Interacting with ionic species representations using ElementEmbeddings¶
This notebook will serve as a tutorial for using the ElementEmbeddings package to interact with ionic species representations.
from elementembeddings.core import SpeciesEmbedding
from elementembeddings.composition import (
SpeciesCompositionalEmbedding,
species_composition_featuriser,
)
/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
Elements are the building blocks of chemistry, but species (elements in a given charge state) dictate the structure and properties of inorganic compounds.
For example, the local spin and atomic environment in Fe(s), FeO, Fe2O3, and Fe3O4 solids are different due to variations in the charge state and coordination of iron.
For composition only machine learning, there many representation schemes that enable us to represent compounds as vectors, built on embeddings of elements. However, this may present a limitation when we want to represent ionic species, as the charge state of the element is not taken into account. As such, we need to represent ionic species as vectors.
The ElementEmbeddings package contains a set of pre-trained embeddings for elements and ionic species, which can be used to represent ionic species in a vector space.
At the time of writing, the 200-dimension SkipSpecies vector embeddings are available for ionic species representations. These embeddings are trained using the Skip-gram model on a large dataset of inorganic compounds.
# Load the SkipSpecies vectors as a SpeciesEmbedding object
skipspecies = SpeciesEmbedding.load_data(embedding_name="skipspecies")
print("Below is the representation of Fe3+ using the SkipSpecies vectors.")
print(skipspecies.embeddings["Fe3+"])
Below is the representation of Fe3+ using the SkipSpecies vectors. [-3.46536078e-02 -3.23320180e-02 -6.41056001e-02 -6.64595328e-03 -3.81412022e-02 -9.60185826e-02 -1.92383174e-02 -2.02107765e-02 8.79131556e-02 9.14798677e-02 -3.54749635e-02 -1.33267939e-01 -1.77447721e-01 -9.33702961e-02 -7.14094117e-02 -6.68478478e-03 -1.49846703e-01 3.65290008e-02 -1.11083306e-01 2.04584867e-01 -7.30767250e-02 7.07381591e-02 1.29051596e-01 8.26864019e-02 -3.41298096e-02 1.55206323e-01 5.24081439e-02 7.91398287e-02 1.86461732e-02 1.88235074e-01 1.51956931e-01 1.14296928e-01 -1.12691864e-01 6.95107281e-02 -1.16133653e-01 -1.42861262e-01 -3.24610062e-02 -6.37443736e-02 9.47019458e-02 -7.04379454e-02 1.51012568e-02 -6.04141466e-02 -7.57871270e-02 6.90726042e-02 -3.73109318e-02 -1.04284994e-01 -7.36037940e-02 -3.05999294e-02 -4.32690326e-03 -6.09171018e-02 1.28173083e-02 4.53064829e-01 4.73245084e-02 -1.39801240e+00 -1.01322591e-01 -1.62838653e-01 -4.33158763e-02 -1.32046595e-01 1.88525077e-02 -9.60192643e-03 -5.94866455e-01 1.12727061e-01 1.86967605e-03 8.49850774e-02 1.26277655e-01 -5.00426851e-02 -4.56427746e-02 -3.25046569e-01 1.37247995e-01 -9.46224555e-02 7.27631105e-03 -5.33877499e-02 -3.18312906e-02 -8.66127461e-02 -1.40548006e-01 6.63848501e-03 6.23855107e-02 1.06035680e-01 -1.68600217e-01 -1.79605886e-01 -9.72149730e-01 1.33717686e-01 -5.84784038e-02 -1.49619198e+00 1.86823923e-02 7.76157603e-02 -5.89469783e-02 -9.49078351e-02 -1.11909047e-01 3.17605101e-02 5.79413511e-02 1.40282623e-02 7.69326091e-02 -1.12443836e-02 -8.67934301e-02 -6.59158587e-01 9.15968940e-02 -3.47942114e-01 -9.98707302e-03 -4.93343398e-02 7.81614780e-02 1.12851635e-01 2.69402359e-02 1.41710088e-01 5.72816245e-02 1.60002038e-01 -2.57115781e-01 -1.09435096e-01 -4.88008857e-02 5.72116769e-05 -1.07527770e-01 5.56552038e-02 7.56548047e-02 8.72470587e-02 -1.57128468e-01 -1.33189365e-01 -1.06330979e+00 -5.80653787e-01 -7.17684031e-02 -3.73947710e-01 1.13771893e-02 -1.42221987e-01 -1.48932025e-01 -2.07824185e-02 3.69309634e-02 1.27229178e-02 4.40038621e-01 -1.32923722e-01 -1.88622907e-01 2.58340001e-01 2.99438331e-02 1.02058776e-01 1.04237549e-01 -9.04425755e-02 2.39991665e-01 8.11270997e-02 -2.99125281e-03 2.83314623e-02 -2.62917858e-02 7.42266746e-03 -5.04185539e-03 -4.37292382e-02 1.17831230e-01 -4.98771993e-03 1.18534625e-01 1.53611377e-01 5.65077439e-02 -1.91291913e-01 -9.52507034e-02 -8.89603943e-02 2.01912194e-01 1.17760837e-01 -2.85485648e-02 -9.52739790e-02 1.49672581e-02 -7.14538768e-02 4.95206676e-02 3.00312508e-02 8.33884105e-02 9.99914482e-02 -9.40189809e-02 -4.94113080e-02 5.30362427e-02 -3.15267175e-01 -3.44095714e-02 1.56485736e-02 2.91987918e-02 -7.36336783e-02 -1.27800524e-01 5.92167228e-02 1.07430264e-01 5.31437919e-02 -1.76421866e-01 2.23079890e-01 7.48595372e-02 -5.39487004e-01 5.16922653e-01 1.29015148e-01 4.36748080e-02 -5.45317074e-03 1.46122992e-01 -7.71054178e-02 3.18054631e-02 -4.02254723e-02 -7.62721375e-02 5.14244894e-03 -6.23153821e-02 -6.00104272e-01 6.64846972e-02 6.28835186e-02 -1.06045604e-01 -1.76288888e-01 -4.96284366e-02 -7.97898546e-02 7.50872344e-02 -5.45614585e-03 -6.50706142e-02 -2.17388973e-01 -3.25618118e-01 4.77024205e-02]
We can check the ionic species which have a feature vector for a particular embedding
print("SkipSpecies has feature vectors for the following ionic species:\n")
print(skipspecies.species_list)
SkipSpecies has feature vectors for the following ionic species: ['H+', 'H-', 'Li+', 'Be2+', 'B+', 'B2+', 'B2-', 'B3-', 'B3+', 'B-', 'C4-', 'C-', 'C4+', 'C+', 'C2+', 'C3+', 'C2-', 'C3-', 'N3-', 'N2+', 'N3+', 'N-', 'N+', 'N2-', 'N5+', 'N4+', 'O2-', 'O-', 'F-', 'Na+', 'Mg2+', 'Al3+', 'Al2+', 'Si2+', 'Si4+', 'Si-', 'Si2-', 'Si4-', 'Si3+', 'Si3-', 'P5+', 'P2-', 'P3-', 'P4+', 'P+', 'P-', 'P3+', 'P2+', 'S2-', 'S6+', 'S-', 'S2+', 'S3+', 'S+', 'S4+', 'S5+', 'Cl-', 'Cl7+', 'Cl5+', 'Cl3+', 'K+', 'Ca2+', 'Sc3+', 'Sc+', 'Sc2+', 'Ti3+', 'Ti4+', 'Ti2+', 'V4+', 'V3+', 'V2+', 'V5+', 'Cr3+', 'Cr2+', 'Cr6+', 'Cr4+', 'Cr5+', 'Mn2+', 'Mn3+', 'Mn4+', 'Mn+', 'Mn7+', 'Mn6+', 'Mn5+', 'Fe2+', 'Fe3+', 'Fe+', 'Fe4+', 'Fe6+', 'Fe5+', 'Co2+', 'Co4+', 'Co3+', 'Co+', 'Ni2+', 'Ni4+', 'Ni3+', 'Ni+', 'Cu2+', 'Cu3+', 'Cu+', 'Zn2+', 'Ga+', 'Ga3+', 'Ga4+', 'Ga2+', 'Ge4-', 'Ge4+', 'Ge2-', 'Ge2+', 'Ge3+', 'As-', 'As2-', 'As3+', 'As5+', 'As3-', 'As+', 'As2+', 'As4+', 'Se2-', 'Se-', 'Se4+', 'Se6+', 'Se5+', 'Se2+', 'Se+', 'Se3+', 'Br-', 'Br+', 'Br2+', 'Br5+', 'Br3+', 'Rb+', 'Sr2+', 'Y3+', 'Y2+', 'Y+', 'Zr2+', 'Zr4+', 'Zr3+', 'Zr+', 'Nb5+', 'Nb3+', 'Nb4+', 'Nb2+', 'Nb+', 'Nb7+', 'Mo3+', 'Mo4+', 'Mo6+', 'Mo5+', 'Mo2+', 'Tc-', 'Tc4+', 'Tc3-', 'Tc3+', 'Tc+', 'Tc7+', 'Tc5+', 'Tc6+', 'Tc2-', 'Tc2+', 'Ru2+', 'Ru6+', 'Ru4+', 'Ru5+', 'Ru3+', 'Rh+', 'Rh4+', 'Rh3+', 'Pd2+', 'Pd4+', 'Pd3+', 'Ag3+', 'Ag+', 'Ag2+', 'Cd2+', 'In3+', 'In+', 'In2+', 'Sn4+', 'Sn3+', 'Sn2+', 'Sb5+', 'Sb2-', 'Sb3-', 'Sb3+', 'Sb4+', 'Sb-', 'Sb+', 'Te-', 'Te2-', 'Te4+', 'Te6+', 'Te2+', 'Te5+', 'Te+', 'I-', 'I3+', 'I7+', 'I5+', 'I+', 'I2+', 'Cs+', 'Ba2+', 'La3+', 'La2+', 'La+', 'Ce3+', 'Ce2+', 'Ce4+', 'Pr3+', 'Pr4+', 'Pr2+', 'Nd3+', 'Nd2+', 'Pm3+', 'Sm3+', 'Sm2+', 'Eu2+', 'Eu3+', 'Gd2+', 'Gd3+', 'Tb3+', 'Tb+', 'Tb2+', 'Tb4+', 'Dy3+', 'Dy2+', 'Ho3+', 'Ho2+', 'Er3+', 'Tm3+', 'Tm2+', 'Yb3+', 'Yb2+', 'Lu3+', 'Hf3+', 'Hf2+', 'Hf4+', 'Ta5+', 'Ta3+', 'Ta4+', 'Ta+', 'Ta2+', 'W6+', 'W4+', 'W2+', 'W3+', 'W5+', 'Re5+', 'Re3+', 'Re6+', 'Re2+', 'Re4+', 'Re7+', 'Os7+', 'Os6+', 'Os5+', 'Os2-', 'Os3+', 'Os-', 'Os4+', 'Os8+', 'Os2+', 'Os+', 'Ir3+', 'Ir4+', 'Ir5+', 'Ir6+', 'Pt2+', 'Pt2-', 'Pt4+', 'Pt3+', 'Pt5+', 'Pt-', 'Pt6+', 'Pt+', 'Au-', 'Au2+', 'Au+', 'Au3+', 'Au5+', 'Au4+', 'Hg2+', 'Hg+', 'Tl+', 'Tl3+', 'Tl2+', 'Pb2+', 'Pb3+', 'Pb4+', 'Bi3+', 'Bi5+', 'Bi2+', 'Bi3-', 'Bi4+', 'Bi+', 'Ac3+', 'Th4+', 'Th3+', 'Pa4+', 'Pa5+', 'Pa3+', 'U6+', 'U4+', 'U3+', 'U2+', 'U5+', 'Np6+', 'Np4+', 'Np3+', 'Np7+', 'Np5+', 'Pu7+', 'Pu6+', 'Pu3+', 'Pu4+', 'Pu5+']
We can also check which elements have an ionic species representation in the embedding
print("The folliowing elements have SkipSpecies ionic species representations:\n")
print(skipspecies.element_list)
The folliowing elements have SkipSpecies ionic species representations: ['Na', 'Dy', 'Y', 'Pr', 'Nd', 'Cd', 'Np', 'Mn', 'H', 'Zr', 'In', 'Ca', 'Cs', 'Pa', 'V', 'Br', 'Tc', 'B', 'Si', 'Er', 'Ba', 'Zn', 'Te', 'Mo', 'Pm', 'P', 'Hg', 'Cu', 'K', 'Hf', 'Nb', 'As', 'Ga', 'Rh', 'Li', 'Sc', 'Bi', 'Au', 'Sn', 'Pu', 'Rb', 'Tm', 'S', 'Lu', 'Ru', 'Cr', 'Tb', 'Eu', 'Os', 'W', 'Gd', 'Re', 'Co', 'Sr', 'C', 'I', 'Ac', 'Ta', 'Sm', 'F', 'Ir', 'Ho', 'Pd', 'Al', 'Fe', 'Ag', 'Pt', 'La', 'Yb', 'Pb', 'Cl', 'Sb', 'U', 'Be', 'Ti', 'Mg', 'O', 'Ni', 'Ce', 'N', 'Se', 'Th', 'Ge', 'Tl']
Like the element representations, BibTex citation information is available for the ionic species embeddings.
print(skipspecies.citation())
['@article{Onwuli_Butler_Walsh_2024, title={Ionic species representations for materials informatics}, DOI={10.26434/chemrxiv-2024-8621l}, journal={ChemRxiv}, author={Onwuli, Anthony and Butler, Keith T. and Walsh, Aron}, year={2024}} This content is a preprint and has not been peer-reviewed.', '@article{antunes2022distributed,title={Distributed representations of atoms and materials for machine learning},author={Antunes, Luis M and Grau-Crespo, Ricardo and Butler, Keith T},journal={npj Computational Materials},volume={8},number={1},pages={1--9},year={2022},publisher={Nature Publishing Group} }']
Representing ionic compositions using ElementEmbeddings¶
In addition to representing individual ionic species, we can also represent ionic compositions using the ElementEmbeddings package. This is useful for representing inorganic compounds as vectors. Let's take the example of Fe3O4.
Fe3O4 is a mixed-valence iron oxide, with a formula unit of Fe3O4. We pass the composition as a dicitionary in the following format:
composition = {
'Fe2+': 1,
'Fe3+': 2,
'O2-': 4
}
composition = {"Fe2+": 1, "Fe3+": 2, "O2-": 4}
Fe3O4_skipspecies = SpeciesCompositionalEmbedding(
formula_dict=composition, embedding=skipspecies
)
A few properties are accessible from the SpeciesCompositionalEmbedding
class
# Print the pretty formula
print(Fe3O4_skipspecies.formula_pretty)
# Print the list of elements in the composition
print(Fe3O4_skipspecies.element_list)
# Print the list of ionic species in the composition
print(Fe3O4_skipspecies.species_list)
# Print the stoichiometric vector of the composition
print(Fe3O4_skipspecies.stoich_vector)
# Print the normalised stoichiometric vector of the composition
print(Fe3O4_skipspecies.norm_stoich_vector)
# Print the number of atoms
print(Fe3O4_skipspecies.num_atoms)
Fe3O4 ['O', 'Fe'] ['Fe2+', 'Fe3+', 'O2-'] [1 2 4] [0.14285714 0.28571429 0.57142857] 7
Featurising compositions¶
We can featurise the composition using the .feature_vector
method. This method returns the feature vector for the composition. This is identical in operation to the CompositionEmbedding
class for featurising compositions.
The species_composition_featuriser
can be used to featurise a list of compositions. This is useful for featurising a large number of compositions. It can also export the feature vectors to a pandas DataFrame by setting the to_dataframe
argument to True
.
compositions = [
{"Fe2+": 1, "Fe3+": 2, "O2-": 4},
{"Fe3+": 2, "O2-": 3},
{"Li+": 7, "La3+": 3, "Zr4+": 1, "O2-": 12},
{"Cs+": 1, "Pb2+": 1, "I-": 3},
{"Pb2+": 1, "Pb4+": 1, "O2-": 3},
]
featurised_comps_df = species_composition_featuriser(
data=compositions, embedding="skipspecies", stats="mean", to_dataframe=True
)
featurised_comps_df
Computing feature vectors: 0%| | 0/5 [00:00<?, ?it/s]
Computing feature vectors: 100%|██████████| 5/5 [00:00<00:00, 36282.91it/s]
formula | composition | mean_0 | mean_1 | mean_2 | mean_3 | mean_4 | mean_5 | mean_6 | mean_7 | ... | mean_190 | mean_191 | mean_192 | mean_193 | mean_194 | mean_195 | mean_196 | mean_197 | mean_198 | mean_199 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Fe3O4 | {'Fe2+': 1, 'Fe3+': 2, 'O2-': 4} | -0.018255 | 0.001659 | -0.009839 | 0.005230 | -0.010928 | -0.057023 | -0.002567 | -0.005813 | ... | -0.037202 | -0.008057 | -0.027421 | -0.008534 | -0.009001 | 0.002369 | 0.017834 | -0.055822 | -0.219390 | 0.020507 |
1 | Fe2O3 | {'Fe3+': 2, 'O2-': 3} | -0.036597 | -0.009373 | -0.013700 | -0.015516 | -0.020896 | -0.071463 | 0.002221 | -0.014784 | ... | -0.045530 | -0.024589 | -0.037825 | -0.025545 | 0.010654 | -0.002034 | -0.001094 | -0.096479 | -0.211483 | 0.035755 |
2 | Li7La3ZrO12 | {'Li+': 7, 'La3+': 3, 'Zr4+': 1, 'O2-': 12} | -0.031236 | -0.015952 | -0.018968 | -0.029273 | -0.005297 | -0.035049 | 0.045972 | -0.032007 | ... | -0.042820 | 0.045177 | -0.056733 | 0.006726 | 0.017449 | -0.023732 | 0.021772 | -0.034134 | -0.102773 | 0.061038 |
3 | CsPbI3 | {'Cs+': 1, 'Pb2+': 1, 'I-': 3} | -0.002381 | 0.023988 | -0.026468 | -0.020235 | -0.002876 | -0.033317 | 0.076300 | -0.069057 | ... | 0.055368 | 0.058231 | -0.079549 | -0.032172 | -0.076099 | -0.024554 | 0.108428 | -0.058528 | -0.055804 | -0.031679 |
4 | Pb2O3 | {'Pb2+': 1, 'Pb4+': 1, 'O2-': 3} | -0.077403 | -0.015334 | 0.023065 | -0.060073 | -0.043160 | -0.140865 | 0.067917 | -0.044093 | ... | 0.038975 | 0.102474 | -0.051598 | 0.001011 | -0.131225 | -0.026707 | 0.145250 | -0.057493 | -0.188810 | 0.055239 |
5 rows × 202 columns
¶
We can also calculate the "distance" between two compositions using their feature vectors. This can be used to determine which compositions are more similar to each other.
print(
f"The euclidean distance between Fe3O4 and Fe2O3 is {Fe3O4_skipspecies.distance({'Fe3+': 2, 'O2-': 3}, distance_metric='euclidean', stats='mean'):.2f}"
)
print(
f"The euclidean distance between Fe3O4 and Pb2O3 is {Fe3O4_skipspecies.distance({'Pb2+': 1, 'Pb4+': 1, 'O2-': 3}, distance_metric='euclidean', stats='mean'):.2f}"
)
print(
f"The euclidean distance between Fe3O4 and CsPbI3 is {Fe3O4_skipspecies.distance({'Cs+': 1, 'Pb2+': 1, 'I-': 3},distance_metric='euclidean', stats='mean'):.2f}"
)
The euclidean distance between Fe3O4 and Fe2O3 is 0.38 The euclidean distance between Fe3O4 and Pb2O3 is 1.60 The euclidean distance between Fe3O4 and CsPbI3 is 2.11
Based on the mean-pooled feature vectors, we can see that Fe3O4 is closer to Fe2O3 than either Pb2O3 and CsPbI3.