Using the ElementEmbeddings package¶
This notebook will serve as a tutorial for using the ElementEmbeddings package and going over the core features.
# Imports
import numpy as np
import pandas as pd
import seaborn as sns
from elementembeddings.core import Embedding
from elementembeddings.plotter import heatmap_plotter, dimension_plotter
import matplotlib.pyplot as plt
sns.set(font_scale=1.5)
/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html from .autonotebook import tqdm as notebook_tqdm
Elemental representations¶
A key problem in supervised machine learning problems is determining the featurisation/representation scheme for a material in order to pass it through a mathematical algorithm. For composition only machine learning, we want to be able create a numerical representation of a chemical formula AwBxCyDz. We can achieve this by creating a composition based feature vector derived from the elemental properties of the constituent atoms or a representation can be learned during the supervised training process.
A few of these CBFV have been included in the package and we can load them using the load_data
class method.
# Create a list of the available CBFVs included in the package
cbfvs = [
"magpie",
"mat2vec",
"matscholar",
"megnet16",
"oliynyk",
"random_200",
"skipatom",
"mod_petti",
"magpie_sc",
"oliynyk_sc",
]
# Create a dictionary of {cbfv name : Embedding objects} key, value pairs
AtomEmbeds = {cbfv: Embedding.load_data(cbfv) for cbfv in cbfvs}
Taking the magpie representation as our example, we will demonstrate some features of the the Embedding
class.
# Let's use magpie as our example
# Let's look at the CBFV of hydrogen for the magpie representation
print(
"Below is the CBFV/representation of the hydrogen atom from the magpie data we have \n"
)
print(AtomEmbeds["magpie"].embeddings["H"])
Below is the CBFV/representation of the hydrogen atom from the magpie data we have [ 1. 92. 1.00794 14.01 1. 1. 31. 2.2 1. 0. 0. 0. 1. 1. 0. 0. 0. 1. 6.615 7.853 0. 194. ]
We can check the elements which have a feature vector for a particular embedding
# We can also check to see what elements have a CBFV for our chosen representation
print("Magpie has composition-based feature vectors for the following elements: \n")
print(AtomEmbeds["magpie"].element_list)
Magpie has composition-based feature vectors for the following elements: ['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Po', 'At', 'Rn', 'Fr', 'Ra', 'Ac', 'Th', 'Pa', 'U', 'Np', 'Pu', 'Am', 'Cm', 'Bk']
For the elemental representations distributed with the package, we also included BibTex citations of the original papers were these representations are derived from. This is accessible through the .citation()
method.
# Print the bibtex citation for the magpie embedding
print(AtomEmbeds["magpie"].citation())
['@article{ward2016general,title={A general-purpose machine learning framework for predicting properties of inorganic materials},author={Ward, Logan and Agrawal, Ankit and Choudhary, Alok and Wolverton, Christopher},journal={npj Computational Materials},volume={2},number={1},pages={1--7},year={2016},publisher={Nature Publishing Group}}']
We can also check the dimensionality of the elemental representation.
# We can quickly check the dimensionality of this CBFV
magpie_dim = AtomEmbeds["magpie"].dim
print(f"The magpie CBFV has a dimensionality of {magpie_dim}")
The magpie CBFV has a dimensionality of 22
# Let's find the dimensionality of all of the CBFVs that we have loaded
AtomEmbeds_dim = {
cbfv: {"dim": AtomEmbeds[cbfv].dim, "type": AtomEmbeds[cbfv].embedding_type}
for cbfv in cbfvs
}
dim_df = pd.DataFrame.from_dict(AtomEmbeds_dim)
dim_df.T
dim | type | |
---|---|---|
magpie | 22 | vector |
mat2vec | 200 | vector |
matscholar | 200 | vector |
megnet16 | 16 | vector |
oliynyk | 44 | vector |
random_200 | 200 | vector |
skipatom | 200 | vector |
mod_petti | 103 | one-hot |
magpie_sc | 22 | vector |
oliynyk_sc | 44 | vector |
We can see a wide range of dimensions of the composition-based feature vectors.
Let's know explore more of the core features of the package. The numerical representation of the elements enables us to quantify the differences between atoms. With these embedding features, we can explore how similar to atoms are by using a 'distance' metric. Atoms with distances close to zero are 'similar', whereas elements which have a large distance between them should in theory be dissimilar.
Using the class method compute_distance_metric
, we can compute these distances.
# Let's continue using our magpie cbfv
# The package contains some default distance metrics: euclidean, manhattan, chebyshev
metrics = ["euclidean", "manhattan", "chebyshev", "wasserstein", "energy"]
distances = [
AtomEmbeds["magpie"].compute_distance_metric("Li", "K", metric=metric)
for metric in metrics
]
print("For the magpie representation:")
for i, distance in enumerate(distances):
print(
f"Using the metric {metrics[i]}, the distance between Li and K is {distance:.2f}"
)
For the magpie representation: Using the metric euclidean, the distance between Li and K is 154.41 Using the metric manhattan, the distance between Li and K is 300.99 Using the metric chebyshev, the distance between Li and K is 117.16 Using the metric wasserstein, the distance between Li and K is 13.68 Using the metric energy, the distance between Li and K is 1.25
# Let's continue using our magpie cbfv
# The package contains some default distance metrics: euclidean, manhattan, chebyshev
metrics = ["euclidean", "manhattan", "chebyshev", "wasserstein", "energy"]
distances = [
AtomEmbeds["magpie_sc"].compute_distance_metric("Li", "K", metric=metric)
for metric in metrics
]
print("For the scaled magpie representation:")
for i, distance in enumerate(distances):
print(
f"Using the metric {metrics[i]}, the distance between Li and K is {distance:.2f}"
)
For the scaled magpie representation: Using the metric euclidean, the distance between Li and K is 4.09 Using the metric manhattan, the distance between Li and K is 7.87 Using the metric chebyshev, the distance between Li and K is 3.39 Using the metric wasserstein, the distance between Li and K is 0.32 Using the metric energy, the distance between Li and K is 0.23
Plotting¶
We can also explore the correlation between embedding vectors. In the example below, we will plot a heatmap of the pearson correlation of our magpie CBFV, a scaled magpie CBFV and the 16-dim megnet embeddings
Pearson Correlation plots¶
Unscaled and scaled Magpie¶
fig, ax = plt.subplots(figsize=(24, 24))
heatmap_plotter(
embedding=AtomEmbeds["magpie"],
metric="pearson",
sortaxisby="atomic_number",
# show_axislabels=False,
ax=ax,
)
fig.show()
fig, ax = plt.subplots(figsize=(24, 24))
heatmap_plotter(
embedding=AtomEmbeds["magpie_sc"],
metric="pearson",
sortaxisby="atomic_number",
# show_axislabels=False,
ax=ax,
)
fig.show()
As we can see from the above pearson correlation heatmaps, the visualisation of the correlations across the atomic embeddings is sensitive to the components of the embedding vectors. The unscaled magpie representation produces a plot which makes qualitative assessment of chemical trends difficult, whereas with the scaled representation it is possible to perform some qualitative analysis on the (dis)similarity of elements based on their feature vector.
fig, ax = plt.subplots(figsize=(24, 24))
heatmap_plotter(
embedding=AtomEmbeds["megnet16"],
metric="pearson",
sortaxisby="atomic_number",
# show_axislabels=False,
ax=ax,
)
fig.show()
PCA plots¶
fig, ax = plt.subplots(figsize=(16, 12))
dimension_plotter(
embedding=AtomEmbeds["magpie"],
reducer="pca",
n_components=2,
ax=ax,
adjusttext=True,
)
fig.tight_layout()
fig.show()
fig, ax = plt.subplots(figsize=(16, 12))
dimension_plotter(
embedding=AtomEmbeds["magpie_sc"],
reducer="pca",
n_components=2,
ax=ax,
adjusttext=True,
)
fig.tight_layout()
fig.show()
fig, ax = plt.subplots(figsize=(16, 12))
dimension_plotter(
embedding=AtomEmbeds["megnet16"],
reducer="pca",
n_components=2,
ax=ax,
adjusttext=True,
)
fig.tight_layout()
fig.show()
t-SNE plots¶
fig, ax = plt.subplots(figsize=(16, 12))
dimension_plotter(
embedding=AtomEmbeds["magpie"],
reducer="tsne",
n_components=2,
ax=ax,
adjusttext=True,
)
fig.tight_layout()
fig.show()
fig, ax = plt.subplots(figsize=(16, 12))
dimension_plotter(
embedding=AtomEmbeds["magpie_sc"],
reducer="tsne",
n_components=2,
ax=ax,
adjusttext=True,
)
fig.tight_layout()
fig.show()
fig, ax = plt.subplots(figsize=(16, 12))
dimension_plotter(
embedding=AtomEmbeds["megnet16"],
reducer="tsne",
n_components=2,
ax=ax,
adjusttext=True,
)
fig.tight_layout()
fig.show()