Using the ElementEmbeddings package¶

This notebook will serve as a tutorial for using the ElementEmbeddings package and going over the core features.

In [1]:

Copied!





# Imports
import numpy as np
import pandas as pd
import seaborn as sns

from elementembeddings.core import Embedding
from elementembeddings.plotter import heatmap_plotter, dimension_plotter
import matplotlib.pyplot as plt

sns.set(font_scale=1.5)
# Imports
import numpy as np
import pandas as pd
import seaborn as sns

from elementembeddings.core import Embedding
from elementembeddings.plotter import heatmap_plotter, dimension_plotter
import matplotlib.pyplot as plt

sns.set(font_scale=1.5)

/opt/hostedtoolcache/Python/3.10.15/x64/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Elemental representations¶

A key problem in supervised machine learning problems is determining the featurisation/representation scheme for a material in order to pass it through a mathematical algorithm. For composition only machine learning, we want to be able create a numerical representation of a chemical formula A_wB_xC_yD_z. We can achieve this by creating a composition based feature vector derived from the elemental properties of the constituent atoms or a representation can be learned during the supervised training process.

A few of these CBFV have been included in the package and we can load them using the load_data class method.

In [2]:

Copied!





# Create a list of the available CBFVs included in the package

cbfvs = [
    "magpie",
    "mat2vec",
    "matscholar",
    "megnet16",
    "oliynyk",
    "random_200",
    "skipatom",
    "mod_petti",
    "magpie_sc",
    "oliynyk_sc",
]

# Create a dictionary of {cbfv name : Embedding objects} key, value pairs
AtomEmbeds = {cbfv: Embedding.load_data(cbfv) for cbfv in cbfvs}
# Create a list of the available CBFVs included in the package

cbfvs = [
    "magpie",
    "mat2vec",
    "matscholar",
    "megnet16",
    "oliynyk",
    "random_200",
    "skipatom",
    "mod_petti",
    "magpie_sc",
    "oliynyk_sc",
]

# Create a dictionary of {cbfv name : Embedding objects} key, value pairs
AtomEmbeds = {cbfv: Embedding.load_data(cbfv) for cbfv in cbfvs}

Taking the magpie representation as our example, we will demonstrate some features of the the Embedding class.

In [3]:

Copied!





# Let's use magpie as our example

# Let's look at the CBFV of hydrogen for the magpie representation
print(
    "Below is the CBFV/representation of the hydrogen atom from the magpie data we have \n"
)
print(AtomEmbeds["magpie"].embeddings["H"])
# Let's use magpie as our example

# Let's look at the CBFV of hydrogen for the magpie representation
print(
    "Below is the CBFV/representation of the hydrogen atom from the magpie data we have \n"
)
print(AtomEmbeds["magpie"].embeddings["H"])

Below is the CBFV/representation of the hydrogen atom from the magpie data we have 

[  1.       92.        1.00794  14.01      1.        1.       31.
   2.2       1.        0.        0.        0.        1.        1.
   0.        0.        0.        1.        6.615     7.853     0.
 194.     ]

We can check the elements which have a feature vector for a particular embedding

In [4]:

Copied!

# We can also check to see what elements have a CBFV for our chosen representation
print("Magpie has composition-based feature vectors for the following elements: \n")
print(AtomEmbeds["magpie"].element_list)
# We can also check to see what elements have a CBFV for our chosen representation
print("Magpie has composition-based feature vectors for the following elements: \n")
print(AtomEmbeds["magpie"].element_list)

Magpie has composition-based feature vectors for the following elements: 

['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne', 'Na', 'Mg', 'Al', 'Si', 'P', 'S', 'Cl', 'Ar', 'K', 'Ca', 'Sc', 'Ti', 'V', 'Cr', 'Mn', 'Fe', 'Co', 'Ni', 'Cu', 'Zn', 'Ga', 'Ge', 'As', 'Se', 'Br', 'Kr', 'Rb', 'Sr', 'Y', 'Zr', 'Nb', 'Mo', 'Tc', 'Ru', 'Rh', 'Pd', 'Ag', 'Cd', 'In', 'Sn', 'Sb', 'Te', 'I', 'Xe', 'Cs', 'Ba', 'La', 'Ce', 'Pr', 'Nd', 'Pm', 'Sm', 'Eu', 'Gd', 'Tb', 'Dy', 'Ho', 'Er', 'Tm', 'Yb', 'Lu', 'Hf', 'Ta', 'W', 'Re', 'Os', 'Ir', 'Pt', 'Au', 'Hg', 'Tl', 'Pb', 'Bi', 'Po', 'At', 'Rn', 'Fr', 'Ra', 'Ac', 'Th', 'Pa', 'U', 'Np', 'Pu', 'Am', 'Cm', 'Bk']

For the elemental representations distributed with the package, we also included BibTex citations of the original papers were these representations are derived from. This is accessible through the .citation() method.

In [5]:

Copied!

# Print the bibtex citation for the magpie embedding
print(AtomEmbeds["magpie"].citation())
# Print the bibtex citation for the magpie embedding
print(AtomEmbeds["magpie"].citation())

['@article{ward2016general,title={A general-purpose machine learning framework for predicting properties of inorganic materials},author={Ward, Logan and Agrawal, Ankit and Choudhary, Alok and Wolverton, Christopher},journal={npj Computational Materials},volume={2},number={1},pages={1--7},year={2016},publisher={Nature Publishing Group}}']

We can also check the dimensionality of the elemental representation.

In [6]:

Copied!

# We can quickly check the dimensionality of this CBFV
magpie_dim = AtomEmbeds["magpie"].dim
print(f"The magpie CBFV has a dimensionality of {magpie_dim}")
# We can quickly check the dimensionality of this CBFV
magpie_dim = AtomEmbeds["magpie"].dim
print(f"The magpie CBFV has a dimensionality of {magpie_dim}")

The magpie CBFV has a dimensionality of 22

In [7]:

Copied!

# Let's find the dimensionality of all of the CBFVs that we have loaded

AtomEmbeds_dim = {
    cbfv: {"dim": AtomEmbeds[cbfv].dim, "type": AtomEmbeds[cbfv].embedding_type}
    for cbfv in cbfvs
}

dim_df = pd.DataFrame.from_dict(AtomEmbeds_dim)
dim_df.T
# Let's find the dimensionality of all of the CBFVs that we have loaded

AtomEmbeds_dim = {
    cbfv: {"dim": AtomEmbeds[cbfv].dim, "type": AtomEmbeds[cbfv].embedding_type}
    for cbfv in cbfvs
}

dim_df = pd.DataFrame.from_dict(AtomEmbeds_dim)
dim_df.T

Out[7]:

	dim	type
magpie	22	vector
mat2vec	200	vector
matscholar	200	vector
megnet16	16	vector
oliynyk	44	vector
random_200	200	vector
skipatom	200	vector
mod_petti	103	one-hot
magpie_sc	22	vector
oliynyk_sc	44	vector

We can see a wide range of dimensions of the composition-based feature vectors.

Let's know explore more of the core features of the package. The numerical representation of the elements enables us to quantify the differences between atoms. With these embedding features, we can explore how similar to atoms are by using a 'distance' metric. Atoms with distances close to zero are 'similar', whereas elements which have a large distance between them should in theory be dissimilar.

Using the class method compute_distance_metric, we can compute these distances.

In [8]:

Copied!





# Let's continue using our magpie cbfv
# The package contains some default distance metrics: euclidean, manhattan, chebyshev

metrics = ["euclidean", "manhattan", "chebyshev", "wasserstein", "energy"]

distances = [
    AtomEmbeds["magpie"].compute_distance_metric("Li", "K", metric=metric)
    for metric in metrics
]
print("For the magpie representation:")
for i, distance in enumerate(distances):
    print(
        f"Using the metric {metrics[i]}, the distance between Li and K is {distance:.2f}"
    )
# Let's continue using our magpie cbfv
# The package contains some default distance metrics: euclidean, manhattan, chebyshev

metrics = ["euclidean", "manhattan", "chebyshev", "wasserstein", "energy"]

distances = [
    AtomEmbeds["magpie"].compute_distance_metric("Li", "K", metric=metric)
    for metric in metrics
]
print("For the magpie representation:")
for i, distance in enumerate(distances):
    print(
        f"Using the metric {metrics[i]}, the distance between Li and K is {distance:.2f}"
    )

For the magpie representation:
Using the metric euclidean, the distance between Li and K is 154.41
Using the metric manhattan, the distance between Li and K is 300.99
Using the metric chebyshev, the distance between Li and K is 117.16
Using the metric wasserstein, the distance between Li and K is 13.68
Using the metric energy, the distance between Li and K is 1.25

In [9]:

Copied!





# Let's continue using our magpie cbfv
# The package contains some default distance metrics: euclidean, manhattan, chebyshev

metrics = ["euclidean", "manhattan", "chebyshev", "wasserstein", "energy"]

distances = [
    AtomEmbeds["magpie_sc"].compute_distance_metric("Li", "K", metric=metric)
    for metric in metrics
]
print("For the scaled magpie representation:")
for i, distance in enumerate(distances):
    print(
        f"Using the metric {metrics[i]}, the distance between Li and K is {distance:.2f}"
    )
# Let's continue using our magpie cbfv
# The package contains some default distance metrics: euclidean, manhattan, chebyshev

metrics = ["euclidean", "manhattan", "chebyshev", "wasserstein", "energy"]

distances = [
    AtomEmbeds["magpie_sc"].compute_distance_metric("Li", "K", metric=metric)
    for metric in metrics
]
print("For the scaled magpie representation:")
for i, distance in enumerate(distances):
    print(
        f"Using the metric {metrics[i]}, the distance between Li and K is {distance:.2f}"
    )

For the scaled magpie representation:
Using the metric euclidean, the distance between Li and K is 4.09
Using the metric manhattan, the distance between Li and K is 7.87
Using the metric chebyshev, the distance between Li and K is 3.39
Using the metric wasserstein, the distance between Li and K is 0.32
Using the metric energy, the distance between Li and K is 0.23

Plotting¶

We can also explore the correlation between embedding vectors. In the example below, we will plot a heatmap of the pearson correlation of our magpie CBFV, a scaled magpie CBFV and the 16-dim megnet embeddings

Pearson Correlation plots¶

Unscaled and scaled Magpie¶

In [10]:

Copied!





fig, ax = plt.subplots(figsize=(24, 24))
heatmap_plotter(
    embedding=AtomEmbeds["magpie"],
    metric="pearson",
    sortaxisby="atomic_number",
    # show_axislabels=False,
    ax=ax,
)

fig.show()
fig, ax = plt.subplots(figsize=(24, 24))
heatmap_plotter(
    embedding=AtomEmbeds["magpie"],
    metric="pearson",
    sortaxisby="atomic_number",
    # show_axislabels=False,
    ax=ax,
)

fig.show()

No description has been provided for this image

In [11]:

Copied!





fig, ax = plt.subplots(figsize=(24, 24))
heatmap_plotter(
    embedding=AtomEmbeds["magpie_sc"],
    metric="pearson",
    sortaxisby="atomic_number",
    # show_axislabels=False,
    ax=ax,
)

fig.show()
fig, ax = plt.subplots(figsize=(24, 24))
heatmap_plotter(
    embedding=AtomEmbeds["magpie_sc"],
    metric="pearson",
    sortaxisby="atomic_number",
    # show_axislabels=False,
    ax=ax,
)

fig.show()

As we can see from the above pearson correlation heatmaps, the visualisation of the correlations across the atomic embeddings is sensitive to the components of the embedding vectors. The unscaled magpie representation produces a plot which makes qualitative assessment of chemical trends difficult, whereas with the scaled representation it is possible to perform some qualitative analysis on the (dis)similarity of elements based on their feature vector.

In [12]:

Copied!





fig, ax = plt.subplots(figsize=(24, 24))
heatmap_plotter(
    embedding=AtomEmbeds["megnet16"],
    metric="pearson",
    sortaxisby="atomic_number",
    # show_axislabels=False,
    ax=ax,
)

fig.show()
fig, ax = plt.subplots(figsize=(24, 24))
heatmap_plotter(
    embedding=AtomEmbeds["megnet16"],
    metric="pearson",
    sortaxisby="atomic_number",
    # show_axislabels=False,
    ax=ax,
)

fig.show()

PCA plots¶

In [13]:

Copied!





fig, ax = plt.subplots(figsize=(16, 12))

dimension_plotter(
    embedding=AtomEmbeds["magpie"],
    reducer="pca",
    n_components=2,
    ax=ax,
    adjusttext=True,
)

fig.tight_layout()
fig.show()
fig, ax = plt.subplots(figsize=(16, 12))

dimension_plotter(
    embedding=AtomEmbeds["magpie"],
    reducer="pca",
    n_components=2,
    ax=ax,
    adjusttext=True,
)

fig.tight_layout()
fig.show()

In [14]:

Copied!





fig, ax = plt.subplots(figsize=(16, 12))

dimension_plotter(
    embedding=AtomEmbeds["magpie_sc"],
    reducer="pca",
    n_components=2,
    ax=ax,
    adjusttext=True,
)

fig.tight_layout()
fig.show()
fig, ax = plt.subplots(figsize=(16, 12))

dimension_plotter(
    embedding=AtomEmbeds["magpie_sc"],
    reducer="pca",
    n_components=2,
    ax=ax,
    adjusttext=True,
)

fig.tight_layout()
fig.show()

In [15]:

Copied!





fig, ax = plt.subplots(figsize=(16, 12))

dimension_plotter(
    embedding=AtomEmbeds["megnet16"],
    reducer="pca",
    n_components=2,
    ax=ax,
    adjusttext=True,
)

fig.tight_layout()
fig.show()
fig, ax = plt.subplots(figsize=(16, 12))

dimension_plotter(
    embedding=AtomEmbeds["megnet16"],
    reducer="pca",
    n_components=2,
    ax=ax,
    adjusttext=True,
)

fig.tight_layout()
fig.show()

t-SNE plots¶

In [16]:

Copied!





fig, ax = plt.subplots(figsize=(16, 12))

dimension_plotter(
    embedding=AtomEmbeds["magpie"],
    reducer="tsne",
    n_components=2,
    ax=ax,
    adjusttext=True,
)

fig.tight_layout()
fig.show()
fig, ax = plt.subplots(figsize=(16, 12))

dimension_plotter(
    embedding=AtomEmbeds["magpie"],
    reducer="tsne",
    n_components=2,
    ax=ax,
    adjusttext=True,
)

fig.tight_layout()
fig.show()

In [17]:

Copied!





fig, ax = plt.subplots(figsize=(16, 12))

dimension_plotter(
    embedding=AtomEmbeds["magpie_sc"],
    reducer="tsne",
    n_components=2,
    ax=ax,
    adjusttext=True,
)

fig.tight_layout()
fig.show()
fig, ax = plt.subplots(figsize=(16, 12))

dimension_plotter(
    embedding=AtomEmbeds["magpie_sc"],
    reducer="tsne",
    n_components=2,
    ax=ax,
    adjusttext=True,
)

fig.tight_layout()
fig.show()

In [18]:

Copied!





fig, ax = plt.subplots(figsize=(16, 12))

dimension_plotter(
    embedding=AtomEmbeds["megnet16"],
    reducer="tsne",
    n_components=2,
    ax=ax,
    adjusttext=True,
)

fig.tight_layout()
fig.show()
fig, ax = plt.subplots(figsize=(16, 12))

dimension_plotter(
    embedding=AtomEmbeds["megnet16"],
    reducer="tsne",
    n_components=2,
    ax=ax,
    adjusttext=True,
)

fig.tight_layout()
fig.show()