What contributions have the authors mentioned in the paper "Comparing molecules and solids across structural and alchemical space" ?

In this paper, an entropy regularization is proposed to reduce the size of the SOAP kernel to quadratic and to obtain a better behaved, smoothly varying metric, that interpolates -depending on the regularization parameter -between the average and best-match limit.

What is the important descriptor of oligopeptide structure?

Conventional wisdom [57] assumes that the Cα dihedral angles φ and ψ are the most important descriptors of oligopeptide structure.

Why did the authors use the conventional best-match distance for the rest of their analyses?

For the sake of simplicity (and given the authors reduced the size of the environment covariance matrix C not considering H atoms as environment centers) the authors used the conventional best-match distance for the rest of their analyses.

What is the significance of the REMatch-SOAP approach?

Reaching chemical accuracy in the automated prediction of atomization energies is an important milestone, and the fact that the authors could achieve that without fully exploring the flexibility of the REMatch-SOAP framework (e.g. by optimizing the entropy regularization parameter,the environment cutoff, eliminating the outliers, combining multiple layers of description or using a non-diagonal alchemical similarity matrix) highlights the potential of their approach.

How many conformers of arginine dipeptide were selected?

The authors selected a library of 5062 locally stable conformers of arginine dipeptide (845 with and 4217 without a Ca2+ counterion) from a public database of oligopeptides structures developed by Ropo et al [56].

What could be used to accelerate the exploration of chemical and conformational space of materials and molecules?

For instance, it could be used to detect outliers in automated high-throughput screenings of materials, to cluster similar configurations together, to accelerate the exploration of chemical and conformational space of materials and molecules.

How many hypothetical structures were used in the map?

Although the map has been built using only reference configurations from a few of the conventional Si phases, the authors have also projected on it (using out-of-sample embedding) two sets of hypothetical configurations obtained by minima hopping [53] and by ab initio random structure search (AIRSS) [52, 55].

What is the smallest number of local minima?

In the absence of a complexing cation, the dipeptide can exist in a very large number of local minima, spanning a relatively narrow range of energies.

What is the way to define a metric in structural and alchemical space?

Distances between atomic structures based on combinations of local similarity kernels provide a flexible framework to define a metric in structural and alchemical space.

Why did the authors not include them in the environment descriptors of other atoms?

Since H atoms stay at almost fixed positions relative to their neighboring atoms, the authors decided to include them in the environment descriptors of other atoms, but did not include them explicitly as centers of atomic environments.

(Open Access) Comparing molecules and solids across structural and alchemical space. (2016) | Sandip De

warwick.ac.uk/lib-publications

Manuscript version: Author’s Accepted Manuscript

The version presented in WRAP is the author’s accepted manuscript and may differ from the

published version or Version of Record.

Persistent WRAP URL:

http://wrap.warwick.ac.uk/133475

How to cite:

Please refer to published version for the most recent bibliographic citation information.

If a published version is known of, the repository item page linked to above, will contain

details on accessing it.

The Warwick Research Archive Portal (WRAP) makes this work by researchers of the

University of Warwick available open access under the following conditions.

individual author(s) and/or other copyright owners. To the extent reasonable and

practicable the material made available in WRAP has been checked for eligibility before

being made available.

Copies of full items can be used for personal research or study, educational, or not-for-profit

purposes without prior permission or charge. Provided that the authors, title and full

bibliographic details are credited, a hyperlink and/or URL is given for the original metadata

page and the content is not changed in any way.

Publisher’s statement:

Please refer to the repository item page, publisher’s statement section, for further

information.

For more information, please contact the WRAP Team at: wrap@warwick.ac.uk.

Comparing molecules and solids across structural and alchemical space

Sandip De,

1, 2

Albert P. Bart´ok,

G´abor Cs´anyi,

and Michele Ceriotti

1, 2

National Center for Computational Design and Discovery of Novel Materials (MARVEL)

Laboratory of Computational Science and Modelling, Institute of Materials,

Ecole Polytechnique F´ed´erale de Lausanne, Lausanne, Switzerland

Engineering Laboratory, University of Cambridge,

Trumpington Street, Cambridge CB2 1PZ, United Kingdom

Evaluating the (dis)similarity of crystalline, disordered and molecular compounds is a critical

step in the development of algorithms to navigate automatically the conﬁguration space of complex

materials. For instance, a structural similarity metric is crucial for classifying structures, searching

chemical space for better compounds and materials, and driving the next generation of machine-

learning techniques for predicting the stability and properties of molecules and materials. In the last

few years several strategies have been designed to compare atomic coordination environments. In

particular, the Smooth Overlap of Atomic Positions (SOAP) has emerged as an elegant framework

to obtain translation, rotation and permutation-invariant descriptors of groups of atoms, driven by

the design of various classes of machine-learned inter-atomic potentials. Here we discuss how one can

combine such local descriptors using a Regularized Entropy Match (REMatch) approach to describe

the similarity of both whole molecular and bulk periodic structures, introducing powerful metrics

that enable the navigation of alchemical and structural complexity within a uniﬁed framework.

Furthermore, using this kernel and a ridge regression method we can predict atomization energies

for a database of small organic molecules with a mean absolute error below 1kcal/mol, reaching an

important milestone in the application of machine-learning techniques to the evaluation of molecular

properties.

I. INTRODUCTION

The increase of available computational power, to-

gether with the development of more accurate and eﬃ-

cient simulation algorithms, have made it possible to re-

liably predict the properties of materials and molecules

of increasing levels of complexity. Furthermore, high-

throughput computational screening of existing and hy-

pothetical compounds promises to dramatically acceler-

ate the development of materials with the better perfor-

mances or custom-tailored properties [1–6].

These developments have made even more urgent the

need for automated tools to analyze, classify [7–11] and

represent [12–16] large amounts of structural data, as

well as techniques to leverage this wealth of information

to estimate inexpensively the properties of materials us-

ing machine-learning techniques, circumventing the need

for computationally demanding quantum mechanical cal-

culations [17–28].

At the most fundamental level, the crucial ingredient

for all these techniques is a mathematical formulation of

the concept of (dis)similarity between atomic conﬁgura-

tions, that can take the form of a distance - that can be

used for dimensionality reduction or clustering - or of a

kernel function, that could be used for ridge regression or

automated classiﬁcation.[29–32] The most obvious choice

for a metric to compare atomic structures would involve

the Euclidean distance between the Cartesian coordi-

nates of the atoms, commonly known as root mean square

displacement (RMSD) distance, that can be easily made

invariant to relative translations and rotations. It is how-

ever highly non-trivial to extend the RMSD to deal with

situations in which atoms in the two structures cannot

be mapped unequivocally onto each other. The determin-

istic evaluation of a “permutationally invariant” RMSD

scales combinatorially with the size of the molecules to be

compared [33], and introduces cusps at locations where

the mapping of atom identities changes. Furthermore, as

we will discuss later on, the RMSD is perhaps the most

straightforward, but not necessarily the most ﬂexible or

eﬀective strategy to compare molecular and condensed-

phase conﬁgurations.

In the last few years, a large number of “ﬁngerprint”

functions have been developed to represent the state

of structures, or of groups of atoms within a struc-

ture. Structural descriptors have been developed based

on graph-theoretic procedures (e.g. SPRINTs [34]), as

well as on analogies with electronic structrure methods

(e.g. Hamiltonian matrix, Hessian matrix, Overlap ma-

trix of Gaussian type Orbitals (GTO) or even Kohn-

Sham eigenvalues ﬁngerprints [33]). Most of these ap-

proaches have been introduced to provide a fast and re-

liable estimate of the dissimilarity between structures.

Several other descriptors have been also used in machine

learning, to predict properties of materials and molecules

circumventing the need for an expensive electronic struc-

ture calculation. A non-comprehensive list of such meth-

ods include Coulomb matrices [17], bags of bonds [28],

“symmetry functions” [35], scattering transformation ap-

plied on a linear superposition of atomic densities [23].

A particularly promising approach to compare struc-

tures in a way that is invariant to rotations, transla-

tions, and permutations of equivalent atoms, is to start

from descriptors designed to represent local atomic en-

vironments and that fulﬁll these requirements, and com-

bine them to yield a global measure of similarity between

structures. This idea typically relies on ﬁnding the best

match between pairs of environments in the two conﬁgu-

rations [22, 33, 36], and can also be traced back to meth-

ods developed to compare images based on the matching

of local features [37].

In the present work we start from a recently-developed

strategy to deﬁne a similarity kernel between local en-

vironments – the smooth overlap of atomic positions

(SOAP)[38] – and discuss the diﬀerent ways one can pro-

cess the set of all possible matchings between atomic

environments to generate a global kernel to compare

two structures. In particular, we introduce a regular-

ized entropy match (REMatch) strategy that is based on

techniques in optimal-transport theory [39], and that is

both more eﬃcient and tunable than previously-applied

methods. We discuss the relative merits of diﬀerent ap-

proaches, and generalize this strategy to the compari-

son between structures with diﬀerent numbers and kinds

of atoms. We demonstrate the behavior of the diﬀer-

ent global kernels when applied to completely diﬀerent

classes of problems, ranging from elemental clusters, to

bulk structures, to the conformers of oligopeptides and

to a heterogeneous database of small organic molecules.

We visualize the behavior of the distance associated with

these kernels using sketch-map [13], a non-linear dimen-

sionality reduction technique, and demonstrate the great

promise shown by the straightforward application of the

REMatch-SOAP kernel to the machine-learning of molec-

ular properties. Finally, we present our conclusions.

II. THEORY

Let us start by introducing the notation we will em-

ploy in the rest of the paper. We will label structures

to be compared by capital letters, use a lowercase Latin

letter to indicate the index of an atom, and when nec-

essary use a Greek lowercase letter to mark its chemi-

cal identity. For instance, the position of the i-th atom

within the structure A will be labeled as x

. The envi-

ronment of that atom, i.e. the abstract descriptor of the

arrangement of atoms in its vicinity will be labelled with

a calligraphic upper case letter, e.g. X

, and the sub-set

of such environment that singles out atoms of species α

will be indicated as X

A,α

Among the many descriptors of local environments

that have been developed in the recent years[1–3, 5, 6, 17–

22, 24–28, 33, 36], we will refer in particular to the SOAP

ﬁngerprints [38], that have been proven to be a very el-

egant and robust strategy to describe coordination envi-

ronments in a way that is naturally invariant with respect

to translations, rotations and permutations of atoms.

We will use the notation k(X , X

) to indicate the sim-

ilarity kernel (normalized to one) between two environ-

ments – which one would use in a kernel ridge regression

method [31, 32, 40] – and d(X , X

)

= 2 − 2k(X , X

) to

indicate the (squared) kernel distance between the en-

vironments – which one would use in a dimensionality

reduction method [13, 16]. In what follows we will dis-

cuss diﬀerent ways by which environment kernels can be

combined to yield a a global similarity kernel between two

structures K(A, B), and the associated squared distance

D(A, B)

= 2 − 2K(A, B).

A. SOAP similarity kernels and local environment

distance

We will ﬁrst focus on the comparison between the en-

vironment of two atoms in a pure compound made up

of a single atomic species α. The crucial ingredient in

making the comparison is a kernel function based on the

distribution of atoms in the two environments. In the

context of SOAP kernels one represents the local density

of atoms within the environment X as a sum of Gaus-

sian functions with variance σ

, centered on each of the

neighbors of the central atom, as well as on the central

atom itself:

(r) =

i∈X

exp



−

− r)

2σ



. (1)

The SOAP kernel is then deﬁned as the overlap of the

two local atomic neighbour densities, integrated over all

three-dimensional rotations

k(X , X

) =



(r)ρ

(

Rr)dr



. (2)

Note that in the n = 1 case the two integrals can be

switched, and therefore the kernel looses all angular in-

formation, so we focus on the n = 2 case exclusively. For

most applications it is helpful to normalise the kernel so

that the self-similarity of any environment is unity, giving

the ﬁnal kernel

k(X , X

) =

k(X , X

k(X , X )

k(X

, X

) (3)

It is a remarkable property of the SOAP kernel that the

integration over all rotations can be carried out analyt-

ically. First, the atomic neighbour density is expanded

in a basis composed of spherical harmonics and a set of

orthogonal radial basis functions {g

(r)},

(r) =

blm

(|r|)Y

(

r), (4)

then the rotationally invariant power spectrum is given

p(X )

= π

2l + 1

)

†

. (5)

Collecting the elements of the power spectrum into a

unit-length vector

p(X ), the SOAP kernel is shown[38]

to be given by

k(X , X

) =

p(X ) ·

p(X

) (6)

eventually leaving a deﬁnition of the distance as

d (X , X

) =

2 − 2

p(X ) ·

p(X

) (7)

The SOAP kernel can be written in the form of a dot

product, therefore it is manifestly positive deﬁnite, which

implies that the distance function (7) is a proper metric.

B. From local descriptors to structure matching

The vectors that enter the deﬁnition of the environ-

ments are deﬁned in such a way that their dot product

is the overlap of (smoothed) atomic distributions. Given

two structures with the same number N of atoms, we can

compute an environment covariance matrix that contains

all the possible pairings of environments

(A, B) = k



, X



, (8)

This matrix contains the complete information on the

pair-wise similarity of all the environments between the

two systems. Based on it, one can introduce a global ker-

nel to compare two structures or molecules. We will dis-

cuss and compare four diﬀerent approaches. All of them

are meant to be normalized, i.e. the given expressions for

K(A, B) are to be divided by

K(A, A)K(B, B) when-

ever the kernel is not normalized to one by construction.

Average structural kernel A ﬁrst possibility to com-

pare two structures involves computing an average kernel

K(A, B) =

(A, B) =

p(X

)





p(X

)





(9)

One sees that

K can be computed inexpensively by

just storing the average SOAP ﬁngerprint between all

environments of the two structures. This kernel is

also positive-deﬁnite, being based on a scalar prod-

uct [41], and therefore induces a metric

D(A, B) =

2 − 2

K(A, B). On the other hand, it is not a very sen-

sitive metric: two very diﬀerent structures can appear to

be the same if they are composed of environments that

give the same ﬁngerprint upon averaging.

Best-match structural kernel Another possibility,

that has been used previously with diﬀerent kinds of

structural ﬁngerprints [22, 33, 42, 43] is to identify the

best match between the environments of the two struc-

tures,

K(A, B) =

max

iπ

(A, B). (10)

which can be accomplished with an O(N

) eﬀort using

the Munkres algorithm [44]. The corresponding distance

has the properties of a metric, which means it can still

be safely used to assess similarity between structures and

molecules. Unfortunately, this “best-match” kernel is not

guaranteed to be positive-deﬁnite, which makes it less

than ideal for use in machine-learning applications. Fur-

thermore, the distance obtained by a best-match strategy

is continuous, but has discontinuous derivatives whenever

the matching of environments changes. These problems

can be solved or alleviated by matching the environments

based on a diﬀerent strategy, that combines features of

the average and the best-match kernels.

Regularized entropy match kernel The best match

problem can be also stated in an alternative form, namely

K(A, B) = max

P∈U(N,N )

(A, B)P

. (11)

where U(N, N) is the set of N × N (scaled) doubly

stochastic matrices, whose rows and columns sum to

1/N, i.e.

= 1/N. We can then bor-

row an idea that was recently introduced in the ﬁeld of

optimal transport[39] to regularize this problem, adding a

penalty that instead aims at maximizing the information

entropy for the matrix P subject to the aforementioned

constraints on its marginals. Such “regularized-entropy

match” (REMatch) kernel is deﬁned as

(A, B) = Tr P

C(A, B),

= argmin

P∈U(N,N )

(1 − C

+ γ ln P

) ,

(12)

where the regularization is given by an entropy term

E(P) = −

ln P

. P

can be computed very eﬃ-

ciently, with O(N

) eﬀort, by the Sinkhorn algorithm [39]

(see Appendix C). For γ → 0, the entropic penalty be-

comes negligible, and

(A, B) →

K(A, B). For γ → ∞,

one selects the P with the least information content, that

is one with constant P

= 1/N

. Hence, in this limit

(A, B) →

K(A, B).

Permutation structural kernel For the sake of com-

pleteness, we also discuss a fourth option: rather than

summing over all possible pairs of environments, one can

consider each pairing of environments separately, and

sum over all the N ! possible permutations that deﬁne

the pairings. In order to kill oﬀ more rapidly the combi-

nations of environments that contain bad matches, one

can multiply the kernels that appear in each pairing, and

deﬁne a permutation kernel

K(A, B) =

iπ

(A, B) = perm C(A, B).

(13)

This choice corresponds to the evaluation of the per-

manent of the environment kernel matrix, and has

some appeal as it is guaranteed to yield a positive-

deﬁnite kernel [45]. The evaluation of the perma-

nent of a matrix, however, has combinatorial computa-

tional complexity[46]. Its application is limited to small

molecules, and we will not discuss it further in the present

work.

C. Matching structures containing multiple species

When comparing structures that contain diﬀerent

atomic species, the ﬁrst problem that has to be addressed

is that of extending the local environment metric so that

the presence of multiple elements is properly accounted

for.

SOAP descriptors provide a straightforward way to do

this: a separate density can be built for each atomic

species

(r) =

i∈X

exp



−

− r)

2σ



, (14)

and a (non-normalized) kernel be deﬁned by matching

separately the diﬀerent species:

k(X , X

) =



(r)ρ

(

Rr)dr



αβ

(X ) · p

αβ

(15)

Here we have introduced “partial” power spectra p

αβ

that encode information on the relative arrangement of

pairs of species, and can be written as

p(X )

αβ

= π

2l + 1

)

†

, (16)

where we built in the angular channel dependent weights

into the elements of the power spectrum. The expansion

coeﬃcients describe the atomic density of species α

(r) =

blm

(|r|)Y

(

r) (17)

in terms of a basis set, which is a combination of spherical

harmonics and orthogonal radial functions. The kernel

in Eq. (15) can then be normalized as in Eq. (3).

Note that, even though the overlap between the envi-

ronments of the diﬀerent species is considered to be zero,

the kernel is sensitive the relative correlations of diﬀer-

ent species. This is because, due to the squaring of the

density overlap within the rotational average, the SO(3)

power spectrum vectors contain mixed-species compo-

nents. One could also introduce a notion of “alchemical

similarity” between diﬀerent species. For instance, when

comparing structures of III-V semiconductors one could

disregard the chemical information on the identity of an

atom as long as it belongs to the same column of the pe-

riodic table. Such a notion can be readily implemented,

deﬁning an alchemical similarity kernel κ

αβ

which is one

for pairs that should be considered interchangeable, and

tend to zero for pairs that one wants to consider as com-

pletely unrelated. The expression then becomes

k(X , X

) =



αα

(r)ρ

(

Rr)dr



αβα

αβ

(X ) · p

)κ

αα

ββ

(18)

The original expression (15) can be recovered by setting

αβ

= δ

αβ

. Global similarity kernels can then be trans-

parently introduced to compare structures composed of

diﬀerent atomic species, with geometry and alchemical

composition treated on the same footings and the possi-

bility of adapting the deﬁnition of similarity to the sys-

tem and application.

D. Matching structures with diﬀerent numbers of

atoms

The deﬁnitions above can be readily extended to com-

pare structures containing diﬀerent numbers of atoms N

and N

. We discuss two possible strategies. When com-

paring crystalline, periodic structures, it may be the case

that one of the structures corresponds to a slight distor-

tion of the other, that needs a larger unit cell for a proper

representation. Comparing the structures using the aver-

age kernel (9) does automatically the “right thing”, that

is performing the comparison in a way that is indepen-

dent of the number of times the two structures have to be

replicated to match atom counts. In the case of the per-

mutation kernel and of the best-match kernel, the most

eﬀective way to perform the comparison is to evaluate the

least common multiple N of N

and N

, and replicate

the environment similarity matrix to form a square ma-

trix. One can then proceed to compute the permanent, or

the linear assignment problem, based on such replicated

matrix. The advantage of this procedure is that one does

not need to explicitly ﬁnd the relation between the shape

of the two unit cells and replicate them to perform the

comparison: the environment similarities can be evalu-

ated including periodic replicas, and the minimum num-

ber of comparisons will be naturally performed among

any pairs of structures. However, the least common mul-

tiple can become very large, making even the best-match

kernel (10) impractically demanding, although the cost

can be reduced by exploiting the redundancy in the ex-

tended environment covariance matrix. As shown in the

Appendix, the REMatch kernel (12) can be computed

easily also for a rectangular matrix, which constitutes

an additional advantage of formulating the environment

matching problem in terms of a regularized transport op-

timization.

When comparing molecules or molecular fragments, it

may be advisable to proceed diﬀerently – since in that

case the chemical composition might diﬀer, and it may

not make sense to compare molecules as if they were part

of an inﬁnite periodic assembly. A possible strategy is

Comparing molecules and solids across structural and alchemical space.

Figures

Citations

Recent advances and applications of machine learning in solid-state materials science

Quantum-chemical insights from deep tensor neural networks.

Machine learning in materials informatics: recent applications and prospects

Perspective: Machine learning potentials for atomistic simulations.

Machine learning of accurate energy-conserving molecular force fields

References

The Elements of Statistical Learning

The Hungarian method for the assignment problem

Nonlinear component analysis as a kernel eigenvalue problem

Survey of clustering algorithms

Clustering by fast search and find of density peaks

Related Papers (5)

On representing chemical environments

Fast and Accurate Modeling of Molecular Atomization Energies with Machine Learning

Generalized neural-network representation of high-dimensional potential-energy surfaces.

Gaussian approximation potentials: the accuracy of quantum mechanics, without the electrons.

Generalized Gradient Approximation Made Simple

Frequently Asked Questions (13)

Q1. What contributions have the authors mentioned in the paper "Comparing molecules and solids across structural and alchemical space" ?

Q2. What is the important descriptor of oligopeptide structure?

Q3. Why did the authors use the conventional best-match distance for the rest of their analyses?

Q4. What is the significance of the REMatch-SOAP approach?

Q5. How many conformers of arginine dipeptide were selected?

Q6. What are the main descriptors of local environments?

Q7. What could be used to accelerate the exploration of chemical and conformational space of materials and molecules?

Q8. How many hypothetical structures were used in the map?

Q9. What is the smallest number of local minima?

Q10. What is the way to define a metric in structural and alchemical space?

Q11. Why did the authors not include them in the environment descriptors of other atoms?

Q12. What is the advantage of this procedure?

Q13. What is the way to compare crystalline, periodic structures?