scispace - formally typeset
Open AccessJournal ArticleDOI

Comparing molecules and solids across structural and alchemical space.

TLDR
In this article, a regularized entropy match (REMatch) approach was proposed to describe the similarity of both molecular and bulk periodic structures, introducing powerful metrics that enable the navigation of alchemical and structural complexities within a unified framework.
Abstract
Evaluating the (dis)similarity of crystalline, disordered and molecular compounds is a critical step in the development of algorithms to navigate automatically the configuration space of complex materials. For instance, a structural similarity metric is crucial for classifying structures, searching chemical space for better compounds and materials, and driving the next generation of machine-learning techniques for predicting the stability and properties of molecules and materials. In the last few years several strategies have been designed to compare atomic coordination environments. In particular, the smooth overlap of atomic positions (SOAPs) has emerged as an elegant framework to obtain translation, rotation and permutation-invariant descriptors of groups of atoms, underlying the development of various classes of machine-learned inter-atomic potentials. Here we discuss how one can combine such local descriptors using a regularized entropy match (REMatch) approach to describe the similarity of both whole molecular and bulk periodic structures, introducing powerful metrics that enable the navigation of alchemical and structural complexities within a unified framework. Furthermore, using this kernel and a ridge regression method we can predict atomization energies for a database of small organic molecules with a mean absolute error below 1 kcal mol(-1), reaching an important milestone in the application of machine-learning techniques for the evaluation of molecular properties.

read more

Content maybe subject to copyright    Report

warwick.ac.uk/lib-publications
Manuscript version: Author’s Accepted Manuscript
The version presented in WRAP is the author’s accepted manuscript and may differ from the
published version or Version of Record.
Persistent WRAP URL:
http://wrap.warwick.ac.uk/133475
How to cite:
Please refer to published version for the most recent bibliographic citation information.
If a published version is known of, the repository item page linked to above, will contain
details on accessing it.
Copyright and reuse:
The Warwick Research Archive Portal (WRAP) makes this work by researchers of the
University of Warwick available open access under the following conditions.
Copyright © and all moral rights to the version of the paper presented here belong to the
individual author(s) and/or other copyright owners. To the extent reasonable and
practicable the material made available in WRAP has been checked for eligibility before
being made available.
Copies of full items can be used for personal research or study, educational, or not-for-profit
purposes without prior permission or charge. Provided that the authors, title and full
bibliographic details are credited, a hyperlink and/or URL is given for the original metadata
page and the content is not changed in any way.
Publisher’s statement:
Please refer to the repository item page, publisher’s statement section, for further
information.
For more information, please contact the WRAP Team at: wrap@warwick.ac.uk.

Comparing molecules and solids across structural and alchemical space
Sandip De,
1, 2
Albert P. Bart´ok,
3
abor Cs´anyi,
3
and Michele Ceriotti
1, 2
1
National Center for Computational Design and Discovery of Novel Materials (MARVEL)
2
Laboratory of Computational Science and Modelling, Institute of Materials,
Ecole Polytechnique ed´erale de Lausanne, Lausanne, Switzerland
3
Engineering Laboratory, University of Cambridge,
Trumpington Street, Cambridge CB2 1PZ, United Kingdom
Evaluating the (dis)similarity of crystalline, disordered and molecular compounds is a critical
step in the development of algorithms to navigate automatically the configuration space of complex
materials. For instance, a structural similarity metric is crucial for classifying structures, searching
chemical space for better compounds and materials, and driving the next generation of machine-
learning techniques for predicting the stability and properties of molecules and materials. In the last
few years several strategies have been designed to compare atomic coordination environments. In
particular, the Smooth Overlap of Atomic Positions (SOAP) has emerged as an elegant framework
to obtain translation, rotation and permutation-invariant descriptors of groups of atoms, driven by
the design of various classes of machine-learned inter-atomic potentials. Here we discuss how one can
combine such local descriptors using a Regularized Entropy Match (REMatch) approach to describe
the similarity of both whole molecular and bulk periodic structures, introducing powerful metrics
that enable the navigation of alchemical and structural complexity within a unified framework.
Furthermore, using this kernel and a ridge regression method we can predict atomization energies
for a database of small organic molecules with a mean absolute error below 1kcal/mol, reaching an
important milestone in the application of machine-learning techniques to the evaluation of molecular
properties.
I. INTRODUCTION
The increase of available computational power, to-
gether with the development of more accurate and effi-
cient simulation algorithms, have made it possible to re-
liably predict the properties of materials and molecules
of increasing levels of complexity. Furthermore, high-
throughput computational screening of existing and hy-
pothetical compounds promises to dramatically acceler-
ate the development of materials with the better perfor-
mances or custom-tailored properties [1–6].
These developments have made even more urgent the
need for automated tools to analyze, classify [7–11] and
represent [12–16] large amounts of structural data, as
well as techniques to leverage this wealth of information
to estimate inexpensively the properties of materials us-
ing machine-learning techniques, circumventing the need
for computationally demanding quantum mechanical cal-
culations [17–28].
At the most fundamental level, the crucial ingredient
for all these techniques is a mathematical formulation of
the concept of (dis)similarity between atomic configura-
tions, that can take the form of a distance - that can be
used for dimensionality reduction or clustering - or of a
kernel function, that could be used for ridge regression or
automated classification.[29–32] The most obvious choice
for a metric to compare atomic structures would involve
the Euclidean distance between the Cartesian coordi-
nates of the atoms, commonly known as root mean square
displacement (RMSD) distance, that can be easily made
invariant to relative translations and rotations. It is how-
ever highly non-trivial to extend the RMSD to deal with
situations in which atoms in the two structures cannot
be mapped unequivocally onto each other. The determin-
istic evaluation of a “permutationally invariant” RMSD
scales combinatorially with the size of the molecules to be
compared [33], and introduces cusps at locations where
the mapping of atom identities changes. Furthermore, as
we will discuss later on, the RMSD is perhaps the most
straightforward, but not necessarily the most flexible or
effective strategy to compare molecular and condensed-
phase configurations.
In the last few years, a large number of “fingerprint”
functions have been developed to represent the state
of structures, or of groups of atoms within a struc-
ture. Structural descriptors have been developed based
on graph-theoretic procedures (e.g. SPRINTs [34]), as
well as on analogies with electronic structrure methods
(e.g. Hamiltonian matrix, Hessian matrix, Overlap ma-
trix of Gaussian type Orbitals (GTO) or even Kohn-
Sham eigenvalues fingerprints [33]). Most of these ap-
proaches have been introduced to provide a fast and re-
liable estimate of the dissimilarity between structures.
Several other descriptors have been also used in machine
learning, to predict properties of materials and molecules
circumventing the need for an expensive electronic struc-
ture calculation. A non-comprehensive list of such meth-
ods include Coulomb matrices [17], bags of bonds [28],
“symmetry functions” [35], scattering transformation ap-
plied on a linear superposition of atomic densities [23].
A particularly promising approach to compare struc-
tures in a way that is invariant to rotations, transla-
tions, and permutations of equivalent atoms, is to start
from descriptors designed to represent local atomic en-
vironments and that fulfill these requirements, and com-
bine them to yield a global measure of similarity between

2
structures. This idea typically relies on finding the best
match between pairs of environments in the two configu-
rations [22, 33, 36], and can also be traced back to meth-
ods developed to compare images based on the matching
of local features [37].
In the present work we start from a recently-developed
strategy to define a similarity kernel between local en-
vironments the smooth overlap of atomic positions
(SOAP)[38] and discuss the different ways one can pro-
cess the set of all possible matchings between atomic
environments to generate a global kernel to compare
two structures. In particular, we introduce a regular-
ized entropy match (REMatch) strategy that is based on
techniques in optimal-transport theory [39], and that is
both more efficient and tunable than previously-applied
methods. We discuss the relative merits of different ap-
proaches, and generalize this strategy to the compari-
son between structures with different numbers and kinds
of atoms. We demonstrate the behavior of the differ-
ent global kernels when applied to completely different
classes of problems, ranging from elemental clusters, to
bulk structures, to the conformers of oligopeptides and
to a heterogeneous database of small organic molecules.
We visualize the behavior of the distance associated with
these kernels using sketch-map [13], a non-linear dimen-
sionality reduction technique, and demonstrate the great
promise shown by the straightforward application of the
REMatch-SOAP kernel to the machine-learning of molec-
ular properties. Finally, we present our conclusions.
II. THEORY
Let us start by introducing the notation we will em-
ploy in the rest of the paper. We will label structures
to be compared by capital letters, use a lowercase Latin
letter to indicate the index of an atom, and when nec-
essary use a Greek lowercase letter to mark its chemi-
cal identity. For instance, the position of the i-th atom
within the structure A will be labeled as x
A
i
. The envi-
ronment of that atom, i.e. the abstract descriptor of the
arrangement of atoms in its vicinity will be labelled with
a calligraphic upper case letter, e.g. X
A
i
, and the sub-set
of such environment that singles out atoms of species α
will be indicated as X
A,α
i
.
Among the many descriptors of local environments
that have been developed in the recent years[1–3, 5, 6, 17–
22, 24–28, 33, 36], we will refer in particular to the SOAP
fingerprints [38], that have been proven to be a very el-
egant and robust strategy to describe coordination envi-
ronments in a way that is naturally invariant with respect
to translations, rotations and permutations of atoms.
We will use the notation k(X , X
0
) to indicate the sim-
ilarity kernel (normalized to one) between two environ-
ments which one would use in a kernel ridge regression
method [31, 32, 40] and d(X , X
0
)
2
= 2 2k(X , X
0
) to
indicate the (squared) kernel distance between the en-
vironments which one would use in a dimensionality
reduction method [13, 16]. In what follows we will dis-
cuss different ways by which environment kernels can be
combined to yield a a global similarity kernel between two
structures K(A, B), and the associated squared distance
D(A, B)
2
= 2 2K(A, B).
A. SOAP similarity kernels and local environment
distance
We will first focus on the comparison between the en-
vironment of two atoms in a pure compound made up
of a single atomic species α. The crucial ingredient in
making the comparison is a kernel function based on the
distribution of atoms in the two environments. In the
context of SOAP kernels one represents the local density
of atoms within the environment X as a sum of Gaus-
sian functions with variance σ
2
, centered on each of the
neighbors of the central atom, as well as on the central
atom itself:
ρ
X
(r) =
X
i∈X
exp
(x
i
r)
2
2σ
2
. (1)
The SOAP kernel is then defined as the overlap of the
two local atomic neighbour densities, integrated over all
three-dimensional rotations
ˆ
R,
˜
k(X , X
0
) =
Z
d
ˆ
R
Z
ρ
X
(r)ρ
X
0
(
ˆ
Rr)dr
n
. (2)
Note that in the n = 1 case the two integrals can be
switched, and therefore the kernel looses all angular in-
formation, so we focus on the n = 2 case exclusively. For
most applications it is helpful to normalise the kernel so
that the self-similarity of any environment is unity, giving
the final kernel
k(X , X
0
) =
˜
k(X , X
0
)/
q
˜
k(X , X )
˜
k(X
0
, X
0
) (3)
It is a remarkable property of the SOAP kernel that the
integration over all rotations can be carried out analyt-
ically. First, the atomic neighbour density is expanded
in a basis composed of spherical harmonics and a set of
orthogonal radial basis functions {g
b
(r)},
ρ
X
(r) =
X
blm
c
blm
g
b
(|r|)Y
lm
(
ˆ
r), (4)
then the rotationally invariant power spectrum is given
by
p(X )
b
1
b
2
l
= π
r
8
2l + 1
X
m
(c
b
1
lm
)
c
b
2
lm
. (5)
Collecting the elements of the power spectrum into a
unit-length vector
ˆ
p(X ), the SOAP kernel is shown[38]
to be given by
k(X , X
0
) =
ˆ
p(X ) ·
ˆ
p(X
0
) (6)

3
eventually leaving a definition of the distance as
d (X , X
0
) =
p
2 2
ˆ
p(X ) ·
ˆ
p(X
0
) (7)
The SOAP kernel can be written in the form of a dot
product, therefore it is manifestly positive definite, which
implies that the distance function (7) is a proper metric.
B. From local descriptors to structure matching
The vectors that enter the definition of the environ-
ments are defined in such a way that their dot product
is the overlap of (smoothed) atomic distributions. Given
two structures with the same number N of atoms, we can
compute an environment covariance matrix that contains
all the possible pairings of environments
C
ij
(A, B) = k
X
A
i
, X
B
j
, (8)
This matrix contains the complete information on the
pair-wise similarity of all the environments between the
two systems. Based on it, one can introduce a global ker-
nel to compare two structures or molecules. We will dis-
cuss and compare four different approaches. All of them
are meant to be normalized, i.e. the given expressions for
K(A, B) are to be divided by
p
K(A, A)K(B, B) when-
ever the kernel is not normalized to one by construction.
Average structural kernel A first possibility to com-
pare two structures involves computing an average kernel
¯
K(A, B) =
1
N
2
X
ij
C
ij
(A, B) =
=
"
1
N
X
i
p(X
A
i
)
#
·
1
N
X
j
p(X
B
j
)
.
(9)
One sees that
¯
K can be computed inexpensively by
just storing the average SOAP fingerprint between all
environments of the two structures. This kernel is
also positive-definite, being based on a scalar prod-
uct [41], and therefore induces a metric
¯
D(A, B) =
p
2 2
¯
K(A, B). On the other hand, it is not a very sen-
sitive metric: two very different structures can appear to
be the same if they are composed of environments that
give the same fingerprint upon averaging.
Best-match structural kernel Another possibility,
that has been used previously with different kinds of
structural fingerprints [22, 33, 42, 43] is to identify the
best match between the environments of the two struc-
tures,
ˆ
K(A, B) =
1
N
max
π
X
i
C
i
(A, B). (10)
which can be accomplished with an O(N
3
) effort using
the Munkres algorithm [44]. The corresponding distance
has the properties of a metric, which means it can still
be safely used to assess similarity between structures and
molecules. Unfortunately, this “best-match” kernel is not
guaranteed to be positive-definite, which makes it less
than ideal for use in machine-learning applications. Fur-
thermore, the distance obtained by a best-match strategy
is continuous, but has discontinuous derivatives whenever
the matching of environments changes. These problems
can be solved or alleviated by matching the environments
based on a different strategy, that combines features of
the average and the best-match kernels.
Regularized entropy match kernel The best match
problem can be also stated in an alternative form, namely
ˆ
K(A, B) = max
P∈U(N,N )
X
ij
C
ij
(A, B)P
ij
. (11)
where U(N, N) is the set of N × N (scaled) doubly
stochastic matrices, whose rows and columns sum to
1/N, i.e.
P
i
P
ij
=
P
j
P
ij
= 1/N. We can then bor-
row an idea that was recently introduced in the field of
optimal transport[39] to regularize this problem, adding a
penalty that instead aims at maximizing the information
entropy for the matrix P subject to the aforementioned
constraints on its marginals. Such “regularized-entropy
match” (REMatch) kernel is defined as
ˆ
K
γ
(A, B) = Tr P
γ
C(A, B),
P
γ
= argmin
P∈U(N,N )
X
ij
P
ij
(1 C
ij
+ γ ln P
ij
) ,
(12)
where the regularization is given by an entropy term
E(P) =
P
ij
P
ij
ln P
ij
. P
γ
can be computed very effi-
ciently, with O(N
2
) effort, by the Sinkhorn algorithm [39]
(see Appendix C). For γ 0, the entropic penalty be-
comes negligible, and
ˆ
K
γ
(A, B)
ˆ
K(A, B). For γ ,
one selects the P with the least information content, that
is one with constant P
ij
= 1/N
2
. Hence, in this limit
ˆ
K
γ
(A, B)
¯
K(A, B).
Permutation structural kernel For the sake of com-
pleteness, we also discuss a fourth option: rather than
summing over all possible pairs of environments, one can
consider each pairing of environments separately, and
sum over all the N ! possible permutations that define
the pairings. In order to kill off more rapidly the combi-
nations of environments that contain bad matches, one
can multiply the kernels that appear in each pairing, and
define a permutation kernel
˘
K(A, B) =
1
N!
X
π
Y
i
C
i
(A, B) = perm C(A, B).
(13)
This choice corresponds to the evaluation of the per-
manent of the environment kernel matrix, and has
some appeal as it is guaranteed to yield a positive-
definite kernel [45]. The evaluation of the perma-
nent of a matrix, however, has combinatorial computa-
tional complexity[46]. Its application is limited to small
molecules, and we will not discuss it further in the present
work.

4
C. Matching structures containing multiple species
When comparing structures that contain different
atomic species, the first problem that has to be addressed
is that of extending the local environment metric so that
the presence of multiple elements is properly accounted
for.
SOAP descriptors provide a straightforward way to do
this: a separate density can be built for each atomic
species
ρ
α
X
(r) =
X
i∈X
α
exp
(x
i
r)
2
2σ
2
, (14)
and a (non-normalized) kernel be defined by matching
separately the different species:
˜
k(X , X
0
) =
Z
d
ˆ
R
Z
X
α
ρ
α
X
(r)ρ
α
X
0
(
ˆ
Rr)dr
2
=
X
αβ
p
αβ
(X ) · p
αβ
(X
0
).
(15)
Here we have introduced “partial” power spectra p
αβ
that encode information on the relative arrangement of
pairs of species, and can be written as
p(X )
αβ
b
1
b
2
l
= π
r
8
2l + 1
X
m
(c
α
b
1
lm
)
c
β
b
2
lm
, (16)
where we built in the angular channel dependent weights
into the elements of the power spectrum. The expansion
coefficients describe the atomic density of species α
ρ
α
X
(r) =
X
blm
c
α
blm
g
b
(|r|)Y
lm
(
ˆ
r) (17)
in terms of a basis set, which is a combination of spherical
harmonics and orthogonal radial functions. The kernel
in Eq. (15) can then be normalized as in Eq. (3).
Note that, even though the overlap between the envi-
ronments of the different species is considered to be zero,
the kernel is sensitive the relative correlations of differ-
ent species. This is because, due to the squaring of the
density overlap within the rotational average, the SO(3)
power spectrum vectors contain mixed-species compo-
nents. One could also introduce a notion of “alchemical
similarity” between different species. For instance, when
comparing structures of III-V semiconductors one could
disregard the chemical information on the identity of an
atom as long as it belongs to the same column of the pe-
riodic table. Such a notion can be readily implemented,
defining an alchemical similarity kernel κ
αβ
which is one
for pairs that should be considered interchangeable, and
tend to zero for pairs that one wants to consider as com-
pletely unrelated. The expression then becomes
˜
k(X , X
0
) =
Z
d
ˆ
R
Z
X
αα
0
κ
αα
0
ρ
α
X
(r)ρ
α
0
X
0
(
ˆ
Rr)dr
2
=
X
αβα
0
β
0
p
αβ
(X ) · p
α
0
β
0
(X
0
)κ
αα
0
κ
ββ
0
.
(18)
The original expression (15) can be recovered by setting
κ
αβ
= δ
αβ
. Global similarity kernels can then be trans-
parently introduced to compare structures composed of
different atomic species, with geometry and alchemical
composition treated on the same footings and the possi-
bility of adapting the definition of similarity to the sys-
tem and application.
D. Matching structures with different numbers of
atoms
The definitions above can be readily extended to com-
pare structures containing different numbers of atoms N
A
and N
B
. We discuss two possible strategies. When com-
paring crystalline, periodic structures, it may be the case
that one of the structures corresponds to a slight distor-
tion of the other, that needs a larger unit cell for a proper
representation. Comparing the structures using the aver-
age kernel (9) does automatically the “right thing”, that
is performing the comparison in a way that is indepen-
dent of the number of times the two structures have to be
replicated to match atom counts. In the case of the per-
mutation kernel and of the best-match kernel, the most
effective way to perform the comparison is to evaluate the
least common multiple N of N
A
and N
B
, and replicate
the environment similarity matrix to form a square ma-
trix. One can then proceed to compute the permanent, or
the linear assignment problem, based on such replicated
matrix. The advantage of this procedure is that one does
not need to explicitly find the relation between the shape
of the two unit cells and replicate them to perform the
comparison: the environment similarities can be evalu-
ated including periodic replicas, and the minimum num-
ber of comparisons will be naturally performed among
any pairs of structures. However, the least common mul-
tiple can become very large, making even the best-match
kernel (10) impractically demanding, although the cost
can be reduced by exploiting the redundancy in the ex-
tended environment covariance matrix. As shown in the
Appendix, the REMatch kernel (12) can be computed
easily also for a rectangular matrix, which constitutes
an additional advantage of formulating the environment
matching problem in terms of a regularized transport op-
timization.
When comparing molecules or molecular fragments, it
may be advisable to proceed differently since in that
case the chemical composition might differ, and it may
not make sense to compare molecules as if they were part
of an infinite periodic assembly. A possible strategy is

Figures
Citations
More filters
Journal ArticleDOI

Recent advances and applications of machine learning in solid-state materials science

TL;DR: A comprehensive overview and analysis of the most recent research in machine learning principles, algorithms, descriptors, and databases in materials science, and proposes solutions and future research paths for various challenges in computational materials science.
Journal ArticleDOI

Quantum-chemical insights from deep tensor neural networks.

TL;DR: In this article, a deep tensor neural network is used to predict atomic energies and local chemical potentials in molecules, reliable isomer energies, and molecules with peculiar electronic structure.
Journal ArticleDOI

Machine learning in materials informatics: recent applications and prospects

TL;DR: This article attempts to provide an overview of some of the recent successful data-driven “materials informatics” strategies undertaken in the last decade, with particular emphasis on the fingerprint or descriptor choices.
Journal ArticleDOI

Perspective: Machine learning potentials for atomistic simulations.

TL;DR: Recent advances in machine learning (ML) now offer an alternative approach for the representation of potential-energy surfaces by fitting large data sets from electronic structure calculations, which are reviewed along with a discussion of their current applicability and limitations.
Journal ArticleDOI

Machine learning of accurate energy-conserving molecular force fields

TL;DR: The GDML approach enables quantitative molecular dynamics simulations for molecules at a fraction of cost of explicit AIMD calculations, thereby allowing the construction of efficient force fields with the accuracy and transferability of high-level ab initio methods.
References
More filters
Journal ArticleDOI

The Elements of Statistical Learning

Eric R. Ziegel
- 01 Aug 2003 - 
TL;DR: Chapter 11 includes more case studies in other areas, ranging from manufacturing to marketing research, and a detailed comparison with other diagnostic tools, such as logistic regression and tree-based methods.
Journal ArticleDOI

The Hungarian method for the assignment problem

TL;DR: This paper has always been one of my favorite children, combining as it does elements of the duality of linear programming and combinatorial tools from graph theory, and it may be of some interest to tell the story of its origin this article.
Journal ArticleDOI

Nonlinear component analysis as a kernel eigenvalue problem

TL;DR: A new method for performing a nonlinear form of principal component analysis by the use of integral operator kernel functions is proposed and experimental results on polynomial feature extraction for pattern recognition are presented.
Journal ArticleDOI

Survey of clustering algorithms

TL;DR: Clustering algorithms for data sets appearing in statistics, computer science, and machine learning are surveyed, and their applications in some benchmark data sets, the traveling salesman problem, and bioinformatics, a new field attracting intensive efforts are illustrated.
Journal ArticleDOI

Clustering by fast search and find of density peaks

TL;DR: A method in which the cluster centers are recognized as local density maxima that are far away from any points of higher density, and the algorithm depends only on the relative densities rather than their absolute values.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What contributions have the authors mentioned in the paper "Comparing molecules and solids across structural and alchemical space" ?

In this paper, an entropy regularization is proposed to reduce the size of the SOAP kernel to quadratic and to obtain a better behaved, smoothly varying metric, that interpolates -depending on the regularization parameter -between the average and best-match limit. 

Conventional wisdom [57] assumes that the Cα dihedral angles φ and ψ are the most important descriptors of oligopeptide structure. 

For the sake of simplicity (and given the authors reduced the size of the environment covariance matrix C not considering H atoms as environment centers) the authors used the conventional best-match distance for the rest of their analyses. 

Reaching chemical accuracy in the automated prediction of atomization energies is an important milestone, and the fact that the authors could achieve that without fully exploring the flexibility of the REMatch-SOAP framework (e.g. by optimizing the entropy regularization parameter,the environment cutoff, eliminating the outliers, combining multiple layers of description or using a non-diagonal alchemical similarity matrix) highlights the potential of their approach. 

The authors selected a library of 5062 locally stable conformers of arginine dipeptide (845 with and 4217 without a Ca2+ counterion) from a public database of oligopeptides structures developed by Ropo et al [56]. 

Among the many descriptors of local environments that have been developed in the recent years[1–3, 5, 6, 17– 22, 24–28, 33, 36], the authors will refer in particular to the SOAP fingerprints [38], that have been proven to be a very elegant and robust strategy to describe coordination environments in a way that is naturally invariant with respect to translations, rotations and permutations of atoms. 

For instance, it could be used to detect outliers in automated high-throughput screenings of materials, to cluster similar configurations together, to accelerate the exploration of chemical and conformational space of materials and molecules. 

Although the map has been built using only reference configurations from a few of the conventional Si phases, the authors have also projected on it (using out-of-sample embedding) two sets of hypothetical configurations obtained by minima hopping [53] and by ab initio random structure search (AIRSS) [52, 55]. 

In the absence of a complexing cation, the dipeptide can exist in a very large number of local minima, spanning a relatively narrow range of energies. 

Distances between atomic structures based on combinations of local similarity kernels provide a flexible framework to define a metric in structural and alchemical space. 

Since H atoms stay at almost fixed positions relative to their neighboring atoms, the authors decided to include them in the environment descriptors of other atoms, but did not include them explicitly as centers of atomic environments. 

The advantage of this procedure is that one does not need to explicitly find the relation between the shape of the two unit cells and replicate them to perform the comparison: the environment similarities can be evaluated including periodic replicas, and the minimum number of comparisons will be naturally performed among any pairs of structures. 

When comparing crystalline, periodic structures, it may be the case that one of the structures corresponds to a slight distortion of the other, that needs a larger unit cell for a proper representation.