scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Chemical Physics in 2018"


Journal ArticleDOI
TL;DR: SchNet as mentioned in this paper is a deep learning architecture specifically designed to model atomistic systems by making use of continuous-filter convolutional layers, where the model learns chemically plausible embeddings of atom types across the periodic table.
Abstract: Deep learning has led to a paradigm shift in artificial intelligence, including web, text, and image search, speech recognition, as well as bioinformatics, with growing impact in chemical physics. Machine learning, in general, and deep learning, in particular, are ideally suitable for representing quantum-mechanical interactions, enabling us to model nonlinear potential-energy surfaces or enhancing the exploration of chemical compound space. Here we present the deep learning architecture SchNet that is specifically designed to model atomistic systems by making use of continuous-filter convolutional layers. We demonstrate the capabilities of SchNet by accurately predicting a range of properties across chemical space for molecules and materials, where our model learns chemically plausible embeddings of atom types across the periodic table. Finally, we employ SchNet to predict potential-energy surfaces and energy-conserving force fields for molecular dynamics simulations of small molecules and perform an exemplary study on the quantum-mechanical properties of C20-fullerene that would have been infeasible with regular ab initio molecular dynamics.

1,104 citations


Journal ArticleDOI
TL;DR: Active learning via query by committee (AL-QBC) as discussed by the authors uses the disagreement between an ensemble of ML potentials to infer the reliability of the ensemble's prediction, which improves the overall fitness of ANAKIN-ME (ANI) deep learning potentials by mitigating human biases in deciding what new training data to use.
Abstract: The development of accurate and transferable machine learning (ML) potentials for predicting molecular energetics is a challenging task. The process of data generation to train such ML potentials is a task neither well understood nor researched in detail. In this work, we present a fully automated approach for the generation of datasets with the intent of training universal ML potentials. It is based on the concept of active learning (AL) via Query by Committee (QBC), which uses the disagreement between an ensemble of ML potentials to infer the reliability of the ensemble's prediction. QBC allows the presented AL algorithm to automatically sample regions of chemical space where the ML potential fails to accurately predict the potential energy. AL improves the overall fitness of ANAKIN-ME (ANI) deep learning potentials in rigorous test cases by mitigating human biases in deciding what new training data to use. AL also reduces the training set size to a fraction of the data required when using naive random sampling techniques. To provide validation of our AL approach, we develop the COmprehensive Machine-learning Potential (COMP6) benchmark (publicly available on GitHub) which contains a diverse set of organic molecules. Active learning-based ANI potentials outperform the original random sampled ANI-1 potential with only 10% of the data, while the final active learning-based model vastly outperforms ANI-1 on the COMP6 benchmark after training to only 25% of the data. Finally, we show that our proposed AL technique develops a universal ANI potential (ANI-1x) that provides accurate energy and force predictions on the entire COMP6 benchmark. This universal ML potential achieves a level of accuracy on par with the best ML potentials for single molecules or materials, while remaining applicable to the general class of organic molecules composed of the elements CHNO.

362 citations


Journal ArticleDOI
TL;DR: An improved perturbative triples correction (T) algorithm for domain based local pair-natural orbital singles and doubles coupled cluster (DLPNO-CCSD) theory is reported, using triples natural orbitals to represent the virtual spaces for triples amplitudes, storage bottlenecks are avoided.
Abstract: In this communication, an improved perturbative triples correction (T) algorithm for domain based local pair-natural orbital singles and doubles coupled cluster (DLPNO-CCSD) theory is reported. In our previous implementation, the semi-canonical approximation was used and linear scaling was achieved for both the DLPNO-CCSD and (T) parts of the calculation. In this work, we refer to this previous method as DLPNO-CCSD(T0) to emphasize the semi-canonical approximation. It is well-established that the DLPNO-CCSD method can predict very accurate absolute and relative energies with respect to the parent canonical CCSD method. However, the (T0) approximation may introduce significant errors in absolute energies as the triples correction grows up in magnitude. In the majority of cases, the relative energies from (T0) are as accurate as the canonical (T) results of themselves. Unfortunately, in rare cases and in particular for small gap systems, the (T0) approximation breaks down and relative energies show large deviations from the parent canonical CCSD(T) results. To address this problem, an iterative (T) algorithm based on the previous DLPNO-CCSD(T0) algorithm has been implemented [abbreviated here as DLPNO-CCSD(T)]. Using triples natural orbitals to represent the virtual spaces for triples amplitudes, storage bottlenecks are avoided. Various carefully designed approximations ease the computational burden such that overall, the increase in the DLPNO-(T) calculation time over DLPNO-(T0) only amounts to a factor of about two (depending on the basis set). Benchmark calculations for the GMTKN30 database show that compared to DLPNO-CCSD(T0), the errors in absolute energies are greatly reduced and relative energies are moderately improved. The particularly problematic case of cumulene chains of increasing lengths is also successfully addressed by DLPNO-CCSD(T).

344 citations


Journal ArticleDOI
TL;DR: A revised version of the well-established B97-D density functional approximation with general applicability for chemical properties of large systems is proposed, based on Becke's power-series ansatz from 1997 and explicitly parametrized by including the standard D3 semi-classical dispersion correction.
Abstract: A revised version of the well-established B97-D density functional approximation with general applicability for chemical properties of large systems is proposed Like B97-D, it is based on Becke’s power-series ansatz from 1997 and is explicitly parametrized by including the standard D3 semi-classical dispersion correction The orbitals are expanded in a modified valence triple-zeta Gaussian basis set, which is available for all elements up to Rn Remaining basis set errors are mostly absorbed in the modified B97 parametrization, while an established atom-pairwise short-range potential is applied to correct for the systematically too long bonds of main group elements which are typical for most semi-local density functionals The new composite scheme (termed B97-3c) completes the hierarchy of “low-cost” electronic structure methods, which are all mainly free of basis set superposition error and account for most interactions in a physically sound and asymptotically correct manner B97-3c yields excellent mol

342 citations


Journal ArticleDOI
TL;DR: A representation of any atom in any chemical environment for the automatized generation of universal kernel ridge regression-based quantum machine learning (QML) models of electronic properties, trained throughout chemical compound space is introduced.
Abstract: We introduce a representation of any atom in any chemical environment for the automatized generation of universal kernel ridge regression-based quantum machine learning (QML) models of electronic properties, trained throughout chemical compound space. The representation is based on Gaussian distribution functions, scaled by power laws and explicitly accounting for structural as well as elemental degrees of freedom. The elemental components help us to lower the QML model’s learning curve, and, through interpolation across the periodic table, even enable “alchemical extrapolation” to covalent bonding between elements not part of training. This point is demonstrated for the prediction of covalent binding in single, double, and triple bonds among main-group elements as well as for atomization energies in organic molecules. We present numerical evidence that resulting QML energy models, after training on a few thousand random training instances, reach chemical accuracy for out-of-sample compounds. Compound dat...

297 citations


Journal ArticleDOI
TL;DR: It is shown that the time-lagged autoencoder reliably finds low-dimensional embeddings for high-dimensional feature spaces which capture the slow dynamics of the underlying stochastic processes-beyond the capabilities of linear dimension reduction techniques.
Abstract: Inspired by the success of deep learning techniques in the physical and chemical sciences, we apply a modification of an autoencoder type deep neural network to the task of dimension reduction of molecular dynamics data. We can show that our time-lagged autoencoder reliably finds low-dimensional embeddings for high-dimensional feature spaces which capture the slow dynamics of the underlying stochastic processes—beyond the capabilities of linear dimension reduction techniques.

295 citations


Journal ArticleDOI
TL;DR: HIP-NN achieves the state-of-the-art performance on a dataset of 131k ground state organic molecules and predicts energies with 0.26 kcal/mol mean absolute error.
Abstract: We introduce the Hierarchically Interacting Particle Neural Network (HIP-NN) to model molecular properties from datasets of quantum calculations. Inspired by a many-body expansion, HIP-NN decomposes properties, such as energy, as a sum over hierarchical terms. These terms are generated from a neural network—a composition of many nonlinear transformations—acting on a representation of the molecule. HIP-NN achieves the state-of-the-art performance on a dataset of 131k ground state organic molecules and predicts energies with 0.26 kcal/mol mean absolute error. With minimal tuning, our model is also competitive on a dataset of molecular dynamics trajectories. In addition to enabling accurate energy predictions, the hierarchical structure of HIP-NN helps to identify regions of model uncertainty.

258 citations


Journal ArticleDOI
TL;DR: Automatic protocols to select a number of fingerprints out of a large pool of candidates, based on the correlations that are intrinsic to the training data, can greatly simplify the construction of neural network potentials that strike the best balance between accuracy and computational efficiency.
Abstract: Machine learning of atomic-scale properties is revolutionizing molecular modeling, making it possible to evaluate inter-atomic potentials with first-principles accuracy, at a fraction of the costs. The accuracy, speed, and reliability of machine learning potentials, however, depend strongly on the way atomic configurations are represented, i.e., the choice of descriptors used as input for the machine learning method. The raw Cartesian coordinates are typically transformed in "fingerprints," or "symmetry functions," that are designed to encode, in addition to the structure, important properties of the potential energy surface like its invariances with respect to rotation, translation, and permutation of like atoms. Here we discuss automatic protocols to select a number of fingerprints out of a large pool of candidates, based on the correlations that are intrinsic to the training data. This procedure can greatly simplify the construction of neural network potentials that strike the best balance between accuracy and computational efficiency and has the potential to accelerate by orders of magnitude the evaluation of Gaussian approximation potentials based on the smooth overlap of atomic positions kernel. We present applications to the construction of neural network potentials for water and for an Al-Mg-Si alloy and to the prediction of the formation energies of small organic molecules using Gaussian process regression.

248 citations


Journal ArticleDOI
TL;DR: The usefulness and reliability of RAVE is demonstrated by applying it to model potentials of increasing complexity, including computation of the binding free energy profile for a hydrophobic ligand-substrate system in explicit water with dissociation time of more than 3 min in computer time at least twenty times less than that needed for umbrella sampling or metadynamics.
Abstract: Here we propose the reweighted autoencoded variational Bayes for enhanced sampling (RAVE) method, a new iterative scheme that uses the deep learning framework of variational autoencoders to enhance sampling in molecular simulations. RAVE involves iterations between molecular simulations and deep learning in order to produce an increasingly accurate probability distribution along a low-dimensional latent space that captures the key features of the molecular simulation trajectory. Using the Kullback-Leibler divergence between this latent space distribution and the distribution of various trial reaction coordinates sampled from the molecular simulation, RAVE determines an optimum, yet nonetheless physically interpretable, reaction coordinate and optimum probability distribution. Both then directly serve as the biasing protocol for a new biased simulation, which is once again fed into the deep learning module with appropriate weights accounting for the bias, the procedure continuing until estimates of desirable thermodynamic observables are converged. Unlike recent methods using deep learning for enhanced sampling purposes, RAVE stands out in that (a) it naturally produces a physically interpretable reaction coordinate, (b) is independent of existing enhanced sampling protocols to enhance the fluctuations along the latent space identified via deep learning, and (c) it provides the ability to easily filter out spurious solutions learned by the deep learning procedure. The usefulness and reliability of RAVE is demonstrated by applying it to model potentials of increasing complexity, including computation of the binding free energy profile for a hydrophobic ligand-substrate system in explicit water with dissociation time of more than 3 min, in computer time at least twenty times less than that needed for umbrella sampling or metadynamics.

225 citations


Journal ArticleDOI
TL;DR: It is found that using a simple empirical parametrization scheme is sufficient in order to obtain HDNNPs with high accuracy for the wACSFs employed here, and the intrinsic parameters of the descriptors can in principle be optimized with a genetic algorithm in a highly automated manner.
Abstract: We introduce weighted atom-centered symmetry functions (wACSFs) as descriptors of a chemical system’s geometry for use in the prediction of chemical properties such as enthalpies or potential energies via machine learning. The wACSFs are based on conventional atom-centered symmetry functions (ACSFs) but overcome the undesirable scaling of the latter with an increasing number of different elements in a chemical system. The performance of these two descriptors is compared using them as inputs in high-dimensional neural network potentials (HDNNPs), employing the molecular structures and associated enthalpies of the 133 855 molecules containing up to five different elements reported in the QM9 database as reference data. A substantially smaller number of wACSFs than ACSFs is needed to obtain a comparable spatial resolution of the molecular structures. At the same time, this smaller set of wACSFs leads to a significantly better generalization performance in the machine learning potential than the large set of conventional ACSFs. Furthermore, we show that the intrinsic parameters of the descriptors can in principle be optimized with a genetic algorithm in a highly automated manner. For the wACSFs employed here, we find however that using a simple empirical parametrization scheme is sufficient in order to obtain HDNNPs with high accuracy.

196 citations


Journal ArticleDOI
TL;DR: D density functional theory is used to study the ensemble, ligand, and strain effects of close-packed surfaces alloyed by transition metals with a combination of strong and weak adsorption of H and O and finds that the tunability of adsorbate binding on random alloys is predominately described by the ensemble effect.
Abstract: Alloying elements with strong and weak adsorption properties can produce a catalyst with optimally tuned adsorbate binding. A full understanding of this alloying effect, however, is not well-established. Here, we use density functional theory to study the ensemble, ligand, and strain effects of close-packed surfaces alloyed by transition metals with a combination of strong and weak adsorption of H and O. Specifically, we consider PdAu, RhAu, and PtAu bimetallics as ordered and randomly alloyed (111) surfaces, as well as randomly alloyed 140-atom clusters. In these alloys, Au is the weak-binding component and Pd, Rh, and Pt are characteristic strong-binding metals. In order to separate the different effects of alloying on binding, we calculate the tunability of H- and O-binding energies as a function of lattice constant (strain effect), number of alloy-substituted sublayers (ligand effect), and randomly alloyed geometries (ensemble effect). We find that on these alloyed surfaces, the ensemble effect more significantly tunes the adsorbate binding as compared to the ligand and strain effects, with the binding energies predominantly determined by the local adsorption environment provided by the specific triatomic ensemble on the (111) surface. However, we also find that tuning of adsorbate binding from the ligand and strain effects cannot be neglected in a quantitative description. Extending our studies to other bimetallics (PdAg, RhAg, PtAg, PdCu, RhCu, and PtCu), we find similar conclusions that the tunability of adsorbate binding on random alloys is predominately described by the ensemble effect.

Journal ArticleDOI
TL;DR: Recent progress in the theory and simulation of quantum transport in molecular junctions is discussed and challenges are identified, which appear crucial to achieve a comprehensive and quantitative understanding of transport in these systems.
Abstract: Molecular junctions, where single molecules are bound to metal or semiconductor electrodes, represent a unique architecture to investigate molecules in a distinct nonequilibrium situation and, in a broader context, to study basic mechanisms of charge and energy transport in a many-body quantum system at the nanoscale. Experimental studies of molecular junctions have revealed a wealth of interesting transport phenomena, the understanding of which necessitates theoretical modeling. The accurate theoretical description of quantum transport in molecular junctions is challenging because it requires methods that are capable to describe the electronic structure and dynamics of molecules in a condensed phase environment out of equilibrium, in some cases with strong electron-electron and/or electronic-vibrational interaction. This perspective discusses recent progress in the theory and simulation of quantum transport in molecular junctions. Furthermore, challenges are identified, which appear crucial to achieve a comprehensive and quantitative understanding of transport in these systems.

Journal ArticleDOI
TL;DR: This article gives an overview of excess-entropy scaling, the 1977 discovery by Rosenfeld that entropy determines properties of liquids like viscosity, diffusion constant, and heat conductivity, and gives examples from computer simulations confirming this intriguing connection between dynamics and thermodynamics.
Abstract: This article gives an overview of excess-entropy scaling, the 1977 discovery by Rosenfeld that entropy determines properties of liquids like viscosity, diffusion constant, and heat conductivity. We give examples from computer simulations confirming this intriguing connection between dynamics and thermodynamics, counterexamples, and experimental validations. Recent uses in application-related contexts are reviewed, and theories proposed for the origin of excess-entropy scaling are briefly summarized. It is shown that if two thermodynamic state points of a liquid have the same microscopic dynamics, they must have the same excess entropy. In this case, the potential-energy function exhibits a symmetry termed hidden scale invariance, stating that the ordering of the potential energies of configurations is maintained if these are scaled uniformly to a different density. This property leads to the isomorph theory, which provides a general framework for excess-entropy scaling and illuminates, in particular, why this does not apply rigorously and universally. It remains an open question whether all aspects of excess-entropy scaling and related regularities reflect hidden scale invariance in one form or other.

Journal ArticleDOI
TL;DR: In this article, a combination of physics-based potentials with machine learning (ML) is proposed to handle new molecules and conformations without explicit prior parametrization, which is transferable across small neutral organic and biologically relevant molecules.
Abstract: Classical intermolecular potentials typically require an extensive parametrization procedure for any new compound considered. To do away with prior parametrization, we propose a combination of physics-based potentials with machine learning (ML), coined IPML, which is transferable across small neutral organic and biologically relevant molecules. ML models provide on-the-fly predictions for environment-dependent local atomic properties: electrostatic multipole coefficients (significant error reduction compared to previously reported), the population and decay rate of valence atomic densities, and polarizabilities across conformations and chemical compositions of H, C, N, and O atoms. These parameters enable accurate calculations of intermolecular contributions—electrostatics, charge penetration, repulsion, induction/polarization, and many-body dispersion. Unlike other potentials, this model is transferable in its ability to handle new molecules and conformations without explicit prior parametrization: All l...

Journal ArticleDOI
TL;DR: This paper re-fit an accurate PES of formaldehyde and compares PES errors on the entire point set used to solve the vibrational Schrödinger equation, i.e., the only error that matters in quantum dynamics calculations.
Abstract: For molecules with more than three atoms, it is difficult to fit or interpolate a potential energy surface (PES) from a small number of (usually ab initio) energies at points. Many methods have been proposed in recent decades, each claiming a set of advantages. Unfortunately, there are few comparative studies. In this paper, we compare neural networks (NNs) with Gaussian process (GP) regression. We re-fit an accurate PES of formaldehyde and compare PES errors on the entire point set used to solve the vibrational Schrodinger equation, i.e., the only error that matters in quantum dynamics calculations. We also compare the vibrational spectra computed on the underlying reference PES and the NN and GP potential surfaces. The NN and GP surfaces are constructed with exactly the same points, and the corresponding spectra are computed with the same points and the same basis. The GP fitting error is lower, and the GP spectrum is more accurate. The best NN fits to 625/1250/2500 symmetry unique potential energy poin...

Journal ArticleDOI
Linfeng Zhang1, Jiequn Han1, Han Wang, Roberto Car1, Weinan E1 
TL;DR: In this article, the Deep Coarse-Grained Potential (abbreviated DeePCG) model was proposed to construct a many-body coarse-grained potential.
Abstract: We introduce a general framework for constructing coarse-grained potential models without ad hoc approximations such as limiting the potential to two- and/or three-body contributions. The scheme, called the Deep Coarse-Grained Potential (abbreviated DeePCG), exploits a carefully crafted neural network to construct a many-body coarse-grained potential. The network is trained with full atomistic data in a way that preserves the natural symmetries of the system. The resulting model is very accurate and can be used to sample the configurations of the coarse-grained variables in a much faster way than with the original atomistic model. As an application, we consider liquid water and use the oxygen coordinates as the coarse-grained variables, starting from a full atomistic simulation of this system at the ab initio molecular dynamics level. We find that the two-body, three-body, and higher-order oxygen correlation functions produced by the coarse-grained and full atomistic models agree very well with each other, illustrating the effectiveness of the DeePCG model on a rather challenging task.

Journal ArticleDOI
TL;DR: An extension of the SNAP form that includes quadratic terms in the bispectrum components is proposed that is shown to provide a large increase in accuracy relative to the linear form, while incurring only a modest increase in computational cost.
Abstract: The Spectral Neighbor Analysis Potential (SNAP) is a classical interatomic potential that expresses the energy of each atom as a linear function of selected bispectrum components of the neighbor atoms. An extension of the SNAP form is proposed that includes quadratic terms in the bispectrum components. The extension is shown to provide a large increase in accuracy relative to the linear form, while incurring only a modest increase in computational cost. The mathematical structure of the quadratic SNAP form is similar to the embedded atom method (EAM), with the SNAP bispectrum components serving as counterparts to the two-body density functions in EAM. The effectiveness of the new form is demonstrated using an extensive set of training data for tantalum structures. Similar to artificial neural network potentials, the quadratic SNAP form requires substantially more training data in order to prevent overfitting. The quality of this new potential form is measured through a robust cross-validation analysis.

Journal ArticleDOI
TL;DR: In this article, the performance of permutationally invariant polynomials, neural networks, and Gaussian approximation potentials (GAPs) in representing water two-body and three-body interaction energies was investigated.
Abstract: The accurate representation of multidimensional potential energy surfaces is a necessary requirement for realistic computer simulations of molecular systems. The continued increase in computer power accompanied by advances in correlated electronic structure methods nowadays enables routine calculations of accurate interaction energies for small systems, which can then be used as references for the development of analytical potential energy functions (PEFs) rigorously derived from many-body (MB) expansions. Building on the accuracy of the MB-pol many-body PEF, we investigate here the performance of permutationally invariant polynomials (PIPs), neural networks, and Gaussian approximation potentials (GAPs) in representing water two-body and three-body interaction energies, denoting the resulting potentials PIP-MB-pol, Behler-Parrinello neural network-MB-pol, and GAP-MB-pol, respectively. Our analysis shows that all three analytical representations exhibit similar levels of accuracy in reproducing both two-body and three-body reference data as well as interaction energies of small water clusters obtained from calculations carried out at the coupled cluster level of theory, the current gold standard for chemical accuracy. These results demonstrate the synergy between interatomic potentials formulated in terms of a many-body expansion, such as MB-pol, that are physically sound and transferable, and machine-learning techniques that provide a flexible framework to approximate the short-range interaction energy terms.

Journal ArticleDOI
TL;DR: The results suggest that ωB97M(2) has the potential to serve as a powerful predictive tool for accurate and efficient electronic structure calculations of main-group chemistry.
Abstract: A meta-generalized gradient approximation, range-separated double hybrid (DH) density functional with VV10 non-local correlation is presented. The final 14-parameter functional form is determined by screening trillions of candidate fits through a combination of best subset selection, forward stepwise selection, and random sample consensus (RANSAC) outlier detection. The MGCDB84 database of 4986 data points is employed in this work, containing a training set of 870 data points, a validation set of 2964 data points, and a test set of 1152 data points. Following an xDH approach, orbitals from the ωB97M-V density functional are used to compute the second-order perturbation theory correction. The resulting functional, ωB97M(2), is benchmarked against a variety of leading double hybrid density functionals, including B2PLYP-D3(BJ), B2GPPLYP-D3(BJ), ωB97X-2(TQZ), XYG3, PTPSS-D3(0), XYGJ-OS, DSD-PBEP86-D3(BJ), and DSD-PBEPBE-D3(BJ). Encouragingly, the overall performance of ωB97M(2) on nearly 5000 data points clearly surpasses that of all of the tested density functionals. As a Rung 5 density functional, ωB97M(2) completes our family of combinatorially optimized functionals, complementing B97M-V on Rung 3, and ωB97X-V and ωB97M-V on Rung 4. The results suggest that ωB97M(2) has the potential to serve as a powerful predictive tool for accurate and efficient electronic structure calculations of main-group chemistry.

Journal ArticleDOI
TL;DR: In this paper, the authors extend SINDy to stochastic dynamical systems which are frequently used to model biophysical processes and prove the asymptotic correctness of SINDY in the infinite data limit.
Abstract: With the rapid increase of available data for complex systems, there is great interest in the extraction of physically relevant information from massive datasets. Recently, a framework called Sparse Identification of Nonlinear Dynamics (SINDy) has been introduced to identify the governing equations of dynamical systems from simulation data. In this study, we extend SINDy to stochastic dynamical systems which are frequently used to model biophysical processes. We prove the asymptotic correctness of stochastic SINDy in the infinite data limit, both in the original and projected variables. We discuss algorithms to solve the sparse regression problem arising from the practical implementation of SINDy and show that cross validation is an essential tool to determine the right level of sparsity. We demonstrate the proposed methodology on two test systems, namely, the diffusion in a one-dimensional potential and the projected dynamics of a two-dimensional diffusion process.

Journal ArticleDOI
TL;DR: This work proposes a methodology to speed up the sampling of amorphous and disordered materials using a combination of a genetic algorithm and a specialized machine-learning potential based on artificial neural networks (ANNs).
Abstract: The atomistic modeling of amorphous materials requires structure sizes and sampling statistics that are challenging to achieve with first-principles methods. Here, we propose a methodology to speed up the sampling of amorphous and disordered materials using a combination of a genetic algorithm and a specialized machine-learning potential based on artificial neural networks (ANNs). We show for the example of the amorphous LiSi alloy that around 1000 first-principles calculations are sufficient for the ANN-potential assisted sampling of low-energy atomic configurations in the entire amorphous LixSi phase space. The obtained phase diagram is validated by comparison with the results from an extensive sampling of LixSi configurations using molecular dynamics simulations and a general ANN potential trained to ∼45 000 first-principles calculations. This demonstrates the utility of the approach for the first-principles modeling of amorphous materials.

Journal ArticleDOI
TL;DR: In this paper, a local model of interatomic interactions is proposed for predicting molecular properties, which provides high accuracy when trained on relatively small training sets and an active learning algorithm of optimally choosing the training set that maximizes the expected performance.
Abstract: In recent years, the machine learning techniques have shown great potent1ial in various problems from a multitude of disciplines, including materials design and drug discovery. The high computational speed on the one hand and the accuracy comparable to that of density functional theory on another hand make machine learning algorithms efficient for high-throughput screening through chemical and configurational space. However, the machine learning algorithms available in the literature require large training datasets to reach the chemical accuracy and also show large errors for the so-called outliers—the out-of-sample molecules, not well-represented in the training set. In the present paper, we propose a new machine learning algorithm for predicting molecular properties that addresses these two issues: it is based on a local model of interatomic interactions providing high accuracy when trained on relatively small training sets and an active learning algorithm of optimally choosing the training set that sig...

Journal ArticleDOI
TL;DR: This work analyzes the electric double layer in an approach beyond the point charge scheme by instead assessing charge polarizations at electrochemical metal-water interfaces from first principles and derives the electrode potential from the charge polarization.
Abstract: The description of electrode-electrolyte interfaces is based on the concept of the formation of an electric double layer. This concept was derived from continuum theories extended by introducing point charge distributions. Based on ab initio molecular dynamics simulations, we analyze the electric double layer in an approach beyond the point charge scheme by instead assessing charge polarizations at electrochemical metal-water interfaces from first principles. We show that the atomic structure of water layers at room temperature leads to an oscillatory behavior of the averaged electrostatic potential. We address the relation between the polarization distribution at the interface and the extent of the electric double layer and subsequently derive the electrode potential from the charge polarization.

Journal ArticleDOI
TL;DR: In this article, a fast semistochastic heat-bath configuration interaction (SHCI) method for solving the many-body Schrodinger equation is presented, which identifies and eliminates computational bottlenecks in both the variational and perturbative steps.
Abstract: This paper presents in detail our fast semistochastic heat-bath configuration interaction (SHCI) method for solving the many-body Schrodinger equation. We identify and eliminate computational bottlenecks in both the variational and perturbative steps of the SHCI algorithm. We also describe the parallelization and the key data structures in our implementation, such as the distributed hash table. The improved SHCI algorithm enables us to include in our variational wavefunction two orders of magnitude more determinants than has been reported previously with other selected configuration interaction methods. We use our algorithm to calculate an accurate benchmark energy for the chromium dimer with the X2C relativistic Hamiltonian in the cc-pVDZ-DK basis, correlating 28 electrons in 76 spatial orbitals. Our largest calculation uses two billion Slater determinants in the variational space and semistochastically includes perturbative contributions from at least trillions of additional determinants with better than 10-5 Ha statistical uncertainty.

Journal ArticleDOI
TL;DR: A number of sophistications of the neural network architectures are described to improve and generalize the process of interleaved collective variable discovery and enhanced sampling and to support bespoke error functions for network training to incorporate prior knowledge.
Abstract: Auto-associative neural networks ("autoencoders") present a powerful nonlinear dimensionality reduction technique to mine data-driven collective variables from molecular simulation trajectories. This technique furnishes explicit and differentiable expressions for the nonlinear collective variables, making it ideally suited for integration with enhanced sampling techniques for accelerated exploration of configurational space. In this work, we describe a number of sophistications of the neural network architectures to improve and generalize the process of interleaved collective variable discovery and enhanced sampling. We employ circular network nodes to accommodate periodicities in the collective variables, hierarchical network architectures to rank-order the collective variables, and generalized encoder-decoder architectures to support bespoke error functions for network training to incorporate prior knowledge. We demonstrate our approach in blind collective variable discovery and enhanced sampling of the configurational free energy landscapes of alanine dipeptide and Trp-cage using an open-source plugin developed for the OpenMM molecular simulation package.

Journal ArticleDOI
TL;DR: This work shows how the decision functions in SML algorithms can be used as initial CVs (SMLcv ) for accelerated sampling, and illustrates how the distance to the support vector machines' decision hyperplane, the output probability estimates from logistic regression, the outputs from shallow or deep neural network classifiers, and other classifiers may be used to reversibly sample slow structural transitions.
Abstract: Selection of appropriate collective variables (CVs) for enhancing sampling of molecular simulations remains an unsolved problem in computational modeling. In particular, picking initial CVs is particularly challenging in higher dimensions. Which atomic coordinates or transforms there of from a list of thousands should one pick for enhanced sampling runs? How does a modeler even begin to pick starting coordinates for investigation? This remains true even in the case of simple two state systems and only increases in difficulty for multi-state systems. In this work, we solve the “initial” CV problem using a data-driven approach inspired by the field of supervised machine learning (SML). In particular, we show how the decision functions in SML algorithms can be used as initial CVs (SMLcv) for accelerated sampling. Using solvated alanine dipeptide and Chignolin mini-protein as our test cases, we illustrate how the distance to the support vector machines’ decision hyperplane, the output probability estimates from logistic regression, the outputs from shallow or deep neural network classifiers, and other classifiers may be used to reversibly sample slow structural transitions. We discuss the utility of other SML algorithms that might be useful for identifying CVs for accelerating molecular simulations.

Journal ArticleDOI
TL;DR: Two different classes of molecular representations for use in machine learning of thermodynamic and electronic properties are studied, including the Coulomb matrix and Bag of Bonds, and Encoded Bonds, which encode such lists into a feature vector whose length is independent of molecular size.
Abstract: Two different classes of molecular representations for use in machine learning of thermodynamic and electronic properties are studied. The representations are evaluated by monitoring the performance of linear and kernel ridge regression models on well-studied data sets of small organic molecules. One class of representations studied here counts the occurrence of bonding patterns in the molecule. These require only the connectivity of atoms in the molecule as may be obtained from a line diagram or a SMILES string. The second class utilizes the three-dimensional structure of the molecule. These include the Coulomb matrix and Bag of Bonds, which list the inter-atomic distances present in the molecule, and Encoded Bonds, which encode such lists into a feature vector whose length is independent of molecular size. Encoded Bonds' features introduced here have the advantage of leading to models that may be trained on smaller molecules and then used successfully on larger molecules. A wide range of feature sets are constructed by selecting, at each rank, either a graph or geometry-based feature. Here, rank refers to the number of atoms involved in the feature, e.g., atom counts are rank 1, while Encoded Bonds are rank 2. For atomization energies in the QM7 data set, the best graph-based feature set gives a mean absolute error of 3.4 kcal/mol. Inclusion of 3D geometry substantially enhances the performance, with Encoded Bonds giving 2.4 kcal/mol, when used alone, and 1.19 kcal/mol, when combined with graph features.

Journal ArticleDOI
TL;DR: A screening procedure using a simple string representation for a promising class of donor-acceptor polymers in conjunction with a grammar variational autoencoder is proposed which increases the chance of finding suitable polymers by more than a factor of five in comparison to the randomised search used in gathering the training set.
Abstract: Polymer solar cells admit numerous potential advantages including low energy payback time and scalable high-speed manufacturing, but the power conversion efficiency is currently lower than for their inorganic counterparts. In a Phenyl-C_61-Butyric-Acid-Methyl-Ester (PCBM)-based blended polymer solar cell, the optical gap of the polymer and the energetic alignment of the lowest unoccupied molecular orbital (LUMO) of the polymer and the PCBM are crucial for the device efficiency. Searching for new and better materials for polymer solar cells is a computationally costly affair using density functional theory (DFT) calculations. In this work, we propose a screening procedure using a simple string representation for a promising class of donor-acceptor polymers in conjunction with a grammar variational autoencoder. The model is trained on a dataset of 3989 monomers obtained from DFT calculations and is able to predict LUMO and the lowest optical transition energy for unseen molecules with mean absolute errors of 43 and 74 meV, respectively, without knowledge of the atomic positions. We demonstrate the merit of the model for generating new molecules with the desired LUMO and optical gap energies which increases the chance of finding suitable polymers by more than a factor of five in comparison to the randomised search used in gathering the training set.

Journal ArticleDOI
TL;DR: This perspective discusses the recent progress and current challenges in multireference wave function methods for dynamical electron correlation, focusing on systematically improvable methods that go beyond the limitations of configuration interaction and perturbation theory.
Abstract: Predicting the electronic structure and properties of molecular systems that display strong electron correlation effects continues to remain a fundamental theoretical challenge. This perspective discusses the recent progress and current challenges in multireference wave function methods for dynamical electron correlation, focusing on systematically improvable methods that go beyond the limitations of configuration interaction and perturbation theory.

Journal ArticleDOI
TL;DR: The present method revises the reference (internal) space under the effect of its interaction with the outer space via the construction of an effective Hamiltonian, following the shifted-Bk philosophy of Davidson and co-workers.
Abstract: Selected configuration interaction (sCI) methods including second-order perturbative corrections provide near full CI (FCI) quality energies with only a small fraction of the determinants of the FCI space. Here, we introduce both a state-specific and a multi-state sCI method based on the configuration interaction using a perturbative selection made iteratively (CIPSI) algorithm. The present method revises the reference (internal) space under the effect of its interaction with the outer space via the construction of an effective Hamiltonian, following the shifted-Bk philosophy of Davidson and co-workers. In particular, the multi-state algorithm removes the storage bottleneck of the effective Hamiltonian via a low-rank factorization of the dressing matrix. Illustrative examples are reported for the state-specific and multi-state versions.