Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules

doi:10.1021/ACSCENTSCI.7B00572

Home
/
Papers
/
Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules

Journal Article•DOI•

Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules

Rafael Gómez-Bombarelli, Jennifer N. Wei¹, David Duvenaud², José Miguel Hernández-Lobato³, Benjamin Sanchez-Lengeling¹, Dennis Sheberla¹, Jorge Aguilera-Iparraguirre, Timothy D. Hirzel, Ryan P. Adams⁴, Alán Aspuru-Guzik⁵, Alán Aspuru-Guzik¹ - Show less +7 more•Institutions (5)

Harvard University¹, University of Toronto², University of Cambridge³, Google⁴, Canadian Institute for Advanced Research⁵

12 Jan 2018-ACS central science (American Chemical Society)-Vol. 4, Iss: 2, pp 268-276

TL;DR: In this article, a deep neural network was trained on hundreds of thousands of existing chemical structures to construct three coupled functions: an encoder, a decoder, and a predictor, which can generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds.

read less

Abstract: We report a method to convert discrete representations of molecules to and from a multidimensional continuous representation. This model allows us to generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds. A deep neural network was trained on hundreds of thousands of existing chemical structures to construct three coupled functions: an encoder, a decoder, and a predictor. The encoder converts the discrete representation of a molecule into a real-valued continuous vector, and the decoder converts these continuous vectors back to discrete molecular representations. The predictor estimates chemical properties from the latent continuous vector representation of the molecule. Continuous representations of molecules allow us to automatically generate novel chemical structures by performing simple operations in the latent space, such as decoding random vectors, perturbing known chemical structures, or interpolating between molecules. Continuous represent...

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

A Comprehensive Survey on Graph Neural Networks

[...]

Zonghan Wu¹, Shirui Pan², Fengwen Chen¹, Guodong Long¹, Chengqi Zhang¹, Philip S. Yu³ - Show less +2 more•Institutions (3)

University of Technology, Sydney¹, Monash University, Clayton campus², University of Illinois at Chicago³

01 Jan 2021-IEEE Transactions on Neural Networks

TL;DR: This article provides a comprehensive overview of graph neural networks (GNNs) in data mining and machine learning fields and proposes a new taxonomy to divide the state-of-the-art GNNs into four categories, namely, recurrent GNNS, convolutional GNN’s, graph autoencoders, and spatial–temporal Gnns.

...read moreread less

Abstract: Deep learning has revolutionized many machine learning tasks in recent years, ranging from image classification and video processing to speech recognition and natural language understanding. The data in these tasks are typically represented in the Euclidean space. However, there is an increasing number of applications, where data are generated from non-Euclidean domains and are represented as graphs with complex relationships and interdependency between objects. The complexity of graph data has imposed significant challenges on the existing machine learning algorithms. Recently, many studies on extending deep learning approaches for graph data have emerged. In this article, we provide a comprehensive overview of graph neural networks (GNNs) in data mining and machine learning fields. We propose a new taxonomy to divide the state-of-the-art GNNs into four categories, namely, recurrent GNNs, convolutional GNNs, graph autoencoders, and spatial–temporal GNNs. We further discuss the applications of GNNs across various domains and summarize the open-source codes, benchmark data sets, and model evaluation of GNNs. Finally, we propose potential research directions in this rapidly growing field.

...read moreread less

4,584 citations

Journal Article•DOI•

Inverse molecular design using machine learning: Generative models for matter engineering

[...]

Benjamin Sanchez-Lengeling¹, Alán Aspuru-Guzik², Alán Aspuru-Guzik³•Institutions (3)

Harvard University¹, Canadian Institute for Advanced Research², University of Toronto³

27 Jul 2018-Science

TL;DR: Methods for achieving inverse design, which aims to discover tailored materials from the starting point of a particular desired functionality, are reviewed.

...read moreread less

Abstract: The discovery of new materials can bring enormous societal and technological progress. In this context, exploring completely the large space of potential materials is computationally intractable. Here, we review methods for achieving inverse design, which aims to discover tailored materials from the starting point of a particular desired functionality. Recent advances from the rapidly growing field of artificial intelligence, mostly from the subfield of machine learning, have resulted in a fertile exchange of ideas, where approaches to inverse molecular design are being proposed and employed at a rapid pace. Among these, deep generative models have been applied to numerous classes of materials: rational design of prospective drugs, synthetic routes to organic compounds, and optimization of photovoltaics and redox flow batteries, as well as a variety of other solid-state materials.

...read moreread less

1,090 citations

Journal Article•DOI•

An Introduction to Variational Autoencoders.

[...]

Diederik P. Kingma¹, Max Welling²•Institutions (2)

Google¹, Qualcomm²

06 Jun 2019-arXiv: Learning

TL;DR: This work provides an introduction to variational autoencoders and some important extensions, which provide a principled framework for learning deep latent-variable models and corresponding inference models.

...read moreread less

Abstract: Variational autoencoders provide a principled framework for learning deep latent-variable models and corresponding inference models. In this work, we provide an introduction to variational autoencoders and some important extensions.

...read moreread less

1,089 citations

Journal Article•DOI•

Deep reinforcement learning for de novo drug design

[...]

Mariya Popova¹, Mariya Popova², Mariya Popova³, Olexandr Isayev³, Alexander Tropsha³ - Show less +1 more•Institutions (3)

Moscow Institute of Physics and Technology¹, Skolkovo Institute of Science and Technology², University of North Carolina at Chapel Hill³

01 Jul 2018-Science Advances

TL;DR: The ReLeaSE method is used to design chemical libraries with a bias toward structural complexity or toward compounds with maximal, minimal, or specific range of physical properties, such as melting point or hydrophobicity.

...read moreread less

Abstract: We have devised and implemented a novel computational strategy for de novo design of molecules with desired properties termed ReLeaSE (Reinforcement Learning for Structural Evolution). On the basis of deep and reinforcement learning (RL) approaches, ReLeaSE integrates two deep neural networks—generative and predictive—that are trained separately but are used jointly to generate novel targeted chemical libraries. ReLeaSE uses simple representation of molecules by their simplified molecular-input line-entry system (SMILES) strings only. Generative models are trained with a stack-augmented memory network to produce chemically feasible SMILES strings, and predictive models are derived to forecast the desired properties of the de novo–generated compounds. In the first phase of the method, generative and predictive models are trained separately with a supervised learning algorithm. In the second phase, both models are trained jointly with the RL approach to bias the generation of new chemical structures toward those with the desired physical and/or biological properties. In the proof-of-concept study, we have used the ReLeaSE method to design chemical libraries with a bias toward structural complexity or toward compounds with maximal, minimal, or specific range of physical properties, such as melting point or hydrophobicity, or toward compounds with inhibitory activity against Janus protein kinase 2. The approach proposed herein can find a general use for generating targeted chemical libraries of novel compounds optimized for either a single desired property or multiple properties.

...read moreread less

792 citations

Journal Article•DOI•

Deep learning enables rapid identification of potent DDR1 kinase inhibitors.

[...]

Alex Zhavoronkov, Yan A. Ivanenkov, Alexander Aliper, Mark S. Veselov, Vladimir A. Aladinskiy, Anastasiya V Aladinskaya, Victor A Terentiev, Daniil Polykovskiy, Maksim Kuznetsov, Arip Asadulaev, Yury Volkov, Artem Zholus, Rim Shayakhmetov, Alexander Zhebrak, Lidiya I Minaeva, Bogdan A Zagribelnyy, Lennart H Lee, Richard Soll, David Madge, Li Xing, Tao Guo, Alán Aspuru-Guzik - Show less +18 more

02 Sep 2019-Nature Biotechnology

TL;DR: A machine learning model allows the identification of new small-molecule kinase inhibitors in days and is used to discover potent inhibitors of discoidin domain receptor 1 (DDR1), a kinase target implicated in fibrosis and other diseases, in 21 days.

...read moreread less

Abstract: We have developed a deep generative model, generative tensorial reinforcement learning (GENTRL), for de novo small-molecule design. GENTRL optimizes synthetic feasibility, novelty, and biological activity. We used GENTRL to discover potent inhibitors of discoidin domain receptor 1 (DDR1), a kinase target implicated in fibrosis and other diseases, in 21 days. Four compounds were active in biochemical assays, and two were validated in cell-based assays. One lead candidate was tested and demonstrated favorable pharmacokinetics in mice.

...read moreread less

663 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Posted Content•

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

[...]

Alec Radford, Luke Metz, Soumith Chintala¹•Institutions (1)

Facebook¹

19 Nov 2015-arXiv: Learning

TL;DR: This work introduces a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrates that they are a strong candidate for unsupervised learning.

...read moreread less

Abstract: In recent years, supervised learning with convolutional networks (CNNs) has seen huge adoption in computer vision applications. Comparatively, unsupervised learning with CNNs has received less attention. In this work we hope to help bridge the gap between the success of CNNs for supervised learning and unsupervised learning. We introduce a class of CNNs called deep convolutional generative adversarial networks (DCGANs), that have certain architectural constraints, and demonstrate that they are a strong candidate for unsupervised learning. Training on various image datasets, we show convincing evidence that our deep convolutional adversarial pair learns a hierarchy of representations from object parts to scenes in both the generator and discriminator. Additionally, we use the learned features for novel tasks - demonstrating their applicability as general image representations.

...read moreread less

6,759 citations

Journal Article•DOI•

SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules

[...]

David Weininger

01 Feb 1988-Journal of Chemical Information and Computer Sciences

TL;DR: This chapter discusses the construction of Benzenoid and Coronoid Hydrocarbons through the stages of enumeration, classification, and topological properties in a number of computers used for this purpose.

...read moreread less

Abstract: (1) Klamer, A. D. “Some Results Concerning Polyominoes”. Fibonacci Q. 1965, 3(1), 9-20. (2) Golomb, S. W. Polyominoes·, Scribner, New York, 1965. (3) Harary, F.; Read, R. C. “The Enumeration of Tree-like Polyhexes”. Proc. Edinburgh Math. Soc. 1970, 17, 1-14. (4) Lunnon, W. F. “Counting Polyominoes” in Computers in Number Theory·, Academic: London, 1971; pp 347-372. (5) Lunnon, W. F. “Counting Hexagonal and Triangular Polyominoes”. Graph Theory Comput. 1972, 87-100. (6) Brunvoll, J.; Cyvin, S. J.; Cyvin, B. N. “Enumeration and Classification of Benzenoid Hydrocarbons”. J. Comput. Chem. 1987, 8, 189-197. (7) Balaban, A. T., et al. “Enumeration of Benzenoid and Coronoid Hydrocarbons”. Z. Naturforsch., A: Phys., Phys. Chem., Kosmophys. 1987, 42A, 863-870. (8) Gutman, I. “Topological Properties of Benzenoid Systems”. Bull. Soc. Chim., Beograd 1982, 47, 453-471. (9) Gutman, I.; Polansky, O. E. Mathematical Concepts in Organic Chemistry·, Springer: Berlin, 1986. (10) To3i6, R.; Doroslovacki, R.; Gutman, I. “Topological Properties of Benzenoid Systems—The Boundary Code”. MATCH 1986, No. 19, 219-228. (11) Doroslovacki, R.; ToSic, R. “A Characterization of Hexagonal Systems”. Rev. Res. Fac. Sci.-Univ. Novi Sad, Math. Ser. 1984,14(2) 201-209. (12) Knop, J. V.; Szymanski, K.; Trinajstic, N. “Computer Enumeration of Substituted Polyhexes”. Comput. Chem. 1984, 8(2), 107-115. (13) Stojmenovic, L; Tosió, R.; Doroslovaóki, R. “Generating and Counting Hexagonal Systems”. Proc. Yugosl. Semin. Graph Theory, 6th, Dubrovnik 1985; pp 189-198. (14) Doroslovaóki, R.; Stojmenovió, I.; Tosió, R. “Generating and Counting Triangular Systems”. BIT 1987, 27, 18-24. (15) Knop, J. V.; Miller, W. R.; Szymanski, K.; Trinajstic, N. Computer Generation of Certain Classes of Molecules·, Association of Chemists and Technologists of Croatia: Zagreb, 1985.

...read moreread less

4,541 citations

Journal Article•DOI•

A learning algorithm for continually running fully recurrent neural networks

[...]

Ronald J. Williams¹, David Zipser²•Institutions (2)

Northeastern University¹, University of California, Los Angeles²

01 Jun 1989-Neural Computation

TL;DR: The exact form of a gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical algorithms for temporal supervised learning tasks.

...read moreread less

Abstract: The exact form of a gradient-following learning algorithm for completely recurrent networks running in continually sampled time is derived and used as the basis for practical algorithms for temporal supervised learning tasks. These algorithms have (1) the advantage that they do not require a precisely defined training interval, operating while the network runs; and (2) the disadvantage that they require nonlocal communication in the network being trained and are computationally expensive. These algorithms allow networks having recurrent connections to learn complex tasks that require the retention of information over time periods having either fixed or indefinite length.

...read moreread less

4,351 citations

Journal Article•DOI•

Extended-Connectivity Fingerprints

[...]

David Rogers¹, Mathew Hahn¹•Institutions (1)

Symyx Technologies¹

28 Apr 2010-Journal of Chemical Information and Modeling

TL;DR: A description of their implementation has not previously been presented in the literature, and ECFPs can be very rapidly calculated and can represent an essentially infinite number of different molecular features.

...read moreread less

Abstract: Extended-connectivity fingerprints (ECFPs) are a novel class of topological fingerprints for molecular characterization. Historically, topological fingerprints were developed for substructure and similarity searching. ECFPs were developed specifically for structure−activity modeling. ECFPs are circular fingerprints with a number of useful qualities: they can be very rapidly calculated; they are not predefined and can represent an essentially infinite number of different molecular features (including stereochemical information); their features represent the presence of particular substructures, allowing easier interpretation of analysis results; and the ECFP algorithm can be tailored to generate different types of circular fingerprints, optimized for different uses. While the use of ECFPs has been widely adopted and validated, a description of their implementation has not previously been presented in the literature.

...read moreread less

4,173 citations

Posted Content•

WaveNet: A Generative Model for Raw Audio

[...]

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, Koray Kavukcuoglu - Show less +5 more

12 Sep 2016-arXiv: Sound

TL;DR: This paper proposed WaveNet, a deep neural network for generating audio waveforms, which is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones.

...read moreread less

Abstract: This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones; nonetheless we show that it can be efficiently trained on data with tens of thousands of samples per second of audio. When applied to text-to-speech, it yields state-of-the-art performance, with human listeners rating it as significantly more natural sounding than the best parametric and concatenative systems for both English and Mandarin. A single WaveNet can capture the characteristics of many different speakers with equal fidelity, and can switch between them by conditioning on the speaker identity. When trained to model music, we find that it generates novel and often highly realistic musical fragments. We also show that it can be employed as a discriminative model, returning promising results for phoneme recognition.

...read moreread less

4,002 citations