A Deep Learning Approach to Antibiotic Discovery

doi:10.1016/J.CELL.2020.01.021

Home
/
Papers
/
A Deep Learning Approach to Antibiotic Discovery

Journal Article•DOI•

A Deep Learning Approach to Antibiotic Discovery

Jonathan M. Stokes¹, Kevin Yang¹, Kyle Swanson¹, Wengong Jin¹, Andres Cubillos-Ruiz², Nina M. Donghia², Craig R. MacNair³, Shawn French³, Lindsey A. Carfrae³, Zohar Bloom-Ackermann⁴, Victoria M. Tran⁵, Anush Chiappino-Pepe², Ahmed H. Badran⁵, Ian W. Andrews², Ian W. Andrews¹, Ian W. Andrews⁵, Emma J. Chory¹, George M. Church², Eric D. Brown³, Tommi S. Jaakkola¹, Regina Barzilay¹, James J. Collins - Show less +18 more•Institutions (5)

Massachusetts Institute of Technology¹, Wyss Institute for Biologically Inspired Engineering², McMaster University³, Harvard University⁴, Broad Institute⁵

20 Feb 2020-Cell (Cell)-Vol. 180, Iss: 4, pp 475-483

TL;DR: A deep neural network capable of predicting molecules with antibacterial activity is trained and a molecule from the Drug Repurposing Hub-halicin- is discovered that is structurally divergent from conventional antibiotics and displays bactericidal activity against a wide phylogenetic spectrum of pathogens.

read less

About: This article is published in Cell.The article was published on 2020-02-20 and is currently open access. It has received 1002 citations till now.

...read moreread less

Citations

PDF

Open Access

More filters

Posted Content•

Open Graph Benchmark: Datasets for Machine Learning on Graphs

[...]

Weihua Hu¹, Matthias Fey², Marinka Zitnik³, Yuxiao Dong¹, Hongyu Ren¹, Bowen Liu¹, Michele Catasta¹, Jure Leskovec¹ - Show less +4 more•Institutions (3)

Stanford University¹, Technical University of Dortmund², Harvard University³

02 May 2020-arXiv: Learning

TL;DR: The OGB datasets are large-scale, encompass multiple important graph ML tasks, and cover a diverse range of domains, ranging from social and information networks to biological networks, molecular graphs, source code ASTs, and knowledge graphs, indicating fruitful opportunities for future research.

...read moreread less

Abstract: We present the Open Graph Benchmark (OGB), a diverse set of challenging and realistic benchmark datasets to facilitate scalable, robust, and reproducible graph machine learning (ML) research. OGB datasets are large-scale, encompass multiple important graph ML tasks, and cover a diverse range of domains, ranging from social and information networks to biological networks, molecular graphs, source code ASTs, and knowledge graphs. For each dataset, we provide a unified evaluation protocol using meaningful application-specific data splits and evaluation metrics. In addition to building the datasets, we also perform extensive benchmark experiments for each dataset. Our experiments suggest that OGB datasets present significant challenges of scalability to large-scale graphs and out-of-distribution generalization under realistic data splits, indicating fruitful opportunities for future research. Finally, OGB provides an automated end-to-end graph ML pipeline that simplifies and standardizes the process of graph data loading, experimental setup, and model evaluation. OGB will be regularly updated and welcomes inputs from the community. OGB datasets as well as data loaders, evaluation scripts, baseline code, and leaderboards are publicly available at this https URL .

...read moreread less

1,097 citations

Posted Content•

TUDataset: A collection of benchmark datasets for learning with graphs.

[...]

Christopher Morris, Nils M. Kriege, Franka Bause, Kristian Kersting, Petra Mutzel, Marion Neumann - Show less +2 more

16 Jul 2020-arXiv: Learning

TL;DR: The TUDataset for graph classification and regression is introduced, which consists of over 120 datasets of varying sizes from a wide range of applications and provides Python-based data loaders, kernel and graph neural network baseline implementations, and evaluation tools.

...read moreread less

Abstract: Recently, there has been an increasing interest in (supervised) learning with graph data, especially using graph neural networks. However, the development of meaningful benchmark datasets and standardized evaluation procedures is lagging, consequently hindering advancements in this area. To address this, we introduce the TUDataset for graph classification and regression. The collection consists of over 120 datasets of varying sizes from a wide range of applications. We provide Python-based data loaders, kernel and graph neural network baseline implementations, and evaluation tools. Here, we give an overview of the datasets, standardized evaluation procedures, and provide baseline experiments. All datasets are available at this http URL. The experiments are fully reproducible from the code available at this http URL.

...read moreread less

346 citations

Journal Article•DOI•

AI in health and medicine

[...]

Pranav Rajpurkar, Emma Chen, O. Banerjee, Eric J. Topol

01 Jan 2022-Nature Medicine

TL;DR: Key findings from a 2-year weekly effort to track and share key developments in medical AI are discussed, including prospective studies and advances in medical image analysis, which have reduced the gap between research and deployment.

...read moreread less

346 citations

Journal Article•DOI•

A guide to machine learning for biologists.

[...]

Joe G Greener¹, Shaun M. Kandathil¹, Lewis Moffat¹, David T. Jones¹•Institutions (1)

University College London¹

13 Sep 2021-Nature Reviews Molecular Cell Biology

TL;DR: Machine learning is becoming a widely used tool for the analysis of biological data as mentioned in this paper, however, proper use of machine learning methods can be challenging for experimentalists, proper application of ML methods can also be challenging, and best practices and points to consider when embarking on experiments involving machine learning are discussed.

...read moreread less

Abstract: The expanding scale and inherent complexity of biological data have encouraged a growing use of machine learning in biology to build informative and predictive models of the underlying biological processes. All machine learning techniques fit models to data; however, the specific methods are quite varied and can at first glance seem bewildering. In this Review, we aim to provide readers with a gentle introduction to a few key machine learning techniques, including the most recently developed and widely used techniques involving deep neural networks. We describe how different techniques may be suited to specific types of biological data, and also discuss some best practices and points to consider when one is embarking on experiments involving machine learning. Some emerging directions in machine learning methodology are also discussed. Machine learning is becoming a widely used tool for the analysis of biological data. However, for experimentalists, proper use of machine learning methods can be challenging. This Review provides an overview of machine learning techniques and provides guidance on their applications in biology.

...read moreread less

325 citations

Journal Article•DOI•

Towards the sustainable discovery and development of new antibiotics

[...]

Marcus Miethke¹, Marco Pieroni², Tilmann Weber³, Mark Brönstrup, Peter Hammann⁴, Ludovic Halby⁵, Paola B. Arimondo⁵, Philippe Glaser⁵, Bertrand Aigle⁶, Helge B. Bode⁷, Helge B. Bode⁸, Rui Moreira⁹, Yanyan Li¹⁰, Andriy Luzhetskyy¹, Marnix H. Medema¹¹, Jean-Luc Pernodet¹², Marc Stadler, José R. Tormo, Olga Genilloud, Andrew W. Truman¹³, Kira J. Weissman⁶, Eriko Takano¹⁴, Stefano Sabatini¹⁵, Evi Stegmann¹⁶, Heike Brötz-Oesterhelt¹⁶, Wolfgang Wohlleben¹⁶, Myriam Seemann¹⁷, Martin Empting¹, Anna K. H. Hirsch¹, Brigitta Loretz¹, Claus-Michael Lehr¹, Alexander Titz¹, Jennifer Herrmann¹, Timo Jaeger, Silke Alt, Thomas Hesterkamp, Mathias Winterhalter¹⁸, Andrea Schiefer¹⁹, Kenneth Pfarr¹⁹, Achim Hoerauf¹⁹, Heather Graz, Michael Graz²⁰, Mika Lindvall, Savithri Ramurthy, Anders Karlén²¹, Maarten van Dongen, Hrvoje Petković²², Andreas Keller¹, Frédéric Peyrane, Stefano Donadio, Laurent Fraisse²³, Laura J. V. Piddock, Ian H. Gilbert²⁴, Heinz E. Moser²⁵, Rolf Müller¹ - Show less +51 more•Institutions (25)

Saarland University¹, University of Parma², Technical University of Denmark³, University of Giessen⁴, Pasteur Institute⁵, University of Lorraine⁶, Goethe University Frankfurt⁷, Max Planck Society⁸, University of Lisbon⁹, National Museum of Natural History¹⁰, Wageningen University and Research Centre¹¹, University of Paris¹², John Innes Centre¹³, University of Manchester¹⁴, University of Perugia¹⁵, University of Tübingen¹⁶, University of Strasbourg¹⁷, Jacobs University Bremen¹⁸, University Hospital Bonn¹⁹, University of Bristol²⁰, Uppsala University²¹, University of Ljubljana²², Drugs for Neglected Diseases Initiative²³, University of Dundee²⁴, Novartis²⁵

19 Aug 2021

TL;DR: In this paper, the authors present a strategic blueprint to substantially improve our ability to discover and develop new antibiotics, and propose both short-term and long-term solutions to overcome the most urgent limitations in the various sectors of research and funding.

...read moreread less

Abstract: An ever-increasing demand for novel antimicrobials to treat life-threatening infections caused by the global spread of multidrug-resistant bacterial pathogens stands in stark contrast to the current level of investment in their development, particularly in the fields of natural-product-derived and synthetic small molecules. New agents displaying innovative chemistry and modes of action are desperately needed worldwide to tackle the public health menace posed by antimicrobial resistance. Here, our consortium presents a strategic blueprint to substantially improve our ability to discover and develop new antibiotics. We propose both short-term and long-term solutions to overcome the most urgent limitations in the various sectors of research and funding, aiming to bridge the gap between academic, industrial and political stakeholders, and to unite interdisciplinary expertise in order to efficiently fuel the translational pipeline for the benefit of future generations.

...read moreread less

255 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2

[...]

Michael I. Love¹, Michael I. Love², Wolfgang Huber, Simon Anders•Institutions (2)

Harvard University¹, Max Planck Society²

05 Dec 2014-Genome Biology

TL;DR: This work presents DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates, which enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression.

...read moreread less

Abstract: In comparative high-throughput sequencing assays, a fundamental task is the analysis of count data, such as read counts per gene in RNA-seq, for evidence of systematic changes across experimental conditions. Small replicate numbers, discreteness, large dynamic range and the presence of outliers require a suitable statistical approach. We present DESeq2, a method for differential analysis of count data, using shrinkage estimation for dispersions and fold changes to improve stability and interpretability of estimates. This enables a more quantitative analysis focused on the strength rather than the mere presence of differential expression. The DESeq2 package is available at http://www.bioconductor.org/packages/release/bioc/html/DESeq2.html .

...read moreread less

47,038 citations

Journal Article•DOI•

Fast and accurate short read alignment with Burrows–Wheeler transform

[...]

Heng Li¹, Richard Durbin¹•Institutions (1)

Wellcome Trust Sanger Institute¹

01 Jul 2009-Bioinformatics

TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.

...read moreread less

Abstract: Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ~10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: [email protected]

...read moreread less

43,862 citations

Additional excerpts

...REAGENT or RESOURCE SOURCE IDENTIFIER tgcaaaataatatgcaccacgacggcggtcagaaaaataa This study AB5046 gaagcgttacttcgcgatctgatcaacgattcgtggaatc This study AB5047 Software and Algorithms Chemprop Yang et al., 2019b https://github.com/swansonk14/chemprop RDKit Landrum, 2006 https://github.com/rdkit BWA Li and Durbin, 2009 https://github.com/lh3/bwa DESeq2 Love et al., 2014 https://bioconductor.org/packages/ release/bioc/html/DESeq2.html edgeR Robinson et al., 2010 https://bioconductor.org/packages/ release/bioc/html/edgeR.html GenomeView Abeel et al., 2012 https://genomeview.org EcoCyc Pathway Tools Keseler et al., 2013 https://ecocyc.org...
[...]
...…AB5047 Software and Algorithms Chemprop Yang et al., 2019b https://github.com/swansonk14/chemprop RDKit Landrum, 2006 https://github.com/rdkit BWA Li and Durbin, 2009 https://github.com/lh3/bwa DESeq2 Love et al., 2014 https://bioconductor.org/packages/ release/bioc/html/DESeq2.html edgeR…...
[...]

Journal Article•DOI•

edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.

[...]

Mark D. Robinson¹, Davis J. McCarthy¹, Gordon K. Smyth¹•Institutions (1)

Walter and Eliza Hall Institute of Medical Research¹

01 Jan 2010-Bioinformatics

TL;DR: EdgeR as mentioned in this paper is a Bioconductor software package for examining differential expression of replicated count data, which uses an overdispersed Poisson model to account for both biological and technical variability and empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference.

...read moreread less

Abstract: Summary: It is expected that emerging digital gene expression (DGE) technologies will overtake microarray technologies in the near future for many functional genomics applications. One of the fundamental data analysis tasks, especially for gene expression studies, involves determining whether there is evidence that counts for a transcript or exon are significantly different across experimental conditions. edgeR is a Bioconductor software package for examining differential expression of replicated count data. An overdispersed Poisson model is used to account for both biological and technical variability. Empirical Bayes methods are used to moderate the degree of overdispersion across transcripts, improving the reliability of inference. The methodology can be used even with the most minimal levels of replication, provided at least one phenotype or experimental condition is replicated. The software may have other applications beyond sequencing data, such as proteome peptide count data. Availability: The package is freely available under the LGPL licence from the Bioconductor web site (http://bioconductor.org).

...read moreread less

29,413 citations

Journal Article•DOI•

One-step inactivation of chromosomal genes in Escherichia coli K-12 using PCR products

[...]

Kirill A. Datsenko¹, Barry L. Wanner•Institutions (1)

Purdue University¹

06 Jun 2000-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: A simple and highly efficient method to disrupt chromosomal genes in Escherichia coli in which PCR primers provide the homology to the targeted gene(s), which should be widely useful, especially in genome analysis of E. coli and other bacteria.

...read moreread less

Abstract: We have developed a simple and highly efficient method to disrupt chromosomal genes in Escherichia coli in which PCR primers provide the homology to the targeted gene(s). In this procedure, recombination requires the phage lambda Red recombinase, which is synthesized under the control of an inducible promoter on an easily curable, low copy number plasmid. To demonstrate the utility of this approach, we generated PCR products by using primers with 36- to 50-nt extensions that are homologous to regions adjacent to the gene to be inactivated and template plasmids carrying antibiotic resistance genes that are flanked by FRT (FLP recognition target) sites. By using the respective PCR products, we made 13 different disruptions of chromosomal genes. Mutants of the arcB, cyaA, lacZYA, ompR-envZ, phnR, pstB, pstCA, pstS, pstSCAB-phoU, recA, and torSTRCAD genes or operons were isolated as antibiotic-resistant colonies after the introduction into bacteria carrying a Red expression plasmid of synthetic (PCR-generated) DNA. The resistance genes were then eliminated by using a helper plasmid encoding the FLP recombinase which is also easily curable. This procedure should be widely useful, especially in genome analysis of E. coli and other bacteria because the procedure can be done in wild-type cells.

...read moreread less

14,389 citations

Journal Article•DOI•

Extended-Connectivity Fingerprints

[...]

David Rogers¹, Mathew Hahn¹•Institutions (1)

Symyx Technologies¹

28 Apr 2010-Journal of Chemical Information and Modeling

TL;DR: A description of their implementation has not previously been presented in the literature, and ECFPs can be very rapidly calculated and can represent an essentially infinite number of different molecular features.

...read moreread less

Abstract: Extended-connectivity fingerprints (ECFPs) are a novel class of topological fingerprints for molecular characterization. Historically, topological fingerprints were developed for substructure and similarity searching. ECFPs were developed specifically for structure−activity modeling. ECFPs are circular fingerprints with a number of useful qualities: they can be very rapidly calculated; they are not predefined and can represent an essentially infinite number of different molecular features (including stereochemical information); their features represent the presence of particular substructures, allowing easier interpretation of analysis results; and the ECFP algorithm can be tailored to generate different types of circular fingerprints, optimized for different uses. While the use of ECFPs has been widely adopted and validated, a description of their implementation has not previously been presented in the literature.

...read moreread less

4,173 citations

"A Deep Learning Approach to Antibio..." refers background in this paper

...An important development relates to how molecules are represented; traditionally, molecules were represented by their fingerprint vectors, which reflected the presence or absence of functional groups in the molecule, or by descriptors that include computable molecular properties and require expert knowledge to construct (Mauri et al., 2006; Moriwaki et al., 2018; Rogers and Hahn, 2010)....
[...]
...Excitingly, we observed that halicin resulted in C. difficile clearance at a greater rate than vehicle or the antibiotic metronidazole (Figure 5F), which is not only a first-line treatment for C. difficile infection, but also the antibiotic most similar to halicin based on Tanimoto score (Figure 2H; Table S2H)....
[...]
...37; Figures 2G and 2H; Table S2H) (Rogers and Hahn, 2010) and the antibiotic metronidazole (Tanimoto similarity 0....
[...]
...Excitingly, halicin, which is structurally most similar to a family of nitro-containing antiparasitic compounds (Tanimoto similarity 0.37; Figures 2G and 2H; Table S2H) (Rogers and Hahn, 2010) and the antibiotic metronidazole (Tanimoto similarity 0.21), displayed excellent growth inhibitory activity against E. coli, achieving a minimum inhibitory concentration (MIC) of 2 mg/mL (Figure 2I)....
[...]