scispace - formally typeset
Open AccessJournal ArticleDOI

MoleculeNet: a benchmark for molecular machine learning

TLDR
A large scale benchmark for molecular machine learning consisting of multiple public datasets, metrics, featurizations and learning algorithms.
Abstract
Molecular machine learning has been maturing rapidly over the last few years. Improved methods and the presence of larger datasets have enabled machine learning algorithms to make increasingly accurate predictions about molecular properties. However, algorithmic progress has been limited due to the lack of a standard benchmark to compare the efficacy of proposed methods; most new algorithms are benchmarked on different datasets making it challenging to gauge the quality of proposed methods. This work introduces MoleculeNet, a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and offers high quality open-source implementations of multiple previously proposed molecular featurization and learning algorithms (released as part of the DeepChem open source library). MoleculeNet benchmarks demonstrate that learnable representations are powerful tools for molecular machine learning and broadly offer the best performance. However, this result comes with caveats. Learnable representations still struggle to deal with complex tasks under data scarcity and highly imbalanced classification. For quantum mechanical and biophysical datasets, the use of physics-aware featurizations can be more important than choice of particular learning algorithm.

read more

Content maybe subject to copyright    Report

MoleculeNet: a benchmark for molecular machine
learning
Zhenqin Wu,
a
Bharath Ramsundar,
b
Evan N. Feinberg,§
c
Joseph Gomes, §
a
Caleb Geniesse,
c
Aneesh S. Pappu,
b
Karl Leswing
d
and Vijay Pande
*
a
Molecular machine learning has been maturing rapidly over the last few years. Improved methods and the
presence of larger datasets have enabled machine learning algorithms to make increasingly accurate
predictions about molecular properties. However, algorithmic progress has been limited due to the lack
of a standard benchmark to compare the ecacy of proposed methods; most new algorithms are
benchmarked on dierent datasets making it challenging to gauge the quality of proposed methods. This
work introduces MoleculeNet, a large scale benchmark for molecular machine learning. MoleculeNet
curates multiple public datasets, establishes metrics for evaluation, and oers high quality open-source
implementations of multiple previously proposed molecular featurization and learning algorithms
(released as part of the DeepChem open source library). MoleculeNet benchmarks demonstrate that
learnable representations are powerful tools for molecular machine learning and broadly oer the best
performance. However, this result comes with caveats. Learnable representations still struggle to deal
with complex tasks under data scarcity and highly imbalanced classication. For quantum mechanical
and biophysical datasets, the use of physics-aware featurizations can be more important than choice of
particular learning algorithm.
1 Introduction
Overlap between chemistry and statistical learning has had
a long history. The eld of cheminformatics has been utilizing
machine learning methods in chemical modeling (e.g. quanti-
tative structure activity relationships, QSAR) for decades.
16
In
the recent 10 years, with the advent of sophisticated deep
learning methods,
7,8
machine learning has gathered increasing
amounts of attention from the scientic community. Data-
driven analysis has become a routine step in many chemical
and biological applications, including virtual screening,
912
chemical property prediction,
1316
and quantum chemistry
calculations.
1720
In many such applications, machine learning has shown
strong potential to compete with or even outperform conven-
tional ab initio computations.
16,18
It follows that introduction of
novel machine learning methods has the potential to reshape
research on properties of molecules. However, this potential
has been limited by the lack of a standard evaluation platform
for proposed machine learning algorithms. Algorithmic papers
oen benchmark proposed methods on disjoint dataset
collections, making it a challenge to gauge whether a proposed
technique does in fact improve performance.
Data for molecule-based machine learning tasks are highly
heterogeneous and expensive to gather. Obtaining precise and
accurate results for chemical properties typically requires
specialized instruments as well as expert supervision (contrast
with computer speech and vision, where lightly trained workers
can annotate data suitable for machine learning systems). As
a result, molecular datasets are usually much smaller than those
available for other machine learning tasks. Furthermore, the
breadth of chemical research means our interests with respect to
a molecule may range from quantum characteristics to measured
impacts on the human body. Molecular machine learning
methods have to be capable of learning to predict this very broad
range of properties. Complicating this challenge, input molecules
can have arbitrary size and components, highly variable
connectivity and many three dimensional conformers (three
dimensional molecular shapes). To transform molecules into
a form suitable for conventional machine learning algorithms
(that usually accept xed length input), we have to extract useful
and related information from a molecule into a xed dimen-
sional representation (a process called featurization).
2123
To put it simply, building machine learning models on
molecules requires overcoming several key issues: limited
a
Department of Chemistry, Stanford University, Stanford, CA 94305, USA. E-mail:
pande@stanford.edu
b
Department of Computer Science, Stanford University, Stanford, CA 94305, USA
c
Program in Biophysics, Stanford School of Medicine, Stanford, CA 94305, USA
d
Schrodinger Inc., USA
Electronic supplementary information (ESI) available. See DOI:
10.1039/c7sc02664a
Joint rst authorship.
§ Joint second authorship.
Cite this: Chem. Sci.,2018,9,513
Received 15th June 2017
Accepted 30th October 2017
DOI: 10.1039/c7sc02664a
rsc.li/chemical-science
This journal is © The Royal Society of Chemistry 2018 Chem. Sci.,2018,9,513530 | 513
Chemical
Science
EDGE ARTICLE
Open Access Article. Published on 31 October 2017. Downloaded on 8/26/2022 12:32:38 PM.
This article is licensed under a
Creative Commons Attribution-NonCommercial 3.0 Unported Licence.
View Article Online
View Journal
| View Issue

amounts of data, wide ranges of outputs to predict, large
heterogeneity in input molecular structures and appropriate
learning algorithms. Therefore, this work aims to facilitate the
development of molecular machine learning methods by
curating a number of dataset collections, creating a suite of
soware that implements many known featurizations of mole-
cules, and providing high quality implementations of many
previously proposed algorithms. Following the footsteps of
WordNet
24
and ImageNet,
25
we call our suite MoleculeNet,
a benchmark collection for molecular machine learning.
In machine learning, a benchmark serves as more than
a simple collection of data and methods. The introduction of
the ImageNet benchmark in 2009 has triggered a series of
breakthroughs in computer vision, and in particular has facil-
itated the rapid development of deep convolutional networks.
The ILSVRC, an annual contest held by the ImageNet team,
26
draws considerable attention from the community, and greatly
stimulates collaborations and competitions across the eld.
The contest has given rise to a series of prominent machine
learning models such as AlexNet,
27
GoogLeNet,
28
ResNet
29
which
have had broad impact on the academic and industrial
computer science communities. We hope that MoleculeNet will
trigger similar breakthroughs by serving as a platform for the
wider community to develop and improve models for learning
molecular properties.
In particular, MoleculeNet contains data on the properties
of over 700 000 compounds. All datasets have been curated
and integrated i nto the open source DeepChem package.
30
Users of DeepChem can easily load all MoleculeNet bench-
mark data through provided l ibrary calls. MoleculeNet also
contributes high quality implementations of well known (bio)
chemical featurization m ethods. To facilitate comp arison and
developmentofnewmethods,wealsoprovidehighquality
implementations of several previously proposed machine
learning methods. O ur imp lement atio ns are integrated with
DeepChem , and depend on S cik it -Le ar n
31
and Tensorow
32
underneath the hood. Finally, evaluation of machine learning
algorithms requires dened methods to split datasets into
training/validation/test collections. Random splitting,
common in machine learning, is oen not correct for chem-
ical data.
33
MoleculeNet contributes a library of splitting
mechanisms to DeepChem and evaluates all algorithms with
multiple choices of data split. MoleculeNet provide a series of
benchmark results of implemented machine learning algo-
rithms using various featurizatio ns and splits upon our dat a-
set collections. These results are provided within this paper,
and will be maintained online in an ongoing fashion as part of
DeepCh em .
The related work section will review prior work in the
chemistry community on gathering curated datasets and
discuss how MoleculeNet diers from these previous eorts.
The methods section reviews the dataset collections, metrics,
featurization methods, and machine learning models included
as part of MoleculeNet. The results section will analyze the
benchmarking results to draw conclusions about the algo-
rithms and datasets considered.
2 Related work
MoleculeNet draws upon a broader movement within the
chemical community to gather large sources of curated data.
PubChem
34
and PubChem BioAssasy
35
gather together thou-
sands of bioassay results, along with millions of unique
molecules tested within these assays. The ChEMBL database
o er s a similar service, with millions of bioactivity outcomes
across thousands of protein targets. Both PubChem and
ChEMBL are human researcher oriented, with web portals that
facilitate browsing of the available targets and compounds.
ChemSpider is a repository of nearly 60 million chemical
structures, with web based search capabilities for users. The
Crystallography Open Database
36
and Cambridge Structural
Database
37
o er large repositories of o rganic and i norganic
compounds. The protein data bank
38
o ers a repository of
experimentally resolved three dimensional protein structures.
This listing is by no means comprehensive; the methods
section will discuss a number of smaller data sources in
greater detail.
These past eorts have been critical in enabling the growth
of computational chemistry. However, these previous databases
are not machine-learning focused. In particular, these collec-
tions don't dene metrics which measure the eectiveness of
algorithmic methods in understanding the data contained.
Furthermore, there is no prescribed separation of the data into
training/validation/test sets (critical for machine learning
development). Without specied metrics or splits, the choice is
le to individual researchers, and there are indeed many
chemical machine learning papers which use subsets of these
data stores for machine learning evaluation. Unfortunately, the
choice of metric and subset varies widely between groups, so
two methods papers using PubChem data may be entirely
incomparable. MoleculeNet aims to bridge this gap by
providing benchmark results for a reasonable range of metrics,
splits, and subsets of these (and other) data collections.
It's important to note that there have been some eorts to
create benchmarking datasets for machine learning in chem-
istry. The Quantum Machine group
39
and previous work on
multitask learning
10
both introduce benchmarking collections
which have been used in multiple papers. MoleculeNet incor-
porates data from both these eorts and signicantly expands
upon them.
3 Methods
MoleculeNet is based on the open source package DeepChem.
30
Fig. 1 shows an annotated DeepChem benchmark script. Note
how dierent choices for data splitting, featurization, and
model are available. DeepChem also directly provides molnet
sub-module to support benchmarking. The single line below
runs benchmarking on the specied dataset, model and fea-
turizer. User dened models capable of handling DeepChem
datasets are also supported.
deepchem.molnet.run_benchmark (datasets, model, split,
featurizer)
514 | Chem. Sci.,2018,9,513530
This journal is © The Royal Society of Chemistry 2018
Chemical Science Edge Article
Open Access Article. Published on 31 October 2017. Downloaded on 8/26/2022 12:32:38 PM.
This article is licensed under a
Creative Commons Attribution-NonCommercial 3.0 Unported Licence.
View Article Online

In this section, we will further elaborate the benchmarking
system, introducing available datasets as well as implemented
splitting, metrics, featurization, and learning methods.
3.1 Datasets
MoleculeNet is built upon multiple public databases. The full
collection currently includes over 700 000 compounds tested on
a range of dierent properties. These properties can be sub-
divided into four categories: quantum mechanics, physical
chemistry, biophysics and physiology. As illustrated in Fig. 2,
separate datasets in the MoleculeNet collection cover various
levels of molecular properties, ranging from molecular-level
properties to macroscopic inuences on human body. For
each dataset, we propose a metric and a splitting pattern
(introduced in the following texts) that best t the properties of
the dataset. Performances on the recommended metric and
split are reported in the results section.
In most datasets, SMILES strings
40
are used to represent
input molecules, 3D coordinates are also included in part of the
collection as molecular features, which enables dierent
methods to be applied. Properties, or output labels, are either 0/
1 for classication tasks, or oating point numbers for regres-
sion tasks. At the time of writing, MoleculeNet contains 17
datasets prepared and benchmarked, but we anticipate adding
further datasets in an on-going fashion. We also highly welcome
contributions from other public data collections. For more
detailed dataset structure requirements and instructions on
curating datasets, please refer to the tutorial notebook in the
example folder of DeepChem github repository.
Table 1 lists details of datasets in the collection, including
tasks, compounds and their features, recommended splits and
metrics. Contents of each dataset will be elaborated in this
subsection, function calls to access the datasets can be found in
the ESI.
3.1.1 QM7/QM7b. The QM7/QM7b datasets are subsets of
the GDB-13 database,
41
a database of nearly 1 billion stable and
synthetically accessible organic molecules, containing up to
seven heavy atoms (C, N, O, S). The 3D Cartesian coordinates
of the most stable conformation and electronic properties
(atomization energy, HOMO/LUMO eigenvalues, etc.) of each
molecule were determined using ab initio density functional
theory (PBE0/tier2 basis set).
17,18
Learning methods
Fig. 1 Example code for benchmark evaluation with DeepChem, multiple methods are provided for data splitting, featurization and learning.
Fig. 2 Tasks in dierent datasets focus on dierent levels of properties of molecules.
This journal is © The Royal Society of Chemistry 2018 Chem. Sci.,2018,9,513530 | 515
Edge Article Chemical Science
Open Access Article. Published on 31 October 2017. Downloaded on 8/26/2022 12:32:38 PM.
This article is licensed under a
Creative Commons Attribution-NonCommercial 3.0 Unported Licence.
View Article Online

benchmarked on QM7/QM7b are responsible for predicting
these electronic properties given stable conformational coor-
dinates. For the purpose of more stable performances as well as
better comparison, we recommend stratied splitting (intro-
duced in the next subsection) for QM7.
3.1.2 QM8. The QM8 dataset comes from a recent study on
modeling quantum mechanical calculations of electronic
spectra and excited state energy of small molecules.
42
Multiple
methods, including time-dependent density functional theories
(TDDFT) and second-order approximate coupled-cluster (CC2),
are applied to a collection of molecules that include up to eight
heavy atoms (also a subset of the GDB-17 database
43
). In total,
four excited state properties are calculated by three dierent
methods on 22 thousand samples.
3.1.3 QM9. QM9 is a comprehensive dataset that provides
geometric, energetic, electronic and thermodynamic properties
for a subset of GDB-17 database,
43
comprising 134 thousand
stable organic molecules with up to nine heavy atoms.
44
All
molecules are modeled using density functional theory (B3LYP/
6-31G(2df,p) based DFT). In our benchmark, geometric prop-
erties (atomic coordinates) are integrated into features, which
are then applied to predict other properties.
The datasets introduced above (QM7, QM7b, QM8, QM9)
were curated as part of the Quantum-Machine eort,
39
which
has processed a number of datasets to measure the ecacy of
machine-learning methods for quantum chemistry.
3.1.4 ESOL. ESOL is a small dataset consisting of water
solubility data for 1128 compounds.
13
The dataset has been
used to train models that estimate solubility directly from
chemical structures (as encoded in SMILES strings).
22
Note that
these structures don't include 3D coordinates, since solubility is
a property of a molecule and not of its particular conformers.
3.1.5 FreeSolv. The Free Solvation Database (FreeSolv)
provides experimental and calculated hydration free energy of
small molecules in water.
16
A subset of the compounds in the
dataset are also used in the SAMPL blind prediction challenge.
15
The calculated values are derived from alchemical free energy
calculations using molecular dynamics simulations. We include
the experimental values in the benchmark collection, and use
calculated values for comparison.
3.1.6 Lipophilicity. Lipophilicity is an important feature of
drug molecules that aects both membrane permeability and
solubility. This dataset, curated from ChEMBL database,
45
provides experimental results of octanol/water distribution
coecient (log D at pH 7.4) of 4200 compounds.
3.1.7 PCBA. PubChem BioAssay (PCBA) is a database con-
sisting of biological activities of small molecules generated by
high-throughput screening.
35
We use a subset of PCBA, con-
taining 128 bioassays measured over 400 thousand compounds,
used by previous work to benchmark machine learning
methods.
10
3.1.8 MUV. The Maximum Unbiased Validation (MUV)
group is another benchmark dataset selected from PubChem
BioAssay by applying a rened nearest neighbor analysis.
46
The
MUV dataset contains 17 challenging tasks for around 90
thousand compounds and is specically designed for validation
of virtual screening techniques.
3.1.9 HIV. The HIV dataset was introduced by the Drug
Therapeutics Program (DTP) AIDS Antiviral Screen, which
tested the ability to inhibit HIV replication for over 40 000
compounds.
47
Screening results were evaluated and placed into
three categories: conrmed inactive (CI), conrmed active (CA)
and conrmed moderately active (CM). We further combine the
latter two labels, making it a classication task between inactive
(CI) and active (CA and CM). As we are more interested in
discover new categories of HIV inhibitors, scaold splitting
(introduced in the next subsection) is recommended for this
dataset.
3.1.10 PDBbind. PDBbind is a comprehensive database of
experimentally measured binding anities for bio-molecular
complexes.
48,49
Unlike other ligand-based biological activity
datasets, in which only the structures of ligands are provided,
PDBbind provides detailed 3D Cartesian coordinates of both
ligands and their target proteins derived from experimental
Table 1 Dataset details: number of compounds and tasks, recommended splits and metrics
Category Dataset Data type Tasks Compounds Rec split Rec metric
Quantum mechanics QM7 SMILES, 3D coordinates 1 Regression 7165 Stratied MAE
QM7b 3D coordinates 14 Regression 7211 Random MAE
QM8 SMILES, 3D coordinates 12 Regression 21 786 Random MAE
QM9 SMILES, 3D coordinates 12 Regression 133 885 Random MAE
Physical chemistry ESOL SMILES 1 Regression 1128 Random RMSE
FreeSolv SMILES 1 Regression 643 Random RMSE
Lipophilicity SMILES 1 Regression 4200 Random RMSE
Biophysics PCBA SMILES 128 Classication 439 863 Random PRC-AUC
MUV SMILES 17 Classication 93 127 Random PRC-AUC
HIV SMILES 1 Classication 41 913 Scaold ROC-AUC
PDBbind SMILES, 3D coordinates 1 Regression 11 908 Time RMSE
BACE SMILES 1 Classication 1522 Scaold ROC-AUC
Physiology BBBP SMILES 1 Classication 2053 Scaold ROC-AUC
Tox21 SMILES 12 Classication 8014 Random ROC-AUC
ToxCast SMILES 617 Classication 8615 Random ROC-AUC
SIDER SMILES 27 Classication 1427 Random ROC-AUC
ClinTox SMILES 2 Classication 1491 Random ROC-AUC
516
| Chem. Sci.,2018,9,513530
This journal is © The Royal Society of Chemistry 2018
Chemical Science Edge Article
Open Access Article. Published on 31 October 2017. Downloaded on 8/26/2022 12:32:38 PM.
This article is licensed under a
Creative Commons Attribution-NonCommercial 3.0 Unported Licence.
View Article Online

(e.g., X-ray crystallography) measurements. The availabil ity of
coordinates of the proteinligand complexes permits
structure-based featurization that is aware of the protein
ligand binding geometry. We use the rened and core
subsets of the database,
50
more carefully processed for data
artifacts, as additional benchmarking targets. Samples in
PDBbind dataset are collected over a relativel y long perio d of
time (since 1982), hence a time splitting pattern (int rodu ced
in the next subsection) is recommended to mimic actual
development in the eld.
3.1.11 BACE. The BACE dataset provides quantitative (IC
50
)
and qualitative (binary label) binding results for a set of
inhibitors of human b-secretase 1 (BACE-1).
51
All data are
experimental values reported in scientic literature over the
past decade, some with detailed crystal structures available. We
merged a collection of 1522 compounds with their 2D structures
and binary labels in MoleculeNet, built as a classication task.
Similarly, regarding a single protein target, scaold splitting
will be more practically useful.
3.1.12 BBBP. The Bloodbrain barrier penetration (BBBP)
dataset comes from a recent study
52
on the modeling and
prediction of the barrier permeability. As a membrane sepa-
rating circulating blood and brain extracellular uid, the blood
brain barrier blocks most drugs, hormones and neurotrans-
mitters. Thus penetration of the barrier forms a long-standing
issue in development of drugs targeting central nervous
system. This dataset includes binary labels for over 2000
compounds on their permeability properties. Scaold splitting
is also recommended for this well-dened target.
3.1.13 Tox21. The Toxicolog y in the 21st Century
(Tox21) initiative created a public database measuring
toxicity of compounds, which has been used in the 2014
Tox21 Data Challenge.
53
This dataset contains qualitat ive
toxicity measurements for 8014 compounds on 12 di erent
targets, including nuclear recept ors and stress response
pathways.
3.1.14 ToxCast. ToxCast is another data collection (from
the same initiative as Tox21) providing toxicology data for
a large library of compounds based on in vitro high-throughput
screening.
54
The processed collection in MoleculeNet includes
qualitative results of over 600 experiments on 8615 compounds.
3.1.15 SIDER. The Side Eect Resource (SIDER) is a data-
base of marketed drugs and adverse drug reactions (ADR).
55
The
version of the SIDER dataset in DeepChem
56
has grouped drug
side-eects into 27 system organ classes following MedDRA
classications
57
measured for 1427 approved drugs (following
previous usage
56
).
3.1.16 ClinTox. The ClinTox dataset, introduced as part of
this work, compares drugs approved by the FDA and drugs that
have failed clinical trials for toxicity reasons.
58,59
The dataset
includes two classication tasks for 1491 drug compounds with
known chemical structures: (1) clinical trial toxicity (or absence
of toxicity) and (2) FDA approval status. List of FDA-approved
drugs are compiled from the SWEETLEAD database,
60
and list
of drugs that failed clinical trials for toxicity reasons are
compiled from the Aggregate Analysis of ClinicalTrials.gov
(AACT) database.
61
3.2 Dataset splitting
Typical machine learning methods require datasets to be split
into training/validation/test subsets (or alternatively into K-
folds) for benchmarking. All MoleculeNet datasets are split into
training, validation and test, following a 80/10/10 ratio.
Training sets were used to train models, while validation sets
were used for tuning hyperparameters, and test sets were used
for evaluation of models.
As mentioned previously, random splitting of molecular data
isn't always best for evaluating machine learning methods.
Consequently, MoleculeNet implements multiple dierent
splittings for each dataset (Fig. 3). Random splitting randomly
splits samples into the training/validation/test subsets. Scaold
splitting splits the samples based on their two-dimensional
structural frameworks,
62
as implemented in RDKit.
63
Since
scaold splitting attempts to separate structurally dierent
molecules into dierent subsets, it oers a greater challenge for
learning algorithms than the random split.
In addition, a stratied random sampling method is imple-
mented on the QM7 dataset to reproduce the results from the
original work.
18
This method sorts datapoints in order of
increasing label value (note this is only dened for real-valued
output). This sorted list is then split into training/validation/
test by ensuring that each set contains the full range of
provided labels. Time splitting is also adopted for dataset that
includes time information (PDBbind). Under this splitting
method, model will be trained on older data and tested on
newer data, mimicking real world development condition.
MoleculeNet contributes the code for these splitting
methods into DeepChem. Users of the library can use these
splits on new datasets with short library calls.
3.3 Metrics
MoleculeNet contains both regression datasets (QM7, QM7b,
QM8, QM9, ESOL, FreeSolv, lipophilicity and PDBbind) and
classication datasets (PCBA, MUV, HIV, BACE, BBBP, Tox21,
ToxCast and SIDER). Consequently, dierent performance
metrics need to be measured for each. Following suggestions
from the community,
64
regression datasets are evaluated by mean
absolute error (MAE) and root-mean-square error (RMSE), clas-
sication datasets are evaluated by area under curve (AUC) of the
receiver operating characteristic (ROC) curve
65
and the precision
recall curve (PRC).
66
For datasets containing more than one task,
we report the mean metric values over all tasks.
To allow better comparison, we propose regression metrics
according to previous work on either same models or datasets.
For classication datasets, we propose recommended metrics
from the two commonly used metrics: AUC-PRC and AUC-ROC.
Four representative sets of ROC curves and PRCs are depicted in
Fig. 4, resulting from the predictions of logistic regression and
graph convolutional models on four tasks. Details about these
tasks and AUC values of all curves are listed in Table 2. Note that
these four tasks have dierent class imbalances, represented as
the number of positive samples and negative samples.
As noted in previous literature,
66
ROC curves and PRCs are
highly correlated, but perform signicantly dierently in case of
This journal is © The Royal Society of Chemistry 2018 Chem. Sci.,2018,9,513530 | 517
Edge Article Chemical Science
Open Access Article. Published on 31 October 2017. Downloaded on 8/26/2022 12:32:38 PM.
This article is licensed under a
Creative Commons Attribution-NonCommercial 3.0 Unported Licence.
View Article Online

Citations
More filters

“Bioinformatics” 특집을 내면서

TL;DR: Assessment of medical technology in the context of commercialization with Bioentrepreneur course, which addresses many issues unique to biomedical products.
Journal ArticleDOI

Opportunities and obstacles for deep learning in biology and medicine.

TL;DR: It is found that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art.
Journal ArticleDOI

Automatic chemical design using a data-driven continuous representation of molecules

TL;DR: A method to convert discrete representations of molecules to and from a multidimensional continuous representation that allows us to generate new molecules for efficient exploration and optimization through open-ended spaces of chemical compounds is reported.
Posted Content

Open Graph Benchmark: Datasets for Machine Learning on Graphs

TL;DR: The OGB datasets are large-scale, encompass multiple important graph ML tasks, and cover a diverse range of domains, ranging from social and information networks to biological networks, molecular graphs, source code ASTs, and knowledge graphs, indicating fruitful opportunities for future research.
References
More filters
Journal ArticleDOI

Random Forests

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Journal Article

Scikit-learn: Machine Learning in Python

TL;DR: Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems, focusing on bringing machine learning to non-specialists using a general-purpose high-level language.
Journal ArticleDOI

Deep learning

TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.
Related Papers (5)