MoleculeNet: a benchmark for molecular machine learning

doi:10.1039/C7SC02664A

MoleculeNet: a benchmark for molecular machine

learning†

Zhenqin Wu,

‡

a

Bharath Ramsundar,‡

b

Evan N. Feinberg,§

c

Joseph Gomes, §

a

Caleb Geniesse,

c

Aneesh S. Pappu,

b

Karl Leswing

d

and Vijay Pande

*

a

Molecular machine learning has been maturing rapidly over the last few years. Improved methods and the

presence of larger datasets have enabled machine learning algorithms to make increasingly accurate

predictions about molecular properties. However, algorithmic progress has been limited due to the lack

of a standard benchmark to compare the eﬃcacy of proposed methods; most new algorithms are

benchmarked on diﬀerent datasets making it challenging to gauge the quality of proposed methods. This

work introduces MoleculeNet, a large scale benchmark for molecular machine learning. MoleculeNet

curates multiple public datasets, establishes metrics for evaluation, and oﬀers high quality open-source

implementations of multiple previously proposed molecular featurization and learning algorithms

(released as part of the DeepChem open source library). MoleculeNet benchmarks demonstrate that

learnable representations are powerful tools for molecular machine learning and broadly oﬀer the best

performance. However, this result comes with caveats. Learnable representations still struggle to deal

with complex tasks under data scarcity and highly imbalanced classiﬁcation. For quantum mechanical

and biophysical datasets, the use of physics-aware featurizations can be more important than choice of

particular learning algorithm.

1 Introduction

Overlap between chemistry and statistical learning has had

a long history. The eld of cheminformatics has been utilizing

machine learning methods in chemical modeling (e.g. quanti-

tative structure activity relationships, QSAR) for decades.

1–6

In

the recent 10 years, with the advent of sophisticated deep

learning methods,

7,8

machine learning has gathered increasing

amounts of attention from the scientic community. Data-

driven analysis has become a routine step in many chemical

and biological applications, including virtual screening,

9–12

chemical property prediction,

13–16

and quantum chemistry

calculations.

17–20

In many such applications, machine learning has shown

strong potential to compete with or even outperform conven-

tional ab initio computations.

16,18

It follows that introduction of

novel machine learning methods has the potential to reshape

research on properties of molecules. However, this potential

has been limited by the lack of a standard evaluation platform

for proposed machine learning algorithms. Algorithmic papers

oen benchmark proposed methods on disjoint dataset

collections, making it a challenge to gauge whether a proposed

technique does in fact improve performance.

Data for molecule-based machine learning tasks are highly

heterogeneous and expensive to gather. Obtaining precise and

accurate results for chemical properties typically requires

specialized instruments as well as expert supervision (contrast

with computer speech and vision, where lightly trained workers

can annotate data suitable for machine learning systems). As

a result, molecular datasets are usually much smaller than those

available for other machine learning tasks. Furthermore, the

breadth of chemical research means our interests with respect to

a molecule may range from quantum characteristics to measured

impacts on the human body. Molecular machine learning

methods have to be capable of learning to predict this very broad

range of properties. Complicating this challenge, input molecules

can have arbitrary size and components, highly variable

connectivity and many three dimensional conformers (three

dimensional molecular shapes). To transform molecules into

a form suitable for conventional machine learning algorithms

(that usually accept xed length input), we have to extract useful

and related information from a molecule into a xed dimen-

sional representation (a process called featurization).

21–23

To put it simply, building machine learning models on

molecules requires overcoming several key issues: limited

a

Department of Chemistry, Stanford University, Stanford, CA 94305, USA. E-mail:

pande@stanford.edu

b

Department of Computer Science, Stanford University, Stanford, CA 94305, USA

c

Program in Biophysics, Stanford School of Medicine, Stanford, CA 94305, USA

d

Schrodinger Inc., USA

† Electronic supplementary information (ESI) available. See DOI:

10.1039/c7sc02664a

‡ Joint rst authorship.

§ Joint second authorship.

Cite this: Chem. Sci.,2018,9,513

Received 15th June 2017

Accepted 30th October 2017

DOI: 10.1039/c7sc02664a

rsc.li/chemical-science

This journal is © The Royal Society of Chemistry 2018 Chem. Sci.,2018,9,513–530 | 513

Chemical

Science

EDGE ARTICLE

Open Access Article. Published on 31 October 2017. Downloaded on 8/26/2022 12:32:38 PM.

This article is licensed under a

Creative Commons Attribution-NonCommercial 3.0 Unported Licence.

View Article Online

View Journal

| View Issue

amounts of data, wide ranges of outputs to predict, large

heterogeneity in input molecular structures and appropriate

learning algorithms. Therefore, this work aims to facilitate the

development of molecular machine learning methods by

curating a number of dataset collections, creating a suite of

soware that implements many known featurizations of mole-

cules, and providing high quality implementations of many

previously proposed algorithms. Following the footsteps of

WordNet

24

and ImageNet,

25

we call our suite MoleculeNet,

a benchmark collection for molecular machine learning.

In machine learning, a benchmark serves as more than

a simple collection of data and methods. The introduction of

the ImageNet benchmark in 2009 has triggered a series of

breakthroughs in computer vision, and in particular has facil-

itated the rapid development of deep convolutional networks.

The ILSVRC, an annual contest held by the ImageNet team,

26

draws considerable attention from the community, and greatly

stimulates collaborations and competitions across the eld.

The contest has given rise to a series of prominent machine

learning models such as AlexNet,

27

GoogLeNet,

28

ResNet

29

which

have had broad impact on the academic and industrial

computer science communities. We hope that MoleculeNet will

trigger similar breakthroughs by serving as a platform for the

wider community to develop and improve models for learning

molecular properties.

In particular, MoleculeNet contains data on the properties

of over 700 000 compounds. All datasets have been curated

and integrated i nto the open source DeepChem package.

30

Users of DeepChem can easily load all MoleculeNet bench-

mark data through provided l ibrary calls. MoleculeNet also

contributes high quality implementations of well known (bio)

chemical featurization m ethods. To facilitate comp arison and

developmentofnewmethods,wealsoprovidehighquality

implementations of several previously proposed machine

learning methods. O ur imp lement atio ns are integrated with

DeepChem , and depend on S cik it -Le ar n

31

and Tensorow

32

underneath the hood. Finally, evaluation of machine learning

algorithms requires dened methods to split datasets into

training/validation/test collections. Random splitting,

common in machine learning, is oen not correct for chem-

ical data.

33

MoleculeNet contributes a library of splitting

mechanisms to DeepChem and evaluates all algorithms with

multiple choices of data split. MoleculeNet provide a series of

benchmark results of implemented machine learning algo-

rithms using various featurizatio ns and splits upon our dat a-

set collections. These results are provided within this paper,

and will be maintained online in an ongoing fashion as part of

DeepCh em .

The related work section will review prior work in the

chemistry community on gathering curated datasets and

discuss how MoleculeNet diﬀers from these previous eﬀorts.

The methods section reviews the dataset collections, metrics,

featurization methods, and machine learning models included

as part of MoleculeNet. The results section will analyze the

benchmarking results to draw conclusions about the algo-

rithms and datasets considered.

2 Related work

MoleculeNet draws upon a broader movement within the

chemical community to gather large sources of curated data.

PubChem

34

and PubChem BioAssasy

35

gather together thou-

sands of bioassay results, along with millions of unique

molecules tested within these assays. The ChEMBL database

oﬀ er s a similar service, with millions of bioactivity outcomes

across thousands of protein targets. Both PubChem and

ChEMBL are human researcher oriented, with web portals that

facilitate browsing of the available targets and compounds.

ChemSpider is a repository of nearly 60 million chemical

structures, with web based search capabilities for users. The

Crystallography Open Database

36

and Cambridge Structural

Database

37

oﬀ er large repositories of o rganic and i norganic

compounds. The protein data bank

38

oﬀ ers a repository of

experimentally resolved three dimensional protein structures.

This listing is by no means comprehensive; the methods

section will discuss a number of smaller data sources in

greater detail.

These past eﬀorts have been critical in enabling the growth

of computational chemistry. However, these previous databases

are not machine-learning focused. In particular, these collec-

tions don't dene metrics which measure the eﬀectiveness of

algorithmic methods in understanding the data contained.

Furthermore, there is no prescribed separation of the data into

training/validation/test sets (critical for machine learning

development). Without specied metrics or splits, the choice is

le to individual researchers, and there are indeed many

chemical machine learning papers which use subsets of these

data stores for machine learning evaluation. Unfortunately, the

choice of metric and subset varies widely between groups, so

two methods papers using PubChem data may be entirely

incomparable. MoleculeNet aims to bridge this gap by

providing benchmark results for a reasonable range of metrics,

splits, and subsets of these (and other) data collections.

It's important to note that there have been some eﬀorts to

create benchmarking datasets for machine learning in chem-

istry. The Quantum Machine group

39

and previous work on

multitask learning

10

both introduce benchmarking collections

which have been used in multiple papers. MoleculeNet incor-

porates data from both these eﬀorts and signicantly expands

upon them.

3 Methods

MoleculeNet is based on the open source package DeepChem.

30

Fig. 1 shows an annotated DeepChem benchmark script. Note

how diﬀerent choices for data splitting, featurization, and

model are available. DeepChem also directly provides molnet

sub-module to support benchmarking. The single line below

runs benchmarking on the specied dataset, model and fea-

turizer. User dened models capable of handling DeepChem

datasets are also supported.

deepchem.molnet.run_benchmark (datasets, model, split,

featurizer)

514 | Chem. Sci.,2018,9,513–530

This journal is © The Royal Society of Chemistry 2018

Chemical Science Edge Article

Open Access Article. Published on 31 October 2017. Downloaded on 8/26/2022 12:32:38 PM.

This article is licensed under a

Creative Commons Attribution-NonCommercial 3.0 Unported Licence.

View Article Online

In this section, we will further elaborate the benchmarking

system, introducing available datasets as well as implemented

splitting, metrics, featurization, and learning methods.

3.1 Datasets

MoleculeNet is built upon multiple public databases. The full

collection currently includes over 700 000 compounds tested on

a range of diﬀerent properties. These properties can be sub-

divided into four categories: quantum mechanics, physical

chemistry, biophysics and physiology. As illustrated in Fig. 2,

separate datasets in the MoleculeNet collection cover various

levels of molecular properties, ranging from molecular-level

properties to macroscopic inuences on human body. For

each dataset, we propose a metric and a splitting pattern

(introduced in the following texts) that best t the properties of

the dataset. Performances on the recommended metric and

split are reported in the results section.

In most datasets, SMILES strings

40

are used to represent

input molecules, 3D coordinates are also included in part of the

collection as molecular features, which enables diﬀerent

methods to be applied. Properties, or output labels, are either 0/

1 for classication tasks, or oating point numbers for regres-

sion tasks. At the time of writing, MoleculeNet contains 17

datasets prepared and benchmarked, but we anticipate adding

further datasets in an on-going fashion. We also highly welcome

contributions from other public data collections. For more

detailed dataset structure requirements and instructions on

curating datasets, please refer to the tutorial notebook in the

example folder of DeepChem github repository.

Table 1 lists details of datasets in the collection, including

tasks, compounds and their features, recommended splits and

metrics. Contents of each dataset will be elaborated in this

subsection, function calls to access the datasets can be found in

the ESI.†

3.1.1 QM7/QM7b. The QM7/QM7b datasets are subsets of

the GDB-13 database,

41

a database of nearly 1 billion stable and

synthetically accessible organic molecules, containing up to

seven “heavy” atoms (C, N, O, S). The 3D Cartesian coordinates

of the most stable conformation and electronic properties

(atomization energy, HOMO/LUMO eigenvalues, etc.) of each

molecule were determined using ab initio density functional

theory (PBE0/tier2 basis set).

17,18

Learning methods

Fig. 1 Example code for benchmark evaluation with DeepChem, multiple methods are provided for data splitting, featurization and learning.

Fig. 2 Tasks in diﬀerent datasets focus on diﬀerent levels of properties of molecules.

This journal is © The Royal Society of Chemistry 2018 Chem. Sci.,2018,9,513–530 | 515

Edge Article Chemical Science

Open Access Article. Published on 31 October 2017. Downloaded on 8/26/2022 12:32:38 PM.

This article is licensed under a

Creative Commons Attribution-NonCommercial 3.0 Unported Licence.

View Article Online

benchmarked on QM7/QM7b are responsible for predicting

these electronic properties given stable conformational coor-

dinates. For the purpose of more stable performances as well as

better comparison, we recommend stratied splitting (intro-

duced in the next subsection) for QM7.

3.1.2 QM8. The QM8 dataset comes from a recent study on

modeling quantum mechanical calculations of electronic

spectra and excited state energy of small molecules.

42

Multiple

methods, including time-dependent density functional theories

(TDDFT) and second-order approximate coupled-cluster (CC2),

are applied to a collection of molecules that include up to eight

heavy atoms (also a subset of the GDB-17 database

43

). In total,

four excited state properties are calculated by three diﬀerent

methods on 22 thousand samples.

3.1.3 QM9. QM9 is a comprehensive dataset that provides

geometric, energetic, electronic and thermodynamic properties

for a subset of GDB-17 database,

43

comprising 134 thousand

stable organic molecules with up to nine heavy atoms.

44

All

molecules are modeled using density functional theory (B3LYP/

6-31G(2df,p) based DFT). In our benchmark, geometric prop-

erties (atomic coordinates) are integrated into features, which

are then applied to predict other properties.

The datasets introduced above (QM7, QM7b, QM8, QM9)

were curated as part of the Quantum-Machine eﬀort,

39

which

has processed a number of datasets to measure the eﬃcacy of

machine-learning methods for quantum chemistry.

3.1.4 ESOL. ESOL is a small dataset consisting of water

solubility data for 1128 compounds.

13

The dataset has been

used to train models that estimate solubility directly from

chemical structures (as encoded in SMILES strings).

22

Note that

these structures don't include 3D coordinates, since solubility is

a property of a molecule and not of its particular conformers.

3.1.5 FreeSolv. The Free Solvation Database (FreeSolv)

provides experimental and calculated hydration free energy of

small molecules in water.

16

A subset of the compounds in the

dataset are also used in the SAMPL blind prediction challenge.

15

The calculated values are derived from alchemical free energy

calculations using molecular dynamics simulations. We include

the experimental values in the benchmark collection, and use

calculated values for comparison.

3.1.6 Lipophilicity. Lipophilicity is an important feature of

drug molecules that aﬀects both membrane permeability and

solubility. This dataset, curated from ChEMBL database,

45

provides experimental results of octanol/water distribution

coeﬃcient (log D at pH 7.4) of 4200 compounds.

3.1.7 PCBA. PubChem BioAssay (PCBA) is a database con-

sisting of biological activities of small molecules generated by

high-throughput screening.

35

We use a subset of PCBA, con-

taining 128 bioassays measured over 400 thousand compounds,

used by previous work to benchmark machine learning

methods.

10

3.1.8 MUV. The Maximum Unbiased Validation (MUV)

group is another benchmark dataset selected from PubChem

BioAssay by applying a rened nearest neighbor analysis.

46

The

MUV dataset contains 17 challenging tasks for around 90

thousand compounds and is specically designed for validation

of virtual screening techniques.

3.1.9 HIV. The HIV dataset was introduced by the Drug

Therapeutics Program (DTP) AIDS Antiviral Screen, which

tested the ability to inhibit HIV replication for over 40 000

compounds.

47

Screening results were evaluated and placed into

three categories: conrmed inactive (CI), conrmed active (CA)

and conrmed moderately active (CM). We further combine the

latter two labels, making it a classication task between inactive

(CI) and active (CA and CM). As we are more interested in

discover new categories of HIV inhibitors, scaﬀold splitting

(introduced in the next subsection) is recommended for this

dataset.

3.1.10 PDBbind. PDBbind is a comprehensive database of

experimentally measured binding aﬃnities for bio-molecular

complexes.

48,49

Unlike other ligand-based biological activity

datasets, in which only the structures of ligands are provided,

PDBbind provides detailed 3D Cartesian coordinates of both

ligands and their target proteins derived from experimental

Table 1 Dataset details: number of compounds and tasks, recommended splits and metrics

Category Dataset Data type Tasks Compounds Rec – split Rec – metric

Quantum mechanics QM7 SMILES, 3D coordinates 1 Regression 7165 Stratied MAE

QM7b 3D coordinates 14 Regression 7211 Random MAE

QM8 SMILES, 3D coordinates 12 Regression 21 786 Random MAE

QM9 SMILES, 3D coordinates 12 Regression 133 885 Random MAE

Physical chemistry ESOL SMILES 1 Regression 1128 Random RMSE

FreeSolv SMILES 1 Regression 643 Random RMSE

Lipophilicity SMILES 1 Regression 4200 Random RMSE

Biophysics PCBA SMILES 128 Classication 439 863 Random PRC-AUC

MUV SMILES 17 Classication 93 127 Random PRC-AUC

HIV SMILES 1 Classication 41 913 Scaﬀold ROC-AUC

PDBbind SMILES, 3D coordinates 1 Regression 11 908 Time RMSE

BACE SMILES 1 Classication 1522 Scaﬀold ROC-AUC

Physiology BBBP SMILES 1 Classication 2053 Scaﬀold ROC-AUC

Tox21 SMILES 12 Classication 8014 Random ROC-AUC

ToxCast SMILES 617 Classication 8615 Random ROC-AUC

SIDER SMILES 27 Classication 1427 Random ROC-AUC

ClinTox SMILES 2 Classication 1491 Random ROC-AUC

516

| Chem. Sci.,2018,9,513–530

Chemical Science Edge Article

Open Access Article. Published on 31 October 2017. Downloaded on 8/26/2022 12:32:38 PM.

This article is licensed under a

Creative Commons Attribution-NonCommercial 3.0 Unported Licence.

View Article Online

(e.g., X-ray crystallography) measurements. The availabil ity of

coordinates of the protein–ligand complexes permits

structure-based featurization that is aware of the protein–

ligand binding geometry. We use the “rened” and “core”

subsets of the database,

50

more carefully processed for data

artifacts, as additional benchmarking targets. Samples in

PDBbind dataset are collected over a relativel y long perio d of

time (since 1982), hence a time splitting pattern (int rodu ced

in the next subsection) is recommended to mimic actual

development in the eld.

3.1.11 BACE. The BACE dataset provides quantitative (IC

50

)

and qualitative (binary label) binding results for a set of

inhibitors of human b-secretase 1 (BACE-1).

51

All data are

experimental values reported in scientic literature over the

past decade, some with detailed crystal structures available. We

merged a collection of 1522 compounds with their 2D structures

and binary labels in MoleculeNet, built as a classication task.

Similarly, regarding a single protein target, scaﬀold splitting

will be more practically useful.

3.1.12 BBBP. The Blood–brain barrier penetration (BBBP)

dataset comes from a recent study

52

on the modeling and

prediction of the barrier permeability. As a membrane sepa-

rating circulating blood and brain extracellular uid, the blood–

brain barrier blocks most drugs, hormones and neurotrans-

mitters. Thus penetration of the barrier forms a long-standing

issue in development of drugs targeting central nervous

system. This dataset includes binary labels for over 2000

compounds on their permeability properties. Scaﬀold splitting

is also recommended for this well-dened target.

3.1.13 Tox21. The “Toxicolog y in the 21st Century”

(Tox21) initiative created a public database measuring

toxicity of compounds, which has been used in the 2014

Tox21 Data Challenge.

53

This dataset contains qualitat ive

toxicity measurements for 8014 compounds on 12 di ﬀerent

targets, including nuclear recept ors and stress response

pathways.

3.1.14 ToxCast. ToxCast is another data collection (from

the same initiative as Tox21) providing toxicology data for

a large library of compounds based on in vitro high-throughput

screening.

54

The processed collection in MoleculeNet includes

qualitative results of over 600 experiments on 8615 compounds.

3.1.15 SIDER. The Side Eﬀect Resource (SIDER) is a data-

base of marketed drugs and adverse drug reactions (ADR).

55

The

version of the SIDER dataset in DeepChem

56

has grouped drug

side-eﬀects into 27 system organ classes following MedDRA

classications

57

measured for 1427 approved drugs (following

previous usage

56

).

3.1.16 ClinTox. The ClinTox dataset, introduced as part of

this work, compares drugs approved by the FDA and drugs that

have failed clinical trials for toxicity reasons.

58,59

The dataset

includes two classication tasks for 1491 drug compounds with

known chemical structures: (1) clinical trial toxicity (or absence

of toxicity) and (2) FDA approval status. List of FDA-approved

drugs are compiled from the SWEETLEAD database,

60

and list

of drugs that failed clinical trials for toxicity reasons are

compiled from the Aggregate Analysis of ClinicalTrials.gov

(AACT) database.

61

3.2 Dataset splitting

Typical machine learning methods require datasets to be split

into training/validation/test subsets (or alternatively into K-

folds) for benchmarking. All MoleculeNet datasets are split into

training, validation and test, following a 80/10/10 ratio.

Training sets were used to train models, while validation sets

were used for tuning hyperparameters, and test sets were used

for evaluation of models.

As mentioned previously, random splitting of molecular data

isn't always best for evaluating machine learning methods.

Consequently, MoleculeNet implements multiple diﬀerent

splittings for each dataset (Fig. 3). Random splitting randomly

splits samples into the training/validation/test subsets. Scaﬀold

splitting splits the samples based on their two-dimensional

structural frameworks,

62

as implemented in RDKit.

63

Since

scaﬀold splitting attempts to separate structurally diﬀerent

molecules into diﬀerent subsets, it oﬀers a greater challenge for

learning algorithms than the random split.

In addition, a stratied random sampling method is imple-

mented on the QM7 dataset to reproduce the results from the

original work.

18

This method sorts datapoints in order of

increasing label value (note this is only dened for real-valued

output). This sorted list is then split into training/validation/

test by ensuring that each set contains the full range of

provided labels. Time splitting is also adopted for dataset that

includes time information (PDBbind). Under this splitting

method, model will be trained on older data and tested on

newer data, mimicking real world development condition.

MoleculeNet contributes the code for these splitting

methods into DeepChem. Users of the library can use these

splits on new datasets with short library calls.

3.3 Metrics

MoleculeNet contains both regression datasets (QM7, QM7b,

QM8, QM9, ESOL, FreeSolv, lipophilicity and PDBbind) and

classication datasets (PCBA, MUV, HIV, BACE, BBBP, Tox21,

ToxCast and SIDER). Consequently, diﬀerent performance

metrics need to be measured for each. Following suggestions

from the community,

64

regression datasets are evaluated by mean

absolute error (MAE) and root-mean-square error (RMSE), clas-

sication datasets are evaluated by area under curve (AUC) of the

receiver operating characteristic (ROC) curve

65

and the precision

recall curve (PRC).

66

For datasets containing more than one task,

we report the mean metric values over all tasks.

To allow better comparison, we propose regression metrics

according to previous work on either same models or datasets.

For classication datasets, we propose recommended metrics

from the two commonly used metrics: AUC-PRC and AUC-ROC.

Four representative sets of ROC curves and PRCs are depicted in

Fig. 4, resulting from the predictions of logistic regression and

graph convolutional models on four tasks. Details about these

tasks and AUC values of all curves are listed in Table 2. Note that

these four tasks have diﬀerent class imbalances, represented as

the number of positive samples and negative samples.

As noted in previous literature,

66

ROC curves and PRCs are

highly correlated, but perform signicantly diﬀerently in case of

Edge Article Chemical Science

Open Access Article. Published on 31 October 2017. Downloaded on 8/26/2022 12:32:38 PM.

This article is licensed under a

Creative Commons Attribution-NonCommercial 3.0 Unported Licence.

View Article Online

MoleculeNet: a benchmark for molecular machine learning

Citations

“Bioinformatics” 특집을 내면서

Opportunities and obstacles for deep learning in biology and medicine.

Automatic chemical design using a data-driven continuous representation of molecules

Applications of machine learning in drug discovery and development.

Open Graph Benchmark: Datasets for Machine Learning on Graphs

References

Random Forests

ImageNet Classification with Deep Convolutional Neural Networks

ImageNet: A large-scale hierarchical image database

Scikit-learn: Machine Learning in Python

Deep learning

Related Papers (5)

Extended-Connectivity Fingerprints

SMILES, a chemical language and information system. 1. introduction to methodology and encoding rules

Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules

Convolutional networks on graphs for learning molecular fingerprints

Neural Message Passing for Quantum Chemistry