scispace - formally typeset
Open AccessJournal ArticleDOI

Predicting Binding from Screening Assays with Transformer Network Embeddings.

TLDR
Visualization of the embedding space reveals organization of structural and functional properties that aid binding prediction, and quantifies the observed increase in AUC on binding prediction tasks between classifiers trained on the translation embedding versus those using an untrained embedding.
Abstract
Cheminformatics aims to assist in chemistry applications that depend on molecular interactions, structural characteristics, and functional properties The arrival of deep learning and the abundance

read more

Content maybe subject to copyright    Report

Predicting Binding from Screening Assays with
Transformer Network Embeddings
Paul Morris,
Rachel St. Clair, Elan Barenholtz, and William Edward Hahn
Center for Complex Systems and Brain Sciences, Florida Atlantic University, Boca Raton,
Florida 33431, United States
E-mail: pmorris2012@fau.edu
Abstract
Cheminformatics aims to assist in chemistry
applications that depend on molecular inter-
actions, structural characteristics, and func-
tional properties. The arrival of deep learning
and the abundance of easily accessible chemi-
cal data from repositories like PubChem have
enabled advancements in computer-aided drug
discovery. Virtual High-Throughput Screen-
ing (vHTS) is one such technique that inte-
grates chemical domain knowledge to perform
in silico biomolecular simulations, but predic-
tion of binding affinity is restricted due to lim-
ited availability of ground-truth binding assay
results. Here, text representations of 83,000,000
molecules are leveraged to enable single-target
binding affinity prediction directly on the out-
come of screening assays. The embedding
of an end-to-end Transformer neural network,
trained to encode the structural characteristics
of a molecule via a text-based translation task,
is repurposed through transfer learning to clas-
sify binding affinity to a single target. Clas-
sifiers trained on the embedding outperform
those trained on SMILES strings for multiple
tasks, receiving between 0.67-0.99 AUC. Visu-
alization reveals organization of structural and
functional properties in the learned embedding
useful for binding prediction. The proposed
model is suitable for parallel computing, en-
abling rapid screening as a complement to vir-
tual screening techniques when limited data is
available.
Introduction
Cheminformatics aims to assist in chemistry
applications that depend on molecular interac-
tions, structural characteristics, and functional
properties. The arrival of powerful computa-
tional techniques and the abundance of eas-
ily accessible chemical data from repositories
have enabled dramatic recent advancements in
computer-aided drug discovery. The domain of
computer-aided drug design ranges from quan-
titative structure activity relationship,
1
drug
induced liver injury,
2
toxicity modeling,
3
vir-
tual screening,
4
among others. All of these
tasks have been aided by models that make use
of computational techniques that leverage large
datasets and human expertise to encode molec-
ular features in order to predict biochemical ac-
tivity.
5
Such techniques seek to expedite drug-
discovery pipelines by increasing the quantity
and quality of active compounds identified, po-
tentially resulting in new drug leads. Computa-
tional approaches are also advantaged in their
ability to integrate features from many sources
describing different chemical properties to ap-
proximate chemical function without the limi-
tations of traditional wet-lab approaches.
6
Virtual High-Throughput Screening (vHTS)
is one such technique that integrates chem-
ical domain knowledge to perform in silico
biomolecular simulations. While binding as-
says are generally more accurate than tradi-
tional virtual screening approaches, they can
only identify drug leads from a set of com-
1

pounds which are easy and cost-efficient to syn-
thesize. Computational techniques which sim-
ulate or approximate physical models of chem-
istry are not constrained by real-world limita-
tions, as molecules do not need to be synthe-
sized and resources for wet-lab experiments are
not required. Models or algorithms which are
driven by chemical data and expert knowledge
have been used to estimate structural and func-
tional properties and aid in scoring of existing
molecules,
7,8
as well as in de novo drug design.
9
Earlier approaches to the application of ma-
chine learning in cheminformatics involved
more traditional techniques such as support
vector machines (SVM), random-forest deci-
sion tree ensembles, markov models, and lin-
ear regression.
10
However, the advent of deep
learning on parallel computing resources has in-
creased the power and utility of computational
models, leading to new opportunities to lever-
age the wealth of available machine-readable
chemical information. Deep neural networks
(DNNs) trained to classify molecular represen-
tations have reportedly been highly effective
for cheminformatic tasks in computer-aided
drug design, computational structural biology,
quantum chemistry, and computational mate-
rial design.
The recent success of deep learning can be
attributed in part to the availability of large,
labeled datasets.
11
Repositories such as Pub-
Chem
12
which compile information on molec-
ular structure and properties have enabled the
application of deep learning vision and natural
language processing (NLP) techniques to many
molecular property prediction tasks. These in-
clude training convolutional neural networks
(CNN) on raw SMILES strings,
13,14
adapting
CNNs to atom graphs and connectivity ma-
trices,
15–18
and using neural networks to clas-
sify molecules from fingerprints or other hand-
designed molecular descriptors.
19
A number of approaches attempt to replace
the hand-designed scoring function of tradi-
tional molecular docking algorithms with a
learned scoring function.
16,20–22
Another class
of deep learning applications for drug discov-
ery attempts to simulate molecular docking.
Large databases such as PDBBind
23
contain
3D conformations of molecules bound to rel-
evant sites on thousands of target structures.
Deep learning approaches encode this 3D in-
formation to learn a model of physics and
identify molecules with low-energy conforma-
tions and high likelihood of binding.
24
Though
physics-based molecular docking models are less
constrained than wet-lab screening approaches,
they can still be computationally expensive and
require significant time and/or resources.
As an alternative to docking, other deep
learning approaches attempt to improve the
quality of virtual screening predictions by learn-
ing to represent molecules with automatically
selected features.
25,26
In particular, transla-
tion between distinct molecular representations
have previously been shown as an effective
technique for learning useful representations of
molecular properties.
9
By learning from ex-
isting representations and other information
which describe structural patterns, these tech-
niques develop custom chemical feature sets
which can match or increase performance on
molecular classification/prediction tasks com-
pared to existing representations. Learning new
representations expands the scope of chemin-
formatics applications by allowing prediction
of molecular function, as improved representa-
tions can increase the predictive quality of mod-
els trained on limited amounts of data.
While the application of deep learning to pre-
diction of molecular properties and other tasks
has shown promise in aiding drug discovery, the
direct application of deep learning to prediction
of screening assay results has been made diffi-
cult by the limited quantity of available data.
Molecules screened against a particular target
likely constitute a much less representative sam-
ple of chemical space than is typical of dataset
samples from vision or NLP populations, where
deep learning has been most successful. The
application of deep learning is especially diffi-
cult for datasets that are not primarily hand-
engineered.
5
To address these limitations, we leverage the
vast wealth of publicly available and easily com-
putable molecular structure data to augment
training of a neural network for binding affin-
ity prediction from historical assay data. To do
2

so, we train a Transformer neural network, an
architecture first introduced in the context of
natural language translation,
27
to translate be-
tween two distinct, text-based molecular rep-
resentations in a well-studied subset of chem-
ical space. An intermediate set of features
computed by this trained model is considered
as an embedding which contains abstract fea-
tures describing general molecular structure.
Molecules represented by this abstract embed-
ding are then used to train a binding affinity
prediction model directly on a limited set of as-
say results which quantify binding to a single
target. The organization of structural and func-
tional properties in embedding feature space en-
ables simple classifiers to simulate screening as-
says in limited data scenarios.
Learning abstract representations of chem-
ical information has recently been shown to
improve performance in predicting molecu-
lar function.
5,28
Another recent study derived
word embeddings and repurposed them through
transfer learning for multiple NLP tasks, out-
performing classification tasks without such
embeddings.
29
Here, we utilize a Transformer
network to create such embeddings for func-
tional assays that may otherwise be poor can-
didates for virtual screening. Since the molec-
ular representations from the Transformer net-
work are learned by text-translation to encode
the functional properties indicated by struc-
ture, they can be applied to any screening assay
model, regardless of bioactivity. This approach
shows pretraining embeddings for generic chem-
ical representations can improve supervised
classification. Our translation-based pretrain-
ing extends that insight to the task of predict-
ing binding assay results.
We evaluate our novel molecular embedding
learned by our Transformer on three single-
target prediction tasks and observe improve-
ment upon baselines for direct prediction of
binding assay results. Since neural network
training is data-driven, embedding features are
also suitable for fine-tuning to consider target-
specific information. Furthermore, The oper-
ations in the transformer model used to com-
pute molecular embeddings are easily paral-
lelizable on modern computing infrastructure
(GPUs), enabling rapid screening of millions of
molecules to assist wet-lab screening assays and
other drug discovery pipelines.
Methods
To accurately predict binding assay results for
a single target with few active compounds, we
first perform an auxiliary text translation task
based on state-of-the-art NLP techniques and
structural text representations of millions of
molecules. We collect SMILES strings and IU-
PAC chemical names for a large set of molecules
on PubChem. SMILES and IUPAC representa-
tions are selected because they both describe
similar aspects of molecular structure following
consistent rules in a machine-readable format.
While the atoms, bonds, and substructures de-
scribed in the two representations are similar,
the SMILES grammar and IUPAC nomencla-
ture have distinct text representations. By
learning to translate between the two, the com-
mon information they contain must be orga-
nized efficiently in an intermediate set of fea-
tures. We then repurpose these features of the
learned embedding for direct prediction of as-
say results, treated as a binary classification
task between binding and non-binding regions
of chemical space.
An overview of this process is shown in Fig-
ure 1. In Step 1, a high-level depiction of
the network architecture is illustrated which
demonstrates how the network layers gener-
ate molecular embeddings when performing
SMILES-IUPAC translation. In Step 2, embed-
dings generated from the trained network are
provided as input to a target-specific binding
classification network.
Transformer Neural Network
The Transformer
27
is a deep neural network
suited for NLP tasks. It relies on large weight
matrices to store patterns and learn short and
long-term dependencies in training sequences.
The Transformer network architecture is flexi-
ble and can be used for both classification and
text generation tasks. In our implementation,
3

Figure 1: Diagram of the two-step procedure
followed to predict binding affinity using the
learned embedding of a Transformer network.
the Transformer is used to generate output as
an IUPAC chemical name which corresponds to
a molecule described by a SMILES string pro-
vided as input.
Before being processed by the main layers
of the Transformer, SMILES strings are con-
verted to an initial, random embedding. Each
character in the SMILES alphabet is replaced
with a random vector, where the same vector
is used for multiple occurrences of the same
character. The values in this vector are the
first network weights of the Transformer, and
they are tuned during training based on the
frequency, co-occurence, and sequential depen-
dencies of each SMILES character. Periodic
functions at different frequencies are added to
the signal of each vector so that the frequency
of the added signal encodes a character’s loca-
tion in the SMILES sequence. This allows the
character-specific layers of the Transformer to
determine the order of one character relative to
others.
Once an initial embedding of character vec-
tors are generated, the signal in each vector is
modulated by the layers in the Transformer’s
encoder stack. An encoder layer consists of
a self-attention operation which modifies each
character vector based on its relation to other
characters in the sequence, followed by a simple
matrix multiplication and nonlinearity which is
applied on each character vector individually.
The output of each layer is a set of character
vectors with the same size as the input. The
output of the final encoder layer is treated as
a molecular embedding, where each character
vector has been modified to contain abstract
features useful for describing the structure of a
molecule. The features in this embedding are
used by an equivalent set of decoder layers for
IUPAC name generation. Character vectors are
processed by the decoder stack one at a time,
resulting in a new character in the IUPAC al-
phabet being predicted. Decoder layers share a
similar structure to encoder layers, except for
a slightly modified form of self-attention which
looks at previously predicted IUPAC characters
to inform prediction of the next character. Dur-
ing training, previous predictions are ignored
in favor of characters from the correct IUPAC
name for a molecule.
Molecular Self-Attention
The core mechanism of the Transformer is ’self-
attention’. In this operation, input vectors rep-
resenting each character in a SMILES string are
output as a linear combination of vectors for
all characters in the string. The output of the
attention layer is a vector for each character
with the same length as the input. However,
vectors for each character are weighted by an
output-specific attention score which represents
how relevant every character in the string is to
a particular output. Attention scores are com-
puted by weight matrices, which accept char-
acter vectors as input. The meaning of the at-
tention scores depends on the task on which
the model is being trained, and multiple sets of
model weights are used to produce multiple sets
of scores which may attend to different relevant
feature in the input. In the case of our trans-
lation task, attention scores may indicate the
4

Figure 2: Left: A selection of attention weights
for SMILES string CCCCC1CCNCC1F are vi-
sualized, showing how the character vectors of
each input on the left are weighted to pro-
duce the output vector for the 3rd carbon atom
on the right. Opacity indicates larger atten-
tion values. Right: Some of the self-attention
weights from the trained Transformer are vi-
sualized for a molecule. Separate weights in a
single layer attend to different features which
describe the same structure.
importance of a certain substructure for gener-
ating part of a molecule’s IUPAC name. Exam-
ple visualizations from the trained Transformer
are shown in Figure 2. The matrices on the
right of the figure demonstrate the capacity of
the Transformer network to learn a descriptive,
varied set of abstract molecular features useful
for describing structure.
Training Procedure
To train the Transformer network for trans-
lation, pairs of SMILES strings and IUPAC
names are sourced directly from the PubChem
compound database for 83,000,000 molecules.
SMILES strings are used as-is, and no canon-
icalization is performed . Similarly, IUPAC
names for each molecule are collected from Pub-
Chem with no modification. Deep neural net-
works trained on large, labeled datasets have
been shown to be robust to unreliable anno-
tation.
30
Noise during neural network training
can increase generalization due to the intrica-
cies of network optimization,
31
making the un-
modified molecular text representations robust
to under-fitting.
A Transformer network is created with 512-
dimension character embedding vectors. A
maximum SMILES string length during train-
ing of 256 characters is imposed, although this
limit can be exceeded during screening infer-
ence. Thus, each molecular embedding contains
256 512 dimensions. Training is performed in
batches of 96 molecular string pairs. The Adam
optimization algorithm
32
is used to to update
the weights of the network. The learning rate
during optimization begins at 0.001 and de-
creases two orders of magnitude, following half
a period of a cosine function, over the course
of a single pass, or epoch, over the 83,000,000
molecule training set. Training continues for
three epochs.
Experiments
The utility of the Transformer embedding was
investigated by training and evaluating bind-
ing prediction models on molecular embeddings
for three binary classification tasks. Equiva-
lent prediction models are also trained on two
representation baselines to quantify the Trans-
former embeddings’ usefulness for binding affin-
ity prediction and explain the relation between
learned features and molecular properties. Fi-
nally, an unsupervised evaluation of the learned
embedding is performed by visualizing how
changes in molecular structure compare corre-
spond to embedding changes.
Assay Datasets
Three datasets for supervised prediction of
binding assay results were compiled. In each,
the results of an assay (or compilation of assays)
for binding affinity to a target were sampled
into a small, labeled dataset. Binary classifi-
cation of molecular embeddings was performed
by binning continuous, assay-specific binding
affinity values into binding and non-binding
categories according to an activity threshold.
In each assay, fewer binding than non-binding
molecules were identified. To account for this,
the non-binding sets were randomly undersam-
5

Citations
More filters
Posted Content

Molecular representation learning with language models and domain-relevant auxiliary tasks.

TL;DR: The Transformer architecture is applied, specifically BERT, to learn flexible and high quality molecular representations for drug discovery problems, and molecular representations learnt by the model `MolBert' improve upon the current state of the art on the benchmark datasets.
Journal ArticleDOI

Accurate predictions of aqueous solubility of drug molecules via the multilevel graph convolutional network (MGCN) and SchNet architectures.

TL;DR: This study proposes two novel models for aqueous solubility predictions, based on the Multilevel Graph Convolutional Network (MGCN) and SchNet architectures, respectively, and found that both the MGCN and Sch net models performed well for aQueous solUBility predictions.
Journal ArticleDOI

Accurate predictions of drugs aqueous solubility via deep learning tools

TL;DR: This study proposed one novel and efficient quantitative structure-property relationship (QSPR) model for molecular properties predictions within the framework of deep learning neural network (DNN), using molecular descriptors calculated by Mordred.
Posted Content

Geometric Deep Learning on Molecular Representations

TL;DR: Geometric deep learning (GDL) has emerged as a recent paradigm in artificial intelligence as discussed by the authors and has shown particular promise in molecular modeling applications, in which various molecular representations with different symmetry properties and levels of abstraction exist.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI

Long short-term memory

TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Journal ArticleDOI

Deep learning

TL;DR: Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.
Book

Deep Learning

TL;DR: Deep learning as mentioned in this paper is a form of machine learning that enables computers to learn from experience and understand the world in terms of a hierarchy of concepts, and it is used in many applications such as natural language processing, speech recognition, computer vision, online recommendation systems, bioinformatics, and videogames.
Journal Article

Dropout: a simple way to prevent neural networks from overfitting

TL;DR: It is shown that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets.
Related Papers (5)
Frequently Asked Questions (16)
Q1. What have the authors contributed in "Predicting binding from screening assays with transformer network embeddings" ?

In this paper, an end-to-end Transformer neural network, trained to encode the structural characteristics of a molecule via a text-based translation task, is repurposed through transfer learning to classify binding affinity to a single target. 

While overall accuracy was somewhat limited and varied per-target, these results suggest a promising direction for further research into the application of deep learning to direct modeling of assay experiment results as a computational screening aid to existing drug discovery pipelines. Data-driven models trained on the Transformer embeddings can be applied as a quick, inexpensive computational screening method to assist the early drug discovery process for targets where a functional assay has been designed. 

An encoder layer consists of a self-attention operation which modifies each character vector based on its relation to other characters in the sequence, followed by a simple matrix multiplication and nonlinearity which is applied on each character vector individually. 

The learning rate during optimization begins at 0.001 and decreases two orders of magnitude, following half a period of a cosine function, over the course of a single pass, or epoch, over the 83,000,000 molecule training set. 

The operations in the transformer model used to compute molecular embeddings are easily parallelizable on modern computing infrastructure(GPUs), enabling rapid screening of millions of molecules to assist wet-lab screening assays and other drug discovery pipelines. 

To train the Transformer network for translation, pairs of SMILES strings and IUPAC names are sourced directly from the PubChem compound database for 83,000,000 molecules. 

the authors found that learning a mapping of chemical space via a Transformer network achieved increased accuracy of data-driven models on multiple binding affinity prediction tasks compared to models trained on handdesigned or untrained representations. 

24 Though physics-based molecular docking models are less constrained than wet-lab screening approaches, they can still be computationally expensive and require significant time and/or resources. 

In this case, the SMILES string for each molecule is converted to a random embedding of 512-dimensional vectors for each character. 

Numeric property values were normalized between 0 and 1 according to the minimum and maximum values of all screened molecules, on a per-dataset basis. 

The networks used to classify binding affinity are identical to the Transformer and random embedding networks, except only 20 input neurons are needed in this case. 

an unsupervised evaluation of the learned embedding is performed by visualizing how changes in molecular structure compare correspond to embedding changes. 

To analyze how the learned molecular embeddings encode binding properties, the authors modified molecular sequences and observed changes in binding confidence to HIV-1 Protease from a binary classifier. 

While the application of deep learning to prediction of molecular properties and other tasks has shown promise in aiding drug discovery, the direct application of deep learning to prediction of screening assay results has been made difficult by the limited quantity of available data. 

To account for this, the non-binding sets were randomly undersam-pled to match the count of binding molecules for the purpose of training and evaluating a balanced binding classifier. 

The same model used in all three reaction experiments was a simple CNN composed of an input layer, two hidden convolutional layers with ReLu, and a fully connected output layer originally trained on the HIV dataset for target binding classification using a random embedding of SMILES strings.