What is the effect of learning a mapping of chemical space?

the authors found that learning a mapping of chemical space via a Transformer network achieved increased accuracy of data-driven models on multiple binding affinity prediction tasks compared to models trained on handdesigned or untrained representations.

How many vectors are used for each character?

In this case, the SMILES string for each molecule is converted to a random embedding of 512-dimensional vectors for each character.

How many properties were normalized between 0 and 1?

Numeric property values were normalized between 0 and 1 according to the minimum and maximum values of all screened molecules, on a per-dataset basis.

How many input neurons are needed to classify binding affinity?

The networks used to classify binding affinity are identical to the Transformer and random embedding networks, except only 20 input neurons are needed in this case.

How did the authors classify the learned molecular embeddings?

To analyze how the learned molecular embeddings encode binding properties, the authors modified molecular sequences and observed changes in binding confidence to HIV-1 Protease from a binary classifier.

How many non-binding sets were randomly undersampled?

To account for this, the non-binding sets were randomly undersam-pled to match the count of binding molecules for the purpose of training and evaluating a balanced binding classifier.

What was the model used in all three experiments?

The same model used in all three reaction experiments was a simple CNN composed of an input layer, two hidden convolutional layers with ReLu, and a fully connected output layer originally trained on the HIV dataset for target binding classification using a random embedding of SMILES strings.

(Open Access) Predicting Binding from Screening Assays with Transformer Network Embeddings. (2020) | Paul Morris

Q: What have the authors contributed in "Predicting binding from screening assays with transformer network embeddings" ?

In this paper, an end-to-end Transformer neural network, trained to encode the structural characteristics of a molecule via a text-based translation task, is repurposed through transfer learning to classify binding affinity to a single target.

Q: What are the future works in "Predicting binding from screening assays with transformer network embeddings" ?

While overall accuracy was somewhat limited and varied per-target, these results suggest a promising direction for further research into the application of deep learning to direct modeling of assay experiment results as a computational screening aid to existing drug discovery pipelines. Data-driven models trained on the Transformer embeddings can be applied as a quick, inexpensive computational screening method to assist the early drug discovery process for targets where a functional assay has been designed.

Q: What is the purpose of the transformer model?

The operations in the transformer model used to compute molecular embeddings are easily parallelizable on modern computing infrastructure(GPUs), enabling rapid screening of millions of molecules to assist wet-lab screening assays and other drug discovery pipelines.

Q: How many IUPAC names are sourced from the PubChem database?

To train the Transformer network for translation, pairs of SMILES strings and IUPAC names are sourced directly from the PubChem compound database for 83,000,000 molecules.

Q: How much time and resources do the authors need to spend on a docking model?

24 Though physics-based molecular docking models are less constrained than wet-lab screening approaches, they can still be computationally expensive and require significant time and/or resources.

Predicting Binding from Screening Assays with

Transformer Network Embeddings

Paul Morris,

∗

Rachel St. Clair, Elan Barenholtz, and William Edward Hahn

Center for Complex Systems and Brain Sciences, Florida Atlantic University, Boca Raton,

Florida 33431, United States

E-mail: pmorris2012@fau.edu

Abstract

Cheminformatics aims to assist in chemistry

applications that depend on molecular inter-

actions, structural characteristics, and func-

tional properties. The arrival of deep learning

and the abundance of easily accessible chemi-

cal data from repositories like PubChem have

enabled advancements in computer-aided drug

discovery. Virtual High-Throughput Screen-

ing (vHTS) is one such technique that inte-

grates chemical domain knowledge to perform

in silico biomolecular simulations, but predic-

tion of binding aﬃnity is restricted due to lim-

ited availability of ground-truth binding assay

results. Here, text representations of 83,000,000

molecules are leveraged to enable single-target

binding aﬃnity prediction directly on the out-

come of screening assays. The embedding

of an end-to-end Transformer neural network,

trained to encode the structural characteristics

of a molecule via a text-based translation task,

is repurposed through transfer learning to clas-

sify binding aﬃnity to a single target. Clas-

siﬁers trained on the embedding outperform

those trained on SMILES strings for multiple

tasks, receiving between 0.67-0.99 AUC. Visu-

alization reveals organization of structural and

functional properties in the learned embedding

useful for binding prediction. The proposed

model is suitable for parallel computing, en-

abling rapid screening as a complement to vir-

tual screening techniques when limited data is

available.

Introduction

Cheminformatics aims to assist in chemistry

applications that depend on molecular interac-

tions, structural characteristics, and functional

properties. The arrival of powerful computa-

tional techniques and the abundance of eas-

ily accessible chemical data from repositories

have enabled dramatic recent advancements in

computer-aided drug discovery. The domain of

computer-aided drug design ranges from quan-

titative structure activity relationship,

drug

induced liver injury,

toxicity modeling,

vir-

tual screening,

among others. All of these

tasks have been aided by models that make use

of computational techniques that leverage large

datasets and human expertise to encode molec-

ular features in order to predict biochemical ac-

tivity.

Such techniques seek to expedite drug-

discovery pipelines by increasing the quantity

and quality of active compounds identiﬁed, po-

tentially resulting in new drug leads. Computa-

tional approaches are also advantaged in their

ability to integrate features from many sources

describing diﬀerent chemical properties to ap-

proximate chemical function without the limi-

tations of traditional wet-lab approaches.

Virtual High-Throughput Screening (vHTS)

is one such technique that integrates chem-

ical domain knowledge to perform in silico

biomolecular simulations. While binding as-

says are generally more accurate than tradi-

tional virtual screening approaches, they can

only identify drug leads from a set of com-

pounds which are easy and cost-eﬃcient to syn-

thesize. Computational techniques which sim-

ulate or approximate physical models of chem-

istry are not constrained by real-world limita-

tions, as molecules do not need to be synthe-

sized and resources for wet-lab experiments are

not required. Models or algorithms which are

driven by chemical data and expert knowledge

have been used to estimate structural and func-

tional properties and aid in scoring of existing

molecules,

7,8

as well as in de novo drug design.

Earlier approaches to the application of ma-

chine learning in cheminformatics involved

more traditional techniques such as support

vector machines (SVM), random-forest deci-

sion tree ensembles, markov models, and lin-

ear regression.

However, the advent of deep

learning on parallel computing resources has in-

creased the power and utility of computational

models, leading to new opportunities to lever-

age the wealth of available machine-readable

chemical information. Deep neural networks

(DNNs) trained to classify molecular represen-

tations have reportedly been highly eﬀective

for cheminformatic tasks in computer-aided

drug design, computational structural biology,

quantum chemistry, and computational mate-

rial design.

The recent success of deep learning can be

attributed in part to the availability of large,

labeled datasets.

Repositories such as Pub-

Chem

which compile information on molec-

ular structure and properties have enabled the

application of deep learning vision and natural

language processing (NLP) techniques to many

molecular property prediction tasks. These in-

clude training convolutional neural networks

(CNN) on raw SMILES strings,

13,14

adapting

CNNs to atom graphs and connectivity ma-

trices,

15–18

and using neural networks to clas-

sify molecules from ﬁngerprints or other hand-

designed molecular descriptors.

A number of approaches attempt to replace

the hand-designed scoring function of tradi-

tional molecular docking algorithms with a

learned scoring function.

16,20–22

Another class

of deep learning applications for drug discov-

ery attempts to simulate molecular docking.

Large databases such as PDBBind

contain

3D conformations of molecules bound to rel-

evant sites on thousands of target structures.

Deep learning approaches encode this 3D in-

formation to learn a model of physics and

identify molecules with low-energy conforma-

tions and high likelihood of binding.

Though

physics-based molecular docking models are less

constrained than wet-lab screening approaches,

they can still be computationally expensive and

require signiﬁcant time and/or resources.

As an alternative to docking, other deep

learning approaches attempt to improve the

quality of virtual screening predictions by learn-

ing to represent molecules with automatically

selected features.

25,26

In particular, transla-

tion between distinct molecular representations

have previously been shown as an eﬀective

technique for learning useful representations of

molecular properties.

By learning from ex-

isting representations and other information

which describe structural patterns, these tech-

niques develop custom chemical feature sets

which can match or increase performance on

molecular classiﬁcation/prediction tasks com-

pared to existing representations. Learning new

representations expands the scope of chemin-

formatics applications by allowing prediction

of molecular function, as improved representa-

tions can increase the predictive quality of mod-

els trained on limited amounts of data.

While the application of deep learning to pre-

diction of molecular properties and other tasks

has shown promise in aiding drug discovery, the

direct application of deep learning to prediction

of screening assay results has been made diﬃ-

cult by the limited quantity of available data.

Molecules screened against a particular target

likely constitute a much less representative sam-

ple of chemical space than is typical of dataset

samples from vision or NLP populations, where

deep learning has been most successful. The

application of deep learning is especially diﬃ-

cult for datasets that are not primarily hand-

engineered.

To address these limitations, we leverage the

vast wealth of publicly available and easily com-

putable molecular structure data to augment

training of a neural network for binding aﬃn-

ity prediction from historical assay data. To do

so, we train a Transformer neural network, an

architecture ﬁrst introduced in the context of

natural language translation,

to translate be-

tween two distinct, text-based molecular rep-

resentations in a well-studied subset of chem-

ical space. An intermediate set of features

computed by this trained model is considered

as an embedding which contains abstract fea-

tures describing general molecular structure.

Molecules represented by this abstract embed-

ding are then used to train a binding aﬃnity

prediction model directly on a limited set of as-

say results which quantify binding to a single

target. The organization of structural and func-

tional properties in embedding feature space en-

ables simple classiﬁers to simulate screening as-

says in limited data scenarios.

Learning abstract representations of chem-

ical information has recently been shown to

improve performance in predicting molecu-

lar function.

5,28

Another recent study derived

word embeddings and repurposed them through

transfer learning for multiple NLP tasks, out-

performing classiﬁcation tasks without such

embeddings.

Here, we utilize a Transformer

network to create such embeddings for func-

tional assays that may otherwise be poor can-

didates for virtual screening. Since the molec-

ular representations from the Transformer net-

work are learned by text-translation to encode

the functional properties indicated by struc-

ture, they can be applied to any screening assay

model, regardless of bioactivity. This approach

shows pretraining embeddings for generic chem-

ical representations can improve supervised

classiﬁcation. Our translation-based pretrain-

ing extends that insight to the task of predict-

ing binding assay results.

We evaluate our novel molecular embedding

learned by our Transformer on three single-

target prediction tasks and observe improve-

ment upon baselines for direct prediction of

binding assay results. Since neural network

training is data-driven, embedding features are

also suitable for ﬁne-tuning to consider target-

speciﬁc information. Furthermore, The oper-

ations in the transformer model used to com-

pute molecular embeddings are easily paral-

lelizable on modern computing infrastructure

(GPUs), enabling rapid screening of millions of

molecules to assist wet-lab screening assays and

other drug discovery pipelines.

Methods

To accurately predict binding assay results for

a single target with few active compounds, we

ﬁrst perform an auxiliary text translation task

based on state-of-the-art NLP techniques and

structural text representations of millions of

molecules. We collect SMILES strings and IU-

PAC chemical names for a large set of molecules

on PubChem. SMILES and IUPAC representa-

tions are selected because they both describe

similar aspects of molecular structure following

consistent rules in a machine-readable format.

While the atoms, bonds, and substructures de-

scribed in the two representations are similar,

the SMILES grammar and IUPAC nomencla-

ture have distinct text representations. By

learning to translate between the two, the com-

mon information they contain must be orga-

nized eﬃciently in an intermediate set of fea-

tures. We then repurpose these features of the

learned embedding for direct prediction of as-

say results, treated as a binary classiﬁcation

task between binding and non-binding regions

of chemical space.

An overview of this process is shown in Fig-

ure 1. In Step 1, a high-level depiction of

the network architecture is illustrated which

demonstrates how the network layers gener-

ate molecular embeddings when performing

SMILES-IUPAC translation. In Step 2, embed-

dings generated from the trained network are

provided as input to a target-speciﬁc binding

classiﬁcation network.

Transformer Neural Network

The Transformer

is a deep neural network

suited for NLP tasks. It relies on large weight

matrices to store patterns and learn short and

long-term dependencies in training sequences.

The Transformer network architecture is ﬂexi-

ble and can be used for both classiﬁcation and

text generation tasks. In our implementation,

Figure 1: Diagram of the two-step procedure

followed to predict binding aﬃnity using the

learned embedding of a Transformer network.

the Transformer is used to generate output as

an IUPAC chemical name which corresponds to

a molecule described by a SMILES string pro-

vided as input.

Before being processed by the main layers

of the Transformer, SMILES strings are con-

verted to an initial, random embedding. Each

character in the SMILES alphabet is replaced

with a random vector, where the same vector

is used for multiple occurrences of the same

character. The values in this vector are the

ﬁrst network weights of the Transformer, and

they are tuned during training based on the

frequency, co-occurence, and sequential depen-

dencies of each SMILES character. Periodic

functions at diﬀerent frequencies are added to

the signal of each vector so that the frequency

of the added signal encodes a character’s loca-

tion in the SMILES sequence. This allows the

character-speciﬁc layers of the Transformer to

determine the order of one character relative to

others.

Once an initial embedding of character vec-

tors are generated, the signal in each vector is

modulated by the layers in the Transformer’s

encoder stack. An encoder layer consists of

a self-attention operation which modiﬁes each

character vector based on its relation to other

characters in the sequence, followed by a simple

matrix multiplication and nonlinearity which is

applied on each character vector individually.

The output of each layer is a set of character

vectors with the same size as the input. The

output of the ﬁnal encoder layer is treated as

a molecular embedding, where each character

vector has been modiﬁed to contain abstract

features useful for describing the structure of a

molecule. The features in this embedding are

used by an equivalent set of decoder layers for

IUPAC name generation. Character vectors are

processed by the decoder stack one at a time,

resulting in a new character in the IUPAC al-

phabet being predicted. Decoder layers share a

similar structure to encoder layers, except for

a slightly modiﬁed form of self-attention which

looks at previously predicted IUPAC characters

to inform prediction of the next character. Dur-

ing training, previous predictions are ignored

in favor of characters from the correct IUPAC

name for a molecule.

Molecular Self-Attention

The core mechanism of the Transformer is ’self-

attention’. In this operation, input vectors rep-

resenting each character in a SMILES string are

output as a linear combination of vectors for

all characters in the string. The output of the

attention layer is a vector for each character

with the same length as the input. However,

vectors for each character are weighted by an

output-speciﬁc attention score which represents

how relevant every character in the string is to

a particular output. Attention scores are com-

puted by weight matrices, which accept char-

acter vectors as input. The meaning of the at-

tention scores depends on the task on which

the model is being trained, and multiple sets of

model weights are used to produce multiple sets

of scores which may attend to diﬀerent relevant

feature in the input. In the case of our trans-

lation task, attention scores may indicate the

Figure 2: Left: A selection of attention weights

for SMILES string CCCCC1CCNCC1F are vi-

sualized, showing how the character vectors of

each input on the left are weighted to pro-

duce the output vector for the 3rd carbon atom

on the right. Opacity indicates larger atten-

tion values. Right: Some of the self-attention

weights from the trained Transformer are vi-

sualized for a molecule. Separate weights in a

single layer attend to diﬀerent features which

describe the same structure.

importance of a certain substructure for gener-

ating part of a molecule’s IUPAC name. Exam-

ple visualizations from the trained Transformer

are shown in Figure 2. The matrices on the

right of the ﬁgure demonstrate the capacity of

the Transformer network to learn a descriptive,

varied set of abstract molecular features useful

for describing structure.

Training Procedure

To train the Transformer network for trans-

lation, pairs of SMILES strings and IUPAC

names are sourced directly from the PubChem

compound database for 83,000,000 molecules.

SMILES strings are used as-is, and no canon-

icalization is performed . Similarly, IUPAC

names for each molecule are collected from Pub-

Chem with no modiﬁcation. Deep neural net-

works trained on large, labeled datasets have

been shown to be robust to unreliable anno-

tation.

Noise during neural network training

can increase generalization due to the intrica-

cies of network optimization,

making the un-

modiﬁed molecular text representations robust

to under-ﬁtting.

A Transformer network is created with 512-

dimension character embedding vectors. A

maximum SMILES string length during train-

ing of 256 characters is imposed, although this

limit can be exceeded during screening infer-

ence. Thus, each molecular embedding contains

256 ∗ 512 dimensions. Training is performed in

batches of 96 molecular string pairs. The Adam

optimization algorithm

is used to to update

the weights of the network. The learning rate

during optimization begins at 0.001 and de-

creases two orders of magnitude, following half

a period of a cosine function, over the course

of a single pass, or epoch, over the 83,000,000

molecule training set. Training continues for

three epochs.

Experiments

The utility of the Transformer embedding was

investigated by training and evaluating bind-

ing prediction models on molecular embeddings

for three binary classiﬁcation tasks. Equiva-

lent prediction models are also trained on two

representation baselines to quantify the Trans-

former embeddings’ usefulness for binding aﬃn-

ity prediction and explain the relation between

learned features and molecular properties. Fi-

nally, an unsupervised evaluation of the learned

embedding is performed by visualizing how

changes in molecular structure compare corre-

spond to embedding changes.

Assay Datasets

Three datasets for supervised prediction of

binding assay results were compiled. In each,

the results of an assay (or compilation of assays)

for binding aﬃnity to a target were sampled

into a small, labeled dataset. Binary classiﬁ-

cation of molecular embeddings was performed

by binning continuous, assay-speciﬁc binding

aﬃnity values into binding and non-binding

categories according to an activity threshold.

In each assay, fewer binding than non-binding

molecules were identiﬁed. To account for this,

the non-binding sets were randomly undersam-

Predicting Binding from Screening Assays with Transformer Network Embeddings.

Figures

Citations

Molecular representation learning with language models and domain-relevant auxiliary tasks.

New Trends in Virtual Screening

Accurate predictions of aqueous solubility of drug molecules via the multilevel graph convolutional network (MGCN) and SchNet architectures.

Accurate predictions of drugs aqueous solubility via deep learning tools

Geometric Deep Learning on Molecular Representations

References

Adam: A Method for Stochastic Optimization

Long short-term memory

Deep learning

Deep Learning

Dropout: a simple way to prevent neural networks from overfitting

Related Papers (5)

Analyzing Learned Molecular Representations for Property Prediction.

AK-Score: Accurate Protein-Ligand Binding Affinity Prediction Using an Ensemble of 3D-Convolutional Neural Networks

FRSite: Protein drug binding site prediction based on faster R-CNN.

Nonlinear Scoring Functions for Similarity-Based Ligand Docking and Binding Affinity Prediction

Atomic Convolutional Networks for Predicting Protein-Ligand Binding Affinity

Frequently Asked Questions (16)

Q1. What have the authors contributed in "Predicting binding from screening assays with transformer network embeddings" ?

Q2. What are the future works in "Predicting binding from screening assays with transformer network embeddings" ?

Q3. What is the definition of an encoder layer?

Q4. How many epochs did the training take?

Q5. What is the purpose of the transformer model?

Q6. How many IUPAC names are sourced from the PubChem database?

Q7. What is the effect of learning a mapping of chemical space?

Q8. How much time and resources do the authors need to spend on a docking model?

Q9. How many vectors are used for each character?

Q10. How many properties were normalized between 0 and 1?

Q11. How many input neurons are needed to classify binding affinity?

Q12. How is the learning of the embedding performed?

Q13. How did the authors classify the learned molecular embeddings?

Q14. What is the main reason why the application of deep learning to screening assays has been made?

Q15. How many non-binding sets were randomly undersampled?

Q16. What was the model used in all three experiments?