scispace - formally typeset
Open AccessPosted ContentDOI

Leveraging molecular structure and bioactivity with chemical language models for drug design

TLDR
It is shown that “hybrid” CLMs can additionally leverage the bioactivity information available for the training compounds to positively advocate hybrid CLMs for virtual compound screening and activity-focused molecular design in low-data situations.
Abstract
Generative chemical language models (CLMs) can be used for de novo molecular structure generation. These CLMs learn from the structural information of known molecules to generate new ones. In this paper, we show that “hybrid” CLMs can additionally leverage the bioactivity information available for the training compounds. To computationally design ligands of phosphoinositide 3-kinase gamma (PI3Kγ), we created a large collection of virtual molecules with a generative CLM. This primary virtual compound library was further refined using a CLM-based classifier for bioactivity prediction. This second hybrid CLM was pretrained with patented molecular structures and fine-tuned with known PI3Kγ binders and non-binders by transfer learning. Several of the computer-generated molecular designs were commercially available, which allowed for fast prescreening and preliminary experimental validation. A new PI3Kγ ligand with sub-micromolar activity was identified. The results positively advocate hybrid CLMs for virtual compound screening and activity-focused molecular design in low-data situations.

read more

Content maybe subject to copyright    Report

Moret et al.
1
Leveraging molecular structure and bioactivity with chemical
language models for drug design
Michael Moret
1
, Francesca Grisoni
1,2 *
, Cyrill Brunner
1
& Gisbert Schneider
1,3 *
1
ETH Zurich, Department of Chemistry and Applied Biosciences, RETHINK, Vladimir-Prelog-
Weg 4, 8093 Zurich, Switzerland;
2
Eindhoven University of Technology, Institute for Complex Molecular Systems, Department of
Biomedical Engineering, Groene Loper 7, 5612AZ Eindhoven, Netherlands;
3
ETH Singapore SEC Ltd, 1 CREATE Way, #06-01 CREATE Tower, Singapore 138602,
Singapore;
*Correspondence to Gisbert Schneider (gisbert@ethz.ch) and Francesca Grisoni
(f.grisoni@tue.nl)
Abstract
Generative chemical language models (CLMs) can be used for de novo molecular structure
generation. These CLMs learn from the structural information of known molecules to generate
new ones. In this paper, we show that “hybrid” CLMs can additionally leverage the bioactivity
information available for the training compounds. To computationally design ligands of
phosphoinositide 3-kinase gamma (PI3Kγ), we created a large collection of virtual molecules with
a generative CLM. This primary virtual compound library was further refined using a CLM-based
classifier for bioactivity prediction. This second hybrid CLM was pretrained with patented
molecular structures and fine-tuned with known PI3Kγ binders and non-binders by transfer
learning. Several of the computer-generated molecular designs were commercially available,
which allowed for fast prescreening and preliminary experimental validation. A new PI3Kγ ligand
with sub-micromolar activity was identified. The results positively advocate hybrid CLMs for virtual
compound screening and activity-focused molecular design in low-data situations.
Introduction
Computational methods have become key players in hit and lead discovery in pharmaceutical
research, complementing experimental high-throughput screening
1
. Bespoke virtual compound
libraries provide access to untapped regions of the chemical space
2
, thereby extending the
diversity of potential drug candidates. However, owing to the potentially unlimited size of virtual
chemical libraries, concerns have been raised over the pragmatism of successfully screening
billions of molecules virtually with a potentially high risk of false positives
2,3
. To mitigate some of
these challenges, researchers have employed generative deep learning models to construct
compounds on demand by de novo design and to obtain small, bespoke virtual compound
libraries
4,5
. A variety of data-driven approaches can be used to generate focused virtual chemical
libraries and create molecules with the desired properties
5–18
. Chemical language models (CLMs)

Moret et al.
2
are based on deep learning networks for processing string representations of molecules (e.g.,
simplified molecular input line entry system (SMILES) strings; Fig. 1a)
5,7,19
. CLMs have already
been successfully employed to generate focused virtual chemical libraries. Examples of de novo
designed bioactive molecules include inhibitors of vascular endothelial growth factor receptor 2
kinase and the unfolded protein response pathway
7
, and nuclear hormone receptor modulators
20
23
.
The creation of a focused virtual chemical library with a CLM generally includes three basic
steps: (i) model pretraining with a large set of molecules to learn the SMILES grammar and the
feature distribution of the pretraining data, (ii) transfer learning with a smaller set of molecules
(fine-tuning set) to bias the molecule generation by the CLM toward the chemical space of interest,
and (iii) sampling of new molecules from the data distributions modeled in steps i) and ii)
5,24
. There
are alternative approaches for CLM development, e.g., model fine-tuning (step ii) by
reinforcement learning
6,25
.
In this study, we developed a data-driven molecular design pipeline that leverages both
the structural and bioactivity information of known ligands to generate de novo bespoke
molecules. We pretrained two CLMs, each with a distinct pretraining strategy, on a large set of
patented compound structures (one for molecular generation and one for classification). Both
CLMs were fine-tuned on inhibitors of phosphoinositide 3-kinase gamma (PI3Kγ), which is an
anticancer, anti-inflammatory, and immunomodulatory drug target
26,27
. For rapid validation,
commercially available compounds from the set of de novo generated molecules were tested, as
opposed to synthesizing them. A new nanomolar ligand of phosphoinositide 3-kinase gamma
(PI3Kγ) was identified.
Results and Discussion
Molecular design and scoring were performed in two steps, each of which was executed by a
distinct CLM: (i) molecular de novo design and (ii) refinement of the generated virtual molecule
library using the available ligand bioactivity data for the target of interest (PI3Kγ).
Focused library generation
Chemical language model. A CLM based on a long short-term memory (LSTM) model and
SMILES strings as input was developed for the de novo generation of a focused virtual chemical
library for PI3Kγ
28
. To learn from unlabeled data, CLMs leverage “self-supervised” learning
29
.
Specifically, the CLM was trained with an autoregressive approach, i.e., the process of iteratively
predicting the next character in a SMILES string given all the previous characters in the string
(Fig. 2a)
30
. In previous studies, CLMs were pretrained on molecules with known biological activity
(IC
50
, EC
50
, K
d
, and K
i
) <1 µM retrieved from the ChEMBL database
20,23,3133
. Although the training
set can capture the general features of bioactive compounds, it does not necessarily represent
the physicochemical properties of approved drugs. Here, to enable the CLM to capture features
more related to approved drugs, we used 839,674 molecules from the US patent database for the
CLM pretraining
34
. We hypothesized that patented compounds are more likely to become
marketed drugs than the molecules deposited in ChEMBL. Transfer learning was performed to
properly focus the pretrained CLM toward the target space of PI3Kγ ligands. For transfer learning,

Moret et al.
3
46 PI3Kγ inhibitors with IC
50
≤100 nM were selected from the Drug Target Commons (DTC)
database
35
.
Nucleus sampling for molecule generation. CLMs generate new molecules by extending strings
from a “start” character until the “stop” character is sampled or when reaching a preset maximum
string length. String characters are iteratively added by weighted random sampling from the
probability distribution learned by the CLM during training. The more likely a given character is at
a given step according to the probabilities learned by the CLM, the more often it will be sampled,
and vice versa. Narrowing the probabilities learned by the CLM with a parameter (the so-called
temperature; Fig. 1b) generally improves the SMILES string sampling
31
. This improvement occurs
in terms of (i) the quality of the SMILES strings generated, as reflected by their validity
(grammatically valid SMILES strings), uniqueness (nonrepetitive molecules), and novelty
(molecules not present in the pretraining and fine-tuning data), and (ii) the similarity of the sampled
virtual chemical libraries to the reference data in terms of their chemical structures and
bioactivities, as measured by the Fréchet ChemNet Distance (FCD)
36
. However, with this
“temperature sampling” approach, SMILES characters are unlikely to be sampled, which could
result in the construction of molecules that do not match the design objective. To prevent the CLM
from picking unlikely SMILES characters by temperature sampling, we employed “nucleus
sampling” here
37
. This method reflects the confidence of the model in its predictions by allowing
only the most probable character(s) to be sampled using a probability threshold based on the
cumulative probabilities of the SMILES characters (Fig. 1c).
Nucleus sampling improved upon temperature sampling in terms of lower FCD values
(Fig. 1d), indicating a greater overall similarity of the de novo generated molecules to the
pretraining set in terms of structural and bioactivity properties. During transfer learning, nucleus
sampling generally improved the quality of the sampled molecules in terms of the novelty of the
SMILES strings compared to the best temperature sampling data obtained (Fig. 1e)
33
. The results
were stable over a range of sampling threshold values (Supplementary Table S1). However,
nucleus sampling did not outperform temperature sampling in terms of the uniqueness, validity,
and novelty of the SMILES strings generated after the pretraining (Supplementary Table S2). To
create a PI3Kγ focused chemical library during transfer learning, we used nucleus sampling with
a threshold of 0.85. A total of 5000 SMILES strings were sampled over 50 transfer learning epochs
with 10 repetitions (5000 × 50 × 10). A total of 2,500,000 SMILES strings were generated, of
which 1,121,735 were valid, unique, and novel compared to both the training and fine-tuning
compounds.

Moret et al.
4
Fig. 1 | De novo molecular generation with the CLM. a, SMILES string representation of a
molecule. b, Example of the effect of the temperature parameter on the probability distribution
learnt by the CLM. c, Example of the effect of the nucleus sampling threshold. Only the characters
N and C can be sampled here. d, Fréchet ChemNet Distance (FCD) comparison between
temperature and nucleus sampling after the pretraining (reported as the mean with standard
deviation over 10 repeats with 5000 molecules sampled per repeat). e, Comparison of the novelty
of the generated SMILES strings during the transfer learning between temperature sampling
(temperature = 0.7) and nucleus sampling (threshold = 0.85). Mean values (lines) and standard
deviations (shaded areas) are shown for 10 repeats (1000 SMILES strings were sampled every
second epoch over 40 epochs). Novelty is expressed as the percentage of SMILES strings
generated that were valid and not included in either the training or the fine-tuning data.
Bioactivity prediction with a hybrid chemical language model
Leveraging bioactivity data for molecule selection. The availability of bioactivity data for the fine-
tuning molecules permitted the training of a bioactivity prediction model to select the most
promising de novo designs
38
. Classical chemoinformatics methods often rely on precomputed
features (molecular descriptors), combined with a machine learning algorithm for molecular
property prediction. In this study, we aimed to explore the potential of a SMILES string-based
hybrid CLM to predict the bioactivity. This neural network model combines a generative CLM with
a classifier network. Given that (i) inactive molecules were annotated with PI3Kγ pIC
50
= 4.0 (Fig.

Moret et al.
5
2c) and (ii) there is a natural ordering of the PI3Kγ ligands according to their pIC
50
values, the
bioactivity prediction task was framed as an ordinal classification task, i.e., classification with a
class order
39
. Such a model considers both the active and inactive compounds for training and
preserves both the class labels and the class order. For model training, we defined three class
labels: “inactive” (pIC
50
4.0, 34 molecules), “moderately active” (4.0 < pIC
50
6.5, 121
molecules), and “highly active” (pIC
50
> 6.5, 43 molecules). The CLM generated a focused virtual
chemical library by leveraging the structural information of the molecules used for fine-tuning,
while the classifier layer factored their activity labels into the model (Fig. 2d).
We explored two different pretraining strategies for feature learning with a large amount of
unlabeled data.
1. Autoregressive pretraining (Fig. 2a). This strategy is analogous to the one performed for
the generative CLM.
2. ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements
Accurately) pretraining (Fig. 2b)
40
. The ELECTRA approach is based on training a model
to distinguish between realinput characters and corruptones, which was previously
shown to be useful for contextual representation of natural language
40
. We adapted
ELECTRA for the CLM training with an LSTM model and SMILES strings as input
28
. The
training data contained corrupted input SMILES strings generated by randomly
substituting multiple characters with other characters of the SMILES language. The CLM
was trained to spot the corrupted characters.
We hypothesized that, compared to autoregressive pretraining, ELECTRA pretraining has a
more appropriate inductive bias (i.e., the set of algorithmic assumptions to solve a given task) to
extract useful features for ordinal classification. The inductive bias of autoregressive pretraining
is particularly suited for generating SMILES strings because the training and generative tasks are
the same, namely, adding characters iteratively. However, ligands of the same macromolecular
target tend to have similar chemical substructures, and, therefore, the ability of a model to
distinguish small structural changes was deemed relevant. At the same time, small structural
changes might lead to drastic variation of the biological activity (the so-called activity cliffs)
41
.
Hereinafter, the model that was pretrained with the ELECTRA method is referred to as “E-CLM.”

Figures
Citations
More filters
Book ChapterDOI

Theory meets reality

References
More filters
Journal ArticleDOI

Combining generative artificial intelligence and on-chip synthesis for de novo drug design.

TL;DR: The results support the suitability of the proposed design-make-test-analyze framework as a blueprint for automated drug design with artificial intelligence and miniaturized bench-top synthesis.
Journal ArticleDOI

Applications of Deep-Learning in Exploiting Large-Scale and Heterogeneous Compound Data in Industrial Pharmaceutical Research.

TL;DR: The current state of analyzing large-scale compound data in industrial pharmaceutical research is summarized and the impact it has had on the drug discovery process over the last two decades is described, with a specific focus on deep-learning technologies.
Journal ArticleDOI

Chemical language models enable navigation in sparsely populated chemical space

TL;DR: It is shown that models developed for natural language processing work well for generating molecules from small amounts of training data, and robust metrics to evaluate the quality of generated molecules are identified.
Journal ArticleDOI

PaccMannRL: De novo generation of hit-like anticancer molecules from transcriptomic data via reinforcement learning.

TL;DR: In this article, a hybrid VAE was used to generate drugs with high predicted efficacy against cell lines or cancer types, using an anticancer drug sensitivity prediction model as reward function.
Journal ArticleDOI

Beam Search for Automated Design and Scoring of Novel ROR Ligands with Machine Intelligence

TL;DR: This paper leveraged the probabilities learned by chemical language models with the beam search algorithm as a model-intrinsic technique for automated molecule design and scoring and yielded novel inverse agonists of retinoic acid receptor-related orphan receptors (RORs).
Related Papers (5)
Frequently Asked Questions (20)
Q1. What have the authors contributed in "Leveraging molecular structure and bioactivity with chemical language models for drug design" ?

In this paper, the authors show that “ hybrid ” CLMs can additionally leverage the bioactivity information available for the training compounds. To computationally design ligands of phosphoinositide 3-kinase gamma ( PI3Kγ ), the authors created a large collection of virtual molecules with a generative CLM. This primary virtual compound library was further refined using a CLM-based classifier for bioactivity prediction. 

Importantly, CLM training was performed without data augmentation to study the positive effect of nucleus sampling on the generation of a SMILES string. Future prospective studies will also have to assess the general applicability of this approach to other targets from different target families. This study highlights the versatility of generative deep learning for hit and lead finding in drug discovery, where the same computational pipeline can be used to both create new molecules and screen libraries of existing compounds. The authors envision future projects in which de novo design methods are first validated for physically available molecules from a compound repository or commercial suppliers before investing in potentially more expensive and time-consuming syntheses. 

Examples of de novo designed bioactive molecules include inhibitors of vascular endothelial growth factor receptor 2 kinase and the unfolded protein response pathway7, and nuclear hormone receptor modulators20– 23. 

To increase the confidence in the bioactivity predictions, the authors used a deep ensemble model by combining the predictions of multiple models with a majority voting approach48,49. 

In this study, the authors developed a data-driven molecular design pipeline that leverages both the structural and bioactivity information of known ligands to generate de novo bespoke molecules. 

A variety of data-driven approaches can be used to generate focused virtual chemical libraries and create molecules with the desired properties5–18. 

Bespoke virtual compound libraries provide access to untapped regions of the chemical space2, thereby extending the diversity of potential drug candidates. 

Computational methods have become key players in hit and lead discovery in pharmaceutical research, complementing experimental high-throughput screening1. 

The inductive bias of autoregressive pretraining is particularly suited for generating SMILES strings because the training and generative tasks are the same, namely, adding characters iteratively. 

The creation of a focused virtual chemical library with a CLM generally includes three basic steps: (i) model pretraining with a large set of molecules to learn the SMILES grammar and the feature distribution of the pretraining data, (ii) transfer learning with a smaller set of molecules (fine-tuning set) to bias the molecule generation by the CLM toward the chemical space of interest, and (iii) sampling of new molecules from the data distributions modeled in steps i) and ii)5,24. 

To mitigate the class data imbalance, the authors applied oversampling to the classes with fewer data (i.e., the “inactive” and “highly active” classes)44. 

Both CLMs were fine-tuned on inhibitors of phosphoinositide 3-kinase gamma (PI3Kγ), which is an anticancer, anti-inflammatory, and immunomodulatory drug target26,27. 

Among these top-ranked molecules, 64% featured a new atom scaffold and 62% featured a new graph scaffold with respect to the finetuning set52,53. 

Chemical language models (CLMs)are based on deep learning networks for processing string representations of molecules (e.g., simplified molecular input line entry system (SMILES) strings; Fig. 1a)5,7,19. 

Narrowing the probabilities learned by the CLM with a parameter (the so-called temperature; Fig. 1b) generally improves the SMILES string sampling31. 

owing to the potentially unlimited size of virtual chemical libraries, concerns have been raised over the pragmatism of successfully screening billions of molecules virtually with a potentially high risk of false positives2,3. 

With increasing confidence levels, the number of molecules predicted as “highly active” decreased (Fig. 3a), a documented effect of ensemble voting51. 

During transfer learning, nucleus sampling generally improved the quality of the sampled molecules in terms of the novelty of the SMILES strings compared to the best temperature sampling data obtained (Fig. 1e)33. 

”To probe the effect of the pretraining scheme on the predictions, the authors added only a single feedforward layer to the pretrained CLM and E-CLM for bioactivity prediction. 

Hit compound 1 has a new atom scaffold compared to all molecules in the ChEMBL database (version 28) annotated with “pActivity” ≥ 5.0 on PI3Kγ (“pActivity”: -log(molar IC50, XC50, EC50, AC50, Ki, Kd, or “potency”)).