scispace - formally typeset
Search or ask a question
Posted ContentDOI

Leveraging molecular structure and bioactivity with chemical language models for drug design

TL;DR: It is shown that “hybrid” CLMs can additionally leverage the bioactivity information available for the training compounds to positively advocate hybrid CLMs for virtual compound screening and activity-focused molecular design in low-data situations.
Abstract: Generative chemical language models (CLMs) can be used for de novo molecular structure generation. These CLMs learn from the structural information of known molecules to generate new ones. In this paper, we show that “hybrid” CLMs can additionally leverage the bioactivity information available for the training compounds. To computationally design ligands of phosphoinositide 3-kinase gamma (PI3Kγ), we created a large collection of virtual molecules with a generative CLM. This primary virtual compound library was further refined using a CLM-based classifier for bioactivity prediction. This second hybrid CLM was pretrained with patented molecular structures and fine-tuned with known PI3Kγ binders and non-binders by transfer learning. Several of the computer-generated molecular designs were commercially available, which allowed for fast prescreening and preliminary experimental validation. A new PI3Kγ ligand with sub-micromolar activity was identified. The results positively advocate hybrid CLMs for virtual compound screening and activity-focused molecular design in low-data situations.

Summary (1 min read)

Introduction

  • Computational methods have become key players in hit and lead discovery in pharmaceutical research, complementing experimental high-throughput screening1.
  • There are alternative approaches for CLM development, e.g., model fine-tuning (step ii) by reinforcement learning6,25.
  • Both CLMs were fine-tuned on inhibitors of phosphoinositide 3-kinase gamma (PI3Kγ), which is an anticancer, anti-inflammatory, and immunomodulatory drug target26,27.
  • For rapid validation, commercially available compounds from the set of de novo generated molecules were tested, as opposed to synthesizing them.

Focused library generation

  • A CLM based on a long short-term memory (LSTM) model and SMILES strings as input was developed for the de novo generation of a focused virtual chemical library for PI3Kγ28.
  • To learn from unlabeled data, CLMs leverage “self-supervised” learning29.
  • Here, to enable the CLM to capture features more related to approved drugs, the authors used 839,674 molecules from the US patent database for the CLM pretraining34.
  • Narrowing the probabilities learned by the CLM with a parameter (the so-called temperature; Fig. 1b) generally improves the SMILES string sampling31.

In vitro bioactivity testing

  • For a proof of concept, some of the molecules generated by the CLM were tested for PI3Kγ binding in vitro.
  • The authors hypothesize that this might be due to the positive effect of the ELECTRA pretraining, which was aimed at recognizing the effect of small structural changes.
  • This preliminary in vitro validation advocates an ensemble prediction approach for virtual compound screening and ranking of the computer-generated molecular designs.

Conclusion

  • Methodological improvements in CLM training advanced the sampling of target-focused virtual molecule libraries.
  • It remains to be determined in more detail to what extent the CLM pretraining method affects model performance in the downstream task, i.e., molecular generation or ordinal classification.
  • Importantly, CLM training was performed without data augmentation to study the positive effect of nucleus sampling on the generation of a SMILES string.
  • The long time required for hit-to-lead expansion and for preclinical and clinical drug development until a marketed drug is obtained will likely preclude any such analysis.
  • Obtaining rapid experimental validation of a set of readily available de novo designed molecules prior to embarking on de novo synthesis might help assess the value of computationally generated activity-focused chemical libraries.

Did you find this useful? Give us your feedback

Figures (5)

Content maybe subject to copyright    Report

Moret et al.
1
Leveraging molecular structure and bioactivity with chemical
language models for drug design
Michael Moret
1
, Francesca Grisoni
1,2 *
, Cyrill Brunner
1
& Gisbert Schneider
1,3 *
1
ETH Zurich, Department of Chemistry and Applied Biosciences, RETHINK, Vladimir-Prelog-
Weg 4, 8093 Zurich, Switzerland;
2
Eindhoven University of Technology, Institute for Complex Molecular Systems, Department of
Biomedical Engineering, Groene Loper 7, 5612AZ Eindhoven, Netherlands;
3
ETH Singapore SEC Ltd, 1 CREATE Way, #06-01 CREATE Tower, Singapore 138602,
Singapore;
*Correspondence to Gisbert Schneider (gisbert@ethz.ch) and Francesca Grisoni
(f.grisoni@tue.nl)
Abstract
Generative chemical language models (CLMs) can be used for de novo molecular structure
generation. These CLMs learn from the structural information of known molecules to generate
new ones. In this paper, we show that “hybrid” CLMs can additionally leverage the bioactivity
information available for the training compounds. To computationally design ligands of
phosphoinositide 3-kinase gamma (PI3Kγ), we created a large collection of virtual molecules with
a generative CLM. This primary virtual compound library was further refined using a CLM-based
classifier for bioactivity prediction. This second hybrid CLM was pretrained with patented
molecular structures and fine-tuned with known PI3Kγ binders and non-binders by transfer
learning. Several of the computer-generated molecular designs were commercially available,
which allowed for fast prescreening and preliminary experimental validation. A new PI3Kγ ligand
with sub-micromolar activity was identified. The results positively advocate hybrid CLMs for virtual
compound screening and activity-focused molecular design in low-data situations.
Introduction
Computational methods have become key players in hit and lead discovery in pharmaceutical
research, complementing experimental high-throughput screening
1
. Bespoke virtual compound
libraries provide access to untapped regions of the chemical space
2
, thereby extending the
diversity of potential drug candidates. However, owing to the potentially unlimited size of virtual
chemical libraries, concerns have been raised over the pragmatism of successfully screening
billions of molecules virtually with a potentially high risk of false positives
2,3
. To mitigate some of
these challenges, researchers have employed generative deep learning models to construct
compounds on demand by de novo design and to obtain small, bespoke virtual compound
libraries
4,5
. A variety of data-driven approaches can be used to generate focused virtual chemical
libraries and create molecules with the desired properties
5–18
. Chemical language models (CLMs)

Moret et al.
2
are based on deep learning networks for processing string representations of molecules (e.g.,
simplified molecular input line entry system (SMILES) strings; Fig. 1a)
5,7,19
. CLMs have already
been successfully employed to generate focused virtual chemical libraries. Examples of de novo
designed bioactive molecules include inhibitors of vascular endothelial growth factor receptor 2
kinase and the unfolded protein response pathway
7
, and nuclear hormone receptor modulators
20
23
.
The creation of a focused virtual chemical library with a CLM generally includes three basic
steps: (i) model pretraining with a large set of molecules to learn the SMILES grammar and the
feature distribution of the pretraining data, (ii) transfer learning with a smaller set of molecules
(fine-tuning set) to bias the molecule generation by the CLM toward the chemical space of interest,
and (iii) sampling of new molecules from the data distributions modeled in steps i) and ii)
5,24
. There
are alternative approaches for CLM development, e.g., model fine-tuning (step ii) by
reinforcement learning
6,25
.
In this study, we developed a data-driven molecular design pipeline that leverages both
the structural and bioactivity information of known ligands to generate de novo bespoke
molecules. We pretrained two CLMs, each with a distinct pretraining strategy, on a large set of
patented compound structures (one for molecular generation and one for classification). Both
CLMs were fine-tuned on inhibitors of phosphoinositide 3-kinase gamma (PI3Kγ), which is an
anticancer, anti-inflammatory, and immunomodulatory drug target
26,27
. For rapid validation,
commercially available compounds from the set of de novo generated molecules were tested, as
opposed to synthesizing them. A new nanomolar ligand of phosphoinositide 3-kinase gamma
(PI3Kγ) was identified.
Results and Discussion
Molecular design and scoring were performed in two steps, each of which was executed by a
distinct CLM: (i) molecular de novo design and (ii) refinement of the generated virtual molecule
library using the available ligand bioactivity data for the target of interest (PI3Kγ).
Focused library generation
Chemical language model. A CLM based on a long short-term memory (LSTM) model and
SMILES strings as input was developed for the de novo generation of a focused virtual chemical
library for PI3Kγ
28
. To learn from unlabeled data, CLMs leverage “self-supervised” learning
29
.
Specifically, the CLM was trained with an autoregressive approach, i.e., the process of iteratively
predicting the next character in a SMILES string given all the previous characters in the string
(Fig. 2a)
30
. In previous studies, CLMs were pretrained on molecules with known biological activity
(IC
50
, EC
50
, K
d
, and K
i
) <1 µM retrieved from the ChEMBL database
20,23,3133
. Although the training
set can capture the general features of bioactive compounds, it does not necessarily represent
the physicochemical properties of approved drugs. Here, to enable the CLM to capture features
more related to approved drugs, we used 839,674 molecules from the US patent database for the
CLM pretraining
34
. We hypothesized that patented compounds are more likely to become
marketed drugs than the molecules deposited in ChEMBL. Transfer learning was performed to
properly focus the pretrained CLM toward the target space of PI3Kγ ligands. For transfer learning,

Moret et al.
3
46 PI3Kγ inhibitors with IC
50
≤100 nM were selected from the Drug Target Commons (DTC)
database
35
.
Nucleus sampling for molecule generation. CLMs generate new molecules by extending strings
from a “start” character until the “stop” character is sampled or when reaching a preset maximum
string length. String characters are iteratively added by weighted random sampling from the
probability distribution learned by the CLM during training. The more likely a given character is at
a given step according to the probabilities learned by the CLM, the more often it will be sampled,
and vice versa. Narrowing the probabilities learned by the CLM with a parameter (the so-called
temperature; Fig. 1b) generally improves the SMILES string sampling
31
. This improvement occurs
in terms of (i) the quality of the SMILES strings generated, as reflected by their validity
(grammatically valid SMILES strings), uniqueness (nonrepetitive molecules), and novelty
(molecules not present in the pretraining and fine-tuning data), and (ii) the similarity of the sampled
virtual chemical libraries to the reference data in terms of their chemical structures and
bioactivities, as measured by the Fréchet ChemNet Distance (FCD)
36
. However, with this
“temperature sampling” approach, SMILES characters are unlikely to be sampled, which could
result in the construction of molecules that do not match the design objective. To prevent the CLM
from picking unlikely SMILES characters by temperature sampling, we employed “nucleus
sampling” here
37
. This method reflects the confidence of the model in its predictions by allowing
only the most probable character(s) to be sampled using a probability threshold based on the
cumulative probabilities of the SMILES characters (Fig. 1c).
Nucleus sampling improved upon temperature sampling in terms of lower FCD values
(Fig. 1d), indicating a greater overall similarity of the de novo generated molecules to the
pretraining set in terms of structural and bioactivity properties. During transfer learning, nucleus
sampling generally improved the quality of the sampled molecules in terms of the novelty of the
SMILES strings compared to the best temperature sampling data obtained (Fig. 1e)
33
. The results
were stable over a range of sampling threshold values (Supplementary Table S1). However,
nucleus sampling did not outperform temperature sampling in terms of the uniqueness, validity,
and novelty of the SMILES strings generated after the pretraining (Supplementary Table S2). To
create a PI3Kγ focused chemical library during transfer learning, we used nucleus sampling with
a threshold of 0.85. A total of 5000 SMILES strings were sampled over 50 transfer learning epochs
with 10 repetitions (5000 × 50 × 10). A total of 2,500,000 SMILES strings were generated, of
which 1,121,735 were valid, unique, and novel compared to both the training and fine-tuning
compounds.

Moret et al.
4
Fig. 1 | De novo molecular generation with the CLM. a, SMILES string representation of a
molecule. b, Example of the effect of the temperature parameter on the probability distribution
learnt by the CLM. c, Example of the effect of the nucleus sampling threshold. Only the characters
N and C can be sampled here. d, Fréchet ChemNet Distance (FCD) comparison between
temperature and nucleus sampling after the pretraining (reported as the mean with standard
deviation over 10 repeats with 5000 molecules sampled per repeat). e, Comparison of the novelty
of the generated SMILES strings during the transfer learning between temperature sampling
(temperature = 0.7) and nucleus sampling (threshold = 0.85). Mean values (lines) and standard
deviations (shaded areas) are shown for 10 repeats (1000 SMILES strings were sampled every
second epoch over 40 epochs). Novelty is expressed as the percentage of SMILES strings
generated that were valid and not included in either the training or the fine-tuning data.
Bioactivity prediction with a hybrid chemical language model
Leveraging bioactivity data for molecule selection. The availability of bioactivity data for the fine-
tuning molecules permitted the training of a bioactivity prediction model to select the most
promising de novo designs
38
. Classical chemoinformatics methods often rely on precomputed
features (molecular descriptors), combined with a machine learning algorithm for molecular
property prediction. In this study, we aimed to explore the potential of a SMILES string-based
hybrid CLM to predict the bioactivity. This neural network model combines a generative CLM with
a classifier network. Given that (i) inactive molecules were annotated with PI3Kγ pIC
50
= 4.0 (Fig.

Moret et al.
5
2c) and (ii) there is a natural ordering of the PI3Kγ ligands according to their pIC
50
values, the
bioactivity prediction task was framed as an ordinal classification task, i.e., classification with a
class order
39
. Such a model considers both the active and inactive compounds for training and
preserves both the class labels and the class order. For model training, we defined three class
labels: “inactive” (pIC
50
4.0, 34 molecules), “moderately active” (4.0 < pIC
50
6.5, 121
molecules), and “highly active” (pIC
50
> 6.5, 43 molecules). The CLM generated a focused virtual
chemical library by leveraging the structural information of the molecules used for fine-tuning,
while the classifier layer factored their activity labels into the model (Fig. 2d).
We explored two different pretraining strategies for feature learning with a large amount of
unlabeled data.
1. Autoregressive pretraining (Fig. 2a). This strategy is analogous to the one performed for
the generative CLM.
2. ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements
Accurately) pretraining (Fig. 2b)
40
. The ELECTRA approach is based on training a model
to distinguish between realinput characters and corruptones, which was previously
shown to be useful for contextual representation of natural language
40
. We adapted
ELECTRA for the CLM training with an LSTM model and SMILES strings as input
28
. The
training data contained corrupted input SMILES strings generated by randomly
substituting multiple characters with other characters of the SMILES language. The CLM
was trained to spot the corrupted characters.
We hypothesized that, compared to autoregressive pretraining, ELECTRA pretraining has a
more appropriate inductive bias (i.e., the set of algorithmic assumptions to solve a given task) to
extract useful features for ordinal classification. The inductive bias of autoregressive pretraining
is particularly suited for generating SMILES strings because the training and generative tasks are
the same, namely, adding characters iteratively. However, ligands of the same macromolecular
target tend to have similar chemical substructures, and, therefore, the ability of a model to
distinguish small structural changes was deemed relevant. At the same time, small structural
changes might lead to drastic variation of the biological activity (the so-called activity cliffs)
41
.
Hereinafter, the model that was pretrained with the ELECTRA method is referred to as “E-CLM.”

Citations
More filters
Book ChapterDOI
20 Aug 2018

3 citations

References
More filters
Journal ArticleDOI
TL;DR: A novel, efficient, gradient based method called long short-term memory (LSTM) is introduced, which can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units.
Abstract: Learning to store information over extended time intervals by recurrent backpropagation takes a very long time, mostly because of insufficient, decaying error backflow. We briefly review Hochreiter's (1991) analysis of this problem, then address it by introducing a novel, efficient, gradient based method called long short-term memory (LSTM). Truncating the gradient where this does not do harm, LSTM can learn to bridge minimal time lags in excess of 1000 discrete-time steps by enforcing constant error flow through constant error carousels within special units. Multiplicative gate units learn to open and close access to the constant error flow. LSTM is local in space and time; its computational complexity per time step and weight is O. 1. Our experiments with artificial data involve local, distributed, real-valued, and noisy pattern representations. In comparisons with real-time recurrent learning, back propagation through time, recurrent cascade correlation, Elman nets, and neural sequence chunking, LSTM leads to many more successful runs, and learns much faster. LSTM also solves complex, artificial long-time-lag tasks that have never been solved by previous recurrent network algorithms.

72,897 citations

Journal ArticleDOI
TL;DR: In this article, the problem of least square problems with non-linear normal equations is solved by an extension of the standard method which insures improvement of the initial solution, which can also be considered an extension to Newton's method.
Abstract: The standard method for solving least squares problems which lead to non-linear normal equations depends upon a reduction of the residuals to linear form by first order Taylor approximations taken about an initial or trial solution for the parameters.2 If the usual least squares procedure, performed with these linear approximations, yields new values for the parameters which are not sufficiently close to the initial values, the neglect of second and higher order terms may invalidate the process, and may actually give rise to a larger value of the sum of the squares of the residuals than that corresponding to the initial solution. This failure of the standard method to improve the initial solution has received some notice in statistical applications of least squares3 and has been encountered rather frequently in connection with certain engineering applications involving the approximate representation of one function by another. The purpose of this article is to show how the problem may be solved by an extension of the standard method which insures improvement of the initial solution.4 The process can also be used for solving non-linear simultaneous equations, in which case it may be considered an extension of Newton's method. Let the function to be approximated be h{x, y, z, • • • ), and let the approximating function be H{oc, y, z, • • ■ ; a, j3, y, ■ • ■ ), where a, /3, 7, • ■ ■ are the unknown parameters. Then the residuals at the points, yit zit • • • ), i = 1, 2, ■ • • , n, are

11,253 citations

Journal ArticleDOI
TL;DR: The authors propose to learn a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences, which can be expressed in terms of these representations.
Abstract: A goal of statistical language modeling is to learn the joint probability function of sequences of words in a language. This is intrinsically difficult because of the curse of dimensionality: a word sequence on which the model will be tested is likely to be different from all the word sequences seen during training. Traditional but very successful approaches based on n-grams obtain generalization by concatenating very short overlapping sequences seen in the training set. We propose to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. The model learns simultaneously (1) a distributed representation for each word along with (2) the probability function for word sequences, expressed in terms of these representations. Generalization is obtained because a sequence of words that has never been seen before gets high probability if it is made of words that are similar (in the sense of having a nearby representation) to words forming an already seen sentence. Training such large models (with millions of parameters) within a reasonable time is itself a significant challenge. We report on experiments using neural networks for the probability function, showing on two text corpora that the proposed approach significantly improves on state-of-the-art n-gram models, and that the proposed approach allows to take advantage of longer contexts.

6,832 citations

Journal ArticleDOI
TL;DR: This chapter discusses the construction of Benzenoid and Coronoid Hydrocarbons through the stages of enumeration, classification, and topological properties in a number of computers used for this purpose.
Abstract: (1) Klamer, A. D. “Some Results Concerning Polyominoes”. Fibonacci Q. 1965, 3(1), 9-20. (2) Golomb, S. W. Polyominoes·, Scribner, New York, 1965. (3) Harary, F.; Read, R. C. “The Enumeration of Tree-like Polyhexes”. Proc. Edinburgh Math. Soc. 1970, 17, 1-14. (4) Lunnon, W. F. “Counting Polyominoes” in Computers in Number Theory·, Academic: London, 1971; pp 347-372. (5) Lunnon, W. F. “Counting Hexagonal and Triangular Polyominoes”. Graph Theory Comput. 1972, 87-100. (6) Brunvoll, J.; Cyvin, S. J.; Cyvin, B. N. “Enumeration and Classification of Benzenoid Hydrocarbons”. J. Comput. Chem. 1987, 8, 189-197. (7) Balaban, A. T., et al. “Enumeration of Benzenoid and Coronoid Hydrocarbons”. Z. Naturforsch., A: Phys., Phys. Chem., Kosmophys. 1987, 42A, 863-870. (8) Gutman, I. “Topological Properties of Benzenoid Systems”. Bull. Soc. Chim., Beograd 1982, 47, 453-471. (9) Gutman, I.; Polansky, O. E. Mathematical Concepts in Organic Chemistry·, Springer: Berlin, 1986. (10) To3i6, R.; Doroslovacki, R.; Gutman, I. “Topological Properties of Benzenoid Systems—The Boundary Code”. MATCH 1986, No. 19, 219-228. (11) Doroslovacki, R.; ToSic, R. “A Characterization of Hexagonal Systems”. Rev. Res. Fac. Sci.-Univ. Novi Sad, Math. Ser. 1984,14(2) 201-209. (12) Knop, J. V.; Szymanski, K.; Trinajstic, N. “Computer Enumeration of Substituted Polyhexes”. Comput. Chem. 1984, 8(2), 107-115. (13) Stojmenovic, L; Tosió, R.; Doroslovaóki, R. “Generating and Counting Hexagonal Systems”. Proc. Yugosl. Semin. Graph Theory, 6th, Dubrovnik 1985; pp 189-198. (14) Doroslovaóki, R.; Stojmenovió, I.; Tosió, R. “Generating and Counting Triangular Systems”. BIT 1987, 27, 18-24. (15) Knop, J. V.; Miller, W. R.; Szymanski, K.; Trinajstic, N. Computer Generation of Certain Classes of Molecules·, Association of Chemists and Technologists of Croatia: Zagreb, 1985.

4,541 citations

Journal ArticleDOI
TL;DR: A description of their implementation has not previously been presented in the literature, and ECFPs can be very rapidly calculated and can represent an essentially infinite number of different molecular features.
Abstract: Extended-connectivity fingerprints (ECFPs) are a novel class of topological fingerprints for molecular characterization. Historically, topological fingerprints were developed for substructure and similarity searching. ECFPs were developed specifically for structure−activity modeling. ECFPs are circular fingerprints with a number of useful qualities: they can be very rapidly calculated; they are not predefined and can represent an essentially infinite number of different molecular features (including stereochemical information); their features represent the presence of particular substructures, allowing easier interpretation of analysis results; and the ECFP algorithm can be tailored to generate different types of circular fingerprints, optimized for different uses. While the use of ECFPs has been widely adopted and validated, a description of their implementation has not previously been presented in the literature.

4,173 citations

Frequently Asked Questions (20)
Q1. What have the authors contributed in "Leveraging molecular structure and bioactivity with chemical language models for drug design" ?

In this paper, the authors show that “ hybrid ” CLMs can additionally leverage the bioactivity information available for the training compounds. To computationally design ligands of phosphoinositide 3-kinase gamma ( PI3Kγ ), the authors created a large collection of virtual molecules with a generative CLM. This primary virtual compound library was further refined using a CLM-based classifier for bioactivity prediction. 

Importantly, CLM training was performed without data augmentation to study the positive effect of nucleus sampling on the generation of a SMILES string. Future prospective studies will also have to assess the general applicability of this approach to other targets from different target families. This study highlights the versatility of generative deep learning for hit and lead finding in drug discovery, where the same computational pipeline can be used to both create new molecules and screen libraries of existing compounds. The authors envision future projects in which de novo design methods are first validated for physically available molecules from a compound repository or commercial suppliers before investing in potentially more expensive and time-consuming syntheses. 

Examples of de novo designed bioactive molecules include inhibitors of vascular endothelial growth factor receptor 2 kinase and the unfolded protein response pathway7, and nuclear hormone receptor modulators20– 23. 

To increase the confidence in the bioactivity predictions, the authors used a deep ensemble model by combining the predictions of multiple models with a majority voting approach48,49. 

In this study, the authors developed a data-driven molecular design pipeline that leverages both the structural and bioactivity information of known ligands to generate de novo bespoke molecules. 

A variety of data-driven approaches can be used to generate focused virtual chemical libraries and create molecules with the desired properties5–18. 

Bespoke virtual compound libraries provide access to untapped regions of the chemical space2, thereby extending the diversity of potential drug candidates. 

Computational methods have become key players in hit and lead discovery in pharmaceutical research, complementing experimental high-throughput screening1. 

The inductive bias of autoregressive pretraining is particularly suited for generating SMILES strings because the training and generative tasks are the same, namely, adding characters iteratively. 

The creation of a focused virtual chemical library with a CLM generally includes three basic steps: (i) model pretraining with a large set of molecules to learn the SMILES grammar and the feature distribution of the pretraining data, (ii) transfer learning with a smaller set of molecules (fine-tuning set) to bias the molecule generation by the CLM toward the chemical space of interest, and (iii) sampling of new molecules from the data distributions modeled in steps i) and ii)5,24. 

To mitigate the class data imbalance, the authors applied oversampling to the classes with fewer data (i.e., the “inactive” and “highly active” classes)44. 

Both CLMs were fine-tuned on inhibitors of phosphoinositide 3-kinase gamma (PI3Kγ), which is an anticancer, anti-inflammatory, and immunomodulatory drug target26,27. 

Among these top-ranked molecules, 64% featured a new atom scaffold and 62% featured a new graph scaffold with respect to the finetuning set52,53. 

Chemical language models (CLMs)are based on deep learning networks for processing string representations of molecules (e.g., simplified molecular input line entry system (SMILES) strings; Fig. 1a)5,7,19. 

Narrowing the probabilities learned by the CLM with a parameter (the so-called temperature; Fig. 1b) generally improves the SMILES string sampling31. 

owing to the potentially unlimited size of virtual chemical libraries, concerns have been raised over the pragmatism of successfully screening billions of molecules virtually with a potentially high risk of false positives2,3. 

With increasing confidence levels, the number of molecules predicted as “highly active” decreased (Fig. 3a), a documented effect of ensemble voting51. 

During transfer learning, nucleus sampling generally improved the quality of the sampled molecules in terms of the novelty of the SMILES strings compared to the best temperature sampling data obtained (Fig. 1e)33. 

”To probe the effect of the pretraining scheme on the predictions, the authors added only a single feedforward layer to the pretrained CLM and E-CLM for bioactivity prediction. 

Hit compound 1 has a new atom scaffold compared to all molecules in the ChEMBL database (version 28) annotated with “pActivity” ≥ 5.0 on PI3Kγ (“pActivity”: -log(molar IC50, XC50, EC50, AC50, Ki, Kd, or “potency”)).