Open AccessPosted ContentDOI

Leveraging molecular structure and bioactivity with chemical language models for drug design

- 04 Oct 2021 -

TLDR

It is shown that “hybrid” CLMs can additionally leverage the bioactivity information available for the training compounds to positively advocate hybrid CLMs for virtual compound screening and activity-focused molecular design in low-data situations.

Abstract:

Generative chemical language models (CLMs) can be used for de novo molecular structure generation. These CLMs learn from the structural information of known molecules to generate new ones. In this paper, we show that “hybrid” CLMs can additionally leverage the bioactivity information available for the training compounds. To computationally design ligands of phosphoinositide 3-kinase gamma (PI3Kγ), we created a large collection of virtual molecules with a generative CLM. This primary virtual compound library was further refined using a CLM-based classifier for bioactivity prediction. This second hybrid CLM was pretrained with patented molecular structures and fine-tuned with known PI3Kγ binders and non-binders by transfer learning. Several of the computer-generated molecular designs were commercially available, which allowed for fast prescreening and preliminary experimental validation. A new PI3Kγ ligand with sub-micromolar activity was identified. The results positively advocate hybrid CLMs for virtual compound screening and activity-focused molecular design in low-data situations.

Content maybe subject to copyright Report

Moret et al.

Leveraging molecular structure and bioactivity with chemical

language models for drug design

Michael Moret

, Francesca Grisoni

1,2 *

, Cyrill Brunner

& Gisbert Schneider

1,3 *

ETH Zurich, Department of Chemistry and Applied Biosciences, RETHINK, Vladimir-Prelog-

Weg 4, 8093 Zurich, Switzerland;

Eindhoven University of Technology, Institute for Complex Molecular Systems, Department of

Biomedical Engineering, Groene Loper 7, 5612AZ Eindhoven, Netherlands;

ETH Singapore SEC Ltd, 1 CREATE Way, #06-01 CREATE Tower, Singapore 138602,

Singapore;

*Correspondence to Gisbert Schneider (gisbert@ethz.ch) and Francesca Grisoni

(f.grisoni@tue.nl)

Abstract

Generative chemical language models (CLMs) can be used for de novo molecular structure

generation. These CLMs learn from the structural information of known molecules to generate

new ones. In this paper, we show that “hybrid” CLMs can additionally leverage the bioactivity

information available for the training compounds. To computationally design ligands of

phosphoinositide 3-kinase gamma (PI3Kγ), we created a large collection of virtual molecules with

a generative CLM. This primary virtual compound library was further refined using a CLM-based

classifier for bioactivity prediction. This second hybrid CLM was pretrained with patented

molecular structures and fine-tuned with known PI3Kγ binders and non-binders by transfer

learning. Several of the computer-generated molecular designs were commercially available,

which allowed for fast prescreening and preliminary experimental validation. A new PI3Kγ ligand

with sub-micromolar activity was identified. The results positively advocate hybrid CLMs for virtual

compound screening and activity-focused molecular design in low-data situations.

Introduction

Computational methods have become key players in hit and lead discovery in pharmaceutical

research, complementing experimental high-throughput screening

. Bespoke virtual compound

libraries provide access to untapped regions of the chemical space

, thereby extending the

diversity of potential drug candidates. However, owing to the potentially unlimited size of virtual

chemical libraries, concerns have been raised over the pragmatism of successfully screening

billions of molecules virtually with a potentially high risk of false positives

2,3

. To mitigate some of

these challenges, researchers have employed generative deep learning models to construct

compounds on demand by de novo design and to obtain small, bespoke virtual compound

libraries

4,5

. A variety of data-driven approaches can be used to generate focused virtual chemical

libraries and create molecules with the desired properties

5–18

. Chemical language models (CLMs)

Moret et al.

are based on deep learning networks for processing string representations of molecules (e.g.,

simplified molecular input line entry system (SMILES) strings; Fig. 1a)

5,7,19

. CLMs have already

been successfully employed to generate focused virtual chemical libraries. Examples of de novo

designed bioactive molecules include inhibitors of vascular endothelial growth factor receptor 2

kinase and the unfolded protein response pathway

, and nuclear hormone receptor modulators

20–

The creation of a focused virtual chemical library with a CLM generally includes three basic

steps: (i) model pretraining with a large set of molecules to learn the SMILES grammar and the

feature distribution of the pretraining data, (ii) transfer learning with a smaller set of molecules

(fine-tuning set) to bias the molecule generation by the CLM toward the chemical space of interest,

and (iii) sampling of new molecules from the data distributions modeled in steps i) and ii)

5,24

. There

are alternative approaches for CLM development, e.g., model fine-tuning (step ii) by

reinforcement learning

6,25

In this study, we developed a data-driven molecular design pipeline that leverages both

the structural and bioactivity information of known ligands to generate de novo bespoke

molecules. We pretrained two CLMs, each with a distinct pretraining strategy, on a large set of

patented compound structures (one for molecular generation and one for classification). Both

CLMs were fine-tuned on inhibitors of phosphoinositide 3-kinase gamma (PI3Kγ), which is an

anticancer, anti-inflammatory, and immunomodulatory drug target

26,27

. For rapid validation,

commercially available compounds from the set of de novo generated molecules were tested, as

opposed to synthesizing them. A new nanomolar ligand of phosphoinositide 3-kinase gamma

(PI3Kγ) was identified.

Results and Discussion

Molecular design and scoring were performed in two steps, each of which was executed by a

distinct CLM: (i) molecular de novo design and (ii) refinement of the generated virtual molecule

library using the available ligand bioactivity data for the target of interest (PI3Kγ).

Focused library generation

Chemical language model. A CLM based on a long short-term memory (LSTM) model and

SMILES strings as input was developed for the de novo generation of a focused virtual chemical

library for PI3Kγ

. To learn from unlabeled data, CLMs leverage “self-supervised” learning

Specifically, the CLM was trained with an autoregressive approach, i.e., the process of iteratively

predicting the next character in a SMILES string given all the previous characters in the string

(Fig. 2a)

. In previous studies, CLMs were pretrained on molecules with known biological activity

(IC

, EC

, K

, and K

) <1 µM retrieved from the ChEMBL database

20,23,31–33

. Although the training

set can capture the general features of bioactive compounds, it does not necessarily represent

the physicochemical properties of approved drugs. Here, to enable the CLM to capture features

more related to approved drugs, we used 839,674 molecules from the US patent database for the

CLM pretraining

. We hypothesized that patented compounds are more likely to become

marketed drugs than the molecules deposited in ChEMBL. Transfer learning was performed to

properly focus the pretrained CLM toward the target space of PI3Kγ ligands. For transfer learning,

Moret et al.

46 PI3Kγ inhibitors with IC

≤100 nM were selected from the Drug Target Commons (DTC)

database

Nucleus sampling for molecule generation. CLMs generate new molecules by extending strings

from a “start” character until the “stop” character is sampled or when reaching a preset maximum

string length. String characters are iteratively added by weighted random sampling from the

probability distribution learned by the CLM during training. The more likely a given character is at

a given step according to the probabilities learned by the CLM, the more often it will be sampled,

and vice versa. Narrowing the probabilities learned by the CLM with a parameter (the so-called

temperature; Fig. 1b) generally improves the SMILES string sampling

. This improvement occurs

in terms of (i) the quality of the SMILES strings generated, as reflected by their validity

(grammatically valid SMILES strings), uniqueness (nonrepetitive molecules), and novelty

(molecules not present in the pretraining and fine-tuning data), and (ii) the similarity of the sampled

virtual chemical libraries to the reference data in terms of their chemical structures and

bioactivities, as measured by the Fréchet ChemNet Distance (FCD)

. However, with this

“temperature sampling” approach, SMILES characters are unlikely to be sampled, which could

result in the construction of molecules that do not match the design objective. To prevent the CLM

from picking unlikely SMILES characters by temperature sampling, we employed “nucleus

sampling” here

. This method reflects the confidence of the model in its predictions by allowing

only the most probable character(s) to be sampled using a probability threshold based on the

cumulative probabilities of the SMILES characters (Fig. 1c).

Nucleus sampling improved upon temperature sampling in terms of lower FCD values

(Fig. 1d), indicating a greater overall similarity of the de novo generated molecules to the

pretraining set in terms of structural and bioactivity properties. During transfer learning, nucleus

sampling generally improved the quality of the sampled molecules in terms of the novelty of the

SMILES strings compared to the best temperature sampling data obtained (Fig. 1e)

. The results

were stable over a range of sampling threshold values (Supplementary Table S1). However,

nucleus sampling did not outperform temperature sampling in terms of the uniqueness, validity,

and novelty of the SMILES strings generated after the pretraining (Supplementary Table S2). To

create a PI3Kγ focused chemical library during transfer learning, we used nucleus sampling with

a threshold of 0.85. A total of 5000 SMILES strings were sampled over 50 transfer learning epochs

with 10 repetitions (5000 × 50 × 10). A total of 2,500,000 SMILES strings were generated, of

which 1,121,735 were valid, unique, and novel compared to both the training and fine-tuning

compounds.

Moret et al.

Fig. 1 | De novo molecular generation with the CLM. a, SMILES string representation of a

molecule. b, Example of the effect of the temperature parameter on the probability distribution

learnt by the CLM. c, Example of the effect of the nucleus sampling threshold. Only the characters

N and C can be sampled here. d, Fréchet ChemNet Distance (FCD) comparison between

temperature and nucleus sampling after the pretraining (reported as the mean with standard

deviation over 10 repeats with 5000 molecules sampled per repeat). e, Comparison of the novelty

of the generated SMILES strings during the transfer learning between temperature sampling

(temperature = 0.7) and nucleus sampling (threshold = 0.85). Mean values (lines) and standard

deviations (shaded areas) are shown for 10 repeats (1000 SMILES strings were sampled every

second epoch over 40 epochs). Novelty is expressed as the percentage of SMILES strings

generated that were valid and not included in either the training or the fine-tuning data.

Bioactivity prediction with a hybrid chemical language model

Leveraging bioactivity data for molecule selection. The availability of bioactivity data for the fine-

tuning molecules permitted the training of a bioactivity prediction model to select the most

promising de novo designs

. Classical chemoinformatics methods often rely on precomputed

features (molecular descriptors), combined with a machine learning algorithm for molecular

property prediction. In this study, we aimed to explore the potential of a SMILES string-based

hybrid CLM to predict the bioactivity. This neural network model combines a generative CLM with

a classifier network. Given that (i) inactive molecules were annotated with PI3Kγ pIC

= 4.0 (Fig.

Moret et al.

2c) and (ii) there is a natural ordering of the PI3Kγ ligands according to their pIC

values, the

bioactivity prediction task was framed as an ordinal classification task, i.e., classification with a

class order

. Such a model considers both the active and inactive compounds for training and

preserves both the class labels and the class order. For model training, we defined three class

labels: “inactive” (pIC

≤ 4.0, 34 molecules), “moderately active” (4.0 < pIC

≤ 6.5, 121

molecules), and “highly active” (pIC

> 6.5, 43 molecules). The CLM generated a focused virtual

chemical library by leveraging the structural information of the molecules used for fine-tuning,

while the classifier layer factored their activity labels into the model (Fig. 2d).

We explored two different pretraining strategies for feature learning with a large amount of

unlabeled data.

1. Autoregressive pretraining (Fig. 2a). This strategy is analogous to the one performed for

the generative CLM.

2. ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements

Accurately) pretraining (Fig. 2b)

. The ELECTRA approach is based on training a model

to distinguish between “real” input characters and “corrupt” ones, which was previously

shown to be useful for contextual representation of natural language

. We adapted

ELECTRA for the CLM training with an LSTM model and SMILES strings as input

. The

training data contained corrupted input SMILES strings generated by randomly

substituting multiple characters with other characters of the SMILES language. The CLM

was trained to spot the corrupted characters.

We hypothesized that, compared to autoregressive pretraining, ELECTRA pretraining has a

more appropriate inductive bias (i.e., the set of algorithmic assumptions to solve a given task) to

extract useful features for ordinal classification. The inductive bias of autoregressive pretraining

is particularly suited for generating SMILES strings because the training and generative tasks are

the same, namely, adding characters iteratively. However, ligands of the same macromolecular

target tend to have similar chemical substructures, and, therefore, the ability of a model to

distinguish small structural changes was deemed relevant. At the same time, small structural

changes might lead to drastic variation of the biological activity (the so-called activity cliffs)

Hereinafter, the model that was pretrained with the ELECTRA method is referred to as “E-CLM.”

HTML Viewer

Figures

Fig. 2 | Bioactivity prediction. a, A CLM for molecule generation iteratively predicts the next character in a SMILES string given the preceding characters (“autoregressive” approach). b, An E-CLM (a CLM pretrained with the ELECTRA method) is trained on corrupted SMILES strings aiming to predict, for each string character, whether it is the original (correct) or a corrupted (substituted) character. c, Activity distribution of the PI3Kγ ligands. Compounds with annotated pIC50 ≤ 4.0 were considered “inactive”, and a pIC50 value of 6.5 was used to separate the “moderately active” from the “highly active” compounds. d, The molecular structures (in the form of a SMILES string) of the fine-tuning set were used to focus the CLM (pretrained on the US patent database) on the chemical space of the target of interest (PI3Kγ). To account for the uncertainty in the predictions, we employed an ensemble of 100 models to rank the generated molecules by the number of “votes”.

Fig. 5 | In vitro characterization of compound 1. Kinase-ligand binding was determined in a competition assay (n = 2), using an immobilized ligand of PI3Kγ and quantitative polymerase chain reaction (qPCR) measuring the competing DNA-tagged PI3Kγ protein. The signal is expressed as a transformation of the qPCR cycle time (2 - cycle time).

Fig. 3 | Molecule ranking with a deep ensemble model. a, Number of molecules in the refined virtual chemical library that were predicted as “highly active” as a function of the number of votes (confidence level). b, Average structural similarity (Tanimoto similarity index computed on Morgan fingerprints) of each de novo design to the fine-tuning set as a function of the number of votes. The solid line represents the mean value, with the shaded area representing the standard deviation. c, Top-ranked designs (99/100 votes) selected with the most distant nearest neighbor, whose similarity is indicated below the structure (“Most similar”) in the fine-tuning set. The atom (“Atom scaffold”) and graph (“Graph scaffold”) scaffold novelty of the structure with respect to the fine-tuning set is indicated below each structure (“Yes”: new, “No”: not new). d, Top-ranked designs (99/100 votes) selected with the closest nearest neighbor in the fine-tuning set.

Fig. 1 | De novo molecular generation with the CLM. a, SMILES string representation of a molecule. b, Example of the effect of the temperature parameter on the probability distribution learnt by the CLM. c, Example of the effect of the nucleus sampling threshold. Only the characters N and C can be sampled here. d, Fréchet ChemNet Distance (FCD) comparison between temperature and nucleus sampling after the pretraining (reported as the mean with standard deviation over 10 repeats with 5000 molecules sampled per repeat). e, Comparison of the novelty of the generated SMILES strings during the transfer learning between temperature sampling (temperature = 0.7) and nucleus sampling (threshold = 0.85). Mean values (lines) and standard deviations (shaded areas) are shown for 10 repeats (1000 SMILES strings were sampled every second epoch over 40 epochs). Novelty is expressed as the percentage of SMILES strings generated that were valid and not included in either the training or the fine-tuning data.

Fig. 4 | Compounds tested for PI3Kγ inhibition. Compounds 1–16 are shown, together with the number of votes from the ensemble of the maximum number of 100 possible votes and the experimentally determined binding constant Kd. Absence of a value (-) indicates no observed binding of the compound to the target.

Citations

PDF

Open Access

More filters

Book ChapterDOI

Theory meets reality

Ruth-Blandina M. Quinn

References

PDF

Open Access

More filters

Journal ArticleDOI

Combining generative artificial intelligence and on-chip synthesis for de novo drug design.

Francesca Grisoni, +9 more

- 01 Jun 2021 -

Science Advances

TL;DR: The results support the suitability of the proposed design-make-test-analyze framework as a blueprint for automated drug design with artificial intelligence and miniaturized bench-top synthesis.

...read moreread less

Journal ArticleDOI

Applications of Deep-Learning in Exploiting Large-Scale and Heterogeneous Compound Data in Industrial Pharmaceutical Research.

Laurianne David, +9 more

- 05 Nov 2019 -

Frontiers in Pharmacology

TL;DR: The current state of analyzing large-scale compound data in industrial pharmaceutical research is summarized and the impact it has had on the drug discovery process over the last two decades is described, with a specific focus on deep-learning technologies.

...read moreread less

Journal ArticleDOI

Chemical language models enable navigation in sparsely populated chemical space

Michael A. Skinnider, +3 more

- 01 Sep 2021 -

Nature Machine Intelligence

TL;DR: It is shown that models developed for natural language processing work well for generating molecules from small amounts of training data, and robust metrics to evaluate the quality of generated molecules are identified.

...read moreread less

Journal ArticleDOI

PaccMannRL: De novo generation of hit-like anticancer molecules from transcriptomic data via reinforcement learning.

Jannis Born, +7 more

- 05 Mar 2021 -

iScience

TL;DR: In this article, a hybrid VAE was used to generate drugs with high predicted efficacy against cell lines or cancer types, using an anticancer drug sensitivity prediction model as reward function.

...read moreread less

Journal ArticleDOI

Beam Search for Automated Design and Scoring of Novel ROR Ligands with Machine Intelligence

Michael Moret, +4 more

- 24 Jun 2021 -

Angewandte Chemie

TL;DR: This paper leveraged the probabilities learned by chemical language models with the beam search algorithm as a model-intrinsic technique for automated molecule design and scoring and yielded novel inverse agonists of retinoic acid receptor-related orphan receptors (RORs).

...read moreread less

Collapse

De Novo Design of Bioactive Small Molecules by Artificial Intelligence

Daniel Merk, +4 more

- 01 Jan 2018 -

Molecular Informatics

CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models

Vijil Chenthamarakshan, +10 more

- 02 Apr 2020 -

arXiv: Learning

Multi-Objective Molecular De Novo Design by Adaptive Fragment Prioritization

Michael Reutlinger, +3 more

- 14 Apr 2014 -

Angewandte Chemie

Strategies for Design of Molecular Structures with a Desired Pharmacophore Using Deep Reinforcement Learning.

Atsushi Yoshimori, +3 more

- 01 Mar 2020 -

Chemical & Pharmaceutical Bulletin

Structure based drug design for HIV protease: from molecular modeling to cheminformatics.

Patra Volarath, +2 more

- 01 May 2007 -

Current Topics in Medicinal Chemistry

Frequently Asked Questions (20)

Q1. What have the authors contributed in "Leveraging molecular structure and bioactivity with chemical language models for drug design" ?

In this paper, the authors show that “ hybrid ” CLMs can additionally leverage the bioactivity information available for the training compounds. To computationally design ligands of phosphoinositide 3-kinase gamma ( PI3Kγ ), the authors created a large collection of virtual molecules with a generative CLM. This primary virtual compound library was further refined using a CLM-based classifier for bioactivity prediction.

Q2. What have the authors stated for future works in "Leveraging molecular structure and bioactivity with chemical language models for drug design" ?

Importantly, CLM training was performed without data augmentation to study the positive effect of nucleus sampling on the generation of a SMILES string. Future prospective studies will also have to assess the general applicability of this approach to other targets from different target families. This study highlights the versatility of generative deep learning for hit and lead finding in drug discovery, where the same computational pipeline can be used to both create new molecules and screen libraries of existing compounds. The authors envision future projects in which de novo design methods are first validated for physically available molecules from a compound repository or commercial suppliers before investing in potentially more expensive and time-consuming syntheses.

Q3. What are examples of de novo designed bioactive molecules?

Examples of de novo designed bioactive molecules include inhibitors of vascular endothelial growth factor receptor 2 kinase and the unfolded protein response pathway7, and nuclear hormone receptor modulators20– 23.

Q4. How did the authors increase the confidence in the bioactivity predictions?

To increase the confidence in the bioactivity predictions, the authors used a deep ensemble model by combining the predictions of multiple models with a majority voting approach48,49.

Q5. What is the purpose of this study?

In this study, the authors developed a data-driven molecular design pipeline that leverages both the structural and bioactivity information of known ligands to generate de novo bespoke molecules.

Q6. What can be done to generate a focused virtual chemical library?

A variety of data-driven approaches can be used to generate focused virtual chemical libraries and create molecules with the desired properties5–18.

Q7. What are the advantages of using a CLM to generate a focused chemical library?

Bespoke virtual compound libraries provide access to untapped regions of the chemical space2, thereby extending the diversity of potential drug candidates.

Q8. What are the main advantages of using CLMs in pharmaceutical research?

Computational methods have become key players in hit and lead discovery in pharmaceutical research, complementing experimental high-throughput screening1.

Q9. What is the inductive bias of ELECTRA pretraining?

The inductive bias of autoregressive pretraining is particularly suited for generating SMILES strings because the training and generative tasks are the same, namely, adding characters iteratively.

Q10. What are the steps to create a focused virtual chemical library?

The creation of a focused virtual chemical library with a CLM generally includes three basic steps: (i) model pretraining with a large set of molecules to learn the SMILES grammar and the feature distribution of the pretraining data, (ii) transfer learning with a smaller set of molecules (fine-tuning set) to bias the molecule generation by the CLM toward the chemical space of interest, and (iii) sampling of new molecules from the data distributions modeled in steps i) and ii)5,24.

Q11. What did the authors do to mitigate the class data imbalance?

To mitigate the class data imbalance, the authors applied oversampling to the classes with fewer data (i.e., the “inactive” and “highly active” classes)44.

Q12. What are the main characteristics of the CLMs?

Both CLMs were fine-tuned on inhibitors of phosphoinositide 3-kinase gamma (PI3Kγ), which is an anticancer, anti-inflammatory, and immunomodulatory drug target26,27.

Q13. How many of the top-ranked molecules featured a new atom scaffold?

Among these top-ranked molecules, 64% featured a new atom scaffold and 62% featured a new graph scaffold with respect to the finetuning set52,53.

Q14. What are the main characteristics of a CLM?

Chemical language models (CLMs)are based on deep learning networks for processing string representations of molecules (e.g., simplified molecular input line entry system (SMILES) strings; Fig. 1a)5,7,19.

Q15. What is the way to improve the quality of the SMILES strings?

Narrowing the probabilities learned by the CLM with a parameter (the so-called temperature; Fig. 1b) generally improves the SMILES string sampling31.

Q16. What are the challenges of generating a virtual chemical library?

owing to the potentially unlimited size of virtual chemical libraries, concerns have been raised over the pragmatism of successfully screening billions of molecules virtually with a potentially high risk of false positives2,3.

Q17. How did the confidence level in the prediction of bioactivity decrease?

With increasing confidence levels, the number of molecules predicted as “highly active” decreased (Fig. 3a), a documented effect of ensemble voting51.

Q18. What is the way to improve the quality of the sampled molecules?

During transfer learning, nucleus sampling generally improved the quality of the sampled molecules in terms of the novelty of the SMILES strings compared to the best temperature sampling data obtained (Fig. 1e)33.

Q19. What is the simplest explanation for the name of the pretraining scheme?

”To probe the effect of the pretraining scheme on the predictions, the authors added only a single feedforward layer to the pretrained CLM and E-CLM for bioactivity prediction.

Q20. What is the PI3K activity of compound 1?

Hit compound 1 has a new atom scaffold compared to all molecules in the ChEMBL database (version 28) annotated with “pActivity” ≥ 5.0 on PI3Kγ (“pActivity”: -log(molar IC50, XC50, EC50, AC50, Ki, Kd, or “potency”)).

Leveraging molecular structure and bioactivity with chemical language models for drug design

Figures

Citations

Theory meets reality

References

Combining generative artificial intelligence and on-chip synthesis for de novo drug design.

Applications of Deep-Learning in Exploiting Large-Scale and Heterogeneous Compound Data in Industrial Pharmaceutical Research.

Chemical language models enable navigation in sparsely populated chemical space

PaccMannRL: De novo generation of hit-like anticancer molecules from transcriptomic data via reinforcement learning.

Beam Search for Automated Design and Scoring of Novel ROR Ligands with Machine Intelligence

Related Papers (5)

De Novo Design of Bioactive Small Molecules by Artificial Intelligence

CogMol: Target-Specific and Selective Drug Design for COVID-19 Using Deep Generative Models

Multi-Objective Molecular De Novo Design by Adaptive Fragment Prioritization

Strategies for Design of Molecular Structures with a Desired Pharmacophore Using Deep Reinforcement Learning.

Structure based drug design for HIV protease: from molecular modeling to cheminformatics.

Frequently Asked Questions (20)

Q1. What have the authors contributed in "Leveraging molecular structure and bioactivity with chemical language models for drug design" ?

Q2. What have the authors stated for future works in "Leveraging molecular structure and bioactivity with chemical language models for drug design" ?

Q3. What are examples of de novo designed bioactive molecules?

Q4. How did the authors increase the confidence in the bioactivity predictions?

Q5. What is the purpose of this study?

Q6. What can be done to generate a focused virtual chemical library?

Q7. What are the advantages of using a CLM to generate a focused chemical library?

Q8. What are the main advantages of using CLMs in pharmaceutical research?

Q9. What is the inductive bias of ELECTRA pretraining?

Q10. What are the steps to create a focused virtual chemical library?

Q11. What did the authors do to mitigate the class data imbalance?

Q12. What are the main characteristics of the CLMs?

Q13. How many of the top-ranked molecules featured a new atom scaffold?

Q14. What are the main characteristics of a CLM?

Q15. What is the way to improve the quality of the SMILES strings?

Q16. What are the challenges of generating a virtual chemical library?

Q17. How did the confidence level in the prediction of bioactivity decrease?

Q18. What is the way to improve the quality of the sampled molecules?

Q19. What is the simplest explanation for the name of the pretraining scheme?

Q20. What is the PI3K activity of compound 1?