Leveraging molecular structure and bioactivity with chemical language models for drug design
Summary (1 min read)
Introduction
- Computational methods have become key players in hit and lead discovery in pharmaceutical research, complementing experimental high-throughput screening1.
- There are alternative approaches for CLM development, e.g., model fine-tuning (step ii) by reinforcement learning6,25.
- Both CLMs were fine-tuned on inhibitors of phosphoinositide 3-kinase gamma (PI3Kγ), which is an anticancer, anti-inflammatory, and immunomodulatory drug target26,27.
- For rapid validation, commercially available compounds from the set of de novo generated molecules were tested, as opposed to synthesizing them.
Focused library generation
- A CLM based on a long short-term memory (LSTM) model and SMILES strings as input was developed for the de novo generation of a focused virtual chemical library for PI3Kγ28.
- To learn from unlabeled data, CLMs leverage “self-supervised” learning29.
- Here, to enable the CLM to capture features more related to approved drugs, the authors used 839,674 molecules from the US patent database for the CLM pretraining34.
- Narrowing the probabilities learned by the CLM with a parameter (the so-called temperature; Fig. 1b) generally improves the SMILES string sampling31.
In vitro bioactivity testing
- For a proof of concept, some of the molecules generated by the CLM were tested for PI3Kγ binding in vitro.
- The authors hypothesize that this might be due to the positive effect of the ELECTRA pretraining, which was aimed at recognizing the effect of small structural changes.
- This preliminary in vitro validation advocates an ensemble prediction approach for virtual compound screening and ranking of the computer-generated molecular designs.
Conclusion
- Methodological improvements in CLM training advanced the sampling of target-focused virtual molecule libraries.
- It remains to be determined in more detail to what extent the CLM pretraining method affects model performance in the downstream task, i.e., molecular generation or ordinal classification.
- Importantly, CLM training was performed without data augmentation to study the positive effect of nucleus sampling on the generation of a SMILES string.
- The long time required for hit-to-lead expansion and for preclinical and clinical drug development until a marketed drug is obtained will likely preclude any such analysis.
- Obtaining rapid experimental validation of a set of readily available de novo designed molecules prior to embarking on de novo synthesis might help assess the value of computationally generated activity-focused chemical libraries.
Did you find this useful? Give us your feedback
Citations
References
72,897 citations
11,253 citations
6,832 citations
4,541 citations
4,173 citations
Related Papers (5)
Frequently Asked Questions (20)
Q2. What have the authors stated for future works in "Leveraging molecular structure and bioactivity with chemical language models for drug design" ?
Importantly, CLM training was performed without data augmentation to study the positive effect of nucleus sampling on the generation of a SMILES string. Future prospective studies will also have to assess the general applicability of this approach to other targets from different target families. This study highlights the versatility of generative deep learning for hit and lead finding in drug discovery, where the same computational pipeline can be used to both create new molecules and screen libraries of existing compounds. The authors envision future projects in which de novo design methods are first validated for physically available molecules from a compound repository or commercial suppliers before investing in potentially more expensive and time-consuming syntheses.
Q3. What are examples of de novo designed bioactive molecules?
Examples of de novo designed bioactive molecules include inhibitors of vascular endothelial growth factor receptor 2 kinase and the unfolded protein response pathway7, and nuclear hormone receptor modulators20– 23.
Q4. How did the authors increase the confidence in the bioactivity predictions?
To increase the confidence in the bioactivity predictions, the authors used a deep ensemble model by combining the predictions of multiple models with a majority voting approach48,49.
Q5. What is the purpose of this study?
In this study, the authors developed a data-driven molecular design pipeline that leverages both the structural and bioactivity information of known ligands to generate de novo bespoke molecules.
Q6. What can be done to generate a focused virtual chemical library?
A variety of data-driven approaches can be used to generate focused virtual chemical libraries and create molecules with the desired properties5–18.
Q7. What are the advantages of using a CLM to generate a focused chemical library?
Bespoke virtual compound libraries provide access to untapped regions of the chemical space2, thereby extending the diversity of potential drug candidates.
Q8. What are the main advantages of using CLMs in pharmaceutical research?
Computational methods have become key players in hit and lead discovery in pharmaceutical research, complementing experimental high-throughput screening1.
Q9. What is the inductive bias of ELECTRA pretraining?
The inductive bias of autoregressive pretraining is particularly suited for generating SMILES strings because the training and generative tasks are the same, namely, adding characters iteratively.
Q10. What are the steps to create a focused virtual chemical library?
The creation of a focused virtual chemical library with a CLM generally includes three basic steps: (i) model pretraining with a large set of molecules to learn the SMILES grammar and the feature distribution of the pretraining data, (ii) transfer learning with a smaller set of molecules (fine-tuning set) to bias the molecule generation by the CLM toward the chemical space of interest, and (iii) sampling of new molecules from the data distributions modeled in steps i) and ii)5,24.
Q11. What did the authors do to mitigate the class data imbalance?
To mitigate the class data imbalance, the authors applied oversampling to the classes with fewer data (i.e., the “inactive” and “highly active” classes)44.
Q12. What are the main characteristics of the CLMs?
Both CLMs were fine-tuned on inhibitors of phosphoinositide 3-kinase gamma (PI3Kγ), which is an anticancer, anti-inflammatory, and immunomodulatory drug target26,27.
Q13. How many of the top-ranked molecules featured a new atom scaffold?
Among these top-ranked molecules, 64% featured a new atom scaffold and 62% featured a new graph scaffold with respect to the finetuning set52,53.
Q14. What are the main characteristics of a CLM?
Chemical language models (CLMs)are based on deep learning networks for processing string representations of molecules (e.g., simplified molecular input line entry system (SMILES) strings; Fig. 1a)5,7,19.
Q15. What is the way to improve the quality of the SMILES strings?
Narrowing the probabilities learned by the CLM with a parameter (the so-called temperature; Fig. 1b) generally improves the SMILES string sampling31.
Q16. What are the challenges of generating a virtual chemical library?
owing to the potentially unlimited size of virtual chemical libraries, concerns have been raised over the pragmatism of successfully screening billions of molecules virtually with a potentially high risk of false positives2,3.
Q17. How did the confidence level in the prediction of bioactivity decrease?
With increasing confidence levels, the number of molecules predicted as “highly active” decreased (Fig. 3a), a documented effect of ensemble voting51.
Q18. What is the way to improve the quality of the sampled molecules?
During transfer learning, nucleus sampling generally improved the quality of the sampled molecules in terms of the novelty of the SMILES strings compared to the best temperature sampling data obtained (Fig. 1e)33.
Q19. What is the simplest explanation for the name of the pretraining scheme?
”To probe the effect of the pretraining scheme on the predictions, the authors added only a single feedforward layer to the pretrained CLM and E-CLM for bioactivity prediction.
Q20. What is the PI3K activity of compound 1?
Hit compound 1 has a new atom scaffold compared to all molecules in the ChEMBL database (version 28) annotated with “pActivity” ≥ 5.0 on PI3Kγ (“pActivity”: -log(molar IC50, XC50, EC50, AC50, Ki, Kd, or “potency”)).