Leveraging molecular structure and bioactivity with chemical language models for drug design
read more
Citations
References
Combining generative artificial intelligence and on-chip synthesis for de novo drug design.
Applications of Deep-Learning in Exploiting Large-Scale and Heterogeneous Compound Data in Industrial Pharmaceutical Research.
Chemical language models enable navigation in sparsely populated chemical space
PaccMannRL: De novo generation of hit-like anticancer molecules from transcriptomic data via reinforcement learning.
Beam Search for Automated Design and Scoring of Novel ROR Ligands with Machine Intelligence
Related Papers (5)
Frequently Asked Questions (20)
Q2. What have the authors stated for future works in "Leveraging molecular structure and bioactivity with chemical language models for drug design" ?
Importantly, CLM training was performed without data augmentation to study the positive effect of nucleus sampling on the generation of a SMILES string. Future prospective studies will also have to assess the general applicability of this approach to other targets from different target families. This study highlights the versatility of generative deep learning for hit and lead finding in drug discovery, where the same computational pipeline can be used to both create new molecules and screen libraries of existing compounds. The authors envision future projects in which de novo design methods are first validated for physically available molecules from a compound repository or commercial suppliers before investing in potentially more expensive and time-consuming syntheses.
Q3. What are examples of de novo designed bioactive molecules?
Examples of de novo designed bioactive molecules include inhibitors of vascular endothelial growth factor receptor 2 kinase and the unfolded protein response pathway7, and nuclear hormone receptor modulators20– 23.
Q4. How did the authors increase the confidence in the bioactivity predictions?
To increase the confidence in the bioactivity predictions, the authors used a deep ensemble model by combining the predictions of multiple models with a majority voting approach48,49.
Q5. What is the purpose of this study?
In this study, the authors developed a data-driven molecular design pipeline that leverages both the structural and bioactivity information of known ligands to generate de novo bespoke molecules.
Q6. What can be done to generate a focused virtual chemical library?
A variety of data-driven approaches can be used to generate focused virtual chemical libraries and create molecules with the desired properties5–18.
Q7. What are the advantages of using a CLM to generate a focused chemical library?
Bespoke virtual compound libraries provide access to untapped regions of the chemical space2, thereby extending the diversity of potential drug candidates.
Q8. What are the main advantages of using CLMs in pharmaceutical research?
Computational methods have become key players in hit and lead discovery in pharmaceutical research, complementing experimental high-throughput screening1.
Q9. What is the inductive bias of ELECTRA pretraining?
The inductive bias of autoregressive pretraining is particularly suited for generating SMILES strings because the training and generative tasks are the same, namely, adding characters iteratively.
Q10. What are the steps to create a focused virtual chemical library?
The creation of a focused virtual chemical library with a CLM generally includes three basic steps: (i) model pretraining with a large set of molecules to learn the SMILES grammar and the feature distribution of the pretraining data, (ii) transfer learning with a smaller set of molecules (fine-tuning set) to bias the molecule generation by the CLM toward the chemical space of interest, and (iii) sampling of new molecules from the data distributions modeled in steps i) and ii)5,24.
Q11. What did the authors do to mitigate the class data imbalance?
To mitigate the class data imbalance, the authors applied oversampling to the classes with fewer data (i.e., the “inactive” and “highly active” classes)44.
Q12. What are the main characteristics of the CLMs?
Both CLMs were fine-tuned on inhibitors of phosphoinositide 3-kinase gamma (PI3Kγ), which is an anticancer, anti-inflammatory, and immunomodulatory drug target26,27.
Q13. How many of the top-ranked molecules featured a new atom scaffold?
Among these top-ranked molecules, 64% featured a new atom scaffold and 62% featured a new graph scaffold with respect to the finetuning set52,53.
Q14. What are the main characteristics of a CLM?
Chemical language models (CLMs)are based on deep learning networks for processing string representations of molecules (e.g., simplified molecular input line entry system (SMILES) strings; Fig. 1a)5,7,19.
Q15. What is the way to improve the quality of the SMILES strings?
Narrowing the probabilities learned by the CLM with a parameter (the so-called temperature; Fig. 1b) generally improves the SMILES string sampling31.
Q16. What are the challenges of generating a virtual chemical library?
owing to the potentially unlimited size of virtual chemical libraries, concerns have been raised over the pragmatism of successfully screening billions of molecules virtually with a potentially high risk of false positives2,3.
Q17. How did the confidence level in the prediction of bioactivity decrease?
With increasing confidence levels, the number of molecules predicted as “highly active” decreased (Fig. 3a), a documented effect of ensemble voting51.
Q18. What is the way to improve the quality of the sampled molecules?
During transfer learning, nucleus sampling generally improved the quality of the sampled molecules in terms of the novelty of the SMILES strings compared to the best temperature sampling data obtained (Fig. 1e)33.
Q19. What is the simplest explanation for the name of the pretraining scheme?
”To probe the effect of the pretraining scheme on the predictions, the authors added only a single feedforward layer to the pretrained CLM and E-CLM for bioactivity prediction.
Q20. What is the PI3K activity of compound 1?
Hit compound 1 has a new atom scaffold compared to all molecules in the ChEMBL database (version 28) annotated with “pActivity” ≥ 5.0 on PI3Kγ (“pActivity”: -log(molar IC50, XC50, EC50, AC50, Ki, Kd, or “potency”)).