scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Chemical Information and Modeling in 2020"


Journal ArticleDOI
TL;DR: The findings of this study can facilitate rational drug design targeting the SARS-CoV-2 main protease, including carfilzomib, eravacycline, valrubicin, lopinavir, and elbasvir.
Abstract: The recent outbreak of novel coronavirus disease-19 (COVID-19) calls for and welcomes possible treatment strategies using drugs on the market. It is very efficient to apply computer-aided drug design techniques to quickly identify promising drug repurposing candidates, especially after the detailed 3D structures of key viral proteins are resolved. The virus causing COVID-19 is SARS-CoV-2. Taking advantage of a recently released crystal structure of SARS-CoV-2 main protease in complex with a covalently bonded inhibitor, N3 (Liu et al., 10.2210/pdb6LU7/pdb), I conducted virtual docking screening of approved drugs and drug candidates in clinical trials. For the top docking hits, I then performed molecular dynamics simulations followed by binding free energy calculations using an end point method called MM-PBSA-WSAS (molecular mechanics/Poisson-Boltzmann surface area/weighted solvent-accessible surface area; Wang, Chem. Rev. 2019, 119, 9478; Wang, Curr. Comput.-Aided Drug Des. 2006, 2, 287; Wang; ; Hou J. Chem. Inf. Model., 2012, 52, 1199). Several promising known drugs stand out as potential inhibitors of SARS-CoV-2 main protease, including carfilzomib, eravacycline, valrubicin, lopinavir, and elbasvir. Carfilzomib, an approved anticancer drug acting as a proteasome inhibitor, has the best MM-PBSA-WSAS binding free energy, -13.8 kcal/mol. The second-best repurposing drug candidate, eravacycline, is synthetic halogenated tetracycline class antibiotic. Streptomycin, another antibiotic and a charged molecule, also demonstrates some inhibitory effect, even though the predicted binding free energy of the charged form (-3.8 kcal/mol) is not nearly as low as that of the neutral form (-7.9 kcal/mol). One bioactive, PubChem 23727975, has a binding free energy of -12.9 kcal/mol. Detailed receptor-ligand interactions were analyzed and hot spots for the receptor-ligand binding were identified. I found that one hot spot residue, His41, is a conserved residue across many viruses including SARS-CoV, SARS-CoV-2, MERS-CoV, and hepatitis C virus (HCV). The findings of this study can facilitate rational drug design targeting the SARS-CoV-2 main protease.

376 citations


Journal ArticleDOI
TL;DR: The binding pose and affinity between a ligand to an enzyme are very important pieces of information for computer-aided drug design and it is found that the Vina approach converges much faster than AD4 one, however, interestingly, AD4 shows a better performance than Vina over 21 considered targets, whereas Vina protocol is better thanAD4 package for 10 other targets.
Abstract: The binding pose and affinity between a ligand and enzyme are very important pieces of information for computer-aided drug design. In the initial stage of a drug discovery project, this information is often obtained by using molecular docking methods. Autodock4 and Autodock Vina are two commonly used open-source and free software tools to perform this task, and each has been cited more than 6000 times in the last ten years. It is of great interest to compare the success rate of the two docking software programs for a large and diverse set of protein-ligand complexes. In this study, we selected 800 protein-ligand complexes for which both PDB structures and experimental binding affinity are available. Docking calculations were performed for these complexes using both Autodock4 and Autodock Vina with different docking options related to computing resource consumption and accuracy. Our calculation results are in good agreement with a previous study that the Vina approach converges much faster than AD4 one. However, interestingly, AD4 shows a better performance than Vina over 21 considered targets, whereas the Vina protocol is better than the AD4 package for 10 other targets. There are 16 complexes for which both the AD4 and Vina protocols fail to produce a reasonable correlation with respected experiments so both are not suitable to use to estimate binding free energies for these cases. In addition, the best docking option for performing the AD4 approach is the long option. However, the short option is the best solution for carrying out Vina docking. The obtained results probably will be useful for future docking studies in deciding which program to use.

209 citations


Journal ArticleDOI
TL;DR: ZINC20 is developed, a new version of ZINC with two major new features: billions of new molecules and new methods to search them: explicit atomic-level graph-based methods and 3D methods such as docking.
Abstract: Identifying and purchasing new small molecules to test in biological assays are enabling for ligand discovery, but as purchasable chemical space continues to grow into the tens of billions based on inexpensive make-on-demand compounds, simply searching this space becomes a major challenge. We have therefore developed ZINC20, a new version of ZINC with two major new features: billions of new molecules and new methods to search them. As a fully enumerated database, ZINC can be searched precisely using explicit atomic-level graph-based methods, such as SmallWorld for similarity and Arthor for pattern and substructure search, as well as 3D methods such as docking. Analysis of the new make-on-demand compound sets by these and related tools reveals startling features. For instance, over 97% of the core Bemis-Murcko scaffolds in make-on-demand libraries are unavailable from "in-stock" collections. Correspondingly, the number of new Bemis-Murcko scaffolds is rising almost as a linear fraction of the elaborated molecules. Thus, an 88-fold increase in the number of molecules in the make-on-demand versus the in-stock sets is built upon a 16-fold increase in the number of Bemis-Murcko scaffolds. The make-on-demand library is also more structurally diverse than physical libraries, with a massive increase in disc- and sphere-like shaped molecules. The new system is freely available at zinc20.docking.org.

202 citations


Journal ArticleDOI
TL;DR: This analysis suggests that to improve the utility of state-of-the-art generative models in real discovery workflows, new algorithm development is warranted.
Abstract: The discovery of functional molecules is an expensive and time-consuming process, exemplified by the rising costs of small molecule therapeutic discovery. One class of techniques of growing interest for early stage drug discovery is de novo molecular generation and optimization, catalyzed by the development of new deep learning approaches. These techniques can suggest novel molecular structures intended to maximize a multiobjective function, e.g., suitability as a therapeutic against a particular target, without relying on brute-force exploration of a chemical space. However, the utility of these approaches is stymied by ignorance of synthesizability. To highlight the severity of this issue, we use a data-driven computer-aided synthesis planning program to quantify how often molecules proposed by state-of-the-art generative models cannot be readily synthesized. Our analysis demonstrates that there are several tasks for which these models generate unrealistic molecular structures despite performing well on popular quantitative benchmarks. Synthetic complexity heuristics can successfully bias generation toward synthetically tractable chemical space, although doing so necessarily detracts from the primary objective. This analysis suggests that to improve the utility of these models in real discovery workflows, new algorithm development is warranted.

146 citations


Journal ArticleDOI
TL;DR: A contemporary overview of the scientific, technical, and practical issues associated with running relative BFE simulations in AMBER20 is provided, with a focus on real-world drug discovery applications.
Abstract: Predicting protein-ligand binding affinities and the associated thermodynamics of biomolecular recognition is a primary objective of structure-based drug design. Alchemical free energy simulations offer a highly accurate and computationally efficient route to achieving this goal. While the AMBER molecular dynamics package has successfully been used for alchemical free energy simulations in academic research groups for decades, widespread impact in industrial drug discovery settings has been minimal because of the previous limitations within the AMBER alchemical code, coupled with challenges in system setup and postprocessing workflows. Through a close academia-industry collaboration we have addressed many of the previous limitations with an aim to improve accuracy, efficiency, and robustness of alchemical binding free energy simulations in industrial drug discovery applications. Here, we highlight some of the recent advances in AMBER20 with a focus on alchemical binding free energy (BFE) calculations, which are less computationally intensive than alternative binding free energy methods where full binding/unbinding paths are explored. In addition to scientific and technical advances in AMBER20, we also describe the essential practical aspects associated with running relative alchemical BFE calculations, along with recommendations for best practices, highlighting the importance not only of the alchemical simulation code but also the auxiliary functionalities and expertise required to obtain accurate and reliable results. This work is intended to provide a contemporary overview of the scientific, technical, and practical issues associated with running relative BFE simulations in AMBER20, with a focus on real-world drug discovery applications.

138 citations


Journal ArticleDOI
TL;DR: This study has developed a template-free self-corrected retrosynthesis predictor (SCROP) to predict retroSynthesis by using Transformer neural networks, which was 1.7 times more accurate than other state-of-the-art methods for compounds not appearing in the training set.
Abstract: Synthesis planning is the process of recursively decomposing target molecules into available precursors. Computer-aided retrosynthesis can potentially assist chemists in designing synthetic routes; however, at present, it is cumbersome and cannot provide satisfactory results. In this study, we have developed a template-free self-corrected retrosynthesis predictor (SCROP) to predict retrosynthesis using transformer neural networks. In the method, the retrosynthesis planning was converted to a machine translation problem from the products to molecular linear notations of the reactants. By coupling with a neural network-based syntax corrector, our method achieved an accuracy of 59.0% on a standard benchmark data set, which outperformed other deep learning methods by >21% and template-based methods by >6%. More importantly, our method was 1.7 times more accurate than other state-of-the-art methods for compounds not appearing in the training set.

132 citations


Journal ArticleDOI
TL;DR: This work presents a high-throughput protein-ligand complex MD simulations that uses the output from AutoDock Vina to improve docking results in distinguishing active from decoy ligands in DUD-E (directory of useful decoy, enhanced) dataset.
Abstract: Structure-based virtual screening relies on classical scoring functions that often fail to reliably discriminate binders from nonbinders. In this work, we present a high-throughput protein-ligand complex molecular dynamics (MD) simulation that uses the output from AutoDock Vina to improve docking results in distinguishing active from decoy ligands in a directory of useful decoy-enhanced (DUD-E) dataset. MD trajectories are processed by evaluating ligand-binding stability using root-mean-square deviations. We select 56 protein targets (of 7 different protein classes) and 560 ligands (280 actives, 280 decoys) and show 22% improvement in ROC AUC (area under the curve, receiver operating characteristics curve), from an initial value of 0.68 (AutoDock Vina) to a final value of 0.83. The MD simulation demonstrates a robust performance across all seven different protein classes. In addition, some predicted ligand-binding modes are moderately refined during MD simulations. These results systematically validate the reliability of a physics-based approach to evaluate protein-ligand binding interactions.

128 citations


Journal ArticleDOI
TL;DR: This is the first molecular generative model to incorporate 3D structural information directly in the design process, and the effectiveness and applicability of this approach on a diverse range of design problems: fragment linking, scaffold hopping, and proteolysis targeting chimera (PROTAC) design.
Abstract: Rational compound design remains a challenging problem for both computational methods and medicinal chemists. Computational generative methods have begun to show promising results for the design pr...

122 citations


Journal ArticleDOI
TL;DR: The DockThor program can be considered as a suitable for docking highly flexible and challenging ligands, with up to 40 rotatable bonds, and outperforming other protein-ligand docking programs on LEADS-PEP dataset.
Abstract: Protein-peptide interactions play a crucial role in many cellular and biological functions, which justify the increasing interest in the development of peptide-based drugs. However, predicting experimental binding modes and affinities in protein-peptide docking remains a great challenge for most docking programs due to some particularities of this class of ligands, such as the high degree of flexibility. In this paper, we present the performance of the DockThor program on the LEADS-PEP data set, a benchmarking set composed of 53 diverse protein-peptide complexes with peptides ranging from 3 to 12 residues and with up to 51 rotatable bonds. The DockThor performance for pose prediction on redocking studies was compared with some state-of-the-art docking programs that were also evaluated on the LEADS-PEP data set, AutoDock, AutoDock Vina, Surflex, GOLD, Glide, rDock, and DINC, as well as with the task-specific docking protocol HPepDock. Our results indicate that DockThor could dock 40% of the cases with an overall backbone RMSD below 2.5 A when the top-scored docking pose was considered, exhibiting similar results to Glide and outperforming other protein-ligand docking programs, whereas rDock and HPepDock achieved superior results. Assessing the docking poses closest to the crystal structure (i.e., best-RMSD pose), DockThor achieved a success rate of 60% in pose prediction. Due to the great overall performance of handling peptidic compounds, the DockThor program can be considered as suitable for docking highly flexible and challenging ligands, with up to 40 rotatable bonds. DockThor is freely available as a virtual screening Web server at https://www.dockthor.lncc.br/ .

122 citations


Journal ArticleDOI
TL;DR: This paper introduces a set of quantitative criteria to capture different uncertainty aspects, and uses these criteria to compare MC-Dropout, Deep Ensembles, and bootstrapping, both theoretically in a unified framework that separates aleatoric/epistemic uncertainty and experimentally on public datasets.
Abstract: Advances in deep neural network (DNN)-based molecular property prediction have recently led to the development of models of remarkable accuracy and generalization ability, with graph convolutional neural networks (GCNNs) reporting state-of-the-art performance for this task. However, some challenges remain, and one of the most important that needs to be fully addressed concerns uncertainty quantification. DNN performance is affected by the volume and the quality of the training samples. Therefore, establishing when and to what extent a prediction can be considered reliable is just as important as outputting accurate predictions, especially when out-of-domain molecules are targeted. Recently, several methods to account for uncertainty in DNNs have been proposed, most of which are based on approximate Bayesian inference. Among these, only a few scale to the large data sets required in applications. Evaluating and comparing these methods has recently attracted great interest, but results are generally fragmented and absent for molecular property prediction. In this paper, we quantitatively compare scalable techniques for uncertainty estimation in GCNNs. We introduce a set of quantitative criteria to capture different uncertainty aspects and then use these criteria to compare MC-dropout, Deep Ensembles, and bootstrapping, both theoretically in a unified framework that separates aleatoric/epistemic uncertainty and experimentally on public data sets. Our experiments quantify the performance of the different uncertainty estimation methods and their impact on uncertainty-related error reduction. Our findings indicate that Deep Ensembles and bootstrapping consistently outperform MC-dropout, with different context-specific pros and cons. Our analysis leads to a better understanding of the role of aleatoric/epistemic uncertainty, also in relation to the target data set features, and highlights the challenge posed by out-of-domain uncertainty.

121 citations


Journal ArticleDOI
TL;DR: A supercomputer-driven pipeline for in silico drug discovery using enhanced sampling molecular dynamics (MD) and ensemble docking is presented, including the use of quantum mechanical, machine learning, and artificial intelligence methods to cluster MD trajectories and rescore docking poses.
Abstract: We present a supercomputer-driven pipeline for in silico drug discovery using enhanced sampling molecular dynamics (MD) and ensemble docking. Ensemble docking makes use of MD results by docking compound databases into representative protein binding-site conformations, thus taking into account the dynamic properties of the binding sites. We also describe preliminary results obtained for 24 systems involving eight proteins of the proteome of SARS-CoV-2. The MD involves temperature replica exchange enhanced sampling, making use of massively parallel supercomputing to quickly sample the configurational space of protein drug targets. Using the Summit supercomputer at the Oak Ridge National Laboratory, more than 1 ms of enhanced sampling MD can be generated per day. We have ensemble docked repurposing databases to 10 configurations of each of the 24 SARS-CoV-2 systems using AutoDock Vina. Comparison to experiment demonstrates remarkably high hit rates for the top scoring tranches of compounds identified by our ensemble approach. We also demonstrate that, using Autodock-GPU on Summit, it is possible to perform exhaustive docking of one billion compounds in under 24 h. Finally, we discuss preliminary results and planned improvements to the pipeline, including the use of quantum mechanical (QM), machine learning, and artificial intelligence (AI) methods to cluster MD trajectories and rescore docking poses.

Journal ArticleDOI
TL;DR: This application note aims to offer the community a production-ready tool for de novo design, called REINVENT, which can be effectively applied on drug discovery projects that are striving to resolve either exploration or exploitation problems while navigating the chemical space.
Abstract: In the past few years, we have witnessed a renaissance of the field of molecular de novo drug design. The advancements in deep learning and artificial intelligence (AI) have triggered an avalanche of ideas on how to translate such techniques to a variety of domains including the field of drug design. A range of architectures have been devised to find the optimal way of generating chemical compounds by using either graph- or string (SMILES)-based representations. With this application note, we aim to offer the community a production-ready tool for de novo design, called REINVENT. It can be effectively applied on drug discovery projects that are striving to resolve either exploration or exploitation problems while navigating the chemical space. It can facilitate the idea generation process by bringing to the researcher's attention the most promising compounds. REINVENT's code is publicly available at https://github.com/MolecularAI/Reinvent.

Journal ArticleDOI
TL;DR: Uncertainty quantification (UQ) is an important component of molecular property prediction, particularly for drug discovery applications where model predictions direct experimental design and where experimental design is directly influenced by experimental design.
Abstract: Uncertainty quantification (UQ) is an important component of molecular property prediction, particularly for drug discovery applications where model predictions direct experimental design and where...

Journal ArticleDOI
TL;DR: This work presents the results of large-scale prospective application of the FEP+ method in active drug discovery projects in an industry setting at Merck KGaA, Darmstadt, Germany and compares results obtained on a new diverse, public benchmark of eight pharmaceutically relevant targets.
Abstract: Accurate ranking of compounds with regards to their binding affinity to a protein using computational methods is of great interest to pharmaceutical research. Physics-based free energy calculations are regarded as the most rigorous way to estimate binding affinity. In recent years, many retrospective studies carried out both in academia and industry have demonstrated its potential. Here, we present the results of large-scale prospective application of the FEP+ method in active drug discovery projects in an industry setting at Merck KGaA, Darmstadt, Germany. We compare these prospective data to results obtained on a new diverse, public benchmark of eight pharmaceutically relevant targets. Our results offer insights into the challenges faced when using free energy calculations in real-life drug discovery projects and identify limitations that could be tackled by future method development. The new public data set we provide to the community can support further method development and comparative benchmarking of free energy calculations.

Journal ArticleDOI
Abstract: The novel coronavirus (SARS-CoV-2) has infected several million people and caused thousands of deaths worldwide since Dec 2019. As the disease is spreading rapidly all over the world, it is urgent to find effective drugs to treat the virus. The main protease (Mpro) of SARS-CoV-2 is one of the potential drug targets. Therefore, in this context, we used rigorous computational methods, including molecular docking, fast pulling of ligand (FPL), and free energy perturbation (FEP), to investigate potential inhibitors of SARS-CoV-2 Mpro. We first tested our approach with three reported inhibitors of SARS-CoV-2 Mpro; and our computational results are in good agreement with the respective experimental data. Subsequently, we applied our approach on a databases of ~4600 natural compounds, as well as 8 available HIV-1 protease (PR) inhibitors and an aza-peptide epoxide. Molecular docking resulted in a short list of 35 natural compounds, which was subsequently refined using the FPL scheme. FPL simulations resulted in five potential inhibitors, including 3 natural compounds and two available HIV-1 PR inhibitors. Finally, FEP, the most accurate and precise method, was used to determine the absolute binding free energy of these five compounds. FEP results indicate that two natural compounds, cannabisin A and isoacteoside, and an HIV-1 PR inhibitor, darunavir, exhibit large binding free energy to SARS-CoV-2 Mpro, which is larger than that of 13b, the most reliable SARS-CoV-2 Mpro inhibitor recently reported. The binding free energy largely arises from van der Waals interaction. We also found that Glu166 form H-bonds to all the inhibitors. Replacing Glu166 by an alanine residue leads to ~ 2.0 kcal/mol decreases in the affinity of darunavir to SARS-CoV-2 Mpro. Our results could contribute to the development of potentials drugs inhibiting SARS-CoV-2.

Journal ArticleDOI
TL;DR: It is asserted that alchemical binding free energy methods using all-atom molecular dynamics simulations have matured to the point where they can be applied in virtual screening campaigns as a final scoring stage to prioritize the top molecules for experimental testing.
Abstract: Virtual high throughput screening (vHTS) in drug discovery is a powerful approach to identify hits: when applied successfully, it can be much faster and cheaper than experimental high-throughput screening approaches. However, mainstream vHTS tools have significant limitations: ligand-based methods depend on knowledge of existing chemical matter, while structure-based tools such as docking involve significant approximations that limit their accuracy. Recent advances in scientific methods coupled with dramatic speedups in computational processing with GPUs make this an opportune time to consider the role of more rigorous methods that could improve the predictive power of vHTS workflows. In this Perspective, we assert that alchemical binding free energy methods using all-atom molecular dynamics simulations have matured to the point where they can be applied in virtual screening campaigns as a final scoring stage to prioritize the top molecules for experimental testing. Specifically, we propose that alchemical absolute binding free energy (ABFE) calculations offer the most direct and computationally efficient approach within a rigorous statistical thermodynamic framework for computing binding energies of diverse molecules, as is required for virtual screening. ABFE calculations are particularly attractive for drug discovery at this point in time, where the confluence of large-scale genomics data and insights from chemical biology have unveiled a large number of promising disease targets for which no small molecule binders are known, precluding ligand-based approaches, and where traditional docking approaches have foundered to find progressible chemical matter.

Journal ArticleDOI
TL;DR: It is suggested that Convolution Neural Network models built with amino acid property descriptors are the most widely applicable to the types of protein redesign problems faced in the pharmaceutical industry.
Abstract: Protein redesign and engineering has become an important task in pharmaceutical research and development. Recent advances in technology have enabled efficient protein redesign by mimicking natural evolutionary mutation, selection, and amplification steps in the laboratory environment. For any given protein, the number of possible mutations is astronomical. It is impractical to synthesize all sequences or even to investigate all functionally interesting variants. Recently, there has been an increased interest in using machine learning to assist protein redesign, since prediction models can be used to virtually screen a large number of novel sequences. However, many state-of-the-art machine learning models, especially deep learning models, have not been extensively explored. Moreover, only a small selection of protein sequence descriptors has been considered. In this work, the performance of prediction models built using an array of machine learning methods and protein descriptor types, including two novel, single amino acid descriptors and one structure-based three-dimensional descriptor, is benchmarked. The predictions were evaluated on a diverse collection of public and proprietary data sets, using a variety of evaluation metrics. The results of this comparison suggest that Convolution Neural Network models built with amino acid property descriptors are the most widely applicable to the types of protein redesign problems faced in the pharmaceutical industry.

Journal ArticleDOI
TL;DR: This paper presents TorchANI, a PyTorch based software for training/inference of ANI (ANAKIN-ME) deep learning models to obtain potential energy surfaces and other physical properties of molecular systems.
Abstract: This paper presents TorchANI, a PyTorch-based program for training/inference of ANI (ANAKIN-ME) deep learning models to obtain potential energy surfaces and other physical properties of molecular systems. ANI is an accurate neural network potential originally implemented using C++/CUDA in a program called NeuroChem. Compared with NeuroChem, TorchANI has a design emphasis on being lightweight, user friendly, cross platform, and easy to read and modify for fast prototyping, while allowing acceptable sacrifice on running performance. Because the computation of atomic environmental vectors and atomic neural networks are all implemented using PyTorch operators, TorchANI is able to use PyTorch's autograd engine to automatically compute analytical forces and Hessian matrices, as well as do force training without requiring any additional codes. TorchANI is open-source and freely available on GitHub: https://github.com/aiqm/torchani.

Journal ArticleDOI
TL;DR: The computational results complement previous crystallographic studies on the SARS-CoV-2 enzyme and, together with other simulation studies, should contribute to outline useful structure–activity relationships.
Abstract: Herein, we investigate the structure and flexibility of the hydrated SARS-CoV-2 main protease by means of 2.0 μs molecular dynamics (MD) simulations in explicit solvent. After having performed electrostatic pKa calculations on several X-ray structures, we consider both the native (unbound) configuration of the enzyme and its noncovalent complex with a model peptide, Ace-Ala-Val-Leu-Gln∼Ser-Nme, which mimics the polyprotein sequence recognized at the active site. For each configuration, we also study their monomeric and homodimeric forms. The simulations of the unbound systems show that the relative orientation of domain III is not stable in the monomeric form and provide further details about interdomain motions, protomer-protomer interactions, inter-residue contacts, accessibility at the catalytic site, etc. In the presence of the peptide substrate, the monomeric protease exhibits a stable interdomain arrangement, but the relative orientation between the scissile peptide bond and the catalytic dyad is not favorable for catalysis. By means of comparative analysis, we further assess the catalytic impact of the enzyme dimerization, the actual flexibility of the active site region, and other structural effects induced by substrate binding. Overall, our computational results complement previous crystallographic studies on the SARS-CoV-2 enzyme and, together with other simulation studies, should contribute to outline useful structure-activity relationships.

Journal ArticleDOI
TL;DR: It is reported that SARS-CoV-2 has undergone 8309 single mutations which can be clustered into six subtypes, and mutations on 40% of nucleotides in the nucleocapsid gene in the population level are identified, signaling potential impacts on the ongoing development of COVID-19 diagnosis, vaccines, and antibody and small-molecular drugs.
Abstract: Tremendous effort has been given to the development of diagnostic tests, preventive vaccines, and therapeutic medicines for coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Much of this development has been based on the reference genome collected on January 5, 2020. Based on the genotyping of 15 140 genome samples collected up to June 1, 2020, we report that SARS-CoV-2 has undergone 8309 single mutations which can be clustered into six subtypes. We introduce mutation ratio and mutation h-index to characterize the protein conservativeness and unveil that SARS-CoV-2 envelope protein, main protease, and endoribonuclease protein are relatively conservative, while SARS-CoV-2 nucleocapsid protein, spike protein, and papain-like protease are relatively nonconservative. In particular, we have identified mutations on 40% of nucleotides in the nucleocapsid gene in the population level, signaling potential impacts on the ongoing development of COVID-19 diagnosis, vaccines, and antibody and small-molecular drugs.

Journal ArticleDOI
TL;DR: The results positively advocate bidirectional strategies for SMILES-based molecular de novo design, with BIMODAL showing superior results to the unidirectional forward RNN for most of the criteria in the tested conditions.
Abstract: Recurrent neural networks (RNNs) are able to generate de novo molecular designs using simplified molecular input line entry systems (SMILES) string representations of the chemical structure. RNN-based structure generation is usually performed unidirectionally, by growing SMILES strings from left to right. However, there is no natural start or end of a small molecule, and SMILES strings are intrinsically nonunivocal representations of molecular graphs. These properties motivate bidirectional structure generation. Here, bidirectional generative RNNs for SMILES-based molecule design are introduced. To this end, two established bidirectional methods were implemented, and a new method for SMILES string generation and data augmentation is introduced-the bidirectional molecule design by alternate learning (BIMODAL). These three bidirectional strategies were compared to the unidirectional forward RNN approach for SMILES string generation, in terms of the (i) novelty, (ii) scaffold diversity, and (iii) chemical-biological relevance of the computer-generated molecules. The results positively advocate bidirectional strategies for SMILES-based molecular de novo design, with BIMODAL showing superior results to the unidirectional forward RNN for most of the criteria in the tested conditions. The code of the methods and the pretrained models can be found at URL https://github.com/ETHmodlab/BIMODAL.

Journal ArticleDOI
TL;DR: In this work, an application for automated setup and processing of free energy calculations is presented and several sanity checks for assessing the reliability of the calculations were implemented, constituting a distinct advantage over existing open-source tools.
Abstract: Free-energy calculations have seen increased usage in structure-based drug design. Despite the rising interest, automation of the complex calculations and subsequent analysis of their results are still hampered by the restricted choice of available tools. In this work, an application for automated setup and processing of free-energy calculations is presented. Several sanity checks for assessing the reliability of the calculations were implemented, constituting a distinct advantage over existing open-source tools. The underlying workflow is built on top of the software Sire, SOMD, BioSimSpace, and OpenMM and uses the AMBER 14SB and GAFF2.1 force fields. It was validated on two datasets originally composed by Schrodinger, consisting of 14 protein structures and 220 ligands. Predicted binding affinities were in good agreement with experimental values. For the larger dataset, the average correlation coefficient Rp was 0.70 ± 0.05 and average Kendall's τ was 0.53 ± 0.05, which are broadly comparable to or better than previously reported results using other methods.

Journal ArticleDOI
TL;DR: This work presents a new dataset for structure-based machine learning, the CrossDocked2020 set, with 22.5 million poses of ligands docked into multiple similar binding pockets across the Protein Data Bank, and performs a comprehensive evaluation of grid-based convolutional neural network models on this dataset.
Abstract: One of the main challenges in drug discovery is predicting protein-ligand binding affinity. Recently, machine learning approaches have made substantial progress on this task. However, current methods of model evaluation are overly optimistic in measuring generalization to new targets, and there does not exist a standard data set of sufficient size to compare performance between models. We present a new data set for structure-based machine learning, the CrossDocked2020 set, with 22.5 million poses of ligands docked into multiple similar binding pockets across the Protein Data Bank, and perform a comprehensive evaluation of grid-based convolutional neural network (CNN) models on this data set. We also demonstrate how the partitioning of the training data and test data can impact the results of models trained with the PDBbind data set, how performance improves by adding more lower-quality training data, and how training with docked poses imparts pose sensitivity to the predicted affinity of a complex. Our best performing model, an ensemble of five densely connected CNNs, achieves a root mean squared error of 1.42 and Pearson R of 0.612 on the affinity prediction task, an AUC of 0.956 at binding pose classification, and a 68.4% accuracy at pose selection on the CrossDocked2020 set. By providing data splits for clustered cross-validation and the raw data for the CrossDocked2020 set, we establish the first standardized data set for training machine learning models to recognize ligands in noncognate target structures while also greatly expanding the number of poses available for training. In order to facilitate community adoption of this data set for benchmarking protein-ligand binding affinity prediction, we provide our models, weights, and the CrossDocked2020 set at https://github.com/gnina/models.

Journal ArticleDOI
TL;DR: A combined protocol for the modeling of a ternary complex induced by a given PROTAC, which may be used to design PROTACs for new targets, as well as improve Protolysis-targeting chimeras for existing targets, potentially cutting down time and synthesis efforts.
Abstract: Proteolysis-targeting chimeras (PROTACs), which induce degradation by recruitment of an E3 ligase to a target protein, are gaining much interest as a new pharmacological modality. However, designing PROTACs is challenging. Formation of a ternary complex between the protein target, the PROTAC, and the recruited E3 ligase is considered paramount for successful degradation. A structural model of this ternary complex could in principle inform rational PROTAC design. Unfortunately, only a handful of structures are available for such complexes, necessitating tools for their modeling. We developed a combined protocol for the modeling of a ternary complex induced by a given PROTAC. Our protocol alternates between sampling of the protein-protein interaction space and the PROTAC molecule conformational space. Application of this protocol-PRosettaC-to a benchmark of known PROTAC ternary complexes results in near-native predictions, with often atomic accuracy prediction of the protein chains, as well as the PROTAC binding moieties. It allowed the modeling of a CRBN/BTK complex that recapitulated experimental results for a series of PROTACs. PRosettaC generated models may be used to design PROTACs for new targets, as well as improve PROTACs for existing targets, potentially cutting down time and synthesis efforts. To enable wide access to this protocol, we have made it available through a web server (https://prosettac.weizmann.ac.il/).

Journal ArticleDOI
TL;DR: Results supported that this novel compound 16 binds with domains I and II, and the domain II–III linker of the 3CLpro protein, suggesting its suitability as a strong candidate for therapeutic discovery against COVID-19.
Abstract: The novel coronavirus, SARS-CoV-2, has caused a recent pandemic called COVID-19 and a severe health threat around the world. In the current situation, the virus is rapidly spreading worldwide, and the discovery of a vaccine and potential therapeutics are critically essential. The crystal structure for the main protease (Mpro) of SARS-CoV-2, 3-chymotrypsin-like cysteine protease (3CLpro), was recently made available and is considerably similar to the previously reported SARS-CoV. Due to its essentiality in viral replication, it represents a potential drug target. Herein, a computer-aided drug design (CADD) approach was implemented for the initial screening of 13 approved antiviral drugs. Molecular docking of 13 antivirals against the 3-chymotrypsin-like cysteine protease (3CLpro) enzyme was accomplished, and indinavir was described as a lead drug with a docking score of -8.824 and a XP Gscore of -9.466 kcal/mol. Indinavir possesses an important pharmacophore, hydroxyethylamine (HEA), and thus, a new library of HEA compounds (>2500) was subjected to virtual screening that led to 25 hits with a docking score more than indinavir. Exclusively, compound 16 with a docking score of -8.955 adhered to drug-like parameters, and the structure-activity relationship (SAR) analysis was demonstrated to highlight the importance of chemical scaffolds therein. Molecular dynamics (MD) simulation analysis performed at 100 ns supported the stability of 16 within the binding pocket. Largely, our results supported that this novel compound 16 binds with domains I and II, and the domain II-III linker of the 3CLpro protein, suggesting its suitability as a strong candidate for therapeutic discovery against COVID-19.

Journal ArticleDOI
TL;DR: An automated, unsupervised method for connecting scientific literature to inorganic synthesis insights, which learns representations of materials corresponding to synthesis-related properties, and that the model's behavior complements existing thermodynamic knowledge.
Abstract: Leveraging new data sources is a key step in accelerating the pace of materials design and discovery. To complement the strides in synthesis planning driven by historical, experimental, and computed data, we present an automated, unsupervised method for connecting scientific literature to inorganic synthesis insights. Starting from the natural language text, we apply word embeddings from language models, which are fed into a named entity recognition model, upon which a conditional variational autoencoder is trained to generate syntheses for any inorganic materials of interest. We show the potential of this technique by predicting precursors for two perovskite materials, using only training data published over a decade prior to their first reported syntheses. We demonstrate that the model learns representations of materials corresponding to synthesis-related properties and that the model's behavior complements the existing thermodynamic knowledge. Finally, we apply the model to perform synthesizability screening for proposed novel perovskite compounds.

Journal ArticleDOI
TL;DR: A fragment molecular orbital (FMO) based interaction analysis on a complex between the SARS-CoV-2 main protease (Mpro) and its peptide-like inhibitor N3 (PDB ID: 6LU7) found His41, His163, His164, and Glu166 were found to be the most important amino acid residues of Mpro in interacting with the inhibitor, mainly due to hydrogen bonding.
Abstract: The worldwide spread of COVID-19 (new coronavirus found in 2019) is an emergent issue to be tackled. In fact, a great amount of works in various fields have been made in a rather short period. Here, we report a fragment molecular orbital (FMO) based interaction analysis on a complex between the SARS-CoV-2 main protease (Mpro) and its peptide-like inhibitor N3 (PDB ID: 6LU7). The target inhibitor molecule was segmented into five fragments in order to capture site specific interactions with amino acid residues of the protease. The interaction energies were decomposed into several contributions, and then the characteristics of hydrogen bonding and dispersion stabilization were made clear. Furthermore, the hydration effect was incorporated by the Poisson-Boltzmann (PB) scheme. From the present FMO study, His41, His163, His164, and Glu166 were found to be the most important amino acid residues of Mpro in interacting with the inhibitor, mainly due to hydrogen bonding. A guideline for optimizations of the inhibitor molecule was suggested as well based on the FMO analysis.

Journal ArticleDOI
Yibo Li1, Jianxing Hu1, Yanxing Wang1, Jielong Zhou, Liangren Zhang1, Zhenming Liu1 
TL;DR: A scaffold-based molecular generative model for drug discovery is proposed, which performs molecule generation based on a wide spectrum of scaffold definitions, including Bemis-Murko (BM) scaffolds, cyclic skeletons, and scaffolds with specifications on side-chain properties.
Abstract: The ultimate goal of drug design is to find novel compounds with desirable pharmacological properties. Designing molecules retaining particular scaffolds as their core structures is an efficient way to obtain potential drug candidates. We propose a scaffold-based molecular generative model for drug discovery, which performs molecule generation based on a wide spectrum of scaffold definitions, including Bemis-Murcko scaffolds, cyclic skeletons, and scaffolds with specifications on side-chain properties. The model can generalize the learned chemical rules of adding atoms and bonds to a given scaffold. The generated compounds were evaluated by molecular docking in DRD2 targets, and the results demonstrated that this approach can be effectively applied to solve several drug design problems, including the generation of compounds containing a given scaffold and de novo drug design of potential drug candidates with specific docking scores.

Journal ArticleDOI
TL;DR: A novel dataset specifically designed for virtual screening and machine learning, consisting in 15 targets, 7844 confirmed active and 407381 confirmed inactive compounds, which mimics experimental screening decks in terms of hit rate (ratio of active to inactive compounds) and potency distribution.
Abstract: Comparative evaluation of virtual screening methods requires a rigorous benchmarking procedure on diverse, realistic, and unbiased data sets. Recent investigations from numerous research groups unambiguously demonstrate that artificially constructed ligand sets classically used by the community (e.g., DUD, DUD-E, MUV) are unfortunately biased by both obvious and hidden chemical biases, therefore overestimating the true accuracy of virtual screening methods. We herewith present a novel data set (LIT-PCBA) specifically designed for virtual screening and machine learning. LIT-PCBA relies on 149 dose-response PubChem bioassays that were additionally processed to remove false positives and assay artifacts and keep active and inactive compounds within similar molecular property ranges. To ascertain that the data set is suited to both ligand-based and structure-based virtual screening, target sets were restricted to single protein targets for which at least one X-ray structure is available in complex with ligands of the same phenotype (e.g., inhibitor, inverse agonist) as that of the PubChem active compounds. Preliminary virtual screening on the 21 remaining target sets with state-of-the-art orthogonal methods (2D fingerprint similarity, 3D shape similarity, molecular docking) enabled us to select 15 target sets for which at least one of the three screening methods is able to enrich the top 1%-ranked compounds in true actives by at least a factor of 2. The corresponding ligand sets (training, validation) were finally unbiased by the recently described asymmetric validation embedding (AVE) procedure to afford the LIT-PCBA data set, consisting of 15 targets and 7844 confirmed active and 407,381 confirmed inactive compounds. The data set mimics experimental screening decks in terms of hit rate (ratio of active to inactive compounds) and potency distribution. It is available online at http://drugdesign.unistra.fr/LIT-PCBA for download and for benchmarking novel virtual screening methods, notably those relying on machine learning.

Journal ArticleDOI
TL;DR: A main finding is that changing the force field has a stronger effect on the simulated aggregation pathway than changing the peptide sequence, and the new force fields are not able to reproduce the experimental aggregation propensity order of the peptides, so it is recommended to use this force field for peptide aggregation simulations and base future reparameterizations on it.
Abstract: The progress toward understanding the molecular basis of Alzheimers's disease is strongly connected to elucidating the early aggregation events of the amyloid-β (Aβ) peptide. Molecular dynamics (MD) simulations provide a viable technique to study the aggregation of Aβ into oligomers with high spatial and temporal resolution. However, the results of an MD simulation can only be as good as the underlying force field. A recent study by our group showed that none of the common force fields can distinguish between aggregation-prone and nonaggregating peptide sequences, producing a similar and in most cases too fast aggregation kinetics for all peptides. Since then, new force fields specially designed for intrinsically disordered proteins such as Aβ were developed. Here, we assess the applicability of these new force fields to studying peptide aggregation using the Aβ16-22 peptide and mutations of it as test case. We investigate their performance in modeling the monomeric state, the aggregation into oligomers, and the stability of the aggregation end product, i.e., the fibrillar state. A main finding is that changing the force field has a stronger effect on the simulated aggregation pathway than changing the peptide sequence. Also the new force fields are not able to reproduce the experimental aggregation propensity order of the peptides. Dissecting the various energy contributions shows that AMBER99SB-disp overestimates the interactions between the peptides and water, thereby inhibiting peptide aggregation. More promising results are obtained with CHARMM36m and especially its version with increased protein-water interactions. It is thus recommended to use this force field for peptide aggregation simulations and base future reparameterizations on it.