scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Cheminformatics in 2014"


Journal ArticleDOI
TL;DR: The particular strengths of TCMSP are the composition of the large number of herbal entries, and the ability to identify drug-target networks and drug-disease networks, which will help revealing the mechanisms of action of Chinese herbs, uncovering the nature ofTCM theory and developing new herb-oriented drugs.
Abstract: Modern medicine often clashes with traditional medicine such as Chinese herbal medicine because of the little understanding of the underlying mechanisms of action of the herbs. In an effort to promote integration of both sides and to accelerate the drug discovery from herbal medicines, an efficient systems pharmacology platform that represents ideal information convergence of pharmacochemistry, ADME properties, drug-likeness, drug targets, associated diseases and interaction networks, are urgently needed. The traditional Chinese medicine systems pharmacology database and analysis platform (TCMSP) was built based on the framework of systems pharmacology for herbal medicines. It consists of all the 499 Chinese herbs registered in the Chinese pharmacopoeia with 29,384 ingredients, 3,311 targets and 837 associated diseases. Twelve important ADME-related properties like human oral bioavailability, half-life, drug-likeness, Caco-2 permeability, blood-brain barrier and Lipinski’s rule of five are provided for drug screening and evaluation. TCMSP also provides drug targets and diseases of each active compound, which can automatically establish the compound-target and target-disease networks that let users view and analyze the drug action mechanisms. It is designed to fuel the development of herbal medicines and to promote integration of modern medicine and traditional medicine for drug discovery and development. The particular strengths of TCMSP are the composition of the large number of herbal entries, and the ability to identify drug-target networks and drug-disease networks, which will help revealing the mechanisms of action of Chinese herbs, uncovering the nature of TCM theory and developing new herb-oriented drugs. TCMSP is freely available at http://sm.nwsuaf.edu.cn/lsp/tcmsp.php .

2,451 citations


Journal ArticleDOI
TL;DR: An algorithm for repeated grid-search V-fold cross-validation for parameter tuning in classification and regression, and a repeated nested cross- validation algorithm for model assessment are described and evaluated.
Abstract: We address the problem of selecting and assessing classification and regression models using cross-validation. Current state-of-the-art methods can yield models with high variance, rendering them unsuitable for a number of practical applications including QSAR. In this paper we describe and evaluate best practices which improve reliability and increase confidence in selected models. A key operational component of the proposed methods is cloud computing which enables routine use of previously infeasible approaches. We describe in detail an algorithm for repeated grid-search V-fold cross-validation for parameter tuning in classification and regression, and we define a repeated nested cross-validation algorithm for model assessment. As regards variable selection and parameter tuning we define two algorithms (repeated grid-search cross-validation and double cross-validation), and provide arguments for using the repeated grid-search in the general case. We show results of our algorithms on seven QSAR datasets. The variation of the prediction performance, which is the result of choosing different splits of the dataset in V-fold cross-validation, needs to be taken into account when selecting and assessing classification and regression models. We demonstrate the importance of repeating cross-validation when selecting an optimal model, as well as the importance of repeating nested cross-validation when assessing a prediction error.

644 citations


Journal ArticleDOI
TL;DR: In this article, the authors present a method for the placement of hydrogen coordinates in protein-ligand complexes which takes tautomers and protonation states of both protein and ligand into account.
Abstract: The calculation of hydrogen positions is a common preprocessing step when working with crystal structures of protein-ligand complexes. An explicit description of hydrogen atoms is generally needed in order to analyze the binding mode of particular ligands or to calculate the associated binding energies. Due to the large number of degrees of freedom resulting from different chemical moieties and the high degree of mutual dependence this problem is anything but trivial. In addition to an efficient algorithm to take care of the complexity resulting from complicated hydrogen bonding networks, a robust chemical model is needed to describe effects such as tautomerism and ionization consistently. We present a novel method for the placement of hydrogen coordinates in protein-ligand complexes which takes tautomers and protonation states of both protein and ligand into account. Our method generates the most probable hydrogen positions on the basis of an optimal hydrogen bonding network using an empirical scoring function. The high quality of our results could be verified by comparison to the manually adjusted Astex diverse set and a remarkably low rate of undesirable hydrogen contacts compared to other tools.

129 citations


Journal ArticleDOI
TL;DR: The cross-validation design in the inner loop and the influence of the test set size in the outer loop is systematically studied for regression models in combination with variable selection.
Abstract: Generally, QSAR modelling requires both model selection and validation since there is no a priori knowledge about the optimal QSAR model. Prediction errors (PE) are frequently used to select and to assess the models under study. Reliable estimation of prediction errors is challenging – especially under model uncertainty – and requires independent test objects. These test objects must not be involved in model building nor in model selection. Double cross-validation, sometimes also termed nested cross-validation, offers an attractive possibility to generate test data and to select QSAR models since it uses the data very efficiently. Nevertheless, there is a controversy in the literature with respect to the reliability of double cross-validation under model uncertainty. Moreover, systematic studies investigating the adequate parameterization of double cross-validation are still missing. Here, the cross-validation design in the inner loop and the influence of the test set size in the outer loop is systematically studied for regression models in combination with variable selection. Simulated and real data are analysed with double cross-validation to identify important factors for the resulting model quality. For the simulated data, a bias-variance decomposition is provided. The prediction errors of QSAR/QSPR regression models in combination with variable selection depend to a large degree on the parameterization of double cross-validation. While the parameters for the inner loop of double cross-validation mainly influence bias and variance of the resulting models, the parameters for the outer loop mainly influence the variability of the resulting prediction error estimate. Double cross-validation reliably and unbiasedly estimates prediction errors under model uncertainty for regression models. As compared to a single test set, double cross-validation provided a more realistic picture of model quality and should be preferred over a single test set.

117 citations


Journal ArticleDOI
TL;DR: This open-source implementation of a MMFF-capable molecular mechanics engine coupled with the rest of the RDKit functionality and covered by the BSD license is appealing to researchers operating in both academia and industry.
Abstract: A general purpose force field such as MMFF94/MMFF94s, which can properly deal with a wide range of diverse structures, is very valuable in the context of a cheminformatics toolkit. Herein we present an open-source implementation of this force field within the RDKit. The new MMFF functionality can be accessed through a C++/C#/Python/Java application programming interface (API) developed along the lines of the one already available for UFF in the RDKit. Our implementation was fully validated against the official validation suite provided by the MMFF authors. All energies and gradients were correctly computed; moreover, atom type and force constants were correctly assigned for 3D molecules built from SMILES strings. To provide full flexibility, the available API provides direct access to include/exclude individual terms from the MMFF energy expression and to carry out constrained geometry optimizations. The availability of a MMFF-capable molecular mechanics engine coupled with the rest of the RDKit functionality and covered by the BSD license is appealing to researchers operating in both academia and industry.

111 citations


Journal ArticleDOI
TL;DR: This review sketches out dictionary-based, rule-based and machine learning, as well as hybrid chemical named entity recognition approaches with their applied solutions, and an outlook on the pros and cons of these approaches and the types of chemical entities extracted.
Abstract: The rapid increase in the flow rate of published digital information in all disciplines has resulted in a pressing need for techniques that can simplify the use of this information. The chemistry literature is very rich with information about chemical entities. Extracting molecules and their related properties and activities from the scientific literature to “text mine” these extracted data and determine contextual relationships helps research scientists, particularly those in drug development. One of the most important challenges in chemical text mining is the recognition of chemical entities mentioned in the texts. In this review, the authors briefly introduce the fundamental concepts of chemical literature mining, the textual contents of chemical documents, and the methods of naming chemicals in documents. We sketch out dictionary-based, rule-based and machine learning, as well as hybrid chemical named entity recognition approaches with their applied solutions. We end with an outlook on the pros and cons of these approaches and the types of chemical entities extracted.

99 citations


Journal ArticleDOI
TL;DR: The comparison of multiple combinations of binary fingerprints and similarity metrics for computing the chemical similarity in the context of two different applications of the read-across technique demonstrates that the classical similarity measurements can be improved with a generalizable model of similarity.
Abstract: Methods that provide a measure of chemical similarity are strongly relevant in several fields of chemoinformatics as they allow to predict the molecular behavior and fate of structurally close compounds. One common application of chemical similarity measurements, based on the principle that similar molecules have similar properties, is the read-across approach, where an estimation of a specific endpoint for a chemical is provided using experimental data available from highly similar compounds. This paper reports the comparison of multiple combinations of binary fingerprints and similarity metrics for computing the chemical similarity in the context of two different applications of the read-across technique. Our analysis demonstrates that the classical similarity measurements can be improved with a generalizable model of similarity. The proposed approach has already been used to build similarity indices in two open-source software tools (CAESAR and VEGA) that make several QSAR models available. In these tools, the similarity index plays a key role for the assessment of the applicability domain.

77 citations


Journal ArticleDOI
TL;DR: The ratio of positive to negative training instances should be taken into account during the preparation of machine learning experiments, as it might significantly influence the performance of particular classifier.
Abstract: The paper presents a thorough analysis of the influence of the number of negative training examples on the performance of machine learning methods. The impact of this rather neglected aspect of machine learning methods application was examined for sets containing a fixed number of positive and a varying number of negative examples randomly selected from the ZINC database. An increase in the ratio of positive to negative training instances was found to greatly influence most of the investigated evaluating parameters of ML methods in simulated virtual screening experiments. In a majority of cases, substantial increases in precision and MCC were observed in conjunction with some decreases in hit recall. The analysis of dynamics of those variations let us recommend an optimal composition of training data. The study was performed on several protein targets, 5 machine learning algorithms (SMO, Naive Bayes, Ibk, J48 and Random Forest) and 2 types of molecular fingerprints (MACCS and CDK FP). The most effective classification was provided by the combination of CDK FP with SMO or Random Forest algorithms. The Naive Bayes models appeared to be hardly sensitive to changes in the number of negative instances in the training set. In conclusion, the ratio of positive to negative training instances should be taken into account during the preparation of machine learning experiments, as it might significantly influence the performance of particular classifier. What is more, the optimization of negative training set size can be applied as a boosting-like approach in machine learning-based virtual screening.

61 citations


Journal ArticleDOI
TL;DR: The hereby-established quantitative assessment has the ability to rank compounds whether they fail well-established pesticide-likeness rules or not, and offer an efficient way to prioritize (class-specific) pesticides.
Abstract: The design of chemical libraries, an early step in agrochemical discovery programs, is frequently addressed by means of qualitative physicochemical and/or topological rule-based methods. The aim of this study is to develop quantitative estimates of herbicide- (QEH), insecticide- (QEI), fungicide- (QEF), and, finally, pesticide-likeness (QEP). In the assessment of these definitions, we relied on the concept of desirability functions. We found a simple function, shared by the three classes of pesticides, parameterized particularly, for six, easy to compute, independent and interpretable, molecular properties: molecular weight, logP, number of hydrogen bond acceptors, number of hydrogen bond donors, number of rotatable bounds and number of aromatic rings. Subsequently, we describe the scoring of each pesticide class by the corresponding quantitative estimate. In a comparative study, we assessed the performance of the scoring functions using extensive datasets of patented pesticides. The hereby-established quantitative assessment has the ability to rank compounds whether they fail well-established pesticide-likeness rules or not, and offer an efficient way to prioritize (class-specific) pesticides. These findings are valuable for the efficient estimation of pesticide-likeness of vast chemical libraries in the field of agrochemical discovery.

58 citations


Journal ArticleDOI
TL;DR: The promising results suggest that the proposed ligand-based target fishing approach is useful for not only finding promiscuous drugs for their new usages, but also predicting some important toxic liabilities.
Abstract: Background Ligand-based in silico target fishing can be used to identify the potential interacting target of bioactive ligands, which is useful for understanding the polypharmacology and safety profile of existing drugs. The underlying principle of the approach is that known bioactive ligands can be used as reference to predict the targets for a new compound.

53 citations


Journal ArticleDOI
TL;DR: InCHlib (Interactive Cluster Heatmap Library), a highly interactive and lightweight JavaScript library for cluster heatmap visualization and exploration, is developed and is a versatile tool which application domain is not limited to the life sciences only.
Abstract: Hierarchical clustering is an exploratory data analysis method that reveals the groups (clusters) of similar objects. The result of the hierarchical clustering is a tree structure called dendrogram that shows the arrangement of individual clusters. To investigate the row/column hierarchical cluster structure of a data matrix, a visualization tool called ‘cluster heatmap’ is commonly employed. In the cluster heatmap, the data matrix is displayed as a heatmap, a 2-dimensional array in which the colour of each element corresponds to its value. The rows/columns of the matrix are ordered such that similar rows/columns are near each other. The ordering is given by the dendrogram which is displayed on the side of the heatmap. We developed InCHlib (Interactive Cluster Heatmap Library), a highly interactive and lightweight JavaScript library for cluster heatmap visualization and exploration. InCHlib enables the user to select individual or clustered heatmap rows, to zoom in and out of clusters or to flexibly modify heatmap appearance. The cluster heatmap can be augmented with additional metadata displayed in a different colour scale. In addition, to further enhance the visualization, the cluster heatmap can be interconnected with external data sources or analysis tools. Data clustering and the preparation of the input file for InCHlib is facilitated by the Python utility script inchlib_clust. The cluster heatmap is one of the most popular visualizations of large chemical and biomedical data sets originating, e.g., in high-throughput screening, genomics or transcriptomics experiments. The presented JavaScript library InCHlib is a client-side solution for cluster heatmap exploration. InCHlib can be easily deployed into any modern web application and configured to cooperate with external tools and data sources. Though InCHlib is primarily intended for the analysis of chemical or biological data, it is a versatile tool which application domain is not limited to the life sciences only.

Journal ArticleDOI
TL;DR: GP models trained on these PCM datasets are statistically sound, at the same level of statistical significance as Support Vector Machines (SVM), with R02 values on the external dataset ranging from 0.68 to 0.92, and RMSEP values close to the experimental error.
Abstract: Proteochemometrics (PCM) is an approach for bioactivity predictive modeling which models the relationship between protein and chemical information. Gaussian Processes (GP), based on Bayesian inference, provide the most objective estimation of the uncertainty of the predictions, thus permitting the evaluation of the applicability domain (AD) of the model. Furthermore, the experimental error on bioactivity measurements can be used as input for this probabilistic model.

Journal ArticleDOI
TL;DR: The study shows that such an approach, referred to as prediction-driven MMP analysis, is a useful tool for medicinal chemists, allowing identification of large numbers of “interesting” transformations that can be used to drive the molecular optimization process.
Abstract: QSAR is an established and powerful method for cheap in silico assessment of physicochemical properties and biological activities of chemical compounds. However, QSAR models are rather complex mathematical constructs that cannot easily be interpreted. Medicinal chemists would benefit from practical guidance regarding which molecules to synthesize. Another possible approach is analysis of pairs of very similar molecules, so-called matched molecular pairs (MMPs). Such an approach allows identification of molecular transformations that affect particular activities (e.g. toxicity). In contrast to QSAR, chemical interpretation of these transformations is straightforward. Furthermore, such transformations can give medicinal chemists useful hints for the hit-to-lead optimization process. The current study suggests a combination of QSAR and MMP approaches by finding MMP transformations based on QSAR predictions for large chemical datasets. The study shows that such an approach, referred to as prediction-driven MMP analysis, is a useful tool for medicinal chemists, allowing identification of large numbers of “interesting” transformations that can be used to drive the molecular optimization process. All the methodological developments have been implemented as software products available online as part of OCHEM ( http://ochem.eu/ ). The prediction-driven MMPs methodology was exemplified by two use cases: modelling of aquatic toxicity and CYP3A4 inhibition. This approach helped us to interpret QSAR models and allowed identification of a number of “significant” molecular transformations that affect the desired properties. This can facilitate drug design as a part of molecular optimization process.

Journal ArticleDOI
TL;DR: A new pharmacophore-based docking program PharmDock is presented that combines pose sampling and ranking based on optimized protein-based Pharmacophore models with local optimization using an empirical scoring function.
Abstract: Protein-based pharmacophore models are enriched with the information of potential interactions between ligands and the protein target. We have shown in a previous study that protein-based pharmacophore models can be applied for ligand pose prediction and pose ranking. In this publication, we present a new pharmacophore-based docking program PharmDock that combines pose sampling and ranking based on optimized protein-based pharmacophore models with local optimization using an empirical scoring function. Tests of PharmDock on ligand pose prediction, binding affinity estimation, compound ranking and virtual screening yielded comparable or better performance to existing and widely used docking programs. The docking program comes with an easy-to-use GUI within PyMOL. Two features have been incorporated in the program suite that allow for user-defined guidance of the docking process based on previous experimental data. Docking with those features demonstrated superior performance compared to unbiased docking. A protein pharmacophore-based docking program, PharmDock, has been made available with a PyMOL plugin. PharmDock and the PyMOL plugin are freely available from http://people.pharmacy.purdue.edu/~mlill/software/pharmdock .

Journal ArticleDOI
TL;DR: A computational framework to build disease-drug networks using drug- and disease-specific subnetworks that incorporates protein networks to refine drug and disease associated genes and prioritize genes in disease and drug specific networks is developed.
Abstract: Background With the rapid development of high-throughput genomic technologies and the accumulation of genome-wide datasets for gene expression profiling and biological networks, the impact of diseases and drugs on gene expression can be comprehensively characterized. Drug repositioning offers the possibility of reduced risks in the drug discovery process, thus it is an essential step in drug development.

Journal ArticleDOI
TL;DR: A formal specification of the quantitative structure-activity relationship data organization and archival format called the QSAR DataBank (QsarDB for shorter, or QDB for shortest) and the utility and benefits have been thoroughly tested by solving everyday QSar and predictive modeling problems.
Abstract: Research efforts in the field of descriptive and predictive Quantitative Structure-Activity Relationships or Quantitative Structure–Property Relationships produce around one thousand scientific publications annually. All the materials and results are mainly communicated using printed media. The printed media in its present form have obvious limitations when they come to effectively representing mathematical models, including complex and non-linear, and large bodies of associated numerical chemical data. It is not supportive of secondary information extraction or reuse efforts while in silico studies poses additional requirements for accessibility, transparency and reproducibility of the research. This gap can and should be bridged by introducing domain-specific digital data exchange standards and tools. The current publication presents a formal specification of the quantitative structure-activity relationship data organization and archival format called the QSAR DataBank (QsarDB for shorter, or QDB for shortest). The article describes QsarDB data schema, which formalizes QSAR concepts (objects and relationships between them) and QsarDB data format, which formalizes their presentation for computer systems. The utility and benefits of QsarDB have been thoroughly tested by solving everyday QSAR and predictive modeling problems, with examples in the field of predictive toxicology, and can be applied for a wide variety of other endpoints. The work is accompanied with open source reference implementation and tools. The proposed open data, open source, and open standards design is open to public and proprietary extensions on many levels. Selected use cases exemplify the benefits of the proposed QsarDB data format. General ideas for future development are discussed.

Journal ArticleDOI
TL;DR: C torsion angles build by four consecutive C atoms are highly valuable as similarity measure on a substructure scale and to find major events the information on the time at which a transition occurs and the local structural changes of it is combinedly called as “event”occurring in the course of the MD simulation.
Abstract: Molecular dynamics (MD) simulation, a standard technique used to study the dynamical properties of biomolecules, is very useful in collecting the trajectories, a series of snapshots – the coordinates of the system of larger systems for longer simulation times. These MD generated trajectories are huge in size (many gigabytes) and the data analysis may take much longer time than the data generation. Managing the large amount of data and presenting them in a flexible and comprehensible manner are the major challenges. Analyzing these trajectories with standard parameter like root-mean square deviation (RMSD) may not reveal the most interesting properties of the dynamics. To overcome these challenges, C torsion angles [1] torsion angles build by four consecutive C atoms are highly valuable as similarity measure on a substructure scale and to find major events the information on the time at which a transition occurs (temporal domain) and the local structural changes (spatial domain) of it is combinedly called as “event”occurring in the course of the MD simulation. By calculating the time series of the C torsion angles and their clustering it is possible to determine the mechanistic details on a residual length scale and find major events occurring in the simulation of large proteins or protein complexes. The main advantage of the C torsion angle criterion is that it does not depend on a previous alignment of the structures, and that the direction of the change is also defined. Heat maps of C torsion angle give nice graphical representations of processes described by the MD simulations. Clustering of snapshots according to the specific C torsion angles is used to automatically find the spatial domains of the structural changes. If all the snapshots are assigned to a single cluster, then those residues are considered as rigid core and the remaining residues are considered as flexible parts. The temporal domain can be characterized in more detail by finding continuous time intervals assigned to a single cluster as (meta) stable structures and time intervals where the assignment jumps between two clusters as transitional periods. Since the outliers can be removed from the fuzzy clusters, starts and ends of time patches now qualify as important events for the underlying substructure and structural changes of larger regions are caused by an accumulation of such substructure events. DNA polymerase I – the open ternary complex of the large fragment of Thermus aquaticus DNA polymerase I (Klentaq1), which is used here as a practical application for C torsion angle based analysis, shows a hand-like arrangement, including a thumb, a palm and a finger domain [2]. The catalytic cycle leading to nucleotide insertion comprises several steps including a large structural rearrangement in the form of a movement of the finger domain towards the thumb domain, i.e. the transition from the open to the closed form. Molecular dynamics simulation were performed using the AMBER 10 suite of programs [3]. To get the visual picture of the ongoing processes, the C torsion angles with the differences to the crystal structure of the open form were plotted as heat map. The rigid and the flexible parts were clearly seen with no or a large number of significant changes, respectively, from the heat map. Once the C torsion angles corresponding to the rigid parts are removed, the remaining regions change only in a specific time interval of the simulation. The spatial and temporal domains of the structural changes were identified automatically by clustering of snapshots (using KNIME [4]) and finding the continuous time intervals, respectively.

Journal ArticleDOI
TL;DR: Molpher is an open-source software framework for the design of virtual chemical libraries focused on a particular mechanistic class of compounds that produces a path of structurally-related compounds through a process the authors term ‘molecular morphing’.
Abstract: Chemical space is virtual space occupied by all chemically meaningful organic compounds. It is an important concept in contemporary chemoinformatics research, and its systematic exploration is vital to the discovery of either novel drugs or new tools for chemical biology. In this paper, we describe Molpher, an open-source framework for the systematic exploration of chemical space. Through a process we term ‘molecular morphing’, Molpher produces a path of structurally-related compounds. This path is generated by the iterative application of so-called ‘morphing operators’ that represent simple structural changes, such as the addition or removal of an atom or a bond. Molpher incorporates an optimized parallel exploration algorithm, compound logging and a two-dimensional visualization of the exploration process. Its feature set can be easily extended by implementing additional morphing operators, chemical fingerprints, similarity measures and visualization methods. Molpher not only offers an intuitive graphical user interface, but also can be run in batch mode. This enables users to easily incorporate molecular morphing into their existing drug discovery pipelines. Molpher is an open-source software framework for the design of virtual chemical libraries focused on a particular mechanistic class of compounds. These libraries, represented by a morphing path and its surroundings, provide valuable starting data for future in silico and in vitro experiments. Molpher is highly extensible and can be easily incorporated into any existing computational drug design pipeline.

Journal ArticleDOI
TL;DR: This paper presents BioPhytMol, a drug discovery community resource on anti-mycobacterial phytomolecules and plant extracts generated using Crowdsourcing, designed to systematically represent and search for anti- mycob bacterial phytochemical information.
Abstract: Tuberculosis (TB) is the second leading cause of death from a single infectious organism, demanding attention towards discovery of novel anti-tubercular compounds. Natural products or their derivatives have provided more than 50% of all existing drugs, offering a chemically diverse space for discovery of novel drugs. BioPhytMol has been designed to systematically curate and analyze the anti-mycobacterial natural product chemical space. BioPhytMol is developed as a drug-discovery community resource with anti-mycobacterial phytomolecules and plant extracts. Currently, it holds 2582 entries including 188 plant families (692 genera and 808 species) from global flora, manually curated from literature. In total, there are 633 phytomolecules (with structures) curated against 25 target mycobacteria. Multiple analysis approaches have been used to prioritize the library for drug-like compounds, for both whole cell screening and target-based approaches. In order to represent the multidimensional data on chemical diversity, physiochemical properties and biological activity data of the compound library, novel approaches such as the use of circular graphs have been employed. BioPhytMol has been designed to systematically represent and search for anti-mycobacterial phytochemical information. Extensive compound analyses can also be performed through web-application for prioritizing drug-like compounds. The resource is freely available online at http://ab-openlab.csir.res.in/biophytmol/ .

Journal ArticleDOI
TL;DR: A consensus model based on the predicted values of individual LLL models of LD50 is developed, yielding correlation coefficients R2 of 0.712 on a test set containing 2,896 compounds.
Abstract: Background Acute toxicity means the ability of a substance to cause adverse effects within a short period following dosing or exposure, which is usually the first step in the toxicological investigations of unknown substances. The median lethal dose, LD50, is frequently used as a general indicator of a substance’s acute toxicity, and there is a high demand on developing non-animal-based prediction of LD50. Unfortunately, it is difficult to accurately predict compound LD50 using a single QSAR model, because the acute toxicity may involve complex mechanisms and multiple biochemical processes.

Journal ArticleDOI
TL;DR: iDrug is easy to use, and provides a novel, fast and reliable tool for conducting drug design experiments by using various molecular design processing tasks can be submitted and visualized simply in one browser without installing locally any standalone modeling softwares.
Abstract: The progress in computer-aided drug design (CADD) approaches over the past decades accelerated the early-stage pharmaceutical research. Many powerful standalone tools for CADD have been developed in academia. As programs are developed by various research groups, a consistent user-friendly online graphical working environment, combining computational techniques such as pharmacophore mapping, similarity calculation, scoring, and target identification is needed. We presented a versatile, user-friendly, and efficient online tool for computer-aided drug design based on pharmacophore and 3D molecular similarity searching. The web interface enables binding sites detection, virtual screening hits identification, and drug targets prediction in an interactive manner through a seamless interface to all adapted packages (e.g., Cavity, PocketV.2, PharmMapper, SHAFTS). Several commercially available compound databases for hit identification and a well-annotated pharmacophore database for drug targets prediction were integrated in iDrug as well. The web interface provides tools for real-time molecular building/editing, converting, displaying, and analyzing. All the customized configurations of the functional modules can be accessed through featured session files provided, which can be saved to the local disk and uploaded to resume or update the history work. iDrug is easy to use, and provides a novel, fast and reliable tool for conducting drug design experiments. By using iDrug, various molecular design processing tasks can be submitted and visualized simply in one browser without installing locally any standalone modeling softwares. iDrug is accessible free of charge at http://lilab.ecust.edu.cn/idrug .

Journal ArticleDOI
TL;DR: This methodology allows for a greater utilisation of the predictions made by black box models and can expedite further study based on the output for a (quantitative) structure activity model.
Abstract: A new algorithm has been developed to enable the interpretation of black box models. The developed algorithm is agnostic to learning algorithm and open to all structural based descriptors such as fragments, keys and hashed fingerprints. The algorithm has provided meaningful interpretation of Ames mutagenicity predictions from both random forest and support vector machine models built on a variety of structural fingerprints. A fragmentation algorithm is utilised to investigate the model’s behaviour on specific substructures present in the query. An output is formulated summarising causes of activation and deactivation. The algorithm is able to identify multiple causes of activation or deactivation in addition to identifying localised deactivations where the prediction for the query is active overall. No loss in performance is seen as there is no change in the prediction; the interpretation is produced directly on the model’s behaviour for the specific query. Models have been built using multiple learning algorithms including support vector machine and random forest. The models were built on public Ames mutagenicity data and a variety of fingerprint descriptors were used. These models produced a good performance in both internal and external validation with accuracies around 82%. The models were used to evaluate the interpretation algorithm. Interpretation was revealed that links closely with understood mechanisms for Ames mutagenicity. This methodology allows for a greater utilisation of the predictions made by black box models and can expedite further study based on the output for a (quantitative) structure activity model. Additionally the algorithm could be utilised for chemical dataset investigation and knowledge extraction/human SAR development.

Journal ArticleDOI
TL;DR: This first article focuses on the general principle of the Self Organising Hypothesis Network (SOHN) approach in the context of binary classification problems along with an illustrative application to the prediction of mutagenicity.
Abstract: Combining different sources of knowledge to build improved structure activity relationship models is not easy owing to the variety of knowledge formats and the absence of a common framework to interoperate between learning techniques. Most of the current approaches address this problem by using consensus models that operate at the prediction level. We explore the possibility to directly combine these sources at the knowledge level, with the aim to harvest potentially increased synergy at an earlier stage. Our goal is to design a general methodology to facilitate knowledge discovery and produce accurate and interpretable models. To combine models at the knowledge level, we propose to decouple the learning phase from the knowledge application phase using a pivot representation (lingua franca) based on the concept of hypothesis. A hypothesis is a simple and interpretable knowledge unit. Regardless of its origin, knowledge is broken down into a collection of hypotheses. These hypotheses are subsequently organised into hierarchical network. This unification permits to combine different sources of knowledge into a common formalised framework. The approach allows us to create a synergistic system between different forms of knowledge and new algorithms can be applied to leverage this unified model. This first article focuses on the general principle of the Self Organising Hypothesis Network (SOHN) approach in the context of binary classification problems along with an illustrative application to the prediction of mutagenicity. It is possible to represent knowledge in the unified form of a hypothesis network allowing interpretable predictions with performances comparable to mainstream machine learning techniques. This new approach offers the potential to combine knowledge from different sources into a common framework in which high level reasoning and meta-learning can be applied; these latter perspectives will be explored in future work.

Journal ArticleDOI
TL;DR: MQN-similarity is shown to efficiently recover 15 different fragrance molecule families from the different FL subsets, demonstrating the relevance of the MQN-based tool to explore the fragrance chemical space.
Abstract: The properties of fragrance molecules in the public databases SuperScent and Flavornet were analyzed to define a “fragrance-like” (FL) property range (Heavy Atom Count ≤ 21, only C, H, O, S, (O + S) ≤ 3, Hydrogen Bond Donor ≤ 1) and the corresponding chemical space including FL molecules from PubChem (NIH repository of molecules), ChEMBL (bioactive molecules), ZINC (drug-like molecules), and GDB-13 (all possible organic molecules up to 13 atoms of C, N, O, S, Cl). The FL subsets of these databases were classified by MQN (Molecular Quantum Numbers, a set of 42 integer value descriptors of molecular structure) and formatted for fast MQN-similarity searching and interactive exploration of color-coded principal component maps in form of the FL-mapplet and FL-browser applications freely available at http://www.gdb.unibe.ch . MQN-similarity is shown to efficiently recover 15 different fragrance molecule families from the different FL subsets, demonstrating the relevance of the MQN-based tool to explore the fragrance chemical space.

Journal ArticleDOI
TL;DR: 2D topological fingerprints calculated to a bond depth of 4-6 contain sufficient information to allow the identification of SoMs using classifiers based on relatively small data sets and the machine learning methods outlined are conceptually simpler and more efficient than other methods tested and give results competitive with other approaches using more expensive quantum chemical descriptors.
Abstract: The prediction of sites and products of metabolism in xenobiotic compounds is key to the development of new chemical entities, where screening potential metabolites for toxicity or unwanted side-effects is of crucial importance. In this work 2D topological fingerprints are used to encode atomic sites and three probabilistic machine learning methods are applied: Parzen-Rosenblatt Window (PRW), Naive Bayesian (NB) and a novel approach called RASCAL (Random Attribute Subsampling Classification ALgorithm). These are implemented by randomly subsampling descriptor space to alleviate the problem often suffered by data mining methods of having to exactly match fingerprints, and in the case of PRW by measuring a distance between feature vectors rather than exact matching. The classifiers have been implemented in CUDA/C++ to exploit the parallel architecture of graphical processing units (GPUs) and is freely available in a public repository. It is shown that for PRW a SoM (Site of Metabolism) is identified in the top two predictions for 85%, 91% and 88% of the CYP 3A4, 2D6 and 2C9 data sets respectively, with RASCAL giving similar performance of 83%, 91% and 88%, respectively. These results put PRW and RASCAL performance ahead of NB which gave a much lower classification performance of 51%, 73% and 74%, respectively. 2D topological fingerprints calculated to a bond depth of 4-6 contain sufficient information to allow the identification of SoMs using classifiers based on relatively small data sets. Thus, the machine learning methods outlined in this paper are conceptually simpler and more efficient than other methods tested and the use of simple topological descriptors derived from 2D structure give results competitive with other approaches using more expensive quantum chemical descriptors. The descriptor space subsampling approach and ensemble methodology allow the methods to be applied to molecules more distant from the training data where data mining would be more likely to fail due to the lack of common fingerprints. The RASCAL algorithm is shown to give equivalent classification performance to PRW but at lower computational expense allowing it to be applied more efficiently in the ensemble scheme.

Journal ArticleDOI
TL;DR: TB Mobile can now manage a small collection of compounds that can be imported from external sources, or exported by various means such as email or app-to-app inter-process communication, meaning that TB Mobile can be used as a node within a growing ecosystem of mobile apps for cheminformatics.
Abstract: We recently developed a freely available mobile app (TB Mobile) for both iOS and Android platforms that displays Mycobacterium tuberculosis (Mtb) active molecule structures and their targets with links to associated data. The app was developed to make target information available to as large an audience as possible. We now report a major update of the iOS version of the app. This includes enhancements that use an implementation of ECFP_6 fingerprints that we have made open source. Using these fingerprints, the user can propose compounds with possible anti-TB activity, and view the compounds within a cluster landscape. Proposed compounds can also be compared to existing target data, using a naive Bayesian scoring system to rank probable targets. We have curated an additional 60 new compounds and their targets for Mtb and added these to the original set of 745 compounds. We have also curated 20 further compounds (many without targets in TB Mobile) to evaluate this version of the app with 805 compounds and associated targets. TB Mobile can now manage a small collection of compounds that can be imported from external sources, or exported by various means such as email or app-to-app inter-process communication. This means that TB Mobile can be used as a node within a growing ecosystem of mobile apps for cheminformatics. It can also cluster compounds and use internal algorithms to help identify potential targets based on molecular similarity. TB Mobile represents a valuable dataset, data-visualization aid and target prediction tool.

Journal ArticleDOI
TL;DR: 2D similarity calculations can be used to more efficiently decide which drugs and metabolites should be tested in cross-reactivity studies, as well as to design experiments and potentially predict antigens that would lead to immunoassays with cross reactivity for a broader array of designer drugs.
Abstract: A challenge for drug of abuse testing is presented by ‘designer drugs’, compounds typically discovered by modifications of existing clinical drug classes such as amphetamines and cannabinoids. Drug of abuse screening immunoassays directed at amphetamine or methamphetamine only detect a small subset of designer amphetamine-like drugs, and those immunoassays designed for tetrahydrocannabinol metabolites generally do not cross-react with synthetic cannabinoids lacking the classic cannabinoid chemical backbone. This suggests complexity in understanding how to detect and identify whether a patient has taken a molecule of one class or another, impacting clinical care. Cross-reactivity data from immunoassays specifically targeting designer amphetamine-like and synthetic cannabinoid drugs was collected from multiple published sources, and virtual chemical libraries for molecular similarity analysis were built. The virtual library for synthetic cannabinoid analysis contained a total of 169 structures, while the virtual library for amphetamine-type stimulants contained 288 compounds. Two-dimensional (2D) similarity for each test compound was compared to the target molecule of the immunoassay undergoing analysis. 2D similarity differentiated between cross-reactive and non-cross-reactive compounds for immunoassays targeting mephedrone/methcathinone, 3,4-methylenedioxypyrovalerone, benzylpiperazine, mephentermine, and synthetic cannabinoids. In this study, we applied 2D molecular similarity analysis to the designer amphetamine-type stimulants and synthetic cannabinoids. Similarity calculations can be used to more efficiently decide which drugs and metabolites should be tested in cross-reactivity studies, as well as to design experiments and potentially predict antigens that would lead to immunoassays with cross reactivity for a broader array of designer drugs.

Journal ArticleDOI
TL;DR: This work describes how the layered structural representation of the Standard InChI is exploited to create new functionality within UniChem that integrates these related molecular forms.
Abstract: UniChem is a low-maintenance, fast and freely available compound identifier mapping service, recently made available on the Internet. Until now, the criterion of molecular equivalence within UniChem has been on the basis of complete identity between Standard InChIs. However, a limitation of this approach is that stereoisomers, isotopes and salts of otherwise identical molecules are not considered as related. Here, we describe how we have exploited the layered structural representation of the Standard InChI to create new functionality within UniChem that integrates these related molecular forms. The service, called ‘Connectivity Search’ allows molecules to be first matched on the basis of complete identity between the connectivity layer of their corresponding Standard InChIs, and the remaining layers then compared to highlight stereochemical and isotopic differences. Parsing of Standard InChI sub-layers permits mixtures and salts to also be included in this integration process. Implementation of these enhancements required simple modifications to the schema, loader and web application, but none of which have changed the original UniChem functionality or services. The scope of queries may be varied using a variety of easily configurable options, and the output is annotated to assist the user to filter, sort and understand the difference between query and retrieved structures. A RESTful web service output may be easily processed programmatically to allow developers to present the data in whatever form they believe their users will require, or to define their own level of molecular equivalence for their resource, albeit within the constraint of identical connectivity.

Journal ArticleDOI
TL;DR: The experimental HTS validation of a novel VLS approach, FINDSITEcomb, across a diverse set of medically-relevant proteins is demonstrated and shows that FINDSitecomb is a promising new VLS Approach that can assist drug discovery.
Abstract: Identification of ligand-protein binding interactions is a critical step in drug discovery. Experimental screening of large chemical libraries, in spite of their specific role and importance in drug discovery, suffer from the disadvantages of being random, time-consuming and expensive. To accelerate the process, traditional structure- or ligand-based VLS approaches are combined with experimental high-throughput screening, HTS. Often a single protein or, at most, a protein family is considered. Large scale VLS benchmarking across diverse protein families is rarely done, and the reported success rate is very low. Here, we demonstrate the experimental HTS validation of a novel VLS approach, FINDSITEcomb, across a diverse set of medically-relevant proteins. For eight different proteins belonging to different fold-classes and from diverse organisms, the top 1% of FINDSITEcomb’s VLS predictions were tested, and depending on the protein target, 4%-47% of the predicted ligands were shown to bind with μM or better affinities. In total, 47 small molecule binders were identified. Low nanomolar (nM) binders for dihydrofolate reductase and protein tyrosine phosphatases (PTPs) and micromolar binders for the other proteins were identified. Six novel molecules had cytotoxic activity (<10 μg/ml) against the HCT-116 colon carcinoma cell line and one novel molecule had potent antibacterial activity. We show that FINDSITEcomb is a promising new VLS approach that can assist drug discovery.

Journal ArticleDOI
TL;DR: Using 2D fingerprints to show the extent of the relationship between computed levels of structural similarity for pairs of molecules and expert judgments of the similarities of those pairs can provide a useful source of information for the assessment of orphan drug status by regulatory authorities.
Abstract: In the European Union, medicines are authorised for some rare disease only if they are judged to be dissimilar to authorised orphan drugs for that disease. This paper describes the use of 2D fingerprints to show the extent of the relationship between computed levels of structural similarity for pairs of molecules and expert judgments of the similarities of those pairs. The resulting relationship can be used to provide input to the assessment of new active compounds for which orphan drug authorisation is being sought. 143 experts provided judgments of the similarity or dissimilarity of 100 pairs of drug-like molecules from the DrugBank 3.0 database. The similarities of these pairs were also computed using BCI, Daylight, ECFC4, ECFP4, MDL and Unity 2D fingerprints. Logistic regression analyses demonstrated a strong relationship between the human and computed similarity assessments, with the resulting regression models having significant predictive power in experiments using data from submissions of orphan drug medicines to the European Medicines Agency. The BCI fingerprints performed best overall on the DrugBank dataset while the BCI, Daylight, ECFP4 and Unity fingerprints performed comparably on the European Medicines Agency dataset. Measures of structural similarity based on 2D fingerprints can provide a useful source of information for the assessment of orphan drug status by regulatory authorities.