Showing papers in "Journal of Cheminformatics in 2014"

PDF

Open Access

Journal Article•DOI•

TCMSP: a database of systems pharmacology for drug discovery from herbal medicines.

[...]

Jinlong Ru¹, Peng Li¹, Wang Jinan¹, Wei Zhou¹, Bohui Li¹, Chao Huang¹, Pidong Li¹, Zihu Guo¹, Weiyang Tao¹, Yinfeng Yang², Xue Xu¹, Yan Li², Yonghua Wang¹, Ling Yang³ - Show less +10 more•Institutions (3)

Northwest A&F University¹, Dalian University of Technology², Dalian Institute of Chemical Physics³

16 Apr 2014-Journal of Cheminformatics

TL;DR: The particular strengths of TCMSP are the composition of the large number of herbal entries, and the ability to identify drug-target networks and drug-disease networks, which will help revealing the mechanisms of action of Chinese herbs, uncovering the nature ofTCM theory and developing new herb-oriented drugs.

...read moreread less

Abstract: Modern medicine often clashes with traditional medicine such as Chinese herbal medicine because of the little understanding of the underlying mechanisms of action of the herbs. In an effort to promote integration of both sides and to accelerate the drug discovery from herbal medicines, an efficient systems pharmacology platform that represents ideal information convergence of pharmacochemistry, ADME properties, drug-likeness, drug targets, associated diseases and interaction networks, are urgently needed. The traditional Chinese medicine systems pharmacology database and analysis platform (TCMSP) was built based on the framework of systems pharmacology for herbal medicines. It consists of all the 499 Chinese herbs registered in the Chinese pharmacopoeia with 29,384 ingredients, 3,311 targets and 837 associated diseases. Twelve important ADME-related properties like human oral bioavailability, half-life, drug-likeness, Caco-2 permeability, blood-brain barrier and Lipinski’s rule of five are provided for drug screening and evaluation. TCMSP also provides drug targets and diseases of each active compound, which can automatically establish the compound-target and target-disease networks that let users view and analyze the drug action mechanisms. It is designed to fuel the development of herbal medicines and to promote integration of modern medicine and traditional medicine for drug discovery and development. The particular strengths of TCMSP are the composition of the large number of herbal entries, and the ability to identify drug-target networks and drug-disease networks, which will help revealing the mechanisms of action of Chinese herbs, uncovering the nature of TCM theory and developing new herb-oriented drugs. TCMSP is freely available at http://sm.nwsuaf.edu.cn/lsp/tcmsp.php .

...read moreread less

2,451 citations

Journal Article•DOI•

Cross-validation pitfalls when selecting and assessing regression and classification models

[...]

Damjan Krstajic¹, Ljubomir Buturovic, David E. Leahy, Simon Thomas•Institutions (1)

University of Belgrade¹

29 Mar 2014-Journal of Cheminformatics

TL;DR: An algorithm for repeated grid-search V-fold cross-validation for parameter tuning in classification and regression, and a repeated nested cross- validation algorithm for model assessment are described and evaluated.

...read moreread less

Abstract: We address the problem of selecting and assessing classification and regression models using cross-validation. Current state-of-the-art methods can yield models with high variance, rendering them unsuitable for a number of practical applications including QSAR. In this paper we describe and evaluate best practices which improve reliability and increase confidence in selected models. A key operational component of the proposed methods is cloud computing which enables routine use of previously infeasible approaches. We describe in detail an algorithm for repeated grid-search V-fold cross-validation for parameter tuning in classification and regression, and we define a repeated nested cross-validation algorithm for model assessment. As regards variable selection and parameter tuning we define two algorithms (repeated grid-search cross-validation and double cross-validation), and provide arguments for using the repeated grid-search in the general case. We show results of our algorithms on seven QSAR datasets. The variation of the prediction performance, which is the result of choosing different splits of the dataset in V-fold cross-validation, needs to be taken into account when selecting and assessing classification and regression models. We demonstrate the importance of repeating cross-validation when selecting an optimal model, as well as the importance of repeating nested cross-validation when assessing a prediction error.

...read moreread less

644 citations

Journal Article•DOI•

Protoss: a holistic approach to predict tautomers and protonation states in protein-ligand complexes

[...]

Stefan Bietz¹, Sascha Urbaczek¹, Benjamin L. Schulz¹, Matthias Rarey¹•Institutions (1)

University of Hamburg¹

03 Apr 2014-Journal of Cheminformatics

TL;DR: In this article, the authors present a method for the placement of hydrogen coordinates in protein-ligand complexes which takes tautomers and protonation states of both protein and ligand into account.

...read moreread less

Abstract: The calculation of hydrogen positions is a common preprocessing step when working with crystal structures of protein-ligand complexes. An explicit description of hydrogen atoms is generally needed in order to analyze the binding mode of particular ligands or to calculate the associated binding energies. Due to the large number of degrees of freedom resulting from different chemical moieties and the high degree of mutual dependence this problem is anything but trivial. In addition to an efficient algorithm to take care of the complexity resulting from complicated hydrogen bonding networks, a robust chemical model is needed to describe effects such as tautomerism and ionization consistently. We present a novel method for the placement of hydrogen coordinates in protein-ligand complexes which takes tautomers and protonation states of both protein and ligand into account. Our method generates the most probable hydrogen positions on the basis of an optimal hydrogen bonding network using an empirical scoring function. The high quality of our results could be verified by comparison to the manually adjusted Astex diverse set and a remarkably low rate of undesirable hydrogen contacts compared to other tools.

...read moreread less

129 citations

Journal Article•DOI•

Reliable estimation of prediction errors for QSAR models under model uncertainty using double cross-validation

[...]

Désirée Baumann¹, Knut Baumann¹•Institutions (1)

Braunschweig University of Technology¹

26 Nov 2014-Journal of Cheminformatics

TL;DR: The cross-validation design in the inner loop and the influence of the test set size in the outer loop is systematically studied for regression models in combination with variable selection.

...read moreread less

Abstract: Generally, QSAR modelling requires both model selection and validation since there is no a priori knowledge about the optimal QSAR model. Prediction errors (PE) are frequently used to select and to assess the models under study. Reliable estimation of prediction errors is challenging – especially under model uncertainty – and requires independent test objects. These test objects must not be involved in model building nor in model selection. Double cross-validation, sometimes also termed nested cross-validation, offers an attractive possibility to generate test data and to select QSAR models since it uses the data very efficiently. Nevertheless, there is a controversy in the literature with respect to the reliability of double cross-validation under model uncertainty. Moreover, systematic studies investigating the adequate parameterization of double cross-validation are still missing. Here, the cross-validation design in the inner loop and the influence of the test set size in the outer loop is systematically studied for regression models in combination with variable selection. Simulated and real data are analysed with double cross-validation to identify important factors for the resulting model quality. For the simulated data, a bias-variance decomposition is provided. The prediction errors of QSAR/QSPR regression models in combination with variable selection depend to a large degree on the parameterization of double cross-validation. While the parameters for the inner loop of double cross-validation mainly influence bias and variance of the resulting models, the parameters for the outer loop mainly influence the variability of the resulting prediction error estimate. Double cross-validation reliably and unbiasedly estimates prediction errors under model uncertainty for regression models. As compared to a single test set, double cross-validation provided a more realistic picture of model quality and should be preferred over a single test set.

...read moreread less

117 citations

Journal Article•DOI•

Bringing the MMFF force field to the RDKit: implementation and validation

[...]

Paolo Tosco¹, Nikolaus Stiefl², Gregory A. Landrum²•Institutions (2)

University of Turin¹, Novartis²

12 Jul 2014-Journal of Cheminformatics

TL;DR: This open-source implementation of a MMFF-capable molecular mechanics engine coupled with the rest of the RDKit functionality and covered by the BSD license is appealing to researchers operating in both academia and industry.

...read moreread less

Abstract: A general purpose force field such as MMFF94/MMFF94s, which can properly deal with a wide range of diverse structures, is very valuable in the context of a cheminformatics toolkit. Herein we present an open-source implementation of this force field within the RDKit. The new MMFF functionality can be accessed through a C++/C#/Python/Java application programming interface (API) developed along the lines of the one already available for UFF in the RDKit. Our implementation was fully validated against the official validation suite provided by the MMFF authors. All energies and gradients were correctly computed; moreover, atom type and force constants were correctly assigned for 3D molecules built from SMILES strings. To provide full flexibility, the available API provides direct access to include/exclude individual terms from the MMFF energy expression and to carry out constrained geometry optimizations. The availability of a MMFF-capable molecular mechanics engine coupled with the rest of the RDKit functionality and covered by the BSD license is appealing to researchers operating in both academia and industry.

...read moreread less

111 citations

Journal Article•DOI•

Chemical named entities recognition: a review on approaches and applications

[...]

Safaa Eltyeb¹, Safaa Eltyeb², Naomie Salim²•Institutions (2)

Sudan University of Science and Technology¹, Universiti Teknologi Malaysia²

28 Apr 2014-Journal of Cheminformatics

TL;DR: This review sketches out dictionary-based, rule-based and machine learning, as well as hybrid chemical named entity recognition approaches with their applied solutions, and an outlook on the pros and cons of these approaches and the types of chemical entities extracted.

...read moreread less

Abstract: The rapid increase in the flow rate of published digital information in all disciplines has resulted in a pressing need for techniques that can simplify the use of this information. The chemistry literature is very rich with information about chemical entities. Extracting molecules and their related properties and activities from the scientific literature to “text mine” these extracted data and determine contextual relationships helps research scientists, particularly those in drug development. One of the most important challenges in chemical text mining is the recognition of chemical entities mentioned in the texts. In this review, the authors briefly introduce the fundamental concepts of chemical literature mining, the textual contents of chemical documents, and the methods of naming chemicals in documents. We sketch out dictionary-based, rule-based and machine learning, as well as hybrid chemical named entity recognition approaches with their applied solutions. We end with an outlook on the pros and cons of these approaches and the types of chemical entities extracted.

...read moreread less

99 citations

Journal Article•DOI•

A generalizable definition of chemical similarity for read-across

[...]

Matteo Floris¹, Alberto Manganaro², Orazio Nicolotti³, Ricardo Medda, Giuseppe Felice Mangiatordi³, Emilio Benfenati² - Show less +2 more•Institutions (3)

Polaris Industries¹, Mario Negri Institute for Pharmacological Research², University of Bari³

18 Oct 2014-Journal of Cheminformatics

TL;DR: The comparison of multiple combinations of binary fingerprints and similarity metrics for computing the chemical similarity in the context of two different applications of the read-across technique demonstrates that the classical similarity measurements can be improved with a generalizable model of similarity.

...read moreread less

Abstract: Methods that provide a measure of chemical similarity are strongly relevant in several fields of chemoinformatics as they allow to predict the molecular behavior and fate of structurally close compounds. One common application of chemical similarity measurements, based on the principle that similar molecules have similar properties, is the read-across approach, where an estimation of a specific endpoint for a chemical is provided using experimental data available from highly similar compounds. This paper reports the comparison of multiple combinations of binary fingerprints and similarity metrics for computing the chemical similarity in the context of two different applications of the read-across technique. Our analysis demonstrates that the classical similarity measurements can be improved with a generalizable model of similarity. The proposed approach has already been used to build similarity indices in two open-source software tools (CAESAR and VEGA) that make several QSAR models available. In these tools, the similarity index plays a key role for the assessment of the applicability domain.

...read moreread less

77 citations

Journal Article•DOI•

The influence of negative training set size on machine learning-based virtual screening

[...]

Rafał Kurczab¹, Sabina Smusz¹, Sabina Smusz², Andrzej J. Bojarski¹•Institutions (2)

Polish Academy of Sciences¹, Jagiellonian University²

11 Jun 2014-Journal of Cheminformatics

TL;DR: The ratio of positive to negative training instances should be taken into account during the preparation of machine learning experiments, as it might significantly influence the performance of particular classifier.

...read moreread less

Abstract: The paper presents a thorough analysis of the influence of the number of negative training examples on the performance of machine learning methods. The impact of this rather neglected aspect of machine learning methods application was examined for sets containing a fixed number of positive and a varying number of negative examples randomly selected from the ZINC database. An increase in the ratio of positive to negative training instances was found to greatly influence most of the investigated evaluating parameters of ML methods in simulated virtual screening experiments. In a majority of cases, substantial increases in precision and MCC were observed in conjunction with some decreases in hit recall. The analysis of dynamics of those variations let us recommend an optimal composition of training data. The study was performed on several protein targets, 5 machine learning algorithms (SMO, Naive Bayes, Ibk, J48 and Random Forest) and 2 types of molecular fingerprints (MACCS and CDK FP). The most effective classification was provided by the combination of CDK FP with SMO or Random Forest algorithms. The Naive Bayes models appeared to be hardly sensitive to changes in the number of negative instances in the training set. In conclusion, the ratio of positive to negative training instances should be taken into account during the preparation of machine learning experiments, as it might significantly influence the performance of particular classifier. What is more, the optimization of negative training set size can be applied as a boosting-like approach in machine learning-based virtual screening.

...read moreread less

61 citations

Journal Article•DOI•

Quantitative estimation of pesticide-likeness for agrochemical discovery

[...]

Sorin Avram¹, Simona Funar-Timofei¹, Ana Borota¹, Sridhar Rao Chennamaneni, Anil Kumar Manchala, Sorel Muresan² - Show less +2 more•Institutions (2)

Romanian Academy¹, University of Agricultural Sciences, Dharwad²

12 Sep 2014-Journal of Cheminformatics

TL;DR: The hereby-established quantitative assessment has the ability to rank compounds whether they fail well-established pesticide-likeness rules or not, and offer an efficient way to prioritize (class-specific) pesticides.

...read moreread less

Abstract: The design of chemical libraries, an early step in agrochemical discovery programs, is frequently addressed by means of qualitative physicochemical and/or topological rule-based methods. The aim of this study is to develop quantitative estimates of herbicide- (QEH), insecticide- (QEI), fungicide- (QEF), and, finally, pesticide-likeness (QEP). In the assessment of these definitions, we relied on the concept of desirability functions. We found a simple function, shared by the three classes of pesticides, parameterized particularly, for six, easy to compute, independent and interpretable, molecular properties: molecular weight, logP, number of hydrogen bond acceptors, number of hydrogen bond donors, number of rotatable bounds and number of aromatic rings. Subsequently, we describe the scoring of each pesticide class by the corresponding quantitative estimate. In a comparative study, we assessed the performance of the scoring functions using extensive datasets of patented pesticides. The hereby-established quantitative assessment has the ability to rank compounds whether they fail well-established pesticide-likeness rules or not, and offer an efficient way to prioritize (class-specific) pesticides. These findings are valuable for the efficient estimation of pesticide-likeness of vast chemical libraries in the field of agrochemical discovery.

...read moreread less

58 citations

Journal Article•DOI•

In Silico target fishing: addressing a "Big Data" problem by ligand-based similarity rankings with data fusion.

[...]

Xian Liu¹, Yuan Xu¹, Shanshan Li¹, Yulan Wang¹, Jianlong Peng¹, Cheng Luo¹, Xiaomin Luo¹, Mingyue Zheng¹, Kaixian Chen¹, Kaixian Chen², Hualiang Jiang², Hualiang Jiang¹ - Show less +8 more•Institutions (2)

Chinese Academy of Sciences¹, ShanghaiTech University²

18 Jun 2014-Journal of Cheminformatics

TL;DR: The promising results suggest that the proposed ligand-based target fishing approach is useful for not only finding promiscuous drugs for their new usages, but also predicting some important toxic liabilities.

...read moreread less

Abstract: Background Ligand-based in silico target fishing can be used to identify the potential interacting target of bioactive ligands, which is useful for understanding the polypharmacology and safety profile of existing drugs. The underlying principle of the approach is that known bioactive ligands can be used as reference to predict the targets for a new compound.

...read moreread less

53 citations

Journal Article•DOI•

InCHlib – interactive cluster heatmap for web applications

[...]

Ctibor Škuta, Petr Bartůněk, Daniel Svozil

17 Sep 2014-Journal of Cheminformatics

TL;DR: InCHlib (Interactive Cluster Heatmap Library), a highly interactive and lightweight JavaScript library for cluster heatmap visualization and exploration, is developed and is a versatile tool which application domain is not limited to the life sciences only.

...read moreread less

Abstract: Hierarchical clustering is an exploratory data analysis method that reveals the groups (clusters) of similar objects. The result of the hierarchical clustering is a tree structure called dendrogram that shows the arrangement of individual clusters. To investigate the row/column hierarchical cluster structure of a data matrix, a visualization tool called ‘cluster heatmap’ is commonly employed. In the cluster heatmap, the data matrix is displayed as a heatmap, a 2-dimensional array in which the colour of each element corresponds to its value. The rows/columns of the matrix are ordered such that similar rows/columns are near each other. The ordering is given by the dendrogram which is displayed on the side of the heatmap. We developed InCHlib (Interactive Cluster Heatmap Library), a highly interactive and lightweight JavaScript library for cluster heatmap visualization and exploration. InCHlib enables the user to select individual or clustered heatmap rows, to zoom in and out of clusters or to flexibly modify heatmap appearance. The cluster heatmap can be augmented with additional metadata displayed in a different colour scale. In addition, to further enhance the visualization, the cluster heatmap can be interconnected with external data sources or analysis tools. Data clustering and the preparation of the input file for InCHlib is facilitated by the Python utility script inchlib_clust. The cluster heatmap is one of the most popular visualizations of large chemical and biomedical data sets originating, e.g., in high-throughput screening, genomics or transcriptomics experiments. The presented JavaScript library InCHlib is a client-side solution for cluster heatmap exploration. InCHlib can be easily deployed into any modern web application and configured to cooperate with external tools and data sources. Though InCHlib is primarily intended for the analysis of chemical or biological data, it is a versatile tool which application domain is not limited to the life sciences only.

...read moreread less

Journal Article•DOI•

Proteochemometric modeling in a Bayesian framework.

[...]

Isidro Cortes-Ciriano¹, Gerard J. P. van Westen², Eelke B. Lenselink, Daniel S. Murrell³, Andreas Bender³, Thérèse E. Malliavin¹ - Show less +2 more•Institutions (3)

Pasteur Institute¹, European Bioinformatics Institute², University of Cambridge³

28 Jun 2014-Journal of Cheminformatics

TL;DR: GP models trained on these PCM datasets are statistically sound, at the same level of statistical significance as Support Vector Machines (SVM), with R02 values on the external dataset ranging from 0.68 to 0.92, and RMSEP values close to the experimental error.

...read moreread less

Abstract: Proteochemometrics (PCM) is an approach for bioactivity predictive modeling which models the relationship between protein and chemical information. Gaussian Processes (GP), based on Bayesian inference, provide the most objective estimation of the uncertainty of the predictions, thus permitting the evaluation of the applicability domain (AD) of the model. Furthermore, the experimental error on bioactivity measurements can be used as input for this probabilistic model.

...read moreread less

Journal Article•DOI•

Prediction-driven matched molecular pairs to interpret QSARs and aid the molecular optimization process

[...]

Yurii Sushko, Sergii Novotarskyi, Robert Körner, Joachim Vogt, Ahmed Abdelaziz, Igor V. Tetko¹ - Show less +2 more•Institutions (1)

Kazan Federal University¹

11 Dec 2014-Journal of Cheminformatics

TL;DR: The study shows that such an approach, referred to as prediction-driven MMP analysis, is a useful tool for medicinal chemists, allowing identification of large numbers of “interesting” transformations that can be used to drive the molecular optimization process.

...read moreread less

Abstract: QSAR is an established and powerful method for cheap in silico assessment of physicochemical properties and biological activities of chemical compounds. However, QSAR models are rather complex mathematical constructs that cannot easily be interpreted. Medicinal chemists would benefit from practical guidance regarding which molecules to synthesize. Another possible approach is analysis of pairs of very similar molecules, so-called matched molecular pairs (MMPs). Such an approach allows identification of molecular transformations that affect particular activities (e.g. toxicity). In contrast to QSAR, chemical interpretation of these transformations is straightforward. Furthermore, such transformations can give medicinal chemists useful hints for the hit-to-lead optimization process. The current study suggests a combination of QSAR and MMP approaches by finding MMP transformations based on QSAR predictions for large chemical datasets. The study shows that such an approach, referred to as prediction-driven MMP analysis, is a useful tool for medicinal chemists, allowing identification of large numbers of “interesting” transformations that can be used to drive the molecular optimization process. All the methodological developments have been implemented as software products available online as part of OCHEM ( http://ochem.eu/ ). The prediction-driven MMPs methodology was exemplified by two use cases: modelling of aquatic toxicity and CYP3A4 inhibition. This approach helped us to interpret QSAR models and allowed identification of a number of “significant” molecular transformations that affect the desired properties. This can facilitate drug design as a part of molecular optimization process.

...read moreread less

Journal Article•DOI•

PharmDock: A Pharmacophore-Based Docking Program

[...]

Bingjie Hu¹, Markus A. Lill¹•Institutions (1)

Purdue University¹

16 Apr 2014-Journal of Cheminformatics

TL;DR: A new pharmacophore-based docking program PharmDock is presented that combines pose sampling and ranking based on optimized protein-based Pharmacophore models with local optimization using an empirical scoring function.

...read moreread less

Abstract: Protein-based pharmacophore models are enriched with the information of potential interactions between ligands and the protein target. We have shown in a previous study that protein-based pharmacophore models can be applied for ligand pose prediction and pose ranking. In this publication, we present a new pharmacophore-based docking program PharmDock that combines pose sampling and ranking based on optimized protein-based pharmacophore models with local optimization using an empirical scoring function. Tests of PharmDock on ligand pose prediction, binding affinity estimation, compound ranking and virtual screening yielded comparable or better performance to existing and widely used docking programs. The docking program comes with an easy-to-use GUI within PyMOL. Two features have been incorporated in the program suite that allow for user-defined guidance of the docking process based on previous experimental data. Docking with those features demonstrated superior performance compared to unbiased docking. A protein pharmacophore-based docking program, PharmDock, has been made available with a PyMOL plugin. PharmDock and the PyMOL plugin are freely available from http://people.pharmacy.purdue.edu/~mlill/software/pharmdock .

...read moreread less

Journal Article•DOI•

Prediction of novel drug indications using network driven biological data prioritization and integration

[...]

Ala Qabaja¹, Ala Qabaja², Mohammed Alshalalfa³, Mohammed Alshalalfa¹, Eisa Alanazi⁴, Eisa Alanazi⁵, Reda Alhajj⁶, Reda Alhajj¹ - Show less +4 more•Institutions (6)

University of Calgary¹, An-Najah National University², Palestine Polytechnic University³, University of Regina⁴, Umm al-Qura University⁵, Global University (GU)⁶

07 Jan 2014-Journal of Cheminformatics

TL;DR: A computational framework to build disease-drug networks using drug- and disease-specific subnetworks that incorporates protein networks to refine drug and disease associated genes and prioritize genes in disease and drug specific networks is developed.

...read moreread less

Abstract: Background With the rapid development of high-throughput genomic technologies and the accumulation of genome-wide datasets for gene expression profiling and biological networks, the impact of diseases and drugs on gene expression can be comprehensively characterized. Drug repositioning offers the possibility of reduced risks in the drug discovery process, thus it is an essential step in drug development.

...read moreread less

Journal Article•DOI•

QSAR DataBank - an approach for the digital organization and archiving of QSAR model information

[...]

Villu Ruusmann¹, Sulev Sild¹, Uko Maran¹•Institutions (1)

University of Tartu¹

14 May 2014-Journal of Cheminformatics

TL;DR: A formal specification of the quantitative structure-activity relationship data organization and archival format called the QSAR DataBank (QsarDB for shorter, or QDB for shortest) and the utility and benefits have been thoroughly tested by solving everyday QSar and predictive modeling problems.

...read moreread less

Abstract: Research efforts in the field of descriptive and predictive Quantitative Structure-Activity Relationships or Quantitative Structure–Property Relationships produce around one thousand scientific publications annually. All the materials and results are mainly communicated using printed media. The printed media in its present form have obvious limitations when they come to effectively representing mathematical models, including complex and non-linear, and large bodies of associated numerical chemical data. It is not supportive of secondary information extraction or reuse efforts while in silico studies poses additional requirements for accessibility, transparency and reproducibility of the research. This gap can and should be bridged by introducing domain-specific digital data exchange standards and tools. The current publication presents a formal specification of the quantitative structure-activity relationship data organization and archival format called the QSAR DataBank (QsarDB for shorter, or QDB for shortest). The article describes QsarDB data schema, which formalizes QSAR concepts (objects and relationships between them) and QsarDB data format, which formalizes their presentation for computer systems. The utility and benefits of QsarDB have been thoroughly tested by solving everyday QSAR and predictive modeling problems, with examples in the field of predictive toxicology, and can be applied for a wide variety of other endpoints. The work is accompanied with open source reference implementation and tools. The proposed open data, open source, and open standards design is open to public and proprietary extensions on many levels. Selected use cases exemplify the benefits of the proposed QsarDB data format. General ideas for future development are discussed.

...read moreread less

Journal Article•DOI•

Analysis and visual summarization of molecular dynamics simulation

[...]

Fredrick Robin Devadoss¹, Victor Paul Raj¹•Institutions (1)

University of Konstanz¹

11 Mar 2014-Journal of Cheminformatics

TL;DR: C torsion angles build by four consecutive C atoms are highly valuable as similarity measure on a substructure scale and to find major events the information on the time at which a transition occurs and the local structural changes of it is combinedly called as “event”occurring in the course of the MD simulation.

...read moreread less

Abstract: Molecular dynamics (MD) simulation, a standard technique used to study the dynamical properties of biomolecules, is very useful in collecting the trajectories, a series of snapshots – the coordinates of the system of larger systems for longer simulation times. These MD generated trajectories are huge in size (many gigabytes) and the data analysis may take much longer time than the data generation. Managing the large amount of data and presenting them in a flexible and comprehensible manner are the major challenges. Analyzing these trajectories with standard parameter like root-mean square deviation (RMSD) may not reveal the most interesting properties of the dynamics. To overcome these challenges, C torsion angles [1] torsion angles build by four consecutive C atoms are highly valuable as similarity measure on a substructure scale and to find major events the information on the time at which a transition occurs (temporal domain) and the local structural changes (spatial domain) of it is combinedly called as “event”occurring in the course of the MD simulation. By calculating the time series of the C torsion angles and their clustering it is possible to determine the mechanistic details on a residual length scale and find major events occurring in the simulation of large proteins or protein complexes. The main advantage of the C torsion angle criterion is that it does not depend on a previous alignment of the structures, and that the direction of the change is also defined. Heat maps of C torsion angle give nice graphical representations of processes described by the MD simulations. Clustering of snapshots according to the specific C torsion angles is used to automatically find the spatial domains of the structural changes. If all the snapshots are assigned to a single cluster, then those residues are considered as rigid core and the remaining residues are considered as flexible parts. The temporal domain can be characterized in more detail by finding continuous time intervals assigned to a single cluster as (meta) stable structures and time intervals where the assignment jumps between two clusters as transitional periods. Since the outliers can be removed from the fuzzy clusters, starts and ends of time patches now qualify as important events for the underlying substructure and structural changes of larger regions are caused by an accumulation of such substructure events. DNA polymerase I – the open ternary complex of the large fragment of Thermus aquaticus DNA polymerase I (Klentaq1), which is used here as a practical application for C torsion angle based analysis, shows a hand-like arrangement, including a thumb, a palm and a finger domain [2]. The catalytic cycle leading to nucleotide insertion comprises several steps including a large structural rearrangement in the form of a movement of the finger domain towards the thumb domain, i.e. the transition from the open to the closed form. Molecular dynamics simulation were performed using the AMBER 10 suite of programs [3]. To get the visual picture of the ongoing processes, the C torsion angles with the differences to the crystal structure of the open form were plotted as heat map. The rigid and the flexible parts were clearly seen with no or a large number of significant changes, respectively, from the heat map. Once the C torsion angles corresponding to the rigid parts are removed, the remaining regions change only in a specific time interval of the simulation. The spatial and temporal domains of the structural changes were identified automatically by clustering of snapshots (using KNIME [4]) and finding the continuous time intervals, respectively.

...read moreread less

Journal Article•DOI•

Molpher: a software framework for systematic chemical space exploration.

[...]

David Hoksza¹, Petr Škoda¹, Milan Voršilák, Daniel Svozil²•Institutions (2)

Charles University in Prague¹, Academy of Sciences of the Czech Republic²

21 Mar 2014-Journal of Cheminformatics

TL;DR: Molpher is an open-source software framework for the design of virtual chemical libraries focused on a particular mechanistic class of compounds that produces a path of structurally-related compounds through a process the authors term ‘molecular morphing’.

...read moreread less

Abstract: Chemical space is virtual space occupied by all chemically meaningful organic compounds. It is an important concept in contemporary chemoinformatics research, and its systematic exploration is vital to the discovery of either novel drugs or new tools for chemical biology. In this paper, we describe Molpher, an open-source framework for the systematic exploration of chemical space. Through a process we term ‘molecular morphing’, Molpher produces a path of structurally-related compounds. This path is generated by the iterative application of so-called ‘morphing operators’ that represent simple structural changes, such as the addition or removal of an atom or a bond. Molpher incorporates an optimized parallel exploration algorithm, compound logging and a two-dimensional visualization of the exploration process. Its feature set can be easily extended by implementing additional morphing operators, chemical fingerprints, similarity measures and visualization methods. Molpher not only offers an intuitive graphical user interface, but also can be run in batch mode. This enables users to easily incorporate molecular morphing into their existing drug discovery pipelines. Molpher is an open-source software framework for the design of virtual chemical libraries focused on a particular mechanistic class of compounds. These libraries, represented by a morphing path and its surroundings, provide valuable starting data for future in silico and in vitro experiments. Molpher is highly extensible and can be easily incorporated into any existing computational drug design pipeline.

...read moreread less

Journal Article•DOI•

BioPhytMol: a drug discovery community resource on anti-mycobacterial phytomolecules and plant extracts.

[...]

Arun Sharma¹, Prasun Dutta¹, Maneesh Sharma², Neeraj Kumar Rajput¹, Bhavna Dodiya³, John J. Georrge⁴, John J. Georrge⁵, Trupti Kholia⁵, Anshu Bhardwaj¹ - Show less +5 more•Institutions (5)

Council of Scientific and Industrial Research¹, University of Delhi², Maharaja Sayajirao University of Baroda³, University of Bonn⁴, Christ College Rajkot⁵

11 Oct 2014-Journal of Cheminformatics

TL;DR: This paper presents BioPhytMol, a drug discovery community resource on anti-mycobacterial phytomolecules and plant extracts generated using Crowdsourcing, designed to systematically represent and search for anti- mycob bacterial phytochemical information.

...read moreread less

Abstract: Tuberculosis (TB) is the second leading cause of death from a single infectious organism, demanding attention towards discovery of novel anti-tubercular compounds. Natural products or their derivatives have provided more than 50% of all existing drugs, offering a chemically diverse space for discovery of novel drugs. BioPhytMol has been designed to systematically curate and analyze the anti-mycobacterial natural product chemical space. BioPhytMol is developed as a drug-discovery community resource with anti-mycobacterial phytomolecules and plant extracts. Currently, it holds 2582 entries including 188 plant families (692 genera and 808 species) from global flora, manually curated from literature. In total, there are 633 phytomolecules (with structures) curated against 25 target mycobacteria. Multiple analysis approaches have been used to prioritize the library for drug-like compounds, for both whole cell screening and target-based approaches. In order to represent the multidimensional data on chemical diversity, physiochemical properties and biological activity data of the compound library, novel approaches such as the use of circular graphs have been employed. BioPhytMol has been designed to systematically represent and search for anti-mycobacterial phytochemical information. Extensive compound analyses can also be performed through web-application for prioritizing drug-like compounds. The resource is freely available online at http://ab-openlab.csir.res.in/biophytmol/ .

...read moreread less

Journal Article•DOI•

Estimation of acute oral toxicity in rat using local lazy learning.

[...]

Jing Lu¹, Jing Lu², Jianlong Peng¹, Jinan Wang¹, Qiancheng Shen¹, Yi Bi², Likun Gong¹, Mingyue Zheng¹, Xiaomin Luo¹, Weiliang Zhu¹, Hualiang Jiang³, Hualiang Jiang¹, Hualiang Jiang⁴, Kaixian Chen⁴, Kaixian Chen¹ - Show less +11 more•Institutions (4)

Chinese Academy of Sciences¹, Yantai University², East China University of Science and Technology³, ShanghaiTech University⁴

16 May 2014-Journal of Cheminformatics

TL;DR: A consensus model based on the predicted values of individual LLL models of LD50 is developed, yielding correlation coefficients R2 of 0.712 on a test set containing 2,896 compounds.

...read moreread less

Abstract: Background Acute toxicity means the ability of a substance to cause adverse effects within a short period following dosing or exposure, which is usually the first step in the toxicological investigations of unknown substances. The median lethal dose, LD50, is frequently used as a general indicator of a substance’s acute toxicity, and there is a high demand on developing non-animal-based prediction of LD50. Unfortunately, it is difficult to accurately predict compound LD50 using a single QSAR model, because the acute toxicity may involve complex mechanisms and multiple biochemical processes.

...read moreread less

Journal Article•DOI•

iDrug: a web-accessible and interactive drug discovery and design platform.

[...]

Xia Wang¹, Haipeng Chen¹, Feng Yang¹, Jiayu Gong¹, Shiliang Li¹, Jianfeng Pei², Xiaofeng Liu¹, Hualiang Jiang¹, Luhua Lai², Honglin Li¹ - Show less +6 more•Institutions (2)

East China University of Science and Technology¹, Peking University²

23 May 2014-Journal of Cheminformatics

TL;DR: iDrug is easy to use, and provides a novel, fast and reliable tool for conducting drug design experiments by using various molecular design processing tasks can be submitted and visualized simply in one browser without installing locally any standalone modeling softwares.

...read moreread less

Abstract: The progress in computer-aided drug design (CADD) approaches over the past decades accelerated the early-stage pharmaceutical research. Many powerful standalone tools for CADD have been developed in academia. As programs are developed by various research groups, a consistent user-friendly online graphical working environment, combining computational techniques such as pharmacophore mapping, similarity calculation, scoring, and target identification is needed. We presented a versatile, user-friendly, and efficient online tool for computer-aided drug design based on pharmacophore and 3D molecular similarity searching. The web interface enables binding sites detection, virtual screening hits identification, and drug targets prediction in an interactive manner through a seamless interface to all adapted packages (e.g., Cavity, PocketV.2, PharmMapper, SHAFTS). Several commercially available compound databases for hit identification and a well-annotated pharmacophore database for drug targets prediction were integrated in iDrug as well. The web interface provides tools for real-time molecular building/editing, converting, displaying, and analyzing. All the customized configurations of the functional modules can be accessed through featured session files provided, which can be saved to the local disk and uploaded to resume or update the history work. iDrug is easy to use, and provides a novel, fast and reliable tool for conducting drug design experiments. By using iDrug, various molecular design processing tasks can be submitted and visualized simply in one browser without installing locally any standalone modeling softwares. iDrug is accessible free of charge at http://lilab.ecust.edu.cn/idrug .

...read moreread less

Journal Article•DOI•

Feature combination networks for the interpretation of statistical machine learning models: application to Ames mutagenicity.

[...]

Samuel J. Webb¹, Thierry Hanser, Brendan J. Howlin¹, Paul Krause¹, Jonathan D. Vessey - Show less +1 more•Institutions (1)

University of Surrey¹

25 Mar 2014-Journal of Cheminformatics

TL;DR: This methodology allows for a greater utilisation of the predictions made by black box models and can expedite further study based on the output for a (quantitative) structure activity model.

...read moreread less

Abstract: A new algorithm has been developed to enable the interpretation of black box models. The developed algorithm is agnostic to learning algorithm and open to all structural based descriptors such as fragments, keys and hashed fingerprints. The algorithm has provided meaningful interpretation of Ames mutagenicity predictions from both random forest and support vector machine models built on a variety of structural fingerprints. A fragmentation algorithm is utilised to investigate the model’s behaviour on specific substructures present in the query. An output is formulated summarising causes of activation and deactivation. The algorithm is able to identify multiple causes of activation or deactivation in addition to identifying localised deactivations where the prediction for the query is active overall. No loss in performance is seen as there is no change in the prediction; the interpretation is produced directly on the model’s behaviour for the specific query. Models have been built using multiple learning algorithms including support vector machine and random forest. The models were built on public Ames mutagenicity data and a variety of fingerprint descriptors were used. These models produced a good performance in both internal and external validation with accuracies around 82%. The models were used to evaluate the interpretation algorithm. Interpretation was revealed that links closely with understood mechanisms for Ames mutagenicity. This methodology allows for a greater utilisation of the predictions made by black box models and can expedite further study based on the output for a (quantitative) structure activity model. Additionally the algorithm could be utilised for chemical dataset investigation and knowledge extraction/human SAR development.

...read moreread less

Journal Article•DOI•

Self organising hypothesis networks: a new approach for representing and structuring SAR knowledge.

[...]

Thierry Hanser, Barber Christopher Gordon, Edward M. Rosser, Jonathan D. Vessey, Samuel J. Webb, Stephane Werner - Show less +2 more

08 May 2014-Journal of Cheminformatics

TL;DR: This first article focuses on the general principle of the Self Organising Hypothesis Network (SOHN) approach in the context of binary classification problems along with an illustrative application to the prediction of mutagenicity.

...read moreread less

Abstract: Combining different sources of knowledge to build improved structure activity relationship models is not easy owing to the variety of knowledge formats and the absence of a common framework to interoperate between learning techniques. Most of the current approaches address this problem by using consensus models that operate at the prediction level. We explore the possibility to directly combine these sources at the knowledge level, with the aim to harvest potentially increased synergy at an earlier stage. Our goal is to design a general methodology to facilitate knowledge discovery and produce accurate and interpretable models. To combine models at the knowledge level, we propose to decouple the learning phase from the knowledge application phase using a pivot representation (lingua franca) based on the concept of hypothesis. A hypothesis is a simple and interpretable knowledge unit. Regardless of its origin, knowledge is broken down into a collection of hypotheses. These hypotheses are subsequently organised into hierarchical network. This unification permits to combine different sources of knowledge into a common formalised framework. The approach allows us to create a synergistic system between different forms of knowledge and new algorithms can be applied to leverage this unified model. This first article focuses on the general principle of the Self Organising Hypothesis Network (SOHN) approach in the context of binary classification problems along with an illustrative application to the prediction of mutagenicity. It is possible to represent knowledge in the unified form of a hypothesis network allowing interpretable predictions with performances comparable to mainstream machine learning techniques. This new approach offers the potential to combine knowledge from different sources into a common framework in which high level reasoning and meta-learning can be applied; these latter perspectives will be explored in future work.

...read moreread less

Journal Article•DOI•

Expanding the fragrance chemical space for virtual screening

[...]

Lars Ruddigkeit¹, Mahendra Awale¹, Jean-Louis Reymond¹•Institutions (1)

University of Bern¹

22 May 2014-Journal of Cheminformatics

TL;DR: MQN-similarity is shown to efficiently recover 15 different fragrance molecule families from the different FL subsets, demonstrating the relevance of the MQN-based tool to explore the fragrance chemical space.

...read moreread less

Abstract: The properties of fragrance molecules in the public databases SuperScent and Flavornet were analyzed to define a “fragrance-like” (FL) property range (Heavy Atom Count ≤ 21, only C, H, O, S, (O + S) ≤ 3, Hydrogen Bond Donor ≤ 1) and the corresponding chemical space including FL molecules from PubChem (NIH repository of molecules), ChEMBL (bioactive molecules), ZINC (drug-like molecules), and GDB-13 (all possible organic molecules up to 13 atoms of C, N, O, S, Cl). The FL subsets of these databases were classified by MQN (Molecular Quantum Numbers, a set of 42 integer value descriptors of molecular structure) and formatted for fast MQN-similarity searching and interactive exploration of color-coded principal component maps in form of the FL-mapplet and FL-browser applications freely available at http://www.gdb.unibe.ch . MQN-similarity is shown to efficiently recover 15 different fragrance molecule families from the different FL subsets, demonstrating the relevance of the MQN-based tool to explore the fragrance chemical space.

...read moreread less

Journal Article•DOI•

Cytochrome P450 site of metabolism prediction from 2D topological fingerprints using GPU accelerated probabilistic classifiers.

[...]

Jonathan D. Tyzack¹, Hamse Y. Mussa¹, Mark J. Williamson¹, Johannes Kirchmair², Robert C. Glen¹ - Show less +1 more•Institutions (2)

University of Cambridge¹, ETH Zurich²

27 May 2014-Journal of Cheminformatics

TL;DR: 2D topological fingerprints calculated to a bond depth of 4-6 contain sufficient information to allow the identification of SoMs using classifiers based on relatively small data sets and the machine learning methods outlined are conceptually simpler and more efficient than other methods tested and give results competitive with other approaches using more expensive quantum chemical descriptors.

...read moreread less

Abstract: The prediction of sites and products of metabolism in xenobiotic compounds is key to the development of new chemical entities, where screening potential metabolites for toxicity or unwanted side-effects is of crucial importance. In this work 2D topological fingerprints are used to encode atomic sites and three probabilistic machine learning methods are applied: Parzen-Rosenblatt Window (PRW), Naive Bayesian (NB) and a novel approach called RASCAL (Random Attribute Subsampling Classification ALgorithm). These are implemented by randomly subsampling descriptor space to alleviate the problem often suffered by data mining methods of having to exactly match fingerprints, and in the case of PRW by measuring a distance between feature vectors rather than exact matching. The classifiers have been implemented in CUDA/C++ to exploit the parallel architecture of graphical processing units (GPUs) and is freely available in a public repository. It is shown that for PRW a SoM (Site of Metabolism) is identified in the top two predictions for 85%, 91% and 88% of the CYP 3A4, 2D6 and 2C9 data sets respectively, with RASCAL giving similar performance of 83%, 91% and 88%, respectively. These results put PRW and RASCAL performance ahead of NB which gave a much lower classification performance of 51%, 73% and 74%, respectively. 2D topological fingerprints calculated to a bond depth of 4-6 contain sufficient information to allow the identification of SoMs using classifiers based on relatively small data sets. Thus, the machine learning methods outlined in this paper are conceptually simpler and more efficient than other methods tested and the use of simple topological descriptors derived from 2D structure give results competitive with other approaches using more expensive quantum chemical descriptors. The descriptor space subsampling approach and ensemble methodology allow the methods to be applied to molecules more distant from the training data where data mining would be more likely to fail due to the lack of common fingerprints. The RASCAL algorithm is shown to give equivalent classification performance to PRW but at lower computational expense allowing it to be applied more efficiently in the ensemble scheme.

...read moreread less

Journal Article•DOI•

New target prediction and visualization tools incorporating open source molecular fingerprints for TB Mobile 2.0

[...]

Alex M. Clark, Malabika Sarker¹, Sean Ekins²•Institutions (2)

SRI International¹, Collaborative Drug Discovery²

04 Aug 2014-Journal of Cheminformatics

TL;DR: TB Mobile can now manage a small collection of compounds that can be imported from external sources, or exported by various means such as email or app-to-app inter-process communication, meaning that TB Mobile can be used as a node within a growing ecosystem of mobile apps for cheminformatics.

...read moreread less

Abstract: We recently developed a freely available mobile app (TB Mobile) for both iOS and Android platforms that displays Mycobacterium tuberculosis (Mtb) active molecule structures and their targets with links to associated data. The app was developed to make target information available to as large an audience as possible. We now report a major update of the iOS version of the app. This includes enhancements that use an implementation of ECFP_6 fingerprints that we have made open source. Using these fingerprints, the user can propose compounds with possible anti-TB activity, and view the compounds within a cluster landscape. Proposed compounds can also be compared to existing target data, using a naive Bayesian scoring system to rank probable targets. We have curated an additional 60 new compounds and their targets for Mtb and added these to the original set of 745 compounds. We have also curated 20 further compounds (many without targets in TB Mobile) to evaluate this version of the app with 805 compounds and associated targets. TB Mobile can now manage a small collection of compounds that can be imported from external sources, or exported by various means such as email or app-to-app inter-process communication. This means that TB Mobile can be used as a node within a growing ecosystem of mobile apps for cheminformatics. It can also cluster compounds and use internal algorithms to help identify potential targets based on molecular similarity. TB Mobile represents a valuable dataset, data-visualization aid and target prediction tool.

...read moreread less

Journal Article•DOI•

Using cheminformatics to predict cross reactivity of “designer drugs” to their currently available immunoassays

[...]

Matthew D. Krasowski¹, Sean Ekins•Institutions (1)

University of Iowa Hospitals and Clinics¹

10 May 2014-Journal of Cheminformatics

TL;DR: 2D similarity calculations can be used to more efficiently decide which drugs and metabolites should be tested in cross-reactivity studies, as well as to design experiments and potentially predict antigens that would lead to immunoassays with cross reactivity for a broader array of designer drugs.

...read moreread less

Abstract: A challenge for drug of abuse testing is presented by ‘designer drugs’, compounds typically discovered by modifications of existing clinical drug classes such as amphetamines and cannabinoids. Drug of abuse screening immunoassays directed at amphetamine or methamphetamine only detect a small subset of designer amphetamine-like drugs, and those immunoassays designed for tetrahydrocannabinol metabolites generally do not cross-react with synthetic cannabinoids lacking the classic cannabinoid chemical backbone. This suggests complexity in understanding how to detect and identify whether a patient has taken a molecule of one class or another, impacting clinical care. Cross-reactivity data from immunoassays specifically targeting designer amphetamine-like and synthetic cannabinoid drugs was collected from multiple published sources, and virtual chemical libraries for molecular similarity analysis were built. The virtual library for synthetic cannabinoid analysis contained a total of 169 structures, while the virtual library for amphetamine-type stimulants contained 288 compounds. Two-dimensional (2D) similarity for each test compound was compared to the target molecule of the immunoassay undergoing analysis. 2D similarity differentiated between cross-reactive and non-cross-reactive compounds for immunoassays targeting mephedrone/methcathinone, 3,4-methylenedioxypyrovalerone, benzylpiperazine, mephentermine, and synthetic cannabinoids. In this study, we applied 2D molecular similarity analysis to the designer amphetamine-type stimulants and synthetic cannabinoids. Similarity calculations can be used to more efficiently decide which drugs and metabolites should be tested in cross-reactivity studies, as well as to design experiments and potentially predict antigens that would lead to immunoassays with cross reactivity for a broader array of designer drugs.

...read moreread less

Journal Article•DOI•

UniChem: extension of InChI-based compound mapping to salt, connectivity and stereochemistry layers

[...]

Jon Chambers¹, Mark Davies¹, Anna Gaulton¹, George Papadatos¹, Anne Hersey¹, John P. Overington¹ - Show less +2 more•Institutions (1)

European Bioinformatics Institute¹

04 Sep 2014-Journal of Cheminformatics

TL;DR: This work describes how the layered structural representation of the Standard InChI is exploited to create new functionality within UniChem that integrates these related molecular forms.

...read moreread less

Abstract: UniChem is a low-maintenance, fast and freely available compound identifier mapping service, recently made available on the Internet. Until now, the criterion of molecular equivalence within UniChem has been on the basis of complete identity between Standard InChIs. However, a limitation of this approach is that stereoisomers, isotopes and salts of otherwise identical molecules are not considered as related. Here, we describe how we have exploited the layered structural representation of the Standard InChI to create new functionality within UniChem that integrates these related molecular forms. The service, called ‘Connectivity Search’ allows molecules to be first matched on the basis of complete identity between the connectivity layer of their corresponding Standard InChIs, and the remaining layers then compared to highlight stereochemical and isotopic differences. Parsing of Standard InChI sub-layers permits mixtures and salts to also be included in this integration process. Implementation of these enhancements required simple modifications to the schema, loader and web application, but none of which have changed the original UniChem functionality or services. The scope of queries may be varied using a variety of easily configurable options, and the output is annotated to assist the user to filter, sort and understand the difference between query and retrieved structures. A RESTful web service output may be easily processed programmatically to allow developers to present the data in whatever form they believe their users will require, or to define their own level of molecular equivalence for their resource, albeit within the constraint of identical connectivity.

...read moreread less

Journal Article•DOI•

Experimental validation of FINDSITE comb virtual ligand screening results for eight proteins yields novel nanomolar and micromolar binders

[...]

Bharath Srinivasan¹, Hongyi Zhou¹, Julia Kubanek¹, Jeffrey Skolnick¹•Institutions (1)

Georgia Institute of Technology¹

26 Apr 2014-Journal of Cheminformatics

TL;DR: The experimental HTS validation of a novel VLS approach, FINDSITEcomb, across a diverse set of medically-relevant proteins is demonstrated and shows that FINDSitecomb is a promising new VLS Approach that can assist drug discovery.

...read moreread less

Abstract: Identification of ligand-protein binding interactions is a critical step in drug discovery. Experimental screening of large chemical libraries, in spite of their specific role and importance in drug discovery, suffer from the disadvantages of being random, time-consuming and expensive. To accelerate the process, traditional structure- or ligand-based VLS approaches are combined with experimental high-throughput screening, HTS. Often a single protein or, at most, a protein family is considered. Large scale VLS benchmarking across diverse protein families is rarely done, and the reported success rate is very low. Here, we demonstrate the experimental HTS validation of a novel VLS approach, FINDSITEcomb, across a diverse set of medically-relevant proteins. For eight different proteins belonging to different fold-classes and from diverse organisms, the top 1% of FINDSITEcomb’s VLS predictions were tested, and depending on the protein target, 4%-47% of the predicted ligands were shown to bind with μM or better affinities. In total, 47 small molecule binders were identified. Low nanomolar (nM) binders for dihydrofolate reductase and protein tyrosine phosphatases (PTPs) and micromolar binders for the other proteins were identified. Six novel molecules had cytotoxic activity (<10 μg/ml) against the HCT-116 colon carcinoma cell line and one novel molecule had potent antibacterial activity. We show that FINDSITEcomb is a promising new VLS approach that can assist drug discovery.

...read moreread less

Journal Article•DOI•

The use of 2D fingerprint methods to support the assessment of structural similarity in orphan drug legislation.

[...]

Pedro Franco¹, Nuria Porta¹, John D. Holliday², Peter Willett²•Institutions (2)

European Medicines Agency¹, University of Sheffield²

01 Feb 2014-Journal of Cheminformatics

TL;DR: Using 2D fingerprints to show the extent of the relationship between computed levels of structural similarity for pairs of molecules and expert judgments of the similarities of those pairs can provide a useful source of information for the assessment of orphan drug status by regulatory authorities.

...read moreread less

Abstract: In the European Union, medicines are authorised for some rare disease only if they are judged to be dissimilar to authorised orphan drugs for that disease. This paper describes the use of 2D fingerprints to show the extent of the relationship between computed levels of structural similarity for pairs of molecules and expert judgments of the similarities of those pairs. The resulting relationship can be used to provide input to the assessment of new active compounds for which orphan drug authorisation is being sought. 143 experts provided judgments of the similarity or dissimilarity of 100 pairs of drug-like molecules from the DrugBank 3.0 database. The similarities of these pairs were also computed using BCI, Daylight, ECFC4, ECFP4, MDL and Unity 2D fingerprints. Logistic regression analyses demonstrated a strong relationship between the human and computed similarity assessments, with the resulting regression models having significant predictive power in experiments using data from submissions of orphan drug medicines to the European Medicines Agency. The BCI fingerprints performed best overall on the DrugBank dataset while the BCI, Daylight, ECFP4 and Unity fingerprints performed comparably on the European Medicines Agency dataset. Measures of structural similarity based on 2D fingerprints can provide a useful source of information for the assessment of orphan drug status by regulatory authorities.

...read moreread less