scispace - formally typeset
Search or ask a question

Showing papers in "Journal of Cheminformatics in 2010"


Journal ArticleDOI
TL;DR: The QUICS methodology enabled rapid, in-depth evaluation of all possible metabolites within a set of samples to identify the metabolites and, for those that did not have an entry in the reference library, to create a library entry to identify that metabolite in future studies.
Abstract: Metabolomics experiments involve generating and comparing small molecule (metabolite) profiles from complex mixture samples to identify those metabolites that are modulated in altered states (e.g., disease, drug treatment, toxin exposure). One non-targeted metabolomics approach attempts to identify and interrogate all small molecules in a sample using GC or LC separation followed by MS or MSn detection. Analysis of the resulting large, multifaceted data sets to rapidly and accurately identify the metabolites is a challenging task that relies on the availability of chemical libraries of metabolite spectral signatures. A method for analyzing spectrometry data to identify and Quantify Individual Components in a Sample, (QUICS), enables generation of chemical library entries from known standards and, importantly, from unknown metabolites present in experimental samples but without a corresponding library entry. This method accounts for all ions in a sample spectrum, performs library matches, and allows review of the data to quality check library entries. The QUICS method identifies ions related to any given metabolite by correlating ion data across the complete set of experimental samples, thus revealing subtle spectral trends that may not be evident when viewing individual samples and are likely to be indicative of the presence of one or more otherwise obscured metabolites. LC-MS/MS or GC-MS data from 33 liver samples were analyzed simultaneously which exploited the inherent biological diversity of the samples and the largely non-covariant chemical nature of the metabolites when viewed over multiple samples. Ions were partitioned by both retention time (RT) and covariance which grouped ions from a single common underlying metabolite. This approach benefitted from using mass, time and intensity data in aggregate over the entire sample set to reject outliers and noise thereby producing higher quality chemical identities. The aggregated data was matched to reference chemical libraries to aid in identifying the ion set as a known metabolite or as a new unknown biochemical to be added to the library. The QUICS methodology enabled rapid, in-depth evaluation of all possible metabolites (known and unknown) within a set of samples to identify the metabolites and, for those that did not have an entry in the reference library, to create a library entry to identify that metabolite in future studies.

524 citations


Journal ArticleDOI
TL;DR: Because of the extensible nature of the standardised Framework design, barriers of interoperability between applications and content are removed, as the user may combine data, models and validation from multiple sources in a dependable and time-effective way.
Abstract: OpenTox provides an interoperable, standards-based Framework for the support of predictive toxicology data management, algorithms, modelling, validation and reporting. It is relevant to satisfying the chemical safety assessment requirements of the REACH legislation as it supports access to experimental data, (Quantitative) Structure-Activity Relationship models, and toxicological information through an integrating platform that adheres to regulatory requirements and OECD validation principles. Initial research defined the essential components of the Framework including the approach to data access, schema and management, use of controlled vocabularies and ontologies, architecture, web service and communications protocols, and selection and integration of algorithms for predictive modelling. OpenTox provides end-user oriented tools to non-computational specialists, risk assessors, and toxicological experts in addition to Application Programming Interfaces (APIs) for developers of new applications. OpenTox actively supports public standards for data representation, interfaces, vocabularies and ontologies, Open Source approaches to core platform components, and community-based collaboration approaches, so as to progress system interoperability goals. The OpenTox Framework includes APIs and services for compounds, datasets, features, algorithms, models, ontologies, tasks, validation, and reporting which may be combined into multiple applications satisfying a variety of different user needs. OpenTox applications are based on a set of distributed, interoperable OpenTox API-compliant REST web services. The OpenTox approach to ontology allows for efficient mapping of complementary data coming from different datasets into a unifying structure having a shared terminology and representation. Two initial OpenTox applications are presented as an illustration of the potential impact of OpenTox for high-quality and consistent structure-activity relationship modelling of REACH-relevant endpoints: ToxPredict which predicts and reports on toxicities for endpoints for an input chemical structure, and ToxCreate which builds and validates a predictive toxicity model based on an input toxicology dataset. Because of the extensible nature of the standardised Framework design, barriers of interoperability between applications and content are removed, as the user may combine data, models and validation from multiple sources in a dependable and time-effective way.

108 citations


Journal ArticleDOI
TL;DR: PoseView, a tool which displays molecular complexes incorporating a simple, easy-to-perceive arrangement of the ligand and the amino acids towards which it forms interactions, is developed, and the underlying interaction models are presented.
Abstract: Chemists are well trained in perceiving 2D molecular sketches. On the side of computer assistance, the automated generation of such sketches becomes very difficult when it comes to multi-molecular arrangements such as protein-ligand complexes in a drug design context. Existing solutions to date suffer from drawbacks such as missing important interaction types, inappropriate levels of abstraction and layout quality. During the last few years we have developed PoseView [1,2], a tool which displays molecular complexes incorporating a simple, easy-to-perceive arrangement of the ligand and the amino acids towards which it forms interactions. Resulting in atomic resolution diagrams, PoseView operates on a fast tree re-arrangement algorithm to minimize crossing lines in the sketches. Due to a de-coupling of interaction perception and the drawing engine, PoseView can draw any interactions determined by either distance-based rules or the FlexX interaction model (which itself is user accessible). Owing to the small molecule drawing engine 2Ddraw [3], molecules are drawn in a textbook-like manner following the IUPAC regulations. The tool has a generic file interface for other complexes than protein-ligand arrangements. It can therefore be used as well for the display of, e.g., RNA/DNA complexes with small molecules. For batch processing, an additional command line interface is available; output can be provided in various formats, amongst them gif, ps, svg and pdf. Besides the underlying interaction models, we will present new algorithmic approaches, assess usability issues and a large-scale validation study on the PDB.

73 citations


Journal ArticleDOI
TL;DR: The ChemSpider dictionary achieved the best precision but the Chemlist dictionary had a higher recall and the best F-score; rule-based filtering and disambiguation is necessary to achieve a high precision for both the automatically generated and the manually curated dictionary.
Abstract: Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships. We acquired the component of ChemSpider containing only manually curated names and synonyms. Rule-based term filtering, semi-automatic manual curation, and disambiguation rules were applied. We tested the dictionary from ChemSpider on an annotated corpus and compared the results with those for the Chemlist dictionary. The ChemSpider dictionary of ca. 80 k names was only a 1/3 to a 1/4 the size of Chemlist at around 300 k. The ChemSpider dictionary had a precision of 0.43 and a recall of 0.19 before the application of filtering and disambiguation and a precision of 0.87 and a recall of 0.19 after filtering and disambiguation. The Chemlist dictionary had a precision of 0.20 and a recall of 0.47 before the application of filtering and disambiguation and a precision of 0.67 and a recall of 0.40 after filtering and disambiguation. We conclude the following: (1) The ChemSpider dictionary achieved the best precision but the Chemlist dictionary had a higher recall and the best F-score; (2) Rule-based filtering and disambiguation is necessary to achieve a high precision for both the automatically generated and the manually curated dictionary. ChemSpider is available as a web service at http://www.chemspider.com/ and the Chemlist dictionary is freely available as an XML file in Simple Knowledge Organization System format on the web at http://www.biosemantics.org/chemlist .

46 citations


Journal ArticleDOI
TL;DR: QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply.
Abstract: QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data We present a step towards standardizing QSAR analyses by defining interoperable and reproducible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors Hence, a dataset described by QSAR-ML makes its setup completely reproducible We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup of QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply This makes is easy to join, extend, combine datasets and hence work collectively, but also allows for analyzing the effect descriptors have on the statistical model's performance The presented Bioclipse plugins equip scientists with graphical tools that make QSAR-ML easily accessible for the community

43 citations


Journal ArticleDOI
TL;DR: The proposed applicability domain formulations for kernel-based QSAR models can successfully identify compounds for which no reliable predictions can be expected from the model.
Abstract: The virtual screening of large compound databases is an important application of structural-activity relationship models. Due to the high structural diversity of these data sets, it is impossible for machine learning based QSAR models, which rely on a specific training set, to give reliable results for all compounds. Thus, it is important to consider the subset of the chemical space in which the model is applicable. The approaches to this problem that have been published so far mostly use vectorial descriptor representations to define this domain of applicability of the model. Unfortunately, these cannot be extended easily to structured kernel-based machine learning models. For this reason, we propose three approaches to estimate the domain of applicability of a kernel-based QSAR model. We evaluated three kernel-based applicability domain estimations using three different structured kernels on three virtual screening tasks. Each experiment consisted of the training of a kernel-based QSAR model using support vector regression and the ranking of a disjoint screening data set according to the predicted activity. For each prediction, the applicability of the model for the respective compound is quantitatively described using a score obtained by an applicability domain formulation. The suitability of the applicability domain estimation is evaluated by comparing the model performance on the subsets of the screening data sets obtained by different thresholds for the applicability scores. This comparison indicates that it is possible to separate the part of the chemspace, in which the model gives reliable predictions, from the part consisting of structures too dissimilar to the training set to apply the model successfully. A closer inspection reveals that the virtual screening performance of the model is considerably improved if half of the molecules, those with the lowest applicability scores, are omitted from the screening. The proposed applicability domain formulations for kernel-based QSAR models can successfully identify compounds for which no reliable predictions can be expected from the model. The resulting reduction of the search space and the elimination of some of the active compounds should not be considered as a drawback, because the results indicate that, in most cases, these omitted ligands would not be found by the model anyway.

40 citations


Journal ArticleDOI
TL;DR: Initial testing indicates this tool is useful in identifying potential biological applications of compounds that are not obvious, and in identifying corroborating and conflicting information from multiple sources.
Abstract: In recent years, there has been a huge increase in the amount of publicly-available and proprietary information pertinent to drug discovery. However, there is a distinct lack of data mining tools available to harness this information, and in particular for knowledge discovery across multiple information sources. At Indiana University we have an ongoing project with Eli Lilly to develop web-service based tools for integrative mining of chemical and biological information. In this paper, we report on the first of these tools, called WENDI (Web Engine for Non-obvious Drug Information) that attempts to find non-obvious relationships between a query compound and scholarly publications, biological properties, genes and diseases using multiple information sources. We have created an aggregate web service that takes a query compound as input, calls multiple web services for computation and database search, and returns an XML file that aggregates this information. We have also developed a client application that provides an easy-to-use interface to this web service. Both the service and client are publicly available. Initial testing indicates this tool is useful in identifying potential biological applications of compounds that are not obvious, and in identifying corroborating and conflicting information from multiple sources. We encourage feedback on the tool to help us refine it further. We are now developing further tools based on this model.

39 citations


Journal ArticleDOI
TL;DR: A collection of primitive operations for molecular diagram sketching has been developed which compose a concise set of operations which can be used to construct publication-quality 2 D coordinates for molecular structures using a bare minimum of input bandwidth.
Abstract: A collection of primitive operations for molecular diagram sketching has been developed. These primitives compose a concise set of operations which can be used to construct publication-quality 2 D coordinates for molecular structures using a bare minimum of input bandwidth. The input requirements for each primitive consist of a small number of discrete choices, which means that these primitives can be used to form the basis of a user interface which does not require an accurate pointing device. This is particularly relevant to software designed for contemporary mobile platforms. The reduction of input bandwidth is accomplished by using algorithmic methods for anticipating probable geometries during the sketching process, and by intelligent use of template grafting. The algorithms and their uses are described in detail.

35 citations


Journal ArticleDOI
TL;DR: It is shown that the outcomes of different assay formats can be mutually predictive, thus removing the need to submit a potentially toxic compound to multiple assays, and enables selection of the easiest-to-run assay as corporate standard, or the most descriptive panel of assays by including assays whose outcomes are not mutually predictive.
Abstract: We collected data from over 80 different cytotoxicity assays from Pfizer in-house work as well as from public sources and investigated the feasibility of using these datasets, which come from a variety of assay formats (having for instance different measured endpoints, incubation times and cell types) to derive a general cytotoxicity model. Our main aim was to derive a computational model based on this data that can highlight potentially cytotoxic series early in the drug discovery process. We developed Bayesian models for each assay using Scitegic FCFP_6 fingerprints together with the default physical property descriptors. Pairs of assays that are mutually predictive were identified by calculating the ROC score of the model derived from one predicting the experimental outcome of the other, and vice versa. The prediction pairs were visualised in a network where nodes are assays and edges are drawn for ROC scores >0.60 in both directions. We observed that, if assay pairs (A, B) and (B, C) were mutually predictive, this was often not the case for the pair (A, C). The results from 48 assays connected to each other were merged in one training set of 145590 compounds and a general cytotoxicity model was derived. The model has been cross-validated as well as being validated with a set of 89 FDA approved drug compounds. We have generated a predictive model for general cytotoxicity which could speed up the drug discovery process in multiple ways. Firstly, this analysis has shown that the outcomes of different assay formats can be mutually predictive, thus removing the need to submit a potentially toxic compound to multiple assays. Furthermore, this analysis enables selection of (a) the easiest-to-run assay as corporate standard, or (b) the most descriptive panel of assays by including assays whose outcomes are not mutually predictive. The model is no replacement for a cytotoxicity assay but opens the opportunity to be more selective about which compounds are to be submitted to it. On a more mundane level, having data from more than 80 assays in one dataset answers, for the first time, the question - "what are the known cytotoxic compounds from the Pfizer compound collection?" Finally, having a predictive cytotoxicity model will assist the design of new compounds with a desired cytotoxicity profile, since comparison of the model output with data from an in vitro safety/toxicology assay suggests one is predictive of the other.

34 citations


Journal ArticleDOI
TL;DR: This work set up several kinds of tautomer definitions and derived a stable definition covering the major kinds of protropic tautomers, and analyzed, what expenditure of time for large databases is needed to investigate which structures have tauts and which not for more than 99% of the database entries.
Abstract: For molecules with mobile H atoms, the result of quantitative structure-activity relationships (QSARs) depends on the position of the respective hydrogen atoms. Thus to obtain reliable results, tautomerism needs to be taken into account. In the last years, many approaches were introduced to achieve this. In this work we present a further development of our previous algorithm based on InChI-layers. While the InChI ansatz supports only heteroatom-tautomerism, we suggest an extension regarding carbon atoms too. Whereas with other tautomer generating algorithms the hydrogen shifts are based on pattern-rules, we try to overcome the rule constriction and evolve a more common solution. The advantage of our approach is quite simple. Due to the avoidance of a rule system with its necessity for exceptions to the rules, we can apply our solution to any kind of tautomerism definition. We set up a Branch-and-Bound approach, which is optimized to generate a complete enumeration of all tautomers, with regard to a certain definition, from any structure. With few and easy decisions like symmetry detection, we avoid a lot of calculation overhead. Decisions with significant influence on the algorithm efficiency are made as early as possible. We have set up several kinds of tautomer definitions and derived a stable definition covering the major kinds of protropic tautomerism. Furthermore we analyzed, what expenditure of time for large databases (case study: more than 70,000 entries) is needed to investigate which structures have tautomers and which not for more than 99% of the database entries. This study has been financially supported by the EU project OSIRIS (IP, contract no. 037017).

31 citations


Journal ArticleDOI
TL;DR: The chemical structure search facility is restructured to use Orchem an Oracle chemistry plug-in using the Chemistry Development Kit, and cross-references in ChEBI have been extended to include BRENDA the enzyme database, NMRShiftDB the database for organic structures and their nuclear magnetic resonance (nmr) spectra, Rhea the biochemical reaction database and IntEnz the enzyme nomenclature database.
Abstract: The bioinformatics community has developed a policy of open access and open data since its inception. This is contrary to chemoinformatics which has traditionally been a closed-access area. In 2004, two complementary open access databases were initiated by the bioinformatics community, ChEBI [1] and PubChem. PubChem serves as automated repository on the biological activities of small molecules and ChEBI (Chemical Entities of Biological Interest) as a manually annotated database of molecular entities focused on 'small' chemical compounds. Although ChEBI is reasonably compact containing just over 18,000 entities, it provides a wide range of data items such as chemical nomenclature, an ontology and chemical structures. The ChEBI database has a strong focus on quality with exceptional efforts afforded to IUPAC nomenclature rules, classification within the ontology and best IUPAC practices when drawing chemical structures. ChEBI is currently undergoing a period of restructuring which will allow it to incorporate the small molecule structures from (and link to) EBI's new chemogenomics database ChEMBL [2], increasing its small molecules coverage to over 500,000 entities. We have restructured the chemical structure search facility to use Orchem [3] an Oracle chemistry plug-in using the Chemistry Development Kit [4]. The facility allows a user to draw a chemical structure or load one from a file and then execute either a substructure or similarity search. Furthermore the ChEBI text search will have extensive facilities for querying based on not only names but formula, a range of charges and molecular weight. The ability to query the ChEBI ontology and retrieve all children for a given entity will also be included. In order to aid the distribution of ChEBI to the chemoinformatics community we have extended our export formats to include an MDL sdf format with a lighter version consisting only of compound structure, name and identifier. A complete version is available with all the ChEBI data properties such as synonyms, cross-references, SMILES and InChI. Furthermore cross-references in ChEBI have been extended to include BRENDA the enzyme database, NMRShiftDB the database for organic structures and their nuclear magnetic resonance (nmr) spectra, Rhea the biochemical reaction database and IntEnz the enzyme nomenclature database. ChEBI is available at http://www.ebi.ac.uk/chebi.

Journal ArticleDOI
TL;DR: MOLA is an easy-to-use graphical user interface tool that automates parallel virtual screening using AutoDock4 and/or Vina in bootable non-dedicated computer clusters and is an ideal virtual screening tool for non-experienced users.
Abstract: Virtual screening of small molecules using molecular docking has become an important tool in drug discovery. However, large scale virtual screening is time demanding and usually requires dedicated computer clusters. There are a number of software tools that perform virtual screening using AutoDock4 but they require access to dedicated Linux computer clusters. Also no software is available for performing virtual screening with Vina using computer clusters. In this paper we present MOLA, an easy-to-use graphical user interface tool that automates parallel virtual screening using AutoDock4 and/or Vina in bootable non-dedicated computer clusters. MOLA automates several tasks including: ligand preparation, parallel AutoDock4/Vina jobs distribution and result analysis. When the virtual screening project finishes, an open-office spreadsheet file opens with the ligands ranked by binding energy and distance to the active site. All results files can automatically be recorded on an USB-flash drive or on the hard-disk drive using VirtualBox. MOLA works inside a customized Live CD GNU/Linux operating system, developed by us, that bypass the original operating system installed on the computers used in the cluster. This operating system boots from a CD on the master node and then clusters other computers as slave nodes via ethernet connections. MOLA is an ideal virtual screening tool for non-experienced users, with a limited number of multi-platform heterogeneous computers available and no access to dedicated Linux computer clusters. When a virtual screening project finishes, the computers can just be restarted to their original operating system. The originality of MOLA lies on the fact that, any platform-independent computer available can he added to the cluster, without ever using the computer hard-disk drive and without interfering with the installed operating system. With a cluster of 10 processors, and a potential maximum speed-up of 10x, the parallel algorithm of MOLA performed with a speed-up of 8,64× using AutoDock4 and 8,60× using Vina.

Journal ArticleDOI
TL;DR: OrChem, an extension for the Oracle 11G database that adds registration and indexing of chemical structures to support fast substructure and similarity searching, and provides similarity searching with response times in the order of seconds for databases with millions of compounds.
Abstract: Background Registration, indexing and searching of chemical structures in relational databases is one of the core areas of cheminformatics. However, little detail has been published on the inner workings of search engines and their development has been mostly closed-source. We decided to develop an open source chemistry extension for Oracle, the de facto database platform in the commercial world.

Journal ArticleDOI
TL;DR: An overview of the history of ChemSpider, the present capabilities of the platform and how it can become one of the primary foundations of the semantic web for chemistry are provided.
Abstract: There is an increasing availability of free and open access resources for chemists to use on the internet. Coupled with the increasing availability of Open Source software tools we are in the middle of a revolution in data availability and tools to manipulate these data. ChemSpider is a free access website for chemists built with the intention of providing a structure centric community for chemists. It was developed with the intention of aggregating and indexing available sources of chemical structures and their associated information into a single searchable repository and making it available to everybody, at no charge. There are tens if not hundreds of chemical structure databases such as literature data, chemical vendor catalogs, molecular properties, environmental data, toxicity data, analytical data etc. and no single way to search across them. Despite the fact that there were a large number of databases containing chemical compounds and data available online their inherent quality, accuracy and completeness was lacking in many regards. The intention with ChemSpider was to provide a platform whereby the chemistry community could contribute to cleaning up the data, improving the quality of data online and expanding the information available to include data such as reaction syntheses, analytical data, experimental properties and linking to other valuable resources. It has grown into a resource containing over 21 million unique chemical structures from over 200 data sources. ChemSpider has enabled real time curation of the data, association of analytical data with chemical structures, real-time deposition of single or batch chemical structures (including with activity data) and transaction-based predictions of physicochemical data. The social community aspects of the system demonstrate the potential of this approach. Curation of the data continues daily and thousands of edits and depositions by members of the community have dramatically improved the quality of the data relative to other public resources for chemistry. This presentation will provide an overview of the history of ChemSpider, the present capabilities of the platform and how it can become one of the primary foundations of the semantic web for chemistry. It will also discuss some of the present projects underway since the acquisition of ChemSpider by the Royal Society of Chemistry.

Journal ArticleDOI
TL;DR: The design and properties of the KnowledgeSpace and other in-house chemistry spaces that build on the same strategy as well as validation of results and a number of successful applications including prospective results are presented.
Abstract: Virtual high throughput screening of in-house compound collections and vendor catalogs is a validated approach in the quest for novel molecular entities. However, these libraries are small compared to the overall synthesizable number of compounds from validated "wet" chemical reactions in pharma companies or the public domain. In order to overcome this limitation, we designed a large virtual combinatorial chemistry space from publicly available combinatorial libraries that gives access to billions of synthetically accessible compounds. Together with FTrees, a fuzzy similarity calculator, the researcher has a means of searching this KnowledgeSpace for analogues to one or several query molecules within a few minutes. The resulting compounds not only exhibit similar properties to the query molecule(s), but also feature an annotation through which of the synthetic routes these molecules can be made. Results can be expected to be diverse, based on FTrees scaffold hopping capabilities, and provide ideas for hit follow-up into novel compound classes. In this contribution we present the design and properties of the KnowledgeSpace and other in-house chemistry spaces that build on the same strategy as well as validation of results and a number of successful applications including prospective results.

Journal ArticleDOI
TL;DR: The ChemSpider dictionary achieved the best precision but the Chemlist dictionary had a higher recall and the best F-score; rule-based filtering and disambiguation is necessary to achieve a high precision for both the automatically generated and the manually curated dictionary.

Journal ArticleDOI
TL;DR: In this paper, the authors present pharmACOphore, a new approach for pairwise as well as multiple flexible alignment of ligands based on ant colony optimization, which describes ligand similarity by minimizing the distance of pharmacophoric features.
Abstract: The flexible superimposition of biologically active ligands is a crucial step in ligand-based drug design. Here we present pharmACOphore, a new approach for pairwise as well as multiple flexible alignment of ligands based on ant colony optimization (ACO; Dorigo, M.; Stutzle, T. Ant Colony Optimization; MIT Press: Cambridge, MA, USA, 2004). An empirical scoring function is used, which describes ligand similarity by minimizing the distance of pharmacophoric features. The scoring function was parametrized on pairwise alignments of ligand sets for four proteins from diverse protein families (cyclooxygenase-2, cyclin-dependent kinase 2, factor Xa and peroxisome proliferator-activated receptor γ). The derived parameters were assessed with respect to pose prediction performance on the independent FlexS data set (Lemmen, C.; Lengauer, T.; Klebe, G. J. Med. Chem. 1998, 41, 4502−4520) in exhausting pairwise alignments. Additionally, multiple flexible alignment experiments were carried out for the pharmacologically ...

Journal ArticleDOI
Thorsten Meinl1, C Ostermann1, O Nimz1, Andrea Zaliani1, Berthold1 
TL;DR: It is shown, that the task of diversity selection is quite complicated and therefore heuristic approaches are needed for typical dataset sizes and therefore Score Erosion is by far the fastest one while finding solutions of equal quality compared to the genetic algorithm and BB2.
Abstract: Diversity selection is a common task in early drug discovery, be it for removing redundant molecules prior to HTS or reducing the number of molecules to synthesize from scratch. One drawback of the current approach, especially with regard to HTS, is, however, that only the structural diversity is taken into account. The fact that a molecule may be highly active or completely inactive is usually ignored. This is especially remarkable, as quite a lot of research is involved in improving virtual screening methods in order to forecast activity. We therefore present a modified version of diversity selection -- which we termed Maximum-Score Diversity Selection -- which additionally takes the predicted activities of the molecules into account. Not very surprisingly both objectives -- maximizing activity whilst also maximizing diversity in the selected subset -- conflict. As a result, we end up with a multiobjective optimization problem. We will show, that the task of diversity selection is quite complicated (it is NP-complete) and therefore heuristic approaches are needed for typical dataset sizes. A common and popular approach is using multiobjective genetic algorithms, such as NSGA-II [1], for optimizing both objectives for the selected subsets. However, we will show that usual implementations suffer from severe limitations that prevent them from finding quite a lot of possible interesting solutions. Therefore, we evaluated two other heuristic for maximum-score diversity selection. One is special heuristic (called BB2) that was motivated by the mentioned proof of NP-completeness [2]. The other is a novel heuristics called Score Erosion which was specifically developed for our actual problem. Among all three heuristics, Score Erosion is by far the fastest one while finding solutions of equal quality compared to the genetic algorithm and BB2. This will be shown on several real world datasets, both public and internal ones. All experiments were carried out using the data analysis platform KNIME [3] therefore we will also show some example how maximum-score diversity selection can be performed inside workflow-based environments.

Journal ArticleDOI
TL;DR: Bingo is a data cartridge for Oracle database that provides the industry's next-generation, fast, scalable, and efficient storage and searching solution for chemical information.
Abstract: Bingo is a data cartridge for Oracle database that provides the industry's next-generation, fast, scalable, and efficient storage and searching solution for chemical information. Bingo seamlessly integrates the chemistry into Oracle databases. Its extensible indexing is designed to enable scientists to store, index, and search chemical moieties alongside numbers and text within one underlying relational database server. For molecule structure searching, Bingo supports 2D and 3D exact and substructure searches, as well as similarity, tautomer, Markush, formula, molecular weight, and flexmatch searches. For reaction searches, Bingo supports reaction substructure search (RSS) with optional automatic generation of atom-to-atom mapping. All of these techniques are available through extensions to the SQL and PL/SQL syntax. Bingo also has features not present in other cartridges, for example, advanced tautomer search, resonance substructure search, and fast updating of the index when adding new structures. The presentation itself you can download from our site: http://opensource.scitouch.net/downloads/bingo-cic.pdf

Journal ArticleDOI
TL;DR: From the 8th to the 10th November 2009, the Chemistry-Information-Computers (CIC) division of the German Chemical Society (GDCh) has invited the chemoinformatics and modeling community to Goslar, Germany to participate in the 5th German Conference on Chemoinformatics (GCC 2009).
Abstract: From the 8th to the 10th November 2009, the Chemistry-Information-Computers (CIC) division of the German Chemical Society (GDCh) has invited the chemoinformatics and modeling community to Goslar, Germany to participate in the 5th German Conference on Chemoinformatics (GCC 2009). The international symposium addressed a broad range of modern research topics in the field of computers and chemistry. The focus was on recent developments and trends in the fields of Chemoinformatics and Drug Discovery, Chemical Information, Patents and Databases, Molecular Modeling, Computational Material Science and Nanotechnology. In addition, other contributions from the field of Computational Chemistry were welcome. The conference was opened traditionally with a "Free-Software-Session" on Sunday afternoon right before the official conference opening at 5 pm including three talks about the Open Source projects Bingo, Dingo and OrChem. In parallel the "Chemoinformatics Market Place" took place including software tutorials by Chemical Computing Group, Hemlholtz-Center Munich and the Cambridge Crystallograhic Data Center. The scientific program was opened by an evening talk giving an overview on the field of Systems Chemistry (Gunter von Kiedrowski). In addition, the program included six plenary lectures (Eberhard Voit (USA), Knut Baumann (Germany), Thomas Kostka (Germany), Anthony J. Williams (USA), Kalr-Heinz Baringhaus (Germany), Christoph Sotriffer (Germany)], 17 general lectures as well as 54 poster presentations. Besides the scientific program a special highlight of the conference were the FIZ-CHEMIE-Berlin 2009 awards on Monday afternoon (Figure ​(Figure1).1). The CIC division awards this price each year to the best diploma thesis and the best PhD in the field of Computational Chemistry. The price for the PhD thesis was awarded to Dr. Jose Batista from the group of Prof. Dr. Jurgen Bajorath, University of Bonn for his dissertation "Analysis of Random Fragment Profiles for the Detaction of Structure-Activity Relationships". The award for the best diploma thesis has gone to Frank Tristram from the group of Dr. Wolfgang Wenzel, Karlsruher Institute of Technology with the title "Modellierubng der Hauptkettenbeweglichkeit in der rechnergestutzten Medikamentenentwicklung". Figure 1 Fiz CHEMIE Berlin Awards 2009: from left to right, Frank Tristram (award for the best diploma thesis), Rene de Planque (Head of FIZ CHEMIE Verlin), Frank Oellien (Chair of the GDCh-CIC division), Jose Batista (award for the best PhD thesis).

Journal ArticleDOI
TL;DR: DrugscorePPI was used to successful identify hotspots in multiple protein-interfaces and was used for computational alanine-scanning of a dataset of 18 targets to predict changes in the binding free energy upon mutations in the interface.
Abstract: Protein-protein complexes are known to play key roles in many cellular processes. Therefore, knowledge of the three-dimensional structure of protein-complexes is of fundamental importance. A key goal in protein-protein docking is to identify near-native protein-complex structures. In this work, we address this problem by deriving a knowledge-based scoring function from protein-protein complex structures and further fine-tuning of the statistical potentials against experimentally determined alanine-scanning results. Based on the formalism of the DrugScore approach1, distance-dependent pair potentials are derived from 850 crystallographically determined protein-protein complexes 2. These DrugScorePPI potentials display quantitative differences compared to those of DrugScore, which was derived from protein-ligand complexes. When used as an objective function to score a non-redundant dataset of 54 targets with "unbound perturbation" solutions, DrugscorePPI was able to rank a near-native solution in the top ten in 89% and in the top five in 65% of the cases. Applied to a dataset of "unbound docking" solutions, DrugscorePPI was able to rank a near-native solution in the top ten in 100% and in the top five in 67% of the cases. Furthermore, Drugscore-PPI was used for computational alanine-scanning of a dataset of 18 targets with a total of 309 mutations to predict changes in the binding free energy upon mutations in the interface. Computed and experimental values showed a correlation of R2 = 0.34. To improve the predictive power, a QSAR-model was built based on 24 residue-specific atom types that improves the correlation coefficient to a value of 0.53, with a root mean square deviation of 0.89 kcal/mol. A Leave-One-Out analysis yields a correlation coefficient of 0.41. This clearly demonstrates the robustness of the model. The application to an independent validation dataset of alanine-mutations was used to show the predictive power of the method and yields a correlation coefficient of 0.51. Based on these findings, Drugscore-PPI was used to successful identify hotspots in multiple protein-interfaces. These results suggest that DrugscorePPI is an adequate method to score protein-protein interactions.

Journal ArticleDOI
TL;DR: It can be shown that data splitting into a training set and an external test set often estimates the prediction error less precise than proper cross-validation, which causes seemingly paradox phenomena such as the so-called "Kubinyi's paradoxon" for small data sets.
Abstract: Cross-validation was originally invented to estimate the prediction error of a mathematical modelling procedure. It can be shown that cross-validation estimates the prediction error almost unbiasedly. Nonetheless, there are numerous reports in the chemoinformatic literature that cross-validated figures of merit cannot be trusted and that a so-called external test set has to be used to estimate the prediction error of a mathematical model. In most cases where cross-validation fails to estimate the prediction error correctly, this can be traced back to the fact that it was employed as an objective function for model selection. Typically each model has some meta-parameters that need to be tuned such as the choice of the actual descriptors and the number of variables in a QSAR equation, the network topology of a neural net, or the complexity of a decision tree. In this case the meta-parameter is varied and the cross-validated prediction error is determined for each setting. Finally, the parameter setting is chosen that optimizes the cross-validated prediction error in an attempt to optimize the predictivity of the model. However, in these cases cross-validation is no longer an unbiased estimator of the prediction error and may grossly deviate from the result of an external test set. It can be shown that the "amount" of model selection can directly be related to the inflation of cross-validated figures of merit. Hence, the model selection step has to be separated from the step of estimating the prediction error. If this is done correctly, cross-validation (or resampling in general) retains its property of unbiasedly estimating the prediction error. Matter of factly, it can be shown that data splitting into a training set and an external test set often estimates the prediction error less precise than proper cross-validation. It is this variabability of prediction errors, which depends on test set size, that causes seemingly paradox phenomena such as the so-called "Kubinyi's paradoxon" for small data sets.

Journal ArticleDOI
TL;DR: This work presents methods for the descriptive analysis of screening data that systematically mining the network for SAR pathways, i.e. sequences of pairwise similar compounds that connect two molecules via a gradually increasing potency gradient.
Abstract: The analysis of high-throughput screening data poses significant challenges to medicinal and computational chemists. The number of compounds assayed in a single screen is prohibitively large for manual data analysis and no generally applicable computational methods have thus far been developed to consistently solve the problem of how to best select hits for further chemical exploration. Focusing on the question of how structure-activity relationship (SAR) information can be used to support this decision, we present methods for the descriptive analysis of screening data. Network representations visualize the distribution of 2D similarity relationships and potency in a data set and give an overview of global and local features of an activity landscape. Although dominated by many weakly active hits, different local SAR environments can be identified among screening hits, thus helping to focus on regions in chemical space that might show favorable SAR behavior in further exploration [1]. A more detailed analysis of the data is achieved by systematically mining the network for SAR pathways, i.e. sequences of pairwise similar compounds that connect two molecules via a gradually increasing potency gradient. The SAR pathways are calculated exhaustively for all possible compound pairs in a data set to identify those having most significant SAR information content. Often, high-scoring pathways lead to activity cliffs, i.e. pairs of similar compounds with significant differences in potency, and scaffold transitions can be observed along the pathways. Furthermore, a tree structure organizes alternative pathways that begin at the same compound but lead to different molecules and chemotypes. Similarly, SAR trees can be generated from all pathways that lead to an activity cliff in order to characterize the surrounding SAR microenvironment [2].

Journal ArticleDOI
TL;DR: A unified conceptual framework to describe and quantify the important issue of the Applicability Domains (AD) of Quantitative Structure-Activity Relationships (QSARs) is proposed, and statistical tools developed to tackle this latter aspect led to a unified AD metric benchmarking scheme.
Abstract: The present work proposes a unified conceptual framework to describe and quantify the important issue of the Applicability Domains (AD) of Quantitative Structure-Activity Relationships (QSARs). AD models are conceived as meta-models designed to associate an untrustworthiness score to any molecule M subject to property prediction by a QSAR model. Untrustworthiness scores or "AD metrics" are an expression of the relationship between M (represented by its descriptors in chemical space) and the space zones populated by the training molecules at the basis of model μ. Scores integrating some of the classical AD criteria (similarity-based, box-based) were considered in addition to newly invented terms, such as the dissimilarity to outlier-free training sets and the correlation breakdown count. A loose correlation is expected to exist between this untrustworthiness and the error affecting the predicted property. While high untrustworthiness does not preclude correct predictions, inaccurate predictions at low untrustworthiness must be imperatively avoided. This kind of relationship is characteristic for the Neighborhood Behavior (NB) problem: dissimilar molecule pairs may or may not display similar properties, but similar molecule pairs with different properties are explicitly "forbidden". Therefore, statistical tools developed to tackle this latter aspect were applied, and lead to a unified AD metric benchmarking scheme. A first use of untrustworthiness scores resides in prioritization of predictions, without need to specify a hard AD border. Moreover, if a significant set of external compounds is available, the formalism allows optimal AD borderlines to be fitted. Eventually, consensus AD definitions were built by means of a nonparametric mixing scheme of two AD metrics of comparable quality, and shown to outperform their respective parents.

Journal ArticleDOI
TL;DR: The MembraneEditor (CmME) has been extended and tested to meet the requirements of different PDB visualizationprograms as well as molecular dynamics (MD) simula-tion environments like Gromacs.
Abstract: BackgroundToday, only a few programs support membrane compu-tation and/or modeling in 3D. They enable the user tocreate very simple-structured membrane layers andusually assume a high level of bio-chemical/-physicalknowledge. The CELLmicrocosmos 2 project developeda tool providing a simplified workflow to create mem-brane (bi-)layers: The MembraneEditor (CmME).ResultsThe geometry-based, scalable and modular computationconcept supports fast to more complex membrane gen-erations. CmME is based on the integration of two dif-ferent types of PDB [1] models: Lipids are integratedwith editable percental distribution values and algo-rithms. Proteins are inserted and aligned into the bilayermanually or automatically, by using data from thePDB_TM [2] or OPM [3] database. Compatibility withother programs is offered by extensive PDB formatexport settings. High lipid densities are possible throughadvanced packing algorithms. Lipid distributions can bedeveloped by using the Plugin-Interface. Originally notintended to change the atomic structure of the mole-cules due performance issues, now it is also possible toaccess the atomic level for user-defined computations.Multiple membrane (bi-)layers and microdomains aresupported as well as a reengineering function providingthe re-editing of externally simulated PDB membranes.ConclusionsThe capabilities of CmME has been extended and testedto meet the requirements of different PDB visualizationprograms as well as molecular dynamics (MD) simula-tion environments like Gromacs [4]. The documentationand the Java Webstart application, requiring onlyan internet connection and Java 6, is accessible at:http://Cm2.CELLmicrocosmos.org

Journal ArticleDOI
TL;DR: This tool is aimed to generate a new paradigm for structure activity relationship knowledge bases, making QSAR/QSPR models active, user-contributed and easily accessible for benchmarking, general use and educational purposes.
Abstract: The main goal of OCHEM database http://qspr.eu is to collect, store and manipulate chemical data with the purpose of their use for model development. Its main features, that distinguish it from other available databases include: 1. The database is open and it is based on Wiki-style principles. We encourage users to submit data and to correct inaccurate submitted data; 2. The database is aimed at collecting high-quality data. To achieve this we require users to submit references to the article, where the data was published. The reference may include the article name, journal name, date of publication, page number, line number, etc. 3. Since the compound properties may vary depending on the conditions, under which they were measured, we store the measurement conditions with the data to provide the users with more accurate information about each data point. The modeling framework is being developed to complement the Wiki-style database of chemical structures. Its main goal is to provide a flexible and expandable calculation environment that would allow a user to create and manipulate QSAR and QSPR models on-line. The modeling framework is integrated with the database web-interface that allows easy transfer of database data to the models. The web interface of the modeling environment is aimed to provide to the Web users easy means to create high-quality prediction models and estimate their accuracy of prediction and applicability domain. The developed models can be published on the Web and be accessed by other users to predict new molecules on-line. This tool is aimed to generate a new paradigm for structure activity relationship knowledge bases, making QSAR/QSPR models active, user-contributed and easily accessible for benchmarking, general use and educational purposes. The examples of the use of the database within national and EU projects will be exemplified.

Journal ArticleDOI
TL;DR: The current technical state of the In ChI algorithm and how the InChI Trust is working to assure the continued support and delivery of the inChI algorithm are described.
Abstract: The IUPAC InChI/InChIKey project has evolved to the point where its future development and promulgation require a new management system that will provide stable and financially viable administrative arrangements for the foreseeable future. This is necessary to give the world-wide chemistry community that IUPAC serves the confidence that facilities for development, maintenance and support of the InChI/InChIKey algorithm are firmly established on an ongoing basis, and will ensure acceptance and usage of InChI as a mainstream standard. This new management system is provided by means of a new organization, the InChI Trust, an independent not-for-profit entity paid for by the community and those who use and benefit from the InChI algorithm. The mission of the Trust is quite simple and limited; its sole purpose is to create and support administratively and financially a scientifically robust and comprehensive InChI algorithm and related standards and protocols. This presentation will describe the current technical state of the InChI algorithm and how the InChI Trust is working to assure the continued support and delivery of the InChI algorithm.

Journal ArticleDOI
TL;DR: This work focuses on a derivative question: Can the authors determine an applicability domain measure suitable for deriving quantitative error bars which accurately reflect the expected error when making predictions for specified values of the domain measure?
Abstract: When making a prediction with a statistical model, it is not sufficient to know that the model is "good", in the sense that it is able to make accurate predictions on test data. Another relevant question is: How good is the model for a specific sample whose properties we wish to predict? Stated another way: Is the sample within or outside the model's domain of applicability or what is the degree to which a test compound is within the model's domain of applicability. Numerous studies have been done on determining appropriate measures to address this question [1-4]. Here we focus on a derivative question: Can we determine an applicability domain measure suitable for deriving quantitative error bars -- that is, error bars which accurately reflect the expected error when making predictions for specified values of the domain measure? Such a measure could then be used to provide an indication of the confidence in a given prediction (i.e. the likely error in a prediction based on to what degree the test compound is part of the model's domain of applicability).Ideally, we wish such a measure to be simple to calculate and to understand, to apply to models of all types -- including classification and regression models for both molecular and non-molecular data - and to be free of adjustable parameters. Consistent with recent work by others [5,6], the measures we have seen that best meet these criteria are distances to individual samples in the training data. We describe our attempts to construct a recipe for deriving quantitative error bars from these distances.

Journal ArticleDOI
TL;DR: A method to predict protein binding pockets and split them into subpockets such that small molecules are mostly contained within one sub-pocket and the number of test cases with more than 30% pocket coverage rises from 30% to 74% (PDBbind) and from 28% to 63% (scPDB) when considering sub-pockets.
Abstract: Computer-based prediction of protein druggability is an essential task in the drug development process. Early identification of disease modifying targets that can be modulated by low-molecular weight compounds can help to speed up and reduce costs in drug discovery. Recently, first methods have been presented performing a druggability estimation solely based on the 3D structure of the protein [1-3]. The essential first step for such methods is the identification of the active site. A multitude of methods exist for automated active site prediction [4-6]. However, most methods developed for automated docking procedures do not explicitly focus on the definition of the boundary of the active site. Since druggability estimates are based on structural descriptors of the active site, a precise description of the active site boundaries is vital for correct predictions. In this work, we present a method to predict protein binding pockets and split them into subpockets such that small molecules are mostly contained within one sub-pocket. The method is based on a novel strategy to geometrically detect narrow regions in pockets. For druggability predictions, such pocket descriptions result in more meaningful structural descriptors like active site surface or volume. Moreover, if several structures from one protein are known, sub-pockets can give hints about protein flexibility and induced fit conformational changes. Our method was evaluated on 718 proteins from the PDBbind [7] data set, as well as 5419 proteins from the scPDB [8] data set. Binding pockets are correctly predicted in 94% and 93% of the datasets. 38%, respectively 45% of the proteins from the two datasets contain pockets which can be divided into more than one sub-pocket. In all cases one sub-pocket completely covers the co-crystallized ligand. Besides the classical overlap-measure of ligand versus predicted active site, we additionally considered the pocket coverage by the co-crystallized ligand. We found that the number of test cases with more than 30% pocket coverage rises from 30% to 74% (PDBbind) and from 28% to 63% (scPDB), respectively, when considering sub-pockets.

Journal ArticleDOI
TL;DR: The protein-ligand docking approach GOLD has been extended to search such conformational ensembles time-efficiently and the performance of the approach has been assessed on several protein targets using different scoring functions.
Abstract: In recent years, the importance of considering induced fit effects in molecular docking calculations has been widely recognised in the molecular modelling community. While small-scale protein side-chain movements are now accounted for in many state-of-the-art docking strategies, the explicit modelling of large-scale protein motions such as loop movements in kinase domains is still a challenging task. For this reason ensemble-based methods have been introduced taking into account several discrete protein conformations in the conformational sampling step. Our protein-ligand docking approach GOLD [1,2] has been extended to search such conformational ensembles time-efficiently. The performance of the approach has been assessed on several protein targets using different scoring functions. A detailed analysis of pose prediction and virtual screening results in dependence of the number of protein structures considered in the conformational ensemble will be presented and limitations of the approach will be highlighted.