scispace - formally typeset
Search or ask a question
Journal ArticleDOI

In silico prediction of chemical aquatic toxicity with chemical category approaches and substructural alerts

23 Feb 2015-Toxicology Research (The Royal Society of Chemistry)-Vol. 4, Iss: 2, pp 452-463
TL;DR: In silico models were developed for the prediction of chemical aquatic toxicity in different fish species, and information gain and ChemoTyper methods were used to identify toxic substructures, which could significantly correlate with chemical aquaticoxicity.
Abstract: Aquatic toxicity is an important endpoint in the evaluation of chemically adverse effects on ecosystems. In this study, in silico models were developed for the prediction of chemical aquatic toxicity in different fish species. Firstly, a large data set containing 6422 data points on aquatic toxicity with 1906 diverse chemicals was constructed. Using molecular descriptors and fingerprints to represent the molecules, local and global models were then developed with five machine learning methods based on three fish species (rainbow trout, fathead minnow and bluegill sunfish). For the local models, both binary and ternary classification models were obtained for each of the three fish species. For the global models, data of all the three fish species were used together. The predictive accuracy of both the local and global models was around 0.8 for the test sets. Moreover, data of the sheepshead minnow were used as an external validation set. For the best local model (model 2), the predictive accuracy was 0.875 for the sheepshead minnow, while for the best global model (model 14), the predictive accuracy was 0.872 for the sheepshead minnow. The FN compounds in model 2 and model 14 were 18 and 10, respectively. Hence, model 14 was the best model, and thus could predict the toxicity of other fish species’. Furthermore, information gain and ChemoTyper methods were used to identify toxic substructures, which could significantly correlate with chemical aquatic toxicity. This study provides critical tools for an early evaluation of chemical aquatic toxicity in an environmental hazard assessment.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This research presents a novel probabilistic procedure called “spot-spot analysis” that allows for real-time analysis of the response of the immune system to Epstein-Barr virus.
Abstract: During drug development, safety is always the most important issue, including a variety of toxicities and adverse drug effects, which should be evaluated in preclinical and clinical trial phases. This review article at first simply introduced the computational methods used in prediction of chemical toxicity for drug design, including machine learning methods and structural alerts. Machine learning methods have been widely applied in qualitative classification and quantitative regression studies, while structural alerts can be regarded as a complementary tool for lead optimization. The emphasis of this article was put on the recent progress of predictive models built for various toxicities. Available databases and web servers were also provided. Though the methods and models are very helpful for drug design, there are still some challenges and limitations to be improved for drug safety assessment in the future.

144 citations

Journal ArticleDOI
TL;DR: Four of the methods for monosubstructure identification with three indices including accuracy rate, coverage rate, and information gain are evaluated to compare their advantages and disadvantages and showed that Bioalerts and FP could detect key substructures with high accuracy and coverage rates.
Abstract: Identification of structural alerts for toxicity is useful in drug discovery and other fields such as environmental protection. With structural alerts, researchers can quickly identify potential toxic compounds and learn how to modify them. Hence, it is important to determine structural alerts from a large number of compounds quickly and accurately. There are already many methods reported for identification of structural alerts. However, how to evaluate those methods is a problem. In this paper, we tried to evaluate four of the methods for monosubstructure identification with three indices including accuracy rate, coverage rate, and information gain to compare their advantages and disadvantages. The Kazius’ Ames mutagenicity data set was used as the benchmark, and the four methods were MoSS (graph-based), SARpy (fragment-based), and two fingerprint-based methods including Bioalerts and the fingerprint (FP) method we previously used. The results showed that Bioalerts and FP could detect key substructures w...

47 citations

Journal ArticleDOI
TL;DR: A large data set containing 1163 diverse compounds with IC50 values determined by the patch clamp method indicated that the model built by molecular descriptors combining fingerprints yielded the best results and each threshold had its best suitable method, which means that hERG blockage assessment might depend on threshold values.
Abstract: The human ether-a-go-go related gene (hERG) plays an important role in cardiac action potential. It encodes an ion channel protein named Kv11.1, which is related to long QT syndrome and may cause avoidable sudden cardiac death. Therefore, it is important to assess the hERG channel blockage of lead compounds in an early drug discovery process. In this study, we collected a large data set containing 1163 diverse compounds with IC50 values determined by the patch clamp method on mammalian cell lines. The whole data set was divided into 80% as the training set and 20% as the test set. Then, five machine learning methods were applied to build a series of binary classification models based on 13 molecular descriptors, five fingerprints and molecular descriptors combining fingerprints at four IC50 thresholds to discriminate hERG blockers from nonblockers, respectively. Models built by molecular descriptors combining fingerprints were validated by using an external validation set containing 407 compounds collected from the hERGCentral database. The performance indicated that the model built by molecular descriptors combining fingerprints yielded the best results and each threshold had its best suitable method, which means that hERG blockage assessment might depend on threshold values. Meanwhile, kNN and SVM methods were better than the others for model building. Furthermore, six privileged substructures were identified using information gain and frequency analysis methods, which could be regarded as structural alerts of cardiac toxicity mediated by hERG channel blockage.

47 citations

Journal ArticleDOI
TL;DR: The results suggest for the appropriateness of the developed QSAR models to reliably predict the toxicity of pesticides in multiple avian test species and can be useful tools in screening the new chemical pesticides for regulatory purposes.
Abstract: A comprehensive safety evaluation of chemicals should require toxicity assessment in both the aquatic and terrestrial test species. Due to the application practices and nature of chemical pesticides, the avian toxicity testing is considered as an essential requirement in the risk assessment process. In this study, tree-based multispecies QSAR (quantitative-structure activity relationship) models were constructed for predicting the avian toxicity of pesticides using a set of nine descriptors derived directly from the chemical structures and following the OECD guidelines. Accordingly, the Bobwhite quail toxicity data was used to construct the QSAR models (SDT, DTF, DTB) and were externally validated using the toxicity data in four other test species (Mallard duck, Ring-necked pheasant, Japanese quail, House sparrow). Prior to the model development, the diversity in the chemical structures and end-point were verified. The external predictive power of the QSAR models was tested through rigorous validation der...

38 citations

Journal ArticleDOI
TL;DR: Ten structural fragments which can be used to assess the genotoxicity potential of a chemical were identified by using information gain and structural fragment frequency analysis and might be helpful for the initial screening of potential genotoxic compounds.
Abstract: Genotoxicity tests can detect compounds that have an adverse effect on the process of heredity. The in vivo micronucleus assay, a genotoxicity test method, has been widely used to evaluate the presence and extent of chromosomal damage in human beings. Due to the high cost and laboriousness of experimental tests, computational approaches for predicting genotoxicity based on chemical structures and properties are recognized as an alternative. In this study, a dataset containing 641 diverse chemicals was collected and the molecules were represented by both fingerprints and molecular descriptors. Then classification models were constructed by six machine learning methods, including the support vector machine (SVM), naive Bayes (NB), k-nearest neighbor (kNN), C4.5 decision tree (DT), random forest (RF) and artificial neural network (ANN). The performance of the models was estimated by five-fold cross-validation and an external validation set. The top ten models showed excellent performance for the external validation with accuracies ranging from 0.846 to 0.938, among which models Pubchem_SVM and MACCS_RF showed a more reliable predictive ability. The applicability domain was also defined to distinguish favorable predictions from unfavorable ones. Finally, ten structural fragments which can be used to assess the genotoxicity potential of a chemical were identified by using information gain and structural fragment frequency analysis. Our models might be helpful for the initial screening of potential genotoxic compounds.

27 citations

References
More filters
Journal ArticleDOI
01 Oct 2001
TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Abstract: Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. The generalization error for forests converges a.s. to a limit as the number of trees in the forest becomes large. The generalization error of a forest of tree classifiers depends on the strength of the individual trees in the forest and the correlation between them. Using a random selection of features to split each node yields error rates that compare favorably to Adaboost (Y. Freund & R. Schapire, Machine Learning: Proceedings of the Thirteenth International conference, aaa, 148–156), but are more robust with respect to noise. Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the splitting. Internal estimates are also used to measure variable importance. These ideas are also applicable to regression.

79,257 citations

Journal ArticleDOI
TL;DR: High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated and the performance of the support- vector network is compared to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Abstract: The support-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data. High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.

37,861 citations

Book
15 Oct 1992
TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.
Abstract: From the Publisher: Classifier systems play a major role in machine learning and knowledge-based systems, and Ross Quinlan's work on ID3 and C4.5 is widely acknowledged to have made some of the most significant contributions to their development. This book is a complete guide to the C4.5 system as implemented in C for the UNIX environment. It contains a comprehensive guide to the system's use , the source code (about 8,800 lines), and implementation notes. The source code and sample datasets are also available on a 3.5-inch floppy diskette for a Sun workstation. C4.5 starts with large sets of cases belonging to known classes. The cases, described by any mixture of nominal and numeric properties, are scrutinized for patterns that allow the classes to be reliably discriminated. These patterns are then expressed as models, in the form of decision trees or sets of if-then rules, that can be used to classify new cases, with emphasis on making the models understandable as well as accurate. The system has been applied successfully to tasks involving tens of thousands of cases described by hundreds of properties. The book starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting. Advantages and disadvantages of the C4.5 approach are discussed and illustrated with several case studies. This book and software should be of interest to developers of classification-based intelligent systems and to students in machine learning and expert systems courses.

21,674 citations

Journal ArticleDOI
TL;DR: The nearest neighbor decision rule assigns to an unclassified sample point the classification of the nearest of a set of previously classified points, so it may be said that half the classification information in an infinite sample set is contained in the nearest neighbor.
Abstract: The nearest neighbor decision rule assigns to an unclassified sample point the classification of the nearest of a set of previously classified points. This rule is independent of the underlying joint distribution on the sample points and their classifications, and hence the probability of error R of such a rule must be at least as great as the Bayes probability of error R^{\ast} --the minimum probability of error over all decision rules taking underlying probability structure into account. However, in a large sample analysis, we will show in the M -category case that R^{\ast} \leq R \leq R^{\ast}(2 --MR^{\ast}/(M-1)) , where these bounds are the tightest possible, for all suitably smooth underlying distributions. Thus for any number of categories, the probability of error of the nearest neighbor rule is bounded above by twice the Bayes probability of error. In this sense, it may be said that half the classification information in an infinite sample set is contained in the nearest neighbor.

12,243 citations

Journal ArticleDOI
TL;DR: PaDEL‐Descriptor is a software for calculating molecular descriptors and fingerprints, which currently calculates 797 descriptors (663 1D, 2D descriptors, and 134 3D descriptorors) and 10 types of fingerprints.
Abstract: Introduction PaDEL-Descriptor is a software for calculating molecular descriptors and fingerprints. The software currently calculates 797 descriptors (663 1D, 2D descriptors, and 134 3D descriptors) and 10 types of fingerprints. These descriptors and fingerprints are calculated mainly using The Chemistry Development Kit. Some additional descriptors and fingerprints were added, which include atom type electrotopological state descriptors, McGowan volume, molecular linear free energy relation descriptors, ring counts, count of chemical substructures identified by Laggner, and binary fingerprints and count of chemical substructures identified by Klekota and Roth. Methods PaDEL-Descriptor was developed using the Java language and consists of a library component and an interface component. The library component allows it to be easily integrated into quantitative structure activity relationship software to provide the descriptor calculation feature while the interface component allows it to be used as a standalone software. The software uses a Master/Worker pattern to take advantage of the multiple CPU cores that are present in most modern computers to speed up calculations of molecular descriptors. Results The software has several advantages over existing standalone molecular descriptor calculation software. It is free and open source, has both graphical user interface and command line interfaces, can work on all major platforms (Windows, Linux, MacOS), supports more than 90 different molecular file formats, and is multithreaded. Conclusion PaDEL-Descriptor is a useful addition to the currently available molecular descriptor calculation software. The software can be downloaded at http://padel.nus.edu.sg/software/padeldescriptor. © 2010 Wiley Periodicals, Inc. J Comput Chem, 2011

1,865 citations