scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Beware of q2

01 Jan 2002-Journal of Molecular Graphics & Modelling (J Mol Graph Model)-Vol. 20, Iss: 4, pp 269-276
TL;DR: It is argued that the high value of LOO q2 appears to be the necessary but not the sufficient condition for the model to have a high predictive power, which is the general property of QSAR models developed using LOO cross-validation.
Abstract: Validation is a crucial aspect of any quantitative structure-activity relationship (QSAR) modeling. This paper examines one of the most popular validation criteria, leave-one-out cross-validated R2 (LOO q2). Often, a high value of this statistical characteristic (q2 > 0.5) is considered as a proof of the high predictive ability of the model. In this paper, we show that this assumption is generally incorrect. In the case of 3D QSAR, the lack of the correlation between the high LOO q2 and the high predictive ability of a QSAR model has been established earlier [Pharm. Acta Helv. 70 (1995) 149; J. Chemomet. 10(1996)95; J. Med. Chem. 41 (1998) 2553]. In this paper, we use two-dimensional (2D) molecular descriptors and k nearest neighbors (kNN) QSAR method for the analysis of several datasets. No correlation between the values of q2 for the training set and predictive ability for the test set was found for any of the datasets. Thus, the high value of LOO q2 appears to be the necessary but not the sufficient condition for the model to have a high predictive power. We argue that this is the general property of QSAR models developed using LOO cross-validation. We emphasize that the external validation is the only way to establish a reliable QSAR model. We formulate a set of criteria for evaluation of predictive ability of QSAR models.
Citations
More filters
Journal ArticleDOI
TL;DR: A set of simple guidelines for developing validated and predictive QSPR models is presented, highlighting the need to establish the domain of model applicability in the chemical space to flag molecules for which predictions may be unreliable, and some algorithms that can be used for this purpose.
Abstract: This paper emphasizes the importance of rigorous validation as a crucial, integral component of Quantitative Structure Property Relationship (QSPR) model development. We consider some examples of published QSPR models, which in spite of their high fitted accuracy for the training sets and apparent mechanistic appeal, fail rigorous validation tests, and, thus, may lack practical utility as reliable screening tools. We present a set of simple guidelines for developing validated and predictive QSPR models. To this end, we discuss several validation strategies including (1) randomization of the modelled property, also called Y-scrambling, (2) multiple leave-many-out cross-validations, and (3) external validation using rational division of a dataset into training and test sets. We also highlight the need to establish the domain of model applicability in the chemical space to flag molecules for which predictions may be unreliable, and discuss some algorithms that can be used for this purpose. We advocate the broad use of these guidelines in the development of predictive QSPR models.

1,838 citations

Journal ArticleDOI
TL;DR: Evidence is presented that only models that have been validated externally, after their internal validation, can be considered reliable and applicable for both external prediction and regulatory purposes.
Abstract: The recent REACH Policy of the European Union has led to scientists and regulators to focus their attention on establishing general validation principles for QSAR models in the context of chemical regulation (previously known as the Setubal, nowadays, the OECD principles). This paper gives a brief analysis of some principles: unambiguous algorithm, Applicability Domain (AD), and statistical validation. Some concerns related to QSAR algorithm reproducibility and an example of a fast check of the applicability domain for MLR models are presented. Common myths and misconceptions related to popular techniques for verifying internal predictivity, particularly for MLR models (for instance crossvalidation, bootstrap), are commented on and compared with commonly used statistical techniques for external validation. The differences in the two validating approaches are highlighted, and evidence is presented that only models that have been validated externally, after their internal validation, can be considered reliable and applicable for both external prediction and regulatory purposes. (“Validation is one of those words...that is constantly used and seldom defined” as stated by A. R. Feinstein in the book Multivariate Analysis: An Introduction, Yale University Press, New Haven, 1996).

1,697 citations

Journal ArticleDOI
TL;DR: Most critical QSAR modeling routines that are regarded as best practices in the field are examined, including procedures used to validate models, both internally and externally, as well as the need to define model applicability domains that should be used when models are employed for the prediction of external compounds or compound libraries.
Abstract: After nearly five decades "in the making", QSAR modeling has established itself as one of the major computational molecular modeling methodologies. As any mature research discipline, QSAR modeling can be characterized by a collection of well defined protocols and procedures that enable the expert application of the method for exploring and exploiting ever growing collections of biologically active chemical compounds. This review examines most critical QSAR modeling routines that we regard as best practices in the field. We discuss these procedures in the context of integrative predictive QSAR modeling workflow that is focused on achieving models of the highest statistical rigor and external predictive power. Specific elements of the workflow consist of data preparation including chemical structure (and when possible, associated biological data) curation, outlier detection, dataset balancing, and model validation. We especially emphasize procedures used to validate models, both internally and externally, as well as the need to define model applicability domains that should be used when models are employed for the prediction of external compounds or compound libraries. Finally, we present several examples of successful applications of QSAR models for virtual screening to identify experimentally confirmed hits.

1,362 citations

Journal ArticleDOI
TL;DR: In this paper, the authors provide guidelines for QSAR development, validation, and application, which are summarized in best practices for building rigorously validated and externally predictive quantitative structure-activity relationship models.
Abstract: Quantitative structure–activity relationship modeling is one of the major computational tools employed in medicinal chemistry. However, throughout its entire history it has drawn both praise and criticism concerning its reliability, limitations, successes, and failures. In this paper, we discuss (i) the development and evolution of QSAR; (ii) the current trends, unsolved problems, and pressing challenges; and (iii) several novel and emerging applications of QSAR modeling. Throughout this discussion, we provide guidelines for QSAR development, validation, and application, which are summarized in best practices for building rigorously validated and externally predictive QSAR models. We hope that this Perspective will help communications between computational and experimental chemists toward collaborative development and use of QSAR models. We also believe that the guidelines presented here will help journal editors and reviewers apply more stringent scientific standards to manuscripts reporting new QSAR stu...

1,314 citations

Journal ArticleDOI
08 Aug 2019
TL;DR: A comprehensive overview and analysis of the most recent research in machine learning principles, algorithms, descriptors, and databases in materials science, and proposes solutions and future research paths for various challenges in computational materials science.
Abstract: One of the most exciting tools that have entered the material science toolbox in recent years is machine learning. This collection of statistical methods has already proved to be capable of considerably speeding up both fundamental and applied research. At present, we are witnessing an explosion of works that develop and apply machine learning to solid-state systems. We provide a comprehensive overview and analysis of the most recent research in this topic. As a starting point, we introduce machine learning principles, algorithms, descriptors, and databases in materials science. We continue with the description of different machine learning approaches for the discovery of stable materials and the prediction of their crystal structure. Then we discuss research in numerous quantitative structure–property relationships and various approaches for the replacement of first-principle methods by machine learning. We review how active learning and surrogate-based optimization can be applied to improve the rational design process and related examples of applications. Two major questions are always the interpretability of and the physical understanding gained from machine learning models. We consider therefore the different facets of interpretability and their importance in materials science. Finally, we propose solutions and future research paths for various challenges in computational materials science.

1,301 citations

References
More filters
Book
01 Jan 1969

16,023 citations

Journal ArticleDOI
TL;DR: The main features of the CoMFA approach, exemplified by analyses of the affinities of 21 varied steroids to corticosteroid and testosterone-binding globulins, and a number of advances in the methodology of molecular graphics are described.
Abstract: Comparative molecular field analysis (CoMFA) is a promising new approach to structure/activity correlation. Its characteristic features are (1) representation of ligand molecules by their steric and electrostatic fields, sampled at the intersections of a three-dimensional lattice, (2) a new ‘field fit” technique, allowing optimal mutual alignment within a series, by minimizing the RMS field differences between molecules, (3) data analysis by partial least squares (PLS), using cross-validation to maximize the likelihood that the results have predictive validity, and (4) graphic representation of results, as contoured three-dimensional coefficient plots. CoMFA is exemplified by analyses of the affinities of 21 varied steroids to corticosteroidand testosterone-binding globulins. Also described are the sensitivities of results to the nature of the field and the definition of the lattice and, for comparison, analyses of the same data using various combinations of other parameters. From these results, a set of ten steroid-binding affinity values unknown to us during the CoMFA analysis were well predicted. A major goal in chemical research is to predict the behavior of new molecules, using relationships derived from analysis of the properties of previously tested molecules. Relationships derived primarily by empirical analysis of a data table, whose columns are numerical property values and whose rows are compounds, usually taking the form of a linear equation, are called quantitative structure/activity relationships (QSAR).I Especially in biological applications, it has long been agreed that the most relevant numerical property values would be shape-dependent. Work on comparative molecular field analysis (CoMFA) began 12 years ago with two additional observations: (1) at the molecular level, the interactions which produce an observed biological effect are usually non-covalent; and ( 2 ) molecular mechanics force fields, most of which treat noncovalent (non-bonded) interactions only as steric and electrostatic forces, can account precisely for a great variety of observed molecular properties.2 Thus it seems reasonable that a suitable sampling of the steric and electrostatic fields surrounding a set of ligand (drug) molecules might provide all the information necessary for understanding their observed biological properties. However, the emergence of a practical CoMFA methodology had to await a new method of data analysis, partial least squares (PLS),3 which can derive robust linear equations from tables having many more columns than rows, and a number of advances in the methodology of molecular graphics. Other “3D-QSAR” methodologies have been described. The molecular shape (MS) approaches, developed independently by Simon et aL4 and by H ~ p f i n g e r , ~ compare net, rather than location-dependent, differences in molecular connectivities, volumes, and/or fields. A second approach, the “distance geometry” method of Crippen,6 provides validation of a ”site-point” hypothesis, a list of binding set coordinates and properties that must be proposed by the investigator. A prototype version of the CoMFA method is called “DYLOMMS”.7 In related work, for exploring binding modes of ligands to receptors, Goodford* advocates the display of probe-interaction “grids”, similar to thme used in CoMFA, while Hansch, Blaney, Langridge, et aL9 have shown the complementarity of QSAR and molecular graphics in understanding enzyme inhibitor data. Below we describe the main features of the CoMFA approach, exemplifying its use by analyzing the binding affinities of 21 varied steroid structures to human corticosteroid-binding globulins (CBG) and testosterone-binding globulins10 (TBG). In this series, the comparative rigidity of the steroid nucleus allows the conformational variable to be neglected, and the in vitro, particularly simple, character of the test system minimizes the importance of nonreceptor-related, hence non-shape-related, compound differences on the experimental observations.” We then investigated the *Author to whom all correspondence should be addressed. 0002-7863/88/15 10-5959$01.50/0 sensitivity of the excellent results obtained to critical model assumptions. For the purpose of comparison, we have also analysed these steroid binding data using both classical and other ”molecular shape” parameters, in various combinations. Finally, toward the end of this work, we were informed of additional corticosteroid binding data,12 and thus were able to test the ability of our model to predict the binding constants of ten more, structurally diverse, steroids. Computational Methods CoMFA Methodology. The overall data flow of a CoMFA analysis appears in Figure I . Its top two panels show how the data table is constructed from the field values at the lattice intersections. These automatically calculated parameters are the energies of steric (van der Waals 6-12) and electrostatic (Coulombic, with a 1 / r dielectric) interaction between the compound of interest, and a “probe atom” placed at the various intersections of a regular three-dimensional lattice, large enough to surround all of the compounds in the series, and with a 2.0 A separation between lattice point unless otherwise stated. The van der Waals A / B values were taken from the standard Tripos force field” and the atomic charges were calculated by the method of Gasteiger and Mar~i l i . ’~ Unless stated otherwise, the probe atom had the van der Waals properties of sp3 carbon and a charge of +1.0. Wherever the prove atom experiences a steric repulsion greater than “cutoff“ (30 kcal/mol ( I ) Martin, Y. C. Quantitative Drug Design; Marcel Dekker: New York, 1978. (2) Burkert, U.; Allinger, N. L. Molecular Mechanics; American Chemical Society: Washington, DC, 1982. (3) Wold, S . ; Ruhe, A,; Wold, H.; Dunn, W. J., 111 SIAM J . Sci. Stat. Comput. 1984, 5 , 135. (4) Simon, Z.; Badileuscu, I.; Racovitan, T. J. Theor. Biol. 1977,66,485. Simon, Z . ; Dragomir, N.; Plauchithiu, M. G.; Holban, S . ; Glatt, H.; Kerek, F. Eur. J . Med. Chem. 1980, 15, 521. ( 5 ) Hopfinger, A. J. J . Am. Chem. SOC. 1980, 102, 7196. (6) Chose, A. K.; Crippen, G. M. J . Med. Chem. 1985, 28, 333 and references therein. (7) Cramer, R. D., 111; Milne, M. Abstracts of the ACS Meeting, April 1979, COMP 44. Wise, M.; Cramer, R. D.; Smith, D. M.; Exman, I. In Quantitative Approaches to Drug Design; Dearden, J. C., Ed.; Elsevier: Amsterdam, 1983; p 145. Wise, M. in Molecular Graphics and Drug Design; Burgen, A. S . V., Roberts, G. C. K., Tute, M. S., Elsevier: New York, 1986; pp 183-194. Cramer, R. D., 111; Bunce, J. D. In QSAR in Drug Design and Toxicology; Hadzi, D., Jerman-Blazic, B., Eds.; Elsevier: New York, 1987; P 3. (8) Goodford, P. J. J . Med. Chem. 1985, 28, 849. (9) Hansch, C.; Hathaway, B. A.; Guo, Z. R.; Selassie, C. D.; Dietrich, S . W.; Blaney, J. M.; Langridge, R.; Volz, K. W.; Kaufman, B. T. J . Med. Chem. 1984, 27, 129. (10) Dunn, J. F.; Nisula, B. C.; Rodbard, D. J . Clin. Endocrin. Metab. 1981, 63. ( I 1 ) Cramer, R. D., I11 Quant. Struct. Acf . Pharmacol., Chem. Biol. 1983, 2, 7, 13. Yunger, L. M.; Cramer, R. D., 111 Quant. Struc. Act. Relat. Pharmacol., Chem. Biol. 1983, 2, 149. (12) Westphal, U. Steroid-Protein Interactions I I ; Springer-Verlag: Berlin, 1986. ( 1 3) Vinter, J. G.; Davis, A.; Saunder, M. R. J . Comp-Aided Mol. Design 1987, 1, 31. (14) Gasteiger, J.; Marsili, M. Tetrahedron 1980, 36, 3219.

3,655 citations

Book
01 Jan 1979
TL;DR: In this paper, the book is the window to get in the world and you can open the world easily, and these wise words are really familiar with you, so bring home now the book enPDFd substituent constants for correlation analysis in chemistry and biology to be your sources when going to read.
Abstract: Bring home now the book enPDFd substituent constants for correlation analysis in chemistry and biology to be your sources when going to read. It can be your new collection to not only display in your racks but also be the one that can help you fining the best sources. As in common, book is the window to get in the world and you can open the world easily. These wise words are really familiar with you, isn't it?

3,169 citations