scispace - formally typeset
Search or ask a question
Author

Yun De Xiao

Bio: Yun De Xiao is an academic researcher from University of North Carolina at Chapel Hill. The author has contributed to research in topics: Quantitative structure–activity relationship & Applicability domain. The author has an hindex of 3, co-authored 4 publications receiving 655 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: There is additional evidence that there exists no correlation between the values of q2 for the training set and accuracy of prediction (R2) for the test set and it is argued that this observation is a general property of any QSAR model developed with LOO cross-validation.
Abstract: Quantitative Structure–Activity Relationship (QSAR) models are used increasingly to screen chemical databases and/or virtual chemical libraries for potentially bioactive molecules. These developments emphasize the importance of rigorous model validation to ensure that the models have acceptable predictive power. Using k nearest neighbors (kNN) variable selection QSAR method for the analysis of several datasets, we have demonstrated recently that the widely accepted leave-one-out (LOO) cross-validated R2 (q2) is an inadequate characteristic to assess the predictive ability of the models [Golbraikh, A., Tropsha, A. Beware of q2! J. Mol. Graphics Mod. 20, 269-276, (2002)]. Herein, we provide additional evidence that there exists no correlation between the values of q2 for the training set and accuracy of prediction (R2) for the test set and argue that this observation is a general property of any QSAR model developed with LOO cross-validation. We suggest that external validation using rationally selected training and test sets provides a means to establish a reliable QSAR model. We propose several approaches to the division of experimental datasets into training and test sets and apply them in QSAR studies of 48 functionalized amino acid anticonvulsants and a series of 157 epipodophyllotoxin derivatives with antitumor activity. We formulate a set of general criteria for the evaluation of predictive power of QSAR models.

591 citations

Journal ArticleDOI
TL;DR: It is demonstrated that not every combination of the data modeling technique and the descriptor collection yields a validated and predictive QSAR model, which affords automation, computational efficiency, and higher probability of identifying significantQSAR models for experimental data sets than the traditional approaches that rely on a single QS AR method.
Abstract: A combinatorial quantitative structure−activity relationships (Combi-QSAR) approach has been developed and applied to a data set of 98 ambergris fragrance compounds with complex stereochemistry. The Combi-QSAR approach explores all possible combinations of different independent descriptor collections and various individual correlation methods to obtain statistically significant models with high internal (for the training set) and external (for the test set) accuracy. Seven different descriptor collections were generated with commercially available MOE, CoMFA, CoMMA, Dragon, VolSurf, and MolconnZ programs; we also included chirality topological descriptors recently developed in our laboratory (Golbraikh, A.; Bonchev, D.; Tropsha, A. J. Chem. Inf. Comput. Sci. 2001, 41, 147−158). CoMMA descriptors were used in combination with MOE descriptors. MolconnZ descriptors were used in combination with chirality descriptors. Each descriptor collection was combined individually with four correlation methods, includin...

91 citations

Journal ArticleDOI
TL;DR: The resulting Catalyst pharmacophore and kNN QSAR models can be used concurrently for rapid virtual screening of chemical databases to identify novel p38 MAP kinase inhibitors.
Abstract: We have employed in parallel the Catalyst HypoGen pharmacophore modeling approach and the variable selection k -nearest neighbor quantitative structure–activity relationship ( k NN QSAR) method to model a diverse data set of p38 mitogen-activated protein (MAP) kinase inhibitors. The HypoGen pharmacophore model, developed from a novel automated training set selection protocol, identified chemical functional features that were characteristic of the active compounds and differentiated the active from the inactive inhibitors. The k NN QSAR modeling employed topological descriptors and afforded predictive QSAR models with consistently high values of both leave-one-out cross-validated R 2 for the training set and predictive R 2 for the test set. The results of both modeling approaches were sensitive to the selection of the training and test sets used for model development and validation. The resulting Catalyst pharmacophore and k NN QSAR models can be used concurrently for rapid virtual screening of chemical databases to identify novel p38 MAP kinase inhibitors.

24 citations

Journal ArticleDOI
TL;DR: Golbraikh et al. as mentioned in this paper used a combinatorial quantitative structure-activity relationship (Combi-QSAR) approach to obtain statistically significant models with high internal and external accuracy.
Abstract: A combinatorial quantitative structure−activity relationships (Combi-QSAR) approach has been developed and applied to a data set of 98 ambergris fragrance compounds with complex stereochemistry. The Combi-QSAR approach explores all possible combinations of different independent descriptor collections and various individual correlation methods to obtain statistically significant models with high internal (for the training set) and external (for the test set) accuracy. Seven different descriptor collections were generated with commercially available MOE, CoMFA, CoMMA, Dragon, VolSurf, and MolconnZ programs; we also included chirality topological descriptors recently developed in our laboratory (Golbraikh, A.; Bonchev, D.; Tropsha, A. J. Chem. Inf. Comput. Sci. 2001, 41, 147−158). CoMMA descriptors were used in combination with MOE descriptors. MolconnZ descriptors were used in combination with chirality descriptors. Each descriptor collection was combined individually with four correlation methods, includin...

Cited by
More filters
Journal ArticleDOI
TL;DR: Evidence is presented that only models that have been validated externally, after their internal validation, can be considered reliable and applicable for both external prediction and regulatory purposes.
Abstract: The recent REACH Policy of the European Union has led to scientists and regulators to focus their attention on establishing general validation principles for QSAR models in the context of chemical regulation (previously known as the Setubal, nowadays, the OECD principles). This paper gives a brief analysis of some principles: unambiguous algorithm, Applicability Domain (AD), and statistical validation. Some concerns related to QSAR algorithm reproducibility and an example of a fast check of the applicability domain for MLR models are presented. Common myths and misconceptions related to popular techniques for verifying internal predictivity, particularly for MLR models (for instance crossvalidation, bootstrap), are commented on and compared with commonly used statistical techniques for external validation. The differences in the two validating approaches are highlighted, and evidence is presented that only models that have been validated externally, after their internal validation, can be considered reliable and applicable for both external prediction and regulatory purposes. (“Validation is one of those words...that is constantly used and seldom defined” as stated by A. R. Feinstein in the book Multivariate Analysis: An Introduction, Yale University Press, New Haven, 1996).

1,697 citations

Journal ArticleDOI
TL;DR: Most critical QSAR modeling routines that are regarded as best practices in the field are examined, including procedures used to validate models, both internally and externally, as well as the need to define model applicability domains that should be used when models are employed for the prediction of external compounds or compound libraries.
Abstract: After nearly five decades "in the making", QSAR modeling has established itself as one of the major computational molecular modeling methodologies. As any mature research discipline, QSAR modeling can be characterized by a collection of well defined protocols and procedures that enable the expert application of the method for exploring and exploiting ever growing collections of biologically active chemical compounds. This review examines most critical QSAR modeling routines that we regard as best practices in the field. We discuss these procedures in the context of integrative predictive QSAR modeling workflow that is focused on achieving models of the highest statistical rigor and external predictive power. Specific elements of the workflow consist of data preparation including chemical structure (and when possible, associated biological data) curation, outlier detection, dataset balancing, and model validation. We especially emphasize procedures used to validate models, both internally and externally, as well as the need to define model applicability domains that should be used when models are employed for the prediction of external compounds or compound libraries. Finally, we present several examples of successful applications of QSAR models for virtual screening to identify experimentally confirmed hits.

1,362 citations

Journal ArticleDOI
TL;DR: In this article, the optimum variable selection strategy for Partial Least Squares (PLS) regression using a model dataset of cytoprotection data is explored, where the compounds of the dataset were classified using K-means clustering technique applied on standardized descriptor matrix and ten combinations of training and test sets were generated based on the obtained clusters.
Abstract: This paper tries to explore the optimum variable selection strategy for Partial Least Squares (PLS) regression using a model dataset of cytoprotection data. The compounds of the dataset were classified using K-means clustering technique applied on standardized descriptor matrix and ten combinations of training and test sets were generated based on the obtained clusters. For a particular training set, PLS models were developed with a number of components optimized by leave-one-out Q2 and then the developed models were validated (externally) using the test set compounds. For each set, PLS model was initially constructed using all descriptors (variables). The variables having least standardized values of regression coefficients were deleted and the next model was developed with a reduced set of variables. These steps were performed several times until further reduction in number of variables did not improve Q2 value. In each case, statistical parameters like predictive R2 (R2pred), squared correlation coefficient between observed and predicted values with (r2) and without () intercept and Root Mean Square Error of Prediction (RMSEP) were calculated from the test set compounds. In case of all ten sets, Q2 values steadily increase on deletion of variables while R2pred values do not show any specific trend. In no case, the highest Q2 and highest R2pred appear in the same trial, i.e., with the same combinations of variables. This suggests that from the viewpoint of external predictability, choice of variables for PLS based on Q2 value may not be optimum. Moreover, a clear separation of r2 and r02 curves in some sets suggests that such models may not be truly predictive in spite of acceptable R2pred values. Another observation is that coefficient of determination R2 for the training set is more immune to changes on deletion of variables than the validation parameters like Q2 and R2pred. Finally, a new parameter rm2 has been suggested to indicate external predictability of QSAR models.

683 citations

Journal ArticleDOI
TL;DR: This work compared y-randomization and several variants thereof, using original response, permuted response, or random number pseudoresponse and original descriptors orrandom number pseudodescriptors, in the typical setting of multilinear regression (MLR) with descriptor selection, and reported progress toward the aim of obtaining the mean highest r2 of random pseudomodels by calculation rather than by tedious multiple simulations on random number variables.
Abstract: y-Randomization is a tool used in validation of QSPR/QSAR models, whereby the performance of the original model in data description (r2) is compared to that of models built for permuted (randomly shuffled) response, based on the original descriptor pool and the original model building procedure. We compared y-randomization and several variants thereof, using original response, permuted response, or random number pseudoresponse and original descriptors or random number pseudodescriptors, in the typical setting of multilinear regression (MLR) with descriptor selection. For each combination of number of observations (compounds), number of descriptors in the final model, and number of descriptors in the pool to select from, computer experiments using the same descriptor selection method result in two different mean highest random r2 values. A lower one is produced by y-randomization or a variant likewise based on the original descriptors, while a higher one is obtained from variants that use random number pse...

653 citations

Journal ArticleDOI
TL;DR: This review seeks to provide a bird's eye view of the different 3D-QSAR approaches employed within the current drug discovery community to construct predictive structure-activity relationships and discusses the limitations that are fundamental to these approaches, as well as those that might be overcome with the improved strategies.
Abstract: Quantitative structure-activity relationships (QSAR) have been applied for decades in the development of relationships between physicochemical properties of chemical substances and their biological activities to obtain a reliable statistical model for prediction of the activities of new chemical entities. The fundamental principle underlying the formalism is that the difference in structural properties is responsible for the variations in biological activities of the compounds. In the classical QSAR studies, affinities of ligands to their binding sites, inhibition constants, rate constants, and other biological end points, with atomic, group or molecular properties such as lipophilicity, polarizability, electronic and steric properties (Hansch analysis) or with certain structural features (Free-Wilson analysis) have been correlated. However such an approach has only a limited utility for designing a new molecule due to the lack of consideration of the 3D structure of the molecules. 3D-QSAR has emerged as a natural extension to the classical Hansch and Free-Wilson approaches, which exploits the three-dimensional properties of the ligands to predict their biological activities using robust chemometric techniques such as PLS, G/PLS, ANN etc. It has served as a valuable predictive tool in the design of pharmaceuticals and agrochemicals. Although the trial and error factor involved in the development of a new drug cannot be ignored completely, QSAR certainly decreases the number of compounds to be synthesized by facilitating the selection of the most promising candidates. Several success stories of QSAR have attracted the medicinal chemists to investigate the relationships of structural properties with biological activity. This review seeks to provide a birds eye view of the different 3D-QSAR approaches employed within the current drug discovery community to construct predictive structure- activity relationships and also discusses the limitations that are fundamental to these approaches, as well as those that might be overcome with the improved strategies. The components involved in building a useful 3D-QSAR model are discussed, including the validation techniques available for this purpose.

606 citations