scispace - formally typeset
Search or ask a question
Author

Min Shen

Bio: Min Shen is an academic researcher from University of North Carolina at Chapel Hill. The author has contributed to research in topics: Quantitative structure–activity relationship & Applicability domain. The author has an hindex of 4, co-authored 4 publications receiving 929 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: There is additional evidence that there exists no correlation between the values of q2 for the training set and accuracy of prediction (R2) for the test set and it is argued that this observation is a general property of any QSAR model developed with LOO cross-validation.
Abstract: Quantitative Structure–Activity Relationship (QSAR) models are used increasingly to screen chemical databases and/or virtual chemical libraries for potentially bioactive molecules. These developments emphasize the importance of rigorous model validation to ensure that the models have acceptable predictive power. Using k nearest neighbors (kNN) variable selection QSAR method for the analysis of several datasets, we have demonstrated recently that the widely accepted leave-one-out (LOO) cross-validated R2 (q2) is an inadequate characteristic to assess the predictive ability of the models [Golbraikh, A., Tropsha, A. Beware of q2! J. Mol. Graphics Mod. 20, 269-276, (2002)]. Herein, we provide additional evidence that there exists no correlation between the values of q2 for the training set and accuracy of prediction (R2) for the test set and argue that this observation is a general property of any QSAR model developed with LOO cross-validation. We suggest that external validation using rationally selected training and test sets provides a means to establish a reliable QSAR model. We propose several approaches to the division of experimental datasets into training and test sets and apply them in QSAR studies of 48 functionalized amino acid anticonvulsants and a series of 157 epipodophyllotoxin derivatives with antitumor activity. We formulate a set of general criteria for the evaluation of predictive power of QSAR models.

591 citations

Journal ArticleDOI
TL;DR: The development, validation, and application of quantitative structure-property relationship (QSPR) models of metabolic turnover rate for compounds in human S9 homogenate spells a rapid, computational screen for generating components of the ADME profile in a drug discovery process.
Abstract: Computational ADME (absorption, distribution, metabolism, and excretion) models may be used early in the drug discovery process in order to flag drug candidates with potentially problematic ADME profiles. We report the development, validation, and application of quantitative structure−property relationship (QSPR) models of metabolic turnover rate for compounds in human S9 homogenate. Biological data were obtained from uniform bioassays of 631 diverse chemicals proprietary to GlaxoSmithKline (GSK). The models were built with topological molecular descriptors such as molecular connectivity indices or atom pairs using the k-nearest neighbor variable selection optimization method developed at the University of North Carolina (Zheng, W.; Tropsha, A. A novel variable selection QSAR approach based on the k-nearest neighbor principle. J. Chem. Inf. Comput. Sci., 2000, 40, 185−194.). For the purpose of validation, the whole data set was divided into training and test sets. The training set QSPR models were charact...

149 citations

Journal ArticleDOI
TL;DR: The development of rigorously validated quantitative structure-activity relationship (QSAR) models for 48 chemically diverse functionalized amino acids with anticonvulsant activity were reported, capable of predicting with reasonable accuracy the activity of 13 novel compounds not included in the original dataset.
Abstract: We report the development of rigorously validated quantitative structure−activity relationship (QSAR) models for 48 chemically diverse functionalized amino acids with anticonvulsant activity. Two variable selection approaches, simulated annealing partial least squares (SA-PLS) and k nearest neighbor (kNN), were employed. Both methods utilize multiple descriptors such as molecular connectivity indices or atom pair descriptors, which are derived from two-dimensional molecular topology. QSAR models with high internal accuracy were generated, with leave-one-out cross-validated R2 (q2) values ranging between 0.6 and 0.8. The q2 values for the actual dataset were significantly higher than those obtained for the same dataset with randomly shuffled activity values, indicating that models were statistically significant. The original dataset was further divided into several training and test sets, with highly predictive models providing q2 values greater than 0.5 for the training sets and R2 values greater than 0.6...

132 citations

Journal ArticleDOI
TL;DR: A drug discovery strategy that employs variable selection quantitative structure-activity relationship (QSAR) models for chemical database mining and only variables selected as a result of model building are used in chemical similarity searches comparing active compounds of the training set with those in chemical databases.
Abstract: We have developed a drug discovery strategy that employs variable selection quantitative structure-activity relationship (QSAR) models for chemical database mining. The approach starts with the development of rigorously validated QSAR models obtained with the variable selection k nearest neighbor (kNN) method (or, in principle, with any other robust model-building technique). Model validation is based on several statistical criteria, including the randomization of the target property (Y-randomization), independent assessment of the training set model's predictive power using external test sets, and the establishment of the model's applicability domain. All successful models are employed in database mining concurrently; in each case, only variables selected as a result of model building (termed descriptor pharmacophore) are used in chemical similarity searches comparing active compounds of the training set (queries) with those in chemical databases. Specific biological activity (characteristic of the train...

132 citations


Cited by
More filters
Journal ArticleDOI
TL;DR: Evidence is presented that only models that have been validated externally, after their internal validation, can be considered reliable and applicable for both external prediction and regulatory purposes.
Abstract: The recent REACH Policy of the European Union has led to scientists and regulators to focus their attention on establishing general validation principles for QSAR models in the context of chemical regulation (previously known as the Setubal, nowadays, the OECD principles). This paper gives a brief analysis of some principles: unambiguous algorithm, Applicability Domain (AD), and statistical validation. Some concerns related to QSAR algorithm reproducibility and an example of a fast check of the applicability domain for MLR models are presented. Common myths and misconceptions related to popular techniques for verifying internal predictivity, particularly for MLR models (for instance crossvalidation, bootstrap), are commented on and compared with commonly used statistical techniques for external validation. The differences in the two validating approaches are highlighted, and evidence is presented that only models that have been validated externally, after their internal validation, can be considered reliable and applicable for both external prediction and regulatory purposes. (“Validation is one of those words...that is constantly used and seldom defined” as stated by A. R. Feinstein in the book Multivariate Analysis: An Introduction, Yale University Press, New Haven, 1996).

1,697 citations

Journal ArticleDOI
TL;DR: Most critical QSAR modeling routines that are regarded as best practices in the field are examined, including procedures used to validate models, both internally and externally, as well as the need to define model applicability domains that should be used when models are employed for the prediction of external compounds or compound libraries.
Abstract: After nearly five decades "in the making", QSAR modeling has established itself as one of the major computational molecular modeling methodologies. As any mature research discipline, QSAR modeling can be characterized by a collection of well defined protocols and procedures that enable the expert application of the method for exploring and exploiting ever growing collections of biologically active chemical compounds. This review examines most critical QSAR modeling routines that we regard as best practices in the field. We discuss these procedures in the context of integrative predictive QSAR modeling workflow that is focused on achieving models of the highest statistical rigor and external predictive power. Specific elements of the workflow consist of data preparation including chemical structure (and when possible, associated biological data) curation, outlier detection, dataset balancing, and model validation. We especially emphasize procedures used to validate models, both internally and externally, as well as the need to define model applicability domains that should be used when models are employed for the prediction of external compounds or compound libraries. Finally, we present several examples of successful applications of QSAR models for virtual screening to identify experimentally confirmed hits.

1,362 citations

Journal ArticleDOI
TL;DR: In this article, the optimum variable selection strategy for Partial Least Squares (PLS) regression using a model dataset of cytoprotection data is explored, where the compounds of the dataset were classified using K-means clustering technique applied on standardized descriptor matrix and ten combinations of training and test sets were generated based on the obtained clusters.
Abstract: This paper tries to explore the optimum variable selection strategy for Partial Least Squares (PLS) regression using a model dataset of cytoprotection data. The compounds of the dataset were classified using K-means clustering technique applied on standardized descriptor matrix and ten combinations of training and test sets were generated based on the obtained clusters. For a particular training set, PLS models were developed with a number of components optimized by leave-one-out Q2 and then the developed models were validated (externally) using the test set compounds. For each set, PLS model was initially constructed using all descriptors (variables). The variables having least standardized values of regression coefficients were deleted and the next model was developed with a reduced set of variables. These steps were performed several times until further reduction in number of variables did not improve Q2 value. In each case, statistical parameters like predictive R2 (R2pred), squared correlation coefficient between observed and predicted values with (r2) and without () intercept and Root Mean Square Error of Prediction (RMSEP) were calculated from the test set compounds. In case of all ten sets, Q2 values steadily increase on deletion of variables while R2pred values do not show any specific trend. In no case, the highest Q2 and highest R2pred appear in the same trial, i.e., with the same combinations of variables. This suggests that from the viewpoint of external predictability, choice of variables for PLS based on Q2 value may not be optimum. Moreover, a clear separation of r2 and r02 curves in some sets suggests that such models may not be truly predictive in spite of acceptable R2pred values. Another observation is that coefficient of determination R2 for the training set is more immune to changes on deletion of variables than the validation parameters like Q2 and R2pred. Finally, a new parameter rm2 has been suggested to indicate external predictability of QSAR models.

683 citations

Journal ArticleDOI
TL;DR: This work compared y-randomization and several variants thereof, using original response, permuted response, or random number pseudoresponse and original descriptors orrandom number pseudodescriptors, in the typical setting of multilinear regression (MLR) with descriptor selection, and reported progress toward the aim of obtaining the mean highest r2 of random pseudomodels by calculation rather than by tedious multiple simulations on random number variables.
Abstract: y-Randomization is a tool used in validation of QSPR/QSAR models, whereby the performance of the original model in data description (r2) is compared to that of models built for permuted (randomly shuffled) response, based on the original descriptor pool and the original model building procedure. We compared y-randomization and several variants thereof, using original response, permuted response, or random number pseudoresponse and original descriptors or random number pseudodescriptors, in the typical setting of multilinear regression (MLR) with descriptor selection. For each combination of number of observations (compounds), number of descriptors in the final model, and number of descriptors in the pool to select from, computer experiments using the same descriptor selection method result in two different mean highest random r2 values. A lower one is produced by y-randomization or a variant likewise based on the original descriptors, while a higher one is obtained from variants that use random number pse...

653 citations

Journal ArticleDOI
TL;DR: There is additional evidence that there exists no correlation between the values of q2 for the training set and accuracy of prediction (R2) for the test set and it is argued that this observation is a general property of any QSAR model developed with LOO cross-validation.
Abstract: Quantitative Structure–Activity Relationship (QSAR) models are used increasingly to screen chemical databases and/or virtual chemical libraries for potentially bioactive molecules. These developments emphasize the importance of rigorous model validation to ensure that the models have acceptable predictive power. Using k nearest neighbors (kNN) variable selection QSAR method for the analysis of several datasets, we have demonstrated recently that the widely accepted leave-one-out (LOO) cross-validated R2 (q2) is an inadequate characteristic to assess the predictive ability of the models [Golbraikh, A., Tropsha, A. Beware of q2! J. Mol. Graphics Mod. 20, 269-276, (2002)]. Herein, we provide additional evidence that there exists no correlation between the values of q2 for the training set and accuracy of prediction (R2) for the test set and argue that this observation is a general property of any QSAR model developed with LOO cross-validation. We suggest that external validation using rationally selected training and test sets provides a means to establish a reliable QSAR model. We propose several approaches to the division of experimental datasets into training and test sets and apply them in QSAR studies of 48 functionalized amino acid anticonvulsants and a series of 157 epipodophyllotoxin derivatives with antitumor activity. We formulate a set of general criteria for the evaluation of predictive power of QSAR models.

591 citations