scispace - formally typeset
Journal ArticleDOI

On Some Aspects of Variable Selection for Partial Least Squares Regression Models

Partha Pratim Roy, +1 more
- 01 Mar 2008 - 
- Vol. 27, Iss: 3, pp 302-313
Reads0
Chats0
TLDR
In this article, the optimum variable selection strategy for Partial Least Squares (PLS) regression using a model dataset of cytoprotection data is explored, where the compounds of the dataset were classified using K-means clustering technique applied on standardized descriptor matrix and ten combinations of training and test sets were generated based on the obtained clusters.
Abstract
This paper tries to explore the optimum variable selection strategy for Partial Least Squares (PLS) regression using a model dataset of cytoprotection data. The compounds of the dataset were classified using K-means clustering technique applied on standardized descriptor matrix and ten combinations of training and test sets were generated based on the obtained clusters. For a particular training set, PLS models were developed with a number of components optimized by leave-one-out Q2 and then the developed models were validated (externally) using the test set compounds. For each set, PLS model was initially constructed using all descriptors (variables). The variables having least standardized values of regression coefficients were deleted and the next model was developed with a reduced set of variables. These steps were performed several times until further reduction in number of variables did not improve Q2 value. In each case, statistical parameters like predictive R2 (R2pred), squared correlation coefficient between observed and predicted values with (r2) and without () intercept and Root Mean Square Error of Prediction (RMSEP) were calculated from the test set compounds. In case of all ten sets, Q2 values steadily increase on deletion of variables while R2pred values do not show any specific trend. In no case, the highest Q2 and highest R2pred appear in the same trial, i.e., with the same combinations of variables. This suggests that from the viewpoint of external predictability, choice of variables for PLS based on Q2 value may not be optimum. Moreover, a clear separation of r2 and r02 curves in some sets suggests that such models may not be truly predictive in spite of acceptable R2pred values. Another observation is that coefficient of determination R2 for the training set is more immune to changes on deletion of variables than the validation parameters like Q2 and R2pred. Finally, a new parameter rm2 has been suggested to indicate external predictability of QSAR models.

read more

Citations
More filters
Journal ArticleDOI

Real external predictivity of QSAR models: how to evaluate it? Comparison of different validation criteria and proposal of using the concordance correlation coefficient.

TL;DR: The concordance correlation coefficient is proposed as a complementary, or alternative, more prudent measure of a QSAR model to be externally predictive, and works well on real data sets, where it seems to be more stable, and helps in making decisions when the validation measures are in conflict.
Journal ArticleDOI

On Two Novel Parameters for Validation of Predictive QSAR Models

TL;DR: A test for these two parameters is suggested to be a more stringent requirement than the traditional validation parameters to decide acceptability of a predictive QSAR model, especially when a regulatory decision is involved.
Journal ArticleDOI

Further exploring rm2 metrics for validation of QSPR models

TL;DR: In this article, some additional variants of r m 2 metrics have been proposed and their applications in judging the quality of predictions of QSPR models have been shown by analyzing results of the QSPr models obtained from three different data sets (n = 119, 90, and 384).
Journal ArticleDOI

Some case studies on application of "r(m)2" metrics for judging quality of quantitative structure-activity relationship predictions: emphasis on scaling of response data.

TL;DR: The present study reports that the web application can be easily used for computation of rm2 metrics provided observed and QSAR‐predicted data for a set of compounds are available and scaling of response data is recommended prior to rm2 calculation.
Journal ArticleDOI

On some aspects of validation of predictive quantitative structure–activity relationship models

TL;DR: This review focuses on the importance of validation of quantitative structure–activity relationship models and different methods of validation.
References
More filters
Journal ArticleDOI

Assessing model fit by cross-validation.

TL;DR: It is shown by theoretical argument and empiric study of a large QSAR data set that when the available sample size is small, holding a portion of it back for testing is wasteful, and that it is much better to use cross-validation, but ensure that this is done properly.
Book

Chemometric methods in molecular design

TL;DR: Molecular concepts experimental design in synthesis-planning and structure-property correlations multivariate analysis of chemical and biological data statistical validation of QSAR results.
Journal ArticleDOI

Rational selection of training and test sets for the development of validated QSAR models.

TL;DR: There is additional evidence that there exists no correlation between the values of q2 for the training set and accuracy of prediction (R2) for the test set and it is argued that this observation is a general property of any QSAR model developed with LOO cross-validation.
Journal ArticleDOI

Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection.

TL;DR: It is demonstrated that QSAR models built and validated with the approach have statistically better predictive power than models generated with either random or activity ranking based selection of the training and test sets.
Journal ArticleDOI

Novel variable selection quantitative structure--property relationship approach based on the k-nearest-neighbor principle

TL;DR: A novel automated variable selection quantitative structure-activity relationship (QSAR) method, based on the kappa-nearest neighbor principle (kNN-QSar) has been developed, which implies that similar compounds display similar profiles of pharmacological activities.
Related Papers (5)