On Some Aspects of Variable Selection for Partial Least Squares Regression Models

doi:10.1002/QSAR.200710043

Journal ArticleDOI

On Some Aspects of Variable Selection for Partial Least Squares Regression Models

Partha Pratim Roy, +1 more

- 01 Mar 2008 -

Qsar & Combinatorial Science

- Vol. 27, Iss: 3, pp 302-313

Chats0

TLDR

In this article, the optimum variable selection strategy for Partial Least Squares (PLS) regression using a model dataset of cytoprotection data is explored, where the compounds of the dataset were classified using K-means clustering technique applied on standardized descriptor matrix and ten combinations of training and test sets were generated based on the obtained clusters.

Abstract:

This paper tries to explore the optimum variable selection strategy for Partial Least Squares (PLS) regression using a model dataset of cytoprotection data. The compounds of the dataset were classified using K-means clustering technique applied on standardized descriptor matrix and ten combinations of training and test sets were generated based on the obtained clusters. For a particular training set, PLS models were developed with a number of components optimized by leave-one-out Q2 and then the developed models were validated (externally) using the test set compounds. For each set, PLS model was initially constructed using all descriptors (variables). The variables having least standardized values of regression coefficients were deleted and the next model was developed with a reduced set of variables. These steps were performed several times until further reduction in number of variables did not improve Q2 value. In each case, statistical parameters like predictive R2 (R2pred), squared correlation coefficient between observed and predicted values with (r2) and without () intercept and Root Mean Square Error of Prediction (RMSEP) were calculated from the test set compounds. In case of all ten sets, Q2 values steadily increase on deletion of variables while R2pred values do not show any specific trend. In no case, the highest Q2 and highest R2pred appear in the same trial, i.e., with the same combinations of variables. This suggests that from the viewpoint of external predictability, choice of variables for PLS based on Q2 value may not be optimum. Moreover, a clear separation of r2 and r02 curves in some sets suggests that such models may not be truly predictive in spite of acceptable R2pred values. Another observation is that coefficient of determination R2 for the training set is more immune to changes on deletion of variables than the validation parameters like Q2 and R2pred. Finally, a new parameter rm2 has been suggested to indicate external predictability of QSAR models.

On Some Aspects of Variable Selection for Partial Least Squares Regression Models

Citations

A new hybrid simulated annealing-based genetic programming technique to predict the ultimate bearing capacity of piles

Insights into Performance Fitness and Error Metrics for Machine Learning.

New prediction models for concrete ultimate strength under true-triaxial stress states: An evolutionary approach

CoMFA and CoMSIA 3D-QSAR studies on quionolone caroxylic acid derivatives inhibitors of HIV-1 integrase

Comparative QSAR studies of CYP1A2 inhibitor flavonoids using 2D and 3D descriptors.

References

Cluster Analysis

Beware of q2

PLS regression methods

Burger's medicinal chemistry and drug discovery

Burgerʼs Medicinal Chemistry and Drug Discovery

Related Papers (5)

Beware of q2

The Importance of Being Earnest: Validation is the Absolute Essential for Successful Application and Interpretation of QSPR Models

Methods for reliability and uncertainty assessment and for applicability evaluations of classification- and regression-based QSARs.

Principles of QSAR models validation: internal and external

Genetic Programming: On the Programming of Computers by Means of Natural Selection