scispace - formally typeset
Search or ask a question
Journal ArticleDOI

On Some Aspects of Variable Selection for Partial Least Squares Regression Models

01 Mar 2008-Qsar & Combinatorial Science (John Wiley & Sons, Ltd)-Vol. 27, Iss: 3, pp 302-313
TL;DR: In this article, the optimum variable selection strategy for Partial Least Squares (PLS) regression using a model dataset of cytoprotection data is explored, where the compounds of the dataset were classified using K-means clustering technique applied on standardized descriptor matrix and ten combinations of training and test sets were generated based on the obtained clusters.
Abstract: This paper tries to explore the optimum variable selection strategy for Partial Least Squares (PLS) regression using a model dataset of cytoprotection data. The compounds of the dataset were classified using K-means clustering technique applied on standardized descriptor matrix and ten combinations of training and test sets were generated based on the obtained clusters. For a particular training set, PLS models were developed with a number of components optimized by leave-one-out Q2 and then the developed models were validated (externally) using the test set compounds. For each set, PLS model was initially constructed using all descriptors (variables). The variables having least standardized values of regression coefficients were deleted and the next model was developed with a reduced set of variables. These steps were performed several times until further reduction in number of variables did not improve Q2 value. In each case, statistical parameters like predictive R2 (R2pred), squared correlation coefficient between observed and predicted values with (r2) and without () intercept and Root Mean Square Error of Prediction (RMSEP) were calculated from the test set compounds. In case of all ten sets, Q2 values steadily increase on deletion of variables while R2pred values do not show any specific trend. In no case, the highest Q2 and highest R2pred appear in the same trial, i.e., with the same combinations of variables. This suggests that from the viewpoint of external predictability, choice of variables for PLS based on Q2 value may not be optimum. Moreover, a clear separation of r2 and r02 curves in some sets suggests that such models may not be truly predictive in spite of acceptable R2pred values. Another observation is that coefficient of determination R2 for the training set is more immune to changes on deletion of variables than the validation parameters like Q2 and R2pred. Finally, a new parameter rm2 has been suggested to indicate external predictability of QSAR models.
Citations
More filters
Journal ArticleDOI
TL;DR: The concordance correlation coefficient is proposed as a complementary, or alternative, more prudent measure of a QSAR model to be externally predictive, and works well on real data sets, where it seems to be more stable, and helps in making decisions when the validation measures are in conflict.
Abstract: The main utility of QSAR models is their ability to predict activities/properties for new chemicals, and this external prediction ability is evaluated by means of various validation criteria. As a measure for such evaluation the OECD guidelines have proposed the predictive squared correlation coefficient Q2F1 (Shi et al.). However, other validation criteria have been proposed by other authors: the Golbraikh-Tropsha method, r2m (Roy), Q2F2 (Schuurmann et al.), Q2F3 (Consonni et al.). In QSAR studies these measures are usually in accordance, though this is not always the case, thus doubts can arise when contradictory results are obtained. It is likely that none of the aforementioned criteria is the best in every situation, so a comparative study using simulated data sets is proposed here, using threshold values suggested by the proponents or those widely used in QSAR modeling. In addition, a different and simple external validation measure, the concordance correlation coefficient (CCC), is proposed and comp...

552 citations

Journal ArticleDOI
TL;DR: A test for these two parameters is suggested to be a more stringent requirement than the traditional validation parameters to decide acceptability of a predictive QSAR model, especially when a regulatory decision is involved.
Abstract: Validation is a crucial aspect of quantitative structure-activity relationship (QSAR) modeling. The present paper shows that traditionally used validation parameters (leave-one-out Q(2) for internal validation and predictive R(2) for external validation) may be supplemented with two novel parameters r(m)(2) and R(p)(2) for a stricter test of validation. The parameter r(m)(2)((overall)) penalizes a model for large differences between observed and predicted values of the compounds of the whole set (considering both training and test sets) while the parameter R(p)(2) penalizes model R(2) for large differences between determination coefficient of nonrandom model and square of mean correlation coefficient of random models in case of a randomization test. Two other variants of r(m)(2) parameter, r(m)(2)((LOO)) and r(m)(2)((test)), penalize a model more strictly than Q(2) and R(2)(pred) respectively. Three different data sets of moderate to large size have been used to develop multiple models in order to indicate the suitability of the novel parameters in QSAR studies. The results show that in many cases the developed models could satisfy the requirements of conventional parameters (Q(2) and R(2)(pred)) but fail to achieve the required values for the novel parameters r(m)(2) and R(p)(2). Moreover, these parameters also help in identifying the best models from among a set of comparable models. Thus, a test for these two parameters is suggested to be a more stringent requirement than the traditional validation parameters to decide acceptability of a predictive QSAR model, especially when a regulatory decision is involved.

474 citations

Journal ArticleDOI
TL;DR: In this article, some additional variants of r m 2 metrics have been proposed and their applications in judging the quality of predictions of QSPR models have been shown by analyzing results of the QSPr models obtained from three different data sets (n = 119, 90, and 384).

467 citations

Journal ArticleDOI
TL;DR: The present study reports that the web application can be easily used for computation of rm2 metrics provided observed and QSAR‐predicted data for a set of compounds are available and scaling of response data is recommended prior to rm2 calculation.
Abstract: Quantitative structure-activity relationship (QSAR) techniques have found wide application in the fields of drug design, property modeling, and toxicity prediction of untested chemicals. A rigorous validation of the developed models plays the key role for their successful application in prediction for new compounds. The r(m)(2) metrics introduced by Roy et al. have been extensively used by different research groups for validation of regression-based QSAR models. This concept has been further advanced here with introduction of scaling of response data prior to computation of r(m)(2). Further, a web application (accessible from http://aptsoftware.co.in/rmsquare/ and http://203.200.173.43:8080/rmsquare/) for calculation of the r(m)(2) metrics has been introduced here. The present study reports that the web application can be easily used for computation of r(m)(2) metrics provided observed and QSAR-predicted data for a set of compounds are available. Further, scaling of response data is recommended prior to r(m)(2) calculation.

360 citations

Journal ArticleDOI
Kunal Roy1
TL;DR: This review focuses on the importance of validation of quantitative structure–activity relationship models and different methods of validation.
Abstract: The success of any quantitative structure-activity relationship model depends on the accuracy of the input data, selection of appropriate descriptors and statistical tools and, most importantly, the validation of the developed model. Validation is the process by which the reliability and relevance of a procedure are established for a specific purpose. This review focuses on the importance of validation of quantitative structure-activity relationship models and different methods of validation. Some important issues, such as internal versus external validation, method of selection of training set compounds and training set size, applicability domain, variable selection and suitable parameters to indicate external predictivity, are also discussed.

242 citations


Cites background or methods from "On Some Aspects of Variable Selecti..."

  • ...Very recently, Roy and Roy [67] have worked on the validation of PLS models of a thiocarbamate data set to explore the optimum variable selection strategy....

    [...]

  • ...Roy et al. [68] have shown that when modelling data sets of anti-HIV thiocarbamates (n = 67) and 1-[(2-hydroxyethoxy)methyl]-6-(phenylthio)thymine (HEPT) (n = 107) derivatives using topological parameters, the quality of the external validation statistics goes on decreasing when the training set size is gradually decreased....

    [...]

  • ...In their work on validation of PLS models of a thiocarbamate data set, Roy and Roy have shown that in some cases, R 2 Pred does not show high intercorrelation with r 2 and/or r 0 2 (the squared correlation coefficient between observed and predicted values for the test set with and without intercept, respectively) [67] ....

    [...]

  • ...Leonard and Roy [33] have performed validation of QSAR models for three data sets with different size based on random division, sorted biological activity data and K -means clusters for the factor scores of the original variable matrix along with/without biological activity values....

    [...]

  • ...To better understand external predictivity of models, Roy and Roy [67] have defined a modified r 2 term (r m 2 ) in the following manner: 2 2 2 2 m 0r r (1 | r r |)= × − − In cases of good external prediction, the predicted values of test set compounds are near the corresponding observed values and, thus, r 0 2 values are close to r 2 value....

    [...]

References
More filters
Book
01 Jan 1974
TL;DR: This fourth edition of the highly successful Cluster Analysis represents a thorough revision of the third edition and covers new and developing areas such as classification likelihood and neural networks for clustering.
Abstract: Cluster analysis comprises a range of methods for classifying multivariate data into subgroups. By organising multivariate data into such subgroups, clustering can help reveal the characteristics of any structure or patterns present. These techniques are applicable in a wide range of areas such as medicine, psychology and market research. This fourth edition of the highly successful Cluster Analysis represents a thorough revision of the third edition and covers new and developing areas such as classification likelihood and neural networks for clustering. Real life examples are used throughout to demonstrate the application of the theory, and figures are used extensively to illustrate graphical techniques. The book is comprehensive yet relatively non-mathematical, focusing on the practical aspects of cluster analysis.

9,857 citations

Journal ArticleDOI
TL;DR: It is argued that the high value of LOO q2 appears to be the necessary but not the sufficient condition for the model to have a high predictive power, which is the general property of QSAR models developed using LOO cross-validation.
Abstract: Validation is a crucial aspect of any quantitative structure-activity relationship (QSAR) modeling. This paper examines one of the most popular validation criteria, leave-one-out cross-validated R2 (LOO q2). Often, a high value of this statistical characteristic (q2 > 0.5) is considered as a proof of the high predictive ability of the model. In this paper, we show that this assumption is generally incorrect. In the case of 3D QSAR, the lack of the correlation between the high LOO q2 and the high predictive ability of a QSAR model has been established earlier [Pharm. Acta Helv. 70 (1995) 149; J. Chemomet. 10(1996)95; J. Med. Chem. 41 (1998) 2553]. In this paper, we use two-dimensional (2D) molecular descriptors and k nearest neighbors (kNN) QSAR method for the analysis of several datasets. No correlation between the values of q2 for the training set and predictive ability for the test set was found for any of the datasets. Thus, the high value of LOO q2 appears to be the necessary but not the sufficient condition for the model to have a high predictive power. We argue that this is the general property of QSAR models developed using LOO cross-validation. We emphasize that the external validation is the only way to establish a reliable QSAR model. We formulate a set of criteria for evaluation of predictive ability of QSAR models.

3,176 citations

Journal ArticleDOI
TL;DR: In this paper, the mathematical and statistical structure of PLS regression is developed and the PLS decomposition of the data matrices involved in model building is analyzed. But the PLP regression algorithm can be interpreted in a model building setting.
Abstract: In this paper we develop the mathematical and statistical structure of PLS regression We show the PLS regression algorithm and how it can be interpreted in model building The basic mathematical principles that lie behind two block PLS are depicted We also show the statistical aspects of the PLS method when it is used for model building Finally we show the structure of the PLS decompositions of the data matrices involved

1,778 citations