scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Performance of some variable selection methods when multicollinearity is present

TL;DR: The nature of the VIP method is explored and it is compared with other methods through computer simulation experiments considering four factors–the proportion of the number of relevant predictor, the magnitude of correlations between predictors, the structure of regression coefficients, andThe magnitude of signal to noise.
About: This article is published in Chemometrics and Intelligent Laboratory Systems.The article was published on 2005-07-28. It has received 1595 citations till now. The article focuses on the topics: Multicollinearity & Partial least squares regression.
Citations
More filters
Journal ArticleDOI
TL;DR: Results suggest some changes to the pilot plant configuration are necessary to reduce power consumption although maximizing biodigester performance, and a modification of the typical continuous stirred tank reactor is a promising process being relatively stable and owing to its capability to manage considerable amounts of residuals at low operational cost.
Abstract: Intensive poultry production generates over 100,000 t of litter annually in West Virginia and 9×106 t nationwide. Current available technological alternatives based on thermophilic anaerobic digestion for residuals treatment are diverse. A modification of the typical continuous stirred tank reactor is a promising process being relatively stable and owing to its capability to manage considerable amounts of residuals at low operational cost. A 40-m3 pilot plant digester was used for performance evaluation considering energy input and methane production. Results suggest some changes to the pilot plant configuration are necessary to reduce power consumption although maximizing biodigester performance.

1,287 citations

Journal ArticleDOI
TL;DR: A review of available methods for variable selection within one of the many modeling approaches for high-throughput data, Partial Least Squares Regression, to get an understanding of the characteristics of the methods and to get a basis for selecting an appropriate method for own use.

1,180 citations


Cites background from "Performance of some variable select..."

  • ...It is generally accepted that a variable should be selected if vj>1, [27–29], but a proper threshold between 0....

    [...]

  • ...21 can yield more relevant variables according to [28]....

    [...]

Journal ArticleDOI
15 Aug 2010-Geoderma
TL;DR: In this article, the root mean square error (RMSE) and the Akaike Information Criterion (AIC) were used to compare different data mining algorithms for modelling soil visible-near infrared (vis-NIR) diffuse reflectance spectra and to assess the interpretability of the results.

928 citations

Journal ArticleDOI
TL;DR: The emphasis in this paper is on how to use variable selection in practice and avoid the most common pitfalls.
Abstract: This paper provides a practical guide to variable selection in chemometrics with a focus on regression-based calibration models. Several approaches, such as genetic algorithms (GAs), jack-knifing, forward selection, etc., are explained; it is also explained how to choose between different kinds of variable selection methods. The emphasis in this paper is on how to use variable selection in practice and avoid the most common pitfalls. Copyright © 2010 John Wiley & Sons, Ltd.

580 citations

Journal ArticleDOI
28 Apr 2016-Nature
TL;DR: It is shown that specific plankton communities, from the surface and deep chlorophyll maximum, correlate with carbon export at 150 m and that the relative abundance of a few bacterial and viral genes can predict a significant fraction of the variability in carbon export in these regions.
Abstract: The biological carbon pump is the process by which CO2 is transformed to organic carbon via photosynthesis, exported through sinking particles, and finally sequestered in the deep ocean. While the intensity of the pump correlates with plankton community composition, the underlying ecosystem structure driving the process remains largely uncharacterized. Here we use environmental and metagenomic data gathered during the Tara Oceans expedition to improve our understanding of carbon export in the oligotrophic ocean. We show that specific plankton communities, from the surface and deep chlorophyll maximum, correlate with carbon export at 150 m and highlight unexpected taxa such as Radiolaria and alveolate parasites, as well as Synechococcus and their phages, as lineages most strongly associated with carbon export in the subtropical, nutrient-depleted, oligotrophic ocean. Additionally, we show that the relative abundance of a few bacterial and viral genes can predict a significant fraction of the variability in carbon export in these regions.

556 citations

References
More filters
Journal ArticleDOI
TL;DR: Chapter 11 includes more case studies in other areas, ranging from manufacturing to marketing research, and a detailed comparison with other diagnostic tools, such as logistic regression and tree-based methods.
Abstract: Chapter 11 includes more case studies in other areas, ranging from manufacturing to marketing research. Chapter 12 concludes the book with some commentary about the scientiŽ c contributions of MTS. The Taguchi method for design of experiment has generated considerable controversy in the statistical community over the past few decades. The MTS/MTGS method seems to lead another source of discussions on the methodology it advocates (Montgomery 2003). As pointed out by Woodall et al. (2003), the MTS/MTGS methods are considered ad hoc in the sense that they have not been developed using any underlying statistical theory. Because the “normal” and “abnormal” groups form the basis of the theory, some sampling restrictions are fundamental to the applications. First, it is essential that the “normal” sample be uniform, unbiased, and/or complete so that a reliable measurement scale is obtained. Second, the selection of “abnormal” samples is crucial to the success of dimensionality reduction when OAs are used. For example, if each abnormal item is really unique in the medical example, then it is unclear how the statistical distance MD can be guaranteed to give a consistent diagnosis measure of severity on a continuous scale when the larger-the-better type S/N ratio is used. Multivariate diagnosis is not new to Technometrics readers and is now becoming increasingly more popular in statistical analysis and data mining for knowledge discovery. As a promising alternative that assumes no underlying data model, The Mahalanobis–Taguchi Strategy does not provide sufŽ cient evidence of gains achieved by using the proposed method over existing tools. Readers may be very interested in a detailed comparison with other diagnostic tools, such as logistic regression and tree-based methods. Overall, although the idea of MTS/MTGS is intriguing, this book would be more valuable had it been written in a rigorous fashion as a technical reference. There is some lack of precision even in several mathematical notations. Perhaps a follow-up with additional theoretical justiŽ cation and careful case studies would answer some of the lingering questions.

11,507 citations


"Performance of some variable select..." refers methods in this paper

  • ...The number of latent variables for PLS regression, the tuning parameter for the Lasso and the significant levels for stepwise regression are determined by five-fold crossvalidation which is widely used for estimating prediction error [12]....

    [...]

Journal ArticleDOI
TL;DR: PLS-regression (PLSR) as mentioned in this paper is the PLS approach in its simplest, and in chemistry and technology, most used form (two-block predictive PLS) is a method for relating two data matrices, X and Y, by a linear multivariate model.

7,861 citations

Journal ArticleDOI
TL;DR: A publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates is described.
Abstract: The purpose of model selection algorithms such as All Subsets, Forward Selection and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. Least Angle Regression (LARS), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods. Three main properties are derived: (1) A simple modification of the LARS algorithm implements the Lasso, an attractive version of ordinary least squares that constrains the sum of the absolute regression coefficients; the LARS modification calculates all possible Lasso estimates for a given problem, using an order of magnitude less computer time than previous methods. (2) A different LARS modification efficiently implements Forward Stagewise linear regression, another promising new model selection method; this connection explains the similar numerical results previously observed for the Lasso and Stagewise, and helps us understand the properties of both methods, which are seen as constrained versions of the simpler LARS algorithm. (3) A simple approximation for the degrees of freedom of a LARS estimate is available, from which we derive a Cp estimate of prediction error; this allows a principled choice among the range of possible LARS estimates. LARS and its variants are computationally efficient: the paper describes a publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates.

7,828 citations

Journal ArticleDOI
TL;DR: In this paper, a tutorial on the Partial Least Squares (PLS) regression method is provided, and an algorithm for a predictive PLS and some practical hints for its use are given.

6,393 citations

Trending Questions (2)
Should I use VIP score from component 1 and 2 to VIP score?

Yes, the paper discusses the use of VIP scores and provides guidelines for their use and performance.

Should I use VIP score from component 1 and 2 to VIP score? PLSDA?

Yes, the paper suggests using the VIP scores from component 1 and 2 to evaluate the performance of PLSDA.