Collinearity: a review of methods to deal with it and a simulation study evaluating their performance
Carsten F. Dormann,Jane Elith,Sven Bacher,Carsten M. Buchmann,Gudrun Carl,Gabriel Carré,Jaime Ricardo García Márquez,Bernd Gruber,Bruno Lafourcade,Pedro J. Leitão,Tamara Münkemüller,Colin J. McClean,Patrick E. Osborne,Björn Reineking,Boris Schröder,Andrew K. Skidmore,Damaris Zurell,Sven Lautenbach +17 more
TLDR
It was found that methods specifically designed for collinearity, such as latent variable methods and tree based models, did not outperform the traditional GLM and threshold-based pre-selection and the value of GLM in combination with penalised methods and thresholds when omitted variables are considered in the final interpretation.Abstract:
Collinearity refers to the non independence of predictor variables, usually in a regression-type analysis. It is a common feature of any descriptive ecological data set and can be a problem for parameter estimation because it inflates the variance of regression parameters and hence potentially leads to the wrong identification of relevant predictors in a statistical model. Collinearity is a severe problem when a model is trained on data from one region or time, and predicted to another with a different or unknown structure of collinearity. To demonstrate the reach of the problem of collinearity in ecology, we show how relationships among predictors differ between biomes, change over spatial scales and through time. Across disciplines, different approaches to addressing collinearity problems have been developed, ranging from clustering of predictors, threshold-based pre-selection, through latent variable methods, to shrinkage and regularisation. Using simulated data with five predictor-response relationships of increasing complexity and eight levels of collinearity we compared ways to address collinearity with standard multiple regression and machine-learning approaches. We assessed the performance of each approach by testing its impact on prediction to new data. In the extreme, we tested whether the methods were able to identify the true underlying relationship in a training dataset with strong collinearity by evaluating its performance on a test dataset without any collinearity. We found that methods specifically designed for collinearity, such as latent variable methods and tree based models, did not outperform the traditional GLM and threshold-based pre-selection. Our results highlight the value of GLM in combination with penalised methods (particularly ridge) and threshold-based pre-selection when omitted variables are considered in the final interpretation. However, all approaches tested yielded degraded predictions under change in collinearity structure and the ‘folk lore’-thresholds of correlation coefficients between predictor variables of |r| >0.7 was an appropriate indicator for when collinearity begins to severely distort model estimation and subsequent prediction. The use of ecological understanding of the system in pre-analysis variable selection and the choice of the least sensitive statistical approaches reduce the problems of collinearity, but cannot ultimately solve them.read more
Citations
More filters
Journal ArticleDOI
The global distribution and burden of dengue
Samir Bhatt,Peter W. Gething,Oliver J. Brady,Jane P. Messina,Andrew Farlow,Catherine L. Moyes,John M. Drake,John M. Drake,John S. Brownstein,Anne G. Hoen,Osman Sankoh,Osman Sankoh,Monica F. Myers,Dylan B. George,Thomas Jaenisch,G. R. William Wint,Cameron P. Simmons,Thomas W. Scott,Thomas W. Scott,Jeremy Farrar,Jeremy Farrar,Simon I. Hay,Simon I. Hay +22 more
TL;DR: These new risk maps and infection estimates provide novel insights into the global, regional and national public health burden imposed by dengue and will help to guide improvements in disease control strategies using vaccine, drug and vector control methods, and in their economic evaluation.
Journal ArticleDOI
Regression Diagnostics: Identifying Influential Data and Sources of Collinearity
TL;DR: This chapter discusses Detecting Influential Observations and Outliers, a method for assessing Collinearity, and its applications in medicine and science.
Journal ArticleDOI
A brief introduction to mixed effects modelling and multi-model inference in ecology.
Xavier A. Harrison,Lynda Donaldson,Lynda Donaldson,Maria Eugenia Correa-Cano,Julian C. Evans,Julian C. Evans,David N. Fisher,David N. Fisher,Cecily E. D. Goodwin,Beth S. Robinson,David J. Hodgson,Richard Inger +11 more
TL;DR: This overview should serve as a widely accessible code of best practice for applying LMMs to complex biological problems and model structures, and in doing so improve the robustness of conclusions drawn from studies investigating ecological and evolutionary questions.
Journal ArticleDOI
Where is positional uncertainty a problem for species distribution modelling
TL;DR: It is proposed that local spatial association is a way to identify the species occurrence records that require treatment for positional uncertainty and developed and presented a tool in the R environment to target observations that are likely to create error in the output from SDMs as a result of positional uncertainty.
Journal ArticleDOI
Spatial prediction models for shallow landslide hazards: a comparative assessment of the efficacy of support vector machines, artificial neural networks, kernel logistic regression, and logistic model tree
TL;DR: This study introduces a framework for training and validation of shallow landslide susceptibility models by using the latest statistical methods and demonstrates the benefit of selecting the optimal machine learning techniques with proper conditioning selection method in shallow landslide susceptible mapping.
References
More filters
Journal ArticleDOI
Random Forests
TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Book
Using multivariate statistics
TL;DR: In this Section: 1. Multivariate Statistics: Why? and 2. A Guide to Statistical Techniques: Using the Book Research Questions and Associated Techniques.
Journal ArticleDOI
Regression Shrinkage and Selection via the Lasso
TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.
Journal ArticleDOI
Multivariate Data Analysis
TL;DR: In this paper, a six-step framework for organizing and discussing multivariate data analysis techniques with flowcharts for each is presented, focusing on the use of each technique, rather than its mathematical derivation.
Journal ArticleDOI
Multivariate Data Analysis
TL;DR: This book deals with probability distributions, discrete and continuous densities, distribution functions, bivariate distributions, means, variances, covariance, correlation, and some random process material.
Related Papers (5)
Novel methods improve prediction of species' distributions from occurrence data
Jane Elith,Catherine H. Graham,Robert P. Anderson,Miroslav Dudík,Simon Ferrier,Antoine Guisan,Robert J. Hijmans,Falk Huettmann,John R. Leathwick,Anthony Lehmann,Jin Li,Lúcia G. Lohmann,Bette A. Loiselle,Glenn Manion,Craig Moritz,Miguel Nakamura,Yoshinori Nakazawa,Jacob C. M. Mc Overton,A. Townsend Peterson,Steven J. Phillips,Karen Richardson,Ricardo Scachetti-Pereira,Robert E. Schapire,Jorge Soberón,Stephen E. Williams,Mary S. Wisz,Niklaus E. Zimmermann +26 more