scispace - formally typeset
Open AccessJournal ArticleDOI

Collinearity: a review of methods to deal with it and a simulation study evaluating their performance

TLDR
It was found that methods specifically designed for collinearity, such as latent variable methods and tree based models, did not outperform the traditional GLM and threshold-based pre-selection and the value of GLM in combination with penalised methods and thresholds when omitted variables are considered in the final interpretation.
Abstract
Collinearity refers to the non independence of predictor variables, usually in a regression-type analysis. It is a common feature of any descriptive ecological data set and can be a problem for parameter estimation because it inflates the variance of regression parameters and hence potentially leads to the wrong identification of relevant predictors in a statistical model. Collinearity is a severe problem when a model is trained on data from one region or time, and predicted to another with a different or unknown structure of collinearity. To demonstrate the reach of the problem of collinearity in ecology, we show how relationships among predictors differ between biomes, change over spatial scales and through time. Across disciplines, different approaches to addressing collinearity problems have been developed, ranging from clustering of predictors, threshold-based pre-selection, through latent variable methods, to shrinkage and regularisation. Using simulated data with five predictor-response relationships of increasing complexity and eight levels of collinearity we compared ways to address collinearity with standard multiple regression and machine-learning approaches. We assessed the performance of each approach by testing its impact on prediction to new data. In the extreme, we tested whether the methods were able to identify the true underlying relationship in a training dataset with strong collinearity by evaluating its performance on a test dataset without any collinearity. We found that methods specifically designed for collinearity, such as latent variable methods and tree based models, did not outperform the traditional GLM and threshold-based pre-selection. Our results highlight the value of GLM in combination with penalised methods (particularly ridge) and threshold-based pre-selection when omitted variables are considered in the final interpretation. However, all approaches tested yielded degraded predictions under change in collinearity structure and the ‘folk lore’-thresholds of correlation coefficients between predictor variables of |r| >0.7 was an appropriate indicator for when collinearity begins to severely distort model estimation and subsequent prediction. The use of ecological understanding of the system in pre-analysis variable selection and the choice of the least sensitive statistical approaches reduce the problems of collinearity, but cannot ultimately solve them.

read more

Citations
More filters
Journal ArticleDOI

Regression Diagnostics: Identifying Influential Data and Sources of Collinearity

TL;DR: This chapter discusses Detecting Influential Observations and Outliers, a method for assessing Collinearity, and its applications in medicine and science.
Journal ArticleDOI

A brief introduction to mixed effects modelling and multi-model inference in ecology.

TL;DR: This overview should serve as a widely accessible code of best practice for applying LMMs to complex biological problems and model structures, and in doing so improve the robustness of conclusions drawn from studies investigating ecological and evolutionary questions.
Journal ArticleDOI

Where is positional uncertainty a problem for species distribution modelling

TL;DR: It is proposed that local spatial association is a way to identify the species occurrence records that require treatment for positional uncertainty and developed and presented a tool in the R environment to target observations that are likely to create error in the output from SDMs as a result of positional uncertainty.
Journal ArticleDOI

Spatial prediction models for shallow landslide hazards: a comparative assessment of the efficacy of support vector machines, artificial neural networks, kernel logistic regression, and logistic model tree

TL;DR: This study introduces a framework for training and validation of shallow landslide susceptibility models by using the latest statistical methods and demonstrates the benefit of selecting the optimal machine learning techniques with proper conditioning selection method in shallow landslide susceptible mapping.
References
More filters
Journal ArticleDOI

Random Forests

TL;DR: Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.
Book

Using multivariate statistics

TL;DR: In this Section: 1. Multivariate Statistics: Why? and 2. A Guide to Statistical Techniques: Using the Book Research Questions and Associated Techniques.
Journal ArticleDOI

Regression Shrinkage and Selection via the Lasso

TL;DR: A new method for estimation in linear models called the lasso, which minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant, is proposed.
Journal ArticleDOI

Multivariate Data Analysis

TL;DR: In this paper, a six-step framework for organizing and discussing multivariate data analysis techniques with flowcharts for each is presented, focusing on the use of each technique, rather than its mathematical derivation.
Journal ArticleDOI

Multivariate Data Analysis

Xianggui Qu
- 01 Feb 2007 - 
TL;DR: This book deals with probability distributions, discrete and continuous densities, distribution functions, bivariate distributions, means, variances, covariance, correlation, and some random process material.
Related Papers (5)