Regression modeling strategies : with applications to linear models, logistic regression, and survival analysis

doi:10.1007/978-1-4757-3462-1

Book•DOI•

Regression modeling strategies : with applications to linear models, logistic regression, and survival analysis

01 Jan 2001-

TL;DR: In this article, the authors present a case study in least squares fitting and interpretation of a linear model, where they use nonparametric transformations of X and Y to fit a linear regression model.

read less

Abstract: Introduction * General Aspects of Fitting Regression Models * Missing Data * Multivariable Modeling Strategies * Resampling, Validating, Describing, and Simplifying the Model * S-PLUS Software * Case Study in Least Squares Fitting and Interpretation of a Linear Model * Case Study in Imputation and Data Reduction * Overview of Maximum Likelihood Estimation * Binary Logistic Regression * Logistic Model Case Study 1: Predicting Cause of Death * Logistic Model Case Study 2: Survival of Titanic Passengers * Ordinal Logistic Regression * Case Study in Ordinal Regrssion, Data Reduction, and Penalization * Models Using Nonparametic Transformations of X and Y * Introduction to Survival Analysis * Parametric Survival Models * Case Study in Parametric Survival Modeling and Model Approximation * Cox Proportional Hazards Regression Model * Case Study in Cox Regression

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

mice: Multivariate Imputation by Chained Equations in R

[...]

Stef van Buuren, Karin Groothuis-Oudshoorn

12 Dec 2011-Journal of Statistical Software

TL;DR: Mice adds new functionality for imputing multilevel data, automatic predictor selection, data handling, post-processing imputed values, specialized pooling routines, model selection tools, and diagnostic graphs.

...read moreread less

Abstract: The R package mice imputes incomplete multivariate data by chained equations. The software mice 1.0 appeared in the year 2000 as an S-PLUS library, and in 2001 as an R package. mice 1.0 introduced predictor selection, passive imputation and automatic pooling. This article documents mice, which extends the functionality of mice 1.0 in several ways. In mice, the analysis of imputed data is made completely general, whereas the range of models under which pooling works is substantially extended. mice adds new functionality for imputing multilevel data, automatic predictor selection, data handling, post-processing imputed values, specialized pooling routines, model selection tools, and diagnostic graphs. Imputation of categorical data is improved in order to bypass problems caused by perfect prediction. Special attention is paid to transformations, sum scores, indices and interactions using passive imputation, and to the proper setup of the predictor matrix. mice can be downloaded from the Comprehensive R Archive Network. This article provides a hands-on, stepwise approach to solve applied incomplete data problems.

...read moreread less

10,234 citations

Journal Article•DOI•

Novel methods improve prediction of species' distributions from occurrence data

[...]

Jane Elith¹, Catherine H. Graham², Robert P. Anderson³, Miroslav Dudík⁴, Simon Ferrier, Antoine Guisan⁵, Robert J. Hijmans⁶, Falk Huettmann⁷, John R. Leathwick⁸, Anthony Lehmann, Jin Li⁹, Lúcia G. Lohmann¹⁰, Bette A. Loiselle¹¹, Glenn Manion, Craig Moritz⁶, Miguel Nakamura¹², Yoshinori Nakazawa¹³, Jacob C. M. Mc Overton¹⁴, A. Townsend Peterson¹³, Steven J. Phillips¹⁵, Karen Richardson¹⁶, Ricardo Scachetti-Pereira, Robert E. Schapire, Jorge Soberón¹³, Stephen E. Williams¹⁷, Mary S. Wisz, Niklaus E. Zimmermann¹⁸ - Show less +23 more•Institutions (18)

University of Melbourne¹, Stony Brook University², City University of New York³, Princeton University⁴, University of Lausanne⁵, University of California, Berkeley⁶, University of Alaska Fairbanks⁷, National Institute of Water and Atmospheric Research⁸, Commonwealth Scientific and Industrial Research Organisation⁹, University of São Paulo¹⁰, University of Missouri¹¹, Consejo Nacional de Ciencia y Tecnología¹², University of Kansas¹³, Landcare Research¹⁴, AT&T¹⁵, McGill University¹⁶, James Cook University¹⁷, Swiss Federal Institute for Forest, Snow and Landscape Research¹⁸

01 Apr 2006-Ecography

TL;DR: This work compared 16 modelling methods over 226 species from 6 regions of the world, creating the most comprehensive set of model comparisons to date and found that presence-only data were effective for modelling species' distributions for many species and regions.

...read moreread less

Abstract: Prediction of species' distributions is central to diverse applications in ecology, evolution and conservation science. There is increasing electronic access to vast sets of occurrence records in museums and herbaria, yet little effective guidance on how best to use this information in the context of numerous approaches for modelling distributions. To meet this need, we compared 16 modelling methods over 226 species from 6 regions of the world, creating the most comprehensive set of model comparisons to date. We used presence-only data to fit models, and independent presence-absence data to evaluate the predictions. Along with well-established modelling methods such as generalised additive models and GARP and BIOCLIM, we explored methods that either have been developed recently or have rarely been applied to modelling species' distributions. These include machine-learning methods and community models, both of which have features that may make them particularly well suited to noisy or sparse information, as is typical of species' occurrence data. Presence-only data were effective for modelling species' distributions for many species and regions. The novel methods consistently outperformed more established methods. The results of our analysis are promising for the use of data from museums and herbaria, especially as methods suited to the noise inherent in such data improve.

...read moreread less

7,589 citations

Journal Article•DOI•

Random effects structure for confirmatory hypothesis testing: Keep it maximal

[...]

Dale J. Barr¹, Roger Levy², Christoph Scheepers¹, Harry Tily•Institutions (2)

University of Glasgow¹, University of California, San Diego²

01 Apr 2013-Journal of Memory and Language

TL;DR: It is argued that researchers using LMEMs for confirmatory hypothesis testing should minimally adhere to the standards that have been in place for many decades, and it is shown thatLMEMs generalize best when they include the maximal random effects structure justified by the design.

...read moreread less

6,878 citations

Journal Article•DOI•

Collinearity: a review of methods to deal with it and a simulation study evaluating their performance

[...]

Carsten F. Dormann¹, Jane Elith¹, Sven Bacher¹, Carsten M. Buchmann¹, Gudrun Carl¹, Gabriel Carré¹, Jaime Ricardo García Márquez¹, Bernd Gruber¹, Bruno Lafourcade¹, Pedro J. Leitão¹, Tamara Münkemüller¹, Colin J. McClean¹, Patrick E. Osborne¹, Björn Reineking¹, Boris Schröder¹, Andrew K. Skidmore¹, Damaris Zurell¹, Sven Lautenbach¹ - Show less +14 more•Institutions (1)

Helmholtz Centre for Environmental Research - UFZ¹

01 Jan 2013-Ecography

TL;DR: It was found that methods specifically designed for collinearity, such as latent variable methods and tree based models, did not outperform the traditional GLM and threshold-based pre-selection and the value of GLM in combination with penalised methods and thresholds when omitted variables are considered in the final interpretation.

...read moreread less

Abstract: Collinearity refers to the non independence of predictor variables, usually in a regression-type analysis. It is a common feature of any descriptive ecological data set and can be a problem for parameter estimation because it inflates the variance of regression parameters and hence potentially leads to the wrong identification of relevant predictors in a statistical model. Collinearity is a severe problem when a model is trained on data from one region or time, and predicted to another with a different or unknown structure of collinearity. To demonstrate the reach of the problem of collinearity in ecology, we show how relationships among predictors differ between biomes, change over spatial scales and through time. Across disciplines, different approaches to addressing collinearity problems have been developed, ranging from clustering of predictors, threshold-based pre-selection, through latent variable methods, to shrinkage and regularisation. Using simulated data with five predictor-response relationships of increasing complexity and eight levels of collinearity we compared ways to address collinearity with standard multiple regression and machine-learning approaches. We assessed the performance of each approach by testing its impact on prediction to new data. In the extreme, we tested whether the methods were able to identify the true underlying relationship in a training dataset with strong collinearity by evaluating its performance on a test dataset without any collinearity. We found that methods specifically designed for collinearity, such as latent variable methods and tree based models, did not outperform the traditional GLM and threshold-based pre-selection. Our results highlight the value of GLM in combination with penalised methods (particularly ridge) and threshold-based pre-selection when omitted variables are considered in the final interpretation. However, all approaches tested yielded degraded predictions under change in collinearity structure and the ‘folk lore’-thresholds of correlation coefficients between predictor variables of |r| >0.7 was an appropriate indicator for when collinearity begins to severely distort model estimation and subsequent prediction. The use of ecological understanding of the system in pre-analysis variable selection and the choice of the least sensitive statistical approaches reduce the problems of collinearity, but cannot ultimately solve them.

...read moreread less

6,199 citations

Journal Article•DOI•

Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples

[...]

Peter C. Austin

10 Nov 2009-Statistics in Medicine

TL;DR: Methods to determine the sampling distribution of the standardized difference when the true standardized difference is equal to zero are described, thereby allowing one to determined the range of standardized differences that are plausible with the propensity score model having been correctly specified.

...read moreread less

Abstract: The propensity score is a subject's probability of treatment, conditional on observed baseline covariates. Conditional on the true propensity score, treated and untreated subjects have similar distributions of observed baseline covariates. Propensity-score matching is a popular method of using the propensity score in the medical literature. Using this approach, matched sets of treated and untreated subjects with similar values of the propensity score are formed. Inferences about treatment effect made using propensity-score matching are valid only if, in the matched sample, treated and untreated subjects have similar distributions of measured baseline covariates. In this paper we discuss the following methods for assessing whether the propensity score model has been correctly specified: comparing means and prevalences of baseline characteristics using standardized differences; ratios comparing the variance of continuous covariates between treated and untreated subjects; comparison of higher order moments and interactions; five-number summaries; and graphical methods such as quantile–quantile plots, side-by-side boxplots, and non-parametric density plots for comparing the distribution of baseline covariates between treatment groups. We describe methods to determine the sampling distribution of the standardized difference when the true standardized difference is equal to zero, thereby allowing one to determine the range of standardized differences that are plausible with the propensity score model having been correctly specified. We highlight the limitations of some previously used methods for assessing the adequacy of the specification of the propensity-score model. In particular, methods based on comparing the distribution of the estimated propensity score between treated and untreated subjects are uninformative. Copyright © 2009 John Wiley & Sons, Ltd.

...read moreread less

3,929 citations

Collapse

Regression modeling strategies : with applications to linear models, logistic regression, and survival analysis

Citations

Related Papers (5)