scispace - formally typeset
Journal ArticleDOI

Unbiased Recursive Partitioning: A Conditional Inference Framework

Reads0
Chats0
TLDR
A unified framework for recursive partitioning is proposed which embeds tree-structured regression models into a well defined theory of conditional inference procedures and it is shown that the predicted accuracy of trees with early stopping is equivalent to the prediction accuracy of pruned trees with unbiased variable selection.
Abstract
Recursive binary partitioning is a popular tool for regression analysis. Two fundamental problems of exhaustive search procedures usually applied to fit such models have been known for a long time: overfitting and a selection bias towards covariates with many possible splits or missing values. While pruning procedures are able to solve the overfitting problem, the variable selection bias still seriously affects the interpretability of tree-structured regression models. For some special cases unbiased procedures have been suggested, however lacking a common theoretical foundation. We propose a unified framework for recursive partitioning which embeds tree-structured regression models into a well defined theory of conditional inference procedures. Stopping criteria based on multiple test procedures are implemented and it is shown that the predictive performance of the resulting trees is as good as the performance of established exhaustive search procedures. It turns out that the partitions and therefore the...

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Classification and regression trees

TL;DR: This article gives an introduction to the subject of classification and regression trees by reviewing some widely available algorithms and comparing their capabilities, strengths, and weakness in two examples.
Journal ArticleDOI

Building Predictive Models in R Using the caret Package

TL;DR: The caret package, short for classification and regression training, contains numerous tools for developing predictive models using the rich set of models available in R to simplify model training and tuning across a wide variety of modeling techniques.
Book

Applied Predictive Modeling

Max Kuhn, +1 more
TL;DR: This research presents a novel and scalable approach called “Smartfitting” that automates the very labor-intensive and therefore time-heavy and therefore expensive and expensive process of designing and implementing statistical models for regression models.
Journal ArticleDOI

Bias in random forest variable importance measures: Illustrations, sources and a solution

TL;DR: An alternative implementation of random forests is proposed, that provides unbiased variable selection in the individual classification trees, that can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories.
Journal ArticleDOI

Conditional variable importance for random forests

TL;DR: A new, conditional permutation scheme is developed for the computation of the variable importance measure that reflects the true impact of each predictor variable more reliably than the original marginal approach.
References
More filters
Book

Applied Logistic Regression

TL;DR: Hosmer and Lemeshow as discussed by the authors provide an accessible introduction to the logistic regression model while incorporating advances of the last decade, including a variety of software packages for the analysis of data sets.
Journal ArticleDOI

Applied Logistic Regression.

TL;DR: Applied Logistic Regression, Third Edition provides an easily accessible introduction to the logistic regression model and highlights the power of this model by examining the relationship between a dichotomous outcome and a set of covariables.
Book

C4.5: Programs for Machine Learning

TL;DR: A complete guide to the C4.5 system as implemented in C for the UNIX environment, which starts from simple core learning methods and shows how they can be elaborated and extended to deal with typical problems such as missing data and over hitting.