Error Estimation and Model Selection

Open AccessBook

Error Estimation and Model Selection

Chats0

TLDR

A new analysis of the generalization error of the hypothesis which minimizes the empirical error within a finite hypothes is language is presented and an information theoretic approach which does not require the assumption that empirical error rates in distinct cross validation folds are independent estimates is pursued.

Abstract:

Machine learning algorithms search a space of possible hypo t eses and estimate the error of each hypotheses using a sample. Most often, the goal of classifica tion tasks is to find a hypothesis with a low true (or generalization) misclassification probabili ty (or error rate); however, only the sample (or empirical) error rate can actually be measured and minim ized. The true error rate of the returned hypothesis is unknown but can, for instance, be estimated us ing cross validation, and very general worst-case bounds can be given. This doctoral dissertation ddresses a compound of questions on error assessment and the intimately related selection of a “ good” hypothesis language, or learning algorithm, for a given problem. In the first part of this thesis, I present a new analysis of the generalization error of the hypothesis which minimizes the empirical error within a finite hypothes is language. I present a solution which characterizes the generalization error of the apparently b est hypothesis in terms of the distribution of error rates of hypotheses in the hypothesis language. The distribution of error rates can, for any given problem, be estimated efficiently from the sample. Eff ectively, this analysis predicts how good the outcome of a learning algorithm would be without the lear ning algorithm actually having to be invoked. This immediately leads to an efficient algorithm fo r the selection of a good hypothesis language (or “model”). The analysis predicts (and thus expl ains) the shape of learning curves with a very high accuracy and thus contributes to a better underst anding of the nature of over-fitting. I study the behavior of the model selection algorithm empiric ally (in particular, in comparison to cross validation) using both artificial problems and a large scale text categorization problem. In the next step, I study in which situations performing auto matic model selection is actually beneficial; in particular, I study Occam algorithms and cross va lidation. Model selection techniques such as tree pruning, weight decay, or cross validation, are empl oyed by virtually all “practical” learners and are generally believed to enhance the performance of lea rning algorithms. However, I show that this belief is equivalent to an assumption on the distributi on of problems which the learning algorithm is exposed to. I specify these distributional assumptions a nd quantify the benefit of Occam algorithms and cross validations in these situations. When the distrib utional assumptions fail, cross-validation based model selection i creasesthe generalization error of the returned hypothesis on aver age. When several distinct learners are assessed with respect to a particular problem (or one learner is assessed repeatedly with distinct parameter settings), an effect arises which is very similar to overfitting that occurs during error-minimization processes. T he lowest observed error rate is an optimistic estimate of the corresponding generalization error. I quan tify this bias. In particular, I study the bias which is imposed by repeated invocations of a learner with di stinct parameter settings when n-fold cross validation is used to estimate the error rate. I pursue an information theoretic approach which does not require the assumption that empirical error rates m asured in distinct cross validation folds are independent estimates. I discuss the implications of th ese results on the results of empirical studies which have been carried out in the past and propose an experim ental setting which leads to almost unbiased results. Finally, I address complexity issues of model selection. In model selection based learning, the learning algorithm is restricted to a (small) model, chosen by the model selection algorithm. By contrast, in the boosting setting, the hypothesis is allowed to grow dynamically, often until the hypothesis is fitted to the data. By giving new worst-case time bounds for the AdaBoost algorithm I show that in many cases the restriction to small sets of hypotheses cause s the high complexity of learning

Error Estimation and Model Selection

Citations

Machine learning

A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification

Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation

An essay towards solving a problem in the doctrine of chances. [Facsimil]

Finding association rules that trade support optimally against confidence

References

The Nature of Statistical Learning Theory

Reinforcement Learning: An Introduction

Support-Vector Networks

Statistical learning theory

Classification and Regression Trees.

Related Papers (5)

GEMS: a system for automated cancer diagnosis and biomarker discovery from microarray gene expression data.

Random Forests

Gene Selection for Cancer Classification using Support Vector Machines

Statistical learning theory

HITON: a novel Markov Blanket algorithm for optimal variable selection.