scispace - formally typeset
Open AccessBook

Error Estimation and Model Selection

Reads0
Chats0
TLDR
A new analysis of the generalization error of the hypothesis which minimizes the empirical error within a finite hypothes is language is presented and an information theoretic approach which does not require the assumption that empirical error rates in distinct cross validation folds are independent estimates is pursued.
Abstract
Machine learning algorithms search a space of possible hypo t eses and estimate the error of each hypotheses using a sample. Most often, the goal of classifica tion tasks is to find a hypothesis with a low true (or generalization) misclassification probabili ty (or error rate); however, only the sample (or empirical) error rate can actually be measured and minim ized. The true error rate of the returned hypothesis is unknown but can, for instance, be estimated us ing cross validation, and very general worst-case bounds can be given. This doctoral dissertation ddresses a compound of questions on error assessment and the intimately related selection of a “ good” hypothesis language, or learning algorithm, for a given problem. In the first part of this thesis, I present a new analysis of the generalization error of the hypothesis which minimizes the empirical error within a finite hypothes is language. I present a solution which characterizes the generalization error of the apparently b est hypothesis in terms of the distribution of error rates of hypotheses in the hypothesis language. The distribution of error rates can, for any given problem, be estimated efficiently from the sample. Eff ectively, this analysis predicts how good the outcome of a learning algorithm would be without the lear ning algorithm actually having to be invoked. This immediately leads to an efficient algorithm fo r the selection of a good hypothesis language (or “model”). The analysis predicts (and thus expl ains) the shape of learning curves with a very high accuracy and thus contributes to a better underst anding of the nature of over-fitting. I study the behavior of the model selection algorithm empiric ally (in particular, in comparison to cross validation) using both artificial problems and a large scale text categorization problem. In the next step, I study in which situations performing auto matic model selection is actually beneficial; in particular, I study Occam algorithms and cross va lidation. Model selection techniques such as tree pruning, weight decay, or cross validation, are empl oyed by virtually all “practical” learners and are generally believed to enhance the performance of lea rning algorithms. However, I show that this belief is equivalent to an assumption on the distributi on of problems which the learning algorithm is exposed to. I specify these distributional assumptions a nd quantify the benefit of Occam algorithms and cross validations in these situations. When the distrib utional assumptions fail, cross-validation based model selection i creasesthe generalization error of the returned hypothesis on aver age. When several distinct learners are assessed with respect to a particular problem (or one learner is assessed repeatedly with distinct parameter settings), an effect arises which is very similar to overfitting that occurs during error-minimization processes. T he lowest observed error rate is an optimistic estimate of the corresponding generalization error. I quan tify this bias. In particular, I study the bias which is imposed by repeated invocations of a learner with di stinct parameter settings when n-fold cross validation is used to estimate the error rate. I pursue an information theoretic approach which does not require the assumption that empirical error rates m asured in distinct cross validation folds are independent estimates. I discuss the implications of th ese results on the results of empirical studies which have been carried out in the past and propose an experim ental setting which leads to almost unbiased results. Finally, I address complexity issues of model selection. In model selection based learning, the learning algorithm is restricted to a (small) model, chosen by the model selection algorithm. By contrast, in the boosting setting, the hypothesis is allowed to grow dynamically, often until the hypothesis is fitted to the data. By giving new worst-case time bounds for the AdaBoost algorithm I show that in many cases the restriction to small sets of hypotheses cause s the high complexity of learning

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Machine learning

TL;DR: Machine learning addresses many of the same research questions as the fields of statistics, data mining, and psychology, but with differences of emphasis.
Journal ArticleDOI

A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification

TL;DR: Both on average and in the majority of microarray datasets, random forests are outperformed by support vector machines both in the settings when no gene selection is performed and when several popular gene selection methods are used.
Journal ArticleDOI

Local Causal and Markov Blanket Induction for Causal Discovery and Feature Selection for Classification Part I: Algorithms and Empirical Evaluation

TL;DR: It is found that non-causal feature selection methods cannot be interpreted causally even when they achieve excellent predictivity, so only local causal techniques should be used when insight into causal structure is sought.

An essay towards solving a problem in the doctrine of chances. [Facsimil]

Thomas Bayes
TL;DR: The probability of any event is the ratio between the value at which an expectation depending on the happening of the event ought to be computed, and the value of the thing expected upon it’s 2 happening.
Book ChapterDOI

Finding association rules that trade support optimally against confidence

TL;DR: This work presents a fast algorithm that finds the n best rules which maximize the resulting criterion, and dynamically prunes redundant rules and parts of the hypothesis space that cannot contain better solutions than the best ones found so far.
References
More filters
Book

The Nature of Statistical Learning Theory

TL;DR: Setting of the learning problem consistency of learning processes bounds on the rate of convergence ofLearning processes controlling the generalization ability of learning process constructing learning algorithms what is important in learning theory?
Book

Reinforcement Learning: An Introduction

TL;DR: This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
Journal ArticleDOI

Support-Vector Networks

TL;DR: High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated and the performance of the support- vector network is compared to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.

Statistical learning theory

TL;DR: Presenting a method for determining the necessary and sufficient conditions for consistency of learning process, the author covers function estimates from small data pools, applying these estimations to real-life problems, and much more.