scispace - formally typeset
Search or ask a question

Showing papers on "Empirical risk minimization published in 2005"


Journal ArticleDOI
TL;DR: This book presents an interplay between the classical theory of general Lévy processes described by Skorohod (1991), Bertoin (1996), Sato (2003), and modern stochastic analysis as presented by Liptser and Shiryayev (1989), Protter (2004), and others.
Abstract: (2005). Information Theory, Inference, and Learning Algorithms. Journal of the American Statistical Association: Vol. 100, No. 472, pp. 1461-1462.

740 citations


01 Jan 2005
TL;DR: In this article, the authors consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data, and motivate minimum entropy regularization, which enables to incorporate unlabelled data in the standard supervised learning.
Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution benefits from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the "cluster assumption". Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces.

168 citations


Journal ArticleDOI
TL;DR: This paper investigates an extension of NP theory to situations in which one has no knowledge of the underlying distributions except for a collection of independent and identically distributed (i.i.d.) training examples from each hypothesis and demonstrates that several concepts from statistical learning theory have counterparts in the NP context.
Abstract: The Neyman-Pearson (NP) approach to hypothesis testing is useful in situations where different types of error have different consequences or a priori probabilities are unknown. For any /spl alpha/>0, the NP lemma specifies the most powerful test of size /spl alpha/, but assumes the distributions for each hypothesis are known or (in some cases) the likelihood ratio is monotonic in an unknown parameter. This paper investigates an extension of NP theory to situations in which one has no knowledge of the underlying distributions except for a collection of independent and identically distributed (i.i.d.) training examples from each hypothesis. Building on a "fundamental lemma" of Cannon et al., we demonstrate that several concepts from statistical learning theory have counterparts in the NP context. Specifically, we consider constrained versions of empirical risk minimization (NP-ERM) and structural risk minimization (NP-SRM), and prove performance guarantees for both. General conditions are given under which NP-SRM leads to strong universal consistency. We also apply NP-SRM to (dyadic) decision trees to derive rates of convergence. Finally, we present explicit algorithms to implement NP-SRM for histograms and dyadic decision trees.

164 citations


Journal Article
TL;DR: The formal definitions of stability for randomized algorithms are given and non-asymptotic bounds on the difference between the empirical and expected error as well as the leave-one-out and expectederror of such algorithms that depend on their random stability are proved.
Abstract: We extend existing theory on stability, namely how much changes in the training data influence the estimated models, and generalization performance of deterministic learning algorithms to the case of randomized algorithms. We give formal definitions of stability for randomized algorithms and prove non-asymptotic bounds on the difference between the empirical and expected error as well as the leave-one-out and expected error of such algorithms that depend on their random stability. The setup we develop for this purpose can be also used for generally studying randomized learning algorithms. We then use these general results to study the effects of bagging on the stability of a learning method and to prove non-asymptotic bounds on the predictive performance of bagging which have not been possible to prove with the existing theory of stability for deterministic learning algorithms.

160 citations


Proceedings Article
05 Dec 2005
TL;DR: In this article, the problem of estimating minimum volume sets based on independent samples distributed according to a probability measure P and a reference measure μ is addressed, where no other information is available regarding P, but the reference measure is assumed to be known.
Abstract: Given a probability measure P and a reference measure μ, one is often interested in the minimum μ-measure set with P-measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P, and are useful for detecting anomalies and constructing confidence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P. Other than these samples, no other information is available regarding P, but the reference measure μ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classification. As in classification, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain finite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency and an oracle inequality. Estimators based on histograms and dyadic partitions illustrate the proposed rules.

133 citations


Proceedings ArticleDOI
06 Oct 2005
TL;DR: A training procedure based on empirical risk minimization / utility maximization is developed and evaluated on a simple extraction task.
Abstract: We consider the problem of training logistic regression models for binary classification in information extraction and information retrieval tasks Fitting probabilistic models for use with such tasks should take into account the demands of the task-specific utility function, in this case the well-known F-measure, which combines recall and precision into a global measure of utility We develop a training procedure based on empirical risk minimization / utility maximization and evaluate it on a simple extraction task

102 citations


Journal ArticleDOI
TL;DR: This work proposes a novel algorithm using the framework of empirical risk minimization and marginalized kernels and analyzes its computational and statistical properties both theoretically and empirically.
Abstract: We consider the problem of decentralized detection under constraints on the number of bits that can be transmitted by each sensor. In contrast to most previous work, in which the joint distribution of sensor observations is assumed to be known, we address the problem when only a set of empirical samples is available. We propose a novel algorithm using the framework of empirical risk minimization and marginalized kernels and analyze its computational and statistical properties both theoretically and empirically. We provide an efficient implementation of the algorithm and demonstrate its performance on both simulated and real data sets.

89 citations


Book ChapterDOI
27 Jun 2005
TL;DR: In this article, the authors established learning rates to the Bayes risk for support vector machines (SVMs) using a regularization sequence, where the approximation error function describes how well the infinite sample versions of the considered SVMs approximate the data-generating distribution.
Abstract: We establish learning rates to the Bayes risk for support vector machines (SVMs) using a regularization sequence ${\it \lambda}_{n}={\it n}^{-\rm \alpha}$, where ${\it \alpha}\in$(0,1) is arbitrary. Under a noise condition recently proposed by Tsybakov these rates can become faster than n−1/2. In order to deal with the approximation error we present a general concept called the approximation error function which describes how well the infinite sample versions of the considered SVMs approximate the data-generating distribution. In addition we discuss in some detail the relation between the “classical” approximation error and the approximation error function. Finally, for distributions satisfying a geometric noise assumption we establish some learning rates when the used RKHS is a Sobolev space.

88 citations


Journal ArticleDOI
TL;DR: It is shown how various stability assumptions can be employed for bounding the bias and variance of estimators of the expected error, and an extension of the bounded-difference inequality for "almost always" stable algorithms is proved.
Abstract: The problem of proving generalization bounds for the performance of learning algorithms can be formulated as a problem of bounding the bias and variance of estimators of the expected error. We show how various stability assumptions can be employed for this purpose. We provide a necessary and sufficient stability condition for bounding the bias and variance for the Empirical Risk Minimization algorithm, and various sufficient conditions for bounding bias and variance of estimators for general algorithms. We discuss settings in which it is possible to obtain exponential bounds, and we prove an extension of the bounded-difference inequality for "almost always" stable algorithms.

82 citations


Book ChapterDOI
27 Jun 2005
TL;DR: This work investigates learning methods based on empirical minimization of the natural estimates of the ranking risk of U-statistics and U-processes to give a theoretical framework for ranking algorithms based on boosting and support vector machines.
Abstract: A general model is proposed for studying ranking problems. We investigate learning methods based on empirical minimization of the natural estimates of the ranking risk. The empirical estimates are of the form of a U-statistic. Inequalities from the theory of U-statistics and U-processes are used to obtain performance bounds for the empirical risk minimizers. Convex risk minimization methods are also studied to give a theoretical framework for ranking algorithms based on boosting and support vector machines. Just like in binary classification, fast rates of convergence are achieved under certain noise assumption. General sufficient conditions are proposed in several special cases that guarantee fast rates of convergence.

78 citations


Journal Article
TL;DR: A tutorial on learning algorithms for a single neural layer whose connection matrix belongs to the orthogonal group, bringing together modern optimization methods on manifolds and at comparing the different algorithms on a common machine learning problem.
Abstract: The aim of this contribution is to present a tutorial on learning algorithms for a single neural layer whose connection matrix belongs to the orthogonal group. The algorithms exploit geodesics appropriately connected as piece-wise approximate integrals of the exact differential learning equation. The considered learning equations essentially arise from the Riemannian-gradient-based optimization theory with deterministic and diffusion-type gradient. The paper aims specifically at reviewing the relevant mathematics (and at presenting it in as much transparent way as possible in order to make it accessible to readers that do not possess a background in differential geometry), at bringing together modern optimization methods on manifolds and at comparing the different algorithms on a common machine learning problem. As a numerical case-study, we consider an application to non-negative independent component analysis, although it should be recognized that Riemannian gradient methods give rise to general-purpose algorithms, by no means limited to ICA-related applications.

Posted Content
TL;DR: This work constructs plug-in classiflers that can achieve not only the fast, but also the super-fast rates, i.e., the rates faster than n i1 .
Abstract: It has been recently shown that, under the margin (or low noise) assumption, there exist classiflers attaining fast rates of convergence of the excess Bayes risk, i.e., the rates faster than n i1=2 . The works on this subject suggested the following two conjectures: (i) the best achievable fast rate is of the order n i1 , and (ii) the plug-in classiflers generally converge slower than the classiflers based on empirical risk minimization. We show that both conjectures are not correct. In particular, we construct plug-in classiflers that can achieve not only the fast, but also the super-fast rates, i.e., the rates faster than n i1 . We establish minimax lower bounds showing that the obtained rates cannot be improved.

Proceedings ArticleDOI
07 Aug 2005
TL;DR: A framework and objective functions for active learning in three fundamental HMM problems: model learning, state estimation, and path estimation are introduced and a new set of algorithms for efficiently finding optimal greedy queries using these objective functions are described.
Abstract: Hidden Markov Models (HMMs) model sequential data in many fields such as text/speech processing and biosignal analysis. Active learning algorithms learn faster and/or better by closing the data-gathering loop, i.e., they choose the examples most informative with respect to their learning objectives. We introduce a framework and objective functions for active learning in three fundamental HMM problems: model learning, state estimation, and path estimation. In addition, we describe a new set of algorithms for efficiently finding optimal greedy queries using these objective functions. The algorithms are fast, i.e., linear in the number of time steps to select the optimal query and we present empirical results showing that these algorithms can significantly reduce the need for labelled training data.

Journal ArticleDOI
TL;DR: A penalized empirical risk minimization classifier is suggested that adaptively attains fast optimal rates of convergence for the excess risk, that is, rates that can be faster than n -1/2 , where n is the sample size.
Abstract: We consider the problem of adaptation to the margin in binary classification. We suggest a penalized empirical risk minimization classifier that adaptively attains, up to a logarithmic factor, fast optimal rates of convergence for the excess risk, that is, rates that can be faster than n^{-1/2}, where n is the sample size. We show that our method also gives adaptive estimators for the problem of edge estimation.

Journal ArticleDOI
TL;DR: In this paper, a penalized empirical risk minimization classifier is proposed to adaptively attain, up to a logarithmic factor, fast optimal rates of convergence for the excess risk, that is, rates that can be faster than n -1/2, where n is the sample size.
Abstract: We consider the problem of adaptation to the margin in binary classification. We suggest a penalized empirical risk minimization classifier that adaptively attains, up to a logarithmic factor, fast optimal rates of convergence for the excess risk, that is, rates that can be faster than n -1/2 , where n is the sample size. We show that our method also gives adaptive estimators for the problem of edge estimation.

Proceedings Article
09 Jul 2005
TL;DR: This work considers a novel framework where a learner may influence the test distribution in a bounded way and derives an efficient algorithm that acts as a wrapper around a broad class of existing supervised learning algorithms while guarranteeing more robust behavior under changes in the input distribution.
Abstract: Supervised machine learning techniques developed in the Probably Approximately Correct, Maximum A Posteriori, and Structural Risk Minimiziation frameworks typically make the assumption that the test data a learner is applied to is drawn from the same distribution as the training data. In various prominent applications of learning techniques, from robotics to medical diagnosis to process control, this assumption is violated. We consider a novel framework where a learner may influence the test distribution in a bounded way. From this framework, we derive an efficient algorithm that acts as a wrapper around a broad class of existing supervised learning algorithms while guarranteeing more robust behavior under changes in the input distribution.

Journal ArticleDOI
01 Oct 2005
TL;DR: The empirical comparison of a recent algorithm RM, its new extensions and three classical classifiers in different aspects including classification accuracy, computational time and storage requirement shows that nominal attributes do have an impact on the performance of those compared learning algorithms.
Abstract: There are many learning algorithms available in the field of pattern classification and people are still discovering new algorithms that they hope will work better. Any new learning algorithm, beside its theoretical foundation, needs to be justified in many aspects including accuracy and efficiency when applied to real life problems. In this paper, we report the empirical comparison of a recent algorithm RM, its new extensions and three classical classifiers in different aspects including classification accuracy, computational time and storage requirement. The comparison is performed in a standardized way and we believe that this would give a good insight into the algorithm RM and its extension. The experiments also show that nominal attributes do have an impact on the performance of those compared learning algorithms.

Proceedings ArticleDOI
07 Aug 2005
TL;DR: In this paper, the authors combine model-based and instance-based learning to produce an incremental first-order regression algorithm that is both computationally efficient and produces better predictions earlier in the learning experiment.
Abstract: The introduction of relational reinforcement learning and the RRL algorithm gave rise to the development of several first order regression algorithms. So far, these algorithms have employed either a model-based approach or an instance-based approach. As a consequence, they suffer from the typical drawbacks of model-based learning such as coarse function approximation or those of lazy learning such as high computational intensity.In this paper we develop a new regression algorithm that combines the strong points of both approaches and tries to avoid the normally inherent draw-backs. By combining model-based and instance-based learning, we produce an incremental first order regression algorithm that is both computationally efficient and produces better predictions earlier in the learning experiment.

Journal ArticleDOI
TL;DR: This work considers an elaboration of binary classification in which the covariates are not available directly but are transformed by a dimensionality-reducing quantizer Q, and makes it possible to pick out the (strict) subset of surrogate loss functions that yield Bayes consistency for joint estimation of the discriminant function and the quantizer.
Abstract: The goal of binary classification is to estimate a discriminant function $\gamma$ from observations of covariate vectors and corresponding binary labels. We consider an elaboration of this problem in which the covariates are not available directly but are transformed by a dimensionality-reducing quantizer $Q$. We present conditions on loss functions such that empirical risk minimization yields Bayes consistency when both the discriminant function and the quantizer are estimated. These conditions are stated in terms of a general correspondence between loss functions and a class of functionals known as Ali-Silvey or $f$-divergence functionals. Whereas this correspondence was established by Blackwell [Proc. 2nd Berkeley Symp. Probab. Statist. 1 (1951) 93--102. Univ. California Press, Berkeley] for the 0--1 loss, we extend the correspondence to the broader class of surrogate loss functions that play a key role in the general theory of Bayes consistency for binary classification. Our result makes it possible to pick out the (strict) subset of surrogate loss functions that yield Bayes consistency for joint estimation of the discriminant function and the quantizer.

01 Jan 2005
TL;DR: This dissertation is to present a kernel-based method for solving the density estimation problem as one of the fundamental problems in machine learning, and investigates the performance of the proposed algorithm in different computer vision and pattern recognition applications.
Abstract: Statistical learning-based kennel methods are rapidly replacing other empirical learning methods (e.g. neural networks) as a preferred tool for machine learning due to many attractive features: a strong basis from statistical learning theory; no computational penalty in moving from linear to non-linear models; the resulting optimization problem is convex, guaranteeing a unique global solution and consequently producing systems with excellent generalization performance. This research work introduces statistical learning for solving different problems in computer vision and pattern recognition applications. The probability density function (pdf) estimation is a one of the major ingredients in Bayesian pattern recognition and machine learning. Many algorithms have been introduced for solving the probability density function estimation problem either in parametric or nonparametric setup. In the parametric approach, a reasonable functional form for the probability density function is assumed, as such the problem is reduced to the parameters estimation of the functional form. For estimating general density functions, the nonparametric setups are used where there is no form assumed for the density function. The curse of dimensionality is a major difficulty which exists in the density function estimation with high dimensional data spaces. An active area of research in the pattern analysis community is to develop algorithms which cope with the dimensionality problem. The purpose of this dissertation is to present a kernel-based method for solving the density estimation problem as one of the fundamental problems in machine learning. The proposed method does not pay much attention to the dimensionality problem. The contribution of this dissertation has three folds: creating a reliable and efficient learning-based density estimation algorithm which is minimally dependent on the input space dimensionality, investigating efficient learning algorithms for the proposed approach, and investigating the performance of the proposed algorithm in different computer vision and pattern recognition applications.

Proceedings ArticleDOI
Liwei Wang1, Jufu Feng1
07 Nov 2005
TL;DR: A methodology based on structural risk minimization is presented which trades off between training error and the model complexity, and gives the capacity of an N-component GMM.
Abstract: Gaussian mixture models are often used for probability density estimation in pattern recognition and machine learning systems. Selecting an optimal number of components in mixture model is important to ensure an accurate and efficient estimate. In this paper, a methodology based on structural risk minimization is presented which trades off between training error and the model complexity. The main contribution of this work is that we give the capacity of an N-component GMM. When applied to unsupervised learning and speech recognition system, the new method shows good performance compared to classical model selection methods.

Journal ArticleDOI
TL;DR: A parametric family of loss functions that provides accurate estimates for the posterior class probabilities near the decision regions are proposed and it is shown that these loss functions can be seen as an alternative to support vector machines (SVM) classifiers for low-dimensional feature spaces.

Proceedings ArticleDOI
13 Mar 2005
TL;DR: Two novel active SV learning algorithms which use adaptive mixtures of random and query learning are presented, inspired by online decision problems, and involves a hard choice among the pure strategies at each step.
Abstract: Active learning is a generic approach to accelerate training of classifiers in order to achieve a higher accuracy with a small number of training examples. In the past, simple active learning algorithms like random learning and query learning have been proposed for the design of support vector machine (SVM) classifiers. In random learning, examples are chosen randomly, while in query learning examples closer to the current separating hyperplane are chosen at each learning step. However, it is observed that a better scheme would be to use random learning in the initial stages (more exploration) and query learning in the final stages (more exploitation) of learning. Here we present two novel active SV learning algorithms which use adaptive mixtures of random and query learning. One of the proposed algorithms is inspired by online decision problems, and involves a hard choice among the pure strategies at each step. The other extends this to soft choices using a mixture of instances recommended by the individual pure strategies. Both strategies handle the exploration-exploitation trade-off in an efficient manner. The efficacy of the algorithms is demonstrated by experiments on benchmark datasets.

DissertationDOI
01 May 2005
TL;DR: This thesis proposes an extension of the standard framework for the derivation of generalization bounds for algorithms taking their hypotheses from random classes of functions and provides an algorithm which computes a data-dependent upper bound for the expected error of empirical minimizers in terms of the “complexity” of data- dependent local subsets.
Abstract: This thesis studies the generalization ability of machine learning algorithms in a statistical setting. It focuses on the data-dependent analysis of the generalization performance of learning algorithms in order to make full use of the potential of the actual training sample from which these algorithms learn. First, we propose an extension of the standard framework for the derivation of generalization bounds for algorithms taking their hypotheses from random classes of functions. This approach is motivated by the fact that the function produced by a learning algorithm based on a random sample of data depends on this sample and is therefore a random function. Such an approach avoids the detour of the worst-case uniform bounds as done in the standard approach. We show that the mechanism which allows one to obtain generalization bounds for random classes in our framework is based on a “small complexity” of certain random coordinate projections. We demonstrate how this notion of complexity relates to learnability and how one can explore geometric properties of these projections in order to derive estimates of rates of convergence and good confidence interval estimates for the expected risk. We then demonstrate the generality of our new approach by presenting a range of examples, among them the algorithm-dependent compression schemes and the data-dependent luckiness frameworks, which fall into our random subclass framework. Second, we study in more detail generalization bounds for a specific algorithm which is of central importance in learning theory, namely the Empirical Risk Minimization algorithm (ERM). Recent results show that one can significantly improve the highprobability estimates for the convergence rates for empirical minimizers by a direct analysis of the ERM algorithm. These results are based on a new localized notion of complexity of subsets of hypothesis functions with identical expected errors and are therefore dependent on the underlying unknown distribution. We investigate the extent to which one can estimate these high-probability convergence rates in a datadependent manner. We provide an algorithm which computes a data-dependent upper bound for the expected error of empirical minimizers in terms of the “complexity” of data-dependent local subsets. These subsets are sets of functions of empirical errors of a given range and can be determined based solely on empirical data. We then show that recent direct estimates, which are essentially sharp estimates on the highprobability convergence rate for the ERM algorithm, can not be recovered universally from empirical data.

Proceedings Article
30 Jul 2005
TL;DR: It is shown that while there is no phase transition when considering the whole hypothesis space, there is a much more severe "gap" phenomenon affecting the effective search space of standard grammatical induction algorithms for deterministic finite automata (DFA).
Abstract: It is now well-known that the feasibility of inductive learning is ruled by statistical properties linking the empirical risk minimization principle and the "capacity" of the hypothesis space. The discovery, a few years ago, of a phase transition phenomenon in inductive logic programming proves that other fundamental characteristics of the learning problems may similarly affect the very possibility of learning under very general conditions. Our work examines the case of grammatical inference. We show that while there is no phase transition when considering the whole hypothesis space, there is a much more severe "gap" phenomenon affecting the effective search space of standard grammatical induction algorithms for deterministic finite automata (DFA). Focusing on the search heuristics of the RPNI and RED-BLUE algorithms, we show that they overcome this problem to some extent, but that they are subject to overgeneralization. The paper last suggests some directions for new generalization operators, suited to this Phase Transition phenomenon.

Book ChapterDOI
22 Jun 2005
TL;DR: In this paper, an application of kernel based learning algorithms to endoscopy images classification problem is presented, which is a part the attempts to extend the existing recommendation system (ERS) with image classification facility.
Abstract: In this paper application of kernel based learning algorithms to endoscopy images classification problem is presented. This work is a part the attempts to extend the existing recommendation system (ERS) with image classification facility. The use of a computer-based system could support the doctor when making a diagnosis and help to avoid human subjectivity. We give a brief description of the SVM and LS-SVM algorithms. The algorithms are then used in the problem of recognition of malignant versus benign tumour in gullet. The classification was performed on features based on edge structure and colour. A detailed experimental comparison of classification performance for diferent kernel functions and different combinations of feature vectors was made. The algorithms performed very well in the experiments achieving high percentage of correct predictions.

Journal Article
TL;DR: Wang et al. as discussed by the authors analyzed the shortcomings of the neural networks based on the rule of empirical risk minimization (ERM), and introduced the rule-of structural risk minimisation(SRM) to overcome the shortcomings.
Abstract: One of purposes of data-driven machine learning is to find out the regularities, which can't be discovered with principle analysis, to forecast the future data. With excellent ability of function approximation, neural networks are widely used to develop the map between be past and the future data to carry out the predictions. First, we analyze the shortcomings of the neural networks based on the rule of empirical risk minimization(ERM),and introduce the rule of structural risk minimization(SRM) to overcome the shortcomings of ERM. Second, vector support machine (SVM), an algorithm implementing SRM, is introduced. Finally, multi-step predictions of the trend of Shanghai Security Composite Index are achieved with acceptable accuracy.


31 Aug 2005
TL;DR: This paper proposes a family of random coordinate descent algorithms for perceptron learning on binary classification problems that directly minimize the training error, and usually achieve the lowest training error compared with other algorithms.
Abstract: A perceptron is a linear threshold classifier that separates examples with a hyperplane. It is perhaps the simplest learning model that is used standalone. In this paper, we propose a family of random coordinate descent algorithms for perceptron learning on binary classification problems. Unlike most perceptron learning algorithms which require smooth cost functions, our algorithms directly minimize the training error, and usually achieve the lowest training error compared with other algorithms. The algorithms are also computational efficient. Such advantages make them favorable for both standalone use and ensemble learning, on problems that are not linearly separable. Experiments show that our algorithms work very well with AdaBoost, and achieve the lowest test errors for half of the datasets.

Journal ArticleDOI
TL;DR: The Kim Pollard Theorem is applied to show that under certain differentiability assumptions, ân converges to a* with rate n-1/3, and the asymptotic distribution of the renormalized estimator is presented.
Abstract: In this paper, we study a two-category classification problem. We indicate the categories by labels Y=1 and Y=-1. We observe a covariate, or feature, X ∈ X ⊂ ℜd. Consider a collection {ha} of classifiers indexed by a finite-dimensional parameter a, and the classifier ha* that minimizes the prediction error over this class. The parameter a* is estimated by the empirical risk minimizer ân over the class, where the empirical risk is calculated on a training sample of size n. We apply the Kim Pollard Theorem to show that under certain differentiability assumptions, ân converges to a* with rate n-1/3, and also present the asymptotic distribution of the renormalized estimator.For example, let V0 denote the set of x on which, given X=x, the label Y=1 is more likely (than the label Y=-1). If X is one-dimensional, the set V0 is the union of disjoint intervals. The problem is then to estimate the thresholds of the intervals. We obtain the asymptotic distribution of the empirical risk minimizer when the classifiers have K thresholds, where K is fixed. We furthermore consider an extension to higher-dimensional X, assuming basically that V0 has a smooth boundary in some given parametric class.We also discuss various rates of convergence when the differentiability conditions are possibly violated. Here, we again restrict ourselves to one-dimensional X. We show that the rate is n-1 in certain cases, and then also obtain the asymptotic distribution for the empirical prediction error.