Showing papers on "Empirical risk minimization published in 2005"

PDF

Open Access

Journal Article•DOI•

Information Theory, Inference, and Learning Algorithms

[...]

01 Dec 2005-Journal of the American Statistical Association

TL;DR: This book presents an interplay between the classical theory of general Lévy processes described by Skorohod (1991), Bertoin (1996), Sato (2003), and modern stochastic analysis as presented by Liptser and Shiryayev (1989), Protter (2004), and others.

...read moreread less

Abstract: (2005). Information Theory, Inference, and Learning Algorithms. Journal of the American Statistical Association: Vol. 100, No. 472, pp. 1461-1462.

...read moreread less

740 citations

Semi-supervised Learning by Entropy Minimization.

[...]

Yves Grandvalet¹, Yoshua Bengio²•Institutions (2)

University of Technology of Compiègne¹, Université de Montréal²

01 Jan 2005

TL;DR: In this article, the authors consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data, and motivate minimum entropy regularization, which enables to incorporate unlabelled data in the standard supervised learning.

...read moreread less

Abstract: We consider the semi-supervised learning problem, where a decision rule is to be learned from labeled and unlabeled data. In this framework, we motivate minimum entropy regularization, which enables to incorporate unlabeled data in the standard supervised learning. Our approach includes other approaches to the semi-supervised problem as particular or limiting cases. A series of experiments illustrates that the proposed solution benefits from unlabeled data. The method challenges mixture models when the data are sampled from the distribution class spanned by the generative model. The performances are definitely in favor of minimum entropy regularization when generative models are misspecified, and the weighting of unlabeled data provides robustness to the violation of the "cluster assumption". Finally, we also illustrate that the method can also be far superior to manifold learning in high dimension spaces.

...read moreread less

168 citations

Journal Article•DOI•

A Neyman-Pearson approach to statistical learning

[...]

Clayton Scott¹, Robert Nowak²•Institutions (2)

Rice University¹, University of Wisconsin-Madison²

01 Nov 2005-IEEE Transactions on Information Theory

TL;DR: This paper investigates an extension of NP theory to situations in which one has no knowledge of the underlying distributions except for a collection of independent and identically distributed (i.i.d.) training examples from each hypothesis and demonstrates that several concepts from statistical learning theory have counterparts in the NP context.

...read moreread less

Abstract: The Neyman-Pearson (NP) approach to hypothesis testing is useful in situations where different types of error have different consequences or a priori probabilities are unknown. For any /spl alpha/>0, the NP lemma specifies the most powerful test of size /spl alpha/, but assumes the distributions for each hypothesis are known or (in some cases) the likelihood ratio is monotonic in an unknown parameter. This paper investigates an extension of NP theory to situations in which one has no knowledge of the underlying distributions except for a collection of independent and identically distributed (i.i.d.) training examples from each hypothesis. Building on a "fundamental lemma" of Cannon et al., we demonstrate that several concepts from statistical learning theory have counterparts in the NP context. Specifically, we consider constrained versions of empirical risk minimization (NP-ERM) and structural risk minimization (NP-SRM), and prove performance guarantees for both. General conditions are given under which NP-SRM leads to strong universal consistency. We also apply NP-SRM to (dyadic) decision trees to derive rates of convergence. Finally, we present explicit algorithms to implement NP-SRM for histograms and dyadic decision trees.

...read moreread less

164 citations

Journal Article•

Stability of Randomized Learning Algorithms

[...]

André Elisseeff, Theodoros Evgeniou, Massimiliano Pontil

01 Dec 2005-Journal of Machine Learning Research

TL;DR: The formal definitions of stability for randomized algorithms are given and non-asymptotic bounds on the difference between the empirical and expected error as well as the leave-one-out and expectederror of such algorithms that depend on their random stability are proved.

...read moreread less

Abstract: We extend existing theory on stability, namely how much changes in the training data influence the estimated models, and generalization performance of deterministic learning algorithms to the case of randomized algorithms. We give formal definitions of stability for randomized algorithms and prove non-asymptotic bounds on the difference between the empirical and expected error as well as the leave-one-out and expected error of such algorithms that depend on their random stability. The setup we develop for this purpose can be also used for generally studying randomized learning algorithms. We then use these general results to study the effects of bagging on the stability of a learning method and to prove non-asymptotic bounds on the predictive performance of bagging which have not been possible to prove with the existing theory of stability for deterministic learning algorithms.

...read moreread less

160 citations

Proceedings Article•

Learning Minimum Volume Sets

[...]

Clayton Scott¹, Robert Nowak²•Institutions (2)

Rice University¹, University of Wisconsin-Madison²

05 Dec 2005

TL;DR: In this article, the problem of estimating minimum volume sets based on independent samples distributed according to a probability measure P and a reference measure μ is addressed, where no other information is available regarding P, but the reference measure is assumed to be known.

...read moreread less

Abstract: Given a probability measure P and a reference measure μ, one is often interested in the minimum μ-measure set with P-measure at least α. Minimum volume sets of this type summarize the regions of greatest probability mass of P, and are useful for detecting anomalies and constructing confidence regions. This paper addresses the problem of estimating minimum volume sets based on independent samples distributed according to P. Other than these samples, no other information is available regarding P, but the reference measure μ is assumed to be known. We introduce rules for estimating minimum volume sets that parallel the empirical risk minimization and structural risk minimization principles in classification. As in classification, we show that the performances of our estimators are controlled by the rate of uniform convergence of empirical to true probabilities over the class from which the estimator is drawn. Thus we obtain finite sample size performance bounds in terms of VC dimension and related quantities. We also demonstrate strong universal consistency and an oracle inequality. Estimators based on histograms and dyadic partitions illustrate the proposed rules.

...read moreread less

133 citations

Proceedings Article•DOI•

Maximum Expected F-Measure Training of Logistic Regression Models

[...]

Martin Jansche¹•Institutions (1)

Columbia University¹

06 Oct 2005

TL;DR: A training procedure based on empirical risk minimization / utility maximization is developed and evaluated on a simple extraction task.

...read moreread less

Abstract: We consider the problem of training logistic regression models for binary classification in information extraction and information retrieval tasks Fitting probabilistic models for use with such tasks should take into account the demands of the task-specific utility function, in this case the well-known F-measure, which combines recall and precision into a global measure of utility We develop a training procedure based on empirical risk minimization / utility maximization and evaluate it on a simple extraction task

...read moreread less

102 citations

Journal Article•DOI•

Nonparametric decentralized detection using kernel methods

[...]

XuanLong Nguyen¹, Martin J. Wainwright¹, Michael I. Jordan¹•Institutions (1)

University of California, Berkeley¹

01 Nov 2005-IEEE Transactions on Signal Processing

TL;DR: This work proposes a novel algorithm using the framework of empirical risk minimization and marginalized kernels and analyzes its computational and statistical properties both theoretically and empirically.

...read moreread less

Abstract: We consider the problem of decentralized detection under constraints on the number of bits that can be transmitted by each sensor. In contrast to most previous work, in which the joint distribution of sensor observations is assumed to be known, we address the problem when only a set of empirical samples is available. We propose a novel algorithm using the framework of empirical risk minimization and marginalized kernels and analyze its computational and statistical properties both theoretically and empirically. We provide an efficient implementation of the algorithm and demonstrate its performance on both simulated and real data sets.

...read moreread less

89 citations

Book Chapter•DOI•

Fast rates for support vector machines

[...]

Ingo Steinwart¹, Clint Scovel¹•Institutions (1)

Los Alamos National Laboratory¹

27 Jun 2005

TL;DR: In this article, the authors established learning rates to the Bayes risk for support vector machines (SVMs) using a regularization sequence, where the approximation error function describes how well the infinite sample versions of the considered SVMs approximate the data-generating distribution.

...read moreread less

Abstract: We establish learning rates to the Bayes risk for support vector machines (SVMs) using a regularization sequence ${\it \lambda}_{n}={\it n}^{-\rm \alpha}$, where ${\it \alpha}\in$(0,1) is arbitrary. Under a noise condition recently proposed by Tsybakov these rates can become faster than n−1/2. In order to deal with the approximation error we present a general concept called the approximation error function which describes how well the infinite sample versions of the considered SVMs approximate the data-generating distribution. In addition we discuss in some detail the relation between the “classical” approximation error and the approximation error function. Finally, for distributions satisfying a geometric noise assumption we establish some learning rates when the used RKHS is a Sobolev space.

...read moreread less

88 citations

Journal Article•DOI•

Stability results in learning theory

[...]

Alexander Rakhlin¹, Sayan Mukherjee², Tomaso Poggio¹•Institutions (2)

Massachusetts Institute of Technology¹, Duke University²

22 Aug 2005-Analysis and Applications

TL;DR: It is shown how various stability assumptions can be employed for bounding the bias and variance of estimators of the expected error, and an extension of the bounded-difference inequality for "almost always" stable algorithms is proved.

...read moreread less

Abstract: The problem of proving generalization bounds for the performance of learning algorithms can be formulated as a problem of bounding the bias and variance of estimators of the expected error. We show how various stability assumptions can be employed for this purpose. We provide a necessary and sufficient stability condition for bounding the bias and variance for the Empirical Risk Minimization algorithm, and various sufficient conditions for bounding bias and variance of estimators for general algorithms. We discuss settings in which it is possible to obtain exponential bounds, and we prove an extension of the bounded-difference inequality for "almost always" stable algorithms.

...read moreread less

82 citations

Book Chapter•DOI•

Ranking and scoring using empirical risk minimization

[...]

Stéphan Clémençon¹, Gábor Lugosi², Nicolas Vayatis³•Institutions (3)

Paris West University Nanterre La Défense¹, Pompeu Fabra University², University of Paris³

27 Jun 2005

TL;DR: This work investigates learning methods based on empirical minimization of the natural estimates of the ranking risk of U-statistics and U-processes to give a theoretical framework for ranking algorithms based on boosting and support vector machines.

...read moreread less

Abstract: A general model is proposed for studying ranking problems. We investigate learning methods based on empirical minimization of the natural estimates of the ranking risk. The empirical estimates are of the form of a U-statistic. Inequalities from the theory of U-statistics and U-processes are used to obtain performance bounds for the empirical risk minimizers. Convex risk minimization methods are also studied to give a theoretical framework for ranking algorithms based on boosting and support vector machines. Just like in binary classification, fast rates of convergence are achieved under certain noise assumption. General sufficient conditions are proposed in several special cases that guarantee fast rates of convergence.

...read moreread less

78 citations

Journal Article•

Quasi-Geodesic Neural Learning Algorithms Over the Orthogonal Group: A Tutorial

[...]

Simone Fiori¹•Institutions (1)

University of Perugia¹

01 Dec 2005-Journal of Machine Learning Research

TL;DR: A tutorial on learning algorithms for a single neural layer whose connection matrix belongs to the orthogonal group, bringing together modern optimization methods on manifolds and at comparing the different algorithms on a common machine learning problem.

...read moreread less

Abstract: The aim of this contribution is to present a tutorial on learning algorithms for a single neural layer whose connection matrix belongs to the orthogonal group. The algorithms exploit geodesics appropriately connected as piece-wise approximate integrals of the exact differential learning equation. The considered learning equations essentially arise from the Riemannian-gradient-based optimization theory with deterministic and diffusion-type gradient. The paper aims specifically at reviewing the relevant mathematics (and at presenting it in as much transparent way as possible in order to make it accessible to readers that do not possess a background in differential geometry), at bringing together modern optimization methods on manifolds and at comparing the different algorithms on a common machine learning problem. As a numerical case-study, we consider an application to non-negative independent component analysis, although it should be recognized that Riemannian gradient methods give rise to general-purpose algorithms, by no means limited to ICA-related applications.

...read moreread less

Posted Content•

Fast learning rates for plug-in classifiers under the margin condition

[...]

Jean-Yves Audibert, Alexandre B. Tsybakov

13 Sep 2005-arXiv: Statistics Theory

TL;DR: This work constructs plug-in classiflers that can achieve not only the fast, but also the super-fast rates, i.e., the rates faster than n i1 .

...read moreread less

Abstract: It has been recently shown that, under the margin (or low noise) assumption, there exist classiflers attaining fast rates of convergence of the excess Bayes risk, i.e., the rates faster than n i1=2 . The works on this subject suggested the following two conjectures: (i) the best achievable fast rate is of the order n i1 , and (ii) the plug-in classiflers generally converge slower than the classiflers based on empirical risk minimization. We show that both conjectures are not correct. In particular, we construct plug-in classiflers that can achieve not only the fast, but also the super-fast rates, i.e., the rates faster than n i1 . We establish minimax lower bounds showing that the obtained rates cannot be improved.

...read moreread less

Proceedings Article•DOI•

Active learning for Hidden Markov Models: objective functions and algorithms

[...]

Brigham Anderson¹, Andrew W. Moore¹•Institutions (1)

Carnegie Mellon University¹

07 Aug 2005

TL;DR: A framework and objective functions for active learning in three fundamental HMM problems: model learning, state estimation, and path estimation are introduced and a new set of algorithms for efficiently finding optimal greedy queries using these objective functions are described.

...read moreread less

Abstract: Hidden Markov Models (HMMs) model sequential data in many fields such as text/speech processing and biosignal analysis. Active learning algorithms learn faster and/or better by closing the data-gathering loop, i.e., they choose the examples most informative with respect to their learning objectives. We introduce a framework and objective functions for active learning in three fundamental HMM problems: model learning, state estimation, and path estimation. In addition, we describe a new set of algorithms for efficiently finding optimal greedy queries using these objective functions. The algorithms are fast, i.e., linear in the number of time steps to select the optimal query and we present empirical results showing that these algorithms can significantly reduce the need for labelled training data.

...read moreread less

Journal Article•DOI•

Square Root Penalty: Adaptation to the Margin in Classification and in Edge Estimation

[...]

Alexandre B. Tsybakov, S. A. van de Geer

21 Jul 2005-arXiv: Statistics Theory

TL;DR: A penalized empirical risk minimization classifier is suggested that adaptively attains fast optimal rates of convergence for the excess risk, that is, rates that can be faster than n -1/2 , where n is the sample size.

...read moreread less

Abstract: We consider the problem of adaptation to the margin in binary classification. We suggest a penalized empirical risk minimization classifier that adaptively attains, up to a logarithmic factor, fast optimal rates of convergence for the excess risk, that is, rates that can be faster than n^{-1/2}, where n is the sample size. We show that our method also gives adaptive estimators for the problem of edge estimation.

...read moreread less

Journal Article•DOI•

Square root penalty: Adaptation to the margin in classification and in edge estimation

[...]

Alexandre B. Tsybakov, S. A. van de Geer

01 Jun 2005-Annals of Statistics

TL;DR: In this paper, a penalized empirical risk minimization classifier is proposed to adaptively attain, up to a logarithmic factor, fast optimal rates of convergence for the excess risk, that is, rates that can be faster than n -1/2, where n is the sample size.

...read moreread less

Abstract: We consider the problem of adaptation to the margin in binary classification. We suggest a penalized empirical risk minimization classifier that adaptively attains, up to a logarithmic factor, fast optimal rates of convergence for the excess risk, that is, rates that can be faster than n -1/2 , where n is the sample size. We show that our method also gives adaptive estimators for the problem of edge estimation.

...read moreread less

Proceedings Article•

Robust supervised learning

[...]

J. Andrew Bagnell¹•Institutions (1)

Carnegie Mellon University¹

09 Jul 2005

TL;DR: This work considers a novel framework where a learner may influence the test distribution in a bounded way and derives an efficient algorithm that acts as a wrapper around a broad class of existing supervised learning algorithms while guarranteeing more robust behavior under changes in the input distribution.

...read moreread less

Abstract: Supervised machine learning techniques developed in the Probably Approximately Correct, Maximum A Posteriori, and Structural Risk Minimiziation frameworks typically make the assumption that the test data a learner is applied to is drawn from the same distribution as the training data. In various prominent applications of learning techniques, from robotics to medical diagnosis to process control, this assumption is violated. We consider a novel framework where a learner may influence the test distribution in a bounded way. From this framework, we derive an efficient algorithm that acts as a wrapper around a broad class of existing supervised learning algorithms while guarranteeing more robust behavior under changes in the input distribution.

...read moreread less

Journal Article•DOI•

An empirical comparison of nine pattern classifiers

[...]

Quoc-Long Tran¹, Kar-Ann Toh¹, Dipti Srinivasan², K.-L. Wong², Shaun Qiu-Cen Low² - Show less +1 more•Institutions (2)

Institute for Infocomm Research Singapore¹, National University of Singapore²

01 Oct 2005

TL;DR: The empirical comparison of a recent algorithm RM, its new extensions and three classical classifiers in different aspects including classification accuracy, computational time and storage requirement shows that nominal attributes do have an impact on the performance of those compared learning algorithms.

...read moreread less

Abstract: There are many learning algorithms available in the field of pattern classification and people are still discovering new algorithms that they hope will work better. Any new learning algorithm, beside its theoretical foundation, needs to be justified in many aspects including accuracy and efficiency when applied to real life problems. In this paper, we report the empirical comparison of a recent algorithm RM, its new extensions and three classical classifiers in different aspects including classification accuracy, computational time and storage requirement. The comparison is performed in a standardized way and we believe that this would give a good insight into the algorithm RM and its extension. The experiments also show that nominal attributes do have an impact on the performance of those compared learning algorithms.

...read moreread less

Proceedings Article•DOI•

Combining model-based and instance-based learning for first order regression

[...]

Kurt Driessens¹, Sašo Džeroski²•Institutions (2)

University of Waikato¹, Jožef Stefan Institute²

07 Aug 2005

TL;DR: In this paper, the authors combine model-based and instance-based learning to produce an incremental first-order regression algorithm that is both computationally efficient and produces better predictions earlier in the learning experiment.

...read moreread less

Abstract: The introduction of relational reinforcement learning and the RRL algorithm gave rise to the development of several first order regression algorithms. So far, these algorithms have employed either a model-based approach or an instance-based approach. As a consequence, they suffer from the typical drawbacks of model-based learning such as coarse function approximation or those of lazy learning such as high computational intensity.In this paper we develop a new regression algorithm that combines the strong points of both approaches and tries to avoid the normally inherent draw-backs. By combining model-based and instance-based learning, we produce an incremental first order regression algorithm that is both computationally efficient and produces better predictions earlier in the learning experiment.

...read moreread less

Journal Article•DOI•

On surrogate loss functions and $f$-divergences

[...]

XuanLong Nguyen¹, Martin J. Wainwright², Michael I. Jordan²•Institutions (2)

Research Triangle Park¹, University of California, Berkeley²

25 Oct 2005-arXiv: Statistics Theory

TL;DR: This work considers an elaboration of binary classification in which the covariates are not available directly but are transformed by a dimensionality-reducing quantizer Q, and makes it possible to pick out the (strict) subset of surrogate loss functions that yield Bayes consistency for joint estimation of the discriminant function and the quantizer.

...read moreread less

Abstract: The goal of binary classification is to estimate a discriminant function $\gamma$ from observations of covariate vectors and corresponding binary labels. We consider an elaboration of this problem in which the covariates are not available directly but are transformed by a dimensionality-reducing quantizer $Q$. We present conditions on loss functions such that empirical risk minimization yields Bayes consistency when both the discriminant function and the quantizer are estimated. These conditions are stated in terms of a general correspondence between loss functions and a class of functionals known as Ali-Silvey or $f$-divergence functionals. Whereas this correspondence was established by Blackwell [Proc. 2nd Berkeley Symp. Probab. Statist. 1 (1951) 93--102. Univ. California Press, Berkeley] for the 0--1 loss, we extend the correspondence to the broader class of surrogate loss functions that play a key role in the general theory of Bayes consistency for binary classification. Our result makes it possible to pick out the (strict) subset of surrogate loss functions that yield Bayes consistency for joint estimation of the discriminant function and the quantizer.

...read moreread less

Kernel methods for statistical learning in computer vision and pattern recognition applications

[...]

Aly A. Farag¹, Refaat M Mohamed¹•Institutions (1)

University of Louisville¹

01 Jan 2005

TL;DR: This dissertation is to present a kernel-based method for solving the density estimation problem as one of the fundamental problems in machine learning, and investigates the performance of the proposed algorithm in different computer vision and pattern recognition applications.

...read moreread less

Abstract: Statistical learning-based kennel methods are rapidly replacing other empirical learning methods (e.g. neural networks) as a preferred tool for machine learning due to many attractive features: a strong basis from statistical learning theory; no computational penalty in moving from linear to non-linear models; the resulting optimization problem is convex, guaranteeing a unique global solution and consequently producing systems with excellent generalization performance. This research work introduces statistical learning for solving different problems in computer vision and pattern recognition applications. The probability density function (pdf) estimation is a one of the major ingredients in Bayesian pattern recognition and machine learning. Many algorithms have been introduced for solving the probability density function estimation problem either in parametric or nonparametric setup. In the parametric approach, a reasonable functional form for the probability density function is assumed, as such the problem is reduced to the parameters estimation of the functional form. For estimating general density functions, the nonparametric setups are used where there is no form assumed for the density function. The curse of dimensionality is a major difficulty which exists in the density function estimation with high dimensional data spaces. An active area of research in the pattern analysis community is to develop algorithms which cope with the dimensionality problem. The purpose of this dissertation is to present a kernel-based method for solving the density estimation problem as one of the fundamental problems in machine learning. The proposed method does not pay much attention to the dimensionality problem. The contribution of this dissertation has three folds: creating a reliable and efficient learning-based density estimation algorithm which is minimally dependent on the input space dimensionality, investigating efficient learning algorithms for the proposed approach, and investigating the performance of the proposed algorithm in different computer vision and pattern recognition applications.

...read moreread less

Proceedings Article•DOI•

Learning Gaussian mixture models by structural risk minimization

[...]

Liwei Wang¹, Jufu Feng¹•Institutions (1)

Peking University¹

07 Nov 2005

TL;DR: A methodology based on structural risk minimization is presented which trades off between training error and the model complexity, and gives the capacity of an N-component GMM.

...read moreread less

Abstract: Gaussian mixture models are often used for probability density estimation in pattern recognition and machine learning systems. Selecting an optimal number of components in mixture model is important to ensure an accurate and efficient estimate. In this paper, a methodology based on structural risk minimization is presented which trades off between training error and the model complexity. The main contribution of this work is that we give the capacity of an N-component GMM. When applied to unsupervised learning and speech recognition system, the new method shows good performance compared to classical model selection methods.

...read moreread less

Journal Article•DOI•

Loss functions to combine learning and decision in multiclass problems

[...]

Alicia Guerrero-Curieses¹, Rocío Alaiz-Rodríguez², Jesús Cid-Sueiro³•Institutions (3)

King Juan Carlos University¹, University of León², Charles III University of Madrid³

01 Dec 2005-Neurocomputing

TL;DR: A parametric family of loss functions that provides accurate estimates for the posterior class probabilities near the decision regions are proposed and it is shown that these loss functions can be seen as an alternative to support vector machines (SVM) classifiers for low-dimensional feature spaces.

...read moreread less

Proceedings Article•DOI•

Stochastic scheduling of active support vector learning algorithms

[...]

Gaurav Pandey¹, Himanshu Gupta², Pabitra Mitra²•Institutions (2)

University of Minnesota¹, Indian Institute of Technology Kanpur²

13 Mar 2005

TL;DR: Two novel active SV learning algorithms which use adaptive mixtures of random and query learning are presented, inspired by online decision problems, and involves a hard choice among the pure strategies at each step.

...read moreread less

Abstract: Active learning is a generic approach to accelerate training of classifiers in order to achieve a higher accuracy with a small number of training examples. In the past, simple active learning algorithms like random learning and query learning have been proposed for the design of support vector machine (SVM) classifiers. In random learning, examples are chosen randomly, while in query learning examples closer to the current separating hyperplane are chosen at each learning step. However, it is observed that a better scheme would be to use random learning in the initial stages (more exploration) and query learning in the final stages (more exploitation) of learning. Here we present two novel active SV learning algorithms which use adaptive mixtures of random and query learning. One of the proposed algorithms is inspired by online decision problems, and involves a hard choice among the pure strategies at each step. The other extends this to soft choices using a mixture of instances recommended by the individual pure strategies. Both strategies handle the exploration-exploitation trade-off in an efficient manner. The efficacy of the algorithms is demonstrated by experiments on benchmark datasets.

...read moreread less

Dissertation•DOI•

Data-Dependent Analysis of Learning Algorithms

[...]

Petra Philips

01 May 2005

TL;DR: This thesis proposes an extension of the standard framework for the derivation of generalization bounds for algorithms taking their hypotheses from random classes of functions and provides an algorithm which computes a data-dependent upper bound for the expected error of empirical minimizers in terms of the “complexity” of data- dependent local subsets.

...read moreread less

Abstract: This thesis studies the generalization ability of machine learning algorithms in a statistical setting. It focuses on the data-dependent analysis of the generalization performance of learning algorithms in order to make full use of the potential of the actual training sample from which these algorithms learn. First, we propose an extension of the standard framework for the derivation of generalization bounds for algorithms taking their hypotheses from random classes of functions. This approach is motivated by the fact that the function produced by a learning algorithm based on a random sample of data depends on this sample and is therefore a random function. Such an approach avoids the detour of the worst-case uniform bounds as done in the standard approach. We show that the mechanism which allows one to obtain generalization bounds for random classes in our framework is based on a “small complexity” of certain random coordinate projections. We demonstrate how this notion of complexity relates to learnability and how one can explore geometric properties of these projections in order to derive estimates of rates of convergence and good confidence interval estimates for the expected risk. We then demonstrate the generality of our new approach by presenting a range of examples, among them the algorithm-dependent compression schemes and the data-dependent luckiness frameworks, which fall into our random subclass framework. Second, we study in more detail generalization bounds for a specific algorithm which is of central importance in learning theory, namely the Empirical Risk Minimization algorithm (ERM). Recent results show that one can significantly improve the highprobability estimates for the convergence rates for empirical minimizers by a direct analysis of the ERM algorithm. These results are based on a new localized notion of complexity of subsets of hypothesis functions with identical expected errors and are therefore dependent on the underlying unknown distribution. We investigate the extent to which one can estimate these high-probability convergence rates in a datadependent manner. We provide an algorithm which computes a data-dependent upper bound for the expected error of empirical minimizers in terms of the “complexity” of data-dependent local subsets. These subsets are sets of functions of empirical errors of a given range and can be determined based solely on empirical data. We then show that recent direct estimates, which are essentially sharp estimates on the highprobability convergence rate for the ERM algorithm, can not be recovered universally from empirical data.

...read moreread less

Proceedings Article•

Phase transitions within grammatical inference

[...]

Nicolas Pernot¹, Antoine Cornuéjols¹, Michèle Sebag¹•Institutions (1)

University of Paris-Sud¹

30 Jul 2005

TL;DR: It is shown that while there is no phase transition when considering the whole hypothesis space, there is a much more severe "gap" phenomenon affecting the effective search space of standard grammatical induction algorithms for deterministic finite automata (DFA).

...read moreread less

Abstract: It is now well-known that the feasibility of inductive learning is ruled by statistical properties linking the empirical risk minimization principle and the "capacity" of the hypothesis space. The discovery, a few years ago, of a phase transition phenomenon in inductive logic programming proves that other fundamental characteristics of the learning problems may similarly affect the very possibility of learning under very general conditions. Our work examines the case of grammatical inference. We show that while there is no phase transition when considering the whole hypothesis space, there is a much more severe "gap" phenomenon affecting the effective search space of standard grammatical induction algorithms for deterministic finite automata (DFA). Focusing on the search heuristics of the RPNI and RED-BLUE algorithms, we show that they overcome this problem to some extent, but that they are subject to overgeneralization. The paper last suggests some directions for new generalization operators, suited to this Phase Transition phenomenon.

...read moreread less

Book Chapter•DOI•

Endoscopy images classification with Kernel based learning algorithms

[...]

Pawel Majewski¹, Wojciech Jedruch¹•Institutions (1)

Gdańsk University of Technology¹

22 Jun 2005

TL;DR: In this paper, an application of kernel based learning algorithms to endoscopy images classification problem is presented, which is a part the attempts to extend the existing recommendation system (ERS) with image classification facility.

...read moreread less

Abstract: In this paper application of kernel based learning algorithms to endoscopy images classification problem is presented. This work is a part the attempts to extend the existing recommendation system (ERS) with image classification facility. The use of a computer-based system could support the doctor when making a diagnosis and help to avoid human subjectivity. We give a brief description of the SVM and LS-SVM algorithms. The algorithms are then used in the problem of recognition of malignant versus benign tumour in gullet. The classification was performed on features based on edge structure and colour. A detailed experimental comparison of classification performance for diferent kernel functions and different combinations of feature vectors was made. The algorithms performed very well in the experiments achieving high percentage of correct predictions.

...read moreread less

Journal Article•

Financial Time Series Forecasting Based on Support Vector Machine

[...]

Yang Zhao-jun¹•Institutions (1)

Shanghai Jiao Tong University¹

01 Jan 2005-Systems Engineering-Theory Methodology Application

TL;DR: Wang et al. as discussed by the authors analyzed the shortcomings of the neural networks based on the rule of empirical risk minimization (ERM), and introduced the rule-of structural risk minimisation(SRM) to overcome the shortcomings.

...read moreread less

Abstract: One of purposes of data-driven machine learning is to find out the regularities, which can't be discovered with principle analysis, to forecast the future data. With excellent ability of function approximation, neural networks are widely used to develop the map between be past and the future data to carry out the predictions. First, we analyze the shortcomings of the neural networks based on the rule of empirical risk minimization(ERM),and introduce the rule of structural risk minimization(SRM) to overcome the shortcomings of ERM. Second, vector support machine (SVM), an algorithm implementing SRM, is introduced. Finally, multi-step predictions of the trend of Shanghai Security Composite Index are achieved with acceptable accuracy.

...read moreread less

Asymptotics in empirical risk minimization

[...]

L. Mohammadi, S.A. van de Geer

01 Jan 2005

Perceptron learning with random coordinate descent

[...]

Ling Li

31 Aug 2005

TL;DR: This paper proposes a family of random coordinate descent algorithms for perceptron learning on binary classification problems that directly minimize the training error, and usually achieve the lowest training error compared with other algorithms.

...read moreread less

Abstract: A perceptron is a linear threshold classifier that separates examples with a hyperplane. It is perhaps the simplest learning model that is used standalone. In this paper, we propose a family of random coordinate descent algorithms for perceptron learning on binary classification problems. Unlike most perceptron learning algorithms which require smooth cost functions, our algorithms directly minimize the training error, and usually achieve the lowest training error compared with other algorithms. The algorithms are also computational efficient. Such advantages make them favorable for both standalone use and ensemble learning, on problems that are not linearly separable. Experiments show that our algorithms work very well with AdaBoost, and achieve the lowest test errors for half of the datasets.

...read moreread less

Journal Article•DOI•

Asymptotics in Empirical Risk Minimization

[...]

L. Mohammadi, Sara van de Geer

01 Dec 2005-Journal of Machine Learning Research

TL;DR: The Kim Pollard Theorem is applied to show that under certain differentiability assumptions, ân converges to a* with rate n-1/3, and the asymptotic distribution of the renormalized estimator is presented.

...read moreread less

Abstract: In this paper, we study a two-category classification problem. We indicate the categories by labels Y=1 and Y=-1. We observe a covariate, or feature, X ∈ X ⊂ ℜd. Consider a collection {ha} of classifiers indexed by a finite-dimensional parameter a, and the classifier ha* that minimizes the prediction error over this class. The parameter a* is estimated by the empirical risk minimizer ân over the class, where the empirical risk is calculated on a training sample of size n. We apply the Kim Pollard Theorem to show that under certain differentiability assumptions, ân converges to a* with rate n-1/3, and also present the asymptotic distribution of the renormalized estimator.For example, let V0 denote the set of x on which, given X=x, the label Y=1 is more likely (than the label Y=-1). If X is one-dimensional, the set V0 is the union of disjoint intervals. The problem is then to estimate the thresholds of the intervals. We obtain the asymptotic distribution of the empirical risk minimizer when the classifiers have K thresholds, where K is fixed. We furthermore consider an extension to higher-dimensional X, assuming basically that V0 has a smooth boundary in some given parametric class.We also discuss various rates of convergence when the differentiability conditions are possibly violated. Here, we again restrict ourselves to one-dimensional X. We show that the rate is n-1 in certain cases, and then also obtain the asymptotic distribution for the empirical prediction error.

...read moreread less