Showing papers on "Empirical risk minimization published in 2011"

PDF

Open Access

Journal Article•

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

[...]

John C. Duchi¹, Elad Hazan², Yoram Singer³•Institutions (3)

University of California, Berkeley¹, Princeton University², Google³

01 Feb 2011-Journal of Machine Learning Research

TL;DR: This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.

...read moreread less

Abstract: We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms.

...read moreread less

6,984 citations

Journal Article•DOI•

Differentially Private Empirical Risk Minimization

[...]

Kamalika Chaudhuri¹, Claire Monteleoni, Anand D. Sarwate•Institutions (1)

University of California, San Diego¹

01 Feb 2011-Journal of Machine Learning Research

TL;DR: This work proposes a new method, objective perturbation, for privacy-preserving machine learning algorithm design, and shows that both theoretically and empirically, this method is superior to the previous state-of-the-art, output perturbations, in managing the inherent tradeoff between privacy and learning performance.

...read moreread less

Abstract: Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or financial records, are analyzed. We provide general techniques to produce privacy-preserving approximations of classifiers learned via (regularized) empirical risk minimization (ERM). These algorithms are private under the e-differential privacy definition due to Dwork et al. (2006). First we apply the output perturbation ideas of Dwork et al. (2006), to ERM classification. Then we propose a new method, objective perturbation, for privacy-preserving machine learning algorithm design. This method entails perturbing the objective function before optimizing over classifiers. If the loss and regularizer satisfy certain convexity and differentiability criteria, we prove theoretical results showing that our algorithms preserve privacy, and provide generalization bounds for linear and nonlinear kernels. We further present a privacy-preserving technique for tuning the parameters in general machine learning algorithms, thereby providing end-to-end privacy guarantees for the training process. We apply these results to produce privacy-preserving analogues of regularized logistic regression and support vector machines. We obtain encouraging results from evaluating their performance on real demographic and benchmark data sets. Our results show that both theoretically and empirically, objective perturbation is superior to the previous state-of-the-art, output perturbation, in managing the inherent tradeoff between privacy and learning performance.

...read moreread less

1,057 citations

Journal Article•

Structured Variable Selection with Sparsity-Inducing Norms

[...]

Rodolphe Jenatton, Jean-Yves Audibert, Francis Bach

01 Feb 2011-Journal of Machine Learning Research

TL;DR: This work considers the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms defined as sums of Euclidean norms on certain subsets of variables, and explores the relationship between groups defining the norm and the resulting nonzero patterns.

...read moreread less

Abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual l1-norm and the group l1-norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings.

...read moreread less

480 citations

Book•

Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: École d'Été de Probabilités de Saint-Flour XXXVIII-2008

[...]

Vladimir Koltchinskii, École d'été de probabilités de Saint-Flour

05 Aug 2011

TL;DR: The purpose of these lecture notes is to provide an introduction to the general theory of empirical risk minimization with an emphasis on excess risk bounds and oracle inequalities in penalized problems.

...read moreread less

Abstract: The purpose of these lecture notes is to provide an introduction to the general theory of empirical risk minimization with an emphasis on excess risk bounds and oracle inequalities in penalized problems. In recent years, there have been new developments in this area motivated by the study of new classes of methods in machine learning such as large margin classification methods (boosting, kernel machines). The main probabilistic tools involved in the analysis of these problems are concentration and deviation inequalities by Talagrand along with other methods of empirical processes theory (symmetrization inequalities, contraction inequality for Rademacher sums, entropy and generic chaining bounds). Sparse recovery based on l_1-type penalization and low rank matrix recovery based on the nuclear norm penalization are other active areas of research, where the main problems can be stated in the framework of penalized empirical risk minimization, and concentration inequalities and empirical processes tools have proved to be very useful.

...read moreread less

458 citations

Book•DOI•

Oracle inequalities in empirical risk minimization and sparse recovery problems

[...]

Vladimir Koltchinskii¹, École d'été de probabilités de Saint-Flour•Institutions (1)

Georgia Institute of Technology¹

01 Jan 2011

TL;DR: The main tools involved in the analysis of these problems are concentration and deviation inequalities by Talagrand along with other methods of empirical processes theory (symmetrization inequalities, contraction inequality for Rademacher sums, entropy and generic chaining bounds) as discussed by the authors.

...read moreread less

274 citations

Journal Article•DOI•

SVR with hybrid chaotic genetic algorithms for tourism demand forecasting

[...]

Wei-Chiang Hong¹, Yucheng Dong², Li-Yueh Chen³, Shih-Yung Wei⁴•Institutions (4)

Oriental Institute of Technology¹, Xi'an Jiaotong University², MingDao University³, National Yunlin University of Science and Technology⁴

01 Mar 2011

TL;DR: This investigation presents a SVR model with chaotic genetic algorithm (CGA), namely SVRCGA, to forecast the tourism demands, and empirical results that involve tourism demands data from existed paper reveal the proposed SVRC GA model outperforms other approaches in the literature.

...read moreread less

Abstract: Accurate tourist demand forecasting systems are essential in tourism planning, particularly in tourism-based countries. Artificial neural networks are attracting attention to forecast tourism demands due to their general non-linear mapping capabilities. Unlike most conventional neural network models, which are based on the empirical risk minimization principle, support vector regression (SVR) applies the structural risk minimization principle to minimize an upper bound of the generalization error, rather than minimizing the training error. This investigation presents a SVR model with chaotic genetic algorithm (CGA), namely SVRCGA, to forecast the tourism demands. With the increase of the complexity and the larger problem scale of tourism demands, genetic algorithms (GAs) are often faced with the problems of premature convergence, slowly reaching the global optimal solution or trapping into a local optimum. The proposed CGA based on the chaos optimization algorithm and GAs, which employs internal randomness of chaos iterations, is used to overcome premature local optimum in determining three parameters of a SVR model. Empirical results that involve tourism demands data from existed paper reveal the proposed SVRCGA model outperforms other approaches in the literature.

...read moreread less

227 citations

Book Chapter•DOI•

Statistical Learning Theory: Models, Concepts, and Results

[...]

U von Luxburg¹, Bernhard Schölkopf¹•Institutions (1)

Max Planck Society¹

01 May 2011

TL;DR: The statistical learning theory as discussed by the authors is regarded as one of the most beautifully developed branches of artificial intelligence, and it provides the theoretical basis for many of today's machine learning algorithms, such as classification.

...read moreread less

Abstract: Publisher Summary Statistical learning theory is regarded as one of the most beautifully developed branches of artificial intelligence. It provides the theoretical basis for many of today's machine learning algorithms. The theory helps to explore what permits to draw valid conclusions from empirical data. This chapter provides an overview of the key ideas and insights of statistical learning theory. The statistical learning theory begins with a class of hypotheses and uses empirical data to select one hypothesis from the class. If the data generating mechanism is benign, then it is observed that the difference between the training error and test error of a hypothesis from the class is small. The statistical learning theory generally avoids metaphysical statements about aspects of the true underlying dependency, and thus is precise by referring to the difference between training and test error. The chapter also describes some other variants of machine learning.

...read moreread less

205 citations

Journal Article•DOI•

Estimating conditional quantiles with the help of the pinball loss

[...]

Ingo Steinwart¹, Andreas Christmann²•Institutions (2)

University of Stuttgart¹, University of Bayreuth²

10 Feb 2011-arXiv: Statistics Theory

TL;DR: This work establishes inequalities that describe how close approximate pinball risk minimizers are to the corresponding conditional quantile, and uses them to establish an oracle inequality for support vector machines that use the pinball loss.

...read moreread less

Abstract: The so-called pinball loss for estimating conditional quantiles is a well-known tool in both statistics and machine learning. So far, however, only little work has been done to quantify the efficiency of this tool for nonparametric approaches. We fill this gap by establishing inequalities that describe how close approximate pinball risk minimizers are to the corresponding conditional quantile. These inequalities, which hold under mild assumptions on the data-generating distribution, are then used to establish so-called variance bounds, which recently turned out to play an important role in the statistical analysis of (regularized) empirical risk minimization approaches. Finally, we use both types of inequalities to establish an oracle inequality for support vector machines that use the pinball loss. The resulting learning rates are min--max optimal under some standard regularity assumptions on the conditional quantile.

...read moreread less

149 citations

Proceedings Article•

Empirical Risk Minimization of Graphical Model Parameters Given Approximate Inference, Decoding, and Model Structure

[...]

Veselin Stoyanov¹, Alexander Ropson, Jason Eisner¹•Institutions (1)

Johns Hopkins University¹

14 Jun 2011

TL;DR: This work argues that instead of choosing approximate MAP parameters, one should seek the parameters that minimize the empirical risk of the entire imperfect system, and shows how to locally optimize this risk using back-propagation and stochastic metadescent.

...read moreread less

Abstract: Graphical models are often used \inappropriately," with approximations in the topology, inference, and prediction. Yet it is still common to train their parameters to approximately maximize training likelihood. We argue that instead, one should seek the parameters that minimize the empirical risk of the entire imperfect system. We show how to locally optimize this risk using back-propagation and stochastic metadescent. Over a range of synthetic-data problems, compared to the usual practice of choosing approximate MAP parameters, our approach signicantly reduces loss on test data, sometimes by an order of magnitude.

...read moreread less

148 citations

Journal Article•DOI•

Estimating conditional quantiles with the help of the pinball loss

[...]

Ingo Steinwart¹, Andreas Christmann²•Institutions (2)

University of Stuttgart¹, University of Bayreuth²

01 Feb 2011-Bernoulli

TL;DR: In this paper, the authors used the so-called pinball loss for estimating conditional quantiles and established inequalities that describe how close approximate pinball risk minimizers are to the corresponding conditional quantile.

...read moreread less

Abstract: Using the so-called pinball loss for estimating conditional quantiles is a well-known tool in both statistics and machine learning. So far, however, only little work has been done to quantify the efficiency of this tool for non-parametric (modified) empirical risk minimization approaches. The goal of this work is to fill this gap by establishing inequalities that describe how close approximate pinball risk minimizers are to the corresponding conditional quantile. These inequalities, which hold under mild assumptions on the data-generating distribution, are then used to establish so-called variance bounds which recently turned out to play an important role in the statistical analysis of (modified) empirical risk minimization approaches. To illustrate the use of the established inequalities, we then use them to establish an oracle inequality for support vector machines that use the pinball loss. Here, it turns out that we obtain learning rates which are optimal in a min-max sense under some standard assumptions on the regularity of the conditional quantile function.

...read moreread less

119 citations

Journal Article•DOI•

Building sparse twin support vector machine classifiers in primal space

[...]

Xinjun Peng¹•Institutions (1)

Shanghai Normal University¹

15 Sep 2011-Information Sciences

TL;DR: A rapid sparse twin support vector machine (STSVM) classifier in primal space is proposed to improve the sparsity and robustness of TSVM.

...read moreread less

Journal Article•DOI•

Effective recognition of control chart patterns in autocorrelated data using a support vector machine based approach

[...]

Shih-Yen Lin¹, Ruey-Shiang Guh², Yeou-Ren Shiue•Institutions (2)

National Chi Nan University¹, National Formosa University²

01 Nov 2011-Computers & Industrial Engineering

TL;DR: A SVM-based CCP recognition model is presented for the on-line real-time recognition of seven typical types of unnatural CCP, assuming that the process observations are AR(1) correlated over time and is more robust toward background noise in the process data than the model based on a back propagation network.

...read moreread less

Journal Article•DOI•

Structured Variable Selection with Sparsity-Inducing Norms

[...]

JenattonRodolphe, AudibertJean-Yves, BachFrancis

01 Nov 2011-Journal of Machine Learning Research

...read moreread less

Proceedings Article•DOI•

Learning a distance metric by empirical loss minimization

[...]

Wei Bian¹, Dacheng Tao¹•Institutions (1)

University of Technology, Sydney¹

16 Jul 2011

TL;DR: It is proved that the empirical risk converges to its expected counterpart at rate of root-n, with the assumption that the best metric that minimizes the expected risk is bounded and that the learned metric is consistent.

...read moreread less

Abstract: In this paper, we study the problem of learning a metric and propose a loss function based metric learning framework, in which the metric is estimated by minimizing an empirical risk over a training set. With mild conditions on the instance distribution and the used loss function, we prove that the empirical risk converges to its expected counterpart at rate of root-n. In addition, with the assumption that the best metric that minimizes the expected risk is bounded, we prove that the learned metric is consistent. Two example algorithms are presented by using the proposed loss function based metric learning framework, each of which uses a log loss function and a smoothed hinge loss function, respectively. Experimental results suggest the effectiveness of the proposed algorithms.

...read moreread less

Posted Content•

Online Learning, Stability, and Stochastic Gradient Descent

[...]

Tomaso Poggio, Stephen Voinea, Lorenzo Rosasco

24 May 2011-arXiv: Learning

TL;DR: It is shown that stochastic gradient descent (SDG) with the usual hypotheses is CVon stable and the implications of CVon stability for convergence of SGD are discussed.

...read moreread less

Abstract: In batch learning, stability together with existence and uniqueness of the solution corresponds to well-posedness of Empirical Risk Minimization (ERM) methods; recently, it was proved that CVloo stability is necessary and sufficient for generalization and consistency of ERM ([2]) In this note, we introduce CVon stability, which plays a similar role in online learning We show that stochastic gradient descent (SDG) with the usual hypotheses is CVon stable and we then discuss the implications of CVon stability for convergence of SGD

...read moreread less

Proceedings Article•DOI•

A quadratic mean based supervised learning model for managing data skewness

[...]

Wei Liu¹, Sanjay Chawla¹•Institutions (1)

University of Sydney¹

01 Jan 2011

TL;DR: This paper proposes a quadratic mean based learning framework (QMLearn) that is robust and insensitive to class skewness for supervised learning models which are based on optimizing a regularized empirical risk function.

...read moreread less

Abstract: In this paper, we study the problem of data skewness. A data set is skewed/imbalanced if its dependent variable is asymmetrically distributed. Dealing with skewed data sets has been identified as one of the ten most challenging problems in data mining research. We address the problem of class skewness for supervised learning models which are based on optimizing a regularized empirical risk function. These include both classification and regression models for discrete and continuous dependent variables. Classical empirical risk minimization is akin to minimizing the arithmetic mean of prediction errors, in which approach the induction process is biased towards the majority class for skewed data. To overcome this drawback, we propose a quadratic mean based learning framework (QMLearn) that is robust and insensitive to class skewness. We will note that minimizing the quadratic mean is a convex optimization problem and hence can be efficiently solved for large and high dimensional data. Comprehensive experiments demonstrate that the QMLearn model significantly outperforms existing statistical learners including logistic regression, support vector machines, linear regression, support vector regression and quantile regression etc.

...read moreread less

Proceedings Article•DOI•

Learning with few examples: An empirical study on leading classifiers

[...]

Christophe Salperwyck, Vincent Lemaire

03 Oct 2011

TL;DR: The study presented in this paper aims to study a larger panel of both algorithms (9 different kinds) and data sets (17 UCI bases) to study the ability of algorithms to produce models with only few examples.

...read moreread less

Abstract: Learning algorithms proved their ability to deal with large amount of data. Most of the statistical approaches use defined size learning sets and produce static models. However in specific situations: active or incremental learning, the learning task starts with only very few data. In that case, looking for algorithms able to produce models with only few examples becomes necessary. The literature's classifiers are generally evaluated with criterion such as: accuracy, ability to order data (ranking)... But this classifiers' taxonomy can dramatically change if the focus is on the ability to learn with just few examples. To our knowledge, just few studies were performed on this problem. The study presented in this paper aims to study a larger panel of both algorithms (9 different kinds) and data sets (17 UCI bases).

...read moreread less

Proceedings Article•

Improved Loss Bounds For Multiple Kernel Learning

[...]

Zakria Hussain¹, John Shawe-Taylor¹•Institutions (1)

University College London¹

14 Jun 2011

TL;DR: In this paper, two generalization error bounds for multiple kernel learning (MKL) were proposed, one of which is a Rademacher complexity bound which is additive in the kernel complexity and margin term.

...read moreread less

Abstract: We propose two new generalization error bounds for multiple kernel learning (MKL). First, using the bound of Srebro and BenDavid (2006) as a starting point, we derive a new version which uses a simple counting argument for the choice of kernels in order to generate a tighter bound when 1-norm regularization (sparsity) is imposed in the kernel learning problem. The second bound is a Rademacher complexity bound which is additive in the (logarithmic) kernel complexity and margin term. This dependence is superior to all previously published Rademacher bounds for learning a convex combination of kernels, including the recent bound of Cortes et al. (2010), which exhibits a multiplicative interaction. We illustrate the tightness of our bounds with simulations.

...read moreread less

Journal Article•DOI•

Margin-adaptive model selection in statistical learning

[...]

Sylvain Arlot¹, Peter L. Bartlett²•Institutions (2)

École Normale Supérieure¹, University of California, Berkeley²

01 May 2011-Bernoulli

TL;DR: In this paper, the authors consider a weaker version of the margin condition that allows one to take into account that learning within a small model can be much easier than within a large one.

...read moreread less

Abstract: A classical condition for fast learning rates is the margin condition, first introduced by Mammen and Tsybakov. We tackle in this paper the problem of adaptivity to this condition in the context of model selection, in a general learning framework. Actually, we consider a weaker version of this condition that allows one to take into account that learning within a small model can be much easier than within a large one. Requiring this “strong margin adaptivity” makes the model selection problem more challenging. We first prove, in a general framework, that some penalization procedures (including local Rademacher complexities) exhibit this adaptivity when the models are nested. Contrary to previous results, this holds with penalties that only depend on the data. Our second main result is that strong margin adaptivity is not always possible when the models are not nested: for every model selection procedure (even a randomized one), there is a problem for which it does not demonstrate strong margin adaptivity.

...read moreread less

Book Chapter•DOI•

The Classification Problem

[...]

Adriano Veloso¹, Wagner Meira¹•Institutions (1)

Universidade Federal de Minas Gerais¹

01 Jan 2011

TL;DR: This work presents the classification problem, starting with definitions and notations that are necessary to ground posterior discussions, and discusses the Probably Approximately Correct learning framework, and some function approximation strategies.

...read moreread less

Abstract: We present the classification problem, starting with definitions and notations that are necessary to ground posterior discussions. Then, we discuss the Probably Approximately Correct learning framework, and some function approximation strategies.

...read moreread less

Proceedings Article•DOI•

Choosing Best Fitness Function with Reinforcement Learning

[...]

Arina Afanasyeva, Maxim Buzdalov

18 Dec 2011

TL;DR: A method based on reinforcement learning is proposed for choosing a good supporting function during optimization using genetic algorithm and results of applying this method to a model problem are shown.

...read moreread less

Abstract: This paper describes an optimization problem with one target function to be optimized and several supporting functions that can be used to speed up the optimization process. A method based on reinforcement learning is proposed for choosing a good supporting function during optimization using genetic algorithm. Results of applying this method to a model problem are shown.

...read moreread less

Journal Article•DOI•

Application of an integrated support vector regression method in prediction of financial returns

[...]

Yuchen Fu¹, Yuanhu Cheng•Institutions (1)

Soochow University (Suzhou)¹

17 Jun 2011-International Journal of Information Engineering and Electronic Business

TL;DR: This study proposes a novel approach, support vector machine method combined with genetic algorithm (GA) for feature selection and chaotic particle swarm optimization (CPSO) for parameter optimization support vector Regression (SVR), to predict financial returns.

...read moreread less

Abstract: Nowadays there are lots of novel forecasting approaches to improve the forecasting accuracy in the financial markets. Support Vector Machine (SVM) as a modern statistical tool has been successfully used to solve nonlinear regression and time series problem. Unlike most conventional neural network models which are based on the empirical risk minimization principle, SVM applies the structural risk minimization principle to minimize an upper bound of the generalization error rather than minimizing the training error. To build an effective SVM model, SVM parameters must be set carefully. This study proposes a novel approach, support vector machine method combined with genetic algorithm (GA) for feature selection and chaotic particle swarm optimization(CPSO) for parameter optimization support vector Regression(SVR),to predict financial returns. The advantage of the GA-CPSO-SVR (Support Vector Regression) is that it can deal with feature selection and SVM parameter optimization simultaneously A numerical example is employed to compare the performance of the proposed model. Experiment results show that the proposed model outperforms the other approaches in forecasting financial returns. Index Terms—SVR; GA-CPSO; financial returns; forecasting

...read moreread less

Journal Article•DOI•

Power Transformer Fault Diagnosis Based on Least Squares Support Vector Machine and Particle Swarm Optimization

[...]

Xin Ma¹•Institutions (1)

North China University of Water Conservancy and Electric Power¹

01 Feb 2011-Applied Mechanics and Materials

TL;DR: A new fault diagnosis method is proposed by combining particle swarm optimization (PSO) and LS-SVM algorithm to choose σ parameter of kernel function on dynamic, which enhances precision rate of fault diagnosis and efficiency.

...read moreread less

Abstract: Dissolved gas analysis (DGA) is an important method to diagnose the fault of power t ransformer. Least squares support vector machine (LS-SVM) has excellent learning, classification ability and generalization ability, which use structural risk minimization instead of traditional empirical risk minimization based on large sample. LS-SVM is widely used in pattern recognition and function fitting. Kernel parameter selection is very important and decides the precision of power transformer fault diagnosis. In order to enhance fault diagnosis precision, a new fault diagnosis method is proposed by combining particle swarm optimization (PSO) and LS-SVM algorithm. It is presented to choose σ parameter of kernel function on dynamic, which enhances precision rate of fault diagnosis and efficiency. The experiments show that the algorithm can efficiently find the suitable kernel parameters which result in good classification purpose.

...read moreread less

Journal Article•DOI•

Riemannian-Gradient-Based Learning on the Complex Matrix-Hypersphere

[...]

Simone Fiori

01 Dec 2011-IEEE Transactions on Neural Networks

TL;DR: This brief tackles the problem of learning over the complex-valued matrix-hypersphere Sn,pα(C) in terms of Riemannian-gradient-based optimization of a regular criterion function and is implemented by a geodesic-stepping method.

...read moreread less

Abstract: This brief tackles the problem of learning over the complex-valued matrix-hypersphere Sn,pα(C) The developed learning theory is formulated in terms of Riemannian-gradient-based optimization of a regular criterion function and is implemented by a geodesic-stepping method The stepping method is equipped with a geodesic-search sub-algorithm to compute the optimal learning stepsize at any step Numerical results show the effectiveness of the developed learning method and of its implementation

...read moreread less

Book Chapter•DOI•

Theory for the Lasso

[...]

Peter Bühlmann¹, Sara van de Geer¹•Institutions (1)

ETH Zurich¹

01 Jan 2011

TL;DR: The aim is to show that the Lasso penalty enjoys good theoretical properties, in the sense that its prediction error is of the same order of magnitude as the prediction error one would have if one knew a priori which variables are relevant.

...read moreread less

Abstract: We study the Lasso, i.e., l1-penalized empirical risk minimization, for general convex loss functions. The aim is to show that the Lasso penalty enjoys good theoretical properties, in the sense that its prediction error is of the same order of magnitude as the prediction error one would have if one knew a priori which variables are relevant. The chapter starts out with squared error loss with fixed design, because there the derivations are the simplest. For more general loss, we defer the probabilistic arguments to Chapter 14. We allow for misspecification of the (generalized) linear model, and will consider an oracle that represents the best approximation within the model of the truth. An important quantity in the results will be the so-called compatibility constant, which we require to be non-zero. The latter requirement is called the compatibility condition, a condition with eigenvalue-flavor to it. Our bounds (for prediction error, etc.) are given in explicit (non-asymptotic) form.

...read moreread less

Posted Content•

Max-Margin Stacking and Sparse Regularization for Linear Classifier Combination and Selection

[...]

Mehmet Umut Sen¹, Hakan Erdogan¹•Institutions (1)

Sabancı University¹

08 Jun 2011-arXiv: Learning

TL;DR: Using sparse regularization, the power of regularized learning with the hinge loss function is shown, and the number of selected classifiers of the diverse ensemble without sacrificing accuracy is reduced.

...read moreread less

Abstract: The main principle of stacked generalization (or Stacking) is using a second-level generalizer to combine the outputs of base classifiers in an ensemble. In this paper, we investigate different combination types under the stacking framework; namely weighted sum (WS), class-dependent weighted sum (CWS) and linear stacked generalization (LSG). For learning the weights, we propose using regularized empirical risk minimization with the hinge loss. In addition, we propose using group sparsity for regularization to facilitate classifier selection. We performed experiments using two different ensemble setups with differing diversities on 8 real-world datasets. Results show the power of regularized learning with the hinge loss function. Using sparse regularization, we are able to reduce the number of selected classifiers of the diverse ensemble without sacrificing accuracy. With the non-diverse ensembles, we even gain accuracy on average by using sparse regularization.

...read moreread less

Journal Article•DOI•

Boosting Learning Algorithm for Pattern Recognition and Beyond

[...]

Osamu Komori, Shinto Eguchi

01 Oct 2011-IEICE Transactions on Information and Systems

TL;DR: In this article, a unified derivation is given by a generator function U which naturally defines entropy, divergence and loss function, which associates with the boosting learning algorithms for the loss minimization, which includes AdaBoost and LogitBoost as a twin generated from Kullback-Leibler divergence.

...read moreread less

Abstract: SUMMARY This paper discusses recent developments for pattern recognition focusing on boosting approach in machine learning. The statistical properties such as Bayes risk consistency for several loss functions are discussed in a probabilistic framework. There are a number of loss functions proposed for different purposes and targets. A unified derivation is given by a generator function U which naturally defines entropy, divergence and loss function. The class of U-loss functions associates with the boosting learning algorithms for the loss minimization, which includes AdaBoost and LogitBoost as a twin generated from Kullback-Leibler divergence, and the (partial) area under the ROC curve. We expand boosting to unsupervised learning, typically density estimation employing U-loss function. Finally, a future perspective in machine learning is discussed.

...read moreread less

Book Chapter•DOI•

Risk assessment in granular environments

[...]

Marcin Szczuka¹•Institutions (1)

University of Warsaw¹

01 Jan 2011-Lecture Notes in Computer Science

TL;DR: The notion of approximation, loss function, and empirical risk functional that are inspired by empirical risk assessment for classifiers in the field of statistical learning are introduced.

...read moreread less

Abstract: We discuss the problem of measuring the quality of decision support (classification) system that involves granularity. We put forward the proposal for such quality measure in the case when the underlying granular system is based on rough and fuzzy set paradigms. We introduce the notion of approximation, loss function, and empirical risk functional that are inspired by empirical risk assessment for classifiers in the field of statistical learning.

...read moreread less

Journal Article•DOI•

Empirical risk minimization and problems of constructing linear classifiers

[...]

Yu. P. Laptin¹, Yu. I. Zhuravlev², A. P. Vinogradov²•Institutions (2)

National Academy of Sciences of Ukraine¹, Russian Academy of Sciences²

01 Jul 2011-Cybernetics and Systems Analysis

TL;DR: The proposed continuous relaxation problem is compared with problems solved with the help of other approaches to the construction of linear classifiers and features of nonsmooth optimization methods used to solve the formulated problems are described.

...read moreread less

Abstract: Problems of construction of linear classifiers for classifying many sets are considered. In the case of linearly separable sets, problem statements are given that generalize already well-known formulations. For linearly inseparable sets, a natural criterion for choosing a classifier is empirical risk minimization. A mixed integer formulation of the empirical risk minimization problem and possible solutions of its continuous relaxation are considered. The proposed continuous relaxation problem is compared with problems solved with the help of other approaches to the construction of linear classifiers. Features of nonsmooth optimization methods used to solve the formulated problems are described.

...read moreread less

Proceedings Article•DOI•

Empirical divergence maximization for quantizer design: An analysis of approximation error

[...]

Michael A. Lexa¹•Institutions (1)

University of Edinburgh¹

22 May 2011

TL;DR: It is shown that this estimator can converge to the theoretically optimal solution as fast as n−1, where n is the number of training samples and the approximation error decay rate is derived as a function of the resolution of a class of partitions known as recursive dyadic partitions.

...read moreread less

Abstract: Empirical divergence maximization is an estimation method similar to empirical risk minimization whereby the Kullback-Leibler divergence is maximized over a class of functions that induce probability distributions. We use this method as a design strategy for quantizers whose output will ultimately be used to make a decision about the quantizer's input. We derive this estimator's approximation error decay rate as a function of the resolution of a class of partitions known as recursive dyadic partitions. This result, coupled with earlier results, show that this estimator can converge to the theoretically optimal solution as fast as n−1, where n is the number of training samples. This estimator also is capable of producing estimates that well-approximate optimal solutions that existing techniques cannot.

...read moreread less