scispace - formally typeset
Search or ask a question

Showing papers on "Empirical risk minimization published in 2011"


Journal Article
TL;DR: This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.
Abstract: We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms.

6,984 citations


Journal ArticleDOI
TL;DR: This work proposes a new method, objective perturbation, for privacy-preserving machine learning algorithm design, and shows that both theoretically and empirically, this method is superior to the previous state-of-the-art, output perturbations, in managing the inherent tradeoff between privacy and learning performance.
Abstract: Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or financial records, are analyzed. We provide general techniques to produce privacy-preserving approximations of classifiers learned via (regularized) empirical risk minimization (ERM). These algorithms are private under the e-differential privacy definition due to Dwork et al. (2006). First we apply the output perturbation ideas of Dwork et al. (2006), to ERM classification. Then we propose a new method, objective perturbation, for privacy-preserving machine learning algorithm design. This method entails perturbing the objective function before optimizing over classifiers. If the loss and regularizer satisfy certain convexity and differentiability criteria, we prove theoretical results showing that our algorithms preserve privacy, and provide generalization bounds for linear and nonlinear kernels. We further present a privacy-preserving technique for tuning the parameters in general machine learning algorithms, thereby providing end-to-end privacy guarantees for the training process. We apply these results to produce privacy-preserving analogues of regularized logistic regression and support vector machines. We obtain encouraging results from evaluating their performance on real demographic and benchmark data sets. Our results show that both theoretically and empirically, objective perturbation is superior to the previous state-of-the-art, output perturbation, in managing the inherent tradeoff between privacy and learning performance.

1,057 citations


Journal Article
TL;DR: This work considers the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms defined as sums of Euclidean norms on certain subsets of variables, and explores the relationship between groups defining the norm and the resulting nonzero patterns.
Abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certain subsets of variables, extending the usual l1-norm and the group l1-norm by allowing the subsets to overlap. This leads to a specific set of allowed nonzero patterns for the solutions of such problems. We first explore the relationship between the groups defining the norm and the resulting nonzero patterns, providing both forward and backward algorithms to go back and forth from groups to patterns. This allows the design of norms adapted to specific prior knowledge expressed in terms of nonzero patterns. We also present an efficient active set algorithm, and analyze the consistency of variable selection for least-squares linear regression in low and high-dimensional settings.

480 citations


Book
05 Aug 2011
TL;DR: The purpose of these lecture notes is to provide an introduction to the general theory of empirical risk minimization with an emphasis on excess risk bounds and oracle inequalities in penalized problems.
Abstract: The purpose of these lecture notes is to provide an introduction to the general theory of empirical risk minimization with an emphasis on excess risk bounds and oracle inequalities in penalized problems. In recent years, there have been new developments in this area motivated by the study of new classes of methods in machine learning such as large margin classification methods (boosting, kernel machines). The main probabilistic tools involved in the analysis of these problems are concentration and deviation inequalities by Talagrand along with other methods of empirical processes theory (symmetrization inequalities, contraction inequality for Rademacher sums, entropy and generic chaining bounds). Sparse recovery based on l_1-type penalization and low rank matrix recovery based on the nuclear norm penalization are other active areas of research, where the main problems can be stated in the framework of penalized empirical risk minimization, and concentration inequalities and empirical processes tools have proved to be very useful.

458 citations


BookDOI
01 Jan 2011
TL;DR: The main tools involved in the analysis of these problems are concentration and deviation inequalities by Talagrand along with other methods of empirical processes theory (symmetrization inequalities, contraction inequality for Rademacher sums, entropy and generic chaining bounds) as discussed by the authors.
Abstract: The purpose of these lecture notes is to provide an introduction to the general theory of empirical risk minimization with an emphasis on excess risk bounds and oracle inequalities in penalized problems. In recent years, there have been new developments in this area motivated by the study of new classes of methods in machine learning such as large margin classification methods (boosting, kernel machines). The main probabilistic tools involved in the analysis of these problems are concentration and deviation inequalities by Talagrand along with other methods of empirical processes theory (symmetrization inequalities, contraction inequality for Rademacher sums, entropy and generic chaining bounds). Sparse recovery based on l_1-type penalization and low rank matrix recovery based on the nuclear norm penalization are other active areas of research, where the main problems can be stated in the framework of penalized empirical risk minimization, and concentration inequalities and empirical processes tools have proved to be very useful.

274 citations


Journal ArticleDOI
01 Mar 2011
TL;DR: This investigation presents a SVR model with chaotic genetic algorithm (CGA), namely SVRCGA, to forecast the tourism demands, and empirical results that involve tourism demands data from existed paper reveal the proposed SVRC GA model outperforms other approaches in the literature.
Abstract: Accurate tourist demand forecasting systems are essential in tourism planning, particularly in tourism-based countries. Artificial neural networks are attracting attention to forecast tourism demands due to their general non-linear mapping capabilities. Unlike most conventional neural network models, which are based on the empirical risk minimization principle, support vector regression (SVR) applies the structural risk minimization principle to minimize an upper bound of the generalization error, rather than minimizing the training error. This investigation presents a SVR model with chaotic genetic algorithm (CGA), namely SVRCGA, to forecast the tourism demands. With the increase of the complexity and the larger problem scale of tourism demands, genetic algorithms (GAs) are often faced with the problems of premature convergence, slowly reaching the global optimal solution or trapping into a local optimum. The proposed CGA based on the chaos optimization algorithm and GAs, which employs internal randomness of chaos iterations, is used to overcome premature local optimum in determining three parameters of a SVR model. Empirical results that involve tourism demands data from existed paper reveal the proposed SVRCGA model outperforms other approaches in the literature.

227 citations


Book ChapterDOI
01 May 2011
TL;DR: The statistical learning theory as discussed by the authors is regarded as one of the most beautifully developed branches of artificial intelligence, and it provides the theoretical basis for many of today's machine learning algorithms, such as classification.
Abstract: Publisher Summary Statistical learning theory is regarded as one of the most beautifully developed branches of artificial intelligence. It provides the theoretical basis for many of today's machine learning algorithms. The theory helps to explore what permits to draw valid conclusions from empirical data. This chapter provides an overview of the key ideas and insights of statistical learning theory. The statistical learning theory begins with a class of hypotheses and uses empirical data to select one hypothesis from the class. If the data generating mechanism is benign, then it is observed that the difference between the training error and test error of a hypothesis from the class is small. The statistical learning theory generally avoids metaphysical statements about aspects of the true underlying dependency, and thus is precise by referring to the difference between training and test error. The chapter also describes some other variants of machine learning.

205 citations


Journal ArticleDOI
TL;DR: This work establishes inequalities that describe how close approximate pinball risk minimizers are to the corresponding conditional quantile, and uses them to establish an oracle inequality for support vector machines that use the pinball loss.
Abstract: The so-called pinball loss for estimating conditional quantiles is a well-known tool in both statistics and machine learning. So far, however, only little work has been done to quantify the efficiency of this tool for nonparametric approaches. We fill this gap by establishing inequalities that describe how close approximate pinball risk minimizers are to the corresponding conditional quantile. These inequalities, which hold under mild assumptions on the data-generating distribution, are then used to establish so-called variance bounds, which recently turned out to play an important role in the statistical analysis of (regularized) empirical risk minimization approaches. Finally, we use both types of inequalities to establish an oracle inequality for support vector machines that use the pinball loss. The resulting learning rates are min--max optimal under some standard regularity assumptions on the conditional quantile.

149 citations


Proceedings Article
14 Jun 2011
TL;DR: This work argues that instead of choosing approximate MAP parameters, one should seek the parameters that minimize the empirical risk of the entire imperfect system, and shows how to locally optimize this risk using back-propagation and stochastic metadescent.
Abstract: Graphical models are often used \inappropriately," with approximations in the topology, inference, and prediction. Yet it is still common to train their parameters to approximately maximize training likelihood. We argue that instead, one should seek the parameters that minimize the empirical risk of the entire imperfect system. We show how to locally optimize this risk using back-propagation and stochastic metadescent. Over a range of synthetic-data problems, compared to the usual practice of choosing approximate MAP parameters, our approach signicantly reduces loss on test data, sometimes by an order of magnitude.

148 citations


Journal ArticleDOI
TL;DR: In this paper, the authors used the so-called pinball loss for estimating conditional quantiles and established inequalities that describe how close approximate pinball risk minimizers are to the corresponding conditional quantile.
Abstract: Using the so-called pinball loss for estimating conditional quantiles is a well-known tool in both statistics and machine learning. So far, however, only little work has been done to quantify the efficiency of this tool for non-parametric (modified) empirical risk minimization approaches. The goal of this work is to fill this gap by establishing inequalities that describe how close approximate pinball risk minimizers are to the corresponding conditional quantile. These inequalities, which hold under mild assumptions on the data-generating distribution, are then used to establish so-called variance bounds which recently turned out to play an important role in the statistical analysis of (modified) empirical risk minimization approaches. To illustrate the use of the established inequalities, we then use them to establish an oracle inequality for support vector machines that use the pinball loss. Here, it turns out that we obtain learning rates which are optimal in a min-max sense under some standard assumptions on the regularity of the conditional quantile function.

119 citations


Journal ArticleDOI
TL;DR: A rapid sparse twin support vector machine (STSVM) classifier in primal space is proposed to improve the sparsity and robustness of TSVM.

Journal ArticleDOI
TL;DR: A SVM-based CCP recognition model is presented for the on-line real-time recognition of seven typical types of unnatural CCP, assuming that the process observations are AR(1) correlated over time and is more robust toward background noise in the process data than the model based on a back propagation network.

Journal ArticleDOI
TL;DR: This work considers the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms defined as sums of Euclidean norms on certai...
Abstract: We consider the empirical risk minimization problem for linear supervised learning, with regularization by structured sparsity-inducing norms. These are defined as sums of Euclidean norms on certai...

Proceedings ArticleDOI
16 Jul 2011
TL;DR: It is proved that the empirical risk converges to its expected counterpart at rate of root-n, with the assumption that the best metric that minimizes the expected risk is bounded and that the learned metric is consistent.
Abstract: In this paper, we study the problem of learning a metric and propose a loss function based metric learning framework, in which the metric is estimated by minimizing an empirical risk over a training set. With mild conditions on the instance distribution and the used loss function, we prove that the empirical risk converges to its expected counterpart at rate of root-n. In addition, with the assumption that the best metric that minimizes the expected risk is bounded, we prove that the learned metric is consistent. Two example algorithms are presented by using the proposed loss function based metric learning framework, each of which uses a log loss function and a smoothed hinge loss function, respectively. Experimental results suggest the effectiveness of the proposed algorithms.

Posted Content
TL;DR: It is shown that stochastic gradient descent (SDG) with the usual hypotheses is CVon stable and the implications of CVon stability for convergence of SGD are discussed.
Abstract: In batch learning, stability together with existence and uniqueness of the solution corresponds to well-posedness of Empirical Risk Minimization (ERM) methods; recently, it was proved that CVloo stability is necessary and sufficient for generalization and consistency of ERM ([2]) In this note, we introduce CVon stability, which plays a similar role in online learning We show that stochastic gradient descent (SDG) with the usual hypotheses is CVon stable and we then discuss the implications of CVon stability for convergence of SGD

Proceedings ArticleDOI
01 Jan 2011
TL;DR: This paper proposes a quadratic mean based learning framework (QMLearn) that is robust and insensitive to class skewness for supervised learning models which are based on optimizing a regularized empirical risk function.
Abstract: In this paper, we study the problem of data skewness. A data set is skewed/imbalanced if its dependent variable is asymmetrically distributed. Dealing with skewed data sets has been identified as one of the ten most challenging problems in data mining research. We address the problem of class skewness for supervised learning models which are based on optimizing a regularized empirical risk function. These include both classification and regression models for discrete and continuous dependent variables. Classical empirical risk minimization is akin to minimizing the arithmetic mean of prediction errors, in which approach the induction process is biased towards the majority class for skewed data. To overcome this drawback, we propose a quadratic mean based learning framework (QMLearn) that is robust and insensitive to class skewness. We will note that minimizing the quadratic mean is a convex optimization problem and hence can be efficiently solved for large and high dimensional data. Comprehensive experiments demonstrate that the QMLearn model significantly outperforms existing statistical learners including logistic regression, support vector machines, linear regression, support vector regression and quantile regression etc.

Proceedings ArticleDOI
03 Oct 2011
TL;DR: The study presented in this paper aims to study a larger panel of both algorithms (9 different kinds) and data sets (17 UCI bases) to study the ability of algorithms to produce models with only few examples.
Abstract: Learning algorithms proved their ability to deal with large amount of data. Most of the statistical approaches use defined size learning sets and produce static models. However in specific situations: active or incremental learning, the learning task starts with only very few data. In that case, looking for algorithms able to produce models with only few examples becomes necessary. The literature's classifiers are generally evaluated with criterion such as: accuracy, ability to order data (ranking)... But this classifiers' taxonomy can dramatically change if the focus is on the ability to learn with just few examples. To our knowledge, just few studies were performed on this problem. The study presented in this paper aims to study a larger panel of both algorithms (9 different kinds) and data sets (17 UCI bases).

Proceedings Article
14 Jun 2011
TL;DR: In this paper, two generalization error bounds for multiple kernel learning (MKL) were proposed, one of which is a Rademacher complexity bound which is additive in the kernel complexity and margin term.
Abstract: We propose two new generalization error bounds for multiple kernel learning (MKL). First, using the bound of Srebro and BenDavid (2006) as a starting point, we derive a new version which uses a simple counting argument for the choice of kernels in order to generate a tighter bound when 1-norm regularization (sparsity) is imposed in the kernel learning problem. The second bound is a Rademacher complexity bound which is additive in the (logarithmic) kernel complexity and margin term. This dependence is superior to all previously published Rademacher bounds for learning a convex combination of kernels, including the recent bound of Cortes et al. (2010), which exhibits a multiplicative interaction. We illustrate the tightness of our bounds with simulations.

Journal ArticleDOI
TL;DR: In this paper, the authors consider a weaker version of the margin condition that allows one to take into account that learning within a small model can be much easier than within a large one.
Abstract: A classical condition for fast learning rates is the margin condition, first introduced by Mammen and Tsybakov. We tackle in this paper the problem of adaptivity to this condition in the context of model selection, in a general learning framework. Actually, we consider a weaker version of this condition that allows one to take into account that learning within a small model can be much easier than within a large one. Requiring this “strong margin adaptivity” makes the model selection problem more challenging. We first prove, in a general framework, that some penalization procedures (including local Rademacher complexities) exhibit this adaptivity when the models are nested. Contrary to previous results, this holds with penalties that only depend on the data. Our second main result is that strong margin adaptivity is not always possible when the models are not nested: for every model selection procedure (even a randomized one), there is a problem for which it does not demonstrate strong margin adaptivity.

Book ChapterDOI
01 Jan 2011
TL;DR: This work presents the classification problem, starting with definitions and notations that are necessary to ground posterior discussions, and discusses the Probably Approximately Correct learning framework, and some function approximation strategies.
Abstract: We present the classification problem, starting with definitions and notations that are necessary to ground posterior discussions. Then, we discuss the Probably Approximately Correct learning framework, and some function approximation strategies.

Proceedings ArticleDOI
18 Dec 2011
TL;DR: A method based on reinforcement learning is proposed for choosing a good supporting function during optimization using genetic algorithm and results of applying this method to a model problem are shown.
Abstract: This paper describes an optimization problem with one target function to be optimized and several supporting functions that can be used to speed up the optimization process. A method based on reinforcement learning is proposed for choosing a good supporting function during optimization using genetic algorithm. Results of applying this method to a model problem are shown.

Journal ArticleDOI
TL;DR: This study proposes a novel approach, support vector machine method combined with genetic algorithm (GA) for feature selection and chaotic particle swarm optimization (CPSO) for parameter optimization support vector Regression (SVR), to predict financial returns.
Abstract: Nowadays there are lots of novel forecasting approaches to improve the forecasting accuracy in the financial markets. Support Vector Machine (SVM) as a modern statistical tool has been successfully used to solve nonlinear regression and time series problem. Unlike most conventional neural network models which are based on the empirical risk minimization principle, SVM applies the structural risk minimization principle to minimize an upper bound of the generalization error rather than minimizing the training error. To build an effective SVM model, SVM parameters must be set carefully. This study proposes a novel approach, support vector machine method combined with genetic algorithm (GA) for feature selection and chaotic particle swarm optimization(CPSO) for parameter optimization support vector Regression(SVR),to predict financial returns. The advantage of the GA-CPSO-SVR (Support Vector Regression) is that it can deal with feature selection and SVM parameter optimization simultaneously A numerical example is employed to compare the performance of the proposed model. Experiment results show that the proposed model outperforms the other approaches in forecasting financial returns. Index Terms—SVR; GA-CPSO; financial returns; forecasting

Journal ArticleDOI
TL;DR: A new fault diagnosis method is proposed by combining particle swarm optimization (PSO) and LS-SVM algorithm to choose σ parameter of kernel function on dynamic, which enhances precision rate of fault diagnosis and efficiency.
Abstract: Dissolved gas analysis (DGA) is an important method to diagnose the fault of power t ransformer. Least squares support vector machine (LS-SVM) has excellent learning, classification ability and generalization ability, which use structural risk minimization instead of traditional empirical risk minimization based on large sample. LS-SVM is widely used in pattern recognition and function fitting. Kernel parameter selection is very important and decides the precision of power transformer fault diagnosis. In order to enhance fault diagnosis precision, a new fault diagnosis method is proposed by combining particle swarm optimization (PSO) and LS-SVM algorithm. It is presented to choose σ parameter of kernel function on dynamic, which enhances precision rate of fault diagnosis and efficiency. The experiments show that the algorithm can efficiently find the suitable kernel parameters which result in good classification purpose.

Journal ArticleDOI
TL;DR: This brief tackles the problem of learning over the complex-valued matrix-hypersphere Sn,pα(C) in terms of Riemannian-gradient-based optimization of a regular criterion function and is implemented by a geodesic-stepping method.
Abstract: This brief tackles the problem of learning over the complex-valued matrix-hypersphere Sn,pα(C) The developed learning theory is formulated in terms of Riemannian-gradient-based optimization of a regular criterion function and is implemented by a geodesic-stepping method The stepping method is equipped with a geodesic-search sub-algorithm to compute the optimal learning stepsize at any step Numerical results show the effectiveness of the developed learning method and of its implementation

Book ChapterDOI
01 Jan 2011
TL;DR: The aim is to show that the Lasso penalty enjoys good theoretical properties, in the sense that its prediction error is of the same order of magnitude as the prediction error one would have if one knew a priori which variables are relevant.
Abstract: We study the Lasso, i.e., l1-penalized empirical risk minimization, for general convex loss functions. The aim is to show that the Lasso penalty enjoys good theoretical properties, in the sense that its prediction error is of the same order of magnitude as the prediction error one would have if one knew a priori which variables are relevant. The chapter starts out with squared error loss with fixed design, because there the derivations are the simplest. For more general loss, we defer the probabilistic arguments to Chapter 14. We allow for misspecification of the (generalized) linear model, and will consider an oracle that represents the best approximation within the model of the truth. An important quantity in the results will be the so-called compatibility constant, which we require to be non-zero. The latter requirement is called the compatibility condition, a condition with eigenvalue-flavor to it. Our bounds (for prediction error, etc.) are given in explicit (non-asymptotic) form.

Posted Content
TL;DR: Using sparse regularization, the power of regularized learning with the hinge loss function is shown, and the number of selected classifiers of the diverse ensemble without sacrificing accuracy is reduced.
Abstract: The main principle of stacked generalization (or Stacking) is using a second-level generalizer to combine the outputs of base classifiers in an ensemble. In this paper, we investigate different combination types under the stacking framework; namely weighted sum (WS), class-dependent weighted sum (CWS) and linear stacked generalization (LSG). For learning the weights, we propose using regularized empirical risk minimization with the hinge loss. In addition, we propose using group sparsity for regularization to facilitate classifier selection. We performed experiments using two different ensemble setups with differing diversities on 8 real-world datasets. Results show the power of regularized learning with the hinge loss function. Using sparse regularization, we are able to reduce the number of selected classifiers of the diverse ensemble without sacrificing accuracy. With the non-diverse ensembles, we even gain accuracy on average by using sparse regularization.

Journal ArticleDOI
TL;DR: In this article, a unified derivation is given by a generator function U which naturally defines entropy, divergence and loss function, which associates with the boosting learning algorithms for the loss minimization, which includes AdaBoost and LogitBoost as a twin generated from Kullback-Leibler divergence.
Abstract: SUMMARY This paper discusses recent developments for pattern recognition focusing on boosting approach in machine learning. The statistical properties such as Bayes risk consistency for several loss functions are discussed in a probabilistic framework. There are a number of loss functions proposed for different purposes and targets. A unified derivation is given by a generator function U which naturally defines entropy, divergence and loss function. The class of U-loss functions associates with the boosting learning algorithms for the loss minimization, which includes AdaBoost and LogitBoost as a twin generated from Kullback-Leibler divergence, and the (partial) area under the ROC curve. We expand boosting to unsupervised learning, typically density estimation employing U-loss function. Finally, a future perspective in machine learning is discussed.

Book ChapterDOI
TL;DR: The notion of approximation, loss function, and empirical risk functional that are inspired by empirical risk assessment for classifiers in the field of statistical learning are introduced.
Abstract: We discuss the problem of measuring the quality of decision support (classification) system that involves granularity. We put forward the proposal for such quality measure in the case when the underlying granular system is based on rough and fuzzy set paradigms. We introduce the notion of approximation, loss function, and empirical risk functional that are inspired by empirical risk assessment for classifiers in the field of statistical learning.

Journal ArticleDOI
TL;DR: The proposed continuous relaxation problem is compared with problems solved with the help of other approaches to the construction of linear classifiers and features of nonsmooth optimization methods used to solve the formulated problems are described.
Abstract: Problems of construction of linear classifiers for classifying many sets are considered. In the case of linearly separable sets, problem statements are given that generalize already well-known formulations. For linearly inseparable sets, a natural criterion for choosing a classifier is empirical risk minimization. A mixed integer formulation of the empirical risk minimization problem and possible solutions of its continuous relaxation are considered. The proposed continuous relaxation problem is compared with problems solved with the help of other approaches to the construction of linear classifiers. Features of nonsmooth optimization methods used to solve the formulated problems are described.

Proceedings ArticleDOI
22 May 2011
TL;DR: It is shown that this estimator can converge to the theoretically optimal solution as fast as n−1, where n is the number of training samples and the approximation error decay rate is derived as a function of the resolution of a class of partitions known as recursive dyadic partitions.
Abstract: Empirical divergence maximization is an estimation method similar to empirical risk minimization whereby the Kullback-Leibler divergence is maximized over a class of functions that induce probability distributions. We use this method as a design strategy for quantizers whose output will ultimately be used to make a decision about the quantizer's input. We derive this estimator's approximation error decay rate as a function of the resolution of a class of partitions known as recursive dyadic partitions. This result, coupled with earlier results, show that this estimator can converge to the theoretically optimal solution as fast as n−1, where n is the number of training samples. This estimator also is capable of producing estimates that well-approximate optimal solutions that existing techniques cannot.