scispace - formally typeset
Search or ask a question

Showing papers on "Empirical risk minimization published in 2019"


Journal Article
TL;DR: The authors developed an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error.
Abstract: We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen's empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems.

132 citations


Posted Content
TL;DR: By focusing on excess risk rather than parameter estimation, this work can give guarantees under weaker assumptions than in previous works and accommodate the case where the target parameter belongs to a complex nonparametric class.
Abstract: We provide non-asymptotic excess risk guarantees for statistical learning in a setting where the population risk with respect to which we evaluate the target parameter depends on an unknown nuisance parameter that must be estimated from data. We analyze a two-stage sample splitting meta-algorithm that takes as input two arbitrary estimation algorithms: one for the target parameter and one for the nuisance parameter. We show that if the population risk satisfies a condition called Neyman orthogonality, the impact of the nuisance estimation error on the excess risk bound achieved by the meta-algorithm is of second order. Our theorem is agnostic to the particular algorithms used for the target and nuisance and only makes an assumption on their individual performance. This enables the use of a plethora of existing results from statistical learning and machine learning to give new guarantees for learning with a nuisance component. Moreover, by focusing on excess risk rather than parameter estimation, we can give guarantees under weaker assumptions than in previous works and accommodate settings in which the target parameter belongs to a complex nonparametric class. We provide conditions on the metric entropy of the nuisance and target classes such that oracle rates---rates of the same order as if we knew the nuisance parameter---are achieved. We also derive new rates for specific estimation algorithms such as variance-penalized empirical risk minimization, neural network estimation and sparse high-dimensional linear model estimation. We highlight the applicability of our results in four settings of central importance: 1) heterogeneous treatment effect estimation, 2) offline policy optimization, 3) domain adaptation, and 4) learning with missing data.

122 citations


Posted Content
TL;DR: It is shown that data augmentation is equivalent to an averaging operation over the orbits of a certain group that keeps the data distribution approximately invariant, and it is proved that it leads to variance reduction.
Abstract: Data augmentation is a widely used trick when training deep neural networks: in addition to the original data, properly transformed data are also added to the training set. However, to the best of our knowledge, a clear mathematical framework to explain the performance benefits of data augmentation is not available. In this paper, we develop such a theoretical framework. We show data augmentation is equivalent to an averaging operation over the orbits of a certain group that keeps the data distribution approximately invariant. We prove that it leads to variance reduction. We study empirical risk minimization, and the examples of exponential families, linear regression, and certain two-layer neural networks. We also discuss how data augmentation could be used in problems with symmetry where other approaches are prevalent, such as in cryo-electron microscopy (cryo-EM).

111 citations


Posted Content
TL;DR: Performing ERM with peer loss functions on the noisy dataset leads to the optimal or a near-optimal classifier as if performing ERM over the clean training data, which the authors do not have access to.
Abstract: Learning with noisy labels is a common challenge in supervised learning. Existing approaches often require practitioners to specify noise rates, i.e., a set of parameters controlling the severity of label noises in the problem, and the specifications are either assumed to be given or estimated using additional steps. In this work, we introduce a new family of loss functions that we name as peer loss functions, which enables learning from noisy labels and does not require a priori specification of the noise rates. Peer loss functions work within the standard empirical risk minimization (ERM) framework. We show that, under mild conditions, performing ERM with peer loss functions on the noisy dataset leads to the optimal or a near-optimal classifier as if performing ERM over the clean training data, which we do not have access to. We pair our results with an extensive set of experiments. Peer loss provides a way to simplify model development when facing potentially noisy training labels, and can be promoted as a robust candidate loss function in such situations.

99 citations


Posted Content
TL;DR: This work designs loopless variants of the stochastic variance-reduced gradient method and proves that the new methods enjoy the same superior theoretical convergence properties as the original methods.
Abstract: The stochastic variance-reduced gradient method (SVRG) and its accelerated variant (Katyusha) have attracted enormous attention in the machine learning community in the last few years due to their superior theoretical properties and empirical behaviour on training supervised machine learning models via the empirical risk minimization paradigm. A key structural element in both of these methods is the inclusion of an outer loop at the beginning of which a full pass over the training data is made in order to compute the exact gradient, which is then used to construct a variance-reduced estimator of the gradient. In this work we design {\em loopless variants} of both of these methods. In particular, we remove the outer loop and replace its function by a coin flip performed in each iteration designed to trigger, with a small probability, the computation of the gradient. We prove that the new methods enjoy the same superior theoretical convergence properties as the original methods. However, we demonstrate through numerical experiments that our methods have substantially superior practical behavior.

85 citations


Proceedings Article
11 Apr 2019
TL;DR: It is proved that tensor initialization followed by gradient descent can converge to the ground-truth parameters at a linear rate up to some statistical error.
Abstract: We study the problem of learning one-hidden-layer neural networks with Rectified Linear Unit (ReLU) activation function, where the inputs are sampled from standard Gaussian distribution and the outputs are generated from a noisy teacher network. We analyze the performance of gradient descent for training such kind of neural networks based on empirical risk minimization, and provide algorithm-dependent guarantees. In particular, we prove that tensor initialization followed by gradient descent can converge to the ground-truth parameters at a linear rate up to some statistical error. To the best of our knowledge, this is the first work characterizing the recovery guarantee for practical learning of one-hidden-layer ReLU networks with multiple neurons. Numerical experiments verify our theoretical findings.

83 citations


Journal ArticleDOI
TL;DR: A simple yet effective algorithm by solving an optimization sub-problem to minimize the mean square error value obtained by cross-validation method is investigated in the present study to construct SVR model for structural reliability analysis.

74 citations


Journal ArticleDOI
TL;DR: In this paper, the median-of-means tournament (MOMT) was introduced to achieve the optimal tradeoff between accuracy and confidence under minimal assumptions, and in particular outperforms classical methods based on empirical risk minimization.
Abstract: We consider the classical statistical learning/regression problem, when the value of a real random variable Y is to be predicted based on the observation of another random variable X. Given a class of functions F and a sample of independent copies of (X, Y ), one needs to choose a function f from F such that f(X) approximates Y as well as possible, in the mean-squared sense. We introduce a new procedure, the so-called median-of-means tournament, that achieves the optimal tradeoff between accuracy and confidence under minimal assumptions, and in particular outperforms classical methods based on empirical risk minimization.

71 citations


Proceedings Article
06 Sep 2019
TL;DR: This work designs an oracle-efficient algorithm for the fair empirical risk minimization task and shows that given sufficiently many samples, the ERM solution generalizes in two directions: both to new individuals, and to new classification tasks, drawn from their corresponding distributions.
Abstract: We propose a new family of fairness definitions for classification problems that combine some of the best properties of both statistical and individual notions of fairness. We posit not only a distribution over individuals, but also a distribution over (or collection of) classification tasks. We then ask that standard statistics (such as error or false positive/negative rates) be (approximately) equalized across individuals, where the rate is defined as an expectation over the classification tasks. Because we are no longer averaging over coarse groups (such as race or gender), this is a semantically meaningful individual-level constraint. Given a sample of individuals and problems, we design an oracle-efficient algorithm (i.e. one that is given access to any standard, fairness-free learning heuristic) for the fair empirical risk minimization task. We also show that given sufficiently many samples, the ERM solution generalizes in two directions: both to new individuals, and to new classification tasks, drawn from their corresponding distributions. Finally we implement our algorithm and empirically verify its effectiveness.

58 citations



Posted Content
TL;DR: This paper proposes RayNet, which combines a CNN that learns view-invariant feature representations with an MRF that explicitly encodes the physics of perspective projection and occlusion and trains RayNet end-to-end using empirical risk minimization.
Abstract: In this paper, we consider the problem of reconstructing a dense 3D model using images captured from different views. Recent methods based on convolutional neural networks (CNN) allow learning the entire task from data. However, they do not incorporate the physics of image formation such as perspective geometry and occlusion. Instead, classical approaches based on Markov Random Fields (MRF) with ray-potentials explicitly model these physical processes, but they cannot cope with large surface appearance variations across different viewpoints. In this paper, we propose RayNet, which combines the strengths of both frameworks. RayNet integrates a CNN that learns view-invariant feature representations with an MRF that explicitly encodes the physics of perspective projection and occlusion. We train RayNet end-to-end using empirical risk minimization. We thoroughly evaluate our approach on challenging real-world datasets and demonstrate its benefits over a piece-wise trained baseline, hand-crafted models as well as other learning-based approaches.

Posted Content
TL;DR: This work considers the problem of sampling from a target distribution, which is not necessarily logconcave, in the context of empirical risk minimization and stochastic optimization as presented in Raginsky et al. (2017).
Abstract: We consider the problem of sampling from a target distribution, which is \emph {not necessarily logconcave}, in the context of empirical risk minimization and stochastic optimization as presented in Raginsky et al. (2017). Non-asymptotic analysis results are established in the $L^1$-Wasserstein distance for the behaviour of Stochastic Gradient Langevin Dynamics (SGLD) algorithms. We allow the estimation of gradients to be performed even in the presence of \emph{dependent} data streams. Our convergence estimates are sharper and \emph{uniform} in the number of iterations, in contrast to those in previous studies.

Journal ArticleDOI
17 Jul 2019
TL;DR: This paper investigates the DP-ERM problem in high dimensional space, and shows that by measuring the utility with Frank-Wolfe gap, it is possible to bound the utility by the Gaussian Width of the constraint set, instead of the dimensionality p of the underlying space.
Abstract: In this paper, we study the Differentially Private Empirical Risk Minimization (DP-ERM) problem with non-convex loss functions and give several upper bounds for the utility in different settings. We first consider the problem in low-dimensional space. For DP-ERM with non-smooth regularizer, we generalize an existing work by measuring the utility using l2 norm of the projected gradient. Also, we extend the error bound measurement, for the first time, from empirical risk to population risk by using the expected l2 norm of the gradient. We then investigate the problem in high dimensional space, and show that by measuring the utility with Frank-Wolfe gap, it is possible to bound the utility by the Gaussian Width of the constraint set, instead of the dimensionality p of the underlying space. We further demonstrate that the advantages of this result can be achieved by the measure of l2 norm of the projected gradient. A somewhat surprising discovery is that although the two kinds of measurements are quite different, their induced utility upper bounds are asymptotically the same under some assumptions. We also show that the utility of some special non-convex loss functions can be reduced to a level (i.e., depending only on log p) similar to that of convex loss functions. Finally, we test our proposed algorithms on both synthetic and real world datasets and the experimental results confirm our theoretical analysis.

Journal ArticleDOI
TL;DR: A regularized risk minimization procedure for regression function estimation is introduced that achieves near optimal accuracy and confidence under general conditions, including heavy-tailed predictor and response variables.
Abstract: We introduce a regularized risk minimization procedure for regression function estimation. The procedure is based on median-of-means tournaments, introduced by the authors in Lugosi and Mendelson (2018) and achieves near optimal accuracy and confidence under general conditions, including heavy-tailed predictor and response variables. It outperforms standard regularized empirical risk minimization procedures such as LASSO or SLOPE in heavy-tailed problems.

Journal ArticleDOI
27 Sep 2019
TL;DR: In this article, the authors propose a general noise reduction framework that can apply to a variety of private empirical risk minimization (ERM) algorithms, using them to find the empirically strongest one that meets the accuracy constraint, and incurring only logarithmic overhead in the number of privacy levels searched.
Abstract: Traditional approaches to differential privacy assume a fixed privacy requirement e for a computation, and attempt to maximize the accuracy of the computation subject to the privacy constraint. As differential privacy is increasingly deployed in practical settings, it may often be that there is instead a fixed accuracy requirement for a given computation and the data analyst would like to maximize the privacy of the computation subject to the accuracy constraint. This raises the question of how to find and run a maximally private empirical risk minimizer subject to a given accuracy requirement. We propose a general “noise reduction” framework that can apply to a variety of private empirical risk minimization (ERM) algorithms, using them to “search” the space of privacy levels to find the empirically strongest one that meets the accuracy constraint, and incurring only logarithmic overhead in the number of privacy levels searched. The privacy analysis of our algorithm leads naturally to a version of differential privacy where the privacy parameters are dependent on the data, which we term ex-post privacy, and which is related to the recently introduced notion of privacy odometers. We also give an ex-post privacy analysis of the classical AboveThreshold privacy tool, modifying it to allow for queries chosen depending on the database. Finally, we apply our approach to two common objective functions, regularized linear and logistic regression, and empirically compare our noise reduction methods to (i) inverting the theoretical utility guarantees of standard private ERM algorithms and (ii) a stronger, empirical baseline based on binary search.

Proceedings Article
17 Jun 2019
TL;DR: In this article, the authors consider learning methods based on the regularization of a convex empirical risk by a squared Hilbertian norm, a setting that includes linear predictors and non-linear predictors through positive-definite kernels.
Abstract: We consider learning methods based on the regularization of a convex empirical risk by a squared Hilbertian norm, a setting that includes linear predictors and non-linear predictors through positive-definite kernels. In order to go beyond the generic analysis leading to convergence rates of the excess risk as $O(1/\sqrt{n})$ from $n$ observations, we assume that the individual losses are self-concordant, that is, their third-order derivatives are bounded by their second-order derivatives. This setting includes least-squares, as well as all generalized linear models such as logistic and softmax regression. For this class of losses, we provide a bias-variance decomposition and show that the assumptions commonly made in least-squares regression, such as the source and capacity conditions, can be adapted to obtain fast non-asymptotic rates of convergence by improving the bias terms, the variance terms or both.

Posted Content
TL;DR: Non-asymptotic error bounds are obtained for a popular class of algorithms called Stochastic Gradient Langevin Dynamics (SGLD), and examples from minibatch logistic regression and from variational inference are given by providing theoretical guarantees for the sampling behaviour of the algorithm.
Abstract: Within the context of empirical risk minimization, see Raginsky, Rakhlin, and Telgarsky (2017), we are concerned with a non-asymptotic analysis of sampling algorithms used in optimization. In particular, we obtain non-asymptotic error bounds for a popular class of algorithms called Stochastic Gradient Langevin Dynamics (SGLD). These results are derived in Wasserstein-1 and Wasserstein-2 distances in the absence of log-concavity of the target distribution. More precisely, the stochastic gradient $H(\theta, x)$ is assumed to be locally Lipschitz continuous in both variables, and furthermore, the dissipativity condition is relaxed by removing its uniform dependence in $x$. This relaxation allows us to present two key paradigms within the framework of scalable posterior sampling for Bayesian inference and of nonconvex optimization; namely, examples from minibatch logistic regression and from variational inference are given by providing theoretical guarantees for the sampling behaviour of the algorithm.

Posted Content
TL;DR: This paper studies a decentralized stochastic gradient tracking (DSGT) algorithm for non-convex empirical risk minimization problems over a peer-to-peer network of nodes, which is in sharp contrast to the existing DSGT only for convex problems.
Abstract: This paper studies a decentralized stochastic gradient tracking (DSGT) algorithm for non-convex empirical risk minimization problems over a peer-to-peer network of nodes, which is in sharp contrast to the existing DSGT only for convex problems. To ensure exact convergence and handle the variance among decentralized datasets, each node performs a stochastic gradient (SG) tracking step by using a mini-batch of samples, where the batch size is designed to be proportional to the size of the local dataset. We explicitly evaluate the convergence rate of DSGT with respect to the number of iterations in terms of algebraic connectivity of the network, mini-batch size, gradient variance, etc. Under certain conditions, we further show that DSGT has a network independence property in the sense that the network topology only affects the convergence rate up to a constant factor. Hence, the convergence rate of DSGT can be comparable to the centralized SGD method. Moreover, a linear speedup of DSGT with respect to the number of nodes is achievable for some scenarios. Numerical experiments for neural networks and logistic regression problems on CIFAR-10 finally illustrate the advantages of DSGT.

Proceedings Article
24 May 2019
TL;DR: Surprisingly, this approach can in some regimes lead to superlinear speedup with respect to the minibatch size, which is not usually present in stochastic optimization.
Abstract: We provide the first importance sampling variants of variance reduced algorithms for empirical risk minimization with non-convex loss functions. In particular, we analyze non-convex versions of SVRG, SAGA and SARAH. Our methods have the capacity to speed up the training process by an order of magnitude compared to the state of the art on real datasets. Moreover, we also improve upon current mini-batch analysis of these methods by proposing importance sampling for minibatches in this setting. Surprisingly, our approach can in some regimes lead to superlinear speedup with respect to the minibatch size, which is not usually present in stochastic optimization. All the above results follow from a general analysis of the methods which works with arbitrary sampling, i.e., fully general randomized strategy for the selection of subsets of examples to be sampled in each iteration. Finally, we also perform a novel importance sampling analysis of SARAH in the convex setting.

Proceedings Article
20 Oct 2019
TL;DR: In this paper, the authors proposed to wrap the terms that cause a negative empirical risk by certain correction functions and prove the consistency of the corrected risk estimator and derive an estimation error bound for corrected risk minimizer.
Abstract: The recently proposed unlabeled-unlabeled (UU) classification method allows us to train a binary classifier only from two unlabeled datasets with different class priors Since this method is based on the empirical risk minimization, it works as if it is a supervised classification method, compatible with any model and optimizer However, this method sometimes suffers from severe overfitting, which we would like to prevent in this paper Our empirical finding in applying the original UU method is that overfitting often co-occurs with the empirical risk going negative, which is not legitimate Therefore, we propose to wrap the terms that cause a negative empirical risk by certain correction functions Then, we prove the consistency of the corrected risk estimator and derive an estimation error bound for the corrected risk minimizer Experiments show that our proposal can successfully mitigate overfitting of the UU method and significantly improve the classification accuracy

Proceedings Article
24 May 2019
TL;DR: In this article, an example-weighting algorithm based on empirical risk minimization is proposed to address the PUbN classification problem, with the weight of each example computed through a preliminary step that draws inspiration from PU learning.
Abstract: In binary classification, there are situations where negative (N) data are too diverse to be fully labeled and we often resort to positive-unlabeled (PU) learning in these scenarios. However, collecting a non-representative N set that contains only a small portion of all possible N data can often be much easier in practice. This paper studies a novel classification framework which incorporates such biased N (bN) data in PU learning. We provide a method based on empirical risk minimization to address this PUbN classification problem. Our approach can be regarded as a novel example-weighting algorithm, with the weight of each example computed through a preliminary step that draws inspiration from PU learning. We also derive an estimation error bound for the proposed method. Experimental results demonstrate the effectiveness of our algorithm in not only PUbN learning scenarios but also ordinary PU learning scenarios on several benchmark datasets.

Journal ArticleDOI
01 Jan 2019-Optik
TL;DR: The experimental results demonstrate that the suggested DSVCNN-based method is competitive with current state-of-the-art approaches and superior to those that use traditional CNN methods.

Journal ArticleDOI
30 Mar 2019
TL;DR: These algorithms can provide confidence intervals that satisfy differential privacy (as well as the more recently proposed concentrated differential privacy) and can be used with existing differentially private mechanisms that train models using objective perturbations and output perturbation.
Abstract: The process of data mining with differential privacy produces results that are affected by two types of noise: sampling noise due to data collection and privacy noise that is designed to prevent the reconstruction of sensitive information. In this paper, we consider the problem of designing confidence intervals for the parameters of a variety of differentially private machine learning models. The algorithms can provide confidence intervals that satisfy differential privacy (as well as the more recently proposed concentrated differential privacy) and can be used with existing differentially private mechanisms that train models using objective perturbation and output perturbation.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: Results computed on multiple UCI benchmark datasets clearly indicate the effectiveness and applicability of the proposed ISPTSVM compared to pinball support vector machine (Pin-SVM), twin bounded support vectors machine (TBSVM) and SPTSVM.
Abstract: In this paper, we propose an improved version of sparse pinball twin support vector machine (SPTSVM) [1], called improved sparse pinball twin support vector machine (ISPTSVM). SPTSVM implements empirical risk minimization principle and the matrices appearing in the formulation of SPTSVM are positive semi-definite. Here, we reformulate the primal problems of SPTSVM by introducing extra regularization term to the objective function of SPTSVM. Unlike SPTSVM, structural risk minimization (SRM) principle is implemented in the proposed ISPTSVM which embodies the marrow of statistical learning theory. Also, the matrices that appear in the dual formulation of the proposed ISPTSVM are positive definite. Results computed on multiple UCI benchmark datasets clearly indicate the effectiveness and applicability of the proposed ISPTSVM compared to pinball support vector machine (Pin-SVM), twin bounded support vector machine (TBSVM) and SPTSVM.

Journal ArticleDOI
TL;DR: This paper considers a set of multiple independent control systems that are each connected over a nonstationary wireless channel to maximize control performance over all the systems through the allocation of transmitting power within a fixed budget using Newton's method.
Abstract: This paper considers a set of multiple independent control systems that are each connected over a nonstationary wireless channel. The goal is to maximize control performance over all the systems through the allocation of transmitting power within a fixed budget. This can be formulated as a constrained optimization problem examined using Lagrangian duality. By taking samples of the unknown wireless channel at every time instance, the resulting problem takes on the form of empirical risk minimization, a well-studied problem in machine learning. Due to the nonstationarity of wireless channels, optimal allocations must be continuously learned and updated as the channel evolves. The quadratic convergence property of Newton's method motivates its use in learning approximately optimal power allocation policies over the sampled dual function as the channel evolves over time. Conditions are established under which Newton's method learns approximate solutions with a single update, and the subsequent suboptimality of the control problem is further characterized. Numerical simulations illustrate the near-optimal performance of the method and resulting stability on a wireless control problem.

Posted Content
TL;DR: Finite sample upper bounds are derived for the generalization error committed by specific families of reservoir computing systems when processing discrete-time inputs under various hypotheses on their dependence structure in the framework of statistical learning theory.
Abstract: We analyze the practices of reservoir computing in the framework of statistical learning theory. In particular, we derive finite sample upper bounds for the generalization error committed by specific families of reservoir computing systems when processing discrete-time inputs under various hypotheses on their dependence structure. Non-asymptotic bounds are explicitly written down in terms of the multivariate Rademacher complexities of the reservoir systems and the weak dependence structure of the signals that are being handled. This allows, in particular, to determine the minimal number of observations needed in order to guarantee a prescribed estimation accuracy with high probability for a given reservoir family. At the same time, the asymptotic behavior of the devised bounds guarantees the consistency of the empirical risk minimization procedure for various hypothesis classes of reservoir functionals.

Journal ArticleDOI
17 Jul 2019
TL;DR: A novel online learning algorithm OLVF is proposed to learn from data with arbitrarily varying feature spaces to classify the feature spaces and the instances from feature spaces simultaneously and a feature sparsity method is applied to reduce the model complexity.
Abstract: We study the problem of online learning with varying feature spaces. The problem is challenging because, unlike traditional online learning problems, varying feature spaces can introduce new features or stop having some features without following a pattern. Other existing methods such as online streaming feature selection (Wu et al. 2013), online learning from trapezoidal data streams (Zhang et al. 2016), and learning with feature evolvable streams (Hou, Zhang, and Zhou 2017) are not capable to learn from arbitrarily varying feature spaces because they make assumptions about the feature space dynamics. In this paper, we propose a novel online learning algorithm OLVF to learn from data with arbitrarily varying feature spaces. The OLVF algorithm learns to classify the feature spaces and the instances from feature spaces simultaneously. To classify an instance, the algorithm dynamically projects the instance classifier and the training instance onto their shared feature subspace. The feature space classifier predicts the projection confidences for a given feature space. The instance classifier will be updated by following the empirical risk minimization principle and the strength of the constraints will be scaled by the projection confidences. Afterwards, a feature sparsity method is applied to reduce the model complexity. Experiments on 10 datasets with varying feature spaces have been conducted to demonstrate the performance of the proposed OLVF algorithm. Moreover, experiments with trapezoidal data streams on the same datasets have been conducted to show that OLVF performs better than the state-of-the-art learning algorithm (Zhang et al. 2016).

Posted Content
TL;DR: It is shown that in the case of nonparametric regression over sieves of c\`adl\`ag functions with bounded sectional variation norm, this upper bound on the rate of convergence holds for least-squares estimators, under the random design, sub-exponential errors setting.
Abstract: Empirical risk minimization over classes functions that are bounded for some version of the variation norm has a long history, starting with Total Variation Denoising (Rudin et al., 1992), and has been considered by several recent articles, in particular Fang et al., 2019 and van der Laan, 2015. In this article, we consider empirical risk minimization over the class $\mathcal{F}_d$ of cadlag functions over $[0,1]^d$ with bounded sectional variation norm (also called Hardy-Krause variation). We show how a certain representation of functions in $\mathcal{F}_d$ allows to bound the bracketing entropy of sieves of $\mathcal{F}_d$, and therefore derive rates of convergence in nonparametric function estimation. Specifically, for sieves whose growth is controlled by some rate $a_n$, we show that the empirical risk minimizer has rate of convergence $O_P(n^{-1/3} (\log n)^{2(d-1)/3} a_n)$. Remarkably, the dimension only affects the rate in $n$ through the logarithmic factor, making this method especially appropriate for high dimensional problems. In particular, we show that in the case of nonparametric regression over sieves of cadlag functions with bounded sectional variation norm, this upper bound on the rate of convergence holds for least-squares estimators, under the random design, sub-exponential errors setting.

Proceedings Article
01 Mar 2019
TL;DR: These results recover optimal bounds for VC- and large (polynomial entropy) classes, replacing localized Rademacher complexity by a simpler analysis which almost completely separates the two aspects that determine the achievable rates: 'easiness' (Bernstein) conditions and model complexity.
Abstract: We present a novel notion of complexity that interpolates between and generalizes some classic existing complexity notions in learning theory: for estimators like empirical risk minimization (ERM) with arbitrary bounded losses, it is upper bounded in terms of data-independent Rademacher complexity; for generalized Bayesian estimators, it is upper bounded by the data-dependent information complexity (also known as stochastic or PAC-Bayesian, KL(posterior∥prior) complexity. For (penalized) ERM, the new complexity reduces to (generalized) normalized maximum likelihood (NML) complexity, i.e. a minimax log-loss individual-sequence regret. Our first main result bounds excess risk in terms of the new complexity. Our second main result links the new complexity via Rademacher complexity to L2(P) entropy, thereby generalizing earlier results of Opper, Haussler, Lugosi, and Cesa-Bianchi who did the log-loss case with L∞. Together, these results recover optimal bounds for VC- and large (polynomial entropy) classes, replacing localized Rademacher complexity by a simpler analysis which almost completely separates the two aspects that determine the achievable rates: 'easiness' (Bernstein) conditions and model complexity.

Posted Content
TL;DR: This work shows that a wide class of stochastic optimization algorithms for strongly convex problems can be augmented with high confidence bounds at an overhead cost that is only logarithmic in the confidence level and polylogarithms in the condition number.
Abstract: Standard results in stochastic convex optimization bound the number of samples that an algorithm needs to generate a point with small function value in expectation. More nuanced high probability guarantees are rare, and typically either rely on "light-tail" noise assumptions or exhibit worse sample complexity. In this work, we show that a wide class of stochastic optimization algorithms for strongly convex problems can be augmented with high confidence bounds at an overhead cost that is only logarithmic in the confidence level and polylogarithmic in the condition number. The procedure we propose, called proxBoost, is elementary and builds on two well-known ingredients: robust distance estimation and the proximal point method. We discuss consequences for both streaming (online) algorithms and offline algorithms based on empirical risk minimization.