Showing papers on "Empirical risk minimization published in 2019"

PDF

Open Access

Journal Article•

Variance-based Regularization with Convex Objectives

[...]

01 Jan 2019-Journal of Machine Learning Research

TL;DR: The authors developed an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error.

...read moreread less

Abstract: We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen's empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems.

...read moreread less

132 citations

Posted Content•

Orthogonal Statistical Learning.

[...]

Dylan J. Foster¹, Vasilis Syrgkanis²•Institutions (2)

Massachusetts Institute of Technology¹, Microsoft²

25 Jan 2019-arXiv: Statistics Theory

TL;DR: By focusing on excess risk rather than parameter estimation, this work can give guarantees under weaker assumptions than in previous works and accommodate the case where the target parameter belongs to a complex nonparametric class.

...read moreread less

Abstract: We provide non-asymptotic excess risk guarantees for statistical learning in a setting where the population risk with respect to which we evaluate the target parameter depends on an unknown nuisance parameter that must be estimated from data. We analyze a two-stage sample splitting meta-algorithm that takes as input two arbitrary estimation algorithms: one for the target parameter and one for the nuisance parameter. We show that if the population risk satisfies a condition called Neyman orthogonality, the impact of the nuisance estimation error on the excess risk bound achieved by the meta-algorithm is of second order. Our theorem is agnostic to the particular algorithms used for the target and nuisance and only makes an assumption on their individual performance. This enables the use of a plethora of existing results from statistical learning and machine learning to give new guarantees for learning with a nuisance component. Moreover, by focusing on excess risk rather than parameter estimation, we can give guarantees under weaker assumptions than in previous works and accommodate settings in which the target parameter belongs to a complex nonparametric class. We provide conditions on the metric entropy of the nuisance and target classes such that oracle rates---rates of the same order as if we knew the nuisance parameter---are achieved. We also derive new rates for specific estimation algorithms such as variance-penalized empirical risk minimization, neural network estimation and sparse high-dimensional linear model estimation. We highlight the applicability of our results in four settings of central importance: 1) heterogeneous treatment effect estimation, 2) offline policy optimization, 3) domain adaptation, and 4) learning with missing data.

...read moreread less

122 citations

Posted Content•

A Group-Theoretic Framework for Data Augmentation

[...]

Shuxiao Chen¹, Edgar Dobriban¹, Jane H. Lee¹•Institutions (1)

University of Pennsylvania¹

25 Jul 2019-arXiv: Machine Learning

TL;DR: It is shown that data augmentation is equivalent to an averaging operation over the orbits of a certain group that keeps the data distribution approximately invariant, and it is proved that it leads to variance reduction.

...read moreread less

Abstract: Data augmentation is a widely used trick when training deep neural networks: in addition to the original data, properly transformed data are also added to the training set. However, to the best of our knowledge, a clear mathematical framework to explain the performance benefits of data augmentation is not available. In this paper, we develop such a theoretical framework. We show data augmentation is equivalent to an averaging operation over the orbits of a certain group that keeps the data distribution approximately invariant. We prove that it leads to variance reduction. We study empirical risk minimization, and the examples of exponential families, linear regression, and certain two-layer neural networks. We also discuss how data augmentation could be used in problems with symmetry where other approaches are prevalent, such as in cryo-electron microscopy (cryo-EM).

...read moreread less

111 citations

Posted Content•

Peer Loss Functions: Learning from Noisy Labels without Knowing Noise Rates

[...]

Yang Liu¹, Hongyi Guo²•Institutions (2)

University of California, Santa Cruz¹, Shanghai Jiao Tong University²

08 Oct 2019-arXiv: Learning

TL;DR: Performing ERM with peer loss functions on the noisy dataset leads to the optimal or a near-optimal classifier as if performing ERM over the clean training data, which the authors do not have access to.

...read moreread less

Abstract: Learning with noisy labels is a common challenge in supervised learning. Existing approaches often require practitioners to specify noise rates, i.e., a set of parameters controlling the severity of label noises in the problem, and the specifications are either assumed to be given or estimated using additional steps. In this work, we introduce a new family of loss functions that we name as peer loss functions, which enables learning from noisy labels and does not require a priori specification of the noise rates. Peer loss functions work within the standard empirical risk minimization (ERM) framework. We show that, under mild conditions, performing ERM with peer loss functions on the noisy dataset leads to the optimal or a near-optimal classifier as if performing ERM over the clean training data, which we do not have access to. We pair our results with an extensive set of experiments. Peer loss provides a way to simplify model development when facing potentially noisy training labels, and can be promoted as a robust candidate loss function in such situations.

...read moreread less

99 citations

Posted Content•

Don't Jump Through Hoops and Remove Those Loops: SVRG and Katyusha are Better Without the Outer Loop

[...]

Dmitry Kovalev¹, Samuel Horváth¹, Peter Richtárik¹•Institutions (1)

King Abdullah University of Science and Technology¹

24 Jan 2019-arXiv: Learning

TL;DR: This work designs loopless variants of the stochastic variance-reduced gradient method and proves that the new methods enjoy the same superior theoretical convergence properties as the original methods.

...read moreread less

Abstract: The stochastic variance-reduced gradient method (SVRG) and its accelerated variant (Katyusha) have attracted enormous attention in the machine learning community in the last few years due to their superior theoretical properties and empirical behaviour on training supervised machine learning models via the empirical risk minimization paradigm. A key structural element in both of these methods is the inclusion of an outer loop at the beginning of which a full pass over the training data is made in order to compute the exact gradient, which is then used to construct a variance-reduced estimator of the gradient. In this work we design {\em loopless variants} of both of these methods. In particular, we remove the outer loop and replace its function by a coin flip performed in each iteration designed to trigger, with a small probability, the computation of the gradient. We prove that the new methods enjoy the same superior theoretical convergence properties as the original methods. However, we demonstrate through numerical experiments that our methods have substantially superior practical behavior.

...read moreread less

85 citations

Proceedings Article•

Learning One-hidden-layer ReLU Networks via Gradient Descent

[...]

Xiao Zhang¹, Yaodong Yu¹, Lingxiao Wang², Quanquan Gu²•Institutions (2)

University of Virginia¹, University of California, Los Angeles²

11 Apr 2019

TL;DR: It is proved that tensor initialization followed by gradient descent can converge to the ground-truth parameters at a linear rate up to some statistical error.

...read moreread less

Abstract: We study the problem of learning one-hidden-layer neural networks with Rectified Linear Unit (ReLU) activation function, where the inputs are sampled from standard Gaussian distribution and the outputs are generated from a noisy teacher network. We analyze the performance of gradient descent for training such kind of neural networks based on empirical risk minimization, and provide algorithm-dependent guarantees. In particular, we prove that tensor initialization followed by gradient descent can converge to the ground-truth parameters at a linear rate up to some statistical error. To the best of our knowledge, this is the first work characterizing the recovery guarantee for practical learning of one-hidden-layer ReLU networks with multiple neurons. Numerical experiments verify our theoretical findings.

...read moreread less

83 citations

Journal Article•DOI•

Support vector regression based metamodeling for structural reliability analysis

[...]

Atin Roy¹, Ramkrishna Manna¹, Subrata Chakraborty¹•Institutions (1)

Indian Institute of Engineering Science and Technology, Shibpur¹

01 Jan 2019-Probabilistic Engineering Mechanics

TL;DR: A simple yet effective algorithm by solving an optimization sub-problem to minimize the mean square error value obtained by cross-validation method is investigated in the present study to construct SVR model for structural reliability analysis.

...read moreread less

74 citations

Journal Article•DOI•

Risk minimization by median-of-means tournaments

[...]

Gábor Lugosi¹, Shahar Mendelson²•Institutions (2)

Pompeu Fabra University¹, Australian National University²

16 Dec 2019-Journal of the European Mathematical Society

TL;DR: In this paper, the median-of-means tournament (MOMT) was introduced to achieve the optimal tradeoff between accuracy and confidence under minimal assumptions, and in particular outperforms classical methods based on empirical risk minimization.

...read moreread less

Abstract: We consider the classical statistical learning/regression problem, when the value of a real random variable Y is to be predicted based on the observation of another random variable X. Given a class of functions F and a sample of independent copies of (X, Y ), one needs to choose a function f from F such that f(X) approximates Y as well as possible, in the mean-squared sense. We introduce a new procedure, the so-called median-of-means tournament, that achieves the optimal tradeoff between accuracy and confidence under minimal assumptions, and in particular outperforms classical methods based on empirical risk minimization.

...read moreread less

71 citations

Proceedings Article•

Average Individual Fairness: Algorithms, Generalization and Experiments

[...]

Saeed Sharifi-Malvajerdi¹, Michael Kearns¹, Aaron Roth¹•Institutions (1)

University of Pennsylvania¹

06 Sep 2019

TL;DR: This work designs an oracle-efficient algorithm for the fair empirical risk minimization task and shows that given sufficiently many samples, the ERM solution generalizes in two directions: both to new individuals, and to new classification tasks, drawn from their corresponding distributions.

...read moreread less

Abstract: We propose a new family of fairness definitions for classification problems that combine some of the best properties of both statistical and individual notions of fairness. We posit not only a distribution over individuals, but also a distribution over (or collection of) classification tasks. We then ask that standard statistics (such as error or false positive/negative rates) be (approximately) equalized across individuals, where the rate is defined as an expectation over the classification tasks. Because we are no longer averaging over coarse groups (such as race or gender), this is a semantically meaningful individual-level constraint. Given a sample of individuals and problems, we design an oracle-efficient algorithm (i.e. one that is given access to any standard, fairness-free learning heuristic) for the fair empirical risk minimization task. We also show that given sufficiently many samples, the ERM solution generalizes in two directions: both to new individuals, and to new classification tasks, drawn from their corresponding distributions. Finally we implement our algorithm and empirically verify its effectiveness.

...read moreread less

58 citations

Proceedings Article•

Differentially Private Empirical Risk Minimization with Non-convex Loss Functions

[...]

Di Wang¹, Changyou Chen², Jinhui Xu³•Institutions (3)

Georgia Institute of Technology¹, State University of New York System², University at Buffalo³

24 May 2019

58 citations

Posted Content•

RayNet: Learning Volumetric 3D Reconstruction with Ray Potentials

[...]

Despoina Paschalidou, Ali Osman Ulusoy¹, Carolin Schmitt², Luc Van Gool³, Andreas Geiger³ - Show less +1 more•Institutions (3)

Microsoft¹, Max Planck Society², ETH Zurich³

06 Jan 2019-arXiv: Computer Vision and Pattern Recognition

TL;DR: This paper proposes RayNet, which combines a CNN that learns view-invariant feature representations with an MRF that explicitly encodes the physics of perspective projection and occlusion and trains RayNet end-to-end using empirical risk minimization.

...read moreread less

Abstract: In this paper, we consider the problem of reconstructing a dense 3D model using images captured from different views. Recent methods based on convolutional neural networks (CNN) allow learning the entire task from data. However, they do not incorporate the physics of image formation such as perspective geometry and occlusion. Instead, classical approaches based on Markov Random Fields (MRF) with ray-potentials explicitly model these physical processes, but they cannot cope with large surface appearance variations across different viewpoints. In this paper, we propose RayNet, which combines the strengths of both frameworks. RayNet integrates a CNN that learns view-invariant feature representations with an MRF that explicitly encodes the physics of perspective projection and occlusion. We train RayNet end-to-end using empirical risk minimization. We thoroughly evaluate our approach on challenging real-world datasets and demonstrate its benefits over a piece-wise trained baseline, hand-crafted models as well as other learning-based approaches.

...read moreread less

Posted Content•

On stochastic gradient Langevin dynamics with dependent data streams: the fully non-convex case

[...]

N. H. Chau, Eric Moulines, Miklós Rásonyi¹, Sotirios Sabanis, Ying Zhang - Show less +1 more•Institutions (1)

Alfréd Rényi Institute of Mathematics¹

30 May 2019-arXiv: Statistics Theory

TL;DR: This work considers the problem of sampling from a target distribution, which is not necessarily logconcave, in the context of empirical risk minimization and stochastic optimization as presented in Raginsky et al. (2017).

...read moreread less

Abstract: We consider the problem of sampling from a target distribution, which is \emph {not necessarily logconcave}, in the context of empirical risk minimization and stochastic optimization as presented in Raginsky et al. (2017). Non-asymptotic analysis results are established in the $L^1$-Wasserstein distance for the behaviour of Stochastic Gradient Langevin Dynamics (SGLD) algorithms. We allow the estimation of gradients to be performed even in the presence of \emph{dependent} data streams. Our convergence estimates are sharper and \emph{uniform} in the number of iterations, in contrast to those in previous studies.

...read moreread less

Journal Article•DOI•

Differentially Private Empirical Risk Minimization with Smooth Non-Convex Loss Functions: A Non-Stationary View

[...]

Di Wang¹, Jinhui Xu¹•Institutions (1)

University at Buffalo¹

17 Jul 2019

TL;DR: This paper investigates the DP-ERM problem in high dimensional space, and shows that by measuring the utility with Frank-Wolfe gap, it is possible to bound the utility by the Gaussian Width of the constraint set, instead of the dimensionality p of the underlying space.

...read moreread less

Abstract: In this paper, we study the Differentially Private Empirical Risk Minimization (DP-ERM) problem with non-convex loss functions and give several upper bounds for the utility in different settings. We first consider the problem in low-dimensional space. For DP-ERM with non-smooth regularizer, we generalize an existing work by measuring the utility using l2 norm of the projected gradient. Also, we extend the error bound measurement, for the first time, from empirical risk to population risk by using the expected l2 norm of the gradient. We then investigate the problem in high dimensional space, and show that by measuring the utility with Frank-Wolfe gap, it is possible to bound the utility by the Gaussian Width of the constraint set, instead of the dimensionality p of the underlying space. We further demonstrate that the advantages of this result can be achieved by the measure of l2 norm of the projected gradient. A somewhat surprising discovery is that although the two kinds of measurements are quite different, their induced utility upper bounds are asymptotically the same under some assumptions. We also show that the utility of some special non-convex loss functions can be reduced to a level (i.e., depending only on log p) similar to that of convex loss functions. Finally, we test our proposed algorithms on both synthetic and real world datasets and the experimental results confirm our theoretical analysis.

...read moreread less

Journal Article•DOI•

Regularization, sparse recovery, and median-of-means tournaments

[...]

Gábor Lugosi, Shahar Mendelson

01 Aug 2019-Bernoulli

TL;DR: A regularized risk minimization procedure for regression function estimation is introduced that achieves near optimal accuracy and confidence under general conditions, including heavy-tailed predictor and response variables.

...read moreread less

Abstract: We introduce a regularized risk minimization procedure for regression function estimation. The procedure is based on median-of-means tournaments, introduced by the authors in Lugosi and Mendelson (2018) and achieves near optimal accuracy and confidence under general conditions, including heavy-tailed predictor and response variables. It outperforms standard regularized empirical risk minimization procedures such as LASSO or SLOPE in heavy-tailed problems.

...read moreread less

Journal Article•DOI•

Accuracy First: Selecting a Differential Privacy Level for Accuracy-Constrained ERM

[...]

Steven C. Wu¹, Aaron Roth², Katrina Ligett, Bo Waggoner³, Seth Neel² - Show less +1 more•Institutions (3)

University of Minnesota¹, University of Pennsylvania², Microsoft³

27 Sep 2019

TL;DR: In this article, the authors propose a general noise reduction framework that can apply to a variety of private empirical risk minimization (ERM) algorithms, using them to find the empirically strongest one that meets the accuracy constraint, and incurring only logarithmic overhead in the number of privacy levels searched.

...read moreread less

Abstract: Traditional approaches to differential privacy assume a fixed privacy requirement e for a computation, and attempt to maximize the accuracy of the computation subject to the privacy constraint. As differential privacy is increasingly deployed in practical settings, it may often be that there is instead a fixed accuracy requirement for a given computation and the data analyst would like to maximize the privacy of the computation subject to the accuracy constraint. This raises the question of how to find and run a maximally private empirical risk minimizer subject to a given accuracy requirement. We propose a general “noise reduction” framework that can apply to a variety of private empirical risk minimization (ERM) algorithms, using them to “search” the space of privacy levels to find the empirically strongest one that meets the accuracy constraint, and incurring only logarithmic overhead in the number of privacy levels searched. The privacy analysis of our algorithm leads naturally to a version of differential privacy where the privacy parameters are dependent on the data, which we term ex-post privacy, and which is related to the recently introduced notion of privacy odometers. We also give an ex-post privacy analysis of the classical AboveThreshold privacy tool, modifying it to allow for queries chosen depending on the database. Finally, we apply our approach to two common objective functions, regularized linear and logistic regression, and empirically compare our noise reduction methods to (i) inverting the theoretical utility guarantees of standard private ERM algorithms and (ii) a stronger, empirical baseline based on binary search.

...read moreread less

Proceedings Article•

Beyond Least-Squares: Fast Rates for Regularized Empirical Risk Minimization through Self-Concordance

[...]

Ulysse Marteau-Ferey¹, Dmitrii Ostrovskii¹, Francis Bach¹, Alessandro Rudi¹•Institutions (1)

University of Paris¹

17 Jun 2019

TL;DR: In this article, the authors consider learning methods based on the regularization of a convex empirical risk by a squared Hilbertian norm, a setting that includes linear predictors and non-linear predictors through positive-definite kernels.

...read moreread less

Abstract: We consider learning methods based on the regularization of a convex empirical risk by a squared Hilbertian norm, a setting that includes linear predictors and non-linear predictors through positive-definite kernels. In order to go beyond the generic analysis leading to convergence rates of the excess risk as $O(1/\sqrt{n})$ from $n$ observations, we assume that the individual losses are self-concordant, that is, their third-order derivatives are bounded by their second-order derivatives. This setting includes least-squares, as well as all generalized linear models such as logistic and softmax regression. For this class of losses, we provide a bias-variance decomposition and show that the assumptions commonly made in least-squares regression, such as the source and capacity conditions, can be adapted to obtain fast non-asymptotic rates of convergence by improving the bias terms, the variance terms or both.

...read moreread less

Posted Content•

Nonasymptotic estimates for Stochastic Gradient Langevin Dynamics under local conditions in nonconvex optimization

[...]

Ying Zhang¹, Ömer Deniz Akyildiz², Theo Damoulas², Sotirios Sabanis³•Institutions (3)

University of Edinburgh¹, University of Warwick², The Turing Institute³

04 Oct 2019-arXiv: Statistics Theory

TL;DR: Non-asymptotic error bounds are obtained for a popular class of algorithms called Stochastic Gradient Langevin Dynamics (SGLD), and examples from minibatch logistic regression and from variational inference are given by providing theoretical guarantees for the sampling behaviour of the algorithm.

...read moreread less

Abstract: Within the context of empirical risk minimization, see Raginsky, Rakhlin, and Telgarsky (2017), we are concerned with a non-asymptotic analysis of sampling algorithms used in optimization. In particular, we obtain non-asymptotic error bounds for a popular class of algorithms called Stochastic Gradient Langevin Dynamics (SGLD). These results are derived in Wasserstein-1 and Wasserstein-2 distances in the absence of log-concavity of the target distribution. More precisely, the stochastic gradient $H(\theta, x)$ is assumed to be locally Lipschitz continuous in both variables, and furthermore, the dissipativity condition is relaxed by removing its uniform dependence in $x$. This relaxation allows us to present two key paradigms within the framework of scalable posterior sampling for Bayesian inference and of nonconvex optimization; namely, examples from minibatch logistic regression and from variational inference are given by providing theoretical guarantees for the sampling behaviour of the algorithm.

...read moreread less

Posted Content•

Decentralized Stochastic Gradient Tracking for Non-convex Empirical Risk Minimization

[...]

Jiaqi Zhang, Keyou You

06 Sep 2019-arXiv: Learning

TL;DR: This paper studies a decentralized stochastic gradient tracking (DSGT) algorithm for non-convex empirical risk minimization problems over a peer-to-peer network of nodes, which is in sharp contrast to the existing DSGT only for convex problems.

...read moreread less

Abstract: This paper studies a decentralized stochastic gradient tracking (DSGT) algorithm for non-convex empirical risk minimization problems over a peer-to-peer network of nodes, which is in sharp contrast to the existing DSGT only for convex problems. To ensure exact convergence and handle the variance among decentralized datasets, each node performs a stochastic gradient (SG) tracking step by using a mini-batch of samples, where the batch size is designed to be proportional to the size of the local dataset. We explicitly evaluate the convergence rate of DSGT with respect to the number of iterations in terms of algebraic connectivity of the network, mini-batch size, gradient variance, etc. Under certain conditions, we further show that DSGT has a network independence property in the sense that the network topology only affects the convergence rate up to a constant factor. Hence, the convergence rate of DSGT can be comparable to the centralized SGD method. Moreover, a linear speedup of DSGT with respect to the number of nodes is achievable for some scenarios. Numerical experiments for neural networks and logistic regression problems on CIFAR-10 finally illustrate the advantages of DSGT.

...read moreread less

Proceedings Article•

Nonconvex Variance Reduced Optimization with Arbitrary Sampling

[...]

Samuel Horváth¹, Peter Richtárik¹•Institutions (1)

King Abdullah University of Science and Technology¹

24 May 2019

TL;DR: Surprisingly, this approach can in some regimes lead to superlinear speedup with respect to the minibatch size, which is not usually present in stochastic optimization.

...read moreread less

Abstract: We provide the first importance sampling variants of variance reduced algorithms for empirical risk minimization with non-convex loss functions. In particular, we analyze non-convex versions of SVRG, SAGA and SARAH. Our methods have the capacity to speed up the training process by an order of magnitude compared to the state of the art on real datasets. Moreover, we also improve upon current mini-batch analysis of these methods by proposing importance sampling for minibatches in this setting. Surprisingly, our approach can in some regimes lead to superlinear speedup with respect to the minibatch size, which is not usually present in stochastic optimization. All the above results follow from a general analysis of the methods which works with arbitrary sampling, i.e., fully general randomized strategy for the selection of subsets of examples to be sampled in each iteration. Finally, we also perform a novel importance sampling analysis of SARAH in the convex setting.

...read moreread less

Proceedings Article•

Mitigating Overfitting in Supervised Classification from Two Unlabeled Datasets: A Consistent Risk Correction Approach

[...]

Nan Lu¹, Tianyi Zhang¹, Gang Niu², Masashi Sugiyama¹•Institutions (2)

University of Tokyo¹, Nanyang Technological University²

20 Oct 2019

TL;DR: In this paper, the authors proposed to wrap the terms that cause a negative empirical risk by certain correction functions and prove the consistency of the corrected risk estimator and derive an estimation error bound for corrected risk minimizer.

...read moreread less

Abstract: The recently proposed unlabeled-unlabeled (UU) classification method allows us to train a binary classifier only from two unlabeled datasets with different class priors Since this method is based on the empirical risk minimization, it works as if it is a supervised classification method, compatible with any model and optimizer However, this method sometimes suffers from severe overfitting, which we would like to prevent in this paper Our empirical finding in applying the original UU method is that overfitting often co-occurs with the empirical risk going negative, which is not legitimate Therefore, we propose to wrap the terms that cause a negative empirical risk by certain correction functions Then, we prove the consistency of the corrected risk estimator and derive an estimation error bound for the corrected risk minimizer Experiments show that our proposal can successfully mitigate overfitting of the UU method and significantly improve the classification accuracy

...read moreread less

Proceedings Article•

Classification from Positive, Unlabeled and Biased Negative Data

[...]

Yu-Guan Hsieh¹, Gang Niu², Masashi Sugiyama³•Institutions (3)

École Normale Supérieure¹, Nanyang Technological University², University of Tokyo³

24 May 2019

TL;DR: In this article, an example-weighting algorithm based on empirical risk minimization is proposed to address the PUbN classification problem, with the weight of each example computed through a preliminary step that draws inspiration from PU learning.

...read moreread less

Abstract: In binary classification, there are situations where negative (N) data are too diverse to be fully labeled and we often resort to positive-unlabeled (PU) learning in these scenarios. However, collecting a non-representative N set that contains only a small portion of all possible N data can often be much easier in practice. This paper studies a novel classification framework which incorporates such biased N (bN) data in PU learning. We provide a method based on empirical risk minimization to address this PUbN classification problem. Our approach can be regarded as a novel example-weighting algorithm, with the weight of each example computed through a preliminary step that draws inspiration from PU learning. We also derive an estimation error bound for the proposed method. Experimental results demonstrate the effectiveness of our algorithm in not only PUbN learning scenarios but also ordinary PU learning scenarios on several benchmark datasets.

...read moreread less

Journal Article•DOI•

Multi-focus image fusion using deep support value convolutional neural network

[...]

Chao-ben Du¹, Shesheng Gao¹, Ying Liu², BingBing Gao¹•Institutions (2)

Northwestern Polytechnical University¹, Chinese Ministry of Public Security²

01 Jan 2019-Optik

TL;DR: The experimental results demonstrate that the suggested DSVCNN-based method is competitive with current state-of-the-art approaches and superior to those that use traditional CNN methods.

...read moreread less

Journal Article•DOI•

Differentially Private Confidence Intervals for Empirical Risk Minimization

[...]

Yue Wang¹, Daniel Kifer¹, Jaewoo Lee²•Institutions (2)

Pennsylvania State University¹, University of Georgia²

30 Mar 2019

TL;DR: These algorithms can provide confidence intervals that satisfy differential privacy (as well as the more recently proposed concentrated differential privacy) and can be used with existing differentially private mechanisms that train models using objective perturbations and output perturbation.

...read moreread less

Abstract: The process of data mining with differential privacy produces results that are affected by two types of noise: sampling noise due to data collection and privacy noise that is designed to prevent the reconstruction of sensitive information. In this paper, we consider the problem of designing confidence intervals for the parameters of a variety of differentially private machine learning models. The algorithms can provide confidence intervals that satisfy differential privacy (as well as the more recently proposed concentrated differential privacy) and can be used with existing differentially private mechanisms that train models using objective perturbation and output perturbation.

...read moreread less

Proceedings Article•DOI•

Improved Sparse Pinball Twin SVM

[...]

Muhammad Tanveer¹, T. Rajani¹, M. A. Ganaie¹•Institutions (1)

Indian Institute of Technology Indore¹

01 Oct 2019

TL;DR: Results computed on multiple UCI benchmark datasets clearly indicate the effectiveness and applicability of the proposed ISPTSVM compared to pinball support vector machine (Pin-SVM), twin bounded support vectors machine (TBSVM) and SPTSVM.

...read moreread less

Abstract: In this paper, we propose an improved version of sparse pinball twin support vector machine (SPTSVM) [1], called improved sparse pinball twin support vector machine (ISPTSVM). SPTSVM implements empirical risk minimization principle and the matrices appearing in the formulation of SPTSVM are positive semi-definite. Here, we reformulate the primal problems of SPTSVM by introducing extra regularization term to the objective function of SPTSVM. Unlike SPTSVM, structural risk minimization (SRM) principle is implemented in the proposed ISPTSVM which embodies the marrow of statistical learning theory. Also, the matrices that appear in the dual formulation of the proposed ISPTSVM are positive definite. Results computed on multiple UCI benchmark datasets clearly indicate the effectiveness and applicability of the proposed ISPTSVM compared to pinball support vector machine (Pin-SVM), twin bounded support vector machine (TBSVM) and SPTSVM.

...read moreread less

Journal Article•DOI•

Learning in Wireless Control Systems Over Nonstationary Channels

[...]

Mark Eisen¹, Konstantinos Gatsis¹, George J. Pappas¹, Alejandro Ribeiro¹•Institutions (1)

University of Pennsylvania¹

01 Mar 2019-IEEE Transactions on Signal Processing

TL;DR: This paper considers a set of multiple independent control systems that are each connected over a nonstationary wireless channel to maximize control performance over all the systems through the allocation of transmitting power within a fixed budget using Newton's method.

...read moreread less

Abstract: This paper considers a set of multiple independent control systems that are each connected over a nonstationary wireless channel. The goal is to maximize control performance over all the systems through the allocation of transmitting power within a fixed budget. This can be formulated as a constrained optimization problem examined using Lagrangian duality. By taking samples of the unknown wireless channel at every time instance, the resulting problem takes on the form of empirical risk minimization, a well-studied problem in machine learning. Due to the nonstationarity of wireless channels, optimal allocations must be continuously learned and updated as the channel evolves. The quadratic convergence property of Newton's method motivates its use in learning approximately optimal power allocation policies over the sampled dual function as the channel evolves over time. Conditions are established under which Newton's method learns approximate solutions with a single update, and the subsequent suboptimality of the control problem is further characterized. Numerical simulations illustrate the near-optimal performance of the method and resulting stability on a wireless control problem.

...read moreread less

Posted Content•

Risk bounds for reservoir computing

[...]

Lukas Gonon¹, Lyudmila Grigoryeva², Juan-Pablo Ortega¹•Institutions (2)

University of St. Gallen¹, University of Konstanz²

30 Oct 2019-arXiv: Learning

TL;DR: Finite sample upper bounds are derived for the generalization error committed by specific families of reservoir computing systems when processing discrete-time inputs under various hypotheses on their dependence structure in the framework of statistical learning theory.

...read moreread less

Abstract: We analyze the practices of reservoir computing in the framework of statistical learning theory. In particular, we derive finite sample upper bounds for the generalization error committed by specific families of reservoir computing systems when processing discrete-time inputs under various hypotheses on their dependence structure. Non-asymptotic bounds are explicitly written down in terms of the multivariate Rademacher complexities of the reservoir systems and the weak dependence structure of the signals that are being handled. This allows, in particular, to determine the minimal number of observations needed in order to guarantee a prescribed estimation accuracy with high probability for a given reservoir family. At the same time, the asymptotic behavior of the devised bounds guarantees the consistency of the empirical risk minimization procedure for various hypothesis classes of reservoir functionals.

...read moreread less

Journal Article•DOI•

Online Learning from Data Streams with Varying Feature Spaces

[...]

Ege Beyazit¹, Jeevithan Alagurajah¹, Xindong Wu¹•Institutions (1)

University of Louisiana at Lafayette¹

17 Jul 2019

TL;DR: A novel online learning algorithm OLVF is proposed to learn from data with arbitrarily varying feature spaces to classify the feature spaces and the instances from feature spaces simultaneously and a feature sparsity method is applied to reduce the model complexity.

...read moreread less

Abstract: We study the problem of online learning with varying feature spaces. The problem is challenging because, unlike traditional online learning problems, varying feature spaces can introduce new features or stop having some features without following a pattern. Other existing methods such as online streaming feature selection (Wu et al. 2013), online learning from trapezoidal data streams (Zhang et al. 2016), and learning with feature evolvable streams (Hou, Zhang, and Zhou 2017) are not capable to learn from arbitrarily varying feature spaces because they make assumptions about the feature space dynamics. In this paper, we propose a novel online learning algorithm OLVF to learn from data with arbitrarily varying feature spaces. The OLVF algorithm learns to classify the feature spaces and the instances from feature spaces simultaneously. To classify an instance, the algorithm dynamically projects the instance classifier and the training instance onto their shared feature subspace. The feature space classifier predicts the projection confidences for a given feature space. The instance classifier will be updated by following the empirical risk minimization principle and the strength of the constraints will be scaled by the projection confidences. Afterwards, a feature sparsity method is applied to reduce the model complexity. Experiments on 10 datasets with varying feature spaces have been conducted to demonstrate the performance of the proposed OLVF algorithm. Moreover, experiments with trapezoidal data streams on the same datasets have been conducted to show that OLVF performs better than the state-of-the-art learning algorithm (Zhang et al. 2016).

...read moreread less

Posted Content•

Fast rates for empirical risk minimization over c\`adl\`ag functions with bounded sectional variation norm

[...]

Aurélien Bibaut, Mark J. van der Laan

22 Jul 2019-arXiv: Statistics Theory

TL;DR: It is shown that in the case of nonparametric regression over sieves of c\`adl\`ag functions with bounded sectional variation norm, this upper bound on the rate of convergence holds for least-squares estimators, under the random design, sub-exponential errors setting.

...read moreread less

Abstract: Empirical risk minimization over classes functions that are bounded for some version of the variation norm has a long history, starting with Total Variation Denoising (Rudin et al., 1992), and has been considered by several recent articles, in particular Fang et al., 2019 and van der Laan, 2015. In this article, we consider empirical risk minimization over the class $\mathcal{F}_d$ of cadlag functions over $[0,1]^d$ with bounded sectional variation norm (also called Hardy-Krause variation). We show how a certain representation of functions in $\mathcal{F}_d$ allows to bound the bracketing entropy of sieves of $\mathcal{F}_d$, and therefore derive rates of convergence in nonparametric function estimation. Specifically, for sieves whose growth is controlled by some rate $a_n$, we show that the empirical risk minimizer has rate of convergence $O_P(n^{-1/3} (\log n)^{2(d-1)/3} a_n)$. Remarkably, the dimension only affects the rate in $n$ through the logarithmic factor, making this method especially appropriate for high dimensional problems. In particular, we show that in the case of nonparametric regression over sieves of cadlag functions with bounded sectional variation norm, this upper bound on the rate of convergence holds for least-squares estimators, under the random design, sub-exponential errors setting.

...read moreread less

Proceedings Article•

A Tight Excess Risk Bound via a Unified PAC-Bayesian-Rademacher-Shtarkov-MDL Complexity

[...]

Peter Grünwald¹, Nishant A. Mehta²•Institutions (2)

Centrum Wiskunde & Informatica¹, University of Victoria²

01 Mar 2019

TL;DR: These results recover optimal bounds for VC- and large (polynomial entropy) classes, replacing localized Rademacher complexity by a simpler analysis which almost completely separates the two aspects that determine the achievable rates: 'easiness' (Bernstein) conditions and model complexity.

...read moreread less

Abstract: We present a novel notion of complexity that interpolates between and generalizes some classic existing complexity notions in learning theory: for estimators like empirical risk minimization (ERM) with arbitrary bounded losses, it is upper bounded in terms of data-independent Rademacher complexity; for generalized Bayesian estimators, it is upper bounded by the data-dependent information complexity (also known as stochastic or PAC-Bayesian, KL(posterior∥prior) complexity. For (penalized) ERM, the new complexity reduces to (generalized) normalized maximum likelihood (NML) complexity, i.e. a minimax log-loss individual-sequence regret. Our first main result bounds excess risk in terms of the new complexity. Our second main result links the new complexity via Rademacher complexity to L2(P) entropy, thereby generalizing earlier results of Opper, Haussler, Lugosi, and Cesa-Bianchi who did the log-loss case with L∞. Together, these results recover optimal bounds for VC- and large (polynomial entropy) classes, replacing localized Rademacher complexity by a simpler analysis which almost completely separates the two aspects that determine the achievable rates: 'easiness' (Bernstein) conditions and model complexity.

...read moreread less

Posted Content•

From low probability to high confidence in stochastic convex optimization

[...]

Damek Davis¹, Dmitriy Drusvyatskiy², Lin Xiao³, Junyu Zhang⁴•Institutions (4)

Cornell University¹, University of Washington², Facebook³, University of Minnesota⁴

31 Jul 2019-arXiv: Optimization and Control

TL;DR: This work shows that a wide class of stochastic optimization algorithms for strongly convex problems can be augmented with high confidence bounds at an overhead cost that is only logarithmic in the confidence level and polylogarithms in the condition number.

...read moreread less

Abstract: Standard results in stochastic convex optimization bound the number of samples that an algorithm needs to generate a point with small function value in expectation. More nuanced high probability guarantees are rare, and typically either rely on "light-tail" noise assumptions or exhibit worse sample complexity. In this work, we show that a wide class of stochastic optimization algorithms for strongly convex problems can be augmented with high confidence bounds at an overhead cost that is only logarithmic in the confidence level and polylogarithmic in the condition number. The procedure we propose, called proxBoost, is elementary and builds on two well-known ingredients: robust distance estimation and the proximal point method. We discuss consequences for both streaming (online) algorithms and offline algorithms based on empirical risk minimization.

...read moreread less