scispace - formally typeset
Search or ask a question

Showing papers on "Empirical risk minimization published in 2020"


Posted Content
TL;DR: This paper implements DomainBed, a testbed for domain generalization including seven multi-domain datasets, nine baseline algorithms, and three model selection criteria, and finds that, when carefully implemented, empirical risk minimization shows state-of-the-art performance across all datasets.
Abstract: The goal of domain generalization algorithms is to predict well on distributions different from those seen during training While a myriad of domain generalization algorithms exist, inconsistencies in experimental conditions -- datasets, architectures, and model selection criteria -- render fair and realistic comparisons difficult In this paper, we are interested in understanding how useful domain generalization algorithms are in realistic settings As a first step, we realize that model selection is non-trivial for domain generalization tasks Contrary to prior work, we argue that domain generalization algorithms without a model selection strategy should be regarded as incomplete Next, we implement DomainBed, a testbed for domain generalization including seven multi-domain datasets, nine baseline algorithms, and three model selection criteria We conduct extensive experiments using DomainBed and find that, when carefully implemented, empirical risk minimization shows state-of-the-art performance across all datasets Looking forward, we hope that the release of DomainBed, along with contributions from fellow researchers, will streamline reproducible and rigorous research in domain generalization

492 citations


Journal ArticleDOI
28 Jul 2020
TL;DR: The development of new classification and regression algorithms based on empirical risk minimization (ERM) over deep neural network hypothesis classes, coined deep learning, revolutionized the area of deep learning.
Abstract: The development of new classification and regression algorithms based on empirical risk minimization (ERM) over deep neural network hypothesis classes, coined deep learning, revolutionized the area...

190 citations


Posted Content
TL;DR: In this setting, the first analysis of classification under the IRM objective is presented, and it is found that IRM and its alternatives fundamentally do not improve over standard Empirical Risk Minimization.
Abstract: Invariant Causal Prediction (Peters et al., 2016) is a technique for out-of-distribution generalization which assumes that some aspects of the data distribution vary across the training set but that the underlying causal mechanisms remain constant. Recently, Arjovsky et al. (2019) proposed Invariant Risk Minimization (IRM), an objective based on this idea for learning deep, invariant features of data which are a complex function of latent variables; many alternatives have subsequently been suggested. However, formal guarantees for all of these works are severely lacking. In this paper, we present the first analysis of classification under the IRM objective--as well as these recently proposed alternatives--under a fairly natural and general model. In the linear case, we show simple conditions under which the optimal solution succeeds or, more often, fails to recover the optimal invariant predictor. We furthermore present the very first results in the non-linear regime: we demonstrate that IRM can fail catastrophically unless the test data are sufficiently similar to the training distribution--this is precisely the issue that it was intended to solve. Thus, in this setting we find that IRM and its alternatives fundamentally do not improve over standard Empirical Risk Minimization.

146 citations


Journal ArticleDOI
TL;DR: In this paper, the authors examine the theoretical properties of enforcing priors provided by generative deep neural networks via empirical risk minimization and show that with high probability, at any point away from small neighborhoods around two scalar multiples of the desired solution, there are no local minima, saddle points, or other stationary points outside these neighborhoods.
Abstract: We examine the theoretical properties of enforcing priors provided by generative deep neural networks via empirical risk minimization. In particular we consider two models, one in which the task is to invert a generative neural network given access to its last layer and another in which the task is to invert a generative neural network given only compressive linear observations of its last layer. We establish that in both cases, in suitable regimes of network layer sizes and a randomness assumption on the network weights, that the non-convex objective function given by empirical risk minimization does not have any spurious stationary points. That is, we establish that with high probability, at any point away from small neighborhoods around two scalar multiples of the desired solution, there is a descent direction. Hence, there are no local minima, saddle points, or other stationary points outside these neighborhoods. These results constitute the first theoretical guarantees which establish the favorable global geometry of these non-convex optimization problems, and they bridge the gap between the empirical success of enforcing deep generative priors and a rigorous understanding of non-linear inverse problems.

101 citations


Posted Content
TL;DR: A striking result is that the widely-used method of imputing with a constant, such as the mean prior to learning is consistent when missing values are not informative, which contrasts with inferential settings where mean imputation is pointed at for distorting the distribution of the data.
Abstract: In many application settings, the data have missing entries which make analysis challenging. An abundant literature addresses missing values in an inferential framework: estimating parameters and their variance from incomplete tables. Here, we consider supervised-learning settings: predicting a target when missing values appear in both training and testing data. We show the consistency of two approaches in prediction. A striking result is that the widely-used method of imputing with a constant, such as the mean prior to learning is consistent when missing values are not informative. This contrasts with inferential settings where mean imputation is pointed at for distorting the distribution of the data. That such a simple approach can be consistent is important in practice. We also show that a predictor suited for complete observations can predict optimally on incomplete data, through multiple imputation. Finally, to compare imputation with learning directly with a model that accounts for missing values, we analyze further decision trees. These can naturally tackle empirical risk minimization with missing values, due to their ability to handle the half-discrete nature of incomplete variables. After comparing theoretically and empirically different missing values strategies in trees, we recommend using the "missing incorporated in attribute" method as it can handle both non-informative and informative missing values.

81 citations


Proceedings Article
01 Jan 2020
TL;DR: This paper is the first to solve distributionally robust federated learning with reduced communication, and to analyze the efficiency of local descent methods on distributed minimax problems.
Abstract: In this paper, we study communication efficient distributed algorithms for distributionally robust federated learning via periodic averaging with adaptive sampling. In contrast to standard empirical risk minimization, due to the minimax structure of the underlying optimization problem, a key difficulty arises from the fact that the global parameter that controls the mixture of local losses can only be updated infrequently on the global stage. To compensate for this, we propose a Distributionally Robust Federated Averaging (DRFA) algorithm that employs a novel snapshotting scheme to approximate the accumulation of history gradients of the mixing parameter. We analyze the convergence rate of DRFA in both convex-linear and nonconvex-linear settings. We also generalize the proposed idea to objectives with regularization on the mixture parameter and propose a proximal variant, dubbed as DRFA-Prox, with provable convergence rates. We also analyze an alternative optimization method for regularized cases in strongly-convex-strongly-concave and non-convex (under PL condition)-strongly-concave settings. To the best of our knowledge, this paper is the first to solve distributionally robust federated learning with reduced communication, and to analyze the efficiency of local descent methods on distributed minimax problems. We give corroborating experimental evidence for our theoretical results in federated learning settings.

73 citations


Proceedings Article
24 Feb 2020
TL;DR: It is observed that model predictions can substantially benefit the training process: self-adaptive training significantly improves generalization over ERM under various levels of noises, and mitigates the overfitting issue in both natural and adversarial training.
Abstract: We propose self-adaptive training---a new training algorithm that dynamically corrects problematic training labels by model predictions without incurring extra computational cost---to improve generalization of deep learning for potentially corrupted training data. This problem is crucial towards robustly learning from data that are corrupted by, e.g., label noises and out-of-distribution samples. The standard empirical risk minimization (ERM) for such data, however, may easily overfit noises and thus suffers from sub-optimal performance. In this paper, we observe that model predictions can substantially benefit the training process: self-adaptive training significantly improves generalization over ERM under various levels of noises, and mitigates the overfitting issue in both natural and adversarial training. We evaluate the error-capacity curve of self-adaptive training: the test error is monotonously decreasing w.r.t. model capacity. This is in sharp contrast to the recently-discovered double-descent phenomenon in ERM which might be a result of overfitting of noises. Experiments on CIFAR and ImageNet datasets verify the effectiveness of our approach in two applications: classification with label noise and selective classification. We release our code at this https URL.

72 citations


Proceedings ArticleDOI
26 May 2020
TL;DR: Wireless protocols are proposed that implement Decentralized Stochastic Gradient Descent by accounting for the presence of path loss, fading, blockages, and mutual interference in the deployment of Federated Learning.
Abstract: Federated Learning (FL), an emerging paradigm for fast intelligent acquisition at the network edge, enables joint training of a machine learning model over distributed data sets and computing resources with limited disclosure of local data. Communication is a critical enabler of large-scale FL due to significant amount of model information exchanged among edge devices. In this paper, we consider a network of wireless devices sharing a common fading wireless channel for the deployment of FL. Each device holds a generally distinct training set, and communication typically takes place in a Device-to-Device (D2D) manner. In the ideal case in which all devices within communication range can communicate simultaneously and noiselessly, a standard protocol that is guaranteed to converge to an optimal solution of the global empirical risk minimization problem under convexity and connectivity assumptions is Decentralized Stochastic Gradient Descent (DSGD). DSGD integrates local SGD steps with periodic consensus averages that require communication between neighboring devices. In this paper, wireless protocols are proposed that implement DSGD by accounting for the presence of path loss, fading, blockages, and mutual interference. The proposed protocols are based on graph coloring for scheduling and on both digital and analog transmission strategies at the physical layer, with the latter leveraging over-the-air computing via sparsity-based recovery.

71 citations


Journal Article
TL;DR: It turns out that under mild conditions, function estimation consistency and convergence may be pursued in modal regression as in vanilla regression protocols, however, it outperforms these regression models in terms of robustness as shown in the study from a re-descending M-estimation view.
Abstract: This paper studies the nonparametric modal regression problem systematically from a statistical learning view. Originally motivated by pursuing a theoretical understanding of the maximum correntropy criterion based regression (MCCR), our study reveals that MCCR with a tending-to-zero scale parameter is essentially modal regression. We show that nonparametric modal regression problem can be approached via the classical empirical risk minimization. Some efforts are then made to develop a framework for analyzing and implementing modal regression. For instance, the modal regression function is described, the modal regression risk is defined explicitly and its \textit{Bayes} rule is characterized; for the sake of computational tractability, the surrogate modal regression risk, which is termed as the generalization risk in our study, is introduced. On the theoretical side, the excess modal regression risk, the excess generalization risk, the function estimation error, and the relations among the above three quantities are studied rigorously. It turns out that under mild conditions, function estimation consistency and convergence may be pursued in modal regression as in vanilla regression protocols, such as mean regression, median regression, and quantile regression. However, it outperforms these regression models in terms of robustness as shown in our study from a re-descending M-estimation view. This coincides with and in return explains the merits of MCCR on robustness. On the practical side, the implementation issues of modal regression including the computational algorithm and the tuning parameters selection are discussed. Numerical assessments on modal regression are also conducted to verify our findings empirically.

59 citations


Proceedings Article
12 Jul 2020
TL;DR: In this paper, a peer loss function is proposed for learning from noisy labels, which does not require a priori specification of the noise rates and can be used to simplify model development when facing potentially noisy training labels.
Abstract: Learning with noisy labels is a common challenge in supervised learning. Existing approaches often require practitioners to specify noise rates, i.e., a set of parameters controlling the severity of label noises in the problem, and the specifications are either assumed to be given or estimated using additional steps. In this work, we introduce a new family of loss functions that we name as peer loss functions, which enables learning from noisy labels and does not require a priori specification of the noise rates. Peer loss functions work within the standard empirical risk minimization (ERM) framework. We show that, under mild conditions, performing ERM with peer loss functions on the noisy dataset leads to the optimal or a near-optimal classifier as if performing ERM over the clean training data, which we do not have access to. We pair our results with an extensive set of experiments. Peer loss provides a way to simplify model development when facing potentially noisy training labels, and can be promoted as a robust candidate loss function in such situations.

57 citations


Proceedings Article
28 Jan 2020
TL;DR: In this article, a coin flip is used to trigger the computation of the gradient in each iteration of the SVRG and its accelerated variant (Katyusha) with a small probability.
Abstract: The stochastic variance-reduced gradient method (SVRG) and its accelerated variant (Katyusha) have attracted enormous attention in the machine learning community in the last few years due to their superior theoretical properties and empirical behaviour on training supervised machine learning models via the empirical risk minimization paradigm A key structural element in both of these methods is the inclusion of an outer loop at the beginning of which a full pass over the training data is made in order to compute the exact gradient, which is then used to construct a variance-reduced estimator of the gradient In this work we design {\em loopless variants} of both of these methods In particular, we remove the outer loop and replace its function by a coin flip performed in each iteration designed to trigger, with a small probability, the computation of the gradient We prove that the new methods enjoy the same superior theoretical convergence properties as the original methods However, we demonstrate through numerical experiments that our methods have substantially superior practical behavior

Posted Content
TL;DR: This work shows that it is possible to flexibly tune the impact of individual losses through a straightforward extension to ERM using a hyperparameter called the tilt, and demonstrates that TERM can be used for a multitude of applications, such as enforcing fairness between subgroups, mitigating the effect of outliers, and handling class imbalance.
Abstract: Empirical risk minimization (ERM) is typically designed to perform well on the average loss, which can result in estimators that are sensitive to outliers, generalize poorly, or treat subgroups unfairly. While many methods aim to address these problems individually, in this work, we explore them through a unified framework---tilted empirical risk minimization (TERM). In particular, we show that it is possible to flexibly tune the impact of individual losses through a straightforward extension to ERM using a hyperparameter called the tilt. We provide several interpretations of the resulting framework: We show that TERM can increase or decrease the influence of outliers, respectively, to enable fairness or robustness; has variance-reduction properties that can benefit generalization; and can be viewed as a smooth approximation to a superquantile method. We develop batch and stochastic first-order optimization methods for solving TERM, and show that the problem can be efficiently solved relative to common alternatives. Finally, we demonstrate that TERM can be used for a multitude of applications, such as enforcing fairness between subgroups, mitigating the effect of outliers, and handling class imbalance. TERM is not only competitive with existing solutions tailored to these individual problems, but can also enable entirely new applications, such as simultaneously addressing outliers and promoting fairness.

DOI
01 Jan 2020
TL;DR: The Equal Opportunity fairness constraint combined with ERM will provably recover the Bayes Optimal Classifier under a range of bias models, and theoretical results provide additional motivation for considering fairness interventions even if an actor cares primarily about accuracy.
Abstract: Multiple fairness constraints have been proposed in the literature, motivated by a range of concerns about how demographic groups might be treated unfairly by machine learning classifiers. In this work we consider a different motivation; learning from biased training data. We posit several ways in which training data may be biased, including having a more noisy or negatively biased labeling process on members of a disadvantaged group, or a decreased prevalence of positive or negative examples from the disadvantaged group, or both. Given such biased training data, Empirical Risk Minimization (ERM) may produce a classifier that not only is biased but also has suboptimal accuracy on the true data distribution. We examine the ability of fairness-constrained ERM to correct this problem. In particular, we find that the Equal Opportunity fairness constraint [Hardt et al., 2016] combined with ERM will provably recover the Bayes optimal classifier under a range of bias models. We also consider other recovery methods including re-weighting the training data, Equalized Odds, and Demographic Parity, and Calibration. These theoretical results provide additional motivation for considering fairness interventions even if an actor cares primarily about accuracy.

Journal Article
TL;DR: New excess risk bounds for general unbounded loss functions including log loss and squared loss, where the distribution of the losses may be heavy-tailed, are presented.
Abstract: We present new excess risk bounds for general unbounded loss functions including log loss and squared loss, where the distribution of the losses may be heavy-tailed. The bounds hold for general estimators, but they are optimized when applied to η-generalized Bayesian, MDL, and empirical risk minimization estimators. In the case of log loss, the bounds imply convergence rates for generalized Bayesian inference under misspecification in terms of a generalization of the Hellinger metric as long as the learning rate η is set correctly. For general loss functions, our bounds rely on two separate conditions: the v-GRIP (generalized reversed information projection) conditions, which control the lower tail of the excess loss; and the newly introduced witness condition, which controls the upper tail. The parameter v in the v-GRIP conditions determines the achievable rate and is akin to the exponent in the Tsybakov margin condition and the Bernstein condition for bounded losses, which the v-GRIP conditions generalize; favorable v in combination with small model complexity leads to O(1/n) rates. The witness condition allows us to connect the excess risk to an “annealed” version thereof, by which we generalize several previous results connecting Hellinger and Renyi divergence to KL divergence.

Journal ArticleDOI
TL;DR: In this paper, a variance reduced stochastic gradient descent (VR-SGD) algorithm is proposed to solve non-smooth and non-strongly convex problems directly without any reduction techniques.
Abstract: In this paper, we propose a simple variant of the original SVRG, called variance reduced stochastic gradient descent (VR-SGD). Unlike the choices of snapshot and starting points in SVRG and its proximal variant, Prox-SVRG, the two vectors of VR-SGD are set to the average and last iterate of the previous epoch, respectively. The settings allow us to use much larger learning rates, and also make our convergence analysis more challenging. We also design two different update rules for smooth and non-smooth objective functions, respectively, which means that VR-SGD can tackle non-smooth and/or non-strongly convex problems directly without any reduction techniques. Moreover, we analyze the convergence properties of VR-SGD for strongly convex problems, which show that VR-SGD attains linear convergence. Different from most algorithms that have no convergence guarantees for non-strongly convex problems, we also provide the convergence guarantees of VR-SGD for this case, and empirically verify that VR-SGD with varying learning rates achieves similar performance to its momentum accelerated variant that has the optimal convergence rate $\mathcal {O}(1/T^2)$ O ( 1 / T 2 ) . Finally, we apply VR-SGD to solve various machine learning problems, such as convex and non-convex empirical risk minimization, and leading eigenvalue computation. Experimental results show that VR-SGD converges significantly faster than SVRG and Prox-SVRG, and usually outperforms state-of-the-art accelerated methods, e.g., Katyusha.

Posted Content
Kartik Ahuja1, Jun Wang1, Amit Dhurandhar1, Karthikeyan Shanmugam1, Kush R. Varshney1 
TL;DR: This work analyzes the IRM and ERM frameworks from the perspective of sample complexity, finding that depending on the type of data generation mechanism, the two approaches might have very different finite sample and asymptotic behavior.
Abstract: Recently, invariant risk minimization (IRM) was proposed as a promising solution to address out-of-distribution (OOD) generalization. However, it is unclear when IRM should be preferred over the widely-employed empirical risk minimization (ERM) framework. In this work, we analyze both these frameworks from the perspective of sample complexity, thus taking a firm step towards answering this important question. We find that depending on the type of data generation mechanism, the two approaches might have very different finite sample and asymptotic behavior. For example, in the covariate shift setting we see that the two approaches not only arrive at the same asymptotic solution, but also have similar finite sample behavior with no clear winner. For other distribution shifts such as those involving confounders or anti-causal variables, however, the two approaches arrive at different asymptotic solutions where IRM is guaranteed to be close to the desired OOD solutions in the finite sample regime, while ERM is biased even asymptotically. We further investigate how different factors -- the number of environments, complexity of the model, and IRM penalty weight -- impact the sample complexity of IRM in relation to its distance from the OOD solutions

Posted Content
TL;DR: It is shown that Mixup can be interpreted as standard empirical risk minimization estimator subject to a combination of data transformation and random perturbation of the transformed data, and that these transformations and perturbations induce multiple known regularization schemes that interact synergistically with each other, resulting in a self calibrated and effective regularization effect that prevents overfitting and overconfident predictions.
Abstract: Mixup is a data augmentation technique that creates new examples as convex combinations of training points and labels. This simple technique has empirically shown to improve the accuracy of many state-of-the-art models in different settings and applications, but the reasons behind this empirical success remain poorly understood. In this paper we take a substantial step in explaining the theoretical foundations of Mixup, by clarifying its regularization effects. We show that Mixup can be interpreted as standard empirical risk minimization estimator subject to a combination of data transformation and random perturbation of the transformed data. We further show that these transformations and perturbations induce multiple known regularization schemes, including label smoothing and reduction of the Lipschitz constant of the estimator, and that these schemes interact synergistically with each other, resulting in a self calibrated and effective regularization effect that prevents overfitting and overconfident predictions. We illustrate our theoretical analysis by experiments that empirically support our conclusions.

Journal ArticleDOI
TL;DR: In this article, the authors consider the problem of learning a concave function on a compact convex domain using linear combinations of "bump-like" components (neurons).
Abstract: Fitting a function by using linear combinations of a large number $N$ of “simple” components is one of the most fruitful ideas in statistical learning. This idea lies at the core of a variety of methods, from two-layer neural networks to kernel regression, to boosting. In general, the resulting risk minimization problem is nonconvex and is solved by gradient descent or its variants. Unfortunately, little is known about global convergence properties of these approaches. Here, we consider the problem of learning a concave function $f$ on a compact convex domain $\Omega \subset {\mathbb{R}}^{d}$, using linear combinations of “bump-like” components (neurons). The parameters to be fitted are the centers of $N$ bumps, and the resulting empirical risk minimization problem is highly nonconvex. We prove that, in the limit in which the number of neurons diverges, the evolution of gradient descent converges to a Wasserstein gradient flow in the space of probability distributions over $\Omega $. Further, when the bump width $\delta $ tends to $0$, this gradient flow has a limit which is a viscous porous medium equation. Remarkably, the cost function optimized by this gradient flow exhibits a special property known as displacement convexity, which implies exponential convergence rates for $N\to \infty $, $\delta \to 0$. Surprisingly, this asymptotic theory appears to capture well the behavior for moderate values of $\delta $, $N$. Explaining this phenomenon, and understanding the dependence on $\delta $, $N$ in a quantitative manner remains an outstanding challenge.

Proceedings ArticleDOI
TL;DR: This work addresses how to train fair classifiers in settings where training labels are corrupted with random noise, and where the error rates of corruption depend both on the label class and on the membership function for a protected subgroup by performing empirical risk minimization with carefully defined surrogate loss functions and surrogate constraints.
Abstract: This work examines how to train fair classifiers in settings where training labels are corrupted with random noise, and where the error rates of corruption depend both on the label class and on the membership function for a protected subgroup. Heterogeneous label noise models systematic biases towards particular groups when generating annotations. We begin by presenting analytical results which show that naively imposing parity constraints on demographic disparity measures, without accounting for heterogeneous and group-dependent error rates, can decrease both the accuracy and the fairness of the resulting classifier. Our experiments demonstrate these issues arise in practice as well. We address these problems by performing empirical risk minimization with carefully defined surrogate loss functions and surrogate constraints that help avoid the pitfalls introduced by heterogeneous label noise. We provide both theoretical and empirical justifications for the efficacy of our methods. We view our results as an important example of how imposing fairness on biased data sets without proper care can do at least as much harm as it does good.

Journal ArticleDOI
TL;DR: An extension of Chervonenkis and Vapnik’s classical empirical risk minimization (ERM) where the empirical risk is replaced by a median-of-means (MOM) estimator of the risk is presented, and the resulting new estimators are called MOM minimizers.
Abstract: We present an extension of Chervonenkis and Vapnik’s classical empirical risk minimization (ERM) where the empirical risk is replaced by a median-of-means (MOM) estimator of the risk. The resulting new estimators are called MOM minimizers. While ERM is sensitive to corruption of the dataset for many classical loss functions used in classification, we show that MOM minimizers behave well in theory, in the sense that it achieves Vapnik’s (slow) rates of convergence under weak assumptions: the functions in the hypothesis class are only required to have a finite second moment and some outliers may also have corrupted the dataset. We propose algorithms, inspired by MOM minimizers, which may be interpreted as MOM version of block stochastic gradient descent (BSGD). The key point of these algorithms is that the block of data onto which a descent step is performed is chosen according to its “ centrality” among the other blocks. This choice of “ descent block” makes these algorithms robust to outliers; also, this is the only extra step added to classical BSGD algorithms. As a consequence, classical BSGD algorithms can be easily turn into robust MOM versions. Moreover, MOM algorithms perform a smart subsampling which may help to reduce substantially time computations and memory resources when applied to non linear algorithms. These empirical performances are illustrated on both simulated and real datasets.

Posted Content
TL;DR: This paper proposes Projected DP-SGD that performs noise reduction by projecting the noisy gradients to a low-dimensional subspace, which is given by the top gradient eigenspace on a small public dataset.
Abstract: Differentially private SGD (DP-SGD) is one of the most popular methods for solving differentially private empirical risk minimization (ERM). Due to its noisy perturbation on each gradient update, the error rate of DP-SGD scales with the ambient dimension $p$, the number of parameters in the model. Such dependence can be problematic for over-parameterized models where $p \gg n$, the number of training samples. Existing lower bounds on private ERM show that such dependence on $p$ is inevitable in the worst case. In this paper, we circumvent the dependence on the ambient dimension by leveraging a low-dimensional structure of gradient space in deep networks---that is, the stochastic gradients for deep nets usually stay in a low dimensional subspace in the training process. We propose Projected DP-SGD that performs noise reduction by projecting the noisy gradients to a low-dimensional subspace, which is given by the top gradient eigenspace on a small public dataset. We provide a general sample complexity analysis on the public dataset for the gradient subspace identification problem and demonstrate that under certain low-dimensional assumptions the public sample complexity only grows logarithmically in $p$. Finally, we provide a theoretical analysis and empirical evaluations to show that our method can substantially improve the accuracy of DP-SGD.

Proceedings Article
03 Jun 2020
TL;DR: This work suggests that performing a certain amount of local communications and computations per iteration can substantially improve the overall efficiency, and extends Network-DANE to composite optimization by allowing a nonsmooth penalty term.
Abstract: There is growing interest in large-scale machine learning and optimization over decentralized networks, eg in the context of multi-agent learning and federated learning Due to the imminent need to alleviate the communication burden, the investigation of communication-efficient distributed optimization algorithms - particularly for empirical risk minimization - has flourished in recent years A large fraction of these algorithms have been developed for the master/slave setting, relying on a central parameter server that can communicate with all agents This paper focuses on distributed optimization over networks, or decentralized optimization, where each agent is only allowed to aggregate information from its neighbors By properly adjusting the global gradient estimate via local averaging in conjunction with proper correction, we develop a communication-efficient approximate Newton-type method Network-DANE, which generalizes DANE to the decentralized scenarios Our key ideas can be applied in a systematic manner to obtain decentralized versions of other master/slave distributed algorithms A notable development is Network-SVRG/SARAH, which employs variance reduction to further accelerate local computation We establish linear convergence of Network-DANE and Network-SVRG for strongly convex losses, and Network-SARAH for quadratic losses, which shed light on the impacts of data homogeneity, network connectivity, and local averaging upon the rate of convergence We further extend Network-DANE to composite optimization by allowing a nonsmooth penalty term Numerical evidence is provided to demonstrate the appealing performance of our algorithms over competitive baselines, in terms of both communication and computation efficiency Our work suggests that performing a certain amount of local communications and computations per iteration can substantially improve the overall efficiency

Proceedings ArticleDOI
TL;DR: An unbiased learning framework for the causal effect of recommendation is proposed based on the inverse propensity scoring technique and it is demonstrated that the proposed method outperforms other biased learning methods in various settings.
Abstract: Increasing users' positive interactions, such as purchases or clicks, is an important objective of recommender systems. Recommenders typically aim to select items that users will interact with. If the recommended items are purchased, an increase in sales is expected. However, the items could have been purchased even without recommendation. Thus, we want to recommend items that results in purchases caused by recommendation. This can be formulated as a ranking problem in terms of the causal effect. Despite its importance, this problem has not been well explored in the related research. It is challenging because the ground truth of causal effect is unobservable, and estimating the causal effect is prone to the bias arising from currently deployed recommenders. This paper proposes an unbiased learning framework for the causal effect of recommendation. Based on the inverse propensity scoring technique, the proposed framework first constructs unbiased estimators for ranking metrics. Then, it conducts empirical risk minimization on the estimators with propensity capping, which reduces variance under finite training samples. Based on the framework, we develop an unbiased learning method for the causal effect extension of a ranking metric. We theoretically analyze the unbiasedness of the proposed method and empirically demonstrate that the proposed method outperforms other biased learning methods in various settings.

Proceedings Article
03 Jun 2020
TL;DR: A computationally efficient method to construct a gradient estimator that is purposely biased toward those observations with higher current losses, and is guaranteed to converge at a sublinear rate to a global optimum for convex loss and to a critical point for weakly convex (non-convex) loss.
Abstract: We propose a new stochastic optimization framework for empirical risk minimization problems such as those that arise in machine learning The traditional approaches, such as (mini-batch) stochastic gradient descent (SGD), utilize an unbiased gradient estimator of the empirical average loss In contrast, we develop a computationally efficient method to construct a gradient estimator that is purposely biased toward those observations with higher current losses On the theory side, we show that the proposed method minimizes a new ordered modification of the empirical average loss, and is guaranteed to converge at a sublinear rate to a global optimum for convex loss and to a critical point for weakly convex (non-convex) loss Furthermore, we prove a new generalization bound for the proposed algorithm On the empirical side, the numerical experiments show that our proposed method consistently improves the test errors compared with the standard mini-batch SGD in various models including SVM, logistic regression, and deep learning problems

Posted Content
11 Jun 2020
TL;DR: The first formal convergence analysis of clipped DP-GD is provided, showing that the value which one sets for clipping really matters and if the clipping norm is set within at most a constant factor higher than the optimal, then one can obtain an excess empirical risk guarantee that is independent of the dimensionality of the model space.
Abstract: Differentially private gradient descent (DP-GD) has been extremely effective both theoretically, and in practice, for solving private empirical risk minimization (ERM) problems. In this paper, we focus on understanding the impact of the clipping norm, a critical component of DP-GD, on its convergence. We provide the first formal convergence analysis of clipped DP-GD. More generally, we show that the value which one sets for clipping really matters: done wrong, it can dramatically affect the resulting quality; done properly, it can eliminate the dependence of convergence on the model dimensionality. We do this by showing a dichotomous behavior of the clipping norm. First, we show that if the clipping norm is set smaller than the optimal, even by a constant factor, the excess empirical risk for convex ERMs can increase from $O(1/n)$ to $\Omega(1)$, where $n$ is the number of data samples. Next, we show that, regardless of the value of the clipping norm, clipped DP-GD minimizes a well-defined convex objective over an unconstrained space, as long as the underlying ERM is a generalized linear problem. Furthermore, if the clipping norm is set within at most a constant factor higher than the optimal, then one can obtain an excess empirical risk guarantee that is independent of the dimensionality of the model space. Finally, we extend our result to non-convex generalized linear problems by showing that DP-GD reaches a first-order stationary point as long as the loss is smooth, and the convergence is independent of the dimensionality of the model space.

Journal ArticleDOI
TL;DR: It is proved that with Gaussian inputs, the empirical risk based on cross entropy exhibits strong convexity and smoothness uniformly in a local neighborhood of the ground truth, as soon as the sample complexity is sufficiently large.
Abstract: We study model recovery for data classification, where the training labels are generated from a one-hidden-layer neural network with sigmoid activations, also known as a single-layer feedforward network, and the goal is to recover the weights of the neural network. We consider two network models, the fully-connected network (FCN) and the non-overlapping convolutional neural network (CNN). We prove that with Gaussian inputs, the empirical risk based on cross entropy exhibits strong convexity and smoothness uniformly in a local neighborhood of the ground truth, as soon as the sample complexity is sufficiently large. This implies that if initialized in this neighborhood, gradient descent converges linearly to a critical point that is provably close to the ground truth. Furthermore, we show such an initialization can be obtained via the tensor method. This establishes the global convergence guarantee for empirical risk minimization using cross entropy via gradient descent for learning one-hidden-layer neural networks, at the near-optimal sample and computational complexity with respect to the network input dimension without unrealistic assumptions such as requiring a fresh set of samples at each iteration.

Journal ArticleDOI
TL;DR: A comprehensive understanding of algorithmic convergence with respect to data homogeneity is obtained by measuring the smoothness of the discrepancy between the local and global loss functions and it is shown that when the data are less balanced, regularization can be used to ensure convergence at a slower rate.
Abstract: Stochastic variance reduced methods have gained a lot of interest recently for empirical risk minimization due to its appealing run time complexity. When the data size is large and disjointly stored on different machines, it becomes imperative to distribute the implementation of such variance reduced methods. In this paper, we consider a general framework that directly distributes popular stochastic variance reduced methods in the master/slave model, by assigning outer loops to the parameter server, and inner loops to worker machines. This framework is natural and friendly to implement, but its theoretical convergence is not well understood. We obtain a comprehensive understanding of algorithmic convergence with respect to data homogeneity by measuring the smoothness of the discrepancy between the local and global loss functions. We establish the linear convergence of distributed versions of a family of stochastic variance reduced algorithms, including those using accelerated and recursive gradient updates, for minimizing strongly convex losses. Our theory captures how the convergence of distributed algorithms behaves as the number of machines and the size of local data vary. Furthermore, we show that when the data are less balanced, regularization can be used to ensure convergence at a slower rate. We also demonstrate that our analysis can be further extended to handle nonconvex loss functions.

Posted Content
TL;DR: Invariant Risk Minimization (IRM), a domain generalization algorithm that employs a causal discovery inspired method to find robust predictors, is applied to the task of fairly predicting the toxicity of internet comments and it is shown that IRM achieves better out-of-distribution accuracy and fairness than Empirical RiskMinimization methods.
Abstract: Robustness is of central importance in machine learning and has given rise to the fields of domain generalization and invariant learning, which are concerned with improving performance on a test distribution distinct from but related to the training distribution. In light of recent work suggesting an intimate connection between fairness and robustness, we investigate whether algorithms from robust ML can be used to improve the fairness of classifiers that are trained on biased data and tested on unbiased data. We apply Invariant Risk Minimization (IRM), a domain generalization algorithm that employs a causal discovery inspired method to find robust predictors, to the task of fairly predicting the toxicity of internet comments. We show that IRM achieves better out-of-distribution accuracy and fairness than Empirical Risk Minimization (ERM) methods, and analyze both the difficulties that arise when applying IRM in practice and the conditions under which IRM will likely be effective in this scenario. We hope that this work will inspire further studies of how robust machine learning methods relate to algorithmic fairness.

Proceedings ArticleDOI
19 Jul 2020
TL;DR: In this article, the authors tackle the problem of algorithmic fairness, where the goal is to avoid the unfairly influence of sensitive information, in the general context of regression with possible continuous sensitive attributes.
Abstract: We tackle the problem of algorithmic fairness, where the goal is to avoid the unfairly influence of sensitive information, in the general context of regression with possible continuous sensitive attributes. We extend the framework of fair empirical risk minimization of [1] to this general scenario, covering in this way the whole standard supervised learning setting. Our generalized fairness measure reduces to well known notions of fairness available in literature. We derive learning guarantees for our method, that imply in particular its statistical consistency, both in terms of the risk and the fairness measure. We then specialize our approach to kernel methods and propose a convex fair estimator in that setting. We test the estimator on a commonly used benchmark dataset (Communities and Crime) and on a new dataset collected at the University of Genoa1, containing the information of the academic career of five thousand students. The latter dataset provides a challenging real case scenario of unfair behaviour of standard regression methods that benefits from our methodology. The experimental results show that our estimator is effective at mitigating the trade-off between accuracy and fairness requirements.

Journal Article
TL;DR: In this article, a communication-efficient approximate Newton-type method Network-DANE is proposed for distributed optimization over networks, where each agent is only allowed to aggregate information from its neighbors.
Abstract: There is growing interest in large-scale machine learning and optimization over decentralized networks, e.g. in the context of multi-agent learning and federated learning. Due to the imminent need to alleviate the communication burden, the investigation of communication-efficient distributed optimization algorithms - particularly for empirical risk minimization - has flourished in recent years. A large fraction of these algorithms have been developed for the master/slave setting, relying on a central parameter server that can communicate with all agents. This paper focuses on distributed optimization over networks, or decentralized optimization, where each agent is only allowed to aggregate information from its neighbors. By properly adjusting the global gradient estimate via local averaging in conjunction with proper correction, we develop a communication-efficient approximate Newton-type method Network-DANE, which generalizes DANE to the decentralized scenarios. Our key ideas can be applied in a systematic manner to obtain decentralized versions of other master/slave distributed algorithms. A notable development is Network-SVRG/SARAH, which employs variance reduction to further accelerate local computation. We establish linear convergence of Network-DANE and Network-SVRG for strongly convex losses, and Network-SARAH for quadratic losses, which shed light on the impacts of data homogeneity, network connectivity, and local averaging upon the rate of convergence. We further extend Network-DANE to composite optimization by allowing a nonsmooth penalty term. Numerical evidence is provided to demonstrate the appealing performance of our algorithms over competitive baselines, in terms of both communication and computation efficiency. Our work suggests that performing a certain amount of local communications and computations per iteration can substantially improve the overall efficiency.