Showing papers by "Anit Kumar Sahu published in 2019"

PDF

Open Access

Posted Content•

MATCHA: Speeding Up Decentralized SGD via Matching Decomposition Sampling

[...]

Jianyu Wang¹, Anit Kumar Sahu², Zhouyi Yang¹, Gauri Joshi¹, Soummya Kar¹ - Show less +1 more•Institutions (2)

23 May 2019-arXiv: Learning

TL;DR: A novel algorithm MATCHA is proposed that uses matching decomposition sampling of the base topology to parallelize inter-worker information exchange so as to significantly reduce communication delay and communicates more frequently over critical links such that it can maintain the same convergence rate as vanilla decentralized SGD.

...read moreread less

Abstract: This paper studies the problem of error-runtime trade-off, typically encountered in decentralized training based on stochastic gradient descent (SGD) using a given network. While a denser (sparser) network topology results in faster (slower) error convergence in terms of iterations, it incurs more (less) communication time/delay per iteration. In this paper, we propose MATCHA, an algorithm that can achieve a win-win in this error-runtime trade-off for any arbitrary network topology. The main idea of MATCHA is to parallelize inter-node communication by decomposing the topology into matchings. To preserve fast error convergence speed, it identifies and communicates more frequently over critical links, and saves communication time by using other links less frequently. Experiments on a suite of datasets and deep neural networks validate the theoretical analyses and demonstrate that MATCHA takes up to $5\times$ less time than vanilla decentralized SGD to reach the same training loss.

...read moreread less

104 citations

Proceedings Article•DOI•

Distributed stochastic optimization with gradient tracking over strongly-connected networks

[...]

Ran Xin¹, Anit Kumar Sahu¹, Usman A. Khan², Soummya Kar³•Institutions (3)

Carnegie Mellon University¹, Bosch², Tufts University³

18 Mar 2019

TL;DR: It is shown that under a sufficiently small constant step-size, S - A - B converges linearly (in expected mean-square sense) to a neighborhood of the global minimizer.

...read moreread less

Abstract: In this paper, we study distributed stochastic optimization to minimize a sum of smooth and strongly-convex local cost functions over a network of agents, communicating over a strongly-connected graph. Assuming that each agent has access to a stochastic first-order oracle $\left( {\mathcal{S}\mathcal{F}\mathcal{O}} \right)$, we propose a novel distributed method, called $\mathcal{S} - \mathcal{A}\mathcal{B}$, where each agent uses an auxiliary variable to asymptotically track the gradient of the global cost in expectation. The $\mathcal{S} - \mathcal{A}\mathcal{B}$ algorithm employs rowand column-stochastic weights simultaneously to ensure both consensus and optimality. Since doubly-stochastic weights are not used, $\mathcal{S} - \mathcal{A}\mathcal{B}$ is applicable to arbitrary strongly-connected graphs. We show that under a sufficiently small constant step-size, $\mathcal{S} - \mathcal{A}\mathcal{B}$ converges linearly (in expected mean-square sense) to a neighborhood of the global minimizer. We present numerical simulations based on real-world data sets to illustrate the theoretical results.

...read moreread less

103 citations

Proceedings Article•DOI•

FedDANE: A Federated Newton-Type Method

[...]

Tian Li¹, Anit Kumar Sahu², Manzil Zaheer¹, Maziar Sanjabi³, Ameet Talwalkar¹, Virginia Smithy¹ - Show less +2 more•Institutions (3)

Carnegie Mellon University¹, Bosch², University of Southern California³

01 Nov 2019

TL;DR: This article proposed FedDANE, an optimization method that adapts from DANE, a method for classical distributed optimization, to handle the practical constraints of federated learning, and provided convergence guarantees for this method when learning over both convex and non-convex functions.

...read moreread less

Abstract: Federated learning aims to jointly learn statistical models over massively distributed remote devices. In this work, we propose FedDANE, an optimization method that we adapt from DANE [8], [9], a method for classical distributed optimization, to handle the practical constraints of federated learning. We provide convergence guarantees for this method when learning over both convex and non-convex functions. Despite encouraging theoretical results, we find that the method has underwhelming performance empirically. In particular, through empirical simulations on both synthetic and real-world datasets, FedDANE consistently underperforms baselines of FedAvg [7] and FedProx [4] in realistic federated settings. We identify low device participation and statistical device heterogeneity as two underlying causes of this underwhelming performance, and conclude by suggesting several directions of future work.

...read moreread less

56 citations

Proceedings Article•DOI•

MATCHA: Speeding Up Decentralized SGD via Matching Decomposition Sampling

[...]

Jianyu Wang¹, Anit Kumar Sahu², Zhouyi Yang¹, Gauri Joshi¹, Soummya Kar¹ - Show less +1 more•Institutions (2)

Carnegie Mellon University¹, Bosch²

01 Dec 2019

TL;DR: MATCHA as discussed by the authors uses matching decomposition sampling of the base topology to parallelize inter-worker information exchange so as to significantly reduce communication delay and maintain the same convergence rate as vanilla decentralized SGD.

...read moreread less

Abstract: Decentralized stochastic gradient descent (SGD) is a promising approach to learn a machine learning model over a network of workers connected in an arbitrary topology. Although a densely-connected network topology can ensure faster convergence in terms of iterations, it incurs more communication time/delay per iteration, resulting in longer training time. In this paper, we propose a novel algorithm MATCHA to achieve a win-win in this error-runtime trade-off. MATCHA uses matching decomposition sampling of the base topology to parallelize inter-worker information exchange so as to significantly reduce communication delay. At the same time, the algorithm communicates more frequently over critical links such that it can maintain the same convergence rate as vanilla decentralized SGD. Experiments on a suite of datasets and deep neural networks validate the theoretical analysis and demonstrate the effectiveness of the proposed scheme as far as reducing communication delays is concerned.

...read moreread less

45 citations

Proceedings Article•

Towards Gradient Free and Projection Free Stochastic Optimization

[...]

Anit Kumar Sahu¹, Manzil Zaheer², Soummya Kar¹•Institutions (2)

Carnegie Mellon University¹, Google²

11 Apr 2019

TL;DR: A zeroth order Frank-Wolfe algorithm is proposed, which in addition to the projection-free nature of the vanilla Frank- Wolfe algorithm makes it gradient free, and it is shown that the proposed algorithm converges to the optimal objective function at a rate of $O(1/T^{1/3}\right)$, where $T$ denotes the iteration count.

...read moreread less

Abstract: This paper focuses on the problem of \emph{constrained} \emph{stochastic} optimization. A zeroth order Frank-Wolfe algorithm is proposed, which in addition to the projection-free nature of the vanilla Frank-Wolfe algorithm makes it gradient free. Under convexity and smoothness assumption, we show that the proposed algorithm converges to the optimal objective function at a rate $O\left(1/T^{1/3}\right)$, where $T$ denotes the iteration count. In particular, the primal sub-optimality gap is shown to have a dimension dependence of $O\left(d^{1/3}\right)$, which is the best known dimension dependence among all zeroth order optimization algorithms with one directional derivative per iteration. For non-convex functions, we obtain the \emph{Frank-Wolfe} gap to be $O\left(d^{1/3}T^{-1/4}\right)$. Experiments on black-box optimization setups demonstrate the efficacy of the proposed algorithm.

...read moreread less

31 citations

Journal Article•DOI•

Federated Learning: Challenges, Methods, and Future Directions.

[...]

Tian Li¹, Anit Kumar Sahu², Ameet Talwalkar¹, Virginia Smith¹•Institutions (2)

Carnegie Mellon University¹, Bosch²

21 Aug 2019-arXiv: Learning

TL;DR: The unique characteristics and challenges of federated learning are discussed, a broad overview of current approaches are provided, and several directions of future work that are relevant to a wide range of research communities are outlined.

...read moreread less

Abstract: Federated learning involves training statistical models over remote devices or siloed data centers, such as mobile phones or hospitals, while keeping data localized. Training in heterogeneous and potentially massive networks introduces novel challenges that require a fundamental departure from standard approaches for large-scale machine learning, distributed optimization, and privacy-preserving data analysis. In this article, we discuss the unique characteristics and challenges of federated learning, provide a broad overview of current approaches, and outline several directions of future work that are relevant to a wide range of research communities.

...read moreread less

27 citations

Posted Content•

Black-box Adversarial Attacks with Bayesian Optimization

[...]

Satya Narayan Shukla¹, Anit Kumar Sahu, Devin Willmott², J. Zico Kolter³•Institutions (3)

University of Massachusetts Amherst¹, Bosch², Carnegie Mellon University³

25 Sep 2019-arXiv: Learning

TL;DR: This work uses Bayesian optimization~(BO) to specifically cater to scenarios involving low query budgets to develop query efficient adversarial attacks, and alleviates the issues surrounding BO in regards to optimizing high dimensional deep learning models by effective dimension upsampling techniques.

...read moreread less

Abstract: We focus on the problem of black-box adversarial attacks, where the aim is to generate adversarial examples using information limited to loss function evaluations of input-output pairs. We use Bayesian optimization~(BO) to specifically cater to scenarios involving low query budgets to develop query efficient adversarial attacks. We alleviate the issues surrounding BO in regards to optimizing high dimensional deep learning models by effective dimension upsampling techniques. Our proposed approach achieves performance comparable to the state of the art black-box adversarial attacks albeit with a much lower average query count. In particular, in low query budget regimes, our proposed method reduces the query count up to $80\%$ with respect to the state of the art methods.

...read moreread less

22 citations

Posted Content•

Distributed stochastic optimization with gradient tracking over strongly-connected networks

[...]

Ran Xin¹, Anit Kumar Sahu¹, Usman A. Khan², Soummya Kar³•Institutions (3)

Tufts University¹, Bosch², Carnegie Mellon University³

18 Mar 2019-arXiv: Learning

TL;DR: In this paper, a distributed stochastic optimization to minimize a sum of smooth and strongly-convex local cost functions over a network of agents communicating over a strongly-connected graph is studied.

...read moreread less

Abstract: In this paper, we study distributed stochastic optimization to minimize a sum of smooth and strongly-convex local cost functions over a network of agents, communicating over a strongly-connected graph. Assuming that each agent has access to a stochastic first-order oracle ($\mathcal{SFO}$), we propose a novel distributed method, called $\mathcal{S}$-$\mathcal{AB}$, where each agent uses an auxiliary variable to asymptotically track the gradient of the global cost in expectation. The $\mathcal{S}$-$\mathcal{AB}$ algorithm employs row- and column-stochastic weights simultaneously to ensure both consensus and optimality. Since doubly-stochastic weights are not used, $\mathcal{S}$-$\mathcal{AB}$ is applicable to arbitrary strongly-connected graphs. We show that under a sufficiently small constant step-size, $\mathcal{S}$-$\mathcal{AB}$ converges linearly (in expected mean-square sense) to a neighborhood of the global minimizer. We present numerical simulations based on real-world data sets to illustrate the theoretical results.

...read moreread less

12 citations

Posted Content•

Learning in Confusion: Batch Active Learning with Noisy Oracle

[...]

Gaurav Gupta, Anit Kumar Sahu, Lin Wan-Yi

25 Sep 2019-arXiv: Learning

TL;DR: This work considers the setting of batch active learning, in which multiple samples are selected as opposed to a single sample as in classical settings, to reduce the training overhead and bridges between uniform randomness and score based importance sampling of clusters when selecting a batch of new samples.

...read moreread less

Abstract: We study the problem of training machine learning models incrementally using active learning with access to imperfect or noisy oracles. We specifically consider the setting of batch active learning, in which multiple samples are selected as opposed to a single sample as in classical settings so as to reduce the training overhead. Our approach bridges between uniform randomness and score based importance sampling of clusters when selecting a batch of new samples. Experiments on benchmark image classification datasets (MNIST, SVHN, and CIFAR10) shows improvement over existing active learning strategies. We introduce an extra denoising layer to deep networks to make active learning robust to label noises and show significant improvements.

...read moreread less

5 citations

Posted Content•

Noisy Batch Active Learning with Deterministic Annealing.

[...]

Gaurav Gupta¹, Anit Kumar Sahu, Lin Wan-Yi²•Institutions (2)

University of Southern California¹, Bosch²

27 Sep 2019-arXiv: Learning

TL;DR: This work incorporates model uncertainty into the sampling probability to compensate for poor estimation of the importance scores when the training data is too small to build a meaningful model.

...read moreread less

Abstract: We study the problem of training machine learning models incrementally with batches of samples annotated with noisy oracles. We select each batch of samples that are important and also diverse via clustering and importance sampling. More importantly, we incorporate model uncertainty into the sampling probability to compensate for poor estimation of the importance scores when the training data is too small to build a meaningful model. Experiments on benchmark image classification datasets (MNIST, SVHN, CIFAR10, and EMNIST) show improvement over existing active learning strategies. We introduce an extra denoising layer to deep networks to make active learning robust to label noises and show significant improvements.

...read moreread less

4 citations

Proceedings Article•DOI•

Distributed empirical risk minimization over directed graphs

[...]

Ran Xin¹, Anit Kumar Sahu², Soummya Kar¹, Usman A. Khan³•Institutions (3)

Carnegie Mellon University¹, Bosch², Tufts University³

01 Nov 2019

TL;DR: This paper proposes a decentralized stochastic gradient method with gradient tracking, and shows that the proposed algorithm converges linearly to an error ball around the optimal solution with a constant step-size.

...read moreread less

Abstract: In this paper, we present stochastic optimization for empirical risk minimization over directed graphs. Using a novel information fusion approach that utilizes both row- and column-stochastic weights simultaneously, we propose $\mathcal{S}\mathcal{A}\mathcal{B}$, a decentralized stochastic gradient method with gradient tracking, and show that the proposed algorithm converges linearly to an error ball around the optimal solution with a constant step-size. We provide a sketch of the convergence analysis as well as the generalization of the proposed algorithm. Finally, we illustrate the theoretical results with the help of experiments with real data.

...read moreread less

Proceedings Article•DOI•

Communication Efficient Distributed Estimation Over Directed Random Graphs

[...]

Anit Kumar Sahu¹, Dusan Jakovetic², Dragana Bajovic², Soummya Kar¹•Institutions (2)

Carnegie Mellon University¹, University of Novi Sad²

01 Jul 2019

TL;DR: Direct C-D-C-E-D distributed recursive estimator further dramatically improves communication efficiency, achieving the $O(1/c\mathcal{T})$ communication MSE rate with arbitrarily high exponent $\kappa$, while keeping the order-optimal $O (1/t)$ sample-wise M SE rate.

...read moreread less

Abstract: Recently, a communication efficient recursive distributed estimator, $C\mathcal{R}\mathcal{E}\mathcal{D}\mathcal{O}$, has been proposed, that utilizes increasingly sparse randomized bidirectional communications. $\lt p\gt C\mathcal{R}\mathcal{E}\mathcal{D}\mathcal{O}$ achieves order-optimal $O(1/t)$ mean square error (MSE) rate in the number of per-node processed samples t, and a $\lt p\gt O(1/C_{t}^{2-\zeta})$ MSE rate in the number of per-node communications, where $\zeta \gt 0$ is arbitrarily small. In this paper, we present directed $C\mathcal{R}\mathcal{E}\mathcal{D}\mathcal{O}, \mathcal{D}-C\mathcal{R}\mathcal{E}\mathcal{D}\mathcal{O}$ for short-a distributed recursive estimator that utilizes directed increasingly sparse communications. We show that $\mathcal{D}-C\mathcal{R}\mathcal{E}\mathcal{D}\mathcal{O}$ further dramatically improves communication efficiency, achieving the $O(1/c\mathcal{T})$ communication MSE rate with arbitrarily high exponent $\kappa$, while keeping the order-optimal $O(1/t)$ sample-wise MSE rate. Numerical examples on real data sets confirm our results.

...read moreread less

Posted Content•

Federated Optimization for Heterogeneous Networks. (arXiv:1812.06127v3 [cs.LG] UPDATED)

[...]

Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, Virginia Smith - Show less +2 more

22 Nov 2019

Posted Content•

Gaussian MRF Covariance Modeling for Efficient Black-Box Adversarial Attacks

[...]

Anit Kumar Sahu, Satya Narayan Shukla¹, J. Zico Kolter²•Institutions (2)

University of Massachusetts Amherst¹, Carnegie Mellon University²

25 Sep 2019-arXiv: Learning

TL;DR: This work proposes to capture correlations within gradients of the loss function with respect to the input images via a Gaussian Markov random field (GMRF), and shows that the covariance structure can be efficiently represented using the Fast Fourier Transform (FFT), along with low-rank updates to perform exact posterior estimation under this model.

...read moreread less

Abstract: We study the problem of generating adversarial examples in a black-box setting, where we only have access to a zeroth order oracle, providing us with loss function evaluations. Although this setting has been investigated in previous work, most past approaches using zeroth order optimization implicitly assume that the gradients of the loss function with respect to the input images are \emph{unstructured}. In this work, we show that in fact substantial correlations exist within these gradients, and we propose to capture these correlations via a Gaussian Markov random field (GMRF). Given the intractability of the explicit covariance structure of the MRF, we show that the covariance structure can be efficiently represented using the Fast Fourier Transform (FFT), along with low-rank updates to perform exact posterior estimation under this model. We use this modeling technique to find fast one-step adversarial attacks, akin to a black-box version of the Fast Gradient Sign Method~(FGSM), and show that the method uses fewer queries and achieves higher attack success rates than the current state of the art. We also highlight the general applicability of this gradient modeling setup.

...read moreread less