scispace - formally typeset
Search or ask a question

Showing papers by "Anit Kumar Sahu published in 2019"


Posted Content
TL;DR: A novel algorithm MATCHA is proposed that uses matching decomposition sampling of the base topology to parallelize inter-worker information exchange so as to significantly reduce communication delay and communicates more frequently over critical links such that it can maintain the same convergence rate as vanilla decentralized SGD.
Abstract: This paper studies the problem of error-runtime trade-off, typically encountered in decentralized training based on stochastic gradient descent (SGD) using a given network. While a denser (sparser) network topology results in faster (slower) error convergence in terms of iterations, it incurs more (less) communication time/delay per iteration. In this paper, we propose MATCHA, an algorithm that can achieve a win-win in this error-runtime trade-off for any arbitrary network topology. The main idea of MATCHA is to parallelize inter-node communication by decomposing the topology into matchings. To preserve fast error convergence speed, it identifies and communicates more frequently over critical links, and saves communication time by using other links less frequently. Experiments on a suite of datasets and deep neural networks validate the theoretical analyses and demonstrate that MATCHA takes up to $5\times$ less time than vanilla decentralized SGD to reach the same training loss.

104 citations


Proceedings ArticleDOI
18 Mar 2019
TL;DR: It is shown that under a sufficiently small constant step-size, S - A - B converges linearly (in expected mean-square sense) to a neighborhood of the global minimizer.
Abstract: In this paper, we study distributed stochastic optimization to minimize a sum of smooth and strongly-convex local cost functions over a network of agents, communicating over a strongly-connected graph. Assuming that each agent has access to a stochastic first-order oracle $\left( {\mathcal{S}\mathcal{F}\mathcal{O}} \right)$, we propose a novel distributed method, called $\mathcal{S} - \mathcal{A}\mathcal{B}$, where each agent uses an auxiliary variable to asymptotically track the gradient of the global cost in expectation. The $\mathcal{S} - \mathcal{A}\mathcal{B}$ algorithm employs rowand column-stochastic weights simultaneously to ensure both consensus and optimality. Since doubly-stochastic weights are not used, $\mathcal{S} - \mathcal{A}\mathcal{B}$ is applicable to arbitrary strongly-connected graphs. We show that under a sufficiently small constant step-size, $\mathcal{S} - \mathcal{A}\mathcal{B}$ converges linearly (in expected mean-square sense) to a neighborhood of the global minimizer. We present numerical simulations based on real-world data sets to illustrate the theoretical results.

103 citations


Proceedings ArticleDOI
01 Nov 2019
TL;DR: This article proposed FedDANE, an optimization method that adapts from DANE, a method for classical distributed optimization, to handle the practical constraints of federated learning, and provided convergence guarantees for this method when learning over both convex and non-convex functions.
Abstract: Federated learning aims to jointly learn statistical models over massively distributed remote devices. In this work, we propose FedDANE, an optimization method that we adapt from DANE [8], [9], a method for classical distributed optimization, to handle the practical constraints of federated learning. We provide convergence guarantees for this method when learning over both convex and non-convex functions. Despite encouraging theoretical results, we find that the method has underwhelming performance empirically. In particular, through empirical simulations on both synthetic and real-world datasets, FedDANE consistently underperforms baselines of FedAvg [7] and FedProx [4] in realistic federated settings. We identify low device participation and statistical device heterogeneity as two underlying causes of this underwhelming performance, and conclude by suggesting several directions of future work.

56 citations


Proceedings ArticleDOI
01 Dec 2019
TL;DR: MATCHA as discussed by the authors uses matching decomposition sampling of the base topology to parallelize inter-worker information exchange so as to significantly reduce communication delay and maintain the same convergence rate as vanilla decentralized SGD.
Abstract: Decentralized stochastic gradient descent (SGD) is a promising approach to learn a machine learning model over a network of workers connected in an arbitrary topology. Although a densely-connected network topology can ensure faster convergence in terms of iterations, it incurs more communication time/delay per iteration, resulting in longer training time. In this paper, we propose a novel algorithm MATCHA to achieve a win-win in this error-runtime trade-off. MATCHA uses matching decomposition sampling of the base topology to parallelize inter-worker information exchange so as to significantly reduce communication delay. At the same time, the algorithm communicates more frequently over critical links such that it can maintain the same convergence rate as vanilla decentralized SGD. Experiments on a suite of datasets and deep neural networks validate the theoretical analysis and demonstrate the effectiveness of the proposed scheme as far as reducing communication delays is concerned.

45 citations


Proceedings Article
11 Apr 2019
TL;DR: A zeroth order Frank-Wolfe algorithm is proposed, which in addition to the projection-free nature of the vanilla Frank- Wolfe algorithm makes it gradient free, and it is shown that the proposed algorithm converges to the optimal objective function at a rate of $O(1/T^{1/3}\right)$, where $T$ denotes the iteration count.
Abstract: This paper focuses on the problem of \emph{constrained} \emph{stochastic} optimization. A zeroth order Frank-Wolfe algorithm is proposed, which in addition to the projection-free nature of the vanilla Frank-Wolfe algorithm makes it gradient free. Under convexity and smoothness assumption, we show that the proposed algorithm converges to the optimal objective function at a rate $O\left(1/T^{1/3}\right)$, where $T$ denotes the iteration count. In particular, the primal sub-optimality gap is shown to have a dimension dependence of $O\left(d^{1/3}\right)$, which is the best known dimension dependence among all zeroth order optimization algorithms with one directional derivative per iteration. For non-convex functions, we obtain the \emph{Frank-Wolfe} gap to be $O\left(d^{1/3}T^{-1/4}\right)$. Experiments on black-box optimization setups demonstrate the efficacy of the proposed algorithm.

31 citations


Journal ArticleDOI
TL;DR: The unique characteristics and challenges of federated learning are discussed, a broad overview of current approaches are provided, and several directions of future work that are relevant to a wide range of research communities are outlined.
Abstract: Federated learning involves training statistical models over remote devices or siloed data centers, such as mobile phones or hospitals, while keeping data localized. Training in heterogeneous and potentially massive networks introduces novel challenges that require a fundamental departure from standard approaches for large-scale machine learning, distributed optimization, and privacy-preserving data analysis. In this article, we discuss the unique characteristics and challenges of federated learning, provide a broad overview of current approaches, and outline several directions of future work that are relevant to a wide range of research communities.

27 citations


Posted Content
TL;DR: This work uses Bayesian optimization~(BO) to specifically cater to scenarios involving low query budgets to develop query efficient adversarial attacks, and alleviates the issues surrounding BO in regards to optimizing high dimensional deep learning models by effective dimension upsampling techniques.
Abstract: We focus on the problem of black-box adversarial attacks, where the aim is to generate adversarial examples using information limited to loss function evaluations of input-output pairs. We use Bayesian optimization~(BO) to specifically cater to scenarios involving low query budgets to develop query efficient adversarial attacks. We alleviate the issues surrounding BO in regards to optimizing high dimensional deep learning models by effective dimension upsampling techniques. Our proposed approach achieves performance comparable to the state of the art black-box adversarial attacks albeit with a much lower average query count. In particular, in low query budget regimes, our proposed method reduces the query count up to $80\%$ with respect to the state of the art methods.

22 citations


Posted Content
TL;DR: In this paper, a distributed stochastic optimization to minimize a sum of smooth and strongly-convex local cost functions over a network of agents communicating over a strongly-connected graph is studied.
Abstract: In this paper, we study distributed stochastic optimization to minimize a sum of smooth and strongly-convex local cost functions over a network of agents, communicating over a strongly-connected graph. Assuming that each agent has access to a stochastic first-order oracle ($\mathcal{SFO}$), we propose a novel distributed method, called $\mathcal{S}$-$\mathcal{AB}$, where each agent uses an auxiliary variable to asymptotically track the gradient of the global cost in expectation. The $\mathcal{S}$-$\mathcal{AB}$ algorithm employs row- and column-stochastic weights simultaneously to ensure both consensus and optimality. Since doubly-stochastic weights are not used, $\mathcal{S}$-$\mathcal{AB}$ is applicable to arbitrary strongly-connected graphs. We show that under a sufficiently small constant step-size, $\mathcal{S}$-$\mathcal{AB}$ converges linearly (in expected mean-square sense) to a neighborhood of the global minimizer. We present numerical simulations based on real-world data sets to illustrate the theoretical results.

12 citations


Posted Content
TL;DR: This work considers the setting of batch active learning, in which multiple samples are selected as opposed to a single sample as in classical settings, to reduce the training overhead and bridges between uniform randomness and score based importance sampling of clusters when selecting a batch of new samples.
Abstract: We study the problem of training machine learning models incrementally using active learning with access to imperfect or noisy oracles. We specifically consider the setting of batch active learning, in which multiple samples are selected as opposed to a single sample as in classical settings so as to reduce the training overhead. Our approach bridges between uniform randomness and score based importance sampling of clusters when selecting a batch of new samples. Experiments on benchmark image classification datasets (MNIST, SVHN, and CIFAR10) shows improvement over existing active learning strategies. We introduce an extra denoising layer to deep networks to make active learning robust to label noises and show significant improvements.

5 citations


Posted Content
TL;DR: This work incorporates model uncertainty into the sampling probability to compensate for poor estimation of the importance scores when the training data is too small to build a meaningful model.
Abstract: We study the problem of training machine learning models incrementally with batches of samples annotated with noisy oracles. We select each batch of samples that are important and also diverse via clustering and importance sampling. More importantly, we incorporate model uncertainty into the sampling probability to compensate for poor estimation of the importance scores when the training data is too small to build a meaningful model. Experiments on benchmark image classification datasets (MNIST, SVHN, CIFAR10, and EMNIST) show improvement over existing active learning strategies. We introduce an extra denoising layer to deep networks to make active learning robust to label noises and show significant improvements.

4 citations


Proceedings ArticleDOI
01 Nov 2019
TL;DR: This paper proposes a decentralized stochastic gradient method with gradient tracking, and shows that the proposed algorithm converges linearly to an error ball around the optimal solution with a constant step-size.
Abstract: In this paper, we present stochastic optimization for empirical risk minimization over directed graphs. Using a novel information fusion approach that utilizes both row- and column-stochastic weights simultaneously, we propose $\mathcal{S}\mathcal{A}\mathcal{B}$, a decentralized stochastic gradient method with gradient tracking, and show that the proposed algorithm converges linearly to an error ball around the optimal solution with a constant step-size. We provide a sketch of the convergence analysis as well as the generalization of the proposed algorithm. Finally, we illustrate the theoretical results with the help of experiments with real data.

Proceedings ArticleDOI
01 Jul 2019
TL;DR: Direct C-D-C-E-D distributed recursive estimator further dramatically improves communication efficiency, achieving the $O(1/c\mathcal{T})$ communication MSE rate with arbitrarily high exponent $\kappa$, while keeping the order-optimal $O (1/t)$ sample-wise M SE rate.
Abstract: Recently, a communication efficient recursive distributed estimator, $C\mathcal{R}\mathcal{E}\mathcal{D}\mathcal{O}$, has been proposed, that utilizes increasingly sparse randomized bidirectional communications. $\lt p\gt C\mathcal{R}\mathcal{E}\mathcal{D}\mathcal{O}$ achieves order-optimal $O(1/t)$ mean square error (MSE) rate in the number of per-node processed samples t, and a $\lt p\gt O(1/C_{t}^{2-\zeta})$ MSE rate in the number of per-node communications, where $\zeta \gt 0$ is arbitrarily small. In this paper, we present directed $C\mathcal{R}\mathcal{E}\mathcal{D}\mathcal{O}, \mathcal{D}-C\mathcal{R}\mathcal{E}\mathcal{D}\mathcal{O}$ for short-a distributed recursive estimator that utilizes directed increasingly sparse communications. We show that $\mathcal{D}-C\mathcal{R}\mathcal{E}\mathcal{D}\mathcal{O}$ further dramatically improves communication efficiency, achieving the $O(1/c\mathcal{T})$ communication MSE rate with arbitrarily high exponent $\kappa$, while keeping the order-optimal $O(1/t)$ sample-wise MSE rate. Numerical examples on real data sets confirm our results.


Posted Content
TL;DR: This work proposes to capture correlations within gradients of the loss function with respect to the input images via a Gaussian Markov random field (GMRF), and shows that the covariance structure can be efficiently represented using the Fast Fourier Transform (FFT), along with low-rank updates to perform exact posterior estimation under this model.
Abstract: We study the problem of generating adversarial examples in a black-box setting, where we only have access to a zeroth order oracle, providing us with loss function evaluations. Although this setting has been investigated in previous work, most past approaches using zeroth order optimization implicitly assume that the gradients of the loss function with respect to the input images are \emph{unstructured}. In this work, we show that in fact substantial correlations exist within these gradients, and we propose to capture these correlations via a Gaussian Markov random field (GMRF). Given the intractability of the explicit covariance structure of the MRF, we show that the covariance structure can be efficiently represented using the Fast Fourier Transform (FFT), along with low-rank updates to perform exact posterior estimation under this model. We use this modeling technique to find fast one-step adversarial attacks, akin to a black-box version of the Fast Gradient Sign Method~(FGSM), and show that the method uses fewer queries and achieves higher attack success rates than the current state of the art. We also highlight the general applicability of this gradient modeling setup.