scispace - formally typeset
Open AccessPosted Content

CANITA: Faster Rates for Distributed Convex Optimization with Communication Compression

Reads0
Chats0
TLDR
In this paper, the authors proposed CANITA, which combines the benefits of communication compression and convergence acceleration for distributed optimization, and achieved the first accelerated convergence rate of O(O(big(1+sqrt{\big( 1+\sqrt{L}{\epsilon} + \omega^2+n}{\omega+n} +n}\frac{1}{''big''
Abstract
Due to the high communication cost in distributed and federated learning, methods relying on compressed communication are becoming increasingly popular. Besides, the best theoretically and practically performing gradient-type methods invariably rely on some form of acceleration/momentum to reduce the number of communications (faster convergence), e.g., Nesterov's accelerated gradient descent (Nesterov, 2004) and Adam (Kingma and Ba, 2014). In order to combine the benefits of communication compression and convergence acceleration, we propose a \emph{compressed and accelerated} gradient method for distributed optimization, which we call CANITA. Our CANITA achieves the \emph{first accelerated rate} $O\bigg(\sqrt{\Big(1+\sqrt{\frac{\omega^3}{n}}\Big)\frac{L}{\epsilon}} + \omega\big(\frac{1}{\epsilon}\big)^{\frac{1}{3}}\bigg)$, which improves upon the state-of-the-art non-accelerated rate $O\left((1+\frac{\omega}{n})\frac{L}{\epsilon} + \frac{\omega^2+n}{\omega+n}\frac{1}{\epsilon}\right)$ of DIANA (Khaled et al., 2020b) for distributed general convex problems, where $\epsilon$ is the target error, $L$ is the smooth parameter of the objective, $n$ is the number of machines/devices, and $\omega$ is the compression parameter (larger $\omega$ means more compression can be applied, and no compression implies $\omega=0$). Our results show that as long as the number of devices $n$ is large (often true in distributed/federated learning), or the compression $\omega$ is not very high, CANITA achieves the faster convergence rate $O\Big(\sqrt{\frac{L}{\epsilon}}\Big)$, i.e., the number of communication rounds is $O\Big(\sqrt{\frac{L}{\epsilon}}\Big)$ (vs. $O\big(\frac{L}{\epsilon}\big)$ achieved by previous works). As a result, CANITA enjoys the advantages of both compression (compressed communication in each round) and acceleration (much fewer communication rounds).

read more

Citations
More filters
Journal ArticleDOI

ANITA: An Optimal Loopless Accelerated Variance-Reduced Gradient Method.

TL;DR: In this article, a novel accelerated variance-reduced gradient method called ANITA was proposed for finite-sum optimization, which can achieve the optimal convergence result for general convex and strongly convex problems.
Posted Content

FedPAGE: A Fast Local Stochastic Gradient Method for Communication-Efficient Federated Learning

TL;DR: In this paper, the authors proposed a new federated learning algorithm, FedPAGE, able to further reduce the communication complexity by utilizing the recent optimal PAGE method (Li et al., 2021) instead of plain SGD in FedAvg.
Posted Content

EF21 with Bells & Whistles: Practical Algorithmic Extensions of Modern Error Feedback

TL;DR: This article proposed six practical extensions of EF21, all supported by strong convergence theory: partial participation, stochastic approximation, variance reduction, proximal setting, momentum and bidirectional compression.
References
More filters
Proceedings Article

Adam: A Method for Stochastic Optimization

TL;DR: This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Journal ArticleDOI

LIBSVM: A library for support vector machines

TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Book

Introductory Lectures on Convex Optimization: A Basic Course

TL;DR: A polynomial-time interior-point method for linear optimization was proposed in this paper, where the complexity bound was not only in its complexity, but also in the theoretical pre- diction of its high efficiency was supported by excellent computational results.
Proceedings Article

Communication-Efficient Learning of Deep Networks from Decentralized Data

TL;DR: In this paper, the authors presented a decentralized approach for federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets.
Posted Content

Federated Learning: Strategies for Improving Communication Efficiency

TL;DR: Two ways to reduce the uplink communication costs are proposed: structured updates, where the user directly learns an update from a restricted space parametrized using a smaller number of variables, e.g. either low-rank or a random mask; and sketched updates, which learn a full model update and then compress it using a combination of quantization, random rotations, and subsampling.
Related Papers (5)