Open AccessPosted Content
FedPAGE: A Fast Local Stochastic Gradient Method for Communication-Efficient Federated Learning
Reads0
Chats0
TLDR
In this paper, the authors proposed a new federated learning algorithm, FedPAGE, able to further reduce the communication complexity by utilizing the recent optimal PAGE method (Li et al., 2021) instead of plain SGD in FedAvg.Abstract:
Federated Averaging (FedAvg, also known as Local-SGD) (McMahan et al., 2017) is a classical federated learning algorithm in which clients run multiple local SGD steps before communicating their update to an orchestrating server. We propose a new federated learning algorithm, FedPAGE, able to further reduce the communication complexity by utilizing the recent optimal PAGE method (Li et al., 2021) instead of plain SGD in FedAvg. We show that FedPAGE uses much fewer communication rounds than previous local methods for both federated convex and nonconvex optimization. Concretely, 1) in the convex setting, the number of communication rounds of FedPAGE is $O(\frac{N^{3/4}}{S\epsilon})$, improving the best-known result $O(\frac{N}{S\epsilon})$ of SCAFFOLD (Karimireddy et al.,2020) by a factor of $N^{1/4}$, where $N$ is the total number of clients (usually is very large in federated learning), $S$ is the sampled subset of clients in each communication round, and $\epsilon$ is the target error; 2) in the nonconvex setting, the number of communication rounds of FedPAGE is $O(\frac{\sqrt{N}+S}{S\epsilon^2})$, improving the best-known result $O(\frac{N^{2/3}}{S^{2/3}\epsilon^2})$ of SCAFFOLD (Karimireddy et al.,2020) by a factor of $N^{1/6}S^{1/3}$, if the sampled clients $S\leq \sqrt{N}$. Note that in both settings, the communication cost for each round is the same for both FedPAGE and SCAFFOLD. As a result, FedPAGE achieves new state-of-the-art results in terms of communication complexity for both federated convex and nonconvex optimization.read more
Citations
More filters
Posted Content
ZeroSARAH: Efficient Nonconvex Finite-Sum Optimization with Zero Full Gradient Computation
TL;DR: ZeroSARAH as mentioned in this paper is a variant of the variance-reduced method SARAH (Nguyen et al., 2017) for minimizing the average of a large number of nonconvex functions.
Posted Content
EF21 with Bells & Whistles: Practical Algorithmic Extensions of Modern Error Feedback
TL;DR: This article proposed six practical extensions of EF21, all supported by strong convergence theory: partial participation, stochastic approximation, variance reduction, proximal setting, momentum and bidirectional compression.
References
More filters
Journal ArticleDOI
LIBSVM: A library for support vector machines
Chih-Chung Chang,Chih-Jen Lin +1 more
TL;DR: Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Proceedings Article
Communication-Efficient Learning of Deep Networks from Decentralized Data
TL;DR: In this paper, the authors presented a decentralized approach for federated learning of deep networks based on iterative model averaging, and conduct an extensive empirical evaluation, considering five different model architectures and four datasets.
Posted Content
Federated Learning: Strategies for Improving Communication Efficiency
Jakub Konečný,H. Brendan McMahan,Felix X. Yu,Peter Richtárik,Ananda Theertha Suresh,Dave Bacon +5 more
TL;DR: Two ways to reduce the uplink communication costs are proposed: structured updates, where the user directly learns an update from a restricted space parametrized using a smaller number of variables, e.g. either low-rank or a random mask; and sketched updates, which learn a full model update and then compress it using a combination of quantization, random rotations, and subsampling.
Posted Content
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Priya Goyal,Piotr Dollár,Ross Girshick,Pieter Noordhuis,Lukasz Wesolowski,Aapo Kyrola,Andrew Tulloch,Yangqing Jia,Kaiming He +8 more
TL;DR: This paper empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization and enable training visual recognition models on internet-scale data with high efficiency.
Proceedings Article
Accelerating Stochastic Gradient Descent using Predictive Variance Reduction
Rie Johnson,Tong Zhang +1 more
TL;DR: It is proved that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG), but the analysis is significantly simpler and more intuitive.