Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Open AccessProceedings Article

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Chats0

TLDR

Deep Gradient Compression (DGC) as mentioned in this paper employs momentum correction, local gradient clipping, momentum factor masking, and warm-up training to preserve accuracy during compression, and achieves a gradient compression ratio from 270x to 600x without losing accuracy.

Abstract:

Large-scale distributed training requires significant communication bandwidth for gradient exchange that limits the scalability of multi-node training, and requires expensive high-bandwidth network infrastructure. The situation gets even worse with distributed training on mobile devices (federated learning), which suffers from higher latency, lower throughput, and intermittent poor connections. In this paper, we find 99.9% of the gradient exchange in distributed SGD is redundant, and propose Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth. To preserve accuracy during compression, DGC employs four methods: momentum correction, local gradient clipping, momentum factor masking, and warm-up training. We have applied Deep Gradient Compression to image classification, speech recognition, and language modeling with multiple datasets including Cifar10, ImageNet, Penn Treebank, and Librispeech Corpus. On these scenarios, Deep Gradient Compression achieves a gradient compression ratio from 270x to 600x without losing accuracy, cutting the gradient size of ResNet-50 from 97MB to 0.35MB, and for DeepSpeech from 488MB to 0.74MB. Deep gradient compression enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributed training on mobile.

Citations

PDF

Open Access

More filters

Posted Content

Federated Learning with Non-IID Data.

Yue Zhao, +5 more

- 02 Jun 2018 -

arXiv: Learning

TL;DR: This work presents a strategy to improve training on non-IID data by creating a small subset of data which is globally shared between all the edge devices, and shows that accuracy can be increased by 30% for the CIFAR-10 dataset with only 5% globally shared data.

...read moreread less

Posted Content

Advances and Open Problems in Federated Learning

Peter Kairouz, +58 more

- 10 Dec 2019 -

arXiv: Learning

TL;DR: Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges.

...read moreread less

Proceedings ArticleDOI

Exploiting Unintended Feature Leakage in Collaborative Learning

Luca Melis, +3 more

TL;DR: In this article, passive and active inference attacks are proposed to exploit the leakage of information about participants' training data in federated learning, where each participant can infer the presence of exact data points and properties that hold only for a subset of the training data and are independent of the properties of the joint model.

...read moreread less

Journal ArticleDOI

Edge Intelligence: Paving the Last Mile of Artificial Intelligence With Edge Computing

Zhi Zhou, +5 more

TL;DR: A comprehensive survey of the recent research efforts on edge intelligence can be found in this paper, where the authors review the background and motivation for AI running at the network edge and provide an overview of the overarching architectures, frameworks, and emerging key technologies for deep learning model toward training/inference at the edge.

...read moreread less

Journal ArticleDOI

Federated Learning in Mobile Edge Networks: A Comprehensive Survey

Wei Yang Bryan Lim, +7 more

- 08 Apr 2020 -

IEEE Communications Surveys and Tutorial...

TL;DR: The concept of federated learning (FL) as mentioned in this paperederated learning has been proposed to enable collaborative training of an ML model and also enable DL for mobile edge network optimization in large-scale and complex mobile edge networks, where heterogeneous devices with varying constraints are involved.

...read moreread less