Scalable distributed DNN training using commodity GPU cloud computing.

doi:10.21437/INTERSPEECH.2015-354

Proceedings ArticleDOI

Scalable distributed DNN training using commodity GPU cloud computing.

Nikko Strom

- pp 1488-1492

Chats0

TLDR

It is shown empirically that the method can reduce the amount of communication by three orders of magnitude while training a typical DNN for acoustic modelling, and enables efficient scaling to more parallel GPU nodes than any other method that is aware of.

Abstract:

We introduce a new method for scaling up distributed Stochastic Gradient Descent (SGD) training of Deep Neural Networks (DNN). The method solves the well-known communication bottleneck problem that arises for data-parallel SGD because compute nodes frequently need to synchronize a replica of the model. We solve it by purposefully controlling the rate of weight-update per individual weight, which is in contrast to the uniform update-rate customarily imposed by the size of a mini-batch. It is shown empirically that the method can reduce the amount of communication by three orders of magnitude while training a typical DNN for acoustic modelling. This reduction in communication bandwidth enables efficient scaling to more parallel GPU nodes than any other method that we are aware of, and it can be achieved with neither loss in convergence rate nor accuracy in the resulting DNN. Furthermore, the training can be performed on commodity cloud infrastructure and networking.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Federated Learning in Mobile Edge Networks: A Comprehensive Survey

Wei Yang Bryan Lim, +7 more

- 08 Apr 2020 -

IEEE Communications Surveys and Tutorial...

TL;DR: The concept of federated learning (FL) as mentioned in this paperederated learning has been proposed to enable collaborative training of an ML model and also enable DL for mobile edge network optimization in large-scale and complex mobile edge networks, where heterogeneous devices with varying constraints are involved.

...read moreread less

Proceedings Article

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

Dan Alistarh, +4 more

TL;DR: Quantized SGD (QSGD) as discussed by the authors is a family of compression schemes for gradient updates which provides convergence guarantees for convex and nonconvex objectives, under asynchrony, and can be extended to stochastic variance-reduced techniques.

...read moreread less

Posted Content

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Yujun Lin, +4 more

- 05 Dec 2017 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Deep Gradient Compression (DGC) as mentioned in this paper employs momentum correction, local gradient clipping, momentum factor masking, and warm-up training to preserve accuracy during compression, and achieves a gradient compression ratio from 270x to 600x without losing accuracy.

...read moreread less

Posted Content

Federated Learning in Mobile Edge Networks: A Comprehensive Survey

Wei Yang Bryan Lim, +7 more

- 26 Sep 2019 -

arXiv: Networking and Internet Architect...

TL;DR: In a large-scale and complex mobile edge network, heterogeneous devices with varying constraints are involved, this raises challenges of communication costs, resource allocation, and privacy and security in the implementation of FL at scale.

...read moreread less

Journal ArticleDOI

Robust and Communication-Efficient Federated Learning From Non-i.i.d. Data

Felix Sattler, +3 more

- 01 Sep 2020 -

IEEE Transactions on Neural Networks

TL;DR: In this paper, the authors propose sparse ternary compression (STC), a new compression framework that is specifically designed to meet the requirements of the federated learning environment, which extends the existing compression technique of top- $k$ gradient sparsification with a novel mechanism to enable downstream compression as well as ternarization and optimal Golomb encoding of the weight updates.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.

John C. Duchi, +2 more

TL;DR: Adaptive subgradient methods as discussed by the authors dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning, which allows us to find needles in haystacks in the form of very predictive but rarely seen features.

...read moreread less

Journal Article

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

John C. Duchi, +2 more

- 01 Feb 2011 -

Journal of Machine Learning Research

TL;DR: This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.

...read moreread less

Posted Content

ADADELTA: An Adaptive Learning Rate Method

Matthew D. Zeiler

- 22 Dec 2012 -

arXiv: Learning

TL;DR: A novel per-dimension learning rate method for gradient descent called ADADELTA that dynamically adapts over time using only first order information and has minimal computational overhead beyond vanilla stochastic gradient descent is presented.

...read moreread less

Proceedings Article

Large Scale Distributed Deep Networks

Jeffrey Dean, +11 more

TL;DR: This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.

...read moreread less

Journal Article

Deep Neural Networks for Acoustic Modeling in Speech Recognition

Geoffrey E. Hinton, +10 more

- 01 Nov 2012 -

IEEE Signal Processing Magazine

TL;DR: This paper provides an overview of this progress and repres nts the shared views of four research groups who have had recent successes in using deep neural networks for a coustic modeling in speech recognition.

...read moreread less

Scalable distributed DNN training using commodity GPU cloud computing.

Citations

Federated Learning in Mobile Edge Networks: A Comprehensive Survey

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Federated Learning in Mobile Edge Networks: A Comprehensive Survey

Robust and Communication-Efficient Federated Learning From Non-i.i.d. Data

References

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

ADADELTA: An Adaptive Learning Rate Method

Large Scale Distributed Deep Networks

Deep Neural Networks for Acoustic Modeling in Speech Recognition

Related Papers (5)

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs.

Deep Residual Learning for Image Recognition

Large Scale Distributed Deep Networks

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

Communication-Efficient Learning of Deep Networks from Decentralized Data