A Quick Survey on Large Scale Distributed Deep Learning Systems

doi:10.1109/PADSW.2018.8644613

Proceedings ArticleDOI

A Quick Survey on Large Scale Distributed Deep Learning Systems

Zhaoning Zhang, +3 more

- pp 1052-1056

Chats0

TLDR

A simple and quick survey on the distributed deep learning system from algorithm perspective, distributed system perspective and applications perspective, and brings analysis on the restricts and prospects of the distributed methods.

Abstract:

Deep learning have been widely used in various fields and has worked very well as a major role. While the gradual penetration into various fields, data quantity of each applications is increasing tremendously, and so as the computation complexity and model parameters. As an obvious result, the training and inference is time consuming. For example, a classic Resnet50 classification model will be trained in 14 days on a NVIDIA M40 GPU with ImageNet data set. Thus, distributed acceleration is a very useful way to dispatch the computation of training and even inference to scale of nodes in parallel and accelerate the whole process. Facebook's work and UC Berkeley's acceleration can training the Resnet-50 model within hour and minutes by distributed deep learning algorithm and system, representatively. As other distributed accelerations, it gives a possibility to accelerate large models on large data sets from weeks to minutes, which gives researchers and developers more space to explore and search. However, besides acceleration, what other issues will be confronted of the distributed deep learning system? Where is the upper limit of acceleration? What application will acceleration be used for? What is the price and cost of acceleration? In this paper, we will take a simple and quick survey on the distributed deep learning system from algorithm perspective, distributed system perspective and applications perspective. We will present several recent excellent works, and bring analysis on the restricts and prospects of the distributed methods.

A Quick Survey on Large Scale Distributed Deep Learning Systems

Citations

Communication-Efficient Distributed Deep Learning: A Comprehensive Survey.

Blockchain-Enabled Cross-Domain Object Detection for Autonomous Driving: A Model Sharing Approach

Distributed Machine Learning for Wireless Communication Networks: Techniques, Architectures, and Applications

Distributed Machine Learning for Wireless Communication Networks: Techniques, Architectures, and Applications

A Systematic Literature Review on Distributed Machine Learning in Edge Computing

References

Deep Residual Learning for Image Recognition

ImageNet Classification with Deep Convolutional Neural Networks

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Learning Transferable Architectures for Scalable Image Recognition

Related Papers (5)

A Survey on Distributed Machine Learning

Deep Learning At Scale and At Ease

DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters

Deep Learning and Its Parallelization

Processing Systems for Deep Learning Inference on Edge Devices.