v Pipe : A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training
Shixiong Zhao,Fanxin Li,Xusheng Chen,Xiuxian Guan,Jianyu Jiang,Dong Huang,Yuhao Qing,Sen Wang,Peng Wang,Gong Zhang,Cheng Li,Ping Luo,Heming Cui +12 more
Reads0
Chats0
TLDR
vPipe as mentioned in this paper provides dynamic layer partitioning and memory management for pipeline parallelism by searching a near-optimal partitioning/memory management plan and live layer migration protocol for rebalancing the layer distribution across a training pipeline.Abstract:
The increasing computational complexity of DNNs achieved unprecedented successes in various areas such as machine vision and natural language processing (NLP), e.g., the recent advanced Transformer has billions of parameters. However, as large-scale DNNs significantly exceed GPU’s physical memory limit, they cannot be trained by conventional methods such as data parallelism. Pipeline parallelism that partitions a large DNN into small subnets and trains them on different GPUs is a plausible solution. Unfortunately, the layer partitioning and memory management in existing pipeline parallel systems are fixed during training, making them easily impeded by out-of-memory errors and the GPU under-utilization. These drawbacks amplify when performing neural architecture search (NAS) such as the evolved Transformer, where different network architectures of Transformer needed to be trained repeatedly. vPipe is the first system that transparently provides dynamic layer partitioning and memory management for pipeline parallelism. vPipe has two unique contributions, including (1) an online algorithm for searching a near-optimal layer partitioning and memory management plan, and (2) a live layer migration protocol for re-balancing the layer distribution across a training pipeline. vPipe improved the training throughput of two notable baselines (Pipedream and GPipe) by 61.4-463.4 percent and 24.8-291.3 percent on various large DNNs and training settings.read more
Citations
More filters
Proceedings ArticleDOI
Pathways: Asynchronous Distributed Dataflow for ML
Paul Barham,Aakanksha Chowdhery,Jeffrey Dean,Sanjay Ghemawat,Steven Hand,D. Hurt,Michael Isard,Hyeontaek Lim,Ruoming Pang,Sudip Roy,Brennan Saeta,Parker Schuh,Ryan Sepassi,Laurent El Shafey,Chandramohan A. Thekkath,Yonghui Wu +15 more
TL;DR:
Proceedings ArticleDOI
NASPipe: high performance and reproducible pipeline parallel supernet training via causal synchronous parallelism
Shixiong Zhao,Fanxin Li,Xusheng Chen,Tianxiang Shen,Li Chen,Sen Wang,Nicholas Zhang,Cheng Li,Heming Cui +8 more
TL;DR:
Posted Content
Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads
Abhinav Jangda,Jun Huang,Guodong Liu,Amir Hossein Nodehi Sabet,Saeed Maleki,Youshan Miao,Madanlal Musuvathi,Todd Mytkowicz,Olli Sarikivi +8 more
TL;DR: CoCoNeT as discussed by the authors provides a DSL to express a program with both computation and communication, which allows users to work on a high-level abstraction and apply powerful optimizations, such as fusion or overlapping of communication and computation.
Proceedings ArticleDOI
Breaking the computation and communication abstraction barrier in distributed machine learning workloads
TL;DR: In this paper , the authors propose to break the separation between computation and communication kernels in machine learning frameworks, which can provide many optimizations to improve the performance of distributed workloads, however, manually applying these optimizations requires modifying the underlying computational and communication libraries for each scenario, which is both time consuming and error-prone.
References
More filters
Proceedings ArticleDOI
Deep Residual Learning for Image Recognition
TL;DR: In this article, the authors proposed a residual learning framework to ease the training of networks that are substantially deeper than those used previously, which won the 1st place on the ILSVRC 2015 classification task.
Proceedings Article
Very Deep Convolutional Networks for Large-Scale Image Recognition
Karen Simonyan,Andrew Zisserman +1 more
TL;DR: This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Proceedings Article
Attention is All you Need
Ashish Vaswani,Noam Shazeer,Niki Parmar,Jakob Uszkoreit,Llion Jones,Aidan N. Gomez,Lukasz Kaiser,Illia Polosukhin +7 more
TL;DR: This paper proposed a simple network architecture based solely on an attention mechanism, dispensing with recurrence and convolutions entirely and achieved state-of-the-art performance on English-to-French translation.
Proceedings ArticleDOI
ImageNet: A large-scale hierarchical image database
TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Proceedings ArticleDOI
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TL;DR: BERT as mentioned in this paper pre-trains deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.