v Pipe : A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training

doi:10.1109/TPDS.2021.3094364

Open AccessJournal ArticleDOI

v Pipe : A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training

Shixiong Zhao, +12 more

- 01 Mar 2022 -

IEEE Transactions on Parallel and Distri...

- Vol. 33, Iss: 3, pp 489-506

Chats0

TLDR

vPipe as mentioned in this paper provides dynamic layer partitioning and memory management for pipeline parallelism by searching a near-optimal partitioning/memory management plan and live layer migration protocol for rebalancing the layer distribution across a training pipeline.

Abstract:

The increasing computational complexity of DNNs achieved unprecedented successes in various areas such as machine vision and natural language processing (NLP), e.g., the recent advanced Transformer has billions of parameters. However, as large-scale DNNs significantly exceed GPU’s physical memory limit, they cannot be trained by conventional methods such as data parallelism. Pipeline parallelism that partitions a large DNN into small subnets and trains them on different GPUs is a plausible solution. Unfortunately, the layer partitioning and memory management in existing pipeline parallel systems are fixed during training, making them easily impeded by out-of-memory errors and the GPU under-utilization. These drawbacks amplify when performing neural architecture search (NAS) such as the evolved Transformer, where different network architectures of Transformer needed to be trained repeatedly. vPipe is the first system that transparently provides dynamic layer partitioning and memory management for pipeline parallelism. vPipe has two unique contributions, including (1) an online algorithm for searching a near-optimal layer partitioning and memory management plan, and (2) a live layer migration protocol for re-balancing the layer distribution across a training pipeline. vPipe improved the training throughput of two notable baselines (Pipedream and GPipe) by 61.4-463.4 percent and 24.8-291.3 percent on various large DNNs and training settings.

v Pipe : A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training

Citations

Pathways: Asynchronous Distributed Dataflow for ML

NASPipe: high performance and reproducible pipeline parallel supernet training via causal synchronous parallelism

Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads

Breaking the computation and communication abstraction barrier in distributed machine learning workloads

Aliasing black box adversarial attack with joint self-attention distribution and confidence probability

References

Deep Residual Learning for Image Recognition

Very Deep Convolutional Networks for Large-Scale Image Recognition

Attention is All you Need

ImageNet: A large-scale hierarchical image database

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Related Papers (5)

Towards efficient deep neural network training by FPGA-based batch-level parallelism

DIDO: Dynamic Pipelines for In-Memory Key-Value Stores on Coupled CPU-GPU Architectures

Exploiting spatial architectures for edit distance algorithms

Automatic Extraction of Pipeline Parallelism for Embedded Software Using Linear Programming

KPN2GPU: an approach for discovery and exploitation of fine-grain data parallelism in process networks