Profiling DNN Workloads on a Volta-based DGX-1 System

doi:10.1109/IISWC.2018.8573521

Proceedings ArticleDOI

Profiling DNN Workloads on a Volta-based DGX-1 System

- pp 122-133

TLDR

This work profile and analyze the training of five popular DNNs using 1, 2, 4 and 8 GPUs, and shows the breakdown of the training time across the FP+ BP stage and the WU stage to provide insights about the limiting factors of theTraining algorithm as well as to identify the bottlenecks in the multi-GPU system architecture.

Abstract:

High performance multi-GPU systems are widely used to accelerate training of deep neural networks (DNNs) by exploiting the inherently massive parallel nature of the training process. Typically, the training of DNNs in multi-GPU systems leverages a data-parallel model in which a DNN is replicated on every GPU, and each GPU performs Forward Propagation (FP), Backward Propagation (BP) and, Weight Update (WU). We analyze the WU stage that is composed of collective communication (e.g., allReduce, broadcast), which demands very efficient communication among the GPUs to avoid diminishing returns when scaling the number of GPUs in the system. To overcome this issue, different data transfer mechanisms and libraries have been introduced by NVIDIA, and adopted by high-level frameworks to train DNNs. In this work, we evaluate and compare the performance of peer-to-peer (P2P) data transfer method and NCCL library-based communication method for training DNNs on a DGX-1 system consisting of 8 NVIDIA Volta-based GPUs. We profile and analyze the training of five popular DNNs (GoogLeNet, AlexNet, Inception-v3, ResNet and LeNet) using 1, 2, 4 and 8 GPUs. We show the breakdown of the training time across the FP+ BP stage and the WU stage to provide insights about the limiting factors of the training algorithm as well as to identify the bottlenecks in the multi-GPU system architecture. Our detailed profiling and analysis can help programmers and DNN model designers accelerate the training process in DNNs.

Citations

PDF

Open Access

More filters

Learning Deconvolution Network for Semantic Segmentation

한보형, +2 more

Proceedings ArticleDOI

Analyzing Machine Learning Workloads Using a Detailed GPU Simulator

Jonathan Lew, +10 more

TL;DR: Changes made to GPGPU-Sim are described to run ML applications that use cuDNN and PyTorch, two widely used frameworks for running Deep Neural Networks (DNNs) and it is demonstrated how the simulator identifies opportunities for architectural optimization that prior tools are unable to provide.

...read moreread less

Proceedings ArticleDOI

MGPUSim: enabling multi-GPU performance modeling and optimization

Yifan Sun, +17 more

TL;DR: This work presents MGPUSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture, and proposes the Locality API, an API extension that allows the GPU programmer to both avoid the complexity of multi- GPU programming, while precisely controlling data placement in the multi- GPUs memory.

...read moreread less

Journal ArticleDOI

An Overview of Efficient Interconnection Networks for Deep Neural Network Accelerators

Seyed Morteza Nabavinejad, +5 more

- 09 Sep 2020 -

IEEE Journal on Emerging and Selected To...

TL;DR: This paper provides a comprehensive investigation of the recent advances in efficient on-chip interconnection and design methodology of the DNN accelerator design, and investigates the emerging interconnection technologies (e.g., in/near-memory processing) for the Dnn accelerator design.

...read moreread less

Proceedings ArticleDOI

NV-group: link-efficient reduction for distributed deep learning on modern dense GPU systems

Ching-Hsiang Chu, +5 more

TL;DR: This is the first study that achieves near-ideal scaling efficiency for distributed DL training and deals with designs tailored for cutting-edge systems like DGX-2 and Summit clusters.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, +2 more

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.

...read moreread less

Journal ArticleDOI

Deep learning in neural networks

Jürgen Schmidhuber

- 01 Jan 2015 -

Neural Networks

TL;DR: This historical survey compactly summarizes relevant work, much of it from the previous millennium, review deep supervised learning, unsupervised learning, reinforcement learning & evolutionary computation, and indirect search for short programs encoding deep and large networks.

...read moreread less

Posted Content

Caffe: Convolutional Architecture for Fast Feature Embedding

Yangqing Jia, +7 more

- 20 Jun 2014 -

arXiv: Computer Vision and Pattern Recog...

TL;DR: Caffe as discussed by the authors is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.

...read moreread less

Proceedings ArticleDOI

Caffe: Convolutional Architecture for Fast Feature Embedding

Yangqing Jia, +7 more

TL;DR: Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.

...read moreread less

Proceedings Article

Understanding the difficulty of training deep feedforward neural networks

Xavier Glorot, +1 more

TL;DR: The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.

...read moreread less

Collapse

Profiling DNN Workloads on a Volta-based DGX-1 System

Citations

Learning Deconvolution Network for Semantic Segmentation

Analyzing Machine Learning Workloads Using a Detailed GPU Simulator

MGPUSim: enabling multi-GPU performance modeling and optimization

An Overview of Efficient Interconnection Networks for Deep Neural Network Accelerators

NV-group: link-efficient reduction for distributed deep learning on modern dense GPU systems

References

ImageNet Classification with Deep Convolutional Neural Networks

Deep learning in neural networks

Caffe: Convolutional Architecture for Fast Feature Embedding

Caffe: Convolutional Architecture for Fast Feature Embedding

Understanding the difficulty of training deep feedforward neural networks

Related Papers (5)

ImageNet: A large-scale hierarchical image database

ImageNet Classification with Deep Convolutional Neural Networks

In-Datacenter Performance Analysis of a Tensor Processing Unit

Deep Residual Learning for Image Recognition

Benchmarking and Analyzing Deep Neural Network Training