scispace - formally typeset
Open AccessProceedings ArticleDOI

MGPUSim: enabling multi-GPU performance modeling and optimization

Reads0
Chats0
TLDR
This work presents MGPUSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture, and proposes the Locality API, an API extension that allows the GPU programmer to both avoid the complexity of multi- GPU programming, while precisely controlling data placement in the multi- GPUs memory.
Abstract
The rapidly growing popularity and scale of data-parallel workloads demand a corresponding increase in raw computational power of Graphics Processing Units (GPUs). As single-GPU platforms struggle to satisfy these performance demands, multi-GPU platforms have started to dominate the high-performance computing world. The advent of such systems raises a number of design challenges, including the GPU microarchitecture, multi-GPU interconnect fabric, runtime libraries, and associated programming models. The research community currently lacks a publicly available and comprehensive multi-GPU simulation framework to evaluate next- generation multi-GPU system designs. In this work, we present MGPUSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMD's Graphics Core Next 3 (GCN3) instruction set architecture. MGPUSim comes with in-built support for multi-threaded execution to enable fast, parallelized, and accurate simulation. In terms of performance accuracy, MGPUSim differs by only 5.5% on average from the actual GPU hardware. We also achieve a 3.5x and a 2.5x average speedup running functional emulation and detailed timing simulation, respectively, on a 4-core CPU, while delivering the same accuracy as serial simulation. We illustrate the flexibility and capability of the simulator through two concrete design studies. In the first, we propose the Locality API, an API extension that allows the GPU programmer to both avoid the complexity of multi-GPU programming, while precisely controlling data placement in the multi-GPU memory. In the second design study, we propose Progressive Page Splitting Migration (PASI), a customized multi-GPU memory management system enabling the hardware to progressively improve data placement. For a discrete 4-GPU system, we observe that the Locality API can speed up the system by 1.6x (geometric mean), and PASI can improve the system performance by 2.6x (geometric mean) across all benchmarks, compared to a unified 4-GPU platform.

read more

Citations
More filters
Proceedings ArticleDOI

Accel-sim: an extensible simulation framework for validated GPU modeling

TL;DR: A new GPU simulator frontend is introduced that minimizes the effort required to simulate different machine ISAs through trace-driven simulation of NVIDIA’s native machine ISA, while still supporting execution- driven simulation of the virtual ISA.
Proceedings ArticleDOI

Characterizing and Modeling Non-Volatile Memory Systems

TL;DR: VANS is developed, which models the sophisticated microarchitecture design of Optane DIMM, and is validated by comparing with the detailed performance characteristics ofoptane-DIMM-attached Intel servers, and develops two architectural optimizations on top ofOptane D IMM, Lazy Cache and Pre-translation, which significantly improve cloud workload performance.
Proceedings ArticleDOI

Towards Inspecting and Eliminating Trojan Backdoors in Deep Neural Networks

TL;DR: TABOR as discussed by the authors proposes a new objective function to guide optimization to identify a trojan backdoor more correctly and accurately and prune the restored triggers, which can not only facilitate the identification of intentionally injected triggers but also filter out false alarms.
Proceedings ArticleDOI

AccelWattch: A Power Modeling Framework for Modern GPUs

TL;DR: In this article, the authors propose a configurable GPU power model called AccelWattch that can be driven by emulation and trace-driven environments, hardware counters, or a mix of the two, models both PTX and SASS ISAs, accounts for power gating and control-flow divergence, and supports DVFS.
Journal ArticleDOI

Grus: Toward Unified-memory-efficient High-performance Graph Processing on GPU

TL;DR: This monograph presents a meta-analyses of the architecture and code optimization challenges faced in the rapidly changing environment and some of the strategies used to address these challenges have been proposed and described.
References
More filters
Proceedings Article

ImageNet Classification with Deep Convolutional Neural Networks

TL;DR: The state-of-the-art performance of CNNs was achieved by Deep Convolutional Neural Networks (DCNNs) as discussed by the authors, which consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax.
Proceedings ArticleDOI

ImageNet: A large-scale hierarchical image database

TL;DR: A new database called “ImageNet” is introduced, a large-scale ontology of images built upon the backbone of the WordNet structure, much larger in scale and diversity and much more accurate than the current image datasets.
Journal ArticleDOI

ImageNet classification with deep convolutional neural networks

TL;DR: A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
Journal ArticleDOI

Parallel discrete event simulation

TL;DR: This article deals with the execution of a simulation program on a parallel computer by decomposing the simulation application into a set of concurrently executing processes and introduces interesting synchronization problems that are at the heart of the PDES problem.
Proceedings ArticleDOI

Analyzing CUDA workloads using a detailed GPU simulator

TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.
Related Papers (5)