Scaling large-data computations on multi-GPU accelerators

doi:10.1145/2464996.2465023

Proceedings ArticleDOI

Scaling large-data computations on multi-GPU accelerators

- pp 443-454

TLDR

A mechanism and an implementation to automatically pipeline the CPU-GPU memory channel so as to overlap the GPU computation with the memory copies, alleviating the data transfer overhead and a novel adaptive runtime tuning mechanism is proposed to automatically select the pipeline stage size.

Abstract:

Modern supercomputers rely on accelerators to speed up highly parallel workloads. Intricate programming models, limited device memory sizes and overheads of data transfers between CPU and accelerator memories are among the open challenges that restrict the widespread use of accelerators. First, this paper proposes a mechanism and an implementation to automatically pipeline the CPU-GPU memory channel so as to overlap the GPU computation with the memory copies, alleviating the data transfer overhead. Second, in doing so, the paper presents a technique called Computation Splitting, COSP, that caters to arbitrary device memory sizes and automatically manages to run out-of-card OpenMP-like applications on GPUs. Third, a novel adaptive runtime tuning mechanism is proposed to automatically select the pipeline stage size so as to gain the best possible performance. The mechanism adapts to the underlying hardware in the starting phase of a program and chooses the pipeline stage size. The techniques are implemented in a system that is able to translate an input OpenMP program to multiple GPUs attached to the same host CPU. Experimentation on a set of nine benchmarks shows that, on average, the pipelining scheme improves the performance by 1.49x, while limiting the runtime tuning overhead to 3% of the execution time.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

DRAGON: breaking GPU memory capacity limits with direct NVM access

Pak Markthub, +4 more

TL;DR: DRAGON leverages the page-faulting mechanism on the recent NVIDIA GPUs by extending capabilities of CUDA Unified Memory (UM) and transparently expands memory capacity and obtain additional speedups via automated I/O and data transfer overlapping.

...read moreread less

Proceedings ArticleDOI

Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks

Tsung Tai Yeh, +4 more

TL;DR: Pagoda is presented, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel, and achieves a geometric mean speedup of 5.70x over PThreads running on a 20-core CPU, 1.51x over CUDA-HyperQ, and 1.69x over GeMTC, the state-of- the-art runtime GPU task scheduling system.

...read moreread less

Proceedings ArticleDOI

Siena: exploring the design space of heterogeneous memory systems

Ivy Bo Peng, +1 more

TL;DR: This paper systematically explore the organization of heterogeneous memory systems on a framework called Siena, which facilitates quick exploration of memory architectures with flexible configurations of memory systems and realistic memory workloads.

...read moreread less

Journal ArticleDOI

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

Fengguang Song, +3 more

- 25 Sep 2015 -

Concurrency and Computation: Practice an...

TL;DR: This work designs a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs.

...read moreread less

Journal ArticleDOI

CEDR: A Compiler-integrated, Extensible DSSoC Runtime

Joshua Mack, +4 more

- 13 Apr 2022 -

ACM Transactions in Embedded Computing S...

TL;DR: CEDR is a capable environment for enabling research in exploring the boundaries of productive application development, resource management heuristic development, and hardware configuration analysis for heterogeneous architectures and provides insights into the trade-offs present in this design space.

...read moreread less

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Rodinia: A benchmark suite for heterogeneous computing

Shuai Che, +6 more

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

...read moreread less

Book ChapterDOI

StreamIt: A Language for Streaming Applications

William Thies, +2 more

TL;DR: The StreamIt language provides novel high-level representations to improve programmer productivity and program robustness within the streaming domain and the StreamIt compiler aims to improve the performance of streaming applications via stream-specific analyses and optimizations.

...read moreread less

Proceedings ArticleDOI

Mars: a MapReduce framework on graphics processors

Bingsheng He, +4 more

TL;DR: Mars hides the programming complexity of the GPU behind the simple and familiar MapReduce interface, and is up to 16 times faster than its CPU-based counterpart for six common web applications on a quad-core machine.

...read moreread less

Proceedings ArticleDOI

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Sunpyo Hong, +1 more

TL;DR: A simple analytical model is proposed that estimates the execution time of massively parallel programs by considering the number of running threads and memory bandwidth and estimates the cost of memory requests, thereby estimating the overall executionTime of a program.

...read moreread less

Proceedings ArticleDOI

OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Seyong Lee, +2 more

TL;DR: This paper presents a compiler framework for automatic source-to-source translation of standard OpenMP applications into CUDA-based GPGPU applications, and identifies several key transformation techniques, which enable efficient GPU global memory access, to achieve high performance.

...read moreread less

Collapse

Scaling large-data computations on multi-GPU accelerators

Citations

DRAGON: breaking GPU memory capacity limits with direct NVM access

Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks

Siena: exploring the design space of heterogeneous memory systems

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

CEDR: A Compiler-integrated, Extensible DSSoC Runtime

References

Rodinia: A benchmark suite for heterogeneous computing

StreamIt: A Language for Streaming Applications

Mars: a MapReduce framework on graphics processors

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

OpenMP to GPGPU: a compiler framework for automatic translation and optimization

Related Papers (5)

BigKernel -- High Performance CPU-GPU Communication Pipelining for Big Data-Style Applications

Relational Query Co-Processing on Graphics Processors 1

Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory

Software pipelining for graphic processing unit acceleration

Directive-Based Pipelining Extension for OpenMP