scispace - formally typeset
Proceedings ArticleDOI

Scaling large-data computations on multi-GPU accelerators

TLDR
A mechanism and an implementation to automatically pipeline the CPU-GPU memory channel so as to overlap the GPU computation with the memory copies, alleviating the data transfer overhead and a novel adaptive runtime tuning mechanism is proposed to automatically select the pipeline stage size.
Abstract
Modern supercomputers rely on accelerators to speed up highly parallel workloads. Intricate programming models, limited device memory sizes and overheads of data transfers between CPU and accelerator memories are among the open challenges that restrict the widespread use of accelerators. First, this paper proposes a mechanism and an implementation to automatically pipeline the CPU-GPU memory channel so as to overlap the GPU computation with the memory copies, alleviating the data transfer overhead. Second, in doing so, the paper presents a technique called Computation Splitting, COSP, that caters to arbitrary device memory sizes and automatically manages to run out-of-card OpenMP-like applications on GPUs. Third, a novel adaptive runtime tuning mechanism is proposed to automatically select the pipeline stage size so as to gain the best possible performance. The mechanism adapts to the underlying hardware in the starting phase of a program and chooses the pipeline stage size. The techniques are implemented in a system that is able to translate an input OpenMP program to multiple GPUs attached to the same host CPU. Experimentation on a set of nine benchmarks shows that, on average, the pipelining scheme improves the performance by 1.49x, while limiting the runtime tuning overhead to 3% of the execution time.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

DRAGON: breaking GPU memory capacity limits with direct NVM access

TL;DR: DRAGON leverages the page-faulting mechanism on the recent NVIDIA GPUs by extending capabilities of CUDA Unified Memory (UM) and transparently expands memory capacity and obtain additional speedups via automated I/O and data transfer overlapping.
Proceedings ArticleDOI

Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks

TL;DR: Pagoda is presented, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel, and achieves a geometric mean speedup of 5.70x over PThreads running on a 20-core CPU, 1.51x over CUDA-HyperQ, and 1.69x over GeMTC, the state-of- the-art runtime GPU task scheduling system.
Proceedings ArticleDOI

Siena: exploring the design space of heterogeneous memory systems

TL;DR: This paper systematically explore the organization of heterogeneous memory systems on a framework called Siena, which facilitates quick exploration of memory architectures with flexible configurations of memory systems and realistic memory workloads.
Journal ArticleDOI

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

TL;DR: This work designs a class of heterogeneous tile algorithms to maximize the degree of parallelism, to minimize the communication volume, and to accommodate the heterogeneity between CPUs and GPUs.
Journal ArticleDOI

CEDR: A Compiler-integrated, Extensible DSSoC Runtime

TL;DR: CEDR is a capable environment for enabling research in exploring the boundaries of productive application development, resource management heuristic development, and hardware configuration analysis for heterogeneous architectures and provides insights into the trade-offs present in this design space.
References
More filters
Proceedings ArticleDOI

Rodinia: A benchmark suite for heterogeneous computing

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Book ChapterDOI

StreamIt: A Language for Streaming Applications

TL;DR: The StreamIt language provides novel high-level representations to improve programmer productivity and program robustness within the streaming domain and the StreamIt compiler aims to improve the performance of streaming applications via stream-specific analyses and optimizations.
Proceedings ArticleDOI

Mars: a MapReduce framework on graphics processors

TL;DR: Mars hides the programming complexity of the GPU behind the simple and familiar MapReduce interface, and is up to 16 times faster than its CPU-based counterpart for six common web applications on a quad-core machine.
Proceedings ArticleDOI

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

TL;DR: A simple analytical model is proposed that estimates the execution time of massively parallel programs by considering the number of running threads and memory bandwidth and estimates the cost of memory requests, thereby estimating the overall executionTime of a program.
Proceedings ArticleDOI

OpenMP to GPGPU: a compiler framework for automatic translation and optimization

TL;DR: This paper presents a compiler framework for automatic source-to-source translation of standard OpenMP applications into CUDA-based GPGPU applications, and identifies several key transformation techniques, which enable efficient GPU global memory access, to achieve high performance.
Related Papers (5)