Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks

doi:10.1145/3018743.3018754

Proceedings ArticleDOI

Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks

Tsung Tai Yeh, +4 more

- Vol. 52, Iss: 8, pp 221-234

Chats0

TLDR

Pagoda is presented, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel, and achieves a geometric mean speedup of 5.70x over PThreads running on a 20-core CPU, 1.51x over CUDA-HyperQ, and 1.69x over GeMTC, the state-of- the-art runtime GPU task scheduling system.

Abstract:

Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU. GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain This paper presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. Experimental results demonstrate that Pagoda achieves a geometric mean speedup of 5.70x over PThreads running on a 20-core CPU, 1.51x over CUDA-HyperQ, and 1.69x over GeMTC, the state-of- the-art runtime GPU task scheduling system.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency

Rachata Ausavarungnirun, +7 more

TL;DR: MASK, a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications, is proposed and evaluations show that MASK restores much of the throughput lost to TLB contention.

...read moreread less

Proceedings Article

{NICA}: An Infrastructure for Inline Acceleration of Network Applications

Haggai Eran, +4 more

TL;DR: NICA is a hardware-software co-designed framework for inline acceleration of the application data plane on F-NICs in multi-tenant systems and integrates ikernels with the high-performance VMA network stack and the KVM hypervisor.

...read moreread less

Posted Content

Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications

Peifeng Yu, +1 more

- 12 Feb 2019 -

arXiv: Distributed, Parallel, and Cluste...

TL;DR: Salus implements an efficient, consolidated execution service that exposes the GPU to different DL applications, and enforces fine-grained sharing by performing iteration scheduling and addressing associated memory management issues, and can be used to implement flexible sharing policies such as fairness, prioritization, and packing for various use cases.

...read moreread less

Proceedings ArticleDOI

Manna: An Accelerator for Memory-Augmented Neural Networks

Jacob R. Stevens, +4 more

TL;DR: Manna is a memory-centric design that focuses on maximizing performance in an extremely low FLOPS/Byte context and develops a detailed architectural simulator with timing and power models calibrated by synthesis to the 15 nm Nangate Open Cell library.

...read moreread less

Proceedings ArticleDOI

AvA: Accelerated Virtualization of Accelerators

Hangchen Yu, +3 more

TL;DR: AvA provides near-native performance and can enforce sharing policies that are not possible with current techniques, with orders of magnitude less developer effort than required for hand-built virtualization support.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Book ChapterDOI

StreamIt: A Language for Streaming Applications

William Thies, +2 more

TL;DR: The StreamIt language provides novel high-level representations to improve programmer productivity and program robustness within the streaming domain and the StreamIt compiler aims to improve the performance of streaming applications via stream-specific analyses and optimizations.

...read moreread less

Journal ArticleDOI

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

Cédric Augonnet, +3 more

TL;DR: StarPU as mentioned in this paper is a runtime system that provides a high-level unified execution model for numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware and easily develop and tune powerful scheduling algorithms.

...read moreread less

Proceedings ArticleDOI

Benchmarking GPUs to tune dense linear algebra

Vasily Volkov, +1 more

TL;DR: In this article, the authors present performance results for dense linear algebra using recent NVIDIA GPUs and argue that modern GPUs should be viewed as multithreaded multicore vector units, and exploit blocking similarly to vector computers and heterogeneity of the system.

...read moreread less

Book

Parallel programming in C with MPI and OpenMP

Michael J. Quinn

TL;DR: This chapter discusses Parallel Architectures, Message-Passing Programming, and Combining MPI and OpenMP.

...read moreread less

Book ChapterDOI

Dynamic Storage Allocation: A Survey and Critical Review

Paul R. Wilson, +3 more

TL;DR: This survey describes a variety of memory allocator designs and point out issues relevant to their design and evaluation, and chronologically survey most of the literature on allocators between 1961 and 1995.

...read moreread less

Collapse

Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks

Citations

MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency

{NICA}: An Infrastructure for Inline Acceleration of Network Applications

Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications

Manna: An Accelerator for Memory-Augmented Neural Networks

AvA: Accelerated Virtualization of Accelerators

References

StreamIt: A Language for Streaming Applications

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

Benchmarking GPUs to tune dense linear algebra

Parallel programming in C with MPI and OpenMP

Dynamic Storage Allocation: A Survey and Critical Review

Related Papers (5)

PTask: operating system abstractions to manage GPUs as compute devices

A study of Persistent Threads style GPU programming for GPGPU workloads

XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures

G-NET: Effective GPU Sharing in NFV Systems

Partitioning GPUs for Improved Scalability