scispace - formally typeset
Proceedings ArticleDOI

Pagoda: Fine-Grained GPU Resource Virtualization for Narrow Tasks

Reads0
Chats0
TLDR
Pagoda is presented, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel, and achieves a geometric mean speedup of 5.70x over PThreads running on a 20-core CPU, 1.51x over CUDA-HyperQ, and 1.69x over GeMTC, the state-of- the-art runtime GPU task scheduling system.
Abstract
Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU. GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain This paper presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. Experimental results demonstrate that Pagoda achieves a geometric mean speedup of 5.70x over PThreads running on a 20-core CPU, 1.51x over CUDA-HyperQ, and 1.69x over GeMTC, the state-of- the-art runtime GPU task scheduling system.

read more

Citations
More filters
Proceedings ArticleDOI

MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency

TL;DR: MASK, a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications, is proposed and evaluations show that MASK restores much of the throughput lost to TLB contention.
Proceedings Article

{NICA}: An Infrastructure for Inline Acceleration of Network Applications

TL;DR: NICA is a hardware-software co-designed framework for inline acceleration of the application data plane on F-NICs in multi-tenant systems and integrates ikernels with the high-performance VMA network stack and the KVM hypervisor.
Posted Content

Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications

TL;DR: Salus implements an efficient, consolidated execution service that exposes the GPU to different DL applications, and enforces fine-grained sharing by performing iteration scheduling and addressing associated memory management issues, and can be used to implement flexible sharing policies such as fairness, prioritization, and packing for various use cases.
Proceedings ArticleDOI

Manna: An Accelerator for Memory-Augmented Neural Networks

TL;DR: Manna is a memory-centric design that focuses on maximizing performance in an extremely low FLOPS/Byte context and develops a detailed architectural simulator with timing and power models calibrated by synthesis to the 15 nm Nangate Open Cell library.
Proceedings ArticleDOI

AvA: Accelerated Virtualization of Accelerators

TL;DR: AvA provides near-native performance and can enforce sharing policies that are not possible with current techniques, with orders of magnitude less developer effort than required for hand-built virtualization support.
References
More filters
Book ChapterDOI

StreamIt: A Language for Streaming Applications

TL;DR: The StreamIt language provides novel high-level representations to improve programmer productivity and program robustness within the streaming domain and the StreamIt compiler aims to improve the performance of streaming applications via stream-specific analyses and optimizations.
Journal ArticleDOI

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

TL;DR: StarPU as mentioned in this paper is a runtime system that provides a high-level unified execution model for numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware and easily develop and tune powerful scheduling algorithms.
Proceedings ArticleDOI

Benchmarking GPUs to tune dense linear algebra

TL;DR: In this article, the authors present performance results for dense linear algebra using recent NVIDIA GPUs and argue that modern GPUs should be viewed as multithreaded multicore vector units, and exploit blocking similarly to vector computers and heterogeneity of the system.
Book

Parallel programming in C with MPI and OpenMP

TL;DR: This chapter discusses Parallel Architectures, Message-Passing Programming, and Combining MPI and OpenMP.
Book ChapterDOI

Dynamic Storage Allocation: A Survey and Critical Review

TL;DR: This survey describes a variety of memory allocator designs and point out issues relevant to their design and evaluation, and chronologically survey most of the literature on allocators between 1961 and 1995.
Related Papers (5)