scispace - formally typeset
Proceedings ArticleDOI

MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency

TLDR
MASK, a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications, is proposed and evaluations show that MASK restores much of the throughput lost to TLB contention.
Abstract
Graphics Processing Units (GPUs) exploit large amounts of threadlevel parallelism to provide high instruction throughput and to efficiently hide long-latency stalls. The resulting high throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can have vastly different compute and memory demands on the GPU. In a large-scale computing environment, to efficiently accommodate such wide-ranging demands without leaving GPU resources underutilized, multiple applications can share a single GPU, akin to how multiple applications execute concurrently on a CPU. Multi-application concurrency requires several support mechanisms in both hardware and software. One such key mechanism is virtual memory, which manages and protects the address space of each application. However, modern GPUs lack the extensive support for multi-application concurrency available in CPUs, and as a result suffer from high performance overheads when shared by multiple applications, as we demonstrate. We perform a detailed analysis of which multi-application concurrency support limitations hurt GPU performance the most. We find that the poor performance is largely a result of the virtual memory mechanisms employed in modern GPUs. In particular, poor address translation performance is a key obstacle to efficient GPU sharing. State-of-the-art address translation mechanisms, which were designed for single-application execution, experience significant inter-application interference when multiple applications spatially share the GPU. This contention leads to frequent misses in the shared translation lookaside buffer (TLB), where a single miss can induce long-latency stalls for hundreds of threads. As a result, the GPU often cannot schedule enough threads to successfully hide the stalls, which diminishes system throughput and becomes a first-order performance concern. Based on our analysis, we propose MASK, a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications. MASK consists of three novel address-translation-aware cache and memory management mechanisms that work together to largely reduce the overhead of address translation: (1) a token-based technique to reduce TLB contention, (2) a bypassing mechanism to improve the effectiveness of cached address translations, and (3) an application-aware memory scheduling scheme to reduce the interference between address translation and data requests. Our evaluations show that MASK restores much of the throughput lost to TLB contention. Relative to a state-of-the-art GPU TLB, MASK improves system throughput by 57.8%, improves IPC throughput by 43.4%, and reduces applicationlevel unfairness by 22.4%. MASK's system throughput is within 23.2% of an ideal GPU system with no address translation overhead.

read more

Citations
More filters
Journal ArticleDOI

Processing data where it makes sense: Enabling in-memory computation

TL;DR: In this paper, the authors discuss some recent research that aims to practically enable computation close to data and discuss at least two promising directions for processing-in-memory (PIM): (1) performing massively-parallel bulk operations in memory by exploiting the analog operational properties of DRAM, with low-cost changes, and (2) exploiting the logic layer in 3D-stacked memory technology to accelerate important data-intensive applications.

Aergia: Exploiting Packet Latency Slack in On-Chip Networks

TL;DR: Aergia as mentioned in this paper introduces new router prioritization policies that exploit interfering packets' available slack to improve overall system performance and fairness, and defines slack as a key measure for characterizing a packet's relative importance.
Posted Content

A Modern Primer on Processing in Memory.

TL;DR: This chapter discusses recent research that aims to practically enable computation close to data, an approach called processing-in-memory (PIM).
Proceedings ArticleDOI

The locality descriptor: a holistic cross-layer abstraction to express data locality in GPUs

TL;DR: The Locality Descriptor is proposed, a crossl-ayer abstraction to explicitly express and exploit data locality in GPUs that improves performance by 26.6% on average when exploiting reuse-based locality in the cache hierarchy, and by 53.7% when exploiting N UMA locality in a NUMA memory system.
Posted Content

Enabling the Adoption of Processing-in-Memory: Challenges, Mechanisms, Future Research Directions

TL;DR: This work proposes and evaluates two general-purpose solutions that minimize unnecessary off-chip communication for PIM architectures and shows that both mechanisms improve the performance and energy consumption of many important memory-intensive applications.
References
More filters
Journal ArticleDOI

Fast gapped-read alignment with Bowtie 2

TL;DR: Bowtie 2 combines the strengths of the full-text minute index with the flexibility and speed of hardware-accelerated dynamic programming algorithms to achieve a combination of high speed, sensitivity and accuracy.
Proceedings ArticleDOI

Rodinia: A benchmark suite for heterogeneous computing

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Proceedings ArticleDOI

Analyzing CUDA workloads using a detailed GPU simulator

TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.
Journal ArticleDOI

NVIDIA Tesla: A Unified Graphics and Computing Architecture

TL;DR: To enable flexible, programmable graphics and high-performance computing, NVIDIA has developed the Tesla scalable unified graphics and parallel computing architecture, which is massively multithreaded and programmable in C or via graphics APIs.
Related Papers (5)
Trending Questions (1)
What are the latest advances in GPU memory hierarchy design?

The paper discusses the MASK framework, which is a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications. It consists of three novel address-translation-aware cache and memory management mechanisms.