scispace - formally typeset
Proceedings ArticleDOI

Selective GPU caches to eliminate CPU-GPU HW cache coherence

Reads0
Chats0
TLDR
This work proposes, selective caching, wherein it disallow GPU caching of any memory that would require coherence updates to propagate between the CPU and GPU, thereby decoupling the GPU from vendor-specific CPU coherence protocols.
Abstract
Cache coherence is ubiquitous in shared memory multiprocessors because it provides a simple, high performance memory abstraction to programmers. Recent work suggests extending hardware cache coherence between CPUs and GPUs to help support programming models with tightly coordinated sharing between CPU and GPU threads. However, implementing hardware cache coherence is particularly challenging in systems with discrete CPUs and GPUs that may not be produced by a single vendor. Instead, we propose, selective caching, wherein we disallow GPU caching of any memory that would require coherence updates to propagate between the CPU and GPU, thereby decoupling the GPU from vendor-specific CPU coherence protocols. We propose several architectural improvements to offset the performance penalty of selective caching: aggressive request coalescing, CPU-side coherent caching for GPU-uncacheable requests, and a CPU-GPU interconnect optimization to support variable-size transfers. Moreover, current GPU workloads access many read-only memory pages; we exploit this property to allow promiscuous GPU caching of these pages, relying on page-level protection, rather than hardware cache coherence, to ensure correctness. These optimizations bring a selective caching GPU implementation to within 93% of a hardware cache-coherent implementation without the need to integrate CPUs and GPUs under a single hardware coherence protocol.

read more

Citations
More filters
Proceedings ArticleDOI

Crossing Guard: Mediating Host-Accelerator Coherence Interactions

TL;DR: The Crossing Guard interface provides the accelerator designer with a standardized set of coherence messages that are simple enough to aid in design of bug-free coherent caches, and sufficiently complex to allow customized and optimized accelerator caches with performance comparable to using the host protocol.
Proceedings ArticleDOI

Need for Speed: Experiences Building a Trustworthy System-Level GPU Simulator

TL;DR: NVArchSim as mentioned in this paper is an architectural simulator used within NVIDIA to design and evaluate features that are difficult to appraise using other methodologies due to workload type, size, complexity, or lack of modeling flexibility.
Proceedings ArticleDOI

Coordinated Page Prefetch and Eviction for Memory Oversubscription Management in GPUs

TL;DR: In this article, a coordinated page prefetch and eviction (CPPE) is proposed to manage memory oversubscription in GPUs with unified memory, which incorporates a modified page eviction policy, hierarchical page eviction (HPE), and an access pattern-aware prefetcher in a fine-grained manner.
Journal ArticleDOI

HPE: Hierarchical Page Eviction Policy for Unified Memory in GPUs

TL;DR: HPE is proposed, a new replacement policy for GPUs with unified memory that uses statistics to classify applications into three categories and selects an appropriate eviction strategy for each category, and applies dynamic adjustment to switch the eviction strategy when necessary.
Patent

Apparatus and method for managing data bias in a graphics processing architecture

TL;DR: In this paper, an apparatus and method for managing data which is biased towards a processor or a GPU is described, where the GPU and processor cores are to share a virtual address space for accessing a system memory; a GPU memory coupled to the processor, the GPU memory addressable through the virtual address spaces shared by the processor cores and GPU; and bias management circuitry to store an indication, for each of a plurality of blocks of data, whether the data has a processor bias or GPU bias.
References
More filters
Journal ArticleDOI

Space/time trade-offs in hash coding with allowable errors

TL;DR: Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.
Proceedings ArticleDOI

Rodinia: A benchmark suite for heterogeneous computing

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Proceedings ArticleDOI

Analyzing CUDA workloads using a detailed GPU simulator

TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.
Journal ArticleDOI

Cuckoo hashing

TL;DR: In this paper, a simple dictionary with worst case constant lookup time was presented, equaling the theoretical performance of the classic dynamic perfect hashing scheme of Dietzfelbinger et al.
Related Papers (5)