Selective GPU caches to eliminate CPU-GPU HW cache coherence

doi:10.1109/HPCA.2016.7446089

Proceedings ArticleDOI

Selective GPU caches to eliminate CPU-GPU HW cache coherence

Neha Agarwal, +5 more

- pp 494-506

Chats0

TLDR

This work proposes, selective caching, wherein it disallow GPU caching of any memory that would require coherence updates to propagate between the CPU and GPU, thereby decoupling the GPU from vendor-specific CPU coherence protocols.

Abstract:

Cache coherence is ubiquitous in shared memory multiprocessors because it provides a simple, high performance memory abstraction to programmers. Recent work suggests extending hardware cache coherence between CPUs and GPUs to help support programming models with tightly coordinated sharing between CPU and GPU threads. However, implementing hardware cache coherence is particularly challenging in systems with discrete CPUs and GPUs that may not be produced by a single vendor. Instead, we propose, selective caching, wherein we disallow GPU caching of any memory that would require coherence updates to propagate between the CPU and GPU, thereby decoupling the GPU from vendor-specific CPU coherence protocols. We propose several architectural improvements to offset the performance penalty of selective caching: aggressive request coalescing, CPU-side coherent caching for GPU-uncacheable requests, and a CPU-GPU interconnect optimization to support variable-size transfers. Moreover, current GPU workloads access many read-only memory pages; we exploit this property to allow promiscuous GPU caching of these pages, relying on page-level protection, rather than hardware cache coherence, to ensure correctness. These optimizations bring a selective caching GPU implementation to within 93% of a hardware cache-coherent implementation without the need to integrate CPUs and GPUs under a single hardware coherence protocol.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Crossing Guard: Mediating Host-Accelerator Coherence Interactions

Lena E. Olson, +2 more

TL;DR: The Crossing Guard interface provides the accelerator designer with a standardized set of coherence messages that are simple enough to aid in design of bug-free coherent caches, and sufficiently complex to allow customized and optimized accelerator caches with performance comparable to using the host protocol.

...read moreread less

Proceedings ArticleDOI

Need for Speed: Experiences Building a Trustworthy System-Level GPU Simulator

Oreste Villa, +7 more

TL;DR: NVArchSim as mentioned in this paper is an architectural simulator used within NVIDIA to design and evaluate features that are difficult to appraise using other methodologies due to workload type, size, complexity, or lack of modeling flexibility.

...read moreread less

Proceedings ArticleDOI

Coordinated Page Prefetch and Eviction for Memory Oversubscription Management in GPUs

Qi Yu, +5 more

TL;DR: In this article, a coordinated page prefetch and eviction (CPPE) is proposed to manage memory oversubscription in GPUs with unified memory, which incorporates a modified page eviction policy, hierarchical page eviction (HPE), and an access pattern-aware prefetcher in a fine-grained manner.

...read moreread less

Journal ArticleDOI

HPE: Hierarchical Page Eviction Policy for Unified Memory in GPUs

Qi Yu, +4 more

- 01 Oct 2020 -

IEEE Transactions on Computer-Aided Desi...

TL;DR: HPE is proposed, a new replacement policy for GPUs with unified memory that uses statistics to classify applications into three categories and selects an appropriate eviction strategy for each category, and applies dynamic adjustment to switch the eviction strategy when necessary.

...read moreread less

Patent

Apparatus and method for managing data bias in a graphics processing architecture

Ray Joydeep, +3 more

TL;DR: In this paper, an apparatus and method for managing data which is biased towards a processor or a GPU is described, where the GPU and processor cores are to share a virtual address space for accessing a system memory; a GPU memory coupled to the processor, the GPU memory addressable through the virtual address spaces shared by the processor cores and GPU; and bias management circuitry to store an indication, for each of a plurality of blocks of data, whether the data has a processor bias or GPU bias.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Space/time trade-offs in hash coding with allowable errors

Burton H. Bloom

- 01 Jul 1970 -

Communications of The ACM

TL;DR: Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.

...read moreread less

Proceedings ArticleDOI

Rodinia: A benchmark suite for heterogeneous computing

Shuai Che, +6 more

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

...read moreread less

Proceedings ArticleDOI

Analyzing CUDA workloads using a detailed GPU simulator

Ali Bakhoda, +4 more

TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.

...read moreread less

Journal ArticleDOI

Cuckoo hashing

Rasmus Pagh, +1 more

TL;DR: In this paper, a simple dictionary with worst case constant lookup time was presented, equaling the theoretical performance of the classic dynamic perfect hashing scheme of Dietzfelbinger et al.

...read moreread less

Journal ArticleDOI

A reconfigurable fabric for accelerating large-scale datacenter services

Andrew Putnam, +22 more

- 28 Oct 2016 -

Communications of The ACM

TL;DR: The authors deployed the reconfigurable fabric in a bed of 1,632 servers and FPGAs in a production datacenter and successfully used it to accelerate the ranking portion of the Bing Web search engine by nearly a factor of two.

...read moreread less

Collapse

Selective GPU caches to eliminate CPU-GPU HW cache coherence

Citations

Crossing Guard: Mediating Host-Accelerator Coherence Interactions

Need for Speed: Experiences Building a Trustworthy System-Level GPU Simulator

Coordinated Page Prefetch and Eviction for Memory Oversubscription Management in GPUs

HPE: Hierarchical Page Eviction Policy for Unified Memory in GPUs

Apparatus and method for managing data bias in a graphics processing architecture

References

Space/time trade-offs in hash coding with allowable errors

Rodinia: A benchmark suite for heterogeneous computing

Analyzing CUDA workloads using a detailed GPU simulator

Cuckoo hashing

A reconfigurable fabric for accelerating large-scale datacenter services

Related Papers (5)

Rodinia: A benchmark suite for heterogeneous computing

Analyzing CUDA workloads using a detailed GPU simulator

Heterogeneous system coherence for integrated CPU-GPU systems

Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing

Cache coherence for GPU architectures