Proceedings ArticleDOI
Selective GPU caches to eliminate CPU-GPU HW cache coherence
Neha Agarwal,David Nellans,Eiman Ebrahimi,Thomas F. Wenisch,John M. Danskin,Stephen W. Keckler +5 more
- pp 494-506
Reads0
Chats0
TLDR
This work proposes, selective caching, wherein it disallow GPU caching of any memory that would require coherence updates to propagate between the CPU and GPU, thereby decoupling the GPU from vendor-specific CPU coherence protocols.Abstract:
Cache coherence is ubiquitous in shared memory multiprocessors because it provides a simple, high performance memory abstraction to programmers. Recent work suggests extending hardware cache coherence between CPUs and GPUs to help support programming models with tightly coordinated sharing between CPU and GPU threads. However, implementing hardware cache coherence is particularly challenging in systems with discrete CPUs and GPUs that may not be produced by a single vendor. Instead, we propose, selective caching, wherein we disallow GPU caching of any memory that would require coherence updates to propagate between the CPU and GPU, thereby decoupling the GPU from vendor-specific CPU coherence protocols. We propose several architectural improvements to offset the performance penalty of selective caching: aggressive request coalescing, CPU-side coherent caching for GPU-uncacheable requests, and a CPU-GPU interconnect optimization to support variable-size transfers. Moreover, current GPU workloads access many read-only memory pages; we exploit this property to allow promiscuous GPU caching of these pages, relying on page-level protection, rather than hardware cache coherence, to ensure correctness. These optimizations bring a selective caching GPU implementation to within 93% of a hardware cache-coherent implementation without the need to integrate CPUs and GPUs under a single hardware coherence protocol.read more
Citations
More filters
Proceedings ArticleDOI
Crossing Guard: Mediating Host-Accelerator Coherence Interactions
TL;DR: The Crossing Guard interface provides the accelerator designer with a standardized set of coherence messages that are simple enough to aid in design of bug-free coherent caches, and sufficiently complex to allow customized and optimized accelerator caches with performance comparable to using the host protocol.
Proceedings ArticleDOI
Need for Speed: Experiences Building a Trustworthy System-Level GPU Simulator
Oreste Villa,Daniel Lustig,Zi Yan,Evgeny Bolotin,Yaosheng Fu,Niladrish Chatterjee,Nan Jiang,David Nellans +7 more
TL;DR: NVArchSim as mentioned in this paper is an architectural simulator used within NVIDIA to design and evaluate features that are difficult to appraise using other methodologies due to workload type, size, complexity, or lack of modeling flexibility.
Proceedings ArticleDOI
Coordinated Page Prefetch and Eviction for Memory Oversubscription Management in GPUs
TL;DR: In this article, a coordinated page prefetch and eviction (CPPE) is proposed to manage memory oversubscription in GPUs with unified memory, which incorporates a modified page eviction policy, hierarchical page eviction (HPE), and an access pattern-aware prefetcher in a fine-grained manner.
Journal ArticleDOI
HPE: Hierarchical Page Eviction Policy for Unified Memory in GPUs
TL;DR: HPE is proposed, a new replacement policy for GPUs with unified memory that uses statistics to classify applications into three categories and selects an appropriate eviction strategy for each category, and applies dynamic adjustment to switch the eviction strategy when necessary.
Patent
Apparatus and method for managing data bias in a graphics processing architecture
TL;DR: In this paper, an apparatus and method for managing data which is biased towards a processor or a GPU is described, where the GPU and processor cores are to share a virtual address space for accessing a system memory; a GPU memory coupled to the processor, the GPU memory addressable through the virtual address spaces shared by the processor cores and GPU; and bias management circuitry to store an indication, for each of a plurality of blocks of data, whether the data has a processor bias or GPU bias.
References
More filters
Journal ArticleDOI
Space/time trade-offs in hash coding with allowable errors
TL;DR: Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.
Proceedings ArticleDOI
Rodinia: A benchmark suite for heterogeneous computing
Shuai Che,Michael Boyer,Jiayuan Meng,David Tarjan,Jeremy W. Sheaffer,Sang-Ha Lee,Kevin Skadron +6 more
TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Proceedings ArticleDOI
Analyzing CUDA workloads using a detailed GPU simulator
TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.
Journal ArticleDOI
Cuckoo hashing
TL;DR: In this paper, a simple dictionary with worst case constant lookup time was presented, equaling the theoretical performance of the classic dynamic perfect hashing scheme of Dietzfelbinger et al.
Journal ArticleDOI
A reconfigurable fabric for accelerating large-scale datacenter services
Andrew Putnam,Adrian M. Caulfield,Eric S. Chung,Derek Chiou,Kypros Constantinides,John Demme,Hadi Esmaeilzadeh,Jeremy Fowers,Gopi Prashanth Gopal,Jan Gray,Michael Haselman,Scott Hauck,Stephen F. Heil,Amir Hormati,Joo-Young Kim,Sitaram Lanka,James R. Larus,Eric C. Peterson,Simon Pope,Aaron L. Smith,Jason Thong,Phillip Yi Xiao,Doug Burger +22 more
TL;DR: The authors deployed the reconfigurable fabric in a bed of 1,632 servers and FPGAs in a production datacenter and successfully used it to accelerate the ranking portion of the Bing Web search engine by nearly a factor of two.