Cache-Conscious Wavefront Scheduling
Timothy G. Rogers,Mike O'Connor,Tor M. Aamodt +2 more
- pp 72-83
Reads0
Chats0
TLDR
This paper proposes Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity.Abstract:
This paper studies the effects of hardware thread scheduling on cache management in GPUs. We propose Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity. In contrast to improvements in the replacement policy that can better tolerate difficult access patterns, CCWS shapes the access pattern to avoid thrashing the shared L1. We show that CCWS can outperform any replacement scheme by evaluating against the Belady-optimal policy. Our evaluation demonstrates that cache efficiency and preservation of intra-wave front locality become more important as GPU computing expands beyond use in high performance computing. At an estimated cost of 0.17% total chip area, CCWS reduces the number of threads actively issued on a core when appropriate. This leads to an average 25% fewer L1 data cache misses which results in a harmonic mean 24% performance improvement over previously proposed scheduling policies across a diverse selection of cache-sensitive workloads.read more
Citations
More filters
Proceedings ArticleDOI
OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance
Adwait Jog,Onur Kayiran,Nachiappan Chidambaram Nachiappan,Asit K. Mishra,Mahmut Kandemir,Onur Mutlu,Ravishankar Iyer,Chita R. Das +7 more
TL;DR: This paper presents a coordinated CTA-aware scheduling policy that utilizes four schemes to minimize the impact of long memory latencies, and indicates that the proposed mechanism can provide 33% average performance improvement compared to the commonly-employed round-robin warp scheduling policy.
Proceedings ArticleDOI
Neither more nor less: optimizing thread-level parallelism for GPGPUs
TL;DR: To reduce resource contention, this paper proposes a dynamic CTA scheduling mechanism, called DYNCTA, which modulates the TLP by allocating optimal number of CTAs, based on application characteristics, to minimize resource contention.
Journal ArticleDOI
Transparent offloading and mapping (TOM): enabling programmer-transparent near-data processing in GPU systems
Kevin Hsieh,Eiman Ebrahimi,Gwangsun Kim,Niladrish Chatterjee,Mike O'Connor,Nandita Vijaykumar,Onur Mutlu,Stephen W. Keckler +7 more
TL;DR: Extensive evaluations across a variety of modern memory-intensive GPU workloads show that TOM significantly improves performance compared to a baseline GPU system that cannot offload computation to 3D-stacked memories.
Proceedings ArticleDOI
Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities
Ashutosh Pattnaik,Xulong Tang,Adwait Jog,Onur Kayiran,Asit K. Mishra,Mahmut Kandemir,Onur Mutlu,Chita R. Das +7 more
TL;DR: Two new runtime techniques are developed: a regression-based affinity prediction model and mechanism that accurately identifies which kernels would benefit from PIM and offloads them to GPU cores in memory, and a concurrent kernel management mechanism that uses the affinity Prediction model, a new kernel execution time prediction model, and kernel dependency information to decide which kernels to schedule concurrently on main GPU cores and the GPU core in memory.
Proceedings ArticleDOI
Orchestrated scheduling and prefetching for GPGPUs
Adwait Jog,Onur Kayiran,Asit K. Mishra,Mahmut Kandemir,Onur Mutlu,Ravishankar Iyer,Chita R. Das +6 more
TL;DR: Techniques that coordinate the thread scheduling and prefetching decisions in a General Purpose Graphics Processing Unit (GPGPU) architecture to better tolerate long memory latencies are presented and a new prefetch-aware warp scheduling policy is proposed that overcomes problems with existing warp scheduling policies.
References
More filters
Book
Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods
TL;DR: In this book, which focuses on the use of iterative methods for solving large sparse systems of linear equations, templates are introduced to meet the needs of both the traditional user and the high-performance specialist.
Proceedings ArticleDOI
Rodinia: A benchmark suite for heterogeneous computing
Shuai Che,Michael Boyer,Jiayuan Meng,David Tarjan,Jeremy W. Sheaffer,Sang-Ha Lee,Kevin Skadron +6 more
TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Journal ArticleDOI
A study of replacement algorithms for a virtual-storage computer
TL;DR: One of the basic limitations of a digital computer is the size of its available memory; an approach that permits the programmer to use a sufficiently large address range can accomplish this objective, assuming that means are provided for automatic execution of the memory-overlay functions.
Proceedings ArticleDOI
Analyzing CUDA workloads using a detailed GPU simulator
TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.
Journal ArticleDOI
Dark Silicon and the End of Multicore Scaling
TL;DR: A comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.