Cache-Conscious Wavefront Scheduling

doi:10.1109/MICRO.2012.16

Open AccessProceedings ArticleDOI

Cache-Conscious Wavefront Scheduling

Timothy G. Rogers, +2 more

- pp 72-83

Chats0

TLDR

This paper proposes Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity.

Abstract:

This paper studies the effects of hardware thread scheduling on cache management in GPUs. We propose Cache-Conscious Wave front Scheduling (CCWS), an adaptive hardware mechanism that makes use of a novel intra-wave front locality detector to capture locality that is lost by other schedulers due to excessive contention for cache capacity. In contrast to improvements in the replacement policy that can better tolerate difficult access patterns, CCWS shapes the access pattern to avoid thrashing the shared L1. We show that CCWS can outperform any replacement scheme by evaluating against the Belady-optimal policy. Our evaluation demonstrates that cache efficiency and preservation of intra-wave front locality become more important as GPU computing expands beyond use in high performance computing. At an estimated cost of 0.17% total chip area, CCWS reduces the number of threads actively issued on a core when appropriate. This leads to an average 25% fewer L1 data cache misses which results in a harmonic mean 24% performance improvement over previously proposed scheduling policies across a diverse selection of cache-sensitive workloads.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

Adwait Jog, +7 more

TL;DR: This paper presents a coordinated CTA-aware scheduling policy that utilizes four schemes to minimize the impact of long memory latencies, and indicates that the proposed mechanism can provide 33% average performance improvement compared to the commonly-employed round-robin warp scheduling policy.

...read moreread less

Proceedings ArticleDOI

Neither more nor less: optimizing thread-level parallelism for GPGPUs

Onur Kayiran, +3 more

TL;DR: To reduce resource contention, this paper proposes a dynamic CTA scheduling mechanism, called DYNCTA, which modulates the TLP by allocating optimal number of CTAs, based on application characteristics, to minimize resource contention.

...read moreread less

Journal ArticleDOI

Transparent offloading and mapping (TOM): enabling programmer-transparent near-data processing in GPU systems

Kevin Hsieh, +7 more

TL;DR: Extensive evaluations across a variety of modern memory-intensive GPU workloads show that TOM significantly improves performance compared to a baseline GPU system that cannot offload computation to 3D-stacked memories.

...read moreread less

Proceedings ArticleDOI

Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities

Ashutosh Pattnaik, +7 more

TL;DR: Two new runtime techniques are developed: a regression-based affinity prediction model and mechanism that accurately identifies which kernels would benefit from PIM and offloads them to GPU cores in memory, and a concurrent kernel management mechanism that uses the affinity Prediction model, a new kernel execution time prediction model, and kernel dependency information to decide which kernels to schedule concurrently on main GPU cores and the GPU core in memory.

...read moreread less

Proceedings ArticleDOI

Orchestrated scheduling and prefetching for GPGPUs

Adwait Jog, +6 more

TL;DR: Techniques that coordinate the thread scheduling and prefetching decisions in a General Purpose Graphics Processing Unit (GPGPU) architecture to better tolerate long memory latencies are presented and a new prefetch-aware warp scheduling policy is proposed that overcomes problems with existing warp scheduling policies.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Book

Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods

Richard Barrett

TL;DR: In this book, which focuses on the use of iterative methods for solving large sparse systems of linear equations, templates are introduced to meet the needs of both the traditional user and the high-performance specialist.

...read moreread less

Proceedings ArticleDOI

Rodinia: A benchmark suite for heterogeneous computing

Shuai Che, +6 more

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

...read moreread less

Journal ArticleDOI

A study of replacement algorithms for a virtual-storage computer

Laszlo A. Belady

- 01 Jun 1966 -

Ibm Systems Journal

TL;DR: One of the basic limitations of a digital computer is the size of its available memory; an approach that permits the programmer to use a sufficiently large address range can accomplish this objective, assuming that means are provided for automatic execution of the memory-overlay functions.

...read moreread less

Proceedings ArticleDOI

Analyzing CUDA workloads using a detailed GPU simulator

Ali Bakhoda, +4 more

TL;DR: In this paper, the performance of non-graphics applications written in NVIDIA's CUDA programming model is evaluated on a microarchitecture performance simulator that runs NVIDIA's parallel thread execution (PTX) virtual instruction set.

...read moreread less

Journal ArticleDOI

Dark Silicon and the End of Multicore Scaling

Hadi Esmaeilzadeh, +4 more

- 01 May 2012 -

IEEE Micro

TL;DR: A comprehensive study that projects the speedup potential of future multicores and examines the underutilization of integration capacity-dark silicon-is timely and crucial.

...read moreread less

Collapse

Cache-Conscious Wavefront Scheduling

Citations

OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance

Neither more nor less: optimizing thread-level parallelism for GPGPUs

Transparent offloading and mapping (TOM): enabling programmer-transparent near-data processing in GPU systems

Scheduling Techniques for GPU Architectures with Processing-In-Memory Capabilities

Orchestrated scheduling and prefetching for GPGPUs

References

Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods

Rodinia: A benchmark suite for heterogeneous computing

A study of replacement algorithms for a virtual-storage computer

Analyzing CUDA workloads using a detailed GPU simulator

Dark Silicon and the End of Multicore Scaling

Related Papers (5)

Analyzing CUDA workloads using a detailed GPU simulator

Rodinia: A benchmark suite for heterogeneous computing

Improving GPU performance via large warps and two-level warp scheduling

GPUWattch: enabling energy optimizations in GPGPUs

Parboil: A Revised Benchmark Suite for Scientific and Commercial Throughput Computing