Branch and Data Herding: Reducing Control and Memory Divergence for Error-Tolerant GPU Applications

doi:10.1109/TMM.2012.2232647

Journal ArticleDOI

Branch and Data Herding: Reducing Control and Memory Divergence for Error-Tolerant GPU Applications

John Sartori, +1 more

- 01 Feb 2013 -

IEEE Transactions on Multimedia

- Vol. 15, Iss: 2, pp 279-290

TLDR

This paper proposes a static analysis and compiler framework to prevent exceptions when control and data errors are introduced, a profiling framework that aims to maximize performance while maintaining acceptable output quality, and hardware optimizations to improve the performance benefits of exploiting error tolerance through branch and data herding.

Abstract:

Control and memory divergence between threads within the same execution bundle, or warp, have been shown to cause significant performance bottlenecks for GPU applications. In this paper, we exploit the observation that many GPU applications exhibit error tolerance to propose branch and data herding. Branch herding eliminates control divergence by forcing all threads in a warp to take the same control path. Data herding eliminates memory divergence by forcing each thread in a warp to load from the same memory block. To safely and efficiently support branch and data herding, we propose a static analysis and compiler framework to prevent exceptions when control and data errors are introduced, a profiling framework that aims to maximize performance while maintaining acceptable output quality, and hardware optimizations to improve the performance benefits of exploiting error tolerance through branch and data herding. Our software implementation of branch herding on NVIDIA GeForce GTX 480 improves performance by up to 34% (13%, on average) for a suite of NVIDIA CUDA SDK and Parboil benchmarks. Our hardware implementation of branch herding improves performance by up to 55% (30%, on average). Data herding improves performance by up to 32% (25%, on average). Observed output quality degradation is minimal for several applications that exhibit error tolerance, especially for visual computing applications.

Branch and Data Herding: Reducing Control and Memory Divergence for Error-Tolerant GPU Applications

Citations

A Survey of Techniques for Approximate Computing

SAGE: self-tuning approximation for graphics engines

Paraprox: pattern-based approximation for data parallel applications

Underdesigned and Opportunistic Computing in Presence of Hardware Variability

Rumba: an online quality management system for approximate computing

References

Analyzing CUDA workloads using a detailed GPU simulator

Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0

Demystifying GPU microarchitecture through microbenchmarking

Improving GPU performance via large warps and two-level warp scheduling

Dynamic warp subdivision for integrated branch and memory divergence tolerance

Related Papers (5)

SAGE: self-tuning approximation for graphics engines

Green: a framework for supporting energy-conscious programming using controlled approximation

Neural Acceleration for General-Purpose Approximate Programs

Managing performance vs. accuracy trade-offs with loop perforation

Architecture support for disciplined approximate programming