A Profile Guided Approach to Optimize Branch Divergence While Transforming Applications for GPUs

doi:10.1145/2723742.2723760

Proceedings ArticleDOI

A Profile Guided Approach to Optimize Branch Divergence While Transforming Applications for GPUs

- pp 176-185

TLDR

A novel profile guided approach to optimize branch divergence while transforming a serial program to a data-parallel program for GPUs is proposed, based on the observation that branches inside some data parallel loops although divergent, exhibit repetitive regular patterns of outcomes.

Abstract:

GPUs offer a powerful bulk synchronous programming model for exploiting data parallelism; however, branch divergence amongst executing warps can lead to serious performance degradation due to execution serialization. We propose a novel profile guided approach to optimize branch divergence while transforming a serial program to a data-parallel program for GPUs. Our approach is based on the observation that branches inside some data parallel loops although divergent, exhibit repetitive regular patterns of outcomes. By exploiting such patterns, loop iterations can be aligned so that the corresponding iterations traverse the same branch path. These aligned iterations when executed as a warp in a GPU, become convergent. We propose a new metric based on the repetitive pattern characteristics that indicates whether a data-parallel loop is worth restructuring. When tested our approach on the well-known Rodinia benchmark, we found that it is possible to achieve upto 48% performance improvement by loop restructuring suggested by the patterns and our metrics.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Efficient warp execution in presence of divergence with collaborative context collection

Farzad Khorasani, +2 more

TL;DR: This work presents a software technique named Collaborative Context Collection (CCC), which improves the warp execution efficiency of real-world benchmarks by up to 56% and achieves an average speedup of 1.69× (maximum 3.08×) and proposes code transformations to enable applicability of CCC to variety of program segments with thread divergence.

...read moreread less

Proceedings ArticleDOI

Merge or Separate?: Multi-job Scheduling for OpenCL Kernels on CPU/GPU Platforms

Yuan Wen, +1 more

TL;DR: A machine learning-based predictive model at runtime is used at runtime to detect whether to merge OpenCL kernels or schedule them separately to the most appropriate devices without the need for ahead-of-time profiling.

...read moreread less

Proceedings ArticleDOI

Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement

Farzad Khorasani, +3 more

TL;DR: A novel software technique called Collaborative Task Engagement (CTE) is introduced that achieves sustained high warp execution efficiencies across irregular inputs and provides portable performance.

...read moreread less

Journal ArticleDOI

Scaling modified condition/decision coverage using distributed concolic testing for Java programs

Sangharatna Godboley, +3 more

- 01 Aug 2018 -

Computer Standards & Interfaces

TL;DR: A Java coverage analyzer according to the test cases produced by Java distributed concolic testers, and this version of the MC/DC analyzer is more powerful than that of procedural languages.

...read moreread less

Journal ArticleDOI

An Automated Analysis of the Branch Coverage and Energy Consumption Using Concolic Testing

Sangharatna Godboley, +3 more

- 01 Feb 2017 -

Arabian Journal for Science and Engineer...

TL;DR: The contribution of this paper is to automate the computation and analysis of the energy consumption of the testing technique while enhancing the branch coverage using concolic testing with a proposed automation framework in a tool, named Green Analysis of Branch Coverage Enhancement.

...read moreread less

References

PDF

Open Access

More filters

Proceedings ArticleDOI

LLVM: a compilation framework for lifelong program analysis & transformation

Chris Lattner, +1 more

TL;DR: The design of the LLVM representation and compiler framework is evaluated in three ways: the size and effectiveness of the representation, including the type information it provides; compiler performance for several interprocedural problems; and illustrative examples of the benefits LLVM provides for several challenging compiler problems.

...read moreread less

Journal ArticleDOI

A bridging model for parallel computation

Leslie G. Valiant

- 01 Aug 1990 -

Communications of The ACM

TL;DR: The bulk-synchronous parallel (BSP) model is introduced as a candidate for this role, and results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.

...read moreread less

Proceedings ArticleDOI

Rodinia: A benchmark suite for heterogeneous computing

Shuai Che, +6 more

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

...read moreread less

Book

Programming Massively Parallel Processors: A Hands-on Approach

David B. Kirk, +1 more

TL;DR: Programming Massively Parallel Processors: A Hands-on Approach as discussed by the authors shows both student and professional alike the basic concepts of parallel programming and GPU architecture, and various techniques for constructing parallel programs are explored in detail.

...read moreread less

Proceedings ArticleDOI

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Shane Ryoo, +5 more

TL;DR: This work discusses the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies, and achieves increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations.

...read moreread less

Collapse

A Profile Guided Approach to Optimize Branch Divergence While Transforming Applications for GPUs

Citations

Efficient warp execution in presence of divergence with collaborative context collection

Merge or Separate?: Multi-job Scheduling for OpenCL Kernels on CPU/GPU Platforms

Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement

Scaling modified condition/decision coverage using distributed concolic testing for Java programs

An Automated Analysis of the Branch Coverage and Energy Consumption Using Concolic Testing

References

LLVM: a compilation framework for lifelong program analysis & transformation

A bridging model for parallel computation

Rodinia: A benchmark suite for heterogeneous computing

Programming Massively Parallel Processors: A Hands-on Approach

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Related Papers (5)

Chunking parallel loops in the presence of synchronization

Optimizing select conditions on GPUs

Chunking loops with non-uniform workloads

Efficient warp execution in presence of divergence with collaborative context collection

Automatic restructuring of GPU kernels for exploiting inter-thread data locality