scispace - formally typeset
Proceedings ArticleDOI

A Profile Guided Approach to Optimize Branch Divergence While Transforming Applications for GPUs

TLDR
A novel profile guided approach to optimize branch divergence while transforming a serial program to a data-parallel program for GPUs is proposed, based on the observation that branches inside some data parallel loops although divergent, exhibit repetitive regular patterns of outcomes.
Abstract
GPUs offer a powerful bulk synchronous programming model for exploiting data parallelism; however, branch divergence amongst executing warps can lead to serious performance degradation due to execution serialization. We propose a novel profile guided approach to optimize branch divergence while transforming a serial program to a data-parallel program for GPUs. Our approach is based on the observation that branches inside some data parallel loops although divergent, exhibit repetitive regular patterns of outcomes. By exploiting such patterns, loop iterations can be aligned so that the corresponding iterations traverse the same branch path. These aligned iterations when executed as a warp in a GPU, become convergent. We propose a new metric based on the repetitive pattern characteristics that indicates whether a data-parallel loop is worth restructuring. When tested our approach on the well-known Rodinia benchmark, we found that it is possible to achieve upto 48% performance improvement by loop restructuring suggested by the patterns and our metrics.

read more

Citations
More filters
Proceedings ArticleDOI

Efficient warp execution in presence of divergence with collaborative context collection

TL;DR: This work presents a software technique named Collaborative Context Collection (CCC), which improves the warp execution efficiency of real-world benchmarks by up to 56% and achieves an average speedup of 1.69× (maximum 3.08×) and proposes code transformations to enable applicability of CCC to variety of program segments with thread divergence.
Proceedings ArticleDOI

Merge or Separate?: Multi-job Scheduling for OpenCL Kernels on CPU/GPU Platforms

TL;DR: A machine learning-based predictive model at runtime is used at runtime to detect whether to merge OpenCL kernels or schedule them separately to the most appropriate devices without the need for ahead-of-time profiling.
Proceedings ArticleDOI

Eliminating Intra-Warp Load Imbalance in Irregular Nested Patterns via Collaborative Task Engagement

TL;DR: A novel software technique called Collaborative Task Engagement (CTE) is introduced that achieves sustained high warp execution efficiencies across irregular inputs and provides portable performance.
Journal ArticleDOI

Scaling modified condition/decision coverage using distributed concolic testing for Java programs

TL;DR: A Java coverage analyzer according to the test cases produced by Java distributed concolic testers, and this version of the MC/DC analyzer is more powerful than that of procedural languages.
Journal ArticleDOI

An Automated Analysis of the Branch Coverage and Energy Consumption Using Concolic Testing

TL;DR: The contribution of this paper is to automate the computation and analysis of the energy consumption of the testing technique while enhancing the branch coverage using concolic testing with a proposed automation framework in a tool, named Green Analysis of Branch Coverage Enhancement.
References
More filters
Proceedings ArticleDOI

LLVM: a compilation framework for lifelong program analysis & transformation

TL;DR: The design of the LLVM representation and compiler framework is evaluated in three ways: the size and effectiveness of the representation, including the type information it provides; compiler performance for several interprocedural problems; and illustrative examples of the benefits LLVM provides for several challenging compiler problems.
Journal ArticleDOI

A bridging model for parallel computation

TL;DR: The bulk-synchronous parallel (BSP) model is introduced as a candidate for this role, and results quantifying its efficiency both in implementing high-level language features and algorithms, as well as in being implemented in hardware.
Proceedings ArticleDOI

Rodinia: A benchmark suite for heterogeneous computing

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Book

Programming Massively Parallel Processors: A Hands-on Approach

TL;DR: Programming Massively Parallel Processors: A Hands-on Approach as discussed by the authors shows both student and professional alike the basic concepts of parallel programming and GPU architecture, and various techniques for constructing parallel programs are explored in detail.
Proceedings ArticleDOI

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

TL;DR: This work discusses the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies, and achieves increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations.