An Automated Tool for Analysis and Tuning of GPU-Accelerated Code in HPC Applications

doi:10.1109/TPDS.2021.3094169

Journal ArticleDOI

An Automated Tool for Analysis and Tuning of GPU-Accelerated Code in HPC Applications

Keren Zhou, +4 more

- 01 Apr 2022 -

IEEE Transactions on Parallel and Distri...

- Vol. 33, Iss: 04, pp 854-865

Chats0

TLDR

GPA, a performance advisor that suggests potential code optimizations at a hierarchy of levels, including individual lines, loops, and functions, is described and experiments show that GPA provides useful advice for tuning GPU code.

Abstract:

The US Department of Energy’s fastest supercomputers and forthcoming exascale systems employ Graphics Processing Units (GPUs) to increase the computational performance of compute nodes. However, the complexity of GPU architectures makes tailoring sophisticated applications to achieve high performance on GPU-accelerated systems a major challenge. At best, prior performance tools for GPU code only provide coarse-grained tuning advice at the kernel level. In this article, we describe GPA, a performance advisor that suggests potential code optimizations at a hierarchy of levels, including individual lines, loops, and functions. To gather the fine-grained measurements needed to produce such insights, GPA uses instruction sampling and binary instrumentation to monitor execution of GPU code. At the time of this writing, GPU instruction sampling is only available on NVIDIA GPUs. To understand performance losses, GPA uses data flow analysis to approximately attribute measured instruction stalls back to their causes. GPA then analyzes patterns of stalls using information about a program’s structure and the GPU architecture to identify optimization strategies that address inefficiencies observed. GPA then employs detailed performance models to estimate the potential speedup that each optimization might provide. Experiments with benchmarks and applications show that GPA provides useful advice for tuning GPU code. We applied GPA to analyze and tune a collection of codes on NVIDIA V100 and A100 GPUs. GPA suggested optimizations that it estimates will accelerate performance across the set of codes by a geometric mean of 1.21×. Applying these optimizations suggested by GPA accelerated these codes by a geometric mean of 1.19×.

An Automated Tool for Analysis and Tuning of GPU-Accelerated Code in HPC Applications

Citations

DrGPU: A Top-Down Profiler for GPU Applications

${\tt simwave}$ -- A Finite Difference Simulator for Acoustic Waves Propagation

Automated performance analysis tools framework for HPC programs

An Empirical Study of High Performance Computing (HPC) Performance Bugs

A Comprehensive Survey of Benchmarks for Automated Improvement of Software's Non-Functional Properties

References

Rodinia: A benchmark suite for heterogeneous computing

The Tau Parallel Performance System

BerkeleyGW: A massively parallel computer package for the calculation of the quasiparticle and optical properties of materials and nanostructures

HPCTOOLKIT: tools for performance analysis of optimized parallel programs

ProfileMe: hardware support for instruction-level profiling on out-of-order processors

Related Papers (5)

Effective sampling-driven performance tools for GPU-accelerated supercomputers

Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance

A cross-input adaptive framework for GPU program optimizations

MGPUSim: enabling multi-GPU performance modeling and optimization

GVPROF: A Value Profiler for GPU-Based Clusters