scispace - formally typeset
Journal ArticleDOI

An Automated Tool for Analysis and Tuning of GPU-Accelerated Code in HPC Applications

Reads0
Chats0
TLDR
GPA, a performance advisor that suggests potential code optimizations at a hierarchy of levels, including individual lines, loops, and functions, is described and experiments show that GPA provides useful advice for tuning GPU code.
Abstract
The US Department of Energy’s fastest supercomputers and forthcoming exascale systems employ Graphics Processing Units (GPUs) to increase the computational performance of compute nodes. However, the complexity of GPU architectures makes tailoring sophisticated applications to achieve high performance on GPU-accelerated systems a major challenge. At best, prior performance tools for GPU code only provide coarse-grained tuning advice at the kernel level. In this article, we describe GPA, a performance advisor that suggests potential code optimizations at a hierarchy of levels, including individual lines, loops, and functions. To gather the fine-grained measurements needed to produce such insights, GPA uses instruction sampling and binary instrumentation to monitor execution of GPU code. At the time of this writing, GPU instruction sampling is only available on NVIDIA GPUs. To understand performance losses, GPA uses data flow analysis to approximately attribute measured instruction stalls back to their causes. GPA then analyzes patterns of stalls using information about a program’s structure and the GPU architecture to identify optimization strategies that address inefficiencies observed. GPA then employs detailed performance models to estimate the potential speedup that each optimization might provide. Experiments with benchmarks and applications show that GPA provides useful advice for tuning GPU code. We applied GPA to analyze and tune a collection of codes on NVIDIA V100 and A100 GPUs. GPA suggested optimizations that it estimates will accelerate performance across the set of codes by a geometric mean of 1.21×. Applying these optimizations suggested by GPA accelerated these codes by a geometric mean of 1.19×.

read more

Citations
More filters
Proceedings ArticleDOI

DrGPU: A Top-Down Profiler for GPU Applications

TL;DR: DrGPU as discussed by the authors is a profiler that performs top-down analysis to guide GPU code optimization, which can quantify stall cycles, decompose them into various stall reasons, pinpoint root causes and provide intuitive optimization guidance.

${\tt simwave}$ -- A Finite Difference Simulator for Acoustic Waves Propagation

TL;DR: The architecture of simwave is designed for applications with geophysical exploration in mind, and its Python front-end enables straightforward integration with many existing Python scientific libraries for the composition of more complex workflows and applications.
Journal ArticleDOI

Automated performance analysis tools framework for HPC programs

TL;DR: In this paper , the authors introduce an extensible framework for HPC performance analysis tools, which not only provides a convenient graphical interface for the tools, but also simplifies their usage extremely.
Proceedings ArticleDOI

An Empirical Study of High Performance Computing (HPC) Performance Bugs

TL;DR: Wang et al. as mentioned in this paper performed a large-scale empirical analysis on 1729 HPC performance commits collected from 23 real-world projects and identified 186 performance issues from these projects.
Journal ArticleDOI

A Comprehensive Survey of Benchmarks for Automated Improvement of Software's Non-Functional Properties

Aymeric Blot, +1 more
- 16 Dec 2022 - 
TL;DR: In this paper , the authors present a survey on automated improvement of non-functional properties of software, focusing on programs found in the 386 papers on the Nfunc_survey.
References
More filters
Proceedings ArticleDOI

Rodinia: A benchmark suite for heterogeneous computing

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Journal ArticleDOI

The Tau Parallel Performance System

TL;DR: This paper presents the TAU (Tuning and Analysis Utilities) parallel performance sytem and describes how it addresses diverse requirements for performance observation and analysis.
Journal ArticleDOI

BerkeleyGW: A massively parallel computer package for the calculation of the quasiparticle and optical properties of materials and nanostructures

TL;DR: This work constructs and solves the Dysonʼs equation for the quasiparticle energies and wavefunctions within the GW approximation for the electron self-energy and additionally construct and solve the Bethe–Salpeter equations for the correlated electron–hole (exciton) wavefun functions and excitation energies.
Journal ArticleDOI

HPCTOOLKIT: tools for performance analysis of optimized parallel programs

TL;DR: An overview of HPCTOOLKIT is provided and its utility for performance analysis of parallel applications is illustrated.
Proceedings ArticleDOI

ProfileMe: hardware support for instruction-level profiling on out-of-order processors

TL;DR: An inexpensive hardware implementation of ProfileMe is described, a variety of software techniques to extract useful profile information from the hardware are outlined, and several ways in which this information can provide valuable feedback for programmers and optimizers are explained.
Related Papers (5)