MATOG: Array Layout Auto-Tuning for CUDA

doi:10.1145/3106341

Open AccessJournal ArticleDOI

MATOG: Array Layout Auto-Tuning for CUDA

Nicolas Weber, +1 more

- 30 Aug 2017 -

ACM Transactions on Architecture and Cod...

- Vol. 14, Iss: 3, pp 28

TLDR

The MATOG auto-tuner abstracts the array memory access in CUDA applications and automatically optimizes the code according to the used GPUs, independent of the used GPU generation and without the need to manually tune the code.

Abstract:

Optimal code performance is (besides correctness and accuracy) the most important objective in compute intensive applications. In many of these applications, Graphic Processing Units (GPUs) are used because of their high amount of compute power. However, caused by their massively parallel architecture, the code has to be specifically adjusted to the underlying hardware to achieve optimal performance and therefore has to be reoptimized for each new generation. In reality, this is usually not the case as productive code is normally at least several years old and nobody has the time to continuously adjust existing code to new hardware. In recent years more and more approaches have emerged that automatically tune the performance of applications toward the underlying hardware. In this article, we present the MATOG auto-tuner and its concepts. It abstracts the array memory access in CUDA applications and automatically optimizes the code according to the used GPUs. MATOG only requires few profiling runs to analyze even complex applications, while achieving significant speedups over non-optimized code, independent of the used GPU generation and without the need to manually tune the code.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

GPU Acceleration of Range Queries over Large Data Sets

Mitchell Nelson, +4 more

TL;DR: This paper presents three GPU algorithms and one CPU based algorithm for the parallel execution of bitmap-range queries and shows that in 95% of tests, using real and synthetic data, the GPU algorithms greatly outperform the parallel CPU algorithm.

...read moreread less

Book ChapterDOI

Block-Size Independence for GPU Programs

Rajeev Alur, +2 more

TL;DR: This paper focuses on optimizations of GPU programs by tuning execution parameters, but many of these optimizations do not ensure correctness and subtle errors can enter while optimizing a GPU program.

...read moreread less

Journal ArticleDOI

Parallel acceleration of CPU and GPU range queries over large data sets

Mitchell Nelson, +4 more

- 01 Dec 2020 -

Journal of Cloud Computing

TL;DR: This paper presents four GPU algorithms and two CPU-based algorithms for the parallel execution of bitmap-range queries and shows that in 98.8% of tests, using real and synthetic data, the GPU algorithms greatly outperform the parallel CPU algorithms.

...read moreread less

Journal ArticleDOI

Analysis of Schedule and Layout Tuning for Sparse Matrices With Compound Entries on GPUs

Johannes Sebastian Mueller-Roemer, +3 more

- 30 Mar 2020 -

Computer Graphics Forum

TL;DR: This work generalizes several matrix layouts and applies joint schedule and layout autotuning to improve the performance of the sparse matrix‐vector product on massively parallel graphics processing units.

...read moreread less

Book ChapterDOI

Astute Approach to Handling Memory Layouts of Regular Data Structures

Adam G. Smelko, +5 more

- 01 Jan 2023 -

Lecture Notes in Computer Science

TL;DR: Noarr as discussed by the authors is a GPU-ready portable C++ library that utilizes generic programming, functional design, and compile-time computations to allow the programmer to specify and compose data structure layouts declaratively while minimizing the indexing and coding overhead.

...read moreread less

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Rodinia: A benchmark suite for heterogeneous computing

Shuai Che, +6 more

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.

...read moreread less

Proceedings ArticleDOI

Sorting networks and their applications

Kenneth E. Batcher

TL;DR: To achieve high throughput rates today's computers perform several operations simultaneously; not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently.

...read moreread less

Proceedings ArticleDOI

The Reyes image rendering architecture

Robert L. Cook, +2 more

TL;DR: An architecture is presented for fast high-quality rendering of complex images that uses micropolygons to minimize paging and to support models that contain arbitrarily many primitives.

...read moreread less

Proceedings ArticleDOI

OpenTuner: an extensible framework for program autotuning

Jason Ansel, +6 more

TL;DR: The efficacy and generality of OpenTuner are demonstrated by building autotuners for 7 distinct projects and 16 total benchmarks, showing speedups over prior techniques of these projects of up to 2.8χ with little programmer effort.

...read moreread less

Proceedings ArticleDOI

Model-driven autotuning of sparse matrix-vector multiply on GPUs

Jee Choi, +2 more

TL;DR: A performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU) and shows that the model can identify the implementations that achieve within 15% of those found through exhaustive search.

...read moreread less

Collapse

Journal of Parallel and Distributed Comp...

MATOG: Array Layout Auto-Tuning for CUDA

Citations

GPU Acceleration of Range Queries over Large Data Sets

Block-Size Independence for GPU Programs

Parallel acceleration of CPU and GPU range queries over large data sets

Analysis of Schedule and Layout Tuning for Sparse Matrices With Compound Entries on GPUs

Astute Approach to Handling Memory Layouts of Regular Data Structures

References

Rodinia: A benchmark suite for heterogeneous computing

Sorting networks and their applications

The Reyes image rendering architecture

OpenTuner: an extensible framework for program autotuning

Model-driven autotuning of sparse matrix-vector multiply on GPUs

Related Papers (5)

Guided profiling for auto-tuning array layouts on GPUs

Auto-tuning complex array layouts for GPUs

Meta-programming and auto-tuning in the search for high performance GPU code

Matrix multiplication beyond auto-tuning: rewrite-based GPU code generation

Using hardware performance counters to speed up autotuning convergence on GPUs