scispace - formally typeset
Open AccessJournal ArticleDOI

MATOG: Array Layout Auto-Tuning for CUDA

TLDR
The MATOG auto-tuner abstracts the array memory access in CUDA applications and automatically optimizes the code according to the used GPUs, independent of the used GPU generation and without the need to manually tune the code.
Abstract
Optimal code performance is (besides correctness and accuracy) the most important objective in compute intensive applications. In many of these applications, Graphic Processing Units (GPUs) are used because of their high amount of compute power. However, caused by their massively parallel architecture, the code has to be specifically adjusted to the underlying hardware to achieve optimal performance and therefore has to be reoptimized for each new generation. In reality, this is usually not the case as productive code is normally at least several years old and nobody has the time to continuously adjust existing code to new hardware. In recent years more and more approaches have emerged that automatically tune the performance of applications toward the underlying hardware. In this article, we present the MATOG auto-tuner and its concepts. It abstracts the array memory access in CUDA applications and automatically optimizes the code according to the used GPUs. MATOG only requires few profiling runs to analyze even complex applications, while achieving significant speedups over non-optimized code, independent of the used GPU generation and without the need to manually tune the code.

read more

Citations
More filters
Proceedings ArticleDOI

GPU Acceleration of Range Queries over Large Data Sets

TL;DR: This paper presents three GPU algorithms and one CPU based algorithm for the parallel execution of bitmap-range queries and shows that in 95% of tests, using real and synthetic data, the GPU algorithms greatly outperform the parallel CPU algorithm.
Book ChapterDOI

Block-Size Independence for GPU Programs

TL;DR: This paper focuses on optimizations of GPU programs by tuning execution parameters, but many of these optimizations do not ensure correctness and subtle errors can enter while optimizing a GPU program.
Journal ArticleDOI

Parallel acceleration of CPU and GPU range queries over large data sets

TL;DR: This paper presents four GPU algorithms and two CPU-based algorithms for the parallel execution of bitmap-range queries and shows that in 98.8% of tests, using real and synthetic data, the GPU algorithms greatly outperform the parallel CPU algorithms.
Journal ArticleDOI

Analysis of Schedule and Layout Tuning for Sparse Matrices With Compound Entries on GPUs

TL;DR: This work generalizes several matrix layouts and applies joint schedule and layout autotuning to improve the performance of the sparse matrix‐vector product on massively parallel graphics processing units.
Book ChapterDOI

Astute Approach to Handling Memory Layouts of Regular Data Structures

TL;DR: Noarr as discussed by the authors is a GPU-ready portable C++ library that utilizes generic programming, functional design, and compile-time computations to allow the programmer to specify and compose data structure layouts declaratively while minimizing the indexing and coding overhead.
References
More filters
Proceedings ArticleDOI

Rodinia: A benchmark suite for heterogeneous computing

TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Proceedings ArticleDOI

Sorting networks and their applications

TL;DR: To achieve high throughput rates today's computers perform several operations simultaneously; not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently.
Proceedings ArticleDOI

The Reyes image rendering architecture

TL;DR: An architecture is presented for fast high-quality rendering of complex images that uses micropolygons to minimize paging and to support models that contain arbitrarily many primitives.
Proceedings ArticleDOI

OpenTuner: an extensible framework for program autotuning

TL;DR: The efficacy and generality of OpenTuner are demonstrated by building autotuners for 7 distinct projects and 16 total benchmarks, showing speedups over prior techniques of these projects of up to 2.8χ with little programmer effort.
Proceedings ArticleDOI

Model-driven autotuning of sparse matrix-vector multiply on GPUs

TL;DR: A performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU) and shows that the model can identify the implementations that achieve within 15% of those found through exhaustive search.
Related Papers (5)