MATOG: Array Layout Auto-Tuning for CUDA
Nicolas Weber,Michael Goesele +1 more
TLDR
The MATOG auto-tuner abstracts the array memory access in CUDA applications and automatically optimizes the code according to the used GPUs, independent of the used GPU generation and without the need to manually tune the code.Abstract:
Optimal code performance is (besides correctness and accuracy) the most important objective in compute intensive applications. In many of these applications, Graphic Processing Units (GPUs) are used because of their high amount of compute power. However, caused by their massively parallel architecture, the code has to be specifically adjusted to the underlying hardware to achieve optimal performance and therefore has to be reoptimized for each new generation. In reality, this is usually not the case as productive code is normally at least several years old and nobody has the time to continuously adjust existing code to new hardware. In recent years more and more approaches have emerged that automatically tune the performance of applications toward the underlying hardware. In this article, we present the MATOG auto-tuner and its concepts. It abstracts the array memory access in CUDA applications and automatically optimizes the code according to the used GPUs. MATOG only requires few profiling runs to analyze even complex applications, while achieving significant speedups over non-optimized code, independent of the used GPU generation and without the need to manually tune the code.read more
Citations
More filters
Proceedings ArticleDOI
GPU Acceleration of Range Queries over Large Data Sets
TL;DR: This paper presents three GPU algorithms and one CPU based algorithm for the parallel execution of bitmap-range queries and shows that in 95% of tests, using real and synthetic data, the GPU algorithms greatly outperform the parallel CPU algorithm.
Book ChapterDOI
Block-Size Independence for GPU Programs
TL;DR: This paper focuses on optimizations of GPU programs by tuning execution parameters, but many of these optimizations do not ensure correctness and subtle errors can enter while optimizing a GPU program.
Journal ArticleDOI
Parallel acceleration of CPU and GPU range queries over large data sets
TL;DR: This paper presents four GPU algorithms and two CPU-based algorithms for the parallel execution of bitmap-range queries and shows that in 98.8% of tests, using real and synthetic data, the GPU algorithms greatly outperform the parallel CPU algorithms.
Journal ArticleDOI
Analysis of Schedule and Layout Tuning for Sparse Matrices With Compound Entries on GPUs
TL;DR: This work generalizes several matrix layouts and applies joint schedule and layout autotuning to improve the performance of the sparse matrix‐vector product on massively parallel graphics processing units.
Book ChapterDOI
Astute Approach to Handling Memory Layouts of Regular Data Structures
TL;DR: Noarr as discussed by the authors is a GPU-ready portable C++ library that utilizes generic programming, functional design, and compile-time computations to allow the programmer to specify and compose data structure layouts declaratively while minimizing the indexing and coding overhead.
References
More filters
Proceedings ArticleDOI
Rodinia: A benchmark suite for heterogeneous computing
Shuai Che,Michael Boyer,Jiayuan Meng,David Tarjan,Jeremy W. Sheaffer,Sang-Ha Lee,Kevin Skadron +6 more
TL;DR: This characterization shows that the Rodinia benchmarks cover a wide range of parallel communication patterns, synchronization techniques and power consumption, and has led to some important architectural insight, such as the growing importance of memory-bandwidth limitations and the consequent importance of data layout.
Proceedings ArticleDOI
Sorting networks and their applications
TL;DR: To achieve high throughput rates today's computers perform several operations simultaneously; not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently.
Proceedings ArticleDOI
The Reyes image rendering architecture
TL;DR: An architecture is presented for fast high-quality rendering of complex images that uses micropolygons to minimize paging and to support models that contain arbitrarily many primitives.
Proceedings ArticleDOI
OpenTuner: an extensible framework for program autotuning
Jason Ansel,Shoaib Kamil,Kalyan Veeramachaneni,Jonathan Ragan-Kelley,Jeffrey Bosboom,Una-May O'Reilly,Saman Amarasinghe +6 more
TL;DR: The efficacy and generality of OpenTuner are demonstrated by building autotuners for 7 distinct projects and 16 total benchmarks, showing speedups over prior techniques of these projects of up to 2.8χ with little programmer effort.
Proceedings ArticleDOI
Model-driven autotuning of sparse matrix-vector multiply on GPUs
TL;DR: A performance model-driven framework for automated performance tuning (autotuning) of sparse matrix-vector multiply (SpMV) on systems accelerated by graphics processing units (GPU) and shows that the model can identify the implementations that achieve within 15% of those found through exhaustive search.