3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

doi:10.1109/SC.2010.2

Proceedings ArticleDOI

3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Anthony Nguyen, +4 more

- pp 1-13

Chats0

TLDR

A novel 3.

Abstract:

Stencil computation sweeps over a spatial grid over multiple time steps to perform nearest-neighbor computations The bandwidth-to-compute requirement for a large class of stencil kernels is very high, and their performance is bound by the available memory bandwidth Since memory bandwidth grows slower than compute, the performance of stencil kernels will not scale with increasing compute density We present a novel 35D-blocking algorithm that performs 25D-spatial and temporal blocking of the input grid into on-chip memory for both CPUs and GPUs The resultant algorithm is amenable to both thread- level and data-level parallelism, and scales near-linearly with the SIMD width and multiple-cores Our performance numbers are faster or comparable to state-of-the-art-stencil implementations on CPUs and GPUs Our implementation of 7-point-stencil is 15X-faster on CPUs, and 18X faster on GPUs for single- precision floating point inputs than previously reported numbers For Lattice Boltzmann methods, the corresponding speedup number on CPUs is 21X

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

Jonathan Ragan-Kelley, +5 more

TL;DR: A systematic model of the tradeoff space fundamental to stencil pipelines is presented, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule are presented.

...read moreread less

Proceedings ArticleDOI

High-performance code generation for stencil computations on GPU architectures

Justin Holewinski, +2 more

TL;DR: This paper develops compiler algorithms for automatic generation of efficient, time-tiled stencil code for GPU accelerators from a high-level description of the stencil operation, and shows that the code generation scheme can achieve high performance on a range of GPU architectures, including both nVidia and AMD devices.

...read moreread less

Journal ArticleDOI

Darkroom: compiling high-level image processing code into hardware pipelines

James Hegarty, +8 more

TL;DR: The semantics of the Darkroom language allow it to compile programs directly into line-buffered pipelines, with all intermediate values in local line-buffer storage, eliminating unnecessary communication with off-chip DRAM.

...read moreread less

Proceedings ArticleDOI

Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers

Naoya Maruyama, +3 more

TL;DR: A compiler-based programming framework that automatically translates user-written structured grid code into scalable parallel implementation code for GPU-equipped clusters is proposed and the feasibility of such automatic translations is demonstrated by implementing several structured grid applications in this framework.

...read moreread less

Proceedings ArticleDOI

Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

Yongpeng Zhang, +1 more

TL;DR: This proposed framework takes a most concise specification of stencil behavior from the user as a single formula, auto-generates tunable code from it, systematically searches for the best configuration and generates the code with optimal parameter configurations for different GPUs.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Book

Adaptive mesh refinement for hyperbolic partial differential equations

Marsha Berger, +1 more

TL;DR: This work presents an adaptive method based on the idea of multiple, component grids for the solution of hyperbolic partial differential equations using finite difference techniques based upon Richardson-type estimates of the truncation error, which is a mesh refinement algorithm in time and space.

...read moreread less

Journal ArticleDOI

Synthesis and evaluation of linear motion transitions

Jing Wang, +1 more

- 20 Mar 2008 -

ACM Transactions on Graphics

TL;DR: This article develops methods for determining visually appealing motion transitions using linear blending, and assess the importance of these techniques by determining the minimum sensitivity of viewers to transition durations, the just noticeable difference, for both center-aligned and start-end specifications.

...read moreread less

Journal ArticleDOI

Algorithms for scalable synchronization on shared-memory multiprocessors

John Mellor-Crummey, +1 more

- 01 Feb 1991 -

ACM Transactions on Computer Systems

TL;DR: The principal conclusion is that contention due to synchronization need not be a problemin large-scale shared-memory multiprocessors, and the existence of scalable algorithms greatly weakens the case for costly special-purpose hardware support for synchronization, and provides protection against so-called “dance hall” architectures.

...read moreread less

Book

Algorithms for scalable synchronization on shared-memory multiprocessors

John Mellor-Crummey, +1 more

TL;DR: In this article, the authors present a scalable algorithm for spin locks that provides reasonable latency in the absence of contention, requires only a constant amount of space per lock, and requires no hardware support other than a swap-with-memory instruction.

...read moreread less

Journal ArticleDOI

Larrabee: a many-core x86 architecture for visual computing

Larry D. Seiler, +13 more

TL;DR: This article consists of a collection of slides from the author's conference presentation, some of the topics discussed include: architecture convergence; Larrabee architecture; and graphics pipeline.

...read moreread less

Collapse

Communications of The ACM

PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures

Matthias Christen, +2 more

3D finite difference computation on GPUs using CUDA

Paulius Micikevicius

3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Citations

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

High-performance code generation for stencil computations on GPU architectures

Darkroom: compiling high-level image processing code into hardware pipelines

Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers

Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

References

Adaptive mesh refinement for hyperbolic partial differential equations

Synthesis and evaluation of linear motion transitions

Algorithms for scalable synchronization on shared-memory multiprocessors

Algorithms for scalable synchronization on shared-memory multiprocessors

Larrabee: a many-core x86 architecture for visual computing

Related Papers (5)

Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

The pochoir stencil compiler

Roofline: an insightful visual performance model for multicore architectures

PATUS: A Code Generation and Autotuning Framework for Parallel Iterative Stencil Computations on Modern Microarchitectures

3D finite difference computation on GPUs using CUDA