scispace - formally typeset
Proceedings ArticleDOI

3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs

Reads0
Chats0
TLDR
A novel 3.
Abstract
Stencil computation sweeps over a spatial grid over multiple time steps to perform nearest-neighbor computations The bandwidth-to-compute requirement for a large class of stencil kernels is very high, and their performance is bound by the available memory bandwidth Since memory bandwidth grows slower than compute, the performance of stencil kernels will not scale with increasing compute density We present a novel 35D-blocking algorithm that performs 25D-spatial and temporal blocking of the input grid into on-chip memory for both CPUs and GPUs The resultant algorithm is amenable to both thread- level and data-level parallelism, and scales near-linearly with the SIMD width and multiple-cores Our performance numbers are faster or comparable to state-of-the-art-stencil implementations on CPUs and GPUs Our implementation of 7-point-stencil is 15X-faster on CPUs, and 18X faster on GPUs for single- precision floating point inputs than previously reported numbers For Lattice Boltzmann methods, the corresponding speedup number on CPUs is 21X

read more

Citations
More filters
Proceedings ArticleDOI

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

TL;DR: A systematic model of the tradeoff space fundamental to stencil pipelines is presented, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule are presented.
Proceedings ArticleDOI

High-performance code generation for stencil computations on GPU architectures

TL;DR: This paper develops compiler algorithms for automatic generation of efficient, time-tiled stencil code for GPU accelerators from a high-level description of the stencil operation, and shows that the code generation scheme can achieve high performance on a range of GPU architectures, including both nVidia and AMD devices.
Journal ArticleDOI

Darkroom: compiling high-level image processing code into hardware pipelines

TL;DR: The semantics of the Darkroom language allow it to compile programs directly into line-buffered pipelines, with all intermediate values in local line-buffer storage, eliminating unnecessary communication with off-chip DRAM.
Proceedings ArticleDOI

Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers

TL;DR: A compiler-based programming framework that automatically translates user-written structured grid code into scalable parallel implementation code for GPU-equipped clusters is proposed and the feasibility of such automatic translations is demonstrated by implementing several structured grid applications in this framework.
Proceedings ArticleDOI

Auto-generation and auto-tuning of 3D stencil codes on GPU clusters

TL;DR: This proposed framework takes a most concise specification of stencil behavior from the user as a single formula, auto-generates tunable code from it, systematically searches for the best configuration and generates the code with optimal parameter configurations for different GPUs.
References
More filters
Book

Adaptive mesh refinement for hyperbolic partial differential equations

TL;DR: This work presents an adaptive method based on the idea of multiple, component grids for the solution of hyperbolic partial differential equations using finite difference techniques based upon Richardson-type estimates of the truncation error, which is a mesh refinement algorithm in time and space.
Journal ArticleDOI

Synthesis and evaluation of linear motion transitions

TL;DR: This article develops methods for determining visually appealing motion transitions using linear blending, and assess the importance of these techniques by determining the minimum sensitivity of viewers to transition durations, the just noticeable difference, for both center-aligned and start-end specifications.
Journal ArticleDOI

Algorithms for scalable synchronization on shared-memory multiprocessors

TL;DR: The principal conclusion is that contention due to synchronization need not be a problemin large-scale shared-memory multiprocessors, and the existence of scalable algorithms greatly weakens the case for costly special-purpose hardware support for synchronization, and provides protection against so-called “dance hall” architectures.
Book

Algorithms for scalable synchronization on shared-memory multiprocessors

TL;DR: In this article, the authors present a scalable algorithm for spin locks that provides reasonable latency in the absence of contention, requires only a constant amount of space per lock, and requires no hardware support other than a swap-with-memory instruction.
Journal ArticleDOI

Larrabee: a many-core x86 architecture for visual computing

TL;DR: This article consists of a collection of slides from the author's conference presentation, some of the topics discussed include: architecture convergence; Larrabee architecture; and graphics pipeline.
Related Papers (5)