Proceedings ArticleDOI
3.5-D Blocking Optimization for Stencil Computations on Modern CPUs and GPUs
Anthony Nguyen,Nadathur Satish,Jatin Chhugani,Changkyu Kim,Pradeep Dubey +4 more
- pp 1-13
Reads0
Chats0
TLDR
A novel 3.Abstract:
Stencil computation sweeps over a spatial grid over multiple time steps to perform nearest-neighbor computations The bandwidth-to-compute requirement for a large class of stencil kernels is very high, and their performance is bound by the available memory bandwidth Since memory bandwidth grows slower than compute, the performance of stencil kernels will not scale with increasing compute density We present a novel 35D-blocking algorithm that performs 25D-spatial and temporal blocking of the input grid into on-chip memory for both CPUs and GPUs The resultant algorithm is amenable to both thread- level and data-level parallelism, and scales near-linearly with the SIMD width and multiple-cores Our performance numbers are faster or comparable to state-of-the-art-stencil implementations on CPUs and GPUs Our implementation of 7-point-stencil is 15X-faster on CPUs, and 18X faster on GPUs for single- precision floating point inputs than previously reported numbers For Lattice Boltzmann methods, the corresponding speedup number on CPUs is 21Xread more
Citations
More filters
Proceedings ArticleDOI
Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines
Jonathan Ragan-Kelley,Connelly Barnes,Andrew Adams,Sylvain Paris,Frédo Durand,Saman Amarasinghe +5 more
TL;DR: A systematic model of the tradeoff space fundamental to stencil pipelines is presented, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule are presented.
Proceedings ArticleDOI
High-performance code generation for stencil computations on GPU architectures
TL;DR: This paper develops compiler algorithms for automatic generation of efficient, time-tiled stencil code for GPU accelerators from a high-level description of the stencil operation, and shows that the code generation scheme can achieve high performance on a range of GPU architectures, including both nVidia and AMD devices.
Journal ArticleDOI
Darkroom: compiling high-level image processing code into hardware pipelines
James Hegarty,John Brunhaver,Zachary DeVito,Jonathan Ragan-Kelley,Noy Cohen,Steven Bell,Artem Vasilyev,Mark Horowitz,Pat Hanrahan +8 more
TL;DR: The semantics of the Darkroom language allow it to compile programs directly into line-buffered pipelines, with all intermediate values in local line-buffer storage, eliminating unnecessary communication with off-chip DRAM.
Proceedings ArticleDOI
Physis: an implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers
TL;DR: A compiler-based programming framework that automatically translates user-written structured grid code into scalable parallel implementation code for GPU-equipped clusters is proposed and the feasibility of such automatic translations is demonstrated by implementing several structured grid applications in this framework.
Proceedings ArticleDOI
Auto-generation and auto-tuning of 3D stencil codes on GPU clusters
Yongpeng Zhang,Frank Mueller +1 more
TL;DR: This proposed framework takes a most concise specification of stencil behavior from the user as a single formula, auto-generates tunable code from it, systematically searches for the best configuration and generates the code with optimal parameter configurations for different GPUs.
References
More filters
Book
Adaptive mesh refinement for hyperbolic partial differential equations
Marsha Berger,Joseph Oliger +1 more
TL;DR: This work presents an adaptive method based on the idea of multiple, component grids for the solution of hyperbolic partial differential equations using finite difference techniques based upon Richardson-type estimates of the truncation error, which is a mesh refinement algorithm in time and space.
Journal ArticleDOI
Synthesis and evaluation of linear motion transitions
Jing Wang,Bobby Bodenheimer +1 more
TL;DR: This article develops methods for determining visually appealing motion transitions using linear blending, and assess the importance of these techniques by determining the minimum sensitivity of viewers to transition durations, the just noticeable difference, for both center-aligned and start-end specifications.
Journal ArticleDOI
Algorithms for scalable synchronization on shared-memory multiprocessors
TL;DR: The principal conclusion is that contention due to synchronization need not be a problemin large-scale shared-memory multiprocessors, and the existence of scalable algorithms greatly weakens the case for costly special-purpose hardware support for synchronization, and provides protection against so-called “dance hall” architectures.
Book
Algorithms for scalable synchronization on shared-memory multiprocessors
TL;DR: In this article, the authors present a scalable algorithm for spin locks that provides reasonable latency in the absence of contention, requires only a constant amount of space per lock, and requires no hardware support other than a swap-with-memory instruction.
Journal ArticleDOI
Larrabee: a many-core x86 architecture for visual computing
Larry D. Seiler,Doug Carmean,Eric Sprangle,Tom Forsyth,Michael Abrash,Pradeep Dubey,Stephen Junkins,Adam T. Lake,Jeremy Sugerman,Robert Dale Cavin,Roger Espasa,Ed Grochowski,Toni Juan,Pat Hanrahan +13 more
TL;DR: This article consists of a collection of slides from the author's conference presentation, some of the topics discussed include: architecture convergence; Larrabee architecture; and graphics pipeline.