scispace - formally typeset
Proceedings ArticleDOI

Using time skewing to eliminate idle time due to memory bandwidth and network limitations

David Wonnacott
- pp 171-180
Reads0
Chats0
TLDR
A generalization of time skewing for multiprocessor architectures is given, and techniques for using multilevel caches reduce the LI cache requirement, which would otherwise be unacceptably high for some architectures when using arrays of high dimension.
Abstract
Time skewing is a compile-time optimization that can provide arbitrarily high cache hit rates for a class of iterative calculations, given a sufficient number of time steps and sufficient cache memory. Thus, it can eliminate processor idle time caused by inadequate main memory bandwidth. In this article, we give a generalization of time skewing for multiprocessor architectures, and discuss time skewing for multilevel caches. Our generalization for multiprocessors lets us eliminate processor idle time caused by any combination of inadequate main memory bandwidth, limited network bandwidth, and high network latency, given a sufficiently large problem and sufficient cache. As in the uniprocessor case, the cache requirement grows with the machine balance rather than the problem size. Our techniques for using multilevel caches reduce the LI cache requirement, which would otherwise be unacceptably high for some architectures when using arrays of high dimension.

read more

Citations
More filters

The Potential of the Cell Processor for Scientific Computing

TL;DR: In this article, the authors examined the potential of using the STI Cell processor as a building block for future high-end computing systems and proposed modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations.
Proceedings ArticleDOI

The potential of the cell processor for scientific computing

TL;DR: This work introduces a performance model for Cell and applies it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs, and proposes modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations.
Journal ArticleDOI

Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

TL;DR: Results demonstrate that recent trends in memory system organization have reduced the eficacy of traditional cache- blocking optimizations, and represent one of the most extensive analyses of stencil optimizations and performance modeling to date.
Proceedings ArticleDOI

PolyMage: Automatic Optimization for Image Processing Pipelines

TL;DR: This is the first model-driven compiler for image processing pipelines that performs complex fusion, tiling, and storage optimization automatically and is up to 1.81x better than that achieved through manual tuning in Halide, a state-of-the-art language and compiler forimage processing pipelines.
Proceedings ArticleDOI

Tiling stencil computations to maximize parallelism

TL;DR: This work provides necessary and sufficient conditions on tiling hyperplanes to enable concurrent start for programs with affine data accesses and provides an approach to find such hyperplanes.
References
More filters
Proceedings ArticleDOI

A data locality optimizing algorithm

TL;DR: An algorithm that improves the locality of a loop nest by transforming the code via interchange, reversal, skewing and tiling is proposed, and is successful in optimizing codes such as matrix multiplication, successive over-relaxation, LU decomposition without pivoting, and Givens QR factorization.
Book ChapterDOI

(σ, ρ)-calculus

TL;DR: In this article, a discrete-time system with time indexed by t = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 28
Proceedings ArticleDOI

Design and evaluation of a compiler algorithm for prefetching

TL;DR: This paper proposes a compiler algorithm to insert prefetch instructions into code that operates on dense matrices, and shows that this algorithm significantly improves the execution speed of the benchmark programs-some of the programs improve by as much as a factor of two.
Journal ArticleDOI

Improving data locality with loop transformations

TL;DR: This article presents compiler optimizations to improve data locality based on a simple yet accurate cost model and finds performance improvements were difficult to achieve, but improved several programs.
Journal Article

Publish or perish.

Related Papers (5)