Using time skewing to eliminate idle time due to memory bandwidth and network limitations

doi:10.1109/IPDPS.2000.845979

Proceedings ArticleDOI

Using time skewing to eliminate idle time due to memory bandwidth and network limitations

David Wonnacott

- pp 171-180

Chats0

TLDR

A generalization of time skewing for multiprocessor architectures is given, and techniques for using multilevel caches reduce the LI cache requirement, which would otherwise be unacceptably high for some architectures when using arrays of high dimension.

Abstract:

Time skewing is a compile-time optimization that can provide arbitrarily high cache hit rates for a class of iterative calculations, given a sufficient number of time steps and sufficient cache memory. Thus, it can eliminate processor idle time caused by inadequate main memory bandwidth. In this article, we give a generalization of time skewing for multiprocessor architectures, and discuss time skewing for multilevel caches. Our generalization for multiprocessors lets us eliminate processor idle time caused by any combination of inadequate main memory bandwidth, limited network bandwidth, and high network latency, given a sufficiently large problem and sufficient cache. As in the uniprocessor case, the cache requirement grows with the machine balance rather than the problem size. Our techniques for using multilevel caches reduce the LI cache requirement, which would otherwise be unacceptably high for some architectures when using arrays of high dimension.

Citations

PDF

Open Access

More filters

The Potential of the Cell Processor for Scientific Computing

Samuel Williams, +5 more

TL;DR: In this article, the authors examined the potential of using the STI Cell processor as a building block for future high-end computing systems and proposed modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations.

...read moreread less

Proceedings ArticleDOI

The potential of the cell processor for scientific computing

Samuel Williams, +5 more

TL;DR: This work introduces a performance model for Cell and applies it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs, and proposes modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations.

...read moreread less

Journal ArticleDOI

Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors

Kaushik Datta, +5 more

- 01 Feb 2009 -

Siam Review

TL;DR: Results demonstrate that recent trends in memory system organization have reduced the eﬁcacy of traditional cache- blocking optimizations, and represent one of the most extensive analyses of stencil optimizations and performance modeling to date.

...read moreread less

Proceedings ArticleDOI

PolyMage: Automatic Optimization for Image Processing Pipelines

Ravi Teja Mullapudi, +2 more

TL;DR: This is the first model-driven compiler for image processing pipelines that performs complex fusion, tiling, and storage optimization automatically and is up to 1.81x better than that achieved through manual tuning in Halide, a state-of-the-art language and compiler forimage processing pipelines.

...read moreread less

Proceedings ArticleDOI

Tiling stencil computations to maximize parallelism

Vinayaka Bandishti, +2 more

TL;DR: This work provides necessary and sufficient conditions on tiling hyperplanes to enable concurrent start for programs with affine data accesses and provides an approach to find such hyperplanes.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

A data locality optimizing algorithm

Michael Wolf, +1 more

TL;DR: An algorithm that improves the locality of a loop nest by transforming the code via interchange, reversal, skewing and tiling is proposed, and is successful in optimizing codes such as matrix multiplication, successive over-relaxation, LU decomposition without pivoting, and Givens QR factorization.

...read moreread less

Book ChapterDOI

(σ, ρ)-calculus

Cheng-Shang Chang

TL;DR: In this article, a discrete-time system with time indexed by t = 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 20, 21, 22, 23, 24, 25, 26, 27, 28, 28

...read moreread less

Proceedings ArticleDOI

Design and evaluation of a compiler algorithm for prefetching

Todd C. Mowry, +2 more

TL;DR: This paper proposes a compiler algorithm to insert prefetch instructions into code that operates on dense matrices, and shows that this algorithm significantly improves the execution speed of the benchmark programs-some of the programs improve by as much as a factor of two.

...read moreread less

Journal ArticleDOI

Improving data locality with loop transformations

Kathryn S. McKinley, +2 more

- 01 Jul 1996 -

ACM Transactions on Programming Language...

TL;DR: This article presents compiler optimizations to improve data locality based on a simple yet accurate cost model and finds performance improvements were difficult to achieve, but improved several programs.

...read moreread less