The cache performance and optimizations of blocked algorithms

doi:10.1145/106972.106981

Proceedings ArticleDOI

The cache performance and optimizations of blocked algorithms

Monica D. Lam, +2 more

- Vol. 19, Iss: 2, pp 63-74

Chats0

TLDR

It is shown that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes.

Abstract:

Blocking is a well-known optimization technique for improving the effectiveness of memory hierarchies. Instead of operating on entire rows or columns of an array, blocked algorithms operate on submatrices or blocks, so that data loaded into the faster levels of the memory hierarchy are reused. This paper presents cache performance data for blocked programs and evaluates several optimization to improve this performance. The data is obtained by a theoretical model of data conflicts in the cache, which has been validated by large amounts of simulation. We show that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes. The conventional wisdom of frying to use the entire cache, or even a fixed fraction of the cache, is incorrect. If a fixed block size is used for a given cache size, the block size that minimizes the expected number of cache misses is very small. Tailoring the block size according to the matrix size and cache parameters can improve the average performance and reduce the variance in performance for different matrix sizes. Finally, whenever possible, it is beneficial to copy non-contiguous reused data into consecutive locations.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

A data locality optimizing algorithm

Michael Wolf, +1 more

TL;DR: An algorithm that improves the locality of a loop nest by transforming the code via interchange, reversal, skewing and tiling is proposed, and is successful in optimizing codes such as matrix multiplication, successive over-relaxation, LU decomposition without pivoting, and Givens QR factorization.

...read moreread less

Proceedings ArticleDOI

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Shane Ryoo, +5 more

TL;DR: This work discusses the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies, and achieves increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations.

...read moreread less

Book

Computer Architecture, Fifth Edition: A Quantitative Approach

John L. Hennessy, +1 more

TL;DR: The Fifth Edition of Computer Architecture focuses on this dramatic shift in the ways in which software and technology in the "cloud" are accessed by cell phones, tablets, laptops, and other mobile computing devices.

...read moreread less

Journal ArticleDOI

Compiler transformations for high-performance computing

David F. Bacon, +2 more

- 01 Dec 1994 -

ACM Computing Surveys

TL;DR: This survey is a comprehensive overview of the important high-level program restructuring techniques for imperative languages, such as C and Fortran, and describes the purpose of each transformation, how to determine if it is legal, and an example of its application.

...read moreread less

Proceedings ArticleDOI

Design and evaluation of a compiler algorithm for prefetching

Todd C. Mowry, +2 more

TL;DR: This paper proposes a compiler algorithm to insert prefetch instructions into code that operates on dense matrices, and shows that this algorithm significantly improves the execution speed of the benchmark programs-some of the programs improve by as much as a factor of two.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Book

Matrix computations

Gene H. Golub

Journal ArticleDOI

A set of level 3 basic linear algebra subprograms

Jack Dongarra, +3 more

- 01 Mar 1990 -

ACM Transactions on Mathematical Softwar...

TL;DR: This paper describes an extension to the set of Basic Linear Algebra Subprograms targeted at matrix-vector operations that should provide for efficient and portable implementations of algorithms for high-performance computers.

...read moreread less

Proceedings ArticleDOI

A data locality optimizing algorithm

Michael Wolf, +1 more

TL;DR: An algorithm that improves the locality of a loop nest by transforming the code via interchange, reversal, skewing and tiling is proposed, and is successful in optimizing codes such as matrix multiplication, successive over-relaxation, LU decomposition without pivoting, and Givens QR factorization.

...read moreread less

Proceedings ArticleDOI

I/O complexity: The red-blue pebble game

Hong Jia-Wei, +1 more

TL;DR: Using the red-blue pebble game formulation, a number of lower bound results for the I/O requirement are proven and may provide insight into the difficult task of balancing I/o and computation in special-purpose system designs.

...read moreread less

Journal ArticleDOI

Strategies for cache and local memory management by global program transformation

GannonDennis, +2 more

- 01 Oct 1988 -

Journal of Parallel and Distributed Comp...

The cache performance and optimizations of blocked algorithms

Citations

A data locality optimizing algorithm

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

Computer Architecture, Fifth Edition: A Quantitative Approach

Compiler transformations for high-performance computing

Design and evaluation of a compiler algorithm for prefetching

References

Matrix computations

A set of level 3 basic linear algebra subprograms

A data locality optimizing algorithm

I/O complexity: The red-blue pebble game

Strategies for cache and local memory management by global program transformation

Related Papers (5)

A data locality optimizing algorithm

Tile size selection using cache organization and data layout

Computer Architecture: A Quantitative Approach

Improving data locality with loop transformations

More iteration space tiling