scispace - formally typeset
Proceedings ArticleDOI

The cache performance and optimizations of blocked algorithms

Monica D. Lam, +2 more
- Vol. 19, Iss: 2, pp 63-74
Reads0
Chats0
TLDR
It is shown that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes.
Abstract
Blocking is a well-known optimization technique for improving the effectiveness of memory hierarchies. Instead of operating on entire rows or columns of an array, blocked algorithms operate on submatrices or blocks, so that data loaded into the faster levels of the memory hierarchy are reused. This paper presents cache performance data for blocked programs and evaluates several optimization to improve this performance. The data is obtained by a theoretical model of data conflicts in the cache, which has been validated by large amounts of simulation. We show that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes. The conventional wisdom of frying to use the entire cache, or even a fixed fraction of the cache, is incorrect. If a fixed block size is used for a given cache size, the block size that minimizes the expected number of cache misses is very small. Tailoring the block size according to the matrix size and cache parameters can improve the average performance and reduce the variance in performance for different matrix sizes. Finally, whenever possible, it is beneficial to copy non-contiguous reused data into consecutive locations.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

A data locality optimizing algorithm

TL;DR: An algorithm that improves the locality of a loop nest by transforming the code via interchange, reversal, skewing and tiling is proposed, and is successful in optimizing codes such as matrix multiplication, successive over-relaxation, LU decomposition without pivoting, and Givens QR factorization.
Proceedings ArticleDOI

Optimization principles and application performance evaluation of a multithreaded GPU using CUDA

TL;DR: This work discusses the GeForce 8800 GTX processor's organization, features, and generalized optimization strategies, and achieves increased performance by reordering accesses to off-chip memory to combine requests to the same or contiguous memory locations and apply classical optimizations to reduce the number of executed operations.
Book

Computer Architecture, Fifth Edition: A Quantitative Approach

TL;DR: The Fifth Edition of Computer Architecture focuses on this dramatic shift in the ways in which software and technology in the "cloud" are accessed by cell phones, tablets, laptops, and other mobile computing devices.
Journal ArticleDOI

Compiler transformations for high-performance computing

TL;DR: This survey is a comprehensive overview of the important high-level program restructuring techniques for imperative languages, such as C and Fortran, and describes the purpose of each transformation, how to determine if it is legal, and an example of its application.
Proceedings ArticleDOI

Design and evaluation of a compiler algorithm for prefetching

TL;DR: This paper proposes a compiler algorithm to insert prefetch instructions into code that operates on dense matrices, and shows that this algorithm significantly improves the execution speed of the benchmark programs-some of the programs improve by as much as a factor of two.
References
More filters
Book

Matrix computations

Gene H. Golub
Journal ArticleDOI

A set of level 3 basic linear algebra subprograms

TL;DR: This paper describes an extension to the set of Basic Linear Algebra Subprograms targeted at matrix-vector operations that should provide for efficient and portable implementations of algorithms for high-performance computers.
Proceedings ArticleDOI

A data locality optimizing algorithm

TL;DR: An algorithm that improves the locality of a loop nest by transforming the code via interchange, reversal, skewing and tiling is proposed, and is successful in optimizing codes such as matrix multiplication, successive over-relaxation, LU decomposition without pivoting, and Givens QR factorization.
Proceedings ArticleDOI

I/O complexity: The red-blue pebble game

TL;DR: Using the red-blue pebble game formulation, a number of lower bound results for the I/O requirement are proven and may provide insight into the difficult task of balancing I/o and computation in special-purpose system designs.