Blocking Linear Algebra Codes for Memory Hierarchies

Open AccessProceedings Article

Blocking Linear Algebra Codes for Memory Hierarchies

- pp 400-405

TLDR

This paper presents some encouraging preliminary results of a project to determine how much restructuring is possible with automatic techniques to reduce the latency of memory in basic machine cycles.

Abstract:

Because computation speed and memory size are both increasing, the latency of memory, in basic machine cycles, is also increasing. As a result, recent compiler research has focused on reducing the e ective latency by restructuring programs to take more advantage of high-speed intermediate memory (or cache, as it is usually called). The problem is that many real-world programs are non-trivial to restructure, and current methods will often fail. In this paper, we present some encouraging preliminary results of a project to determine how much restructuring is possible with automatic techniques.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Software prefetching

David Callahan, +2 more

TL;DR: These simulations show that, even when generated by a very simple compiler algorithm, prefetch instructions can eliminate nearly all cache misses, while causing only modest increases in data traffic between memory and cache.

...read moreread less

Journal ArticleDOI

Improving register allocation for subscripted variables

David Callahan, +2 more

TL;DR: This paper presents a source-to-source transformation, called scalar replacement, that finds opportunities for reuse of subscripted variables and replaces the references involved by references to temporary scalar variables to increase the likelihood that these elements will be assigned to registers by the coloring-based register allocators found in most compilers.

...read moreread less

Journal ArticleDOI

The Uniform Memory Hierarchy Model of Computation

Bowen Alpern, +3 more

- 01 Jan 1993 -

Algorithmica

TL;DR: In this paper, the authors introduced the Uniform Memory Hierarchy (UMH) model, which captures performance-relevant aspects of the hierarchical nature of computer memory and is used to quantify architectural requirements of several algorithms and to ratify the faster speeds achieved by tuned implementations that use improved data-movement strategies.

...read moreread less

Automatic Blocking of Nested Loops

Jack Dongarra, +1 more

TL;DR: It is shown, in a very general setting, how to choose a nearly optimal set of transformed indices and, in one particular but rather frequently occurring situation,How to choose an optimalSet of block sizes.

...read moreread less

Proceedings ArticleDOI

Effective partial redundancy elimination

Preston Briggs, +1 more

TL;DR: This paper shows that a combination of global reassociation and global value numbering can increase the effectiveness of partial redundancy elimination by imposing a discipline on the choice of names and the shape of expressions.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Proceedings ArticleDOI

Supernode partitioning

François Irigoin, +1 more

TL;DR: A class of partitionings is presented that encompasses previous techniques and provides enough flexibility to adapt code to multiprocessors with two levels of parallelism and two level of memory.

...read moreread less

Book

The Structure of Computers and Computations

David L. Kuck

Proceedings ArticleDOI

More iteration space tiling

M. Wolfe

TL;DR: Subdividing the iteration space of a loop into blocks or tiles with a fixed maximum size has several advantages, and tiles become a natural candidate as the unit of work for parallel task scheduling.

...read moreread less

Proceedings Article

Iteration Space Tiling for Memory Hierarchies

Michael Wolfe

Dissertation

Software methods for improvement of cache performance on supercomputer applications

Allan Kennedy Porterfield, +1 more

TL;DR: Measurements of actual supercomputer cache performance has not been previously undertaken, and PFC-Sim, a program-driven event tracing facility that can simulate data cache performance of very long programs, is used to measure the performance of various cache structures.

...read moreread less

Blocking Linear Algebra Codes for Memory Hierarchies

Citations

Software prefetching

Improving register allocation for subscripted variables

The Uniform Memory Hierarchy Model of Computation

Automatic Blocking of Nested Loops

Effective partial redundancy elimination

References

Supernode partitioning

The Structure of Computers and Computations

More iteration space tiling

Iteration Space Tiling for Memory Hierarchies

Software methods for improvement of cache performance on supercomputer applications

Related Papers (5)

The cache performance and optimizations of blocked algorithms

A data locality optimizing algorithm

Software prefetching

Optimizing supercompilers for supercomputers

A set of level 3 basic linear algebra subprograms