scispace - formally typeset
Open AccessProceedings Article

Blocking Linear Algebra Codes for Memory Hierarchies

Steve Carr, +1 more
- pp 400-405
TLDR
This paper presents some encouraging preliminary results of a project to determine how much restructuring is possible with automatic techniques to reduce the latency of memory in basic machine cycles.
Abstract
Because computation speed and memory size are both increasing, the latency of memory, in basic machine cycles, is also increasing. As a result, recent compiler research has focused on reducing the e ective latency by restructuring programs to take more advantage of high-speed intermediate memory (or cache, as it is usually called). The problem is that many real-world programs are non-trivial to restructure, and current methods will often fail. In this paper, we present some encouraging preliminary results of a project to determine how much restructuring is possible with automatic techniques.

read more

Citations
More filters
Proceedings ArticleDOI

Software prefetching

TL;DR: These simulations show that, even when generated by a very simple compiler algorithm, prefetch instructions can eliminate nearly all cache misses, while causing only modest increases in data traffic between memory and cache.
Journal ArticleDOI

Improving register allocation for subscripted variables

TL;DR: This paper presents a source-to-source transformation, called scalar replacement, that finds opportunities for reuse of subscripted variables and replaces the references involved by references to temporary scalar variables to increase the likelihood that these elements will be assigned to registers by the coloring-based register allocators found in most compilers.
Journal ArticleDOI

The Uniform Memory Hierarchy Model of Computation

TL;DR: In this paper, the authors introduced the Uniform Memory Hierarchy (UMH) model, which captures performance-relevant aspects of the hierarchical nature of computer memory and is used to quantify architectural requirements of several algorithms and to ratify the faster speeds achieved by tuned implementations that use improved data-movement strategies.

Automatic Blocking of Nested Loops

TL;DR: It is shown, in a very general setting, how to choose a nearly optimal set of transformed indices and, in one particular but rather frequently occurring situation,How to choose an optimalSet of block sizes.
Proceedings ArticleDOI

Effective partial redundancy elimination

TL;DR: This paper shows that a combination of global reassociation and global value numbering can increase the effectiveness of partial redundancy elimination by imposing a discipline on the choice of names and the shape of expressions.
References
More filters
Proceedings ArticleDOI

Supernode partitioning

TL;DR: A class of partitionings is presented that encompasses previous techniques and provides enough flexibility to adapt code to multiprocessors with two levels of parallelism and two level of memory.
Proceedings ArticleDOI

More iteration space tiling

TL;DR: Subdividing the iteration space of a loop into blocks or tiles with a fixed maximum size has several advantages, and tiles become a natural candidate as the unit of work for parallel task scheduling.
Proceedings Article

Iteration Space Tiling for Memory Hierarchies

Michael Wolfe
Dissertation

Software methods for improvement of cache performance on supercomputer applications

TL;DR: Measurements of actual supercomputer cache performance has not been previously undertaken, and PFC-Sim, a program-driven event tracing facility that can simulate data cache performance of very long programs, is used to measure the performance of various cache structures.