Open Access
Tolerating latency through software-controlled data prefetching
TLDR
This dissertation proposes and evaluates a new compiler algorithm for inserting prefetches into code that attempts to minimize overheads by only issuing prefetched for references that are predicted to suffer cache misses, and investigates the architectural support necessary to make prefetching effective.Abstract:
The large latency of memory accesses in modern computer systems is a key obstacle to achieving high processor utilization. Furthermore, the technology trends indicate that this gap between processor and memory speeds is likely to increase in the future. While increased latency affects all computer systems, the problem is magnified in large-scale shared-memory multiprocessors, where physical dimensions cause latency to be an inherent problem. To cope with the memory latency problem, the basic solution that nearly all computer systems rely on is their cache hierarchy. While caches are useful, they are not a panacea.
Software-controlled prefetching is a technique for tolerating memory latency by explicitly executing prefetch instructions to move data close to the processor before it is actually needed. This technique is attractive because it can hide both read and write latency within a single thread of execution while requiring relatively little hardware support. Software-controlled prefetching, however, presents two major challenges. First, some sophistication is required on the part of either the programmer, runtime system, or (preferably) the compiler to insert prefetches into the code. Second, care must be taken that the overheads of prefetching, which include additional instructions and increased memory queueing delays, do not outweigh the benefits.
This dissertation proposes and evaluates a new compiler algorithm for inserting prefetches into code. The proposed algorithm attempts to minimize overheads by only issuing prefetches for references that are predicted to suffer cache misses. The algorithm can prefetch both dense-matrix and sparse-matrix codes, thus covering a large fraction of scientific applications. It also works for both uniprocessor and large-scale shared-memory multiprocessor architectures. We have implemented our algorithm in the SUIF (Stanford University Intermediate Form) optimizing compiler. The results of our detailed architectural simulations demonstrate that the speed of some applications can be improved by as much as a factor of two, both on uniprocessor and multiprocessor systems. This dissertation also compares software-controlled prefetching with other latency-hiding techniques (e.g., locality optimizations, relaxed consistency models, and multithreading), and investigates the architectural support necessary to make prefetching effective.read more
Citations
More filters
Book
Parallel Computer Architecture: A Hardware/Software Approach
TL;DR: This book explains the forces behind this convergence of shared-memory, message-passing, data parallel, and data-driven computing architectures and provides comprehensive discussions of parallel programming for high performance and of workload-driven evaluation, based on understanding hardware-software interactions.
Book
Memory Systems: Cache, DRAM, Disk
TL;DR: Is your memory hierarchy stopping your microprocessor from performing at the high level it should be?
Proceedings ArticleDOI
Compiler-based prefetching for recursive data structures
Chi-Keung Luk,Todd C. Mowry +1 more
TL;DR: It is demonstrated that compiler-inserted prefetching can significantly improve the execution speed of pointer-based codes---as much as 45% for the applications the authors study and can improve performance by as much as twofold.
Proceedings Article
Database Architecture Optimized for the New Bottleneck: Memory Access
TL;DR: In this article, the authors discuss how vertically fragmented data structures optimize cache performance on sequential data access, and introduce radix algorithms for partitioned hash-join, which are quantified using a detailed analytical model that incorporates memory access cost.
Journal ArticleDOI
Meta optimization: improving compiler heuristics with machine learning
TL;DR: By evolving a compiler's heuristic over several benchmarks, Meta Optimization can create effective, general-purpose heuristics, and demonstrates the efficacy of the techniques on three different optimizations in this paper: hyperblock formation, register allocation, and data prefetching.
References
More filters
Journal ArticleDOI
How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs
TL;DR: Many large sequential computers execute operations in a different order than is specified by the program, and a correct execution by each processor does not guarantee the correct execution of the entire program.
Journal ArticleDOI
The Nas Parallel Benchmarks
David H. Bailey,Eric Barszcz,John T. Barton,D. S. Browning,Russell Carter,Leonardo Dagum,Rod Fatoohi,Paul O. Frederickson,T. A. Lasinski,Robert Schreiber,Horst D. Simon,V. Venkatakrishnan,Sisira Weeratunga +12 more
TL;DR: A new set of benchmarks has been developed for the performance evaluation of highly parallel supercom puters that mimic the computation and data move ment characteristics of large-scale computational fluid dynamics applications.
Proceedings ArticleDOI
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Proceedings ArticleDOI
A data locality optimizing algorithm
Michael Wolf,Monica S. Lam +1 more
TL;DR: An algorithm that improves the locality of a loop nest by transforming the code via interchange, reversal, skewing and tiling is proposed, and is successful in optimizing codes such as matrix multiplication, successive over-relaxation, LU decomposition without pivoting, and Givens QR factorization.