Proceedings ArticleDOI
Prefetching using Markov predictors
Doug Joseph,Dirk Grunwald +1 more
- Vol. 25, Iss: 2, pp 252-263
TLDR
The Markov prefetcher acts as an interface between the on-chip and off-chip cache, and can be added to existing computer designs and reduces the overall execution stalls due to instruction and data memory operations by an average of 54% for various commercial benchmarks while only using two thirds the memory of a demand-fetch cache organization.Abstract:
Prefetching is one approach to reducing the latency of memory operations in modern computer systems. In this paper, we describe the Markov prefetcher. This prefetcher acts as an interface between the on-chip and off-chip cache, and can be added to existing computer designs. The Markov prefetcher is distinguished by prefetching multiple reference predictions from the memory subsystem, and then prioritizing the delivery of those references to the processor.This design results in a prefetching system that provides good coverage, is accurate and produces timely results that can be effectively used by the processor. In our cycle-level simulations, the Markov Prefetcher reduces the overall execution stalls due to instruction and data memory operations by an average of 54% for various commercial benchmarks while only using two thirds the memory of a demand-fetch cache organization.read more
Citations
More filters
Book
Memory Systems: Cache, DRAM, Disk
TL;DR: Is your memory hierarchy stopping your microprocessor from performing at the high level it should be?
Proceedings ArticleDOI
Phase tracking and prediction
TL;DR: This paper presents a unified profiling architecture that can efficiently capture, classify, and predict phase-based program behavior on the largest of time scales, and can capture phases that account for over 80% of execution using less that 500 bytes of on-chip memory.
Proceedings ArticleDOI
Runahead execution: an alternative to very large instruction windows for out-of-order processors
TL;DR: This paper proposes runahead execution as an effective way to increase memory latency tolerance in an out-of-order processor without requiring an unreasonably large instruction window.
Proceedings ArticleDOI
The predictability of data values
TL;DR: Comparison of context based prediction and stride prediction shows that the higher accuracy of contextbased prediction is due to relatively few static instructions giving large improvements; this suggests the usefulness of hybrid predictors.
Proceedings ArticleDOI
Managing Wire Delay in Large Chip-Multiprocessor Caches
Bradford M. Beckmann,Darien Wood +1 more
TL;DR: This paper develops L2 cache designs for CMPs that incorporate block migration, stride-based prefetching between L1 and L2 caches, and presents a hybrid design-combining all three techniques-that improves performance by an additional 2% to 19% overPrefetching alone.
References
More filters
Proceedings ArticleDOI
Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers
TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Proceedings ArticleDOI
Design and evaluation of a compiler algorithm for prefetching
TL;DR: This paper proposes a compiler algorithm to insert prefetch instructions into code that operates on dense matrices, and shows that this algorithm significantly improves the execution speed of the benchmark programs-some of the programs improve by as much as a factor of two.
Proceedings ArticleDOI
Evaluating stream buffers as a secondary cache replacement
TL;DR: The results show that, for the majority of the benchmarks, stream buffers can attain hit rates that are comparable to typical hit rates of secondary caches, and as the data-set size of the scientific workload increases the performance of streams typically improves relative to secondary cache performance, showing that streams are more scalable to large data- set sizes.
Proceedings ArticleDOI
An architecture for software-controlled data prefetching
TL;DR: Simulations based on a MIPS processor model show that this technique can dramatically reduce on-chip cache miss ratios and average observed memory latency for scientific loops at only slight cost in total memory traffic.
Proceedings ArticleDOI
A modified approach to data cache management
TL;DR: The bare minimum amount of local memories that programs require to run without delay is measured by using the Value Reuse Profile, which contains the dynamic value reuse information of a program's execution, and by assuming the existence of efficient memory systems.