scispace - formally typeset
Proceedings ArticleDOI

Prefetching using Markov predictors

Doug Joseph, +1 more
- Vol. 25, Iss: 2, pp 252-263
TLDR
The Markov prefetcher acts as an interface between the on-chip and off-chip cache, and can be added to existing computer designs and reduces the overall execution stalls due to instruction and data memory operations by an average of 54% for various commercial benchmarks while only using two thirds the memory of a demand-fetch cache organization.
Abstract
Prefetching is one approach to reducing the latency of memory operations in modern computer systems. In this paper, we describe the Markov prefetcher. This prefetcher acts as an interface between the on-chip and off-chip cache, and can be added to existing computer designs. The Markov prefetcher is distinguished by prefetching multiple reference predictions from the memory subsystem, and then prioritizing the delivery of those references to the processor.This design results in a prefetching system that provides good coverage, is accurate and produces timely results that can be effectively used by the processor. In our cycle-level simulations, the Markov Prefetcher reduces the overall execution stalls due to instruction and data memory operations by an average of 54% for various commercial benchmarks while only using two thirds the memory of a demand-fetch cache organization.

read more

Citations
More filters
Book

Memory Systems: Cache, DRAM, Disk

TL;DR: Is your memory hierarchy stopping your microprocessor from performing at the high level it should be?
Proceedings ArticleDOI

Phase tracking and prediction

TL;DR: This paper presents a unified profiling architecture that can efficiently capture, classify, and predict phase-based program behavior on the largest of time scales, and can capture phases that account for over 80% of execution using less that 500 bytes of on-chip memory.
Proceedings ArticleDOI

Runahead execution: an alternative to very large instruction windows for out-of-order processors

TL;DR: This paper proposes runahead execution as an effective way to increase memory latency tolerance in an out-of-order processor without requiring an unreasonably large instruction window.
Proceedings ArticleDOI

The predictability of data values

TL;DR: Comparison of context based prediction and stride prediction shows that the higher accuracy of contextbased prediction is due to relatively few static instructions giving large improvements; this suggests the usefulness of hybrid predictors.
Proceedings ArticleDOI

Managing Wire Delay in Large Chip-Multiprocessor Caches

TL;DR: This paper develops L2 cache designs for CMPs that incorporate block migration, stride-based prefetching between L1 and L2 caches, and presents a hybrid design-combining all three techniques-that improves performance by an additional 2% to 19% overPrefetching alone.
References
More filters
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Proceedings ArticleDOI

Design and evaluation of a compiler algorithm for prefetching

TL;DR: This paper proposes a compiler algorithm to insert prefetch instructions into code that operates on dense matrices, and shows that this algorithm significantly improves the execution speed of the benchmark programs-some of the programs improve by as much as a factor of two.
Proceedings ArticleDOI

Evaluating stream buffers as a secondary cache replacement

TL;DR: The results show that, for the majority of the benchmarks, stream buffers can attain hit rates that are comparable to typical hit rates of secondary caches, and as the data-set size of the scientific workload increases the performance of streams typically improves relative to secondary cache performance, showing that streams are more scalable to large data- set sizes.
Proceedings ArticleDOI

An architecture for software-controlled data prefetching

TL;DR: Simulations based on a MIPS processor model show that this technique can dramatically reduce on-chip cache miss ratios and average observed memory latency for scientific loops at only slight cost in total memory traffic.
Proceedings ArticleDOI

A modified approach to data cache management

TL;DR: The bare minimum amount of local memories that programs require to run without delay is measured by using the Value Reuse Profile, which contains the dynamic value reuse information of a program's execution, and by assuming the existence of efficient memory systems.