How many KB of storage does FDP use for its prefetching structures?

In total, FDP uses 3.1 KB for its prefetching structures, and requires logic which can detect if an address is within an existing stream, allocate new streams, add and remove items from a Bloom filter, and calculate the dynamic settings of the prefetcher based on feedback mechanisms.

What is the advantage of a confirmation-based prefetcher?

Confirmation-based prefetchers have the advantage that once a pattern has been confirmed, many prefetches can be issued along that pattern, far ahead of the program’s actual access stream.

What is the aggressive level of prefetching?

Sandbox Prefetching (SBP) represents another class of prefetcher, and combines the ideas of global confirmation with immediate action to aggressively, and safely, perform prefetches.

Why do the authors add a number of items to the Bloom filter?

Because each candidate prefetcher generates only a single prefetch address, the authors will add a number of items to the Bloom filter equal to the number of L2 accesses in an evaluation period.

How do the authors test the prefetch accuracy of a stream?

There is a set of candidate prefetchers which are evaluated by simulating their prefetch action by adding prefetch addresses to a sandbox Bloom filter, rather than issuing real prefetches, and by testing subsequent cache access addresses to see if they are part of a strided stream.

How does the simulation simulate a DRAM main memory?

The authors evaluate Sandbox Prefetching using the Wind River Simics full system simulator [2], which has been augmented to precisely model a DRAM main memory system by integrating the USIMM DRAM simulator [4].

How much does SBP improve on the lbm?

Compared to FDP, SBP improves performance across single threaded workloads by an average of 18.7%, with a maximum of 68.8% improvement in the lbm workload.

What is the common way to calculate the prefetch degree?

The only performancecritical logic is used to generate prefetch addresses based on a reference address, and a set of offsets which have been predetermined to have high evaluation scores, which is not unusual for a prefetching mechanism.

What is the performance of the singlethreaded workloads where AMPM sees the greatest?

In the singlethreaded workloads where AMPM sees its largest performance improvements over No PF, SBP is able to consistently achieve even higher performance.

(Open Access) Sandbox Prefetching: Safe run-time evaluation of aggressive prefetchers (2014) | Seth H. Pugsley

Q: What contributions have the authors mentioned in the paper "Sandbox prefetching: safe run-time evaluation of aggressive prefetchers" ?

In this work, the authors propose a new mechanism to determine at run-time the appropriate prefetching mechanism for the currently executing program, called Sandbox Prefetching.

Q: What is the key idea behind a prefetch sandbox?

The key idea behind a prefetch sandbox is to track prefetch requests generated by a candidate prefetch pattern, without actually issuing those prefetch requests to the memory system.

Q: How does the proposed Sandbox Prefetching technique improve performance?

Their results show that using the proposed Sandbox Prefetching technique improves the average performance of 14 memory-intensive benchmarks in the SPEC2006 suite by 47.6% compared to no prefetching, by 18.7% compared to the state-of-the-art Feedback Directed Prefetching, and by 1.4% compared to the Access Map Pattern Matching Prefetcher, which has a considerably larger storage and logic requirement compared to Sandbox Prefetching.

Sandbox Prefetching: Safe Run-Time Evaluation of Aggressive Prefetchers

Seth H Pugsley

, Zeshan Chishti

, Chris Wilkerson

, Peng-fei Chuang

, Robert L Scott

, Aamer

Jaleel

, Shih-Lien Lu

, Kingsum Chow

, and Rajeev Balasubramonian

University of Utah, {pugsley, rajeev}@cs.utah.edu

Intel Labs, {zeshan.a.chishti, chris.wilkerson, shih-lien.l.lu}@intel.com

Intel Software and Services Group, {peng-fei.chuang, robert.l.scott, kingsum.chow}@intel.com

Intel Corporation VSSAD, aamer.jaleel@intel.com

Abstract

Memory latency is a major factor in limiting CPU per-

formance, and prefetching is a well-known method for hid-

ing memory latency. Overly aggressive prefetching can

waste scarce resources such as memory bandwidth and

cache capacity, limiting or even hurting performance. It

is therefore important to employ prefetching mechanisms

that use these resources prudently, while still prefetching

required data in a timely manner.

In this work, we propose a new mechanism to deter-

mine at run-time the appropriate prefetching mechanism for

the currently executing program, called Sandbox Prefetch-

ing. Sandbox Prefetching evaluates simple, aggressive

offset prefetchers at run-time by adding the prefetch ad-

dress to a Bloom ﬁlter, rather than actually fetching the

data into the cache. Subsequent cache accesses are tested

against the contents of the Bloom ﬁlter to see if the ag-

gressive prefetcher under evaluation could have accurately

prefetched the data, while simultaneously testing for the ex-

istence of prefetchable streams. Real prefetches are per-

formed when the accuracy of evaluated prefetchers exceeds

a threshold. This method combines the ideas of global

pattern conﬁrmation and immediate prefetching action to

achieve high performance. Sandbox Prefetching improves

performance across the tested workloads by 47.6% com-

pared to not using any prefetching, and by 18.7% compared

to the Feedback Directed Prefetching technique. Perfor-

mance is also improved by 1.4% compared to the Access

Map Pattern Matching Prefetcher, while incurring consid-

erably less logic and storage overheads.

1 Introduction

Modern high performance microprocessors employ

hardware prefetching techniques to mitigate the perfor-

mance impact of long memory latencies. These prefetch-

ers operate by predicting which memory addresses will be

accessed by a program in the near future and then specu-

latively issuing memory requests for those addresses. The

performance improvement afforded by a prefetcher depends

on its ability to correctly predict the memory addresses ac-

cessed by a program. Accurate prefetches hide the memory

latency of potential demand misses by bringing data earlier

to the on-chip caches. In comparison, inaccurate prefetches

result in two problems: First, they increase the contention

for the available memory bandwidth, which could result in

both performance losses and energy overheads. Second,

they waste cache capacity, which could result in additional

cache misses, contributing to the problem they were in-

tended to solve. In fact, there is often a trade-off between

prefetch accuracy and coverage, i.e., to bring in more use-

ful cache lines, the prefetcher must also bring in more use-

less cache lines. Therefore, before employing a prefetching

technique, it is important to weigh the relative beneﬁt of ac-

curate prefetches against the bandwidth and cache capacity

concerns of inaccurate prefetches.

One common approach to maximizing prefetcher accu-

racy is to track streams of accesses in the address space.

This is done by observing several memory accesses form-

ing a regular pattern, and then predicting that pattern will

continue in subsequent memory accesses. These prefetch-

ers can be accurate, but take some time to conﬁrm streams

before any performance beneﬁt can be reaped.

Other techniques, such as next-line prefetchers, or even

the use of larger cache lines, harvest ﬁne grained spatial

locality, but do so blindly for all memory references with-

out ﬁrst evaluating the beneﬁts. As a result, these prefetch-

ers may increase bandwidth and power while providing no

performance beneﬁt. Kumar et al. [17], took advantage

of the beneﬁts of ﬁne grained spatial locality while avoid-

ing the overheads of superﬂuous prefetches by maintaining

bit-vectors indicating which nearby cache lines were most

likely to be used after a reference to a particular cache line

occurred. The key drawback of this approach is the over-

head of storing the bit vector for each footprint.

In this paper, we build on the previous work on prefetch-

ers that exploit ﬁne grained spatial locality. Rather than

build and store patterns as in [17], our approach evaluates

a few previously deﬁned patterns and identiﬁes at run-time

which patterns are most likely to provide beneﬁt. Since not

all previously deﬁned prefetch patterns are appropriate for

all workloads, we must identify when and where speciﬁc

patterns will beneﬁt performance.

To address this problem we introduce the concept of a

“Prefetch Sandbox.” The key idea behind a prefetch sand-

box is to track prefetch requests generated by a candidate

prefetch pattern, without actually issuing those prefetch re-

quests to the memory system. To test the accuracy of a

candidate prefetcher, the sandbox stores the addresses of

all the cache lines that would have been prefetched by

the candidate pattern. We implement the sandbox using a

Bloom ﬁlter for its space efﬁciency and constant lookup

time. Based on the number of simulated prefetch hits, a can-

didate prefetcher may be globally activated to immediately

perform a prefetch action after every cache access, with no

further conﬁrmation.

Our results show that using the proposed Sandbox

Prefetching technique improves the average performance of

14 memory-intensive benchmarks in the SPEC2006 suite

by 47.6% compared to no prefetching, by 18.7% compared

to the state-of-the-art Feedback Directed Prefetching, and

by 1.4% compared to the Access Map Pattern Matching

Prefetcher, which has a considerably larger storage and

logic requirement compared to Sandbox Prefetching.

2 Background

Hardware prefetchers for regular data access patterns

fall into two broad categories: conservative, conﬁrmation-

based prefetchers (such as a stream prefetcher), and aggres-

sive, immediate prefetchers (such as a next-line prefetcher).

These two varieties of prefetchers have generally opposite

goals and methods, but Sandbox Prefetching combines at-

tributes of both.

2.1 Conﬁrmation-Based Prefetchers

A conﬁrmation-based prefetcher is one that performs a

prefetch only after it has built up some conﬁdence that the

prefetch will be useful. A stream prefetcher is a good ex-

ample of a conﬁrmation-based prefetcher. When a cache

line address A is seen for the ﬁrst time, no prefetches are

performed at this time, but the stream prefetcher begins

watching for address A+1. Even when A+1 is seen, still no

prefetches are made, because there is not yet enough evi-

dence that this is a true stream. Only after A+2 is also seen,

and the stream has been fully “conﬁrmed” will A+3 (and

perhaps further cache lines) ﬁnally be prefetched.

Since conﬁrmation prefetchers always wait for some

time before issuing any prefetches, they will always leave

some performance on t he table. Even i f a stream prefetcher

is perfect at prefetching a long stream in a timely and accu-

rate manner, the fact that it had to conﬁrm the stream before

the ﬁrst prefetch was issued means that its performance will

always be limited, because it missed out on prefetching the

ﬁrst three accesses.

Conﬁrmation-based prefetchers have the advantage that

once a pattern has been conﬁrmed, many prefetches can be

issued along that pattern, far ahead of the program’s actual

access stream. This improves performance by avoiding late

prefetches.

Conﬁrmation-based stream prefetchers operate on the

granularity of individual streams. Each address that comes

to the prefetcher is considered to be either part of an ex-

isting stream, or not a part of any known stream, in which

case a new stream will be allocated for it. Once a stream

has been conﬁrmed, prefetches may be made along it, but

each stream must be conﬁrmed independently of all other

streams. Furthermore, a new stream will have to be allo-

cated and conﬁrmed whenever an access stream reaches a

virtual page boundary, because the next physical page will

not yet be known.

2.2 Immediate Prefetchers

An immediate prefetcher is one that generates a prefetch

address and performs a prefetch as soon as it is given an in-

put reference address. The most basic and common of these

types of prefetchers is the next-line prefetcher. Every time

the next-line prefetcher is given the address of a cache line

A, it will immediately prefetch the cache line A+1. The only

requirement for the next-line prefetcher to prefetch address

A+1 is for it to see address A. No additional conﬁrmation or

input is required.

Immediate prefetchers have the disadvantage that they

have a higher probability of being inaccurate, compared

to conﬁrmation-based prefetchers. Also, with immediate

prefetchers, there is no notion of prefetching “ahead” of a

stream of accesses, because an immediate prefetcher takes

only a single action each time it is invoked.

Immediate prefetchers have the advantage that they can

prefetch patterns which a conﬁrmation-based prefetcher

cannot prefetch, because no conﬁrmable pattern exists. For

example, consider a linked list of data structures that are

exactly the size of two cache lines. A conﬁrmation-based

prefetcher would consider the ﬁrst cache line of each of

these linked list nodes to be the beginning of a new pat-

tern, and accessing the second cache line would help build

conﬁdence in this pattern, but the third sequential access

would never come, because the linked list would jump

somewhere else in memory. On the other hand, because im-

mediate prefetchers work on the granularity of individual

cache lines, and not streams, a next-line prefetcher would

be able to perfectly prefetch the second cache line of these

linked list nodes.

3 Related Work

There are numerous studies that have proposed novel

prefetching algorithms [13, 3, 15, 19, 7, 20, 14, 5, 18,

10, 11, 22, 6, 9, 23, 2, 1, 17, 16]. Initial research on

prefetching relied on the fact that many applications ex-

hibit a high degree of spatial locality. As such, many studies

showed these applications can beneﬁt from sequential and

stride prefetching [6, 9]. However, applications that lack

spatial locality receive very little beneﬁt from sequential

prefetching. Therefore, more complex prefetching propos-

als such as Markov-based prefetchers [14], and prefetchers

for pointer chasing applications [7, 20], have also been pro-

posed. While we cannot cover all prefetching related re-

search work, we summarize prior art that closely relates to

our Sandbox Prefetching technique.

There have been a few studies that dynamically adapt

the aggressiveness of prefetchers. Dahlgren et al. proposed

adaptive sequential prefetching for multiprocessors [5].

Their proposal dynamically modulated prefetcher distance

by tracking the usefulness of prefetches. If the prefetches

are useful, the prefetch distance is increased, otherwise it is

decreased. Jiminez et al. present a real life dynamic imple-

mentation of a prefetcher in the POWER7 processor [13].

The POWER7 processor supports a number of prefetcher

conﬁgurations and prefetch distances (seven in all). Their

approach exposes to the operating system software the dif-

ferent prefetcher conﬁgurations using a Conﬁguration Sta-

tus Register (CSR) on a per-core basis. The operating sys-

tem/software time samples the performance of all prefetch

conﬁgurations and chooses the best prefetch setting for the

given core and runs it for several time quanta. In contrast to

this work, which uses software to evaluate and program the

hardware prefetcher, our proposed scheme is a hardware-

only solution.

In this work, we compare Sandbox Prefetching to Feed-

back Directed Prefetching (FDP) [21] and Address Map

Pattern Matching Prefetching (AMPM) [12]. We now de-

scribe both of these techniques in some detail.

3.1 Fee d ba ck Directed Pr e fet ching

FDP is an improvement on a conventional stream

prefetcher, which takes into account the accuracy and time-

liness of the prefetcher, as well as the cache pollution gen-

erated by the prefetcher, to dynamically vary how aggres-

sively the prefetcher operates.

FDP works by maintaining a structure that tracks mul-

tiple different access streams. The FDP mechanism is in-

voked in the event of an L2 cache miss (which was the last

level cache miss in the FDP work). When a new stream

is accessed for the ﬁrst time, a new entry in the tracking

structure is allocated. The stream then trains on the next

two cache misses that fall within +/-16 cache blocks of the

initial miss in order to determine the direction, whether pos-

itive or negative, that the stream is heading in.

After a stream and its direction are conﬁrmed, the stream

tracker enters monitor and request mode. The tracker mon-

itors a region of memory, between a s tart and end pointer,

and whenever there is an access to that region, one or more

prefetches are issued, and the bounds of the monitored r e-

gion of memory are advanced in the direction of the stream.

The size of the monitored memory region, and the number

of cache lines which are prefetched with each access, are

determined by the current aggressiveness level.

FDP has ﬁve levels of aggressiveness that it can switch

between, given the current behavior of the program. The

least aggressive level monitors a region of 4 cache blocks,

and prefetches a single cache line on each stream access.

The most aggressive level monitors a region of 64 cache

blocks, and prefetches 4 cache li nes on each stream access.

3.2 Address Map Pattern Matching

AMPM is the winner of a data prefetching championship

whose only limitations were on the number of bits that

could be used to store prefetcher state (4 KB). As a con-

sequence, AMPM uses complex logic to make the most of

its limited storage budget. The main idea of AMPM is to

track every cache line in large 16 KB regions of memory,

and then to exhaustively determine if any strides can be dis-

covered through the use of pattern matching, and then to

prefetch along those strides.

AMPM tracks address maps for 52 16 KB regions of

memory, which maintain 2 bits of state for each cache

line in the region, corresponding to whether the line has

not been accessed, has been demand accessed, or has been

prefetched. This address map is updated on every L2 access

and prefetch (L2 was the last level of cache in their work).

Also on every L2 access, the address map corresponding

to the current access is retrieved from a fully-associative

array of address maps, and is placed in a large shift regis-

ter to align the map with the current access address. Then

it attempts to match 256 separate patterns with this shifted

address map, each pattern match requiring two compares to

discover series of strided accesses centered around the cur-

rent access. This generates a list of candidate prefetches,

and a number of these are prefetched according to a dy-

namically changing prefetch degree, and in the order of the

smallest magnitude offset to the largest.

4 Sandbox Prefetching

Sandbox Prefetching (SBP) represents another class of

prefetcher, and combines the ideas of global conﬁrmation

with immediate action t o aggressively, and safely, per-

form prefetches. All of our discussion and evaluations are

made in the context of a two-level cache heirharchy, and

prefetches happen exclusively from memory to L2.

4.1 Overview

SBP operates on the principle of validating the accuracy

of aggressive, immediate offset prefetchers in a safe, sand-

Figure 1. Sandbox Prefetching’s place in the

memory hierarchy.

boxed environment, where neither the real cache nor mem-

ory bandwidth are disturbed, and then deploying them in

the real memory hierarchy only if they prove that they can

accurately prefetch useful data. A set of candidate prefetch-

ers, each corresponding to a speciﬁc cache line offset, are

constantly evaluated and re-evaluated for accuracy, and the

most accurate of them are allowed to issue real prefetches

to main memory.

Immediate prefetchers have one single prefetch action,

and they perform this action in all situations they are used.

A next-line prefetcher will always fetch the plus-one cache

line, regardless of the input it receives. It is the same with

the candidate offset prefetchers. Each one of them will per-

form a prefetch with a speciﬁc offset from the current input

cache line address. Prefetcher accuracy is a concern, and

we therefore cannot allow all candidate prefetchers to issue

prefetches all the time.

Candidates are evaluated by simulating their prefetch ac-

tion and measuring the simulated effect. This is done by

adding prefetch addresses into a sandbox, rather than issu-

ing real prefetches to main memory. The sandbox is a struc-

ture which implements a set and keeps track of addresses

which have been added to it. Subsequent cache accesses test

the sandbox to see if their address can be found there. If the

address is found there, then the current candidate prefetcher

could have accurately prefetched this cache line, and that

candidate’s accuracy score is increased. The accuracy score

is used to tell which, if any, of the candidate prefetchers has

qualiﬁed to issue real prefetches in the real memory hier-

archy. If the address is not found there, then that means

the current candidate prefetcher could not have accurately

prefetched this line.

Candidate prefetchers are not “conﬁrmed” in the con-

text of a single access stream, as in a stream prefetcher, but

rather in the context of all memory access patterns present

Figure 2. Sandbox Prefetching acts on every

L2 access.

in the currently executing program. We do not test if a

particular offset prefetcher, which prefetches offset O from

the current cache line address, is accurate for only a single

stream, but we test to see if for every access A, that there is

a subsequent access to A+O. If the pattern holds true for a

large enough number of cache accesses, then the candidate

prefetcher is turned on in the real memory hierarchy.

Each candidate is evaluated for a ﬁxed number of L2 ac-

cesses, and then the contents of the sandbox are reset, and

the next candidate is evaluated.

4.2 The Sandbox

The sandbox of Sandbox Prefetching is implemented as

a Bloom ﬁlter[16]. Each prefetch address generated by the

current candidate prefetcher is added to the sandbox Bloom

ﬁlter, and each time there is a cache access, the Bloom ﬁl-

ter is tested to see if the cache line address is contained

in it. The sandbox can be thought of as tracking an un-

ordered history of all prefetch addresses the current candi-

date prefetcher has generated.

Because of the probabilities governing Bloom ﬁlters, the

size of the Bloom ﬁlt er is directly related to how many items

can be added to it before the false positive rate rises above

a desirable level. Because each candidate prefetcher gener-

ates only a single prefetch address, we will add a number of

items to the Bloom ﬁlter equal to the number of L2 accesses

in an evaluation period. We experimentally determined that

an evaluation period of 256 L2 accesses is optimal for the

tested workloads. We chose the size of the Bloom ﬁlter to

2048 bits (256 bytes), which for 256 item insertions gives

us a maximum expected false positive rate of approximately

1%.

There is only one sandbox per core, and the candidate

prefetchers are evaluated one at a time, in a time multi-

plexed fashion, with the sandbox being reset in between

each evaluation. This means there is no opportunity for

cross-contamination between mutiple candidate prefetchers

sharing a sandbox.

4.3 Candidate Evaluation

Sandbox Prefetching maintains a set of 16 candidate

prefetchers, which are evaluated in round-robin fashion.

Initially this set of prefetchers is for offsets -8 to -1, and +1

to +8. At the beginning of an evaluation period, the sand-

box, the L2 access counter, and the prefetch accuracy score

are all reset, along with other counters which track period

cache reads, writes, and prefetches, which are used to ap-

proximate bandwidth usage.

Each time the L2 cache is accessed, the cache line ad-

dress is used to check the sandbox to see if this line would

have been prefetched by the current candidate prefetcher.

If it is a hit, then the prefetch accuracy score is incre-

mented, otherwise, nothing happens. After this, the can-

didate prefetcher generates a prefetch address, based on the

reference cache line address and its own prefetch offset, and

adds this address to the sandbox. Finally, the counter that

tracks the number of L2 accesses this period is incremented.

Once this number reaches 256, the evaluation period is over

and the sandbox and other counters are reset, and the evalu-

ation of the next candidate prefetcher begins.

After a complete r ound of evaluating every candidate

prefetcher is over, the bottom 4 prefetchers with the lowest

prefetch accuracy score are cycled out, and 4 more offset

prefetchers that have not been recently evaluated fr om the

range -16 to +16 are cycled in.

4.4 Prefetch Act ion

As soon as a candidate prefetcher has ﬁnished its evalu-

ation, it may be used to issue real prefetches, if its accuracy

score is high enough. In addition to all of the candidate

evaluation that is done, each L2 access may result in one

or more prefetches to be issued to main memory. We con-

trol the number of prefetches that are issued by estimating

the amount of bandwidth each core has consumed during

its last evaluation period, and then using that to estimate

the amount of unused bandwidth available to be used for

additional prefetches. Each core in a multi-core setup gets

a prefetch degree budget proportional to the number of L2

accesses it performs. This prefetch degree is capped at a

minimum of one prefetch per prefetch direction (positive

and negative), per core, per L2 access, and at a maximum

of eight. The prefetch degree is recalculated at the end of

each evaluation period.

Evaluated prefetchers with lower numbered offsets are

given preference to issue their prefetches ﬁrst ( and there-

fore use up some of the prefetch degree budget ﬁrst). There

is an accuracy score cutoff point, below which an evalu-

ated prefetcher will not be allowed to issue any prefetches.

Prefetches continue until a number of prefetches equal to

the prefetch degree has been issued, and then is repeated

for the negative offset prefetchers. The actual offsets of

the evaluated prefetchers can change as the less accurate

candidate prefetchers are cycled out, so there will need to

Sandbox Size 2048 bits

Evaluation Period 256 L2 accesses

Total PF Candidates 32

Candidate Offset Ranges -16 to +16, excluding 0,

16 evaluated per round,

then worst 4 cycled out

Candidate Score Storage 16 10-bit counters

Prefetch Accuracy Cutoffs 256 (1 PF)

to Issue Multiple Prefetches 512 (2 PFs)

Per L2 Access 768 (3 PFs)

Bandwidth Estimation Counters Read Counter

Write Counter

Prefetch Counter

Table 1. Sandbox Prefetching parameters and

counters.

be some hardware logic to decide the order that evaluated

prefetchers will be considered to issue their prefetches. The

speciﬁc values of the cutoff points will be discussed in the

next subsection.

It is important to keep in mind that there is no additional

conﬁrmation before prefetches are iss ued at this stage. All

of the conﬁrmation has already been done globally in the

sandbox during the offset prefetcher’s evaluation.

4.5 Detecting Streams

So far we have focused on the sandbox’s ability to de-

tect the accuracy of offset prefetches, but it can also be used

to detect strided access streams. When the sandbox is be-

ing probed to see if the current L2 access cache line ad-

dress could have been prefetched by the current candidate

prefetcher, we can also act as though this access is the latest

in a strided stream of accesses (where the stride is equal to

the offset of the current candidate prefetcher), and test to

see if earlier addresses in this strided stream are also found

in the sandbox.

For example, if the current candidate prefetcher’s offset

is +3, whenever we check the sandbox to see if the current

cache line address A is found in it, we can also check for A-

3, A-6, A-9, and so on. If those addresses are also found in

the sandbox, then that means that the program is accessing

a stream with stride +3. When we were only considering

individual offsets that could be prefetched, there was no op-

portunity to prefetch “ahead,” because there was no stream

to follow. But now that we can accurately detect strided

streams in the access pattern, it makes sense that each can-

didate prefetcher be allowed to prefetch more than a single

line.

We treat the detection of earlier members of a stream in

the sandbox the same as we treat the detection of the cur-

rent access address, by incrementing the sandbox accuracy

score for each line found. We probe the sandbox for the cur-

rent address and the previous three members of the stream,

so it’s possible that on each L2 access, the prefetch accu-

Sandbox Prefetching: Safe run-time evaluation of aggressive prefetchers

Figures

Citations

Efficiently prefetching complex address patterns

Best-offset hardware prefetching

Path confidence based lookahead prefetching

A Survey of Recent Prefetching Techniques for Processor Caches

Bingo Spatial Data Prefetcher

References

The art of computer programming, volume 1 (3rd ed.): fundamental algorithms

The art of computer programming. Vol.1: Fundamental algorithms

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

Clearing the clouds: a study of emerging scale-out workloads on modern hardware

POWER4 system microarchitecture

Related Papers (5)

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Spatial Memory Streaming

Data Cache Prefetching Using a Global History Buffer

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

Coordinated control of multiple prefetchers in multi-core systems

Frequently Asked Questions (15)

Q1. What contributions have the authors mentioned in the paper "Sandbox prefetching: safe run-time evaluation of aggressive prefetchers" ?

Q2. How many KB of storage does FDP use for its prefetching structures?

Q3. What is the advantage of a confirmation-based prefetcher?

Q4. What is the key idea behind a prefetch sandbox?

Q5. What is the aggressive level of prefetching?

Q6. Why do the authors add a number of items to the Bloom filter?

Q7. How many sandboxes are used to evaluate a candidate prefetcher?

Q8. How do the authors test the prefetch accuracy of a stream?

Q9. How does the proposed Sandbox Prefetching technique improve performance?

Q10. How does the simulation simulate a DRAM main memory?

Q11. How much does SBP improve on the lbm?

Q12. What is the common way to calculate the prefetch degree?

Q13. What is the performance of the singlethreaded workloads where AMPM sees the greatest?

Q14. How many L2 accesses are used to evaluate a candidate?

Q15. What is the difference between immediate and next-line prefetchers?