scispace - formally typeset
Open AccessProceedings ArticleDOI

Sandbox Prefetching: Safe run-time evaluation of aggressive prefetchers

TLDR
This work proposes a new mechanism to determine at run-time the appropriate prefetching mechanism for the currently executing program, called Sandbox Prefetching, which combines the ideas of global pattern confirmation and immediatePrefetching action to achieve high performance.
Abstract
Memory latency is a major factor in limiting CPU performance, and prefetching is a well-known method for hiding memory latency. Overly aggressive prefetching can waste scarce resources such as memory bandwidth and cache capacity, limiting or even hurting performance. It is therefore important to employ prefetching mechanisms that use these resources prudently, while still prefetching required data in a timely manner. In this work, we propose a new mechanism to determine at run-time the appropriate prefetching mechanism for the currently executing program, called Sandbox Prefetching. Sandbox Prefetching evaluates simple, aggressive offset prefetchers at run-time by adding the prefetch address to a Bloom filter, rather than actually fetching the data into the cache. Subsequent cache accesses are tested against the contents of the Bloom filter to see if the aggressive prefetcher under evaluation could have accurately prefetched the data, while simultaneously testing for the existence of prefetchable streams. Real prefetches are performed when the accuracy of evaluated prefetchers exceeds a threshold. This method combines the ideas of global pattern confirmation and immediate prefetching action to achieve high performance. Sandbox Prefetching improves performance across the tested workloads by 47.6% compared to not using any prefetching, and by 18.7% compared to the Feedback Directed Prefetching technique. Performance is also improved by 1.4% compared to the Access Map Pattern Matching Prefetcher, while incurring considerably less logic and storage overheads.

read more

Content maybe subject to copyright    Report

Sandbox Prefetching: Safe Run-Time Evaluation of Aggressive Prefetchers
Seth H Pugsley
1
, Zeshan Chishti
2
, Chris Wilkerson
2
, Peng-fei Chuang
3
, Robert L Scott
3
, Aamer
Jaleel
4
, Shih-Lien Lu
2
, Kingsum Chow
3
, and Rajeev Balasubramonian
1
1
University of Utah, {pugsley, rajeev}@cs.utah.edu
2
Intel Labs, {zeshan.a.chishti, chris.wilkerson, shih-lien.l.lu}@intel.com
3
Intel Software and Services Group, {peng-fei.chuang, robert.l.scott, kingsum.chow}@intel.com
4
Intel Corporation VSSAD, aamer.jaleel@intel.com
Abstract
Memory latency is a major factor in limiting CPU per-
formance, and prefetching is a well-known method for hid-
ing memory latency. Overly aggressive prefetching can
waste scarce resources such as memory bandwidth and
cache capacity, limiting or even hurting performance. It
is therefore important to employ prefetching mechanisms
that use these resources prudently, while still prefetching
required data in a timely manner.
In this work, we propose a new mechanism to deter-
mine at run-time the appropriate prefetching mechanism for
the currently executing program, called Sandbox Prefetch-
ing. Sandbox Prefetching evaluates simple, aggressive
offset prefetchers at run-time by adding the prefetch ad-
dress to a Bloom filter, rather than actually fetching the
data into the cache. Subsequent cache accesses are tested
against the contents of the Bloom filter to see if the ag-
gressive prefetcher under evaluation could have accurately
prefetched the data, while simultaneously testing for the ex-
istence of prefetchable streams. Real prefetches are per-
formed when the accuracy of evaluated prefetchers exceeds
a threshold. This method combines the ideas of global
pattern confirmation and immediate prefetching action to
achieve high performance. Sandbox Prefetching improves
performance across the tested workloads by 47.6% com-
pared to not using any prefetching, and by 18.7% compared
to the Feedback Directed Prefetching technique. Perfor-
mance is also improved by 1.4% compared to the Access
Map Pattern Matching Prefetcher, while incurring consid-
erably less logic and storage overheads.
1 Introduction
Modern high performance microprocessors employ
hardware prefetching techniques to mitigate the perfor-
mance impact of long memory latencies. These prefetch-
ers operate by predicting which memory addresses will be
accessed by a program in the near future and then specu-
latively issuing memory requests for those addresses. The
performance improvement afforded by a prefetcher depends
on its ability to correctly predict the memory addresses ac-
cessed by a program. Accurate prefetches hide the memory
latency of potential demand misses by bringing data earlier
to the on-chip caches. In comparison, inaccurate prefetches
result in two problems: First, they increase the contention
for the available memory bandwidth, which could result in
both performance losses and energy overheads. Second,
they waste cache capacity, which could result in additional
cache misses, contributing to the problem they were in-
tended to solve. In fact, there is often a trade-off between
prefetch accuracy and coverage, i.e., to bring in more use-
ful cache lines, the prefetcher must also bring in more use-
less cache lines. Therefore, before employing a prefetching
technique, it is important to weigh the relative benefit of ac-
curate prefetches against the bandwidth and cache capacity
concerns of inaccurate prefetches.
One common approach to maximizing prefetcher accu-
racy is to track streams of accesses in the address space.
This is done by observing several memory accesses form-
ing a regular pattern, and then predicting that pattern will
continue in subsequent memory accesses. These prefetch-
ers can be accurate, but take some time to confirm streams
before any performance benefit can be reaped.
Other techniques, such as next-line prefetchers, or even
the use of larger cache lines, harvest fine grained spatial
locality, but do so blindly for all memory references with-
out first evaluating the benefits. As a result, these prefetch-
ers may increase bandwidth and power while providing no
performance benefit. Kumar et al. [17], took advantage
of the benefits of fine grained spatial locality while avoid-
ing the overheads of superfluous prefetches by maintaining
bit-vectors indicating which nearby cache lines were most
likely to be used after a reference to a particular cache line

occurred. The key drawback of this approach is the over-
head of storing the bit vector for each footprint.
In this paper, we build on the previous work on prefetch-
ers that exploit fine grained spatial locality. Rather than
build and store patterns as in [17], our approach evaluates
a few previously defined patterns and identifies at run-time
which patterns are most likely to provide benefit. Since not
all previously defined prefetch patterns are appropriate for
all workloads, we must identify when and where specific
patterns will benefit performance.
To address this problem we introduce the concept of a
“Prefetch Sandbox. The key idea behind a prefetch sand-
box is to track prefetch requests generated by a candidate
prefetch pattern, without actually issuing those prefetch re-
quests to the memory system. To test the accuracy of a
candidate prefetcher, the sandbox stores the addresses of
all the cache lines that would have been prefetched by
the candidate pattern. We implement the sandbox using a
Bloom filter for its space efficiency and constant lookup
time. Based on the number of simulated prefetch hits, a can-
didate prefetcher may be globally activated to immediately
perform a prefetch action after every cache access, with no
further confirmation.
Our results show that using the proposed Sandbox
Prefetching technique improves the average performance of
14 memory-intensive benchmarks in the SPEC2006 suite
by 47.6% compared to no prefetching, by 18.7% compared
to the state-of-the-art Feedback Directed Prefetching, and
by 1.4% compared to the Access Map Pattern Matching
Prefetcher, which has a considerably larger storage and
logic requirement compared to Sandbox Prefetching.
2 Background
Hardware prefetchers for regular data access patterns
fall into two broad categories: conservative, confirmation-
based prefetchers (such as a stream prefetcher), and aggres-
sive, immediate prefetchers (such as a next-line prefetcher).
These two varieties of prefetchers have generally opposite
goals and methods, but Sandbox Prefetching combines at-
tributes of both.
2.1 Confirmation-Based Prefetchers
A confirmation-based prefetcher is one that performs a
prefetch only after it has built up some confidence that the
prefetch will be useful. A stream prefetcher is a good ex-
ample of a confirmation-based prefetcher. When a cache
line address A is seen for the first time, no prefetches are
performed at this time, but the stream prefetcher begins
watching for address A+1. Even when A+1 is seen, still no
prefetches are made, because there is not yet enough evi-
dence that this is a true stream. Only after A+2 is also seen,
and the stream has been fully “confirmed” will A+3 (and
perhaps further cache lines) finally be prefetched.
Since confirmation prefetchers always wait for some
time before issuing any prefetches, they will always leave
some performance on t he table. Even i f a stream prefetcher
is perfect at prefetching a long stream in a timely and accu-
rate manner, the fact that it had to confirm the stream before
the first prefetch was issued means that its performance will
always be limited, because it missed out on prefetching the
first three accesses.
Confirmation-based prefetchers have the advantage that
once a pattern has been confirmed, many prefetches can be
issued along that pattern, far ahead of the program’s actual
access stream. This improves performance by avoiding late
prefetches.
Confirmation-based stream prefetchers operate on the
granularity of individual streams. Each address that comes
to the prefetcher is considered to be either part of an ex-
isting stream, or not a part of any known stream, in which
case a new stream will be allocated for it. Once a stream
has been confirmed, prefetches may be made along it, but
each stream must be confirmed independently of all other
streams. Furthermore, a new stream will have to be allo-
cated and confirmed whenever an access stream reaches a
virtual page boundary, because the next physical page will
not yet be known.
2.2 Immediate Prefetchers
An immediate prefetcher is one that generates a prefetch
address and performs a prefetch as soon as it is given an in-
put reference address. The most basic and common of these
types of prefetchers is the next-line prefetcher. Every time
the next-line prefetcher is given the address of a cache line
A, it will immediately prefetch the cache line A+1. The only
requirement for the next-line prefetcher to prefetch address
A+1 is for it to see address A. No additional confirmation or
input is required.
Immediate prefetchers have the disadvantage that they
have a higher probability of being inaccurate, compared
to confirmation-based prefetchers. Also, with immediate
prefetchers, there is no notion of prefetching “ahead” of a
stream of accesses, because an immediate prefetcher takes
only a single action each time it is invoked.
Immediate prefetchers have the advantage that they can
prefetch patterns which a confirmation-based prefetcher
cannot prefetch, because no confirmable pattern exists. For
example, consider a linked list of data structures that are
exactly the size of two cache lines. A confirmation-based
prefetcher would consider the first cache line of each of
these linked list nodes to be the beginning of a new pat-
tern, and accessing the second cache line would help build
confidence in this pattern, but the third sequential access
would never come, because the linked list would jump
somewhere else in memory. On the other hand, because im-
mediate prefetchers work on the granularity of individual
cache lines, and not streams, a next-line prefetcher would
be able to perfectly prefetch the second cache line of these
linked list nodes.

3 Related Work
There are numerous studies that have proposed novel
prefetching algorithms [13, 3, 15, 19, 7, 20, 14, 5, 18,
10, 11, 22, 6, 9, 23, 2, 1, 17, 16]. Initial research on
prefetching relied on the fact that many applications ex-
hibit a high degree of spatial locality. As such, many studies
showed these applications can benefit from sequential and
stride prefetching [6, 9]. However, applications that lack
spatial locality receive very little benefit from sequential
prefetching. Therefore, more complex prefetching propos-
als such as Markov-based prefetchers [14], and prefetchers
for pointer chasing applications [7, 20], have also been pro-
posed. While we cannot cover all prefetching related re-
search work, we summarize prior art that closely relates to
our Sandbox Prefetching technique.
There have been a few studies that dynamically adapt
the aggressiveness of prefetchers. Dahlgren et al. proposed
adaptive sequential prefetching for multiprocessors [5].
Their proposal dynamically modulated prefetcher distance
by tracking the usefulness of prefetches. If the prefetches
are useful, the prefetch distance is increased, otherwise it is
decreased. Jiminez et al. present a real life dynamic imple-
mentation of a prefetcher in the POWER7 processor [13].
The POWER7 processor supports a number of prefetcher
configurations and prefetch distances (seven in all). Their
approach exposes to the operating system software the dif-
ferent prefetcher configurations using a Configuration Sta-
tus Register (CSR) on a per-core basis. The operating sys-
tem/software time samples the performance of all prefetch
configurations and chooses the best prefetch setting for the
given core and runs it for several time quanta. In contrast to
this work, which uses software to evaluate and program the
hardware prefetcher, our proposed scheme is a hardware-
only solution.
In this work, we compare Sandbox Prefetching to Feed-
back Directed Prefetching (FDP) [21] and Address Map
Pattern Matching Prefetching (AMPM) [12]. We now de-
scribe both of these techniques in some detail.
3.1 Fee d ba ck Directed Pr e fet ching
FDP is an improvement on a conventional stream
prefetcher, which takes into account the accuracy and time-
liness of the prefetcher, as well as the cache pollution gen-
erated by the prefetcher, to dynamically vary how aggres-
sively the prefetcher operates.
FDP works by maintaining a structure that tracks mul-
tiple different access streams. The FDP mechanism is in-
voked in the event of an L2 cache miss (which was the last
level cache miss in the FDP work). When a new stream
is accessed for the first time, a new entry in the tracking
structure is allocated. The stream then trains on the next
two cache misses that fall within +/-16 cache blocks of the
initial miss in order to determine the direction, whether pos-
itive or negative, that the stream is heading in.
After a stream and its direction are confirmed, the stream
tracker enters monitor and request mode. The tracker mon-
itors a region of memory, between a s tart and end pointer,
and whenever there is an access to that region, one or more
prefetches are issued, and the bounds of the monitored r e-
gion of memory are advanced in the direction of the stream.
The size of the monitored memory region, and the number
of cache lines which are prefetched with each access, are
determined by the current aggressiveness level.
FDP has five levels of aggressiveness that it can switch
between, given the current behavior of the program. The
least aggressive level monitors a region of 4 cache blocks,
and prefetches a single cache line on each stream access.
The most aggressive level monitors a region of 64 cache
blocks, and prefetches 4 cache li nes on each stream access.
3.2 Address Map Pattern Matching
AMPM is the winner of a data prefetching championship
whose only limitations were on the number of bits that
could be used to store prefetcher state (4 KB). As a con-
sequence, AMPM uses complex logic to make the most of
its limited storage budget. The main idea of AMPM is to
track every cache line in large 16 KB regions of memory,
and then to exhaustively determine if any strides can be dis-
covered through the use of pattern matching, and then to
prefetch along those strides.
AMPM tracks address maps for 52 16 KB regions of
memory, which maintain 2 bits of state for each cache
line in the region, corresponding to whether the line has
not been accessed, has been demand accessed, or has been
prefetched. This address map is updated on every L2 access
and prefetch (L2 was the last level of cache in their work).
Also on every L2 access, the address map corresponding
to the current access is retrieved from a fully-associative
array of address maps, and is placed in a large shift regis-
ter to align the map with the current access address. Then
it attempts to match 256 separate patterns with this shifted
address map, each pattern match requiring two compares to
discover series of strided accesses centered around the cur-
rent access. This generates a list of candidate prefetches,
and a number of these are prefetched according to a dy-
namically changing prefetch degree, and in the order of the
smallest magnitude offset to the largest.
4 Sandbox Prefetching
Sandbox Prefetching (SBP) represents another class of
prefetcher, and combines the ideas of global confirmation
with immediate action t o aggressively, and safely, per-
form prefetches. All of our discussion and evaluations are
made in the context of a two-level cache heirharchy, and
prefetches happen exclusively from memory to L2.
4.1 Overview
SBP operates on the principle of validating the accuracy
of aggressive, immediate offset prefetchers in a safe, sand-

Figure 1. Sandbox Prefetching’s place in the
memory hierarchy.
boxed environment, where neither the real cache nor mem-
ory bandwidth are disturbed, and then deploying them in
the real memory hierarchy only if they prove that they can
accurately prefetch useful data. A set of candidate prefetch-
ers, each corresponding to a specific cache line offset, are
constantly evaluated and re-evaluated for accuracy, and the
most accurate of them are allowed to issue real prefetches
to main memory.
Immediate prefetchers have one single prefetch action,
and they perform this action in all situations they are used.
A next-line prefetcher will always fetch the plus-one cache
line, regardless of the input it receives. It is the same with
the candidate offset prefetchers. Each one of them will per-
form a prefetch with a specific offset from the current input
cache line address. Prefetcher accuracy is a concern, and
we therefore cannot allow all candidate prefetchers to issue
prefetches all the time.
Candidates are evaluated by simulating their prefetch ac-
tion and measuring the simulated effect. This is done by
adding prefetch addresses into a sandbox, rather than issu-
ing real prefetches to main memory. The sandbox is a struc-
ture which implements a set and keeps track of addresses
which have been added to it. Subsequent cache accesses test
the sandbox to see if their address can be found there. If the
address is found there, then the current candidate prefetcher
could have accurately prefetched this cache line, and that
candidate’s accuracy score is increased. The accuracy score
is used to tell which, if any, of the candidate prefetchers has
qualified to issue real prefetches in the real memory hier-
archy. If the address is not found there, then that means
the current candidate prefetcher could not have accurately
prefetched this line.
Candidate prefetchers are not “confirmed” in the con-
text of a single access stream, as in a stream prefetcher, but
rather in the context of all memory access patterns present
Figure 2. Sandbox Prefetching acts on every
L2 access.
in the currently executing program. We do not test if a
particular offset prefetcher, which prefetches offset O from
the current cache line address, is accurate for only a single
stream, but we test to see if for every access A, that there is
a subsequent access to A+O. If the pattern holds true for a
large enough number of cache accesses, then the candidate
prefetcher is turned on in the real memory hierarchy.
Each candidate is evaluated for a fixed number of L2 ac-
cesses, and then the contents of the sandbox are reset, and
the next candidate is evaluated.
4.2 The Sandbox
The sandbox of Sandbox Prefetching is implemented as
a Bloom filter[16]. Each prefetch address generated by the
current candidate prefetcher is added to the sandbox Bloom
filter, and each time there is a cache access, the Bloom fil-
ter is tested to see if the cache line address is contained
in it. The sandbox can be thought of as tracking an un-
ordered history of all prefetch addresses the current candi-
date prefetcher has generated.
Because of the probabilities governing Bloom filters, the
size of the Bloom filt er is directly related to how many items
can be added to it before the false positive rate rises above
a desirable level. Because each candidate prefetcher gener-
ates only a single prefetch address, we will add a number of
items to the Bloom filter equal to the number of L2 accesses
in an evaluation period. We experimentally determined that
an evaluation period of 256 L2 accesses is optimal for the
tested workloads. We chose the size of the Bloom filter to
2048 bits (256 bytes), which for 256 item insertions gives
us a maximum expected false positive rate of approximately
1%.
There is only one sandbox per core, and the candidate
prefetchers are evaluated one at a time, in a time multi-
plexed fashion, with the sandbox being reset in between
each evaluation. This means there is no opportunity for
cross-contamination between mutiple candidate prefetchers
sharing a sandbox.

4.3 Candidate Evaluation
Sandbox Prefetching maintains a set of 16 candidate
prefetchers, which are evaluated in round-robin fashion.
Initially this set of prefetchers is for offsets -8 to -1, and +1
to +8. At the beginning of an evaluation period, the sand-
box, the L2 access counter, and the prefetch accuracy score
are all reset, along with other counters which track period
cache reads, writes, and prefetches, which are used to ap-
proximate bandwidth usage.
Each time the L2 cache is accessed, the cache line ad-
dress is used to check the sandbox to see if this line would
have been prefetched by the current candidate prefetcher.
If it is a hit, then the prefetch accuracy score is incre-
mented, otherwise, nothing happens. After this, the can-
didate prefetcher generates a prefetch address, based on the
reference cache line address and its own prefetch offset, and
adds this address to the sandbox. Finally, the counter that
tracks the number of L2 accesses this period is incremented.
Once this number reaches 256, the evaluation period is over
and the sandbox and other counters are reset, and the evalu-
ation of the next candidate prefetcher begins.
After a complete r ound of evaluating every candidate
prefetcher is over, the bottom 4 prefetchers with the lowest
prefetch accuracy score are cycled out, and 4 more offset
prefetchers that have not been recently evaluated fr om the
range -16 to +16 are cycled in.
4.4 Prefetch Act ion
As soon as a candidate prefetcher has finished its evalu-
ation, it may be used to issue real prefetches, if its accuracy
score is high enough. In addition to all of the candidate
evaluation that is done, each L2 access may result in one
or more prefetches to be issued to main memory. We con-
trol the number of prefetches that are issued by estimating
the amount of bandwidth each core has consumed during
its last evaluation period, and then using that to estimate
the amount of unused bandwidth available to be used for
additional prefetches. Each core in a multi-core setup gets
a prefetch degree budget proportional to the number of L2
accesses it performs. This prefetch degree is capped at a
minimum of one prefetch per prefetch direction (positive
and negative), per core, per L2 access, and at a maximum
of eight. The prefetch degree is recalculated at the end of
each evaluation period.
Evaluated prefetchers with lower numbered offsets are
given preference to issue their prefetches first ( and there-
fore use up some of the prefetch degree budget first). There
is an accuracy score cutoff point, below which an evalu-
ated prefetcher will not be allowed to issue any prefetches.
Prefetches continue until a number of prefetches equal to
the prefetch degree has been issued, and then is repeated
for the negative offset prefetchers. The actual offsets of
the evaluated prefetchers can change as the less accurate
candidate prefetchers are cycled out, so there will need to
Sandbox Size 2048 bits
Evaluation Period 256 L2 accesses
Total PF Candidates 32
Candidate Offset Ranges -16 to +16, excluding 0,
16 evaluated per round,
then worst 4 cycled out
Candidate Score Storage 16 10-bit counters
Prefetch Accuracy Cutoffs 256 (1 PF)
to Issue Multiple Prefetches 512 (2 PFs)
Per L2 Access 768 (3 PFs)
Bandwidth Estimation Counters Read Counter
Write Counter
Prefetch Counter
Table 1. Sandbox Prefetching parameters and
counters.
be some hardware logic to decide the order that evaluated
prefetchers will be considered to issue their prefetches. The
specific values of the cutoff points will be discussed in the
next subsection.
It is important to keep in mind that there is no additional
confirmation before prefetches are iss ued at this stage. All
of the confirmation has already been done globally in the
sandbox during the offset prefetcher’s evaluation.
4.5 Detecting Streams
So far we have focused on the sandbox’s ability to de-
tect the accuracy of offset prefetches, but it can also be used
to detect strided access streams. When the sandbox is be-
ing probed to see if the current L2 access cache line ad-
dress could have been prefetched by the current candidate
prefetcher, we can also act as though this access is the latest
in a strided stream of accesses (where the stride is equal to
the offset of the current candidate prefetcher), and test to
see if earlier addresses in this strided stream are also found
in the sandbox.
For example, if the current candidate prefetcher’s offset
is +3, whenever we check the sandbox to see if the current
cache line address A is found in it, we can also check for A-
3, A-6, A-9, and so on. If those addresses are also found in
the sandbox, then that means that the program is accessing
a stream with stride +3. When we were only considering
individual offsets that could be prefetched, there was no op-
portunity to prefetch “ahead, because there was no stream
to follow. But now that we can accurately detect strided
streams in the access pattern, it makes sense that each can-
didate prefetcher be allowed to prefetch more than a single
line.
We treat the detection of earlier members of a stream in
the sandbox the same as we treat the detection of the cur-
rent access address, by incrementing the sandbox accuracy
score for each line found. We probe the sandbox for the cur-
rent address and the previous three members of the stream,
so it’s possible that on each L2 access, the prefetch accu-

Citations
More filters
Proceedings ArticleDOI

Efficiently prefetching complex address patterns

TL;DR: The Variable Length Delta Prefetcher (VLDP), which builds up delta histories between successive Cache line misses within physical pages, and then uses these histories to predict the order of cache line misses in new pages.
Proceedings ArticleDOI

Best-offset hardware prefetching

TL;DR: This work proposes an offset prefetcher with a new method for selecting the prefetch offset that takes into account prefetch timeliness and shows that it outperforms the Sandbox prefetchers on the SPEC CPU2006 benchmarks, with equally simple hardware.
Proceedings ArticleDOI

Path confidence based lookahead prefetching

TL;DR: The Signature Path Prefetcher (SPP), which offers effective solutions for three classic challenges in prefetcher design, uses a compressed history based scheme that accurately predicts complex address patterns and adaptively throttle itself on a per-prefetch stream basis.
Journal ArticleDOI

A Survey of Recent Prefetching Techniques for Processor Caches

TL;DR: This article surveys several recent techniques that aim to improve the implementation and effectiveness of prefetching and characterize the techniques on several parameters to highlight their similarities and differences.
Proceedings ArticleDOI

Bingo Spatial Data Prefetcher

TL;DR: Bingo spatial data prefetcher is proposed in which short and long events are used to select the best access pattern for prefetching, and a storage-efficient design for Bingo in such a way that just one history table is needed to maintain the association between the access patterns and the long and short events.
References
More filters
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Proceedings ArticleDOI

Clearing the clouds: a study of emerging scale-out workloads on modern hardware

TL;DR: This work identifies the key micro-architectural needs of scale-out workloads, calling for a change in the trajectory of server processors that would lead to improved computational density and power efficiency in data centers.
Journal ArticleDOI

POWER4 system microarchitecture

TL;DR: The processor microarchitecture as well as the interconnection architecture employed to form systems up to a 32-way symmetric multiprocessor are described.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What contributions have the authors mentioned in the paper "Sandbox prefetching: safe run-time evaluation of aggressive prefetchers" ?

In this work, the authors propose a new mechanism to determine at run-time the appropriate prefetching mechanism for the currently executing program, called Sandbox Prefetching. 

In total, FDP uses 3.1 KB for its prefetching structures, and requires logic which can detect if an address is within an existing stream, allocate new streams, add and remove items from a Bloom filter, and calculate the dynamic settings of the prefetcher based on feedback mechanisms. 

Confirmation-based prefetchers have the advantage that once a pattern has been confirmed, many prefetches can be issued along that pattern, far ahead of the program’s actual access stream. 

The key idea behind a prefetch sandbox is to track prefetch requests generated by a candidate prefetch pattern, without actually issuing those prefetch requests to the memory system. 

Sandbox Prefetching (SBP) represents another class of prefetcher, and combines the ideas of global confirmation with immediate action to aggressively, and safely, perform prefetches. 

Because each candidate prefetcher generates only a single prefetch address, the authors will add a number of items to the Bloom filter equal to the number of L2 accesses in an evaluation period. 

There is only one sandbox per core, and the candidate prefetchers are evaluated one at a time, in a time multiplexed fashion, with the sandbox being reset in between each evaluation. 

There is a set of candidate prefetchers which are evaluated by simulating their prefetch action by adding prefetch addresses to a sandbox Bloom filter, rather than issuing real prefetches, and by testing subsequent cache access addresses to see if they are part of a strided stream. 

Their results show that using the proposed Sandbox Prefetching technique improves the average performance of 14 memory-intensive benchmarks in the SPEC2006 suite by 47.6% compared to no prefetching, by 18.7% compared to the state-of-the-art Feedback Directed Prefetching, and by 1.4% compared to the Access Map Pattern Matching Prefetcher, which has a considerably larger storage and logic requirement compared to Sandbox Prefetching. 

The authors evaluate Sandbox Prefetching using the Wind River Simics full system simulator [2], which has been augmented to precisely model a DRAM main memory system by integrating the USIMM DRAM simulator [4]. 

Compared to FDP, SBP improves performance across single threaded workloads by an average of 18.7%, with a maximum of 68.8% improvement in the lbm workload. 

The only performancecritical logic is used to generate prefetch addresses based on a reference address, and a set of offsets which have been predetermined to have high evaluation scores, which is not unusual for a prefetching mechanism. 

In the singlethreaded workloads where AMPM sees its largest performance improvements over No PF, SBP is able to consistently achieve even higher performance. 

Each candidate is evaluated for a fixed number of L2 accesses, and then the contents of the sandbox are reset, and the next candidate is evaluated. 

On the other hand, because immediate prefetchers work on the granularity of individual cache lines, and not streams, a next-line prefetcher would be able to perfectly prefetch the second cache line of these linked list nodes.