scispace - formally typeset
Open AccessProceedings ArticleDOI

Using dead blocks as a virtual victim cache

TLDR
Increasing cache efficiency can improve performance by reducing miss rate, or alternately, improve power and energy by allowing a smaller cache with the same miss rate.
Abstract
Caches mitigate the long memory latency that limits the performance of modern processors. However, caches can be quite inefficient. On average, a cache block in a 2MB L2 cache is dead 59% of the time, i.e., it will not be referenced again before it is evicted. Increasing cache efficiency can improve performance by reducing miss rate, or alternately, improve power and energy by allowing a smaller cache with the same miss rate.This paper proposes using predicted dead blocks to hold blocks evicted from other sets. When these evicted blocks are referenced again, the access can be satisfied from the other set, avoiding a costly access to main memory. The pool of predicted dead blocks can be thought of as a virtual victim cache. For a set of memory-intensive single-threaded workloads, a virtual victim cache in a 16-way set associative 2MB L2 cache reduces misses by 26%, yields an geometric mean speedup of 12.1% and improves cache efficiency by 27% on average, where cache efficiency is defined as the average time during which cache blocks contain live information. This virtual victim cache yields a lower average miss rate than a fully-associative LRU cache of the same capacity. For a set of multi-core workloads, the virtual victim cache improves throughput performance by 4% over LRU while improving cache efficiency by 62%.Alternately, a 1.7MB virtual victim cache achieves about the same performance as a larger 2MB L2 cache, reducing the number of SRAM cells required by 16%, thus maintaining performance while reducing power and area.

read more

Content maybe subject to copyright    Report

Using Dead Blocks as a Virtual Victim Cache
Samira Khan
Dept. of Computer Science
University of Texas at San Antonio
skhan@cs.utsa.edu
Daniel A. Jim´enez
Dept. of Computer Science
University of Texas at San Antonio
dj@cs.utsa.edu
Doug Burger
Microsoft Research
dburger@microsoft.com
Babak Falsafi
Parallel Systems Architecture Lab
Ecole Polytechnique F´ed´erale de Lausanne
babak.falsafi@epfl.ch
Abstract
Caches mitigate the long memory latency that limits the
performance of modern processors. However, caches can
be quite inefficient. On average, a cache block in a 2MB L2
cache is dead 59% of the time, i.e., it will not be referenced
again before it is evicted. Increasing cache efficiency can
improve performance by reducing miss rate, or alternately,
improve power and energy by allowing a smaller cache with
the same miss rate.
This paper proposes using predicted dead blocks to hold
blocks evicted from other sets. When these evicted blocks
are referenced again, the access can be satisfied from the
other set, avoiding a costly access to main memory. The
pool of predicted dead blocks can be thought of as a virtual
victim cache. A virtual victim cache in a 16-way set asso-
ciative 2MB L2 cache reduces misses by 11.7%, yields an
average speedup of 12.5% and improves cache efficiency by
15% on average, where cache efficiency is defined as the
average time during which cache blocks contain live infor-
mation. This virtual victim cache yields a lower average
miss rate than a fully-associative LRU cache of the same
capacity.
The virtual victim cache significantly reduces cache
misses in multi-threaded workloads. For a 2MB cache ac-
cessed simultaneously by four threads, the virtual victim
cache reduces misses by 12.9% and increases cache effi-
ciency by 16% on average
Alternately, a 1.7MB virtual victim cache achieves about
the same performance as a larger 2MB L2 cache, reducing
the number of SRAM cells required by 16%, thus maintain-
ing performance while reducing power and area.
1. Introduction
The performance gap between modern processors and
memory is a primary concern for computer architecture.
Processors have large on chip caches and can access a block
in just a few cycles, but a miss that goes all the way to
memory incurs hundreds of cycles of delay. Thus, reduc-
ing cache misses can significantly improve performance.
One way to reduce the miss rate is to increase the num-
ber of live blocks in the cache. A cache block is live if it
will be referenced again before its eviction. From the last
reference until the block is evicted the block is dead [13].
Studies show that cache blocks are dead most of the time;
for the benchmarks and 2MB L2 cache used for this study,
cache blocks are dead on average 59% of the time. Dead
blocks lead to poor cache efficiency [15, 4] because after
the last access to a block, it resides a long time in the cache
before it is evicted. In the least-recently-used (LRU) re-
placement policy, after the last access, every block has to
move down from the MRU position to the LRU position
and then it is evicted. Cache efficiency can be improved
by replacing dead blocks with live blocks as soon as pos-
sible after a block becomes dead, rather than waiting for it
to be evicted. Having more live blocks in the same size of
cache improves the system performance by reducing miss
rate; more live blocks means more cache hits. Alternately,
a technique that increases the number of live blocks may
allow reducing the size of the cache, resulting in a system
with the same performance but reduced power and energy
needs.
This paper describes a technique to improve cache per-
formance by using predicted dead blocks to hold victims
from cache evictions in other sets. The pool of predicted
dead blocks can be thought of as a virtual victim cache
(VVC). Figure 1 graphically depicts the efficiency of a 1MB
16-way set associative L2 cache with LRU replacement for
the SPEC CPU 2006 benchmark 456.hmmer. The amount
of time each cache block is live is shown a as a greyscale in-
tensity. Figure 1(a) showsthe unoptimized cache. The dark-
ness shows that many blocks remain dead for large stretches
of time. Figure 1(b) shows the same cache optimized with
the VVC idea. Now many blocks have more live time so the
cache is more efficient.

(a) (b)
Greyscale Efficiency
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Figure 1. Virtual victim cache increases cache efficiency. Block efficiency (i.e., fraction of time block
is live) shown as greyscale intensities for 456.hmmer for (a) a baseline 1MB cache and (b) a VVC-
enhanced cache; darker blocks are dead longer.
The VVC idea uses a dead block predictor, i.e., a mi-
croarchitectural structure that uses past history to predict
whether a given block is likely to be dead at a given time.
This study uses a trace based dead bock predictor [13].
This idea has some similarity to the victim cache [7], but
victims are stored in the same cache from which they were
evicted, simply moving from one set to another. When a
victim block is referenced again, the access can be satisfied
from the other set. Another way to view the idea is as an
enhanced combination of block insertion (i.e. placement)
policy, search strategy, and replacement policy. Blocks are
initially placed in one set and migrated to less a active set
when they become least-recently-used. A more active block
is found with one access to the tag array, and a less active
block may be found with an additional search.
1.1. Contributions
This paper explores the idea of placing victim blocks into
dead blocks in other sets. This strategy reduces the num-
ber of cache misses per thousand instructions (MPKI) by
11.7% on average with a 2MB L2 cache, yields an average
speedup of 12.5% over the baseline and improves cache ef-
ficiency by 15% on average. The VVC outperforms a fully
associative cache with the same capacity; thus, VVC does
not simply improve performance by increasing associativ-
ity. The VVC also outperforms a real victim cache that uses
the same additional hardware budget as the VVC structures,
e.g. the predictor tables. It also provides an improvement
for multi-threaded workloads for a variety of cache sizes.
This paper introduces a new dead block predictor organi-
zation inspired by branch predictors. This organization re-
duces harmful false positive predictions by over 10% on av-
erage, significantly improving the performance of the VVC
with the potential to improve other optimizations that rely
on dead block prediction.
The VVC idea includes a block insertion policy driven
by cache evictions and dead block predictions. Using
an adaptive insertion policy, the VVC gives an average
speedup of 17.3% over the baseline 2MB cache.
2. Related Work
In this section we discuss related work. Previous work
introduced several dead block predictors and applied them
to problems such as prefetching and block replacement [13,
15, 10, 6, 1], but did not explore coupling dead block pre-
diction with alternative block placement strategies.
The VVC depends on an accurate means of determining
which blocks are dead and thus candidates to replace with
victims from other sets. We discuss related work in dead
block prediction.
2.1. Trace Based Predictor
The concept of a Dead Block Predictor (DBP) was intro-
duced by Lai et al. [13]. The main idea is that, if a given se-
quence of accesses to a given cache block leads to the death
(i.e. last access) of the block, then that same sequence of ac-
cesses to a differentblock is likely to lead to the death of that
block. An access to a cache block is represented by the pro-
gram counters (PC) of the instructions making the access.
The sequence, or trace of PCs of the instructions access-
ing a block are encoded as the fixed-length truncated sum
of hashes of these PCs. This trace encoding is called a sig-
nature. For a given cache block, the trace for the sequence
of PCs begins when the block is refilled and ends when the
block is evicted. The predictor learns from the trace encod-
ing of the evicted blocks. A table of saturating counters is
indexed by block signatures. When a block is replaced, the

counter associated with it is incremented. When a block
is accessed, the corresponding counter is decremented. A
block is predicted dead when the counter corresponding to
its trace exceeds a threshold.
This dead block predictor is used to prefetch data into
predicted dead blocks in the L1 data cache, enabling looka-
head prefetching and eliminating the necessity of prefetch
buffers. That work also proposes a dead block correlating
prefetcher that uses address correlation to determine which
block to prefetch in the dead blocks.
A trace based predictor is also used to optimize a cache
coherence protocol [12, 24]. Dynamic self-invalidation
involves another kind of block “death” due to coherence
events [14]. PC traces are used to detect the last touch and
invalidate the shared cache blocks to reduce cache coher-
ence overhead.
2.2. Counting Based Predictor
Dead blocks can also be predicted depending on how
many times a block has been accessed. Kharbutli and Soli-
hin propose a counting based predictor for an L2 cache
where a counter for each block records how many times
the block has been referenced [10]. When a block is evicted
the history table stores the reference count and the first PC
that brought that block into the cache. When a block is
brought into cache again by the same PC the dead block
predictors predicts it to be dead after the number of refer-
ences reaches the threshold value stored in the history table.
Kharbutli and Solihin use this counter based dead block pre-
dictor to improve the LRU replacement policy [10]. This
improved LRU policy replaces a dead block if available, the
LRU block if not. Our technique also replaces predicted
dead blocks with other blocks, but the other blocks are vic-
tims from other sets, effectively extending associativity in
the same way a victim cache does.
2.3. Time Based Predictor
Another approach of dead block prediction is to predict
a block dead when it is not accessed for a certain number
of cycles. Hu et al. proposed a time based dead block pre-
dictor [6]. It learns the number of cycles a block is live and
predicts the block dead if it is not accessed more than twice
the number of cycles that it had been live. This predictor is
used to prefetch data into the L1 cache. This work also pro-
posed using dead times to filter blocks in the victim cache.
Blocks with shorter dead times are likely to be reused before
getting evicted from the victim cache, so time base victim
cache stores only blocks that are likely to be reused.
Abella et al. propose another time based predictor [1]. It
also predicts a block dead if it has not been accessed for
a certain number of cycles. But here the number of cy-
cles is calculated from the number of accesses of that block.
Abella et al. reduce cache leakage for L2 cache by dynami-
cally turning off cache blocks whose content is not likely to
be reused without hurting the performance.
2.4. Cache Burst Predictor
Cache bursts [15] can be used with trace based, counting
based and time based dead block predictors. A cache burst
consists of all the contiguous accesses that a block receives
while in the most-recently-used (MRU) position. Instead of
each individual references, cache burst based predictor up-
dates the predictor only on each bursts. It also improvespre-
diction accuracy by making prediction only when a block
moves out of the MRU position. The dead block predictor
needs to store trace or reference count information for each
burst only rather than for each reference. But since predic-
tion is made only after a block becomes non MRU, some
of the dead time is lost compared to non burst predictors.
Cache burst predictors improve prefetching, bypassing and
enhancing LRU replacement policy both for L1 Data cache
and L2 cache.
2.5. Other Dead Block Predictors
Another kind of dead block prediction involves predict-
ing in software [26, 21]. In this approach the compiler col-
lects dead block information and provides hints to the mi-
croarchitecture to make cache decisions. If a cache block
is likely to be reused again it hints to keep the block in the
cache; otherwise, it hints to evict the block.
2.6. Cache Placement and Replacement
Policy
Adaptive insertion policy [18] adaptively inserts the in-
coming lines in the MRU position when the working size
becomes larger than the cache size. Keramidas et al. [9]
proposed a cache replacement policy that uses reuse dis-
tance prediction. This policy tries to evict cache blocks that
will be reused furthest in the future. A memory-level par-
allelism aware cache replacement policy relies on the fact
that isolated misses are more costly on performance than
parallel misses [19].
3. Using Dead Blocks as a Virtual Victim Cache
Victim caches [7] work well because they effectively ex-
tend the associativity of any hot set in the cache, reducing
localized conflict misses. However, victim caches must be
small because of their high associativity, and are flushed
quickly if multiple hot sets are competing for space. Thus,
victim caches do not reduce capacity misses appreciably,
nor conflict misses where the reference patterns do not pro-
duce a new reference to the victim quickly, but they provide
excellent miss reduction for a small additional amount of
state and complexity. Larger victim caches have not come
into wide use because any additionalmiss reduction benefits
are outweighed by the overheads of the larger structures.
Large caches already contain significant quantities of un-
used state, however, which in theory can be used for opti-
mizations similar to victim caches if the unused state can
be identified and harvested with sufficiently low overhead.

Since the majority of the blocks in a cache are dead at any
point in time, and since dead-block predictors have been
shown to be accurate in many cases, the opportunity ex-
ists to replace these dead blocks with victim blocks, mov-
ing them back into their set when they are accessed. This
virtual victim cache approach has the potential to reduce
both capacity misses and additional conflict misses: Capac-
ity misses can be reduced because dead blocks are evicted
before potentially live blocks that have been accessed less
recently (avoiding misses that would occur with full asso-
ciativity), and conflict misses can be further reduced if hot
set overflows can spill into other dead regions of the cache,
no matter how many hot sets are active at any one time.
An important question is how the dead blocks outside
of a set are found and managed without adding prohibitive
overhead. By coupling small numbers of sets, and mov-
ing blocks overflowing from one set into the predicted dead
blocks (which we call receiver blocks) of a ”partner set, a
virtual victim cache can be established with little additional
overhead. While this approach effectively creates a higher-
associativity cache, the power overheads are kept low be-
cause only the original set is searched the majority of the
time, with the partner sets only searched upon a miss in the
original set. The overheads include more tag bits (the log
of the number of partner sets) and more energy and latency
incurred on a cache miss, since the partner sets are searched
to no avail.
3.1. Identifying Potential Receiver Blocks
A trace based dead block predictor keeps a trace encod-
ing for each cache block. The trace is updated on each use
of the block. When a block is evicted from the cache, a sat-
urating counter associated with that block’s trace is incre-
mented. When a block is used, the counter is decremented.
Ideally, any victim block could replace any receiver
block in the entire cache, resulting in the highest possible
usage of the dead blocks as a virtual victim cache. How-
ever, this idea would increase the dead block hit latency and
energy as every set in the cache would have to be searched
for a hit. Thus, there is a trade-off between the number of
sets that can store victim blocks from a particular set and
the time and energy needed for a hit. We have determined
that, for each set, considering only one other partner set to
identify a receiver block yields a reasonable balance. Sets
are paired into adjacent sets that differ in their set indices
by one bit.
3.2. Placing Victim Blocks into the Adja-
cent Set
When a victim block is evicted from a set, the adjacent
set is searched for invalid or predicted dead receiver blocks.
If no such block is found, then the LRU block of the ad-
jacent set is used. Once a receiver block is identified, the
victim block replaces it. The victim block is placed into the
most-recently-used (MRU) position in the adjacent set.
3.3. Block Identification in the VVC
If a previously evicted block is referenced again, the tag
match will fail in the original set, the adjacent set will be
searched, and if the receiver block has not yet been evicted
then the block will be found there. The block will then
be refilled in the original set from the adjacent set, and the
block in the adjacent set will be marked as invalid. A small
penalty for the additional tag match and fill will accrue to
this access, but this access is considered a hit in the L2 for
purposes of counting hits and misses (analogously, an ac-
cess to a virtually-addressed cache following a TLB miss
may still be considered a hit, albeit with an extra delay).
To distinguish receiver blocks from other blocks, we
keep an extra bit with each block that is true if the block
is a receiver block, false otherwise. When a set’s tags are
searched for a normal cache access, receiver blocks from
the adjacent set are prevented from matching to maintain
correctness. Note that keeping this extra bit is equivalent to
keeping an extra tag bit in a higher associativity cache.
Figure 2(a) shows what happens when a LRU block is
evicted from a set s. If the adjacent set v has any predicted
dead or invalid block in it, the victim block replaces that
block, otherwise the LRU block of set v is used. Similarly,
Figure 2(b) depicts a VVC hit. If the access results in a miss
in the original set s, that block can be found in a receiver
block of the adjacent set v. Algorithm 1 shows the complete
algorithm for the VVC.
Algorithm 1 Virtual Victim Cache with Trace based Pre-
dictor
On an access to set s with address a, PC pc
if the access is a hit in block blk then
blk.trace updateT race(blk.trace, pc)
isDead lookupP redictor(blk.trace)
if isDead then
mark blk as dead
return
end if
else /* search adjacent set for a dead block hit */
v = adjacentSet(s)
access set v with address a
if the access is a hit in a dead block dblk then
bring dblk back into set s
return
end if
else /* this access is a miss */
repblk block chosen by LRU policy
updateP redictor(repblk.trace)
v = adjacentSet(s)
place repblk in an invalid/dead/LRU block in set v
place block for address a into repblk
repblk.trace updateT race(pc)
return
end if

.
.
.
MRU
.
.
LRU
.
.
.
.
Set s
Set v = adjacentSet(s)
LRU
dead block
MRU
.
.
.
.
.
.
.
.
.
Set v = adjacentSet(s)Set s
MRU
MRU
LRU
blk b from set s
LRU
No tag match for block b tag match for block b
(a) (b)
Figure 2. (a) Placing evicted block into an adjacent partner set, and (b) hitting in the virtual victim
cache
3.4. Caching Predicted Dead Blocks
Note that blocks that are predicted dead and evicted from
a set may be cached in the VVC. Although it might seem
counterintuitiveto replace one dead block with another dead
block, this policy does give an advantage over simply dis-
carding predicted dead blocks because the predictor might
be wrong, i.e. one or both blocks might not be dead. We
favor the block from the hotter set, likely to be the set just
accessed.
3.5. Implementation Issues
Adjacent sets differ in one bit, bit k. The set adjacent
to set index s is s exclusive-ORed with 2
k
. A value of
k = 3 provides good performance, although performance
is largely insensitive to the choice of k. Victims replace re-
ceiver blocks in the MRU position of the adjacent set and
are allowed to be evicted just as any other block in the set.
Evicted receiver blocks are not allowed to return to their
original sets, i.e., evicted blocks may not “ping-pong” back
and forth between adjacent sets.
Each cache block keeps the following additional infor-
mation: whether or not it is a receiver block (1 bit), whether
or not the block is predicted dead (1 bit), and the truncated
sum representing the trace for this block (14 bits). The dead
block predictor additionally keeps two tables of two-bit sat-
urating counters indexed by traces. The predictor tables
consume an additional 2
14
entries ×2 bit counters ×2 ta-
bles = 64 kilobits, or 8 kilobytes.
4. Skewed Dead Block Predictor
In this section we discuss a new dead block predictor
based on the reference trace predictor of Lai et al. [13] as
well as skewed table organizations [22, 16].
4.1. Reference Trace Dead Block Predictor
The reference trace predictor collects a trace of the in-
structions used to access a particular block. The theory is
that, if a sequence of memory instructions to a block leads
to the last access of that block, then the same sequence of in-
structions should lead to the last access of other blocks. The
reference trace predictor encodes the path of memory access
instructions leading to a memory reference as the truncated
sum of the instructions’ addresses. This truncated sum is
called a signature. Each cache block is associated with a
signature that is cleared when that cache block is filled and
updated when that block is accessed. The signature is used
to access a table of two-bit saturating counters. When a
block is accessed, the correspondingcounter is decremented
and then the signature is updated. When a block is evicted,
the counter is incremented. Thus, a counter is only incre-
mented by a signature resulting from the last access to a
block.
When a block is accessed and then the signature is up-
dated, the table of counters is consulted. If the counter ex-
ceeds a threshold (e.g. 2), then the block is predicted dead.
Each cache block stores a single bit prediction. For compar-
ison, we use a 15-bit signature indexing a 32K-entry table
of counters. The predictor exclusive-ORs the first 15 bits
of each PC with the next 15 bits and adds this quantity to a
running 15-bit trace.
The original reference trace predictor of Lai et al. uses
data addresses as well as instruction addresses, requiring a
large table because of the high number of signatures. Subse-
quent work found that using only instruction addresses was
sufficient and allows smaller tables [15]; thus, we use only
instruction addresses for all of the predictors in this paper.
4.2. A Skewed Organization
In the original trace-based dead block predictor, a single
table is indexed with the signature. For this study, we ex-
plore an organization that uses the idea of a skewed organi-
zation [22, 16] to reduce the impact of conflicts in the table.
The predictor keeps two 16K-entry tables of 2-bit counters,
each indexed by a different 14-bit hash of the 15-bit block
signature. Each access to the predictor yields two counter
values. The sum of these values is used as a confidence that
is compared with a threshold; if the threshold is met, then

Citations
More filters
Proceedings ArticleDOI

Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design

TL;DR: This paper proposes a latency-optimized cache architecture, called Alloy Cache, that eliminates the delay due to tag serialization by streaming tag and data together in a single burst, and proposes a simple and highly effective Memory Access Predictor.
Proceedings ArticleDOI

SHiP: signature-based hit predictor for high performance caching

TL;DR: This paper proposes a novel Signature-based Hit Predictor (SHiP) to learn the re-reference behavior of cache lines belonging to each signature, and finds that SHiP offers substantial improvements over the baseline LRU replacement and state-of-the-art replacement policy proposals.
Proceedings ArticleDOI

Sampling Dead Block Prediction for Last-Level Caches

TL;DR: This paper introduces sampling dead block prediction, a technique that samples program counters (PCs) to determine when a cache block is likely to be dead, and shows how this technique can reduce the number of LLC misses over LRU and be used to significantly improve a cache with a default random replacement policy.
Proceedings ArticleDOI

CAMEO: A Two-Level Memory Organization with Capacity of Main Memory and Flexibility of Hardware-Managed Cache

TL;DR: This paper proposes CAMEO, a hardware-based Cache-like Memory Organization that not only makes stacked DRAM visible as part of the memory address space but also exploits data locality on a fine-grained basis and proposes a low overhead Line Location Table (LLT) that tracks the physical location of all data lines.
Proceedings ArticleDOI

Bypass and insertion algorithms for exclusive last-level caches

TL;DR: Detailed execution-driven simulation results show that a combination of the best insertion and bypass policies delivers an improvement of up to 61.2% and on average 3.5% in terms of instructions retired per cycle for single-threaded dynamic instruction traces running on a 2 MB 16-way exclusive LLC compared to a baseline exclusive design in the presence of well-tuned multi-stream hardware prefetchers.
References
More filters
Journal ArticleDOI

The SimpleScalar tool set, version 2.0

TL;DR: This document describes release 2.0 of the SimpleScalar tool set, a suite of free, publicly available simulation tools that offer both detailed and high-performance simulation of modern microprocessors.
Journal ArticleDOI

A study of replacement algorithms for a virtual-storage computer

TL;DR: One of the basic limitations of a digital computer is the size of its available memory; an approach that permits the programmer to use a sufficiently large address range can accomplish this objective, assuming that means are provided for automatic execution of the memory-overlay functions.
Proceedings ArticleDOI

Automatically characterizing large scale program behavior

TL;DR: This work quantifies the effectiveness of Basic Block Vectors in capturing program behavior across several different architectural metrics, explores the large scale behavior of several programs, and develops a set of algorithms based on clustering capable of analyzing this behavior.
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Proceedings ArticleDOI

Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

TL;DR: In this article, the authors propose a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources.
Related Papers (5)
Frequently Asked Questions (20)
Q1. What are the contributions in "Using dead blocks as a virtual victim cache" ?

This paper proposes using predicted dead blocks to hold blocks evicted from other sets. 

The authors see several future directions for this work. The VVC allows reducing the associativity and size of the cache while maintaining performance, but the potential for reducing the number of sets has not been explored. A more discriminating technique could further improve performance by filtering out cold data. 

Dead blocks lead to poor cache efficiency [15, 4] because after the last access to a block, it resides a long time in the cache before it is evicted. 

The overhead of the predictor and VVC metadata is 76KB which is 3.4% of the total 2MB cache space (including both the data and tag arrays). 

Their technique also replaces predicted dead blocks with other blocks, but the other blocks are victims from other sets, effectively extending associativity in the same way a victim cache does. 

At a capacity of 1.7MB representing an associativity of 13, the VVC achieves an average MPKI of 9.9, just above the MPKI of the 2MB baseline cache at 9.7. 

This dead block predictor is used to prefetch data into predicted dead blocks in the L1 data cache, enabling lookahead prefetching and eliminating the necessity of prefetch buffers. 

By coupling small numbers of sets, and moving blocks overflowing from one set into the predicted dead blocks (which the authors call receiver blocks) of a ”partner set,” a virtual victim cache can be established with little additional overhead. 

The reference trace predictor encodes the path of memory access instructions leading to a memory reference as the truncated sum of the instructions’ addresses. 

Another way to view the idea is as an enhanced combination of block insertion (i.e. placement) policy, search strategy, and replacement policy. 

The authors choose a memory-intensive subset of the benchmarks based on the following criteria: a benchmark is used if it (1) does not cause an abnormal termination in the baseline sim-outorder simulator for the chosen simpoint, and (2) if increasing the size of the L2 cache from 1MB to 2MB results in at least a 5% speedup. 

victim caches do not reduce capacity misses appreciably, nor conflict misses where the reference patterns do not produce a new reference to the victim quickly, but they provide excellent miss reduction for a small additional amount of state and complexity. 

Reducing the number of accesses to the tag array through a more intelligent search strategy could improve the power behavior of the cache. 

At a 128KB hardware budget, the new predictor has a false positive misprediction rate of 3.4% compared with 4.2% for the Lai et al. predictor. 

The VVC allows reducing the associativity and size of the cache while maintaining performance, but the potential for reducing the number of sets has not been explored. 

This strategy reduces the number of cache misses per thousand instructions (MPKI) by 11.7% on average with a 2MB L2 cache, yields an average speedup of 12.5% over the baseline and improves cache efficiency by 15% on average. 

The block will then be refilled in the original set from the adjacent set, and the block in the adjacent set will be marked as invalid. 

A small penalty for the additional tag match and fill will accrue to this access, but this access is considered a hit in the L2 for purposes of counting hits and misses (analogously, an access to a virtually-addressed cache following a TLB miss may still be considered a hit, albeit with an extra delay). 

A memory-level parallelism aware cache replacement policy relies on the fact that isolated misses are more costly on performance than parallel misses [19]. 

The authors choose a 64KB victim cache because it requires approximately the same amount of SRAM, including the tag array, as the extra structures of the VVC.