What are the future works mentioned in the paper "Using dead blocks as a virtual victim cache" ?

The authors see several future directions for this work. The VVC allows reducing the associativity and size of the cache while maintaining performance, but the potential for reducing the number of sets has not been explored. A more discriminating technique could further improve performance by filtering out cold data.

How much is the overhead of the predictor?

The overhead of the predictor and VVC metadata is 76KB which is 3.4% of the total 2MB cache space (including both the data and tag arrays).

How does the VVC achieve an average MPKI?

At a capacity of 1.7MB representing an associativity of 13, the VVC achieves an average MPKI of 9.9, just above the MPKI of the 2MB baseline cache at 9.7.

How can a virtual victim cache be established?

By coupling small numbers of sets, and moving blocks overflowing from one set into the predicted dead blocks (which the authors call receiver blocks) of a ”partner set,” a virtual victim cache can be established with little additional overhead.

What is the average speed of the benchmarks?

The authors choose a memory-intensive subset of the benchmarks based on the following criteria: a benchmark is used if it (1) does not cause an abnormal termination in the baseline sim-outorder simulator for the chosen simpoint, and (2) if increasing the size of the L2 cache from 1MB to 2MB results in at least a 5% speedup.

What is the way to reduce the number of accesses to the tag array?

Reducing the number of accesses to the tag array through a more intelligent search strategy could improve the power behavior of the cache.

How many false positives are there for the Lai et al. predictor?

At a 128KB hardware budget, the new predictor has a false positive misprediction rate of 3.4% compared with 4.2% for the Lai et al. predictor.

What is the potential for reducing the number of sets?

The VVC allows reducing the associativity and size of the cache while maintaining performance, but the potential for reducing the number of sets has not been explored.

Why do the authors choose a 64KB victim cache?

The authors choose a 64KB victim cache because it requires approximately the same amount of SRAM, including the tag array, as the extra structures of the VVC.

(Open Access) Using dead blocks as a virtual victim cache (2010) | Samira Khan

Q: What is the reason why dead blocks lead to poor cache efficiency?

Dead blocks lead to poor cache efficiency [15, 4] because after the last access to a block, it resides a long time in the cache before it is evicted.

Q: What is the technique used to replace dead blocks with other blocks?

Their technique also replaces predicted dead blocks with other blocks, but the other blocks are victims from other sets, effectively extending associativity in the same way a victim cache does.

Using Dead Blocks as a Virtual Victim Cache

Samira Khan

Dept. of Computer Science

University of Texas at San Antonio

skhan@cs.utsa.edu

Daniel A. Jim´enez

Dept. of Computer Science

University of Texas at San Antonio

dj@cs.utsa.edu

Doug Burger

Microsoft Research

dburger@microsoft.com

Babak Falsaﬁ

Parallel Systems Architecture Lab

Ecole Polytechnique F´ed´erale de Lausanne

babak.falsafi@epfl.ch

Abstract

Caches mitigate the long memory latency that limits the

performance of modern processors. However, caches can

be quite inefﬁcient. On average, a cache block in a 2MB L2

cache is dead 59% of the time, i.e., it will not be referenced

again before it is evicted. Increasing cache efﬁciency can

improve performance by reducing miss rate, or alternately,

improve power and energy by allowing a smaller cache with

the same miss rate.

This paper proposes using predicted dead blocks to hold

blocks evicted from other sets. When these evicted blocks

are referenced again, the access can be satisﬁed from the

other set, avoiding a costly access to main memory. The

pool of predicted dead blocks can be thought of as a virtual

victim cache. A virtual victim cache in a 16-way set asso-

ciative 2MB L2 cache reduces misses by 11.7%, yields an

average speedup of 12.5% and improves cache efﬁciency by

15% on average, where cache efﬁciency is deﬁned as the

average time during which cache blocks contain live infor-

mation. This virtual victim cache yields a lower average

miss rate than a fully-associative LRU cache of the same

capacity.

The virtual victim cache signiﬁcantly reduces cache

misses in multi-threaded workloads. For a 2MB cache ac-

cessed simultaneously by four threads, the virtual victim

cache reduces misses by 12.9% and increases cache efﬁ-

ciency by 16% on average

Alternately, a 1.7MB virtual victim cache achieves about

the same performance as a larger 2MB L2 cache, reducing

the number of SRAM cells required by 16%, thus maintain-

ing performance while reducing power and area.

1. Introduction

The performance gap between modern processors and

memory is a primary concern for computer architecture.

Processors have large on chip caches and can access a block

in just a few cycles, but a miss that goes all the way to

memory incurs hundreds of cycles of delay. Thus, reduc-

ing cache misses can signiﬁcantly improve performance.

One way to reduce the miss rate is to increase the num-

ber of live blocks in the cache. A cache block is live if it

will be referenced again before its eviction. From the last

reference until the block is evicted the block is dead [13].

Studies show that cache blocks are dead most of the time;

for the benchmarks and 2MB L2 cache used for this study,

cache blocks are dead on average 59% of the time. Dead

blocks lead to poor cache efﬁciency [15, 4] because after

the last access to a block, it resides a long time in the cache

before it is evicted. In the least-recently-used (LRU) re-

placement policy, after the last access, every block has to

move down from the MRU position to the LRU position

and then it is evicted. Cache efﬁciency can be improved

by replacing dead blocks with live blocks as soon as pos-

sible after a block becomes dead, rather than waiting for it

to be evicted. Having more live blocks in the same size of

cache improves the system performance by reducing miss

rate; more live blocks means more cache hits. Alternately,

a technique that increases the number of live blocks may

allow reducing the size of the cache, resulting in a system

with the same performance but reduced power and energy

needs.

This paper describes a technique to improve cache per-

formance by using predicted dead blocks to hold victims

from cache evictions in other sets. The pool of predicted

dead blocks can be thought of as a virtual victim cache

(VVC). Figure 1 graphically depicts the efﬁciency of a 1MB

16-way set associative L2 cache with LRU replacement for

the SPEC CPU 2006 benchmark 456.hmmer. The amount

of time each cache block is live is shown a as a greyscale in-

tensity. Figure 1(a) showsthe unoptimized cache. The dark-

ness shows that many blocks remain dead for large stretches

of time. Figure 1(b) shows the same cache optimized with

the VVC idea. Now many blocks have more live time so the

cache is more efﬁcient.

(a) (b)

Greyscale Efﬁciency

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Figure 1. Virtual victim cache increases cache efﬁciency. Block efﬁciency (i.e., fraction of time block

is live) shown as greyscale intensities for 456.hmmer for (a) a baseline 1MB cache and (b) a VVC-

enhanced cache; darker blocks are dead longer.

The VVC idea uses a dead block predictor, i.e., a mi-

croarchitectural structure that uses past history to predict

whether a given block is likely to be dead at a given time.

This study uses a trace based dead bock predictor [13].

This idea has some similarity to the victim cache [7], but

victims are stored in the same cache from which they were

evicted, simply moving from one set to another. When a

victim block is referenced again, the access can be satisﬁed

from the other set. Another way to view the idea is as an

enhanced combination of block insertion (i.e. placement)

policy, search strategy, and replacement policy. Blocks are

initially placed in one set and migrated to less a active set

when they become least-recently-used. A more active block

is found with one access to the tag array, and a less active

block may be found with an additional search.

1.1. Contributions

This paper explores the idea of placing victim blocks into

dead blocks in other sets. This strategy reduces the num-

ber of cache misses per thousand instructions (MPKI) by

11.7% on average with a 2MB L2 cache, yields an average

speedup of 12.5% over the baseline and improves cache ef-

ﬁciency by 15% on average. The VVC outperforms a fully

associative cache with the same capacity; thus, VVC does

not simply improve performance by increasing associativ-

ity. The VVC also outperforms a real victim cache that uses

the same additional hardware budget as the VVC structures,

e.g. the predictor tables. It also provides an improvement

for multi-threaded workloads for a variety of cache sizes.

This paper introduces a new dead block predictor organi-

zation inspired by branch predictors. This organization re-

duces harmful false positive predictions by over 10% on av-

erage, signiﬁcantly improving the performance of the VVC

with the potential to improve other optimizations that rely

on dead block prediction.

The VVC idea includes a block insertion policy driven

by cache evictions and dead block predictions. Using

an adaptive insertion policy, the VVC gives an average

speedup of 17.3% over the baseline 2MB cache.

2. Related Work

In this section we discuss related work. Previous work

introduced several dead block predictors and applied them

to problems such as prefetching and block replacement [13,

15, 10, 6, 1], but did not explore coupling dead block pre-

diction with alternative block placement strategies.

The VVC depends on an accurate means of determining

which blocks are dead and thus candidates to replace with

victims from other sets. We discuss related work in dead

block prediction.

2.1. Trace Based Predictor

The concept of a Dead Block Predictor (DBP) was intro-

duced by Lai et al. [13]. The main idea is that, if a given se-

quence of accesses to a given cache block leads to the death

(i.e. last access) of the block, then that same sequence of ac-

cesses to a differentblock is likely to lead to the death of that

block. An access to a cache block is represented by the pro-

gram counters (PC) of the instructions making the access.

The sequence, or trace of PCs of the instructions access-

ing a block are encoded as the ﬁxed-length truncated sum

of hashes of these PCs. This trace encoding is called a sig-

nature. For a given cache block, the trace for the sequence

of PCs begins when the block is reﬁlled and ends when the

block is evicted. The predictor learns from the trace encod-

ing of the evicted blocks. A table of saturating counters is

indexed by block signatures. When a block is replaced, the

counter associated with it is incremented. When a block

is accessed, the corresponding counter is decremented. A

block is predicted dead when the counter corresponding to

its trace exceeds a threshold.

This dead block predictor is used to prefetch data into

predicted dead blocks in the L1 data cache, enabling looka-

head prefetching and eliminating the necessity of prefetch

buffers. That work also proposes a dead block correlating

prefetcher that uses address correlation to determine which

block to prefetch in the dead blocks.

A trace based predictor is also used to optimize a cache

coherence protocol [12, 24]. Dynamic self-invalidation

involves another kind of block “death” due to coherence

events [14]. PC traces are used to detect the last touch and

invalidate the shared cache blocks to reduce cache coher-

ence overhead.

2.2. Counting Based Predictor

Dead blocks can also be predicted depending on how

many times a block has been accessed. Kharbutli and Soli-

hin propose a counting based predictor for an L2 cache

where a counter for each block records how many times

the block has been referenced [10]. When a block is evicted

the history table stores the reference count and the ﬁrst PC

that brought that block into the cache. When a block is

brought into cache again by the same PC the dead block

predictors predicts it to be dead after the number of refer-

ences reaches the threshold value stored in the history table.

Kharbutli and Solihin use this counter based dead block pre-

dictor to improve the LRU replacement policy [10]. This

improved LRU policy replaces a dead block if available, the

LRU block if not. Our technique also replaces predicted

dead blocks with other blocks, but the other blocks are vic-

tims from other sets, effectively extending associativity in

the same way a victim cache does.

2.3. Time Based Predictor

Another approach of dead block prediction is to predict

a block dead when it is not accessed for a certain number

of cycles. Hu et al. proposed a time based dead block pre-

dictor [6]. It learns the number of cycles a block is live and

predicts the block dead if it is not accessed more than twice

the number of cycles that it had been live. This predictor is

used to prefetch data into the L1 cache. This work also pro-

posed using dead times to ﬁlter blocks in the victim cache.

Blocks with shorter dead times are likely to be reused before

getting evicted from the victim cache, so time base victim

cache stores only blocks that are likely to be reused.

Abella et al. propose another time based predictor [1]. It

also predicts a block dead if it has not been accessed for

a certain number of cycles. But here the number of cy-

cles is calculated from the number of accesses of that block.

Abella et al. reduce cache leakage for L2 cache by dynami-

cally turning off cache blocks whose content is not likely to

be reused without hurting the performance.

2.4. Cache Burst Predictor

Cache bursts [15] can be used with trace based, counting

based and time based dead block predictors. A cache burst

consists of all the contiguous accesses that a block receives

while in the most-recently-used (MRU) position. Instead of

each individual references, cache burst based predictor up-

dates the predictor only on each bursts. It also improvespre-

diction accuracy by making prediction only when a block

moves out of the MRU position. The dead block predictor

needs to store trace or reference count information for each

burst only rather than for each reference. But since predic-

tion is made only after a block becomes non MRU, some

of the dead time is lost compared to non burst predictors.

Cache burst predictors improve prefetching, bypassing and

enhancing LRU replacement policy both for L1 Data cache

and L2 cache.

2.5. Other Dead Block Predictors

Another kind of dead block prediction involves predict-

ing in software [26, 21]. In this approach the compiler col-

lects dead block information and provides hints to the mi-

croarchitecture to make cache decisions. If a cache block

is likely to be reused again it hints to keep the block in the

cache; otherwise, it hints to evict the block.

2.6. Cache Placement and Replacement

Policy

Adaptive insertion policy [18] adaptively inserts the in-

coming lines in the MRU position when the working size

becomes larger than the cache size. Keramidas et al. [9]

proposed a cache replacement policy that uses reuse dis-

tance prediction. This policy tries to evict cache blocks that

will be reused furthest in the future. A memory-level par-

allelism aware cache replacement policy relies on the fact

that isolated misses are more costly on performance than

parallel misses [19].

3. Using Dead Blocks as a Virtual Victim Cache

Victim caches [7] work well because they effectively ex-

tend the associativity of any hot set in the cache, reducing

localized conﬂict misses. However, victim caches must be

small because of their high associativity, and are ﬂushed

quickly if multiple hot sets are competing for space. Thus,

victim caches do not reduce capacity misses appreciably,

nor conﬂict misses where the reference patterns do not pro-

duce a new reference to the victim quickly, but they provide

excellent miss reduction for a small additional amount of

state and complexity. Larger victim caches have not come

into wide use because any additionalmiss reduction beneﬁts

are outweighed by the overheads of the larger structures.

Large caches already contain signiﬁcant quantities of un-

used state, however, which in theory can be used for opti-

mizations similar to victim caches if the unused state can

be identiﬁed and harvested with sufﬁciently low overhead.

Since the majority of the blocks in a cache are dead at any

point in time, and since dead-block predictors have been

shown to be accurate in many cases, the opportunity ex-

ists to replace these dead blocks with victim blocks, mov-

ing them back into their set when they are accessed. This

virtual victim cache approach has the potential to reduce

both capacity misses and additional conﬂict misses: Capac-

ity misses can be reduced because dead blocks are evicted

before potentially live blocks that have been accessed less

recently (avoiding misses that would occur with full asso-

ciativity), and conﬂict misses can be further reduced if hot

set overﬂows can spill into other dead regions of the cache,

no matter how many hot sets are active at any one time.

An important question is how the dead blocks outside

of a set are found and managed without adding prohibitive

overhead. By coupling small numbers of sets, and mov-

ing blocks overﬂowing from one set into the predicted dead

blocks (which we call receiver blocks) of a ”partner set,” a

virtual victim cache can be established with little additional

overhead. While this approach effectively creates a higher-

associativity cache, the power overheads are kept low be-

cause only the original set is searched the majority of the

time, with the partner sets only searched upon a miss in the

original set. The overheads include more tag bits (the log

of the number of partner sets) and more energy and latency

incurred on a cache miss, since the partner sets are searched

to no avail.

3.1. Identifying Potential Receiver Blocks

A trace based dead block predictor keeps a trace encod-

ing for each cache block. The trace is updated on each use

of the block. When a block is evicted from the cache, a sat-

urating counter associated with that block’s trace is incre-

mented. When a block is used, the counter is decremented.

Ideally, any victim block could replace any receiver

block in the entire cache, resulting in the highest possible

usage of the dead blocks as a virtual victim cache. How-

ever, this idea would increase the dead block hit latency and

energy as every set in the cache would have to be searched

for a hit. Thus, there is a trade-off between the number of

sets that can store victim blocks from a particular set and

the time and energy needed for a hit. We have determined

that, for each set, considering only one other partner set to

identify a receiver block yields a reasonable balance. Sets

are paired into adjacent sets that differ in their set indices

by one bit.

3.2. Placing Victim Blocks into the Adja-

cent Set

When a victim block is evicted from a set, the adjacent

set is searched for invalid or predicted dead receiver blocks.

If no such block is found, then the LRU block of the ad-

jacent set is used. Once a receiver block is identiﬁed, the

victim block replaces it. The victim block is placed into the

most-recently-used (MRU) position in the adjacent set.

3.3. Block Identiﬁcation in the VVC

If a previously evicted block is referenced again, the tag

match will fail in the original set, the adjacent set will be

searched, and if the receiver block has not yet been evicted

then the block will be found there. The block will then

be reﬁlled in the original set from the adjacent set, and the

block in the adjacent set will be marked as invalid. A small

penalty for the additional tag match and ﬁll will accrue to

this access, but this access is considered a hit in the L2 for

purposes of counting hits and misses (analogously, an ac-

cess to a virtually-addressed cache following a TLB miss

may still be considered a hit, albeit with an extra delay).

To distinguish receiver blocks from other blocks, we

keep an extra bit with each block that is true if the block

is a receiver block, false otherwise. When a set’s tags are

searched for a normal cache access, receiver blocks from

the adjacent set are prevented from matching to maintain

correctness. Note that keeping this extra bit is equivalent to

keeping an extra tag bit in a higher associativity cache.

Figure 2(a) shows what happens when a LRU block is

evicted from a set s. If the adjacent set v has any predicted

dead or invalid block in it, the victim block replaces that

block, otherwise the LRU block of set v is used. Similarly,

Figure 2(b) depicts a VVC hit. If the access results in a miss

in the original set s, that block can be found in a receiver

block of the adjacent set v. Algorithm 1 shows the complete

algorithm for the VVC.

Algorithm 1 Virtual Victim Cache with Trace based Pre-

dictor

On an access to set s with address a, PC pc

if the access is a hit in block blk then

blk.trace ← updateT race(blk.trace, pc)

isDead ← lookupP redictor(blk.trace)

if isDead then

mark blk as dead

return

end if

else /* search adjacent set for a dead block hit */

v = adjacentSet(s)

access set v with address a

if the access is a hit in a dead block dblk then

bring dblk back into set s

return

end if

else /* this access is a miss */

repblk ← block chosen by LRU policy

updateP redictor(repblk.trace)

v = adjacentSet(s)

place repblk in an invalid/dead/LRU block in set v

place block for address a into repblk

repblk.trace ← updateT race(pc)

return

end if

MRU

LRU

Set s

Set v = adjacentSet(s)

LRU

dead block

MRU

Set v = adjacentSet(s)Set s

MRU

LRU

blk b from set s

LRU

No tag match for block b tag match for block b

(a) (b)

Figure 2. (a) Placing evicted block into an adjacent partner set, and (b) hitting in the virtual victim

cache

3.4. Caching Predicted Dead Blocks

Note that blocks that are predicted dead and evicted from

a set may be cached in the VVC. Although it might seem

counterintuitiveto replace one dead block with another dead

block, this policy does give an advantage over simply dis-

carding predicted dead blocks because the predictor might

be wrong, i.e. one or both blocks might not be dead. We

favor the block from the hotter set, likely to be the set just

accessed.

3.5. Implementation Issues

Adjacent sets differ in one bit, bit k. The set adjacent

to set index s is s exclusive-ORed with 2

. A value of

k = 3 provides good performance, although performance

is largely insensitive to the choice of k. Victims replace re-

ceiver blocks in the MRU position of the adjacent set and

are allowed to be evicted just as any other block in the set.

Evicted receiver blocks are not allowed to return to their

original sets, i.e., evicted blocks may not “ping-pong” back

and forth between adjacent sets.

Each cache block keeps the following additional infor-

mation: whether or not it is a receiver block (1 bit), whether

or not the block is predicted dead (1 bit), and the truncated

sum representing the trace for this block (14 bits). The dead

block predictor additionally keeps two tables of two-bit sat-

urating counters indexed by traces. The predictor tables

consume an additional 2

entries ×2 bit counters ×2 ta-

bles = 64 kilobits, or 8 kilobytes.

4. Skewed Dead Block Predictor

In this section we discuss a new dead block predictor

based on the reference trace predictor of Lai et al. [13] as

well as skewed table organizations [22, 16].

4.1. Reference Trace Dead Block Predictor

The reference trace predictor collects a trace of the in-

structions used to access a particular block. The theory is

that, if a sequence of memory instructions to a block leads

to the last access of that block, then the same sequence of in-

structions should lead to the last access of other blocks. The

reference trace predictor encodes the path of memory access

instructions leading to a memory reference as the truncated

sum of the instructions’ addresses. This truncated sum is

called a signature. Each cache block is associated with a

signature that is cleared when that cache block is ﬁlled and

updated when that block is accessed. The signature is used

to access a table of two-bit saturating counters. When a

block is accessed, the correspondingcounter is decremented

and then the signature is updated. When a block is evicted,

the counter is incremented. Thus, a counter is only incre-

mented by a signature resulting from the last access to a

block.

When a block is accessed and then the signature is up-

dated, the table of counters is consulted. If the counter ex-

ceeds a threshold (e.g. 2), then the block is predicted dead.

Each cache block stores a single bit prediction. For compar-

ison, we use a 15-bit signature indexing a 32K-entry table

of counters. The predictor exclusive-ORs the ﬁrst 15 bits

of each PC with the next 15 bits and adds this quantity to a

running 15-bit trace.

The original reference trace predictor of Lai et al. uses

data addresses as well as instruction addresses, requiring a

large table because of the high number of signatures. Subse-

quent work found that using only instruction addresses was

sufﬁcient and allows smaller tables [15]; thus, we use only

instruction addresses for all of the predictors in this paper.

4.2. A Skewed Organization

In the original trace-based dead block predictor, a single

table is indexed with the signature. For this study, we ex-

plore an organization that uses the idea of a skewed organi-

zation [22, 16] to reduce the impact of conﬂicts in the table.

The predictor keeps two 16K-entry tables of 2-bit counters,

each indexed by a different 14-bit hash of the 15-bit block

signature. Each access to the predictor yields two counter

values. The sum of these values is used as a conﬁdence that

is compared with a threshold; if the threshold is met, then

Using dead blocks as a virtual victim cache

Figures

Citations

Fundamental Latency Trade-off in Architecting DRAM Caches: Outperforming Impractical SRAM-Tags with a Simple and Practical Design

SHiP: signature-based hit predictor for high performance caching

Sampling Dead Block Prediction for Last-Level Caches

CAMEO: A Two-Level Memory Organization with Capacity of Main Memory and Flexibility of Hardware-Managed Cache

Bypass and insertion algorithms for exclusive last-level caches

References

The SimpleScalar tool set, version 2.0

A study of replacement algorithms for a virtual-storage computer

Automatically characterizing large scale program behavior

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

Related Papers (5)

Adaptive insertion policies for high performance caching

Dead-block prediction & dead-block correlating prefetchers

High performance cache replacement using re-reference interval prediction (RRIP)

A study of replacement algorithms for a virtual-storage computer

Adaptive insertion policies for managing shared caches

Frequently Asked Questions (20)

Q1. What are the contributions in "Using dead blocks as a virtual victim cache" ?

Q2. What are the future works mentioned in the paper "Using dead blocks as a virtual victim cache" ?

Q3. What is the reason why dead blocks lead to poor cache efficiency?

Q4. How much is the overhead of the predictor?

Q5. What is the technique used to replace dead blocks with other blocks?

Q6. How does the VVC achieve an average MPKI?

Q7. What is the use of dead block predictor?

Q8. How can a virtual victim cache be established?

Q9. What is the definition of dead block predictor?

Q10. What is the way to view the idea?

Q11. What is the average speed of the benchmarks?

Q12. What are the benefits of victim caches?

Q13. What is the way to reduce the number of accesses to the tag array?

Q14. How many false positives are there for the Lai et al. predictor?

Q15. What is the potential for reducing the number of sets?

Q16. How does the VVC improve cache efficiency?

Q17. What happens if a block is evicted from a set?

Q18. What is the penalty for the additional tag match and fill?

Q19. What is the difference between a memory-level parallelism aware cache replacement policy and ?

Q20. Why do the authors choose a 64KB victim cache?