What contributions have the authors mentioned in the paper "Bloom filtering cache misses for accurate data speculation and prefetching" ?

(Open Access) Bloom filtering cache misses for accurate data speculation and prefetching (2002) | Jih-Kwon Peir

Bloom Filtering Cache Misses for Accurate Data

Speculation and Prefetching

Jih-Kwon Peir

CISE Department

University of Florida

peir@cise.uﬂ.edu

Shih-Chang Lai

ECE Department

Oregon State University

laish@ece.orst.edu

Shih-Lien Lu

Jared Stark

Konrad Lai

Microprocessor Research

Intel Labs

shih-lien.l.lu@intel.com

ABSTRACT

A processor must know a load instruction’s latency to sched-

ule the load’s dependent instructions at the correct time.

Unfortunately, modern processors do not know this latency

until well after the dependent instructions should have been

scheduled to avoid pipeline bubbles between themselves and

the load. One solution to this problem is to predict the load’s

latency, by predicting whether the load will hit or miss in

the data cache. Existing cache hit/miss predictors, however,

can only correctly predict about 50% of cache misses.

This paper introduces a new hit/miss predictor that uses

a Bloom Filter to identify cache misses early in the pipeline.

This early identiﬁcation of cache misses allows the processor

to more accurately schedule instructions that are dependent

on loads and to more precisely prefetch data into the cache.

Simulations using a modiﬁed SimpleScalar model show that

the proposed Bloom Filter is nearly perfect, with a predic-

tion accuracy greater than 99% for the SPECint2000 bench-

marks. IPC (Instructions Per Cycle) performance improved

by 19% over a processor that delayed the scheduling of in-

structions dependent on a load until the load latency was

known, and by 6% and 7% over a processor that always pre-

dicted a load would hit the cache and with a counter-based

hit/miss predictor respectively. This IPC reaches 99.7% of

the IPC of a processor with perfect scheduling.

Categories and Subject Descriptors

C.1.1 [Processor Architectures]: Single Data Stream Ar-

chitectures

General Terms

Algorithm, Design, Performance

Keywords

Bloom Filter, Data Cache, Data Prefetching, Instruction

Scheduling, Data Speculation

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

ICS’02, June 22-26, 2002, New York, New York, USA.

1. INTRODUCTION

To achieve the highest performance, a processor must ex-

ecute a pair of dependent instructions with no intervening

pipeline bubbles. It must arrange for—or schedule—the de-

pendent instruction to begin execution immediately after

the instruction it depends on (i. e., the parent instruction)

completes execution. Accomplishing this requires knowing

the latency of the parent.

Unfortunately, a modern processor schedules an instruc-

tion well before it executes, and the latency of some in-

structions can only be determined by their execution. For

example, the latency of a load depends on where in the

cache/memory hierarchy its data exists, and can only be

determined by executing the load and querying the caches.

At the time the load is scheduled, its latency is unknown.

At the time its dependents should be scheduled, its latency

may still be unknown. Hence, the timely scheduling of the

instructions that are dependent on a load is a problem in

modern processors.

The Intel Pentium 4 illustrates this problem. On an Intel

Pentium 4 [6, 7], a load is scheduled 7 cycles before it be-

gins execution. Its execution (load-use) latency is 2 cycles.

At the time a load is scheduled, its execution will not begin

for another 7 cycles. Two cycles after the load is sched-

uled, if the load will hit the (ﬁrst-level) cache, its dependent

instructions must be scheduled to avoid pipeline bubbles.

However, two cycles after the load is scheduled, the load has

not yet even started executing, so its cache hit/miss status

is unknown. A similar situation exists in the Compaq Al-

pha 21264 [9]. A load is scheduled 2 cycles before it begins

execution, and its execution latency is 3 cycles. If the load

will hit the (ﬁrst-level) cache, its dependents must be sched-

uled 3 cycles after it has been scheduled to avoid pipeline

bubbles. However, the load’s cache hit/miss status is still

unknown 3 cycles after it has been scheduled.

One possible solution to this problem is to schedule the de-

pendents of a load only after the latency of the load is known.

The processor delays the scheduling of the dependents until

it knows the load hit the cache. This eﬀectively increases

the load’s latency to the amount of time between when

the load is scheduled and when its cache hit/miss status is

known. This solution introduces bubbles into the pipeline,

and can devastate processor performance. Our simulations

show that a processor using this solution drops 17% of its

performance (in Instructions Per Cycle [IPC]) compared to

an ideal processor that uses an oracle to perfectly predict

load latencies and perfectly schedule their dependents.

A better solution—and the solution that is the focus of

this work—is to use data speculation. The processor specu-

lates that a load will hit the cache (a good assumption given

cache hits rates are generally over 90%), and schedules its

dependents accordingly. If the load hits, all is well. If the

load misses, any dependents that have been scheduled will

not receive the load’s result before they begin execution. All

these instructions have been erroneously scheduled, and will

need to be rescheduled.

Recovery must occur whenever instructions are erroneous-

ly scheduled due to data (mis)speculation. Although mis-

speculation is rare, the overall penalty for all mis-specula-

tions may be high, as the cost of each recovery can be high.

If the processor only rescheduled those instructions that are

(directly or indirectly) dependent on the load, the cost would

be low. However, such a recovery mechanism is expensive to

implement. The recovery mechanism for the Compaq Alpha

21264 simply reschedules all instructions scheduled since the

oﬀending load was scheduled, whether they are dependent

or not. Although it’s cheaper to implement, the recovery

cost can be high with this mechanism due to the reschedul-

ing and re-execution of the independent instructions. Re-

gardless of which recovery mechanism is implemented, as

processor pipelines grow deeper and issue widths widen, the

number of erroneously scheduled instructions will increase,

and recovery costs will climb.

To reduce the penalty due to data mis-speculations, the

processor can predict whether the load will hit the cache,

instead of just speculating that the load will always hit.

The load’s dependents are then scheduled according to the

prediction. As an example of a cache hit/miss predictor,

the Compaq Alpha 21264 uses the most signiﬁcant bit of a

4-bit saturating counter as the load’s hit/miss prediction.

The counter is incremented by one every time a load hits,

and decremented by two every time a load misses. Unfortu-

nately, even with 2-level predictors [15], only about 50% of

the cache misses can be correctly predicted.

In this paper, we describe a new approach to hit/miss pre-

diction that is very accurate and space (and hence power) ef-

ﬁcient compared to existing approaches. This approach uses

a Bloom Filter (BF), which is a probabilistic algorithm to

quickly test membership in a large set using hash functions

into an array of bits [2]. We investigate two variants of this

approach: the ﬁrst is based on partitioned-address matching,

and the second is based on partial-address matching. Ex-

perimental results show that, for modest-sized predictors,

Bloom Filters outperform predictors that used a table of

saturating counters indexed by load PC. These table-based

predictors operate just like the predictor for the Compaq

Alpha 21264, except they have multiple counters instead of

just one. As an example, for an 8K-bit predictor, the Bloom

Filter mispredicts 0.4% of all loads, whereas the table-based

predictor mispredicts 8% of all loads. This translates to

an 7% improvement in IPC over the table-based predictor.

Compared to a machine with a perfect predictor, a machine

with a Bloom Filters has 99.7% of its IPC.

The remainder of the paper is organized as follows: The

next section explains data speculation fundamentals and re-

lated work. Section 3 explains BFs and how they can be

used as hit/miss predictors. Section 4 describes how the

SimpleScalar microarchitecture [3] must be modiﬁed to sup-

port data speculation using a BF as a hit/miss predictor.

Section 5 evaluates the performance of BFs, reporting their

accuracy as hit/miss predictors and the performance beneﬁt

(in IPC) they can provide. Finally, Section 6 concludes.

2. DATA SPECULATION

2.1 The Fundamentals

To facilitate the presentation and discussion, we consider

a baseline pipeline model that is similar to the Compaq Al-

pha 21264 [9]. In the baseline model, the front-end pipeline

stages are: instruction fetch and decode/rename. After de-

code/rename, the ALU instructions go through the back-end

stages: schedule, register read, execute, writeback, and com-

mit. Additional stages are required for executing a load.

After decode/rename, loads go through schedule, register

read, address generation, two cache access cycles, an addi-

tional cycle for hit/miss determination (data access before

hit/miss using way prediction [4]), writeback, and commit.

Thus, there are a total of 7 and 10 cycles for ALU and load

instructions, respectively.

Figure 1 shows the problem in scheduling the instructions

that are dependent on a load. For simplicity, the front-end

stages are omitted. In this example, the add instruction con-

sumes the data produced by the load instruction. After the

load is scheduled, it takes 5 cycles to resolve the hit/miss.

However, the dependent add must be scheduled the third

cycle after the load is scheduled to achieve the minimum

3-cycle load-use latency and allow back-to-back execution

of these two dependent instructions. If the processor spec-

ulatively schedules the add assuming the load will hit the

cache, the add will get incorrect data if load actually misses

the cache. In this case, the add along with any other de-

pendent instructions scheduled within the illustrated 3-cycle

speculative window must be canceled and rescheduled.

To show the performance potential of using data spec-

ulation for scheduling instructions that are dependent on

loads, we simulated the SPECint2000 benchmarks. We com-

pare two scheduling techniques. The ﬁrst is a no-speculation

scheme: the dependents are delayed until the hit/miss of the

parent load is known. The second uses a perfect hit/miss

predictor that knows the hit/miss of a load in time to (per-

fectly) schedule its dependents to achieve minimum load la-

tency. The performance gap (in IPC) between these two

extremes shows the performance potential of speculatively

scheduling the dependents of loads. Figure 2 shows the re-

sults. In these simulations, we modiﬁed the SimpleScalar

out-of-order pipeline to match our baseline model; and dou-

bled the default SimpleScalar issue width to 8, scaling the

other parameters accordingly. A more detailed description

of the simulation model is given in Section 5. On aver-

age, the IPC for perfect scheduling is 17% higher than the

IPC for the no-speculation scheme. Thus, the main focus of

this paper is to recover this 17% performance gap, by using

mechanisms for eﬃcient load data speculation.

2.2 Related Work

The Compaq Alpha 21264 uses a mini-restart mechanism

to cancel and reschedule all instructions scheduled since a

mis-speculated load was scheduled [9]. While this mini-

restart is less costly than restarting the entire processor

pipeline, it is still expensive to reschedule (and re-execute)

both the dependent and the independent instructions. To

alleviate this problem, the Compaq Alpha 21264 uses the

load r1 <− 0(r2) schedule register addgen cache1 cache2 hit/miss writeback commit

add r3 <− r2, r1

(stall) (stall) (stall) schedule register execute writeback commit

3−cycle speculative window

Speculative issue for hit

Minimum 3−cycle latency

Figure 1: Example of Data Speculation for a Load

0.5

1.5

2.5

Gap

Gcc

Gzip

Mcf

ser

Twolf

tex

rage

IPC

no-speculation

perfect scheduling

Figure 2: No-Speculation vs. Perfect Scheduling

most signiﬁcant bit of a 4-bit saturating counter as the load’s

hit/miss prediction. The counter is incremented by one ev-

ery time a load hits, and decremented by two every time a

load misses. The load’s dependents are scheduled accord-

ing to the prediction. If the prediction is wrong, either the

load was predicted to miss and it hit, in which case the ex-

ecution of the dependents will be unnecessarily delayed; or

the load was predicted to hit and it missed, in which case

dependents may have been erroneously scheduled and will

need to be rescheduled.

Yoaz et al. [15] used 2-level local predictors, 2-level global

predictors, and hybrid predictors for cache hit/miss predic-

tion. Their results show that these predictors only correctly

identify half of the misses (for SPECint95), leaving the other

half predicted as hits. Furthermore, they incorrectly identify

a small percentage of the hits as being misses.

The MIPS R10000 speculatively issues instructions that

are dependent on a load and reschedules them if the load

misses the cache [14].

The Intel Pentium 4 achieves a minimum 2-cycle load-

use latency by leveraging the fact that most accesses hit

the ﬁrst-level (L

) cache. The scheduler issues the depen-

dent micro-operations (called uops) before the parent load

has ﬁnished executing [6, 7]. In most cases, the scheduler

assumes the load will hit the L

cache. A ‘replay’ mecha-

nism is used to handle the case where the load misses the L

cache. The replay logic keeps track of the dependent uops of

each speculative load. When a load misses, all its dependent

uops are re-executed with the correct data when that data

becomes available.

Morancho, Llaber´ıa, and Oliv´e describe a recovery mech-

anism for load latency misprediction [11]. A recovery buﬀer

retains all speculatively scheduled instructions. After a la-

tency misprediction, the load’s dependent instructions can

be re-scheduled directly from the recovery buﬀer as soon as

the load data becomes available. The recovery buﬀer allows

the processor to remove instruction from the scheduler early,

providing more space for other instructions.

3. BLOOM FILTERS

A Bloom Filter (BF) is a probabilistic algorithm to quickly

test membership in a large set using multiple hash functions

into an array of bits [2]. A BF quickly ﬁlters (i. e., identiﬁes)

non-members without querying the large set by exploiting

the fact that a small percentage of erroneous classiﬁcations

can be tolerated. When a BF identiﬁes a non-member, it is

guaranteed to not belong to the large set. When a BF iden-

tiﬁes a member, however, it is not guaranteed to belong to

the large set. To put it more simply, the result of the mem-

bership test is either: it is deﬁnitely not a member, or, it is

probably a member. In this paper, we consider two variants

of the BF for ﬁltering cache misses: one based on partitioned-

address matching, and the other based on partial-address

matching. To simplify our discussion, we ﬁrst assume both

the BF and the cache use physical addresses. Afterwards,

we will describe using virtual addresses.

3.1 Partitioned-Address Bloom Filter

Consider a cache line address with n bits (ignoring the

oﬀset bits). A large, direct-mapped array of 2

bits is re-

quired to precisely record whether each cache line address is

in the cache. To reduce the space and allow a quick access, a

partitioned-address BF can be constructed. Instead of using

the entire line address, the address can be split into m par-

titions, with each partition using its own array of bits. The

result is m sub-arrays with 2

n/m

bits, each of which records

the membership of the respective address partitions of lines

stored in the cache. A cache miss is identiﬁed when one or

more of the address partitions for the address of a requested

line does not belong to the respective address partition of

any line in the cache. A ﬁlter error is encountered when

a cache miss cannot be identiﬁed. This situation happens

when the line is not in the cache, but all m partitions of the

line’s address match address partitions of other cache lines.

The ﬁlter rate represents the percentage of cache misses that

can be identiﬁed.

Figure 3 illustrates how the partitioned-address BF works.

A load address is partitioned, in this example, into 4 equally

divided groups, A1, A2, A3, and A4. Each of the four ad-

dress partitions is used to index separate BF arrays, BF1,

BF2, BF3, and BF4, respectively. Each entry in the BF

Cache

Miss

R2R1

A2 A4

BF1 BF3 BF4BF2

Replaced Line Address:

Requested Line Address:

counter on

cache miss

decrement

counter on

cache miss

True if cache miss.

False if (maybe) cache hit.

increment

Figure 3: Partitioned-Address Bloom Filter for

Cache Miss Detection

arrays contains the information of whether the address par-

tition belongs to the corresponding address partition of any

line in the cache. If any of the 4 BF arrays indicates one

of the address partitions is absent from the cache, the re-

quested line is not in the cache. Otherwise, the requested

line is probably in the cache, but it’s not guaranteed to be.

Given the fact that a single address partition can exist

for multiple lines in the cache, the primary diﬃculty of the

partitioned-address BF is to maintain the correct member-

ship information. When a line is removed from the cache,

an exhaustive search is necessary to check if the address par-

titions for the address of the removed line still exist for any

of the remaining lines. To avoid such a search, each entry in

the BF array contains a reference counter that keeps track of

the number of cache lines with the entry’s corresponding ad-

dress partition. When a cache miss occurs, each counter for

the address partitions for the address of the newly-requested

line is incremented, while the counters for the address par-

titions for the address of the replaced line are decremented.

A zero count indicates the corresponding address partition

does not belong to any line in the cache. Although accurate,

this counter technique requires extra space in the BF arrays

for the counters along with adders to handle the updates.

A similar idea has been considered to reduce the number

of comparators for a set-associative cache [8] and to ﬁlter

cache-coherence traﬃc in a multiprocessor environment [12].

3.2 Partial-Address Bloom Filter

The partial-address BF uses the least-signiﬁcant bits of the

line address to index a small array of bits. Each bit indicates

whether the partial address matches any corresponding par-

tial address of a line in the cache. The array size is reduced

to 2

bits, where p is the number of partial address bits. A

ﬁlter error occurs when the partial address of the requested

line matches the partial address of an existing cache line, but

the other portion of the line address does not match. We

call such cases collisions. The least-signiﬁcant bits are se-

Cache Hit/Miss

BF Array

I n d e xT a g

Requested Line Address:

Offset

L1 Cache Tags

collision? (yes/no)

Partial Address (p bits)

of Replaced Cache Line

reset bit on

cache miss

but no collision

set bit on

cache miss

Collision Detector

False if cache miss.

True if (maybe) cache hit.

Figure 4: Partial-Address Bloom Filter for Cache

Miss Detection

lected rather than more-signiﬁcant bits to reduce the chance

of collisions. Due to memory reference locality, the more-

signiﬁcant line address bits tend to change less frequently.

With a suﬃcient number of low-order partial address bits to

represent cache line addresses, collisions are rare [10].

The design of a partial-address BF is illustrated in Fig-

ure 4. A BF array with 2

bits indicates whether the cor-

responding partial address matches that of any cache line.

The BF array is updated to reﬂect any cache content change.

When a cache miss occurs, except for the caveat described

in the paragraph below, the entry in the BF array for the re-

placed line is reset to indicate that the line with that partial

address is no longer in the cache. Then, the entry for the

requested line is set to indicate that a line with that partial

address now exists in the cache.

If the partial address is wider than the cache index, when

two cache lines share the same partial address, they must

be in the same set in a set-associative cache. The BF array

indicates which partial addresses exist in the cache, so if

one of these lines is replaced, the BF entry for the replaced

line should not be reset, since the partial address still exists

for the line that was not replaced. When a cache line is

replaced, the collision detector checks the remaining cache

lines in the same set as the replaced line to see if any of

them have the same partial address as the replaced line. If

any do have the same partial address, the BF entry is not

reset. Otherwise, the entry is reset. The collision detection

is done in parallel with the cache hit/miss detection. The

BF array is updated on the detection of a cache miss.

3.3 Bloom Filters using Virtual Addresses

The hit/miss prediction for a load must be done before

the scheduling of its dependents. If the physical address is

not available in time to perform the prediction, the virtual

address must be used. When a virtual address is used to

access a BF, it is called a virtual-address BF. If the cache

is virtually indexed and tagged, the virtual-address BF op-

BF Array

p2 p2

Collision & Update Table

(CUT)

Cache Hit/Miss

TLB and

Cache Tag

Access

(p0+p1)

I n d e xT a g

Requested Line Address:

Offset

Partial Address (p0+p2)

collision? (yes/no)

victim information

Collision Detector

reset bit on

cache miss

but no collision

set bit on

cache miss

of Replaced Cache Line

Partial Address (p0+p2)

False if cache miss.

True if (maybe) cache hit.

Figure 5: Partial-Virtual-Address Bloom Filter for

Cache Miss Detection

erates analogously to the BF and cache that both use only

physical addresses. However, if the cache is either virtually-

indexed physically-tagged or physically-indexed physically-

tagged, the BF array update for the virtual-address BF must

be modiﬁed. In this section, we describe these modiﬁcations.

With virtual addresses, two virtual addresses can map

to the same physical address, causing an address synonym.

With a virtual-address BF, the BF might identify the ﬁrst

address as missing the cache, even though the line is in the

cache set identiﬁed by the second address. That is, the BF

identiﬁes a load as missing the cache even though it hits.

This situation can arise regardless of whether the cache is

physically or virtually indexed. In this situation, the proces-

sor simply delays scheduling the load’s dependent instruc-

tions. Since cache hits by synonyms are rare, the perfor-

mance loss caused by the delayed scheduling is minimal. In

fact, for some virtually-indexed caches, the load-use latency

for a synonym hit is longer than for a non-synonym hit. For

scheduling, the processor may initially treat the synonym

hit as a cache miss, in which case the BF should identify the

synonym hit as a cache miss anyway.

A more essential issue is correctly updating the BF array

on cache misses. Let’s ﬁrst focus on the partial-address BF

shown in Figure 5. To simplify our discussion, assume the

cache is physically indexed and tagged with p

index

bits, where p

bits are within the page oﬀset and p

bits

are beyond the oﬀset. During a cache access, p

bits are

translated. Also assume p

partial virtual address bits

are used to access the BF, where p

bits are beyond the page

oﬀset. To correctly update the BF array, the p

bits of each

cache line are stored in a Collision and Update Table (CUT).

When a line is replaced, its p

bits are read from the CUT.

These p

bits are then combined with the requested line’s

bits to update the BF array.

The CUT is organized as a two-dimensional array and

indexed by the p

bits. During each cache access, the set of

bits indexed by p

are read from the CUT. If a cache miss

is detected, the p

bits of the victim (e. g., LRU) line in the

accessed cache set are compared to the p

bits for the other

lines in that CUT set. If the victim’s p

bits don’t match

any other line’s p

bits, there is no collision, and the victim’s

bits are used along with the p

bits to reset the BF array

to indicate that the line with the p

partial address is no

longer in the cache. If the victim’s p

bits do match another

line’s p

bits, the victim and the other line share the same

partial address, and there is a collision. In this case, the BF

entry for the victim line is left alone. Then, the BF entry

for the requested line is set using the partial virtual address

of the requested line. Note that when the cache is virtually-

indexed physically-tagged, all the cache index bits are used

to access the CUT. In this case, only the partial address bits

beyond the virtual cache index bits need to be saved in the

CUT and compared for collision detection.

Handling a virtual partitioned-address BF is straightfor-

ward. Virtual address tags must be stored in the cache tag

array along with the physical tags. When a line is replaced,

the replaced line’s virtual address tag is used to update the

counter in each partitioned BF.

For the remainder of the paper, we will assume virtual-

address BFs. The virtual address needed to access the BF

is available after the address generation cycle. Due to its

rarity, we will omit discussions of synonym hits. If fact, for

our benchmarks there are no synonyms.

4. THE MICROARCHITECTURE

In our baseline model, ALU instructions require a min-

imum of 7 cycles: instruction fetch (IFE), decode/rename

(DEC), schedule (SCH), register read (REG), execute (EXE),

writeback (WRB), and commit (CMT). Loads extend the

execute stage to 4 cycles: address generation (AGN), two

cache access cycles (CA1, CA2), and hit/miss determina-

tion (H/M). Assuming a load hits the L

cache, there is a

3-cycle speculative window in which the load’s dependents

and their children are scheduled. When a miss occurs, all of

the dependent instructions and their children scheduled in

these 3 cycles must be canceled and re-executed using the

correct data when it becomes available.

4.1 Predictor Timing and Mini-Restart

If data cache misses can be predicted early enough and

accurately enough, the processor’s scheduler can avoid in-

serting pipeline bubbles between a load and its dependent

instructions. To be eﬀective, the load’s cache hit/miss pre-

diction must be done before its dependents must be sched-

uled. Thus, there are two basic issues: (1) when, and (2)

how fast the hit/miss prediction can be performed. Hit/miss

predictors that use saturating counters, like the one used by

the Compaq Alpha 21264, can access the counter at the be-

ginning of the pipeline. Since our pipeline has a minimum

3-cycle load latency, the prediction is available before any

of the load’s dependents need to be scheduled. If a miss is

predicted, the dependents are blocked from scheduling un-

til either the data comes back from the outer levels of the

memory hierarchy or the prediction is found to be incorrect.

The proposed Bloom Filter approach, on the other hand,

requires the load address to accurately identify (ﬁlter) misses.

This ﬁltering can only be performed after the load address

is calculated in the address generation cycle. As shown in

Figure 1, the load’s dependent instructions must be sched-

uled the cycle after the load’s address generation to avoid

Bloom filtering cache misses for accurate data speculation and prefetching

Figures

Citations

LogTM-SE: Decoupling Hardware Transactional Memory from Caches

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Counter-Based Cache Replacement and Bypassing Algorithms

Computer Architecture Techniques for Power-Efficiency

Managing cache coherency in a data processing apparatus

References

Space/time trade-offs in hash coding with allowable errors

The SimpleScalar tool set, version 2.0

Summary cache: a scalable wide-area web cache sharing protocol

The Alpha 21264 microprocessor

The Mips R10000 superscalar microprocessor

Related Papers (5)

Space/time trade-offs in hash coding with allowable errors

Summary cache: a scalable wide-area web cache sharing protocol

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

Reducing set-associative cache energy via way-prediction and selective direct-mapping

Wattch: a framework for architectural-level power analysis and optimizations

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "Bloom filtering cache misses for accurate data speculation and prefetching" ?