scispace - formally typeset
Open AccessProceedings ArticleDOI

Bloom filtering cache misses for accurate data speculation and prefetching

Reads0
Chats0
TLDR
This paper introduces a new hit/miss predictor that uses a Bloom Filter to identify cache misses early in the pipeline, which allows the processor to more accurately schedule instructions that are dependent on loads and to more precisely prefetch data into the cache.
Abstract
A processor must know a load instruction's latency to schedule the load's dependent instructions at the correct time. Unfortunately, modern processors do not know this latency until well after the dependent instructions should have been scheduled to avoid pipeline bubbles between themselves and the load. One solution to this problem is to predict the load's latency, by predicting whether the load will hit or miss in the data cache. Existing cache hit/miss predictors, however, can only correctly predict about 50% of cache misses.This paper introduces a new hit/miss predictor that uses a Bloom Filter to identify cache misses early in the pipeline. This early identification of cache misses allows the processor to more accurately schedule instructions that are dependent on loads and to more precisely prefetch data into the cache. Simulations using a modified SimpleScalar model show that the proposed Bloom Filter is nearly perfect, with a prediction accuracy greater than 99% for the SPECint2000 benchmarks. IPC (Instructions Per Cycle) performance improved by 19% over a processor that delayed the scheduling of instructions dependent on a load until the load latency was known, and by 6% and 7% over a processor that always predicted a load would hit the cache and with a counter-based hit/miss predictor respectively. This IPC reaches 99.7% of the IPC of a processor with perfect scheduling.

read more

Content maybe subject to copyright    Report

Bloom Filtering Cache Misses for Accurate Data
Speculation and Prefetching
Jih-Kwon Peir
CISE Department
University of Florida
peir@cise.ufl.edu
Shih-Chang Lai
ECE Department
Oregon State University
laish@ece.orst.edu
Shih-Lien Lu
Jared Stark
Konrad Lai
Microprocessor Research
Intel Labs
shih-lien.l.lu@intel.com
ABSTRACT
A processor must know a load instruction’s latency to sched-
ule the load’s dependent instructions at the correct time.
Unfortunately, modern processors do not know this latency
until well after the dependent instructions should have been
scheduled to avoid pipeline bubbles between themselves and
the load. One solution to this problem is to predict the load’s
latency, by predicting whether the load will hit or miss in
the data cache. Existing cache hit/miss predictors, however,
can only correctly predict about 50% of cache misses.
This paper introduces a new hit/miss predictor that uses
a Bloom Filter to identify cache misses early in the pipeline.
This early identification of cache misses allows the processor
to more accurately schedule instructions that are dependent
on loads and to more precisely prefetch data into the cache.
Simulations using a modified SimpleScalar model show that
the proposed Bloom Filter is nearly perfect, with a predic-
tion accuracy greater than 99% for the SPECint2000 bench-
marks. IPC (Instructions Per Cycle) performance improved
by 19% over a processor that delayed the scheduling of in-
structions dependent on a load until the load latency was
known, and by 6% and 7% over a processor that always pre-
dicted a load would hit the cache and with a counter-based
hit/miss predictor respectively. This IPC reaches 99.7% of
the IPC of a processor with perfect scheduling.
Categories and Subject Descriptors
C.1.1 [Processor Architectures]: Single Data Stream Ar-
chitectures
General Terms
Algorithm, Design, Performance
Keywords
Bloom Filter, Data Cache, Data Prefetching, Instruction
Scheduling, Data Speculation
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ICS’02, June 22-26, 2002, New York, New York, USA.
Copyright 2002 ACM 1-58113-483-5/02/0006 ...$5.00.
1. INTRODUCTION
To achieve the highest performance, a processor must ex-
ecute a pair of dependent instructions with no intervening
pipeline bubbles. It must arrange for—or schedule—the de-
pendent instruction to begin execution immediately after
the instruction it depends on (i. e., the parent instruction)
completes execution. Accomplishing this requires knowing
the latency of the parent.
Unfortunately, a modern processor schedules an instruc-
tion well before it executes, and the latency of some in-
structions can only be determined by their execution. For
example, the latency of a load depends on where in the
cache/memory hierarchy its data exists, and can only be
determined by executing the load and querying the caches.
At the time the load is scheduled, its latency is unknown.
At the time its dependents should be scheduled, its latency
may still be unknown. Hence, the timely scheduling of the
instructions that are dependent on a load is a problem in
modern processors.
The Intel Pentium 4 illustrates this problem. On an Intel
Pentium 4 [6, 7], a load is scheduled 7 cycles before it be-
gins execution. Its execution (load-use) latency is 2 cycles.
At the time a load is scheduled, its execution will not begin
for another 7 cycles. Two cycles after the load is sched-
uled, if the load will hit the (first-level) cache, its dependent
instructions must be scheduled to avoid pipeline bubbles.
However, two cycles after the load is scheduled, the load has
not yet even started executing, so its cache hit/miss status
is unknown. A similar situation exists in the Compaq Al-
pha 21264 [9]. A load is scheduled 2 cycles before it begins
execution, and its execution latency is 3 cycles. If the load
will hit the (first-level) cache, its dependents must be sched-
uled 3 cycles after it has been scheduled to avoid pipeline
bubbles. However, the load’s cache hit/miss status is still
unknown 3 cycles after it has been scheduled.
One possible solution to this problem is to schedule the de-
pendents of a load only after the latency of the load is known.
The processor delays the scheduling of the dependents until
it knows the load hit the cache. This effectively increases
the load’s latency to the amount of time between when
the load is scheduled and when its cache hit/miss status is
known. This solution introduces bubbles into the pipeline,
and can devastate processor performance. Our simulations
show that a processor using this solution drops 17% of its
performance (in Instructions Per Cycle [IPC]) compared to
an ideal processor that uses an oracle to perfectly predict

load latencies and perfectly schedule their dependents.
A better solution—and the solution that is the focus of
this work—is to use data speculation. The processor specu-
lates that a load will hit the cache (a good assumption given
cache hits rates are generally over 90%), and schedules its
dependents accordingly. If the load hits, all is well. If the
load misses, any dependents that have been scheduled will
not receive the load’s result before they begin execution. All
these instructions have been erroneously scheduled, and will
need to be rescheduled.
Recovery must occur whenever instructions are erroneous-
ly scheduled due to data (mis)speculation. Although mis-
speculation is rare, the overall penalty for all mis-specula-
tions may be high, as the cost of each recovery can be high.
If the processor only rescheduled those instructions that are
(directly or indirectly) dependent on the load, the cost would
be low. However, such a recovery mechanism is expensive to
implement. The recovery mechanism for the Compaq Alpha
21264 simply reschedules all instructions scheduled since the
offending load was scheduled, whether they are dependent
or not. Although it’s cheaper to implement, the recovery
cost can be high with this mechanism due to the reschedul-
ing and re-execution of the independent instructions. Re-
gardless of which recovery mechanism is implemented, as
processor pipelines grow deeper and issue widths widen, the
number of erroneously scheduled instructions will increase,
and recovery costs will climb.
To reduce the penalty due to data mis-speculations, the
processor can predict whether the load will hit the cache,
instead of just speculating that the load will always hit.
The load’s dependents are then scheduled according to the
prediction. As an example of a cache hit/miss predictor,
the Compaq Alpha 21264 uses the most significant bit of a
4-bit saturating counter as the load’s hit/miss prediction.
The counter is incremented by one every time a load hits,
and decremented by two every time a load misses. Unfortu-
nately, even with 2-level predictors [15], only about 50% of
the cache misses can be correctly predicted.
In this paper, we describe a new approach to hit/miss pre-
diction that is very accurate and space (and hence power) ef-
ficient compared to existing approaches. This approach uses
a Bloom Filter (BF), which is a probabilistic algorithm to
quickly test membership in a large set using hash functions
into an array of bits [2]. We investigate two variants of this
approach: the first is based on partitioned-address matching,
and the second is based on partial-address matching. Ex-
perimental results show that, for modest-sized predictors,
Bloom Filters outperform predictors that used a table of
saturating counters indexed by load PC. These table-based
predictors operate just like the predictor for the Compaq
Alpha 21264, except they have multiple counters instead of
just one. As an example, for an 8K-bit predictor, the Bloom
Filter mispredicts 0.4% of all loads, whereas the table-based
predictor mispredicts 8% of all loads. This translates to
an 7% improvement in IPC over the table-based predictor.
Compared to a machine with a perfect predictor, a machine
with a Bloom Filters has 99.7% of its IPC.
The remainder of the paper is organized as follows: The
next section explains data speculation fundamentals and re-
lated work. Section 3 explains BFs and how they can be
used as hit/miss predictors. Section 4 describes how the
SimpleScalar microarchitecture [3] must be modified to sup-
port data speculation using a BF as a hit/miss predictor.
Section 5 evaluates the performance of BFs, reporting their
accuracy as hit/miss predictors and the performance benefit
(in IPC) they can provide. Finally, Section 6 concludes.
2. DATA SPECULATION
2.1 The Fundamentals
To facilitate the presentation and discussion, we consider
a baseline pipeline model that is similar to the Compaq Al-
pha 21264 [9]. In the baseline model, the front-end pipeline
stages are: instruction fetch and decode/rename. After de-
code/rename, the ALU instructions go through the back-end
stages: schedule, register read, execute, writeback, and com-
mit. Additional stages are required for executing a load.
After decode/rename, loads go through schedule, register
read, address generation, two cache access cycles, an addi-
tional cycle for hit/miss determination (data access before
hit/miss using way prediction [4]), writeback, and commit.
Thus, there are a total of 7 and 10 cycles for ALU and load
instructions, respectively.
Figure 1 shows the problem in scheduling the instructions
that are dependent on a load. For simplicity, the front-end
stages are omitted. In this example, the add instruction con-
sumes the data produced by the load instruction. After the
load is scheduled, it takes 5 cycles to resolve the hit/miss.
However, the dependent add must be scheduled the third
cycle after the load is scheduled to achieve the minimum
3-cycle load-use latency and allow back-to-back execution
of these two dependent instructions. If the processor spec-
ulatively schedules the add assuming the load will hit the
cache, the add will get incorrect data if load actually misses
the cache. In this case, the add along with any other de-
pendent instructions scheduled within the illustrated 3-cycle
speculative window must be canceled and rescheduled.
To show the performance potential of using data spec-
ulation for scheduling instructions that are dependent on
loads, we simulated the SPECint2000 benchmarks. We com-
pare two scheduling techniques. The first is a no-speculation
scheme: the dependents are delayed until the hit/miss of the
parent load is known. The second uses a perfect hit/miss
predictor that knows the hit/miss of a load in time to (per-
fectly) schedule its dependents to achieve minimum load la-
tency. The performance gap (in IPC) between these two
extremes shows the performance potential of speculatively
scheduling the dependents of loads. Figure 2 shows the re-
sults. In these simulations, we modified the SimpleScalar
out-of-order pipeline to match our baseline model; and dou-
bled the default SimpleScalar issue width to 8, scaling the
other parameters accordingly. A more detailed description
of the simulation model is given in Section 5. On aver-
age, the IPC for perfect scheduling is 17% higher than the
IPC for the no-speculation scheme. Thus, the main focus of
this paper is to recover this 17% performance gap, by using
mechanisms for efficient load data speculation.
2.2 Related Work
The Compaq Alpha 21264 uses a mini-restart mechanism
to cancel and reschedule all instructions scheduled since a
mis-speculated load was scheduled [9]. While this mini-
restart is less costly than restarting the entire processor
pipeline, it is still expensive to reschedule (and re-execute)
both the dependent and the independent instructions. To
alleviate this problem, the Compaq Alpha 21264 uses the

load r1 <− 0(r2) schedule register addgen cache1 cache2 hit/miss writeback commit
add r3 <− r2, r1
(stall) (stall) (stall) schedule register execute writeback commit
3−cycle speculative window
Speculative issue for hit
Minimum 3−cycle latency
Figure 1: Example of Data Speculation for a Load
0
0.5
1
1.5
2
2.5
3
Bz
i
p
Gap
Gcc
Gzip
Mcf
Pa
r
ser
Pe
r
l
Twolf
Vp
r
Av
e
rage
IPC
no-speculation
perfect scheduling
Figure 2: No-Speculation vs. Perfect Scheduling
most significant bit of a 4-bit saturating counter as the load’s
hit/miss prediction. The counter is incremented by one ev-
ery time a load hits, and decremented by two every time a
load misses. The load’s dependents are scheduled accord-
ing to the prediction. If the prediction is wrong, either the
load was predicted to miss and it hit, in which case the ex-
ecution of the dependents will be unnecessarily delayed; or
the load was predicted to hit and it missed, in which case
dependents may have been erroneously scheduled and will
need to be rescheduled.
Yoaz et al. [15] used 2-level local predictors, 2-level global
predictors, and hybrid predictors for cache hit/miss predic-
tion. Their results show that these predictors only correctly
identify half of the misses (for SPECint95), leaving the other
half predicted as hits. Furthermore, they incorrectly identify
a small percentage of the hits as being misses.
The MIPS R10000 speculatively issues instructions that
are dependent on a load and reschedules them if the load
misses the cache [14].
The Intel Pentium 4 achieves a minimum 2-cycle load-
use latency by leveraging the fact that most accesses hit
the first-level (L
1
) cache. The scheduler issues the depen-
dent micro-operations (called uops) before the parent load
has finished executing [6, 7]. In most cases, the scheduler
assumes the load will hit the L
1
cache. A replay mecha-
nism is used to handle the case where the load misses the L
1
cache. The replay logic keeps track of the dependent uops of
each speculative load. When a load misses, all its dependent
uops are re-executed with the correct data when that data
becomes available.
Morancho, Llaber´ıa, and Oliv´e describe a recovery mech-
anism for load latency misprediction [11]. A recovery buffer
retains all speculatively scheduled instructions. After a la-
tency misprediction, the load’s dependent instructions can
be re-scheduled directly from the recovery buffer as soon as
the load data becomes available. The recovery buffer allows
the processor to remove instruction from the scheduler early,
providing more space for other instructions.
3. BLOOM FILTERS
A Bloom Filter (BF) is a probabilistic algorithm to quickly
test membership in a large set using multiple hash functions
into an array of bits [2]. A BF quickly filters (i. e., identifies)
non-members without querying the large set by exploiting
the fact that a small percentage of erroneous classifications
can be tolerated. When a BF identifies a non-member, it is
guaranteed to not belong to the large set. When a BF iden-
tifies a member, however, it is not guaranteed to belong to
the large set. To put it more simply, the result of the mem-
bership test is either: it is definitely not a member, or, it is
probably a member. In this paper, we consider two variants
of the BF for filtering cache misses: one based on partitioned-
address matching, and the other based on partial-address
matching. To simplify our discussion, we first assume both
the BF and the cache use physical addresses. Afterwards,
we will describe using virtual addresses.
3.1 Partitioned-Address Bloom Filter
Consider a cache line address with n bits (ignoring the
offset bits). A large, direct-mapped array of 2
n
bits is re-
quired to precisely record whether each cache line address is
in the cache. To reduce the space and allow a quick access, a
partitioned-address BF can be constructed. Instead of using
the entire line address, the address can be split into m par-
titions, with each partition using its own array of bits. The
result is m sub-arrays with 2
n/m
bits, each of which records
the membership of the respective address partitions of lines
stored in the cache. A cache miss is identified when one or
more of the address partitions for the address of a requested
line does not belong to the respective address partition of
any line in the cache. A filter error is encountered when
a cache miss cannot be identified. This situation happens
when the line is not in the cache, but all m partitions of the
line’s address match address partitions of other cache lines.
The filter rate represents the percentage of cache misses that
can be identified.
Figure 3 illustrates how the partitioned-address BF works.
A load address is partitioned, in this example, into 4 equally
divided groups, A1, A2, A3, and A4. Each of the four ad-
dress partitions is used to index separate BF arrays, BF1,
BF2, BF3, and BF4, respectively. Each entry in the BF

Cache
Miss
R2R1
A1
R3
A2 A4
R4
A3
BF1 BF3 BF4BF2
Replaced Line Address:
Requested Line Address:
counter on
cache miss
decrement
counter on
cache miss
True if cache miss.
False if (maybe) cache hit.
increment
Figure 3: Partitioned-Address Bloom Filter for
Cache Miss Detection
arrays contains the information of whether the address par-
tition belongs to the corresponding address partition of any
line in the cache. If any of the 4 BF arrays indicates one
of the address partitions is absent from the cache, the re-
quested line is not in the cache. Otherwise, the requested
line is probably in the cache, but it’s not guaranteed to be.
Given the fact that a single address partition can exist
for multiple lines in the cache, the primary difficulty of the
partitioned-address BF is to maintain the correct member-
ship information. When a line is removed from the cache,
an exhaustive search is necessary to check if the address par-
titions for the address of the removed line still exist for any
of the remaining lines. To avoid such a search, each entry in
the BF array contains a reference counter that keeps track of
the number of cache lines with the entry’s corresponding ad-
dress partition. When a cache miss occurs, each counter for
the address partitions for the address of the newly-requested
line is incremented, while the counters for the address par-
titions for the address of the replaced line are decremented.
A zero count indicates the corresponding address partition
does not belong to any line in the cache. Although accurate,
this counter technique requires extra space in the BF arrays
for the counters along with adders to handle the updates.
A similar idea has been considered to reduce the number
of comparators for a set-associative cache [8] and to filter
cache-coherence traffic in a multiprocessor environment [12].
3.2 Partial-Address Bloom Filter
The partial-address BF uses the least-significant bits of the
line address to index a small array of bits. Each bit indicates
whether the partial address matches any corresponding par-
tial address of a line in the cache. The array size is reduced
to 2
p
bits, where p is the number of partial address bits. A
filter error occurs when the partial address of the requested
line matches the partial address of an existing cache line, but
the other portion of the line address does not match. We
call such cases collisions. The least-significant bits are se-
Cache Hit/Miss
BF Array
I n d e xT a g
Requested Line Address:
Offset
L1 Cache Tags
collision? (yes/no)
Partial Address (p bits)
Partial Address (p bits)
of Replaced Cache Line
reset bit on
cache miss
but no collision
set bit on
cache miss
Collision Detector
False if cache miss.
True if (maybe) cache hit.
Figure 4: Partial-Address Bloom Filter for Cache
Miss Detection
lected rather than more-significant bits to reduce the chance
of collisions. Due to memory reference locality, the more-
significant line address bits tend to change less frequently.
With a sufficient number of low-order partial address bits to
represent cache line addresses, collisions are rare [10].
The design of a partial-address BF is illustrated in Fig-
ure 4. A BF array with 2
p
bits indicates whether the cor-
responding partial address matches that of any cache line.
The BF array is updated to reflect any cache content change.
When a cache miss occurs, except for the caveat described
in the paragraph below, the entry in the BF array for the re-
placed line is reset to indicate that the line with that partial
address is no longer in the cache. Then, the entry for the
requested line is set to indicate that a line with that partial
address now exists in the cache.
If the partial address is wider than the cache index, when
two cache lines share the same partial address, they must
be in the same set in a set-associative cache. The BF array
indicates which partial addresses exist in the cache, so if
one of these lines is replaced, the BF entry for the replaced
line should not be reset, since the partial address still exists
for the line that was not replaced. When a cache line is
replaced, the collision detector checks the remaining cache
lines in the same set as the replaced line to see if any of
them have the same partial address as the replaced line. If
any do have the same partial address, the BF entry is not
reset. Otherwise, the entry is reset. The collision detection
is done in parallel with the cache hit/miss detection. The
BF array is updated on the detection of a cache miss.
3.3 Bloom Filters using Virtual Addresses
The hit/miss prediction for a load must be done before
the scheduling of its dependents. If the physical address is
not available in time to perform the prediction, the virtual
address must be used. When a virtual address is used to
access a BF, it is called a virtual-address BF. If the cache
is virtually indexed and tagged, the virtual-address BF op-

BF Array
p2 p2
Collision & Update Table
(CUT)
Cache Hit/Miss
TLB and
Cache Tag
Access
(p0+p1)
I n d e xT a g
Requested Line Address:
Offset
Partial Address (p0+p2)
p0
collision? (yes/no)
victim information
Collision Detector
reset bit on
cache miss
but no collision
set bit on
cache miss
of Replaced Cache Line
Partial Address (p0+p2)
False if cache miss.
True if (maybe) cache hit.
Figure 5: Partial-Virtual-Address Bloom Filter for
Cache Miss Detection
erates analogously to the BF and cache that both use only
physical addresses. However, if the cache is either virtually-
indexed physically-tagged or physically-indexed physically-
tagged, the BF array update for the virtual-address BF must
be modified. In this section, we describe these modifications.
With virtual addresses, two virtual addresses can map
to the same physical address, causing an address synonym.
With a virtual-address BF, the BF might identify the first
address as missing the cache, even though the line is in the
cache set identified by the second address. That is, the BF
identifies a load as missing the cache even though it hits.
This situation can arise regardless of whether the cache is
physically or virtually indexed. In this situation, the proces-
sor simply delays scheduling the load’s dependent instruc-
tions. Since cache hits by synonyms are rare, the perfor-
mance loss caused by the delayed scheduling is minimal. In
fact, for some virtually-indexed caches, the load-use latency
for a synonym hit is longer than for a non-synonym hit. For
scheduling, the processor may initially treat the synonym
hit as a cache miss, in which case the BF should identify the
synonym hit as a cache miss anyway.
A more essential issue is correctly updating the BF array
on cache misses. Let’s first focus on the partial-address BF
shown in Figure 5. To simplify our discussion, assume the
cache is physically indexed and tagged with p
0
+p
1
index
bits, where p
0
bits are within the page offset and p
1
bits
are beyond the offset. During a cache access, p
1
bits are
translated. Also assume p
0
+p
2
partial virtual address bits
are used to access the BF, where p
2
bits are beyond the page
offset. To correctly update the BF array, the p
2
bits of each
cache line are stored in a Collision and Update Table (CUT).
When a line is replaced, its p
2
bits are read from the CUT.
These p
2
bits are then combined with the requested line’s
p
0
bits to update the BF array.
The CUT is organized as a two-dimensional array and
indexed by the p
0
bits. During each cache access, the set of
p
2
bits indexed by p
0
are read from the CUT. If a cache miss
is detected, the p
2
bits of the victim (e. g., LRU) line in the
accessed cache set are compared to the p
2
bits for the other
lines in that CUT set. If the victim’s p
2
bits don’t match
any other line’s p
2
bits, there is no collision, and the victim’s
p
2
bits are used along with the p
0
bits to reset the BF array
to indicate that the line with the p
0
+p
2
partial address is no
longer in the cache. If the victim’s p
2
bits do match another
line’s p
2
bits, the victim and the other line share the same
partial address, and there is a collision. In this case, the BF
entry for the victim line is left alone. Then, the BF entry
for the requested line is set using the partial virtual address
of the requested line. Note that when the cache is virtually-
indexed physically-tagged, all the cache index bits are used
to access the CUT. In this case, only the partial address bits
beyond the virtual cache index bits need to be saved in the
CUT and compared for collision detection.
Handling a virtual partitioned-address BF is straightfor-
ward. Virtual address tags must be stored in the cache tag
array along with the physical tags. When a line is replaced,
the replaced line’s virtual address tag is used to update the
counter in each partitioned BF.
For the remainder of the paper, we will assume virtual-
address BFs. The virtual address needed to access the BF
is available after the address generation cycle. Due to its
rarity, we will omit discussions of synonym hits. If fact, for
our benchmarks there are no synonyms.
4. THE MICROARCHITECTURE
In our baseline model, ALU instructions require a min-
imum of 7 cycles: instruction fetch (IFE), decode/rename
(DEC), schedule (SCH), register read (REG), execute (EXE),
writeback (WRB), and commit (CMT). Loads extend the
execute stage to 4 cycles: address generation (AGN), two
cache access cycles (CA1, CA2), and hit/miss determina-
tion (H/M). Assuming a load hits the L
1
cache, there is a
3-cycle speculative window in which the load’s dependents
and their children are scheduled. When a miss occurs, all of
the dependent instructions and their children scheduled in
these 3 cycles must be canceled and re-executed using the
correct data when it becomes available.
4.1 Predictor Timing and Mini-Restart
If data cache misses can be predicted early enough and
accurately enough, the processor’s scheduler can avoid in-
serting pipeline bubbles between a load and its dependent
instructions. To be effective, the load’s cache hit/miss pre-
diction must be done before its dependents must be sched-
uled. Thus, there are two basic issues: (1) when, and (2)
how fast the hit/miss prediction can be performed. Hit/miss
predictors that use saturating counters, like the one used by
the Compaq Alpha 21264, can access the counter at the be-
ginning of the pipeline. Since our pipeline has a minimum
3-cycle load latency, the prediction is available before any
of the load’s dependents need to be scheduled. If a miss is
predicted, the dependents are blocked from scheduling un-
til either the data comes back from the outer levels of the
memory hierarchy or the prediction is found to be incorrect.
The proposed Bloom Filter approach, on the other hand,
requires the load address to accurately identify (filter) misses.
This filtering can only be performed after the load address
is calculated in the address generation cycle. As shown in
Figure 1, the load’s dependent instructions must be sched-
uled the cycle after the load’s address generation to avoid

Citations
More filters
Proceedings ArticleDOI

LogTM-SE: Decoupling Hardware Transactional Memory from Caches

TL;DR: This paper proposes a hardware transactional memory system called LogTM Signature Edition (LogTM-SE), which uses signatures to summarize a transactions read-and write-sets and detects conflicts on coherence requests (eager conflict detection), and allows cache victimization, unbounded nesting, thread context switching and migration, and paging.
Proceedings ArticleDOI

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

TL;DR: Results show that feedback-directed prefetching eliminates the large negative performance impact incurred on some benchmarks due to prefetcher, and it is applicable to stream-based prefetchers, global-history-buffer based delta correlation prefetchery, and PC-based stridePrefetchers.
Journal ArticleDOI

Counter-Based Cache Replacement and Bypassing Algorithms

TL;DR: A new counter-based approach to deal with cache pollution, predicting lines that have become dead and replacing them early from the L2 cache and identifying never-reaccessed lines, which is augmented with an event counter that is incremented when an event of interest such as certain cache accesses occurs.
Book

Computer Architecture Techniques for Power-Efficiency

TL;DR: This book aims to document some of the most important architectural techniques that were invented, proposed, and applied to reduce both dynamic power and static power dissipation in processors and memory hierarchies by focusing on their common characteristics.
Patent

Managing cache coherency in a data processing apparatus

TL;DR: In this article, cache coherency circuitry ensures that data accessed by each processing unit is up-to-date and has snoop indication circuitry whose content is derived from the already-provided segment filtering data.
References
More filters
Journal ArticleDOI

Space/time trade-offs in hash coding with allowable errors

TL;DR: Analysis of the paradigm problem demonstrates that allowing a small number of test messages to be falsely identified as members of the given set will permit a much smaller hash area to be used without increasing reject time.
Journal ArticleDOI

The SimpleScalar tool set, version 2.0

TL;DR: This document describes release 2.0 of the SimpleScalar tool set, a suite of free, publicly available simulation tools that offer both detailed and high-performance simulation of modern microprocessors.
Journal ArticleDOI

Summary cache: a scalable wide-area web cache sharing protocol

TL;DR: This paper demonstrates the benefits of cache sharing, measures the overhead of the existing protocols, and proposes a new protocol called "summary cache", which reduces the number of intercache protocol messages, reduces the bandwidth consumption, and eliminates 30% to 95% of the protocol CPU overhead, all while maintaining almost the same cache hit ratios as ICP.
Journal ArticleDOI

The Alpha 21264 microprocessor

R.E. Kessler
- 01 Mar 1999 - 
TL;DR: A unique combination of high clock speeds and advanced microarchitectural techniques, including many forms of out-of-order and speculative execution, provide exceptional core computational performance in the 21264.
Journal ArticleDOI

The Mips R10000 superscalar microprocessor

K.C. Yeager
- 01 Apr 1996 - 
TL;DR: The Mips R10000 is a dynamic, superscalar microprocessor that implements the 64-bit Mips 4 instruction set architecture that fetches and decodes four instructions per cycle and dynamically issues them to five fully-pipelined, low-latency execution units.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "Bloom filtering cache misses for accurate data speculation and prefetching" ?

This paper introduces a new hit/miss predictor that uses a Bloom Filter to identify cache misses early in the pipeline.