scispace - formally typeset
Open AccessJournal ArticleDOI

Randomized cache placement for eliminating conflicts

Nigel Topham, +1 more
- 01 Feb 1999 - 
- Vol. 48, Iss: 2, pp 185-192
TLDR
This paper shows how the introduction of a pseudorandom element into the cache index function can effectively eliminate repetitive conflict misses and produce a cache where miss ratio depends solely on working set behavior.
Abstract
Applications with regular patterns of memory access can experience high levels of cache conflict misses. In shared-memory multiprocessors conflict misses can be increased significantly by the data transpositions required for parallelization. Techniques such as blocking which are introduced within a single thread to improve locality, can result in yet more conflict misses. The tension between minimizing cache conflicts and the other transformations needed for efficient parallelization leads to complex optimization problems for parallelizing compilers. This paper shows how the introduction of a pseudorandom element into the cache index function can effectively eliminate repetitive conflict misses and produce a cache where miss ratio depends solely on working set behavior. We examine the impact of pseudorandom cache indexing on processor cycle times and present practical solutions to some of the major implementation issues for this type of cache. Our conclusions are supported by simulations of a superscalar out-of-order processor executing the SPEC95 benchmarks, as well as from cache simulations of individual loop kernels to illustrate specific effects. We present measurements of instructions committed per cycle (IPC) when comparing the performance of different cache architectures on whole-program benchmarks such as the SPEC95 suite.

read more

Content maybe subject to copyright    Report

Randomized Cache Placement
for Eliminating Conflicts
Nigel Topham, Member, IEEE Computer Society, and
Antonio Gonza
Â
lez, Member, IEEE Computer Society
AbstractÐApplications with regular patterns of memory access can experience high levels of cache conflict misses. In shared-
memory multiprocessors conflict misses can be increased significantly by the data transpositions required for parallelization.
Techniques such as blocking which are introduced within a single thread to improve locality, can result in yet more conflict misses. The
tension between minimizing cache conflicts and the other transformations needed for efficient parallelization leads to complex
optimization problems for parallelizing compilers. This paper shows how the introduction of a pseudorandom element into the cache
index function can effectively eliminate repetitive conflict misses and produce a cache where miss ratio depends solely on working set
behavior. We examine the impact of pseudorandom cache indexing on processor cycle times and present practical solutions to some
of the major implementation issues for this type of cache. Our conclusions are supported by simulations of a superscalar out-of-order
processor executing the SPEC95 benchmarks, as well as from cache simulations of individual loop kernels to illustrate specific effects.
We present measurements of Instructions committed Per Cycle (IPC) when comparing the performance of different cache
architectures on whole-program benchmarks such as the SPEC95 suite.
Index TermsÐConflict avoidance, cache architectures, performance evaluation.
æ
1INTRODUCTION
I
F the upward trend in processor clock frequencies during
the last 10 years is extrapolated over the next ten years,
we will see clock frequencies increase by a factor of 20
during that period [1]. However, based on the current 7
percent per annum reduction in DRAM access times [2],
memory latency can be expected to reduce by only 50
percent in the next 10 years. This potential 10-fold increase
in the distance to main memory has serious implications for
the design of future cache-based memory hierarchies as
well as for the architecture of memory devices themselves.
Each block of main memory can be placed in
exactly one set of blocks in cache. The chosen set is
determined by the indexing function. Conventional
caches typically extract a field of m bits from the
address and use this to select one block from a set of
2
m
. While easy to implement, this indexing function is
not robust. The principal weakness is its susceptibility
to repetitive conflict misses. For example, if C is the
number of cache sets and B is the block size, then
addresses A
1
and A
2
map to the same cache set if
j A
1
=B j
C
jA
2
=B j
C
.IfA
1
and A
2
collide on the
same cache set, then addresses A
1
k and A
2
k also
collide in cache, for any integer k, except when
m
1
<Bÿjk j
B
m
2
,wherem
1
min j A
1
j
B
; j A
2
j
B

and m
2
max j A
1
j
B
; j A
2
j
B
. There are two common
cases when this happens. First, when accessing a stream
of addresses fA
0
;A
1
; ...;A
m
g,ifA
i
collides with A
ik
, then
there may be up to m ÿ k conflict misses in this stream.
Second, when accessing elements of two distinct arrays b
0
and b
1
,ifb
0
[i] collides with b
1
[j], then b
0
[i k] collides with
b
1
j k except under the conditions outlined above. Set-
associativity can help to alleviate such conflicts, but is not
an effective solution for repetitive and regular conflicts.
One of the best ways to control locality in dense matrix
computations with large data structures is to use a tiled (or
ªblockedº) algorithm. This is effectively a reordering of the
iteration space which increases temporal locality. However,
previous work has shown that the conflicts introduced by
tiling can be a serious problem [3]. In practice, until now,
this has meant that compilers which tile loop nests really
ought to compute the maximal conflict-free tile size for
given values of B, major array dimension N, and cache
capacity C. Often, this will be too small to make it
worthwhile tiling a loop or, perhaps, the value of N will
not be known at compile time. Gosh et al. [4] present a
framework for analyzing cache misses in perfectly nested
loops with affine references. They develop a generic
technique for determining optimum tile sizes and methods
for determining array padding sizes to avoid conflicts.
These methods require solutions to sets of linear Diophan-
tine equations and depend upon there being sufficient
information at compile time to find such solutions.
Table 1 highlights the problem of conflict misses with
reference to the SPEC95 benchmarks. The programs were
compiled with the maximum optimization level and
instrumented with the
ATOM tool [5]. A data cache similar
to the first-level cache of the Alpha 21164 microprocessor
was simulated: 8 KB capacity, 32-byte lines, write-through
and no write allocate. For each benchmark, we simulated
the first 2
30
load operations. Because of the no-write-
IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY 1999 185
. N. Topham is with the Institute for Computing Systems Architecture,
Division of Informatics, Edinburgh University, Edinburgh, Scotland.
E-mail: npt@dcs.ed.ac.uk.
. A. Gonza
Â
lez is with the Department of Computer Architecture, Polytechnic
University of Catalonia, c/ Jordi Girona 1-3, Modulo D6, 08034 Barcelona,
Spain. E-mail: antonio@ac.upc.es.
For information on obtaining reprints of this article, please send e-mail to:
tc@computer.org, and reference IEEECS Log Number 108233.
0018-9340/99/$10.00 ß 1999 IEEE

allocate feature, the tables below refer only to load
operations. Table 1 shows the miss ratio for the following
cache organizations: direct-mapped, two-way associative,
column-associative [6], victim cache with four victim lines
[7], and two-way skewed-associative [8], [9].
Of these schemes, only the two-way skewed-associative
cache uses an unconventional indexing scheme, as pro-
posed by its author. For comparison, the miss ratio of a fully
associative cache is shown in the penultimate column. The
miss ratio difference between a direct-mapped cache and
that of a fully associative cache is shown in the right-most
column of Table 1, and represents the direct-mapped
conflict miss ratio (CMR) [2]. In the case of hydro2d and
apsi, some organizations exhibit lower miss ratios than a
fully-associative cache due to suboptimality of LRU
replacement in a fully-associative cache for these particular
programs. Effectively, the direct-mapped conflict miss ratio
represents the target reduction in miss ratio that we hope to
achieve through improved indexing schemes. The other
type of misses, compulsory and capacity, will remain
unchanged by the use of randomized indexing schemes.
As expected, the improvement of a 2-way set-associative
cache over a direct-mapped cache is rather low. The
column-associative cache provides a miss ratio similar to
that of a two-way set-associative cache. Since the former has
a lower access time but requires two cache probes to satisfy
some hits, any choice between these two organizations
should take into account implementation parameters such
as access time and miss penalty. The victim cache removes
many conflict misses and outperforms a four-way set-
associative cache. Finally, the two-way skewed-associative
cache offers the lowest miss ratio. Previous work has shown
that it can be significantly more effective than a four-way
conventionally indexed set-associative cache [10].
In this paper, we investigate the use of alternative index
functions for reducing conflicts and discuss some practical
implementation issues. Section 2 introduces the alternative
index functions and Section 3 evaluates their conflict
avoidance properties. In Section 4, we discuss a number
of implementation issues, such as the effect of novel indexing
functions on cache access time. Then, in Section 5, we
evaluate the impact of the proposed indexing scheme on
the performance of a dynamically scheduled processor.
Finally, in Section 6, we draw conclusions from this study.
2ALTERNATIVE INDEXING FUNCTIONS
The aim of this paper is to show how alternative cache
organizations can eliminate repetitive conflict misses. This
is analogous to the problem of finding an efficient hashing
function. For large secondary or tertiary caches, it may be
possible to use the virtual address mapping to adjust the
location of pages in cache, as suggested by Bershad et al.
[11], thus avoiding conflicts dynamically. However, for
small first-level caches, this effect can only be achieved by
using an alternative cache index function.
In the field of interleaved memories, it is well-known
that bank conflicts can be reduced by using bank selection
functions other than the simple modulo-power-of-two.
Lawrie and Vora proposed a scheme using prime-modulus
functions [12], Harper and Jump [13], and Sohi [14]
proposed skewing functions. The use of XOR functions in
parallel memory systems was proposed by Frailong et al.
[15], and other pseudorandom functions were proposed by
Raghavan and Hayes [16] and Rau et al. [17], [18]. These
schemes each yield a more or less uniform distribution of
requests to banks, with varying degrees of theoretical
predictability and implementation cost. In principle, each
of these schemes could be used to construct a conflict-
resistant cache by using them as the indexing function.
However, in cache architectures, two factors are critical:
First, the chosen indexing function must have a logically
simple implementation and, second, we would like to be
able to guarantee good behavior on all regular address
patternsÐeven those that are pathological under a conven-
tional index function.
In the commercial domain, the IBM 3033 [19] and the
Amdahl 470 [20] made use of XOR-mapping functions in
order to index the TLB. The first generation HP Precision
Architecture processors [21] also used a similar technique.
The use of pseudorandom cache indexing has been
suggested by other authors. For example, Smith [22]
compared a pseudorandom placement against a set-
associative placement. He concluded that random indexing
had a small advantage in most cases, but that the
advantages were not significant. In this paper, we show
that, for certain workloads and cache organizations, the
advantages can be very large.
Hashing the process
ID with the address bits in order to
index the cache was evaluated in a multiprogrammed
186 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY 1999
TABLE 1
Cache Miss Ratios for Direct-Mapped (DM), 2-Way Set-
Associative (2W), Column-Associative (CA), Victim Cache (VC),
2-Way Skewed Associative (SA), and Fully-Associative (FA)
Organizations. Conflict-Miss Ratio (CMR) Is Also Shown.

environment by Agarwal in [23]. Results showed that this
scheme could reduce the miss ratio.
Perhaps the most well-known alternative cache indexing
scheme is the class of bitwise exclusive-OR functions
proposed for the skewed associative cache [8]. The bitwise
XOR mapping computes each bit of the cache index as
either one bit of the address or the XOR of two bits. Where
two such mappings are required, different groups of bits
are chosen for XORing in each case. A two-way skewed-
associative cache consists of two banks of the same size that
are accessed simultaneously with two different hashing
functions. Not only does the associativity help to reduce
conflicts but the skewed indexing functions help to prevent
repetitive conflicts from occurring.
The polynomial modulus function was first applied to
cache indexing in [10]. It is best described by first
considering the unsigned integer address A in terms of its
binary representation A a
nÿ1
2
nÿ1
a
nÿ1
2
nÿ2
a
0
.
This is interpreted as the polynomial Axa
nÿ1
x
nÿ1
a
nÿ1
x
nÿ2
a
0
defined over the field GF(2). The binary
representation of the m-bit cache index R is similarly
defined by the GF(2) polynomial Rx of order less than m
such that AxV xP xRx. Effectively, Rx is Ax
modulo P x, where P x is an irreducible polynomial of
order m and P x is such that x
i
modPx generates all
polynomials of order lower than m. The polynomials that
fulfil the previous requirements are called Ipoly polyno-
mials. Rau showed how the computation of Rx can be
accomplished by the vector-matrix product of the address
and an n m matrix H of single-bit coefficients derived
from P x [18]. In GF(2), this product is computed by a
network of AND and XOR gates, and if the H-matrix is
constant, the AND gates can be omitted and the mapping
then requires just m XOR gates with fan-in from 2 to n.In
practice, we may reduce the number of input address bits to
the polynomial mapping function by ignoring some of the
upper bits in A. This does not seriously degrade the quality
of the mapping function.
Ipoly mapping functions have been studied previously
in the context of stride-insensitive interleaved memories
(see [17], [18]) and have certain provable characteristics of
significant value for cache indexing. In [24], it was
demonstrated that a skewed Ipoly cache indexing scheme
shows a higher degree of conflict resistance than that
exhibited by conventional set-associativity or other (non-
Ipoly) XOR-based mapping functions. Overall, the skewed-
associative cache using Ipoly mapping and a pure LRU
replacement policy achieved a miss ratio within 1 percent of
that achieved by a fully associative cache. Given the
advantage of an Ipoly function over the bitwise XOR
function, all results presented in this paper use the Ipoly
indexing scheme.
3EVALUATION OF CONFLICT RESISTANCE
The performance of both the integer and floating-point
SPEC95 programs has been evaluated for column-associa-
tive, two-way set-associative (2W), and two-way skewed-
associative organizations using Ipoly indexing functions. In
all cases, a single-level cache is assumed. The miss ratios of
these configurations are shown in Table 2. Given a
conventional indexing function, the direct-mapped (DM)
and fully associative (FA) cache organizations display,
respectively, the lowest and the highest degrees of conflict
resistance of all possible cache architectures. As such, they
define the bounds within which novel indexing schemes
should be evaluated. Their miss ratios are shown in the
right-most two columns of Table 2.
The column-associative cache has access-time character-
istics similar to a direct-mapped cache but has some degree
of pseudo-associativityÐeach address can map to one of
two locations in the cache but, initially, only one is probed.
The column labeled SPL represents a cache which swaps
data between the two locations to increase the percentage of
a hit on the first probe. It also uses a realistic pseudo-LRU
replacement policy. The cache reported in the column
labeled LRU does not swap data between columns and uses
an unrealistic pure LRU replacement policy [10].
It is to be expected that a two-way set-associative cache
will be capable of eliminating many random conflicts.
However, a conventionally indexed set-associative cache is
not able to eliminate pathological conflict behavior as it has
limited associativity and a naive indexing function. The
performance of a two-way set-associative cache can be
improved by simply replacing the index function, while
retaining all other characteristics. Conventional LRU repla-
cement can still be used, as the indexing function has no
impact on replacement for this cache organization. For two
programs, the two-way Ipoly cache has a lower miss ratio
than a fully associative cache. This is again due to the
TOPHAM AND GONZ
ALEZ: RANDOMIZED CACHE PLACEMENT FOR ELIMINATING CONFLICTS 187
TABLE 2
Miss Ratios for Ipoly Indexing on SPEC95 Benchmarks

suboptimality of LRU replacement in the fully associative
cache and is a common anomaly in programs with
negligible conflict misses.
The final cache organization shown in Table 2 is the two-
way skewed-associative cache proposed originally by
Seznec [8]. In its original form, it used two bitwise XOR
indexing functions. Our version uses Ipoly indexing
functions, as proposed in [10] and [24]. In this case, two
distinct Ipoly functions are used to construct two distinct
cache indices from each address. Pure LRU is difficult to
implement in a skewed-associative cache, so here we
present results for an cache which uses a realistic pseudo-
LRU policy (labelled pLRU) and a cache which uses an
unrealistic pure LRU policy (labeled LRU). This organiza-
tion produces the lowest conflict miss ratio, down from 4.8
percent to 0.67 percent for SPECint, and from 12.61 percent
to 0.07 percent for SPECfp.
It is striking that the performance improvement is
dominated by three programs (tomcatv, swim, and wave).
These effectively exhibit pathological conflict miss ratios
under conventional indexing schemes. Studies by Olukotun
et al. [25], have shown that the data cache miss ratio in
tomcatv wastes 56 percent and 40 percent of available IPC
in 6-way and 2-way superscalar processors, respectively.
Tiling will often introduce extra cache conflicts, the
elimination of which is not always possible through
software. Now that we have alternative indexing functions
that exhibit conflict avoidance properties we can use these
to avoid these induced conflicts. The effectiveness of Ipoly
indexing for tiled loops was evaluated by simulating the
cache behavior of a variety of tiled loop kernels. Here, we
present a small sample of results to illustrate the general
outcome. Figs. 1 and 2 show the miss ratios observed in two
tiled matrix multiplication kernels, where the original
matrices were square and of dimensions 171 and 256,
respectively. Tile sizes were varied from 2 2 up to 16 16
to show the effect of conflicts occurring in caches that are
direct-mapped (a1), 2-way set-associative (a2), fully asso-
ciative (fa), and skewed 2-way Ipoly (Hp-Sk). The tiled
working set divided by cache capacity measures the
fraction of the cache occupied by a single tile. Cache
capacity is 8 KBytes, with 32-byte lines.
For dimension 171, the miss ratio initially falls for all
caches as tile size increases. This is due to increasing spatial
locality, up to the point where self-conflicts begin to occur
in the conventionally indexed direct-mapped and two-way
set-associative caches. The fully associative cache suffers no
self-conflicts and its miss ratio decreases monotonically to
less than 1 percent at 50 percent loading. The behavior of
the skewed 2-way Ipoly cache tracks the fully associative
cache closely. The qualitative difference between the Ipoly
cache and a conventional two-way cache is clearly visible.
For dimension 256, the product array and the multi-
plicand array are positioned in memory so that cross-
conflicts occur in addition to self-conflicts. Hence, the
direct-mapped and 2-way set associative caches experience
little spatial locality. However, the Ipoly cache is able to
eliminate cross-conflicts, as well as self-conflicts, and it
again tracks the fully associative cache.
4IMPLEMENTATION ISSUES
The logic of the GF(2) polynomial modulus operation
presented in Section 2 defines a class of hash functions
which compute the cache placement of an address by
combining subsets of the address bits using XOR gates. This
means that, for example, bit 0 of the cache index may be
computed as the XOR of bits 0, 11, 14, and 19 of the original
address. The choice of polynomial determines which bits
are included in each set. The implementation of such a
function for a cache with an 8-bit index would require just
eight XOR gates with fan-in of 3 or 4.
While this appears remarkably simple, there is more to
consider than just the placement function. First, the
function itself uses address bits beyond the normal limit
imposed by typical minimum page size restriction. Second,
the use of pseudorandom placement in a multilevel
memory hierarchy has implications for the maintenance of
Inclusion. In [24], we explain these two issues in more depth
188 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY 1999
Fig. 1. Miss ratio versus cache loading for 171 171 matrix multiply.
Fig. 2. Miss ratio versus cache loading for 256 256 matrix multiply.

and show how the virtual-real two-level cache hierarchy
proposed by Wang et al. [26] provides a clean solution to
both problems.
A cache memory access in a conventional organization
normally computes its effective address by adding two
registers or a register plus a displacement. Ipoly indexing
implies additional circuitry to compute the index from the
effective address. This circuitry consists of several XOR
gates that operate in parallel and therefore the total delay is
just the delay of one gate. Each XOR gate has a number of
inputs that depend on the particular polynomial being
used. For the experiments reported in this paper, the
number of inputs is never higher than 5. The XOR gating
required by the Ipoly mapping may increase the critical
path length within the processor pipeline. However, any
delay will be short since all bits of the index can be
computed in parallel. Moreover, we show later that, even if
this additional delay induces a full cycle penalty in the
cache access time, the Ipoly mapping provides a significant
overall performance improvement. Memory address predic-
tion can be also used to avoid the penalty introduced by the
XOR delay when it lengthens the critical path. Memory
addresses have been shown to be highly predictable. For
instance, in [27], it was shown that the addresses of about 75
percent of the dynamically executed memory instructions
from the SPEC95 suite can be predicted with a simple
tabular scheme which tracks the last address produced by a
given instruction and its last stride. A similar scheme that
could be used to give an early prediction of the line that is
likely to be accessed by a given load instruction is outlined
below.
The processor incorporates a table indexed by the
instruction address. Each entry stores the last address and
the predicted stride for some recently executed load
instruction. In the fetch stage, this table is accessed with
the program counter. In the decode stage, the predicted
address is computed and the XOR functions are performed
to compute the predicted cache line. This can be done in one
cycle since the XOR can be performed in parallel with the
computation of the most significant bits of the effective
address. When the instruction is subsequently issued to the
memory unit, it uses the predicted line number to access the
cache in parallel with the actual address and line computa-
tion. If the predicted line turns out to be incorrect, the cache
access is repeated with the actual address. Otherwise, the
data provided by the speculative access can be loaded into
the destination register.
A number of previous papers have suggested address
prediction as a means to reduce memory latency [28], [29],
[30], or to execute memory instructions and their dependent
instructions speculatively [31], [27], [32]. In the case of a
miss-speculation, a recovery mechanism similar to that
used by branch prediction schemes is then used to squash
miss-speculated instructions.
5EFFECT OF IPOLY INDEXING ON IPC
In order to verify the impact of polynomial mapping on
realistic microprocessor architectures, we have developed a
parametric simulator for a four-way superscalar processor
with out-of-order execution. Table 3 summarizes the
functional units and their latencies used in these experi-
ments. The reorder buffer contained 32 entries, and there
were two separate physical register files (FP and Integer),
each with 64 physical registers. The processor had a lockup-
free data cache [33] that allowed eight outstanding misses to
different cache lines. Cache capacities of 8 KB and 16 KB
were simulated with 2-way associativity and 32-byte lines.
The cache was write-through and no-write-allocate. The
cache had two ports, each with a two-cycle hit time and a
miss penalty of 20 cycles. This was connected by a 64-bit
data bus to an infinite level-two cache. Data dependencies
through memory were speculated using a mechanism
similar to the arb of the Multiscalar [34] and the HP PA-
8000 [35]. A branch history table with 2K entries and 2-bit
saturating counters was used for branch prediction.
The memory address prediction scheme was implemen-
ted by a direct-mapped table with 1K entries, indexed by
instruction address. To reduce cost, the entries were not
tagged, although this increases interference in the table.
Each entry contained the last effective address of the most
recent load instruction to index into that table entry,
together with the last observed stride. In addition, each
entry contained a 2-bit saturating counter to assign
confidence to the prediction. Only when the most-signifi-
cant bit of the counter is set would the prediction be
considered correct. The address field was updated for each
new reference, regardless of the prediction. However, the
stride field was updated only when the counter went below
10
2
, i.e., after two consecutive mispredictions.
Table 4 shows the IPC and miss ratios for six configura-
tions.
1
All IPC averages are computed using an equally-
weighted harmonic mean. The baseline configuration is an 8
KB cache with conventional indexing and no address
prediction (NP, third column). The average IPC for this
configuration is 1.27 from an average miss ratio of 16.53
percent. With Ipoly indexing, the average miss ratio falls to
9.68 percent. If the XOR gates are not in the critical path,
IPC rises to 1.33 (NX, fifth column). Conversely, if the XOR
gates are in the critical path and a one cycle penalty in the
cache access time is assumed, the resulting IPC is 1.29 (WX,
sixth column). However, if memory address prediction is
then introduced (WP, seventh columnn), IPC is the same as
for a cache without the XOR gates in the critical path (NX).
Hence, the memory address prediction scheme can offset
the penalty introduced by the additional delay of the XOR
gates when they are in the critical path, even under the
conservative assumption that a whole cycle of latency is
added to each load instruction. Finally, Table 4 also shows
the performance of a 16 KB 2-way set-associative cache
without Ipoly indexing (second column). Notice that the
addition of Ipoly indexing to an 8 KB cache yields over 60
percent of the IPC increase that can be obtained by doubling
the cache size.
These IPC measurements exhibit small absolute differ-
ences, but this is because the benefit of Ipoly indexing is
perceived by a only small subset of the benchmark
programs. Most programs in SPEC95 exhibit low conflict
TOPHAM AND GONZ
ALEZ: RANDOMIZED CACHE PLACEMENT FOR ELIMINATING CONFLICTS 189
1. For each configuration we simulated 10
8
instructions after skipping
the first 2 10
9
.

Citations
More filters
Journal ArticleDOI

Dynamic Partitioning of Shared Cache Memory

TL;DR: The results show that smart cache management and scheduling is essential to achieve high performance with shared cache memory and can improve the total IPC significantly over the standard least recently used (LRU) replacement policy.
Posted Content

Theoretical Use of Cache Memory as a Cryptanalytic Side-Channel.

TL;DR: In this article, the idea of cache memory being used as a side-channel which leaks information during the run of a cryptographic algorithm has been investigated, and it has been shown that an attacker may be able to reveal or narrow the possible values of secret information held on the target device.
BookDOI

Data Access and Storage Management for Embedded Programmable Processors

TL;DR: DTSE in Programmable Architectures and Related Compiler Work on Data Tranfer and Storage Management, and Automated Data Reuse Exploration Techniques.
Proceedings ArticleDOI

Cuckoo directory: A scalable directory for many-core systems

TL;DR: The Cuckoo directory is proposed, a power- and area-efficient scalable distributed directory that scales to high core counts without the energy costs of wide associative lookup and without gross capacity over-provisioning.
References
More filters
Book

Computer Architecture: A Quantitative Approach

TL;DR: This best-selling title, considered for over a decade to be essential reading for every serious student and practitioner of computer design, has been updated throughout to address the most important trends facing computer designers today.
Journal ArticleDOI

Cache Memories

TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Proceedings ArticleDOI

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

TL;DR: In this article, a hardware technique to improve the performance of caches is presented, where a small fully-associative cache between a cache and its refill path is used to place prefetched data and not in the cache.
Proceedings ArticleDOI

ATOM: a system for building customized program analysis tools

TL;DR: ATOM as mentioned in this paper is a single framework for building a wide range of customized program analysis tools, including block counting, profiling, dynamic memory recording, instruction and data cache simulation, pipeline simulation, evaluating branch prediction, and instruction scheduling.