Randomized cache placement for eliminating conflicts

doi:10.1109/12.752660

Randomized Cache Placement

for Eliminating Conflicts

Nigel Topham, Member, IEEE Computer Society, and

Antonio Gonza

Â

lez, Member, IEEE Computer Society

AbstractÐApplications with regular patterns of memory access can experience high levels of cache conflict misses. In shared-

memory multiprocessors conflict misses can be increased significantly by the data transpositions required for parallelization.

Techniques such as blocking which are introduced within a single thread to improve locality, can result in yet more conflict misses. The

tension between minimizing cache conflicts and the other transformations needed for efficient parallelization leads to complex

optimization problems for parallelizing compilers. This paper shows how the introduction of a pseudorandom element into the cache

index function can effectively eliminate repetitive conflict misses and produce a cache where miss ratio depends solely on working set

behavior. We examine the impact of pseudorandom cache indexing on processor cycle times and present practical solutions to some

of the major implementation issues for this type of cache. Our conclusions are supported by simulations of a superscalar out-of-order

processor executing the SPEC95 benchmarks, as well as from cache simulations of individual loop kernels to illustrate specific effects.

We present measurements of Instructions committed Per Cycle (IPC) when comparing the performance of different cache

architectures on whole-program benchmarks such as the SPEC95 suite.

Index TermsÐConflict avoidance, cache architectures, performance evaluation.

æ

1INTRODUCTION

I

F the upward trend in processor clock frequencies during

the last 10 years is extrapolated over the next ten years,

we will see clock frequencies increase by a factor of 20

during that period [1]. However, based on the current 7

percent per annum reduction in DRAM access times [2],

memory latency can be expected to reduce by only 50

percent in the next 10 years. This potential 10-fold increase

in the distance to main memory has serious implications for

the design of future cache-based memory hierarchies as

well as for the architecture of memory devices themselves.

Each block of main memory can be placed in

exactly one set of blocks in cache. The chosen set is

determined by the indexing function. Conventional

caches typically extract a field of m bits from the

address and use this to select one block from a set of

2

m

. While easy to implement, this indexing function is

not robust. The principal weakness is its susceptibility

to repetitive conflict misses. For example, if C is the

number of cache sets and B is the block size, then

addresses A

1

and A

2

map to the same cache set if

j A

1

=B j

C

jA

2

=B j

C

.IfA

1

and A

2

collide on the

same cache set, then addresses A

1

 k and A

2

 k also

collide in cache, for any integer k, except when

m

1

<Bÿjk j

B

 m

2

,wherem

1

 min j A

1

j

B

; j A

2

j

B



and m

2

 max j A

1

j

B

; j A

2

j

B

. There are two common

cases when this happens. First, when accessing a stream

of addresses fA

0

;A

1

; ...;A

m

g,ifA

i

collides with A

ik

, then

there may be up to m ÿ k conflict misses in this stream.

Second, when accessing elements of two distinct arrays b

0

and b

1

,ifb

0

[i] collides with b

1

[j], then b

0

[i  k] collides with

b

1

j  k except under the conditions outlined above. Set-

associativity can help to alleviate such conflicts, but is not

an effective solution for repetitive and regular conflicts.

One of the best ways to control locality in dense matrix

computations with large data structures is to use a tiled (or

ªblockedº) algorithm. This is effectively a reordering of the

iteration space which increases temporal locality. However,

previous work has shown that the conflicts introduced by

tiling can be a serious problem [3]. In practice, until now,

this has meant that compilers which tile loop nests really

ought to compute the maximal conflict-free tile size for

given values of B, major array dimension N, and cache

capacity C. Often, this will be too small to make it

worthwhile tiling a loop or, perhaps, the value of N will

not be known at compile time. Gosh et al. [4] present a

framework for analyzing cache misses in perfectly nested

loops with affine references. They develop a generic

technique for determining optimum tile sizes and methods

for determining array padding sizes to avoid conflicts.

These methods require solutions to sets of linear Diophan-

tine equations and depend upon there being sufficient

information at compile time to find such solutions.

Table 1 highlights the problem of conflict misses with

reference to the SPEC95 benchmarks. The programs were

compiled with the maximum optimization level and

instrumented with the

ATOM tool [5]. A data cache similar

to the first-level cache of the Alpha 21164 microprocessor

was simulated: 8 KB capacity, 32-byte lines, write-through

and no write allocate. For each benchmark, we simulated

the first 2

30

load operations. Because of the no-write-

IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY 1999 185

. N. Topham is with the Institute for Computing Systems Architecture,

Division of Informatics, Edinburgh University, Edinburgh, Scotland.

E-mail: npt@dcs.ed.ac.uk.

. A. Gonza

Â

lez is with the Department of Computer Architecture, Polytechnic

University of Catalonia, c/ Jordi Girona 1-3, Modulo D6, 08034 Barcelona,

Spain. E-mail: antonio@ac.upc.es.

For information on obtaining reprints of this article, please send e-mail to:

tc@computer.org, and reference IEEECS Log Number 108233.

0018-9340/99/$10.00 ß 1999 IEEE

allocate feature, the tables below refer only to load

operations. Table 1 shows the miss ratio for the following

cache organizations: direct-mapped, two-way associative,

column-associative [6], victim cache with four victim lines

[7], and two-way skewed-associative [8], [9].

Of these schemes, only the two-way skewed-associative

cache uses an unconventional indexing scheme, as pro-

posed by its author. For comparison, the miss ratio of a fully

associative cache is shown in the penultimate column. The

miss ratio difference between a direct-mapped cache and

that of a fully associative cache is shown in the right-most

column of Table 1, and represents the direct-mapped

conflict miss ratio (CMR) [2]. In the case of hydro2d and

apsi, some organizations exhibit lower miss ratios than a

fully-associative cache due to suboptimality of LRU

replacement in a fully-associative cache for these particular

programs. Effectively, the direct-mapped conflict miss ratio

represents the target reduction in miss ratio that we hope to

achieve through improved indexing schemes. The other

type of misses, compulsory and capacity, will remain

unchanged by the use of randomized indexing schemes.

As expected, the improvement of a 2-way set-associative

cache over a direct-mapped cache is rather low. The

column-associative cache provides a miss ratio similar to

that of a two-way set-associative cache. Since the former has

a lower access time but requires two cache probes to satisfy

some hits, any choice between these two organizations

should take into account implementation parameters such

as access time and miss penalty. The victim cache removes

many conflict misses and outperforms a four-way set-

associative cache. Finally, the two-way skewed-associative

cache offers the lowest miss ratio. Previous work has shown

that it can be significantly more effective than a four-way

conventionally indexed set-associative cache [10].

In this paper, we investigate the use of alternative index

functions for reducing conflicts and discuss some practical

implementation issues. Section 2 introduces the alternative

index functions and Section 3 evaluates their conflict

avoidance properties. In Section 4, we discuss a number

of implementation issues, such as the effect of novel indexing

functions on cache access time. Then, in Section 5, we

evaluate the impact of the proposed indexing scheme on

the performance of a dynamically scheduled processor.

Finally, in Section 6, we draw conclusions from this study.

2ALTERNATIVE INDEXING FUNCTIONS

The aim of this paper is to show how alternative cache

organizations can eliminate repetitive conflict misses. This

is analogous to the problem of finding an efficient hashing

function. For large secondary or tertiary caches, it may be

possible to use the virtual address mapping to adjust the

location of pages in cache, as suggested by Bershad et al.

[11], thus avoiding conflicts dynamically. However, for

small first-level caches, this effect can only be achieved by

using an alternative cache index function.

In the field of interleaved memories, it is well-known

that bank conflicts can be reduced by using bank selection

functions other than the simple modulo-power-of-two.

Lawrie and Vora proposed a scheme using prime-modulus

functions [12], Harper and Jump [13], and Sohi [14]

proposed skewing functions. The use of XOR functions in

parallel memory systems was proposed by Frailong et al.

[15], and other pseudorandom functions were proposed by

Raghavan and Hayes [16] and Rau et al. [17], [18]. These

schemes each yield a more or less uniform distribution of

requests to banks, with varying degrees of theoretical

predictability and implementation cost. In principle, each

of these schemes could be used to construct a conflict-

resistant cache by using them as the indexing function.

However, in cache architectures, two factors are critical:

First, the chosen indexing function must have a logically

simple implementation and, second, we would like to be

able to guarantee good behavior on all regular address

patternsÐeven those that are pathological under a conven-

tional index function.

In the commercial domain, the IBM 3033 [19] and the

Amdahl 470 [20] made use of XOR-mapping functions in

order to index the TLB. The first generation HP Precision

Architecture processors [21] also used a similar technique.

The use of pseudorandom cache indexing has been

suggested by other authors. For example, Smith [22]

compared a pseudorandom placement against a set-

associative placement. He concluded that random indexing

had a small advantage in most cases, but that the

advantages were not significant. In this paper, we show

that, for certain workloads and cache organizations, the

advantages can be very large.

Hashing the process

ID with the address bits in order to

index the cache was evaluated in a multiprogrammed

186 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY 1999

TABLE 1

Cache Miss Ratios for Direct-Mapped (DM), 2-Way Set-

Associative (2W), Column-Associative (CA), Victim Cache (VC),

2-Way Skewed Associative (SA), and Fully-Associative (FA)

Organizations. Conflict-Miss Ratio (CMR) Is Also Shown.

environment by Agarwal in [23]. Results showed that this

scheme could reduce the miss ratio.

Perhaps the most well-known alternative cache indexing

scheme is the class of bitwise exclusive-OR functions

proposed for the skewed associative cache [8]. The bitwise

XOR mapping computes each bit of the cache index as

either one bit of the address or the XOR of two bits. Where

two such mappings are required, different groups of bits

are chosen for XORing in each case. A two-way skewed-

associative cache consists of two banks of the same size that

are accessed simultaneously with two different hashing

functions. Not only does the associativity help to reduce

conflicts but the skewed indexing functions help to prevent

repetitive conflicts from occurring.

The polynomial modulus function was first applied to

cache indexing in [10]. It is best described by first

considering the unsigned integer address A in terms of its

binary representation A  a

nÿ1

2

nÿ1

 a

nÿ1

2

nÿ2

a

0

.

This is interpreted as the polynomial Axa

nÿ1

x

nÿ1



a

nÿ1

x

nÿ2

a

0

defined over the field GF(2). The binary

representation of the m-bit cache index R is similarly

defined by the GF(2) polynomial Rx of order less than m

such that AxV xP xRx. Effectively, Rx is Ax

modulo P x, where P x is an irreducible polynomial of

order m and P x is such that x

i

modPx generates all

polynomials of order lower than m. The polynomials that

fulfil the previous requirements are called Ipoly polyno-

mials. Rau showed how the computation of Rx can be

accomplished by the vector-matrix product of the address

and an n  m matrix H of single-bit coefficients derived

from P x [18]. In GF(2), this product is computed by a

network of AND and XOR gates, and if the H-matrix is

constant, the AND gates can be omitted and the mapping

then requires just m XOR gates with fan-in from 2 to n.In

practice, we may reduce the number of input address bits to

the polynomial mapping function by ignoring some of the

upper bits in A. This does not seriously degrade the quality

of the mapping function.

Ipoly mapping functions have been studied previously

in the context of stride-insensitive interleaved memories

(see [17], [18]) and have certain provable characteristics of

significant value for cache indexing. In [24], it was

demonstrated that a skewed Ipoly cache indexing scheme

shows a higher degree of conflict resistance than that

exhibited by conventional set-associativity or other (non-

Ipoly) XOR-based mapping functions. Overall, the skewed-

associative cache using Ipoly mapping and a pure LRU

replacement policy achieved a miss ratio within 1 percent of

that achieved by a fully associative cache. Given the

advantage of an Ipoly function over the bitwise XOR

function, all results presented in this paper use the Ipoly

indexing scheme.

3EVALUATION OF CONFLICT RESISTANCE

The performance of both the integer and floating-point

SPEC95 programs has been evaluated for column-associa-

tive, two-way set-associative (2W), and two-way skewed-

associative organizations using Ipoly indexing functions. In

all cases, a single-level cache is assumed. The miss ratios of

these configurations are shown in Table 2. Given a

conventional indexing function, the direct-mapped (DM)

and fully associative (FA) cache organizations display,

respectively, the lowest and the highest degrees of conflict

resistance of all possible cache architectures. As such, they

define the bounds within which novel indexing schemes

should be evaluated. Their miss ratios are shown in the

right-most two columns of Table 2.

The column-associative cache has access-time character-

istics similar to a direct-mapped cache but has some degree

of pseudo-associativityÐeach address can map to one of

two locations in the cache but, initially, only one is probed.

The column labeled SPL represents a cache which swaps

data between the two locations to increase the percentage of

a hit on the first probe. It also uses a realistic pseudo-LRU

replacement policy. The cache reported in the column

labeled LRU does not swap data between columns and uses

an unrealistic pure LRU replacement policy [10].

It is to be expected that a two-way set-associative cache

will be capable of eliminating many random conflicts.

However, a conventionally indexed set-associative cache is

not able to eliminate pathological conflict behavior as it has

limited associativity and a naive indexing function. The

performance of a two-way set-associative cache can be

improved by simply replacing the index function, while

retaining all other characteristics. Conventional LRU repla-

cement can still be used, as the indexing function has no

impact on replacement for this cache organization. For two

programs, the two-way Ipoly cache has a lower miss ratio

than a fully associative cache. This is again due to the

TOPHAM AND GONZ



ALEZ: RANDOMIZED CACHE PLACEMENT FOR ELIMINATING CONFLICTS 187

TABLE 2

Miss Ratios for Ipoly Indexing on SPEC95 Benchmarks

suboptimality of LRU replacement in the fully associative

cache and is a common anomaly in programs with

negligible conflict misses.

The final cache organization shown in Table 2 is the two-

way skewed-associative cache proposed originally by

Seznec [8]. In its original form, it used two bitwise XOR

indexing functions. Our version uses Ipoly indexing

functions, as proposed in [10] and [24]. In this case, two

distinct Ipoly functions are used to construct two distinct

cache indices from each address. Pure LRU is difficult to

implement in a skewed-associative cache, so here we

present results for an cache which uses a realistic pseudo-

LRU policy (labelled pLRU) and a cache which uses an

unrealistic pure LRU policy (labeled LRU). This organiza-

tion produces the lowest conflict miss ratio, down from 4.8

percent to 0.67 percent for SPECint, and from 12.61 percent

to 0.07 percent for SPECfp.

It is striking that the performance improvement is

dominated by three programs (tomcatv, swim, and wave).

These effectively exhibit pathological conflict miss ratios

under conventional indexing schemes. Studies by Olukotun

et al. [25], have shown that the data cache miss ratio in

tomcatv wastes 56 percent and 40 percent of available IPC

in 6-way and 2-way superscalar processors, respectively.

Tiling will often introduce extra cache conflicts, the

elimination of which is not always possible through

software. Now that we have alternative indexing functions

that exhibit conflict avoidance properties we can use these

to avoid these induced conflicts. The effectiveness of Ipoly

indexing for tiled loops was evaluated by simulating the

cache behavior of a variety of tiled loop kernels. Here, we

present a small sample of results to illustrate the general

outcome. Figs. 1 and 2 show the miss ratios observed in two

tiled matrix multiplication kernels, where the original

matrices were square and of dimensions 171 and 256,

respectively. Tile sizes were varied from 2  2 up to 16  16

to show the effect of conflicts occurring in caches that are

direct-mapped (a1), 2-way set-associative (a2), fully asso-

ciative (fa), and skewed 2-way Ipoly (Hp-Sk). The tiled

working set divided by cache capacity measures the

fraction of the cache occupied by a single tile. Cache

capacity is 8 KBytes, with 32-byte lines.

For dimension 171, the miss ratio initially falls for all

caches as tile size increases. This is due to increasing spatial

locality, up to the point where self-conflicts begin to occur

in the conventionally indexed direct-mapped and two-way

set-associative caches. The fully associative cache suffers no

self-conflicts and its miss ratio decreases monotonically to

less than 1 percent at 50 percent loading. The behavior of

the skewed 2-way Ipoly cache tracks the fully associative

cache closely. The qualitative difference between the Ipoly

cache and a conventional two-way cache is clearly visible.

For dimension 256, the product array and the multi-

plicand array are positioned in memory so that cross-

conflicts occur in addition to self-conflicts. Hence, the

direct-mapped and 2-way set associative caches experience

little spatial locality. However, the Ipoly cache is able to

eliminate cross-conflicts, as well as self-conflicts, and it

again tracks the fully associative cache.

4IMPLEMENTATION ISSUES

The logic of the GF(2) polynomial modulus operation

presented in Section 2 defines a class of hash functions

which compute the cache placement of an address by

combining subsets of the address bits using XOR gates. This

means that, for example, bit 0 of the cache index may be

computed as the XOR of bits 0, 11, 14, and 19 of the original

address. The choice of polynomial determines which bits

are included in each set. The implementation of such a

function for a cache with an 8-bit index would require just

eight XOR gates with fan-in of 3 or 4.

While this appears remarkably simple, there is more to

consider than just the placement function. First, the

function itself uses address bits beyond the normal limit

imposed by typical minimum page size restriction. Second,

the use of pseudorandom placement in a multilevel

memory hierarchy has implications for the maintenance of

Inclusion. In [24], we explain these two issues in more depth

188 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 2, FEBRUARY 1999

Fig. 1. Miss ratio versus cache loading for 171  171 matrix multiply.

Fig. 2. Miss ratio versus cache loading for 256  256 matrix multiply.

and show how the virtual-real two-level cache hierarchy

proposed by Wang et al. [26] provides a clean solution to

both problems.

A cache memory access in a conventional organization

normally computes its effective address by adding two

registers or a register plus a displacement. Ipoly indexing

implies additional circuitry to compute the index from the

effective address. This circuitry consists of several XOR

gates that operate in parallel and therefore the total delay is

just the delay of one gate. Each XOR gate has a number of

inputs that depend on the particular polynomial being

used. For the experiments reported in this paper, the

number of inputs is never higher than 5. The XOR gating

required by the Ipoly mapping may increase the critical

path length within the processor pipeline. However, any

delay will be short since all bits of the index can be

computed in parallel. Moreover, we show later that, even if

this additional delay induces a full cycle penalty in the

cache access time, the Ipoly mapping provides a significant

overall performance improvement. Memory address predic-

tion can be also used to avoid the penalty introduced by the

XOR delay when it lengthens the critical path. Memory

addresses have been shown to be highly predictable. For

instance, in [27], it was shown that the addresses of about 75

percent of the dynamically executed memory instructions

from the SPEC95 suite can be predicted with a simple

tabular scheme which tracks the last address produced by a

given instruction and its last stride. A similar scheme that

could be used to give an early prediction of the line that is

likely to be accessed by a given load instruction is outlined

below.

The processor incorporates a table indexed by the

instruction address. Each entry stores the last address and

the predicted stride for some recently executed load

instruction. In the fetch stage, this table is accessed with

the program counter. In the decode stage, the predicted

address is computed and the XOR functions are performed

to compute the predicted cache line. This can be done in one

cycle since the XOR can be performed in parallel with the

computation of the most significant bits of the effective

address. When the instruction is subsequently issued to the

memory unit, it uses the predicted line number to access the

cache in parallel with the actual address and line computa-

tion. If the predicted line turns out to be incorrect, the cache

access is repeated with the actual address. Otherwise, the

data provided by the speculative access can be loaded into

the destination register.

A number of previous papers have suggested address

prediction as a means to reduce memory latency [28], [29],

[30], or to execute memory instructions and their dependent

instructions speculatively [31], [27], [32]. In the case of a

miss-speculation, a recovery mechanism similar to that

used by branch prediction schemes is then used to squash

miss-speculated instructions.

5EFFECT OF IPOLY INDEXING ON IPC

In order to verify the impact of polynomial mapping on

realistic microprocessor architectures, we have developed a

parametric simulator for a four-way superscalar processor

with out-of-order execution. Table 3 summarizes the

functional units and their latencies used in these experi-

ments. The reorder buffer contained 32 entries, and there

were two separate physical register files (FP and Integer),

each with 64 physical registers. The processor had a lockup-

free data cache [33] that allowed eight outstanding misses to

different cache lines. Cache capacities of 8 KB and 16 KB

were simulated with 2-way associativity and 32-byte lines.

The cache was write-through and no-write-allocate. The

cache had two ports, each with a two-cycle hit time and a

miss penalty of 20 cycles. This was connected by a 64-bit

data bus to an infinite level-two cache. Data dependencies

through memory were speculated using a mechanism

similar to the arb of the Multiscalar [34] and the HP PA-

8000 [35]. A branch history table with 2K entries and 2-bit

saturating counters was used for branch prediction.

The memory address prediction scheme was implemen-

ted by a direct-mapped table with 1K entries, indexed by

instruction address. To reduce cost, the entries were not

tagged, although this increases interference in the table.

Each entry contained the last effective address of the most

recent load instruction to index into that table entry,

together with the last observed stride. In addition, each

entry contained a 2-bit saturating counter to assign

confidence to the prediction. Only when the most-signifi-

cant bit of the counter is set would the prediction be

considered correct. The address field was updated for each

new reference, regardless of the prediction. However, the

stride field was updated only when the counter went below

10

2

, i.e., after two consecutive mispredictions.

Table 4 shows the IPC and miss ratios for six configura-

tions.

1

All IPC averages are computed using an equally-

weighted harmonic mean. The baseline configuration is an 8

KB cache with conventional indexing and no address

prediction (NP, third column). The average IPC for this

configuration is 1.27 from an average miss ratio of 16.53

percent. With Ipoly indexing, the average miss ratio falls to

9.68 percent. If the XOR gates are not in the critical path,

IPC rises to 1.33 (NX, fifth column). Conversely, if the XOR

gates are in the critical path and a one cycle penalty in the

cache access time is assumed, the resulting IPC is 1.29 (WX,

sixth column). However, if memory address prediction is

then introduced (WP, seventh columnn), IPC is the same as

for a cache without the XOR gates in the critical path (NX).

Hence, the memory address prediction scheme can offset

the penalty introduced by the additional delay of the XOR

gates when they are in the critical path, even under the

conservative assumption that a whole cycle of latency is

added to each load instruction. Finally, Table 4 also shows

the performance of a 16 KB 2-way set-associative cache

without Ipoly indexing (second column). Notice that the

addition of Ipoly indexing to an 8 KB cache yields over 60

percent of the IPC increase that can be obtained by doubling

the cache size.

These IPC measurements exhibit small absolute differ-

ences, but this is because the benefit of Ipoly indexing is

perceived by a only small subset of the benchmark

programs. Most programs in SPEC95 exhibit low conflict

TOPHAM AND GONZ



ALEZ: RANDOMIZED CACHE PLACEMENT FOR ELIMINATING CONFLICTS 189

1. For each configuration we simulated 10

8

instructions after skipping

the first 2  10

9

.

Randomized cache placement for eliminating conflicts

Figures

Citations

コンピュータ・サイエンス : ACM computing surveys

Dynamic Partitioning of Shared Cache Memory

Theoretical Use of Cache Memory as a Cryptanalytic Side-Channel.

Data Access and Storage Management for Embedded Programmable Processors

Cuckoo directory: A scalable directory for many-core systems

References

Computer Architecture: A Quantitative Approach

Cache Memories

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

コンピュータ・サイエンス : ACM computing surveys

ATOM: a system for building customized program analysis tools

Related Papers (5)

Eliminating cache conflict misses through XOR-based placement functions

A case for two-way skewed-associative caches

Column-associative caches: a technique for reducing the miss rate of direct-mapped caches

Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers

Pseudo-randomly interleaved memory