Cache profiling and the SPEC benchmarks: a case study

doi:10.1109/2.318580

A

vital tool-box

component, the CProf

cache profiling system

lets programmers

identify hot spots

by

providing cache

performance

information at the

source-line and data-

structure level.

October

1994

Cache Profiling

and

the

SPEC

Benchmarks:

A

Case

Study

Alvin

R.

Lebeck

and David A. Wood

University

of

Wisconsin

-

Madison

ache memories help bridge the cycle-time gap between fast microproces-

sors

and relatively slow main memories. By holding recently referenced re-

gions of memory, caches can reduce the number of cycles the processor must

stall while waiting for data.

As

the disparity between processor and main memory cy-

cle times increases

-

by

40

percent or more per year- cache performance becomes

ever more critical.

Caches only work well, however, for programs that exhibit sufficient locality. Other

programs have reference patterns that caches cannot exploit; they spend excessive ex-

ecution time transferring data between main memory and cache.

For

example, the

SPEC92I benchmark tomcatv spends as much as

53

percent of its time waiting for

memory on a DECstation

5000/125.

Fortunately, for many programs, small source-code changes

--

called program

transformations

-

can radically alter memory reference patterns, greatly improv-

ing cache performance. Consider the well-known example in Figure

1

of traversing

a two-dimensional Fortran array. Since Fortran lays out two-dimensional arrays in

column-major order, consecutive elements of a column are stored in consecutive

memory locations. Traversing columns in the inner loop (by incrementing the row

index) produces a sequential reference pattern and, hence, spatial locality that most

caches can exploit. If, instead, the inner loop traverses rows, each inner-loop itera-

tion references a different memory region.

For arrays that are much larger than the cache, the column-traversing version will

have much better cache behavior than the row-traversing version. On a DECstation

5000/125, the column-traversing version runs

1.69

times faster than the row-travers-

ing version on an array of single-precision floating-point numbers.

We call this type

of

analysis a

mental

simulation of the cache behavior. By mentally

applying the program reference pattern to the underlying cache organization, we can

predict the program’s cache performance. This simulation is similar to asymptotic

analysis of algorithms (for example, worst-case behavior), which programmers com-

monly use to study the number of operations executed as a function of input size.

When analyzing cache behavior. programmers perform a similar analysis, but they

DO 20K

=

1,100

DO

20K

=

1,100

DO

20

I

=

1,5000 D02OJ

=

I,

100

DO

20

J

=

1,100

DO

20

I

=

1,5000

20

XA(1. J)

=

2

*

XA(1,

J) 20

XA(1,

J)

=

2

*

XA(1,

J)

(a)

(b)

Figure

1.

Row-major traversal

of

Fortran array (a), and column traversal (b).

must also have a basic understanding of

cache operation (see the next section).

Although asymptotic analysis is effec-

tive

for

eert;iin algorithms, analysis is dif-

ficult when applied

to

large complex

pro-

gramb.

Instead. programmers often rely

on

an execution-timr profile

to

isolate

problematic code sections.

to

which they

later apply asymptotic analysis. Unfortu-

nately, traditional execution-time profil-

ing took

(for

example. gprof'). are gen-

erally insufficient to identify cache

perfommance problems. For the Fortran

array example (Figurl:

1

).

an

execution-

time profile wouldidentify the procedure

or source lines

as

a

bottleneck. but the

programmer could easil! conclude that

the floating-point operations were re-

sponsible. We can see. therefore. that

prograniniers would benefit from

;I

pro-

file that focuses specifically

on

a

pro-

gram's cache behavior.

Our

purpose

in this article

is

to

intro-

duce

ii

broad audience

to

cache perfor-

mance profiling and tuning techniques.

Although used sporadically in the super-

computer and multiprocessor communi-

ties. these techniques

also

have broad

ap-

plicability

lo

programs running on

fast

uniprocessor workstations. We show that

cache profiling, using our CProf cache

profiling sqclem. improves program per-

formance by focusing

a

programmer's at-

tention

on

pr~iblctnatic code sections and

providing insight into appropriate pro-

gram transl'ornmations.

Understanding

cache behavior

Caches

sit

between the

(List)

proces-

\or

and

(\IOU

)

main meinor>, holding re-

Cache

memory terminology

Associativity -The number of unique places in the cache where a particular

Block size

-

The number of contiguous bytes fetched on each cache miss.

Cache hit

-

A

memory reference satisfied by the cache.

Cache miss

-

A

memory reference not satisfied by the cache.

Capacity

-

The total number of bytes a cache may contain.

Capacity miss

-A

reference that misses in a fully associative cache with

LRU

replacement.

Compulsory miss

-

A

reference that misses because it is the first reference

Conflict miss

-

A

reterence that hits in a fully associative cache but misses in

Direct mapped

-A

cache in which a block can reside in exactly one place in

Fully

associative

-

A

cache in which a block can reside in any place in the

Miss penalty -The time required to fetch data from main memory into the

Set-associative

-

A

cache in which

a

block can reside in exactly A places in

block may reside.

to a cache block.

an A-way set-associative cache.

the cache.

cache

(A

=

C/B).

cache on a cache miss.

the cache.

gions

of

recently referenced main mem-

ory. Refet ences satisfied by the cache

-

called

hit.)

~-

proceeld at processor speed:

those unsatisfied --called

n~i.ss~,s

-

incur

a cache miss penalty to fetch thc corre-

sponding data from main memory. Most

current processors must wait,

or

.stdI.

un-

til the data arrive.

Caches work h'ecause most programs

exhibit significant locality.

Temporal

lo-

cczlify

exists when a program references

the same memory location multiple times

in

a

short

period. Caches exploit tempo-

ral locality by retaining recently refer-

enced data.

Spcrtial

loctrliry

occurs when

the program accesses memory locations

close

to

those

it

ha:i recently accessed.

Caches exploit spatial locality by fetch-

ing multiple contiguous words

-

a

cache

block

-

tvhenevcr

a

miss occurs.

Caches are characterized by three ma-

jor parameters:

cqwrcify

(C),

block

.xix

(€3).

and

rrv.wcicrf/\i@

(A).

A

cache's ca-

pacity simply defines the total number

of

bytes it m<iy contain. The block size de-

termines how many contiguous bytes are

fetched

on

each cache miss.

A

cache may

contain

211

most

dYU

blocks

at

any one

time. Associativity refers

to

the number

of

unique cxhe locations where

a

partic-

ular

block

may

residle.

If

a

block can re-

side

in

an! cache location

(A

=

C'IH).

we

call

it

a,fid/j.

trssoc'irrtiix,

c~rc/w:

if

it

can re-

side in cxxtly one localion

(A

=

/).

we

call

it

~ir~,ci-/}?(r~~~~(,~l:

if

it

can reside in ex-

actly

A

locations.

we

call

it

/t-tiyy

se/-

rr.\.socirrtii~v.

(Smith's

survey'

describes

cache desisn in

mort:

detail.)

With these three parameters. a pro-

cache behavior

tor

simple algorithms.

Consider

I

he simple example

of

nested

loops

whet-e the

outel.

loop

iterates

L

times and the inner

loop

sequentially

acesses an array

of

A/

4-byte integers:

for

(i

=

0;

i

<

L:

+ti)

a[j]

+

=

2:

for

(j

.=

0;

j

<

NI

+-+I)

If

the arrav size

(4N)

is

smaller than the

cache capacity (see Figure 2a-b). we ex-

pect the number (4'ciichc misses

to

equal

the array six divi'ded

by

the cache block

size.

4NiB

(that is. the number

of

cache

blocks reqiiired

tc

hold the entire array).

If'

the iirrav size

is

largei-

than the cache

capacity (we Figurc:

2c).

the

expected

equal

to

the numlw

of

cache blocks re-

quired

to

contain the array times the

number

of

outer-loop iterations

(4YLIR).

16

Compilers may someday automate this

analysis and transform the code to reduce

the miss frequency; recent research has

produced promising results for restricted

problem

domain^.^,^

However, for gen-

eral codes using current commercial com-

pilers, the programmer must manually

analyze the programs and manually per-

form transformations.

To

select appropriate program trans-

formations, a programmer must first

know what causes poor cache behavior.

One approach to understanding why

cache misses occur is to classify each miss

as one of three disjoint types6:

compul-

sory,

capacity,

and

conflict.

(Hill and

Smith6 define compulsory, capacity, and

conflict misses in terms of miss ratios.

When generalizing this concept to indi-

vidual cache misses, we must introduce

anticonflict misses, which miss in a fully

associative cache with

LRU replacement

but hit in an A-way set-associative cache.

Anticonflict misses are generally only

useful for understanding the rare cases

when a set-associative cache performs

better than a fully associative cache

of

the same capacity.)

A compulsory miss is caused by refer-

encing a previously unreferenced cache

block. In the small array example (Fig-

ure 2b), all misses are compulsory. Elim-

inating a compulsory miss requires

prefetching the data, either by an explicit

prefetch operation5 or by placing more

data items in a single cache block. For ex-

ample, if the integers in our example re-

quire only

2

bytes rather than

4,

we can

cut the misses

in

half by changing the dec-

laration. However, since compulsory

misses usually constitute only a fraction

of

all cache misses, we do not discuss

them further.

A

reference that misses in a fully

asso-

ciative cache with

LRU

replacement is

classified as a capacity miss. Capacity

misses are caused by referencing more

cache blocks than can fit in the cache. In

the large array example (Figure 2c), we

expect to see many capacity misses. Pro-

grammers can reduce capacity misses by

restructuring the program to re-reference

blocks while they are in cache. For ex-

ample, it may be possible to modify the

loop structure to perform the

L

outer-

loop iterations

on

a portion of the array

that fits in the cache and then move to

the next portion of the array. This tech-

nique, called

blocking,

is similar to the

techniques used to exploit the vector reg-

isters in some supercomputers.

A

reference that hits in a fully associa-

I

Cache Small array

Large

array

Figure

2.

Determining expected cache behavior. Sequentially accessing a small

array

(b)

that fits in the cache (a) should produce

M

cache misses, where

M

is the

number

of

cache blocks required to hold the array. Accessing an array that is much

larger than the cache (c) should result in

ML

cache misses, where

L

is the number

of

passes over the array.

Cache

MO

conflict

Confliiing

mappings

El

n

Figure

3.

Conficting cache mappings. The presence

of

conflict misses indicates

a

mapping problem: (b) shows how two arrays that fit in the cache (a) with a map-

ping that

will

not produce any conflict misses, and (c) shows two mappings that will

result in contlict misses.

tive cache but misses in an A-way set-

associative cache is classified as a

conflict

miss.

A

conflict miss to block Xindicates

that block

X

has been referenced in the

recent past, since it

is

contained in the

fully associative cache, but at least A

other cache blocks that map to the same

cache set have been accessed since the

last reference to block

X.

Consider the execution of a doubly

nested loop on a machine with a direct-

mapped cache, where the inner loop se-

quentially accesses two arrays

(for

ex-

ample, dot-product). If the combined

array size is smaller than the cache, we

might expect only compulsory misses.

However, this ideal case occurs only if

the two arrays map to different cache

sets (Figure 3b). If they overlap, either

partially or entirely (Figure 3c), then we

will get conflict misses as array elements

compete for space in the set. Eliminat-

ing conflict misses requires a program

transformation that changes either the

memory allocation

of

the two arrays,

so

that contemporaneous accesses do not

compete for the same sets, or that

changes the manner in which the arrays

are accessed.

Our discussion assumes a cache in-

dexed with virtual addresses. Many sys-

tems index their caches with real or phys-

ical addresses, making cache behavior

strongly dependent on page placement.

October

1994

17

I

/* old declaration

of

two arrays

*I

int val

[SIZE];

int key

[SIZE];

C

old declaration

integer X(N,

N)

integer Y(N, N)

I*

new declaration

of

*I

I*

array of structures

*I

struct merge

{

C

new declaration

integer XY(2*N,

N)

int Val;

C

preprocessor macro

int key;

C

definitions to perform addressing

#define X(i,

j)

XY((2*i)

-

1,

N)

#define Y(i,

j)

XY((2*i), N)

1;

struct merge merged-array[SIZE];

Figure

4.

Examples of merging arrays in

C

(a) and Fortran77

(b).

I*

old declaration

of

a twelve

*I

I*

byte structure

*I

struct ex-struct

[

int vall, va12, va13;

t;

I*

new declaration of structure

*I

I*

padded to 16-byte block size

*I

struct ex-struct

[

int vall, va12, va13;

char pad[4];

1;

(a)

I*

original allocation does not */

I*

guarantee alignment

*I

ar

=

(struct ex-struct

*)

malloc(sizeof(struct ex-struct)*SIZE);

I*

new code to guarantee alignment

*I

I*

of structure.

*I

ar

=

(struct ex-struct

*)

malloc(sizeof(struct ex-struct)*(SIZE

+

I));

ar

=

((int) ar

+

B

-

l)/B)*B

Figure

5.

Padding (a) and aligning structures

(b)

in

C.

However, many operating systems use tures are relerenced. Capacity misses can

page coloring to minimize this effect, thus be eliminated

by

program transforma-

reducing the performance difference be- tions that reuse data before it is displaced

tween virtual-indexed and real-indexed from the cache, such as

loop

fusion.

caches.’ structure and array packing,

and loop interchange.

Merging arrays. Some programs con-

temporaneou4y reference two (or more)

arrays of the same dimension using the

same indices By merging multiple arrays

into a single compound array, the pro-

Techniques

for

im roving cache

be

R

avior

Program transformations can be clas-

sified

by

the type

of

cache misses they

eliminate. Conflict misses can be reduced

by merging arrays, padding and aligning

structures. packing structures and arrays,

and interchanging

loops.

The first three

techniques change the allocation

of

data

structures, whereas loop interchange

modifies the order in which data struc-

grammer increases spatial locality and

potentially reduces conflict misses. In the

C

programming language, this is accom-

plished

by

declaring an array

of

structures

rather than two arrays (Figure 4a). This

simple transformation can

also

be per-

formed in Fortrango, which provides

structures. Since Fortran77 does not have

structures, the programmer can obtain

the same effect using complex indexing

(Figure

4b).

Padding and aligning structures. Ref-

erencing a data slructure that spans two

cache blocks may incur two misses, even

if the structure is smlaller than the block

size. Padding structures

to

a multiple

of

the block size arid aligning them on

a

block boundary can (diminate “niisalign-

ment” misses, which generally show

up

as

conflict misse:. F’adding is easily ac-

complished in

C

(Fip,ure Sa) by declaring

extra pad fields. Alignment is a little

more difficult, since the address

of

the

structure must be a multiple

of

the cache

block size. Staticady declared structures

generally require compiler support.

Dy-

namically allocai.ed structures can be

aligned by the programmer using simple

pointer arithmetic: (Figure Sb). Some dy-

namic memory allocators (for example,

some versions

of

tnalloc())

return cache

block-aligned me:nory.

Packing. Packing is the opposite

of

padding.

I3y

packing an array into the

smallest space possible. the programmer

increases spatial locality. which can re-

duce conflict and (capacity misses.

In

Fig-

ure 6a, the prograinmer observes that the

elements

of

array

due

are never greater

than

255

and. hence, could

fit

in type

un-

signedchar.

which requires

8

bits, instead

of

unsigned

int,

which typically requires

32

bits. For

ii

mactiint with 16-byte cache

blocks, the code

in

Figure

6b

permits I6

elements per block. rather than 4. reduc-

ing the maximum iiumber

of

cache misses

by

a

factor

of

4.

Loop fusion. Niimeric programs often

consist

of

scveral operations

on

the same

data, coded as multiple

loops

over the

same arrqs. By combining these

loops,

a

programmer increases the program’s

temporal locality

.ml

frequently reduces

the number

of

capacity misses. The ex-

amples in Figure

7

combine two doubly

nested

loops

so

that all operations are

performed

on

an entire row before mov-

ing on to tho next

Lm~i,fir.siotz

is the

ex-

act opposite

of

loop

Ji’.s.sion.

a

program

transformation that splits independent

portions

of

a

1oc.y

body

into separate

loops.

Loop

fission helps an optimizing

compiler detect

loops

that exploit vector

hardware

on

som2 supercomputers. Be-

cause most vector supercomputers do

not

employ caches. rclying instead

on

high-

bandwidth interlcaved main memories.

some

of

the transfrmnations described

in

18

COMPUTER

this article may be counterproductive for

these machines.

Blocking. Blocking is a general tech-

nique for restructuring a program to

reuse chunks

of

data that fit in the cache

and reduce capacity misses. The SPEC

matrix multiply (part

of

dnasa7, a For-

tran77 program) implements a column-

blocked algorithm (Figure 8b) that

achieves a

2.04 speedup versus a naive

implementation (Figure 8a) on a DEC-

station

50001125. The algorithm tries to

keep four columns of the

A

matrix in

cache for the duration of the outermost

loop, ideally getting

N

-

1

hits for each

miss. If the matrix is

so

large that four

columns do not fit in the cache, we can

use a two-dimensional (row and column)

blocked algorithm instead.

CProf

cache

profiling

system

Cache misses

do

result from the com-

plex interaction among algorithm, mem-

ory allocation, and cache configuration;

when the program is executed, the reality

may not match the programmer's expec-

tations. CProf,

our cache profiling sys-

tem, addresses this problem by identify-

ing where cache misses occur and by

classifying them as

compulsory, capacity,

or

conflict

misses.

Cache- and memory-system profilers

differ from the better-known execution-

time profilers by focusing on memory-

system performance. Memory-system

profilers do not obviate execution-time

profilers; instead, they provide vital sup-

plementary information to quickly iden-

tify memory-system bottlenecks and tune

memory-system performance.

Cache- and memory-system profilers

differ in the level of detail they present.

I*

old declaration

of

an array

*I

I*

of

unsigned integers.

*I

unsigned int values[10000];

I*

new declaration

of

an array

*I

I*

of unsigned characters.

*I

/*

Valid iff

0

c

=

value

<

=

255

*!

unsigned char values[10000];

I*

loop

sequencing through values

*I

for

(i

=

0; I

<

1oooO; i++)

values

[i]

=

i

%

256;

/*

loop

sequencing through values */

for (i

=

0;

i

<

1oooO;

i++)

values [i] = i

%

256;

(a) (b)

Figure

6.

Unpacked (a)

and

packed

(b)

array structures in

C.

Figure

7.

Sepa-

rate

(a)

and

fused

(b)

loops.

for

(i

=

0;

i

<

N;

i++)

for

(i

=

0;

i

N;

i++)

for

(j

=

0;

j

<N;

j++);

far

(j

=0;

j

<N;

j++)

a[i][j]

=

I/b[i]b]*c[i][j];

{

E

a[i]b]

=

I/b[i]~]*c[i]6];

dIi][j]

=

a[i]Q]+c[i]b];

for (i

=

0;

i

<

N;

i++)

for6

=O;j

<

N,

j++)

d[i][j]

=

a[i]u]+c[i][j];

(a)

(b)

High-level tools, such as MTool,* iden-

tify procedures

or

basic blocks that incur

large memory overheads. CProf and

PFC-Sim,9 on the other hand, allow

more detailed analysis by identifying

cache misses at the source-line level.

This extra detail is not free;

MTool

runs

much faster than profilers requiring ad-

dress tracing and full cache simulation.

However, full simulation also permits a

profiler to identify which data structures

are responsible for cache misses and

to

determine the type of miss

-

features

provided by CProf and MemSpy.l0

CProf

is

similar to MemSpy, the differ-

ence being the granularity at which source

code is annotated and the miss type clas-

sification. MemSpy annotates source code

at the procedure level and provides two

miss types for uniprocessors

-

compul-

sory and replacement. CProf provides

fine-grain source identification and data

structure support, and classifies misses as

compulsory, capacity, or conflict.

CProf uses a flexible

X

Windows in-

terface (see Figure A on p.

20)

to

present

the cache profile in a way that helps the

programmer determine the cache per-

formance bottlenecks. The data window

lists either source lines

or

data structures

Figure

8.

Naive (a) and

SPEC column-

blocked matrix

multiply (b).

DO

110

J

=

1,

M,

4

DO

110

K

=

1,

N

DO

110

J

=

1,

M

DOllOI=l,L

DO 110

K

=

1,

N

C(1,

K)

=

C(1,

K)

+

A(1,

J)

*

B(J,

K)

+

A(1, J

+

1)

*

B(J

+

1,

K)

+

A(1,

J+2)

*

B(J

+2,

K)

+

A(1, J

+

3)

*

B(J

+

3,

K)

DO

110 I

=

1,

L

C(1,

K)

=

C(1,

K)

+

A(1, J)

*

B(J, K)

110 CONTINUE

October

1994 19

Cache profiling and the SPEC benchmarks: a case study

Citations

Computer Architecture, Fifth Edition: A Quantitative Approach

Selective cache ways: on-demand cache resource allocation

Exploiting hardware performance counters with flow and context sensitive profiling

Cache miss equations: a compiler framework for analyzing and tuning memory behavior

Cache-conscious data placement

References

Cache Memories

The cache performance and optimizations of blocked algorithms

Evaluating associativity in CPU caches

Page placement algorithms for large real-indexed caches

Software methods for improvement of cache performance on supercomputer applications

Related Papers (5)

The cache performance and optimizations of blocked algorithms

A data locality optimizing algorithm

Computer Architecture: A Quantitative Approach

Evaluating associativity in CPU caches

ATOM: a system for building customized program analysis tools