scispace - formally typeset
Open AccessJournal ArticleDOI

Cache profiling and the SPEC benchmarks: a case study

Alvin R. Lebeck, +1 more
- 01 Oct 1994 - 
- Vol. 27, Iss: 10, pp 15-26
TLDR
It is shown that cache profiling, using the CProf cache profiling system, improves program performance by focusing a programmer's attention on problematic code sections and providing insight into appropriate program transformations.
Abstract
A vital tool-box component, the CProf cache profiling system lets programmers identify hot spots by providing cache performance information at the source-line and data-structure level. Our purpose is to introduce a broad audience to cache performance profiling and tuning techniques. Although used sporadically in the supercomputer and multiprocessor communities, these techniques also have broad applicability to programs running on fast uniprocessor workstations. We show that cache profiling, using our CProf cache profiling system, improves program performance by focusing a programmer's attention on problematic code sections and providing insight into appropriate program transformations. >

read more

Content maybe subject to copyright    Report

A
vital tool-box
component, the CProf
cache profiling system
lets programmers
identify hot spots
by
providing cache
performance
information at the
source-line and data-
structure level.
October
1994
Cache Profiling
and
the
SPEC
Benchmarks:
A
Case
Study
Alvin
R.
Lebeck
and David A. Wood
University
of
Wisconsin
-
Madison
ache memories help bridge the cycle-time gap between fast microproces-
sors
and relatively slow main memories. By holding recently referenced re-
gions of memory, caches can reduce the number of cycles the processor must
stall while waiting for data.
As
the disparity between processor and main memory cy-
cle times increases
-
by
40
percent or more per year- cache performance becomes
ever more critical.
Caches only work well, however, for programs that exhibit sufficient locality. Other
programs have reference patterns that caches cannot exploit; they spend excessive ex-
ecution time transferring data between main memory and cache.
For
example, the
SPEC92I benchmark tomcatv spends as much as
53
percent of its time waiting for
memory on a DECstation
5000/125.
Fortunately, for many programs, small source-code changes
--
called program
transformations
-
can radically alter memory reference patterns, greatly improv-
ing cache performance. Consider the well-known example in Figure
1
of traversing
a two-dimensional Fortran array. Since Fortran lays out two-dimensional arrays in
column-major order, consecutive elements of a column are stored in consecutive
memory locations. Traversing columns in the inner loop (by incrementing the row
index) produces a sequential reference pattern and, hence, spatial locality that most
caches can exploit. If, instead, the inner loop traverses rows, each inner-loop itera-
tion references a different memory region.
For arrays that are much larger than the cache, the column-traversing version will
have much better cache behavior than the row-traversing version. On a DECstation
5000/125, the column-traversing version runs
1.69
times faster than the row-travers-
ing version on an array of single-precision floating-point numbers.
We call this type
of
analysis a
mental
simulation of the cache behavior. By mentally
applying the program reference pattern to the underlying cache organization, we can
predict the program’s cache performance. This simulation is similar to asymptotic
analysis of algorithms (for example, worst-case behavior), which programmers com-
monly use to study the number of operations executed as a function of input size.
When analyzing cache behavior. programmers perform a similar analysis, but they

DO 20K
=
1,100
DO
20K
=
1,100
DO
20
I
=
1,5000 D02OJ
=
I,
100
DO
20
J
=
1,100
DO
20
I
=
1,5000
20
XA(1. J)
=
2
*
XA(1,
J) 20
XA(1,
J)
=
2
*
XA(1,
J)
(a)
(b)
Figure
1.
Row-major traversal
of
Fortran array (a), and column traversal (b).
must also have a basic understanding of
cache operation (see the next section).
Although asymptotic analysis is effec-
tive
for
eert;iin algorithms, analysis is dif-
ficult when applied
to
large complex
pro-
gramb.
Instead. programmers often rely
on
an execution-timr profile
to
isolate
problematic code sections.
to
which they
later apply asymptotic analysis. Unfortu-
nately, traditional execution-time profil-
ing took
(for
example. gprof'). are gen-
erally insufficient to identify cache
perfommance problems. For the Fortran
array example (Figurl:
1
).
an
execution-
time profile wouldidentify the procedure
or source lines
as
a
bottleneck. but the
programmer could easil! conclude that
the floating-point operations were re-
sponsible. We can see. therefore. that
prograniniers would benefit from
;I
pro-
file that focuses specifically
on
a
pro-
gram's cache behavior.
Our
purpose
in this article
is
to
intro-
duce
ii
broad audience
to
cache perfor-
mance profiling and tuning techniques.
Although used sporadically in the super-
computer and multiprocessor communi-
ties. these techniques
also
have broad
ap-
plicability
lo
programs running on
fast
uniprocessor workstations. We show that
cache profiling, using our CProf cache
profiling sqclem. improves program per-
formance by focusing
a
programmer's at-
tention
on
pr~iblctnatic code sections and
providing insight into appropriate pro-
gram transl'ornmations.
Understanding
cache behavior
Caches
sit
between the
(List)
proces-
\or
and
(\IOU
)
main meinor>, holding re-
Cache
memory terminology
Associativity -The number of unique places in the cache where a particular
Block size
-
The number of contiguous bytes fetched on each cache miss.
Cache hit
-
A
memory reference satisfied by the cache.
Cache miss
-
A
memory reference not satisfied by the cache.
Capacity
-
The total number of bytes a cache may contain.
Capacity miss
-A
reference that misses in a fully associative cache with
LRU
replacement.
Compulsory miss
-
A
reference that misses because it is the first reference
Conflict miss
-
A
reterence that hits in a fully associative cache but misses in
Direct mapped
-A
cache in which a block can reside in exactly one place in
Fully
associative
-
A
cache in which a block can reside in any place in the
Miss penalty -The time required to fetch data from main memory into the
Set-associative
-
A
cache in which
a
block can reside in exactly A places in
block may reside.
to a cache block.
an A-way set-associative cache.
the cache.
cache
(A
=
C/B).
cache on a cache miss.
the cache.
gions
of
recently referenced main mem-
ory. Refet ences satisfied by the cache
-
called
hit.)
~-
proceeld at processor speed:
those unsatisfied --called
n~i.ss~,s
-
incur
a cache miss penalty to fetch thc corre-
sponding data from main memory. Most
current processors must wait,
or
.stdI.
un-
til the data arrive.
Caches work h'ecause most programs
exhibit significant locality.
Temporal
lo-
cczlify
exists when a program references
the same memory location multiple times
in
a
short
period. Caches exploit tempo-
ral locality by retaining recently refer-
enced data.
Spcrtial
loctrliry
occurs when
the program accesses memory locations
close
to
those
it
ha:i recently accessed.
Caches exploit spatial locality by fetch-
ing multiple contiguous words
-
a
cache
block
-
tvhenevcr
a
miss occurs.
Caches are characterized by three ma-
jor parameters:
cqwrcify
(C),
block
.xix
(€3).
and
rrv.wcicrf/\i@
(A).
A
cache's ca-
pacity simply defines the total number
of
bytes it m<iy contain. The block size de-
termines how many contiguous bytes are
fetched
on
each cache miss.
A
cache may
contain
211
most
dYU
blocks
at
any one
time. Associativity refers
to
the number
of
unique cxhe locations where
a
partic-
ular
block
may
residle.
If
a
block can re-
side
in
an! cache location
(A
=
C'IH).
we
call
it
a,fid/j.
trssoc'irrtiix,
c~rc/w:
if
it
can re-
side in cxxtly one localion
(A
=
/).
we
call
it
~ir~,ci-/}?(r~~~~(,~l:
if
it
can reside in ex-
actly
A
locations.
we
call
it
/t-tiyy
se/-
rr.\.socirrtii~v.
(Smith's
survey'
describes
cache desisn in
mort:
detail.)
With these three parameters. a pro-
cache behavior
tor
simple algorithms.
Consider
I
he simple example
of
nested
loops
whet-e the
outel.
loop
iterates
L
times and the inner
loop
sequentially
acesses an array
of
A/
4-byte integers:
for
(i
=
0;
i
<
L:
+ti)
a[j]
+
=
2:
for
(j
.=
0;
j
<
NI
+-+I)
If
the arrav size
(4N)
is
smaller than the
cache capacity (see Figure 2a-b). we ex-
pect the number (4'ciichc misses
to
equal
the array six divi'ded
by
the cache block
size.
4NiB
(that is. the number
of
cache
blocks reqiiired
tc
hold the entire array).
If'
the iirrav size
is
largei-
than the cache
capacity (we Figurc:
2c).
the
expected
equal
to
the numlw
of
cache blocks re-
quired
to
contain the array times the
number
of
outer-loop iterations
(4YLIR).
16

Compilers may someday automate this
analysis and transform the code to reduce
the miss frequency; recent research has
produced promising results for restricted
problem
domain^.^,^
However, for gen-
eral codes using current commercial com-
pilers, the programmer must manually
analyze the programs and manually per-
form transformations.
To
select appropriate program trans-
formations, a programmer must first
know what causes poor cache behavior.
One approach to understanding why
cache misses occur is to classify each miss
as one of three disjoint types6:
compul-
sory,
capacity,
and
conflict.
(Hill and
Smith6 define compulsory, capacity, and
conflict misses in terms of miss ratios.
When generalizing this concept to indi-
vidual cache misses, we must introduce
anticonflict misses, which miss in a fully
associative cache with
LRU replacement
but hit in an A-way set-associative cache.
Anticonflict misses are generally only
useful for understanding the rare cases
when a set-associative cache performs
better than a fully associative cache
of
the same capacity.)
A compulsory miss is caused by refer-
encing a previously unreferenced cache
block. In the small array example (Fig-
ure 2b), all misses are compulsory. Elim-
inating a compulsory miss requires
prefetching the data, either by an explicit
prefetch operation5 or by placing more
data items in a single cache block. For ex-
ample, if the integers in our example re-
quire only
2
bytes rather than
4,
we can
cut the misses
in
half by changing the dec-
laration. However, since compulsory
misses usually constitute only a fraction
of
all cache misses, we do not discuss
them further.
A
reference that misses in a fully
asso-
ciative cache with
LRU
replacement is
classified as a capacity miss. Capacity
misses are caused by referencing more
cache blocks than can fit in the cache. In
the large array example (Figure 2c), we
expect to see many capacity misses. Pro-
grammers can reduce capacity misses by
restructuring the program to re-reference
blocks while they are in cache. For ex-
ample, it may be possible to modify the
loop structure to perform the
L
outer-
loop iterations
on
a portion of the array
that fits in the cache and then move to
the next portion of the array. This tech-
nique, called
blocking,
is similar to the
techniques used to exploit the vector reg-
isters in some supercomputers.
A
reference that hits in a fully associa-
I
Cache Small array
Large
array
Figure
2.
Determining expected cache behavior. Sequentially accessing a small
array
(b)
that fits in the cache (a) should produce
M
cache misses, where
M
is the
number
of
cache blocks required to hold the array. Accessing an array that is much
larger than the cache (c) should result in
ML
cache misses, where
L
is the number
of
passes over the array.
Cache
MO
conflict
Confliiing
mappings
El
n
Figure
3.
Conficting cache mappings. The presence
of
conflict misses indicates
a
mapping problem: (b) shows how two arrays that fit in the cache (a) with a map-
ping that
will
not produce any conflict misses, and (c) shows two mappings that will
result in contlict misses.
tive cache but misses in an A-way set-
associative cache is classified as a
conflict
miss.
A
conflict miss to block Xindicates
that block
X
has been referenced in the
recent past, since it
is
contained in the
fully associative cache, but at least A
other cache blocks that map to the same
cache set have been accessed since the
last reference to block
X.
Consider the execution of a doubly
nested loop on a machine with a direct-
mapped cache, where the inner loop se-
quentially accesses two arrays
(for
ex-
ample, dot-product). If the combined
array size is smaller than the cache, we
might expect only compulsory misses.
However, this ideal case occurs only if
the two arrays map to different cache
sets (Figure 3b). If they overlap, either
partially or entirely (Figure 3c), then we
will get conflict misses as array elements
compete for space in the set. Eliminat-
ing conflict misses requires a program
transformation that changes either the
memory allocation
of
the two arrays,
so
that contemporaneous accesses do not
compete for the same sets, or that
changes the manner in which the arrays
are accessed.
Our discussion assumes a cache in-
dexed with virtual addresses. Many sys-
tems index their caches with real or phys-
ical addresses, making cache behavior
strongly dependent on page placement.
October
1994
17

I
/* old declaration
of
two arrays
*I
int val
[SIZE];
int key
[SIZE];
C
old declaration
integer X(N,
N)
integer Y(N, N)
I*
new declaration
of
*I
I*
array of structures
*I
struct merge
{
C
new declaration
integer XY(2*N,
N)
int Val;
C
preprocessor macro
int key;
C
definitions to perform addressing
#define X(i,
j)
XY((2*i)
-
1,
N)
#define Y(i,
j)
XY((2*i), N)
1;
struct merge merged-array[SIZE];
Figure
4.
Examples of merging arrays in
C
(a) and Fortran77
(b).
I*
old declaration
of
a twelve
*I
I*
byte structure
*I
struct ex-struct
[
int vall, va12, va13;
t;
I*
new declaration of structure
*I
I*
padded to 16-byte block size
*I
struct ex-struct
[
int vall, va12, va13;
char pad[4];
1;
(a)
I*
original allocation does not */
I*
guarantee alignment
*I
ar
=
(struct ex-struct
*)
malloc(sizeof(struct ex-struct)*SIZE);
I*
new code to guarantee alignment
*I
I*
of structure.
*I
ar
=
(struct ex-struct
*)
malloc(sizeof(struct ex-struct)*(SIZE
+
I));
ar
=
((int) ar
+
B
-
l)/B)*B
Figure
5.
Padding (a) and aligning structures
(b)
in
C.
However, many operating systems use tures are relerenced. Capacity misses can
page coloring to minimize this effect, thus be eliminated
by
program transforma-
reducing the performance difference be- tions that reuse data before it is displaced
tween virtual-indexed and real-indexed from the cache, such as
loop
fusion.
caches.’ structure and array packing,
and loop interchange.
Merging arrays. Some programs con-
temporaneou4y reference two (or more)
arrays of the same dimension using the
same indices By merging multiple arrays
into a single compound array, the pro-
Techniques
for
im roving cache
be
R
avior
Program transformations can be clas-
sified
by
the type
of
cache misses they
eliminate. Conflict misses can be reduced
by merging arrays, padding and aligning
structures. packing structures and arrays,
and interchanging
loops.
The first three
techniques change the allocation
of
data
structures, whereas loop interchange
modifies the order in which data struc-
grammer increases spatial locality and
potentially reduces conflict misses. In the
C
programming language, this is accom-
plished
by
declaring an array
of
structures
rather than two arrays (Figure 4a). This
simple transformation can
also
be per-
formed in Fortrango, which provides
structures. Since Fortran77 does not have
structures, the programmer can obtain
the same effect using complex indexing
(Figure
4b).
Padding and aligning structures. Ref-
erencing a data slructure that spans two
cache blocks may incur two misses, even
if the structure is smlaller than the block
size. Padding structures
to
a multiple
of
the block size arid aligning them on
a
block boundary can (diminate “niisalign-
ment” misses, which generally show
up
as
conflict misse:. F’adding is easily ac-
complished in
C
(Fip,ure Sa) by declaring
extra pad fields. Alignment is a little
more difficult, since the address
of
the
structure must be a multiple
of
the cache
block size. Staticady declared structures
generally require compiler support.
Dy-
namically allocai.ed structures can be
aligned by the programmer using simple
pointer arithmetic: (Figure Sb). Some dy-
namic memory allocators (for example,
some versions
of
tnalloc())
return cache
block-aligned me:nory.
Packing. Packing is the opposite
of
padding.
I3y
packing an array into the
smallest space possible. the programmer
increases spatial locality. which can re-
duce conflict and (capacity misses.
In
Fig-
ure 6a, the prograinmer observes that the
elements
of
array
due
are never greater
than
255
and. hence, could
fit
in type
un-
signedchar.
which requires
8
bits, instead
of
unsigned
int,
which typically requires
32
bits. For
ii
mactiint with 16-byte cache
blocks, the code
in
Figure
6b
permits I6
elements per block. rather than 4. reduc-
ing the maximum iiumber
of
cache misses
by
a
factor
of
4.
Loop fusion. Niimeric programs often
consist
of
scveral operations
on
the same
data, coded as multiple
loops
over the
same arrqs. By combining these
loops,
a
programmer increases the program’s
temporal locality
.ml
frequently reduces
the number
of
capacity misses. The ex-
amples in Figure
7
combine two doubly
nested
loops
so
that all operations are
performed
on
an entire row before mov-
ing on to tho next
Lm~i,fir.siotz
is the
ex-
act opposite
of
loop
Ji’.s.sion.
a
program
transformation that splits independent
portions
of
a
1oc.y
body
into separate
loops.
Loop
fission helps an optimizing
compiler detect
loops
that exploit vector
hardware
on
som2 supercomputers. Be-
cause most vector supercomputers do
not
employ caches. rclying instead
on
high-
bandwidth interlcaved main memories.
some
of
the transfrmnations described
in
18
COMPUTER

this article may be counterproductive for
these machines.
Blocking. Blocking is a general tech-
nique for restructuring a program to
reuse chunks
of
data that fit in the cache
and reduce capacity misses. The SPEC
matrix multiply (part
of
dnasa7, a For-
tran77 program) implements a column-
blocked algorithm (Figure 8b) that
achieves a
2.04 speedup versus a naive
implementation (Figure 8a) on a DEC-
station
50001125. The algorithm tries to
keep four columns of the
A
matrix in
cache for the duration of the outermost
loop, ideally getting
N
-
1
hits for each
miss. If the matrix is
so
large that four
columns do not fit in the cache, we can
use a two-dimensional (row and column)
blocked algorithm instead.
CProf
cache
profiling
system
Cache misses
do
result from the com-
plex interaction among algorithm, mem-
ory allocation, and cache configuration;
when the program is executed, the reality
may not match the programmer's expec-
tations. CProf,
our cache profiling sys-
tem, addresses this problem by identify-
ing where cache misses occur and by
classifying them as
compulsory, capacity,
or
conflict
misses.
Cache- and memory-system profilers
differ from the better-known execution-
time profilers by focusing on memory-
system performance. Memory-system
profilers do not obviate execution-time
profilers; instead, they provide vital sup-
plementary information to quickly iden-
tify memory-system bottlenecks and tune
memory-system performance.
Cache- and memory-system profilers
differ in the level of detail they present.
I*
old declaration
of
an array
*I
I*
of
unsigned integers.
*I
unsigned int values[10000];
I*
new declaration
of
an array
*I
I*
of unsigned characters.
*I
/*
Valid iff
0
c
=
value
<
=
255
*!
unsigned char values[10000];
I*
loop
sequencing through values
*I
for
(i
=
0; I
<
1oooO; i++)
values
[i]
=
i
%
256;
/*
loop
sequencing through values */
for (i
=
0;
i
<
1oooO;
i++)
values [i] = i
%
256;
(a) (b)
Figure
6.
Unpacked (a)
and
packed
(b)
array structures in
C.
Figure
7.
Sepa-
rate
(a)
and
fused
(b)
loops.
for
(i
=
0;
i
<
N;
i++)
for
(i
=
0;
i
N;
i++)
for
(j
=
0;
j
<N;
j++);
far
(j
=0;
j
<N;
j++)
a[i][j]
=
I/b[i]b]*c[i][j];
{
E
a[i]b]
=
I/b[i]~]*c[i]6];
dIi][j]
=
a[i]Q]+c[i]b];
for (i
=
0;
i
<
N;
i++)
for6
=O;j
<
N,
j++)
d[i][j]
=
a[i]u]+c[i][j];
(a)
(b)
High-level tools, such as MTool,* iden-
tify procedures
or
basic blocks that incur
large memory overheads. CProf and
PFC-Sim,9 on the other hand, allow
more detailed analysis by identifying
cache misses at the source-line level.
This extra detail is not free;
MTool
runs
much faster than profilers requiring ad-
dress tracing and full cache simulation.
However, full simulation also permits a
profiler to identify which data structures
are responsible for cache misses and
to
determine the type of miss
-
features
provided by CProf and MemSpy.l0
CProf
is
similar to MemSpy, the differ-
ence being the granularity at which source
code is annotated and the miss type clas-
sification. MemSpy annotates source code
at the procedure level and provides two
miss types for uniprocessors
-
compul-
sory and replacement. CProf provides
fine-grain source identification and data
structure support, and classifies misses as
compulsory, capacity, or conflict.
CProf uses a flexible
X
Windows in-
terface (see Figure A on p.
20)
to
present
the cache profile in a way that helps the
programmer determine the cache per-
formance bottlenecks. The data window
lists either source lines
or
data structures
Figure
8.
Naive (a) and
SPEC column-
blocked matrix
multiply (b).
DO
110
J
=
1,
M,
4
DO
110
K
=
1,
N
DO
110
J
=
1,
M
DOllOI=l,L
DO 110
K
=
1,
N
C(1,
K)
=
C(1,
K)
+
A(1,
J)
*
B(J,
K)
+
A(1, J
+
1)
*
B(J
+
1,
K)
+
A(1,
J+2)
*
B(J
+2,
K)
+
A(1, J
+
3)
*
B(J
+
3,
K)
DO
110 I
=
1,
L
C(1,
K)
=
C(1,
K)
+
A(1, J)
*
B(J, K)
110 CONTINUE
110 CONTINUE
October
1994 19

Citations
More filters
Book

Computer Architecture, Fifth Edition: A Quantitative Approach

TL;DR: The Fifth Edition of Computer Architecture focuses on this dramatic shift in the ways in which software and technology in the "cloud" are accessed by cell phones, tablets, laptops, and other mobile computing devices.
Proceedings ArticleDOI

Selective cache ways: on-demand cache resource allocation

TL;DR: In this paper, a tradeoff between performance and energy is made between a small performance degradation for energy savings, and the tradeoff can produce a significant reduction in cache energy dissipation.
Proceedings ArticleDOI

Exploiting hardware performance counters with flow and context sensitive profiling

TL;DR: This paper extends previous work on efficient path profiling to flow sensitive profiling, which associates hardware performance metrics with a path through a procedure, and describes a data structure, the calling context tree, that efficiently captures calling contexts for procedure-level measurements.
Journal ArticleDOI

Cache miss equations: a compiler framework for analyzing and tuning memory behavior

TL;DR: This article describes methods for generating and solving Cache Miss Equations (CMEs) that give a detailed representation of cache behavior, including conflict misses, in loop-oriented scientific code within the SUIF compiler framework.
Proceedings ArticleDOI

Cache-conscious data placement

TL;DR: Results show that profile driven data placement significantly reduces the data miss rate by 24% on average, and a compiler directed approach that creates an address placement for the stack, global variables, heap objects, and constants in order to reduce data cache misses is presented.
References
More filters
Journal ArticleDOI

Cache Memories

TL;DR: Specific aspects of cache memories investigated include: the cache fetch algorithm (demand versus prefetch), the placement and replacement algorithms, line size, store-through versus copy-back updating of main memory, cold-start versus warm-start miss ratios, mulhcache consistency, the effect of input /output through the cache, the behavior of split data/instruction caches, and cache size.
Proceedings ArticleDOI

The cache performance and optimizations of blocked algorithms

TL;DR: It is shown that the degree of cache interference is highly sensitive to the stride of data accesses and the size of the blocks, and can cause wide variations in machine performance for different matrix sizes.
Journal ArticleDOI

Evaluating associativity in CPU caches

TL;DR: All-associativity simulation is theoretically less efficient than forest simulation or stack simulation (a commonly used simulation algorithm), in practice it is not much slower and allows the simulation of many more caches with a single pass through an address trace.
Journal ArticleDOI

Page placement algorithms for large real-indexed caches

TL;DR: This work develops several page placement algorithms, called careful-mapping algorithms, that try to select a page frame from a pool of available page frames that is likely to reduce cache contention.
Dissertation

Software methods for improvement of cache performance on supercomputer applications

TL;DR: Measurements of actual supercomputer cache performance has not been previously undertaken, and PFC-Sim, a program-driven event tracing facility that can simulate data cache performance of very long programs, is used to measure the performance of various cache structures.