What are the contributions mentioned in the paper "Program locality analysis using reuse distance" ?

This article addresses the analysis problem at the program level, where the size of data and the locality of execution may change significantly depending on the input. The article presents two techniques that predict how the locality of a program changes with its input.

Why is locality considered a fundamental concept in computing?

Locality is considered a fundamental concept in computing because to understand a computation the authors must understand its use of data.

How many accesses is there between successive tree compressions?

Since one treenode is added for each access, the number of accesses between successive tree compressions is at least M/2 accesses.

How did they improve the efficiency of the tree-based methods?

Almasi et al. [2002] showed that by recording the empty regions instead of the last accesses in the trace, they could improve the efficiency of vector and tree based methods by 20% to 40%.

What is the simplest way to model a buffer as a stack?

Mattson et al. [1970] showed that buffer memory could be modeled as a stack, if the method of buffer management satisfied the inclusion property in that a smaller buffer would hold a subset of data held by a larger buffer.

What is the cost of a reuse-distance analysis?

To measure only the cost of reuse-distance analysis, the hashing step is bypassed by pre-computing the last access time in all analyzers (except for KHW, which does not need the access time).

What is the way to predict a reuse distance?

Using more than two training inputs may produce a better prediction, because more data may reduce the noise from imprecise reuse distance measurement and histogram construction.

What is the method for calculating the average distance of a group of reuse distances?

For a group of reuse distances, the authors calculate the ratio of their average distance in two executions, di/d̂ i, and pick fi to be the pattern function that is closest to di/d̂ i.

Why is the consistency across inputs due to consistency in programmers’ coding style?

The consistency across inputs might be due to consistency in programmers’ coding style, for example, the distribution of function sizes.

How many accesses are transferred to the approximate trace?

To satisfy the first requirement, the authors transfer the last accesses of c data elements from the precise trace to the approximate trace when the size of the precise trace exceeds 2c.

How is the prediction accuracy shown in Table VI?

Table VI shows the prediction accuracy when the size of the largest training run is reduced to 1.6%, 3%, and 13% of the size used previously in Table II.

(Open Access) Program locality analysis using reuse distance (2009) | Yutao Zhong

Program Locality Analysis Using

Reuse Distance

YUTAO ZHONG

George Mason University

XIPENG SHEN

The College of William and Mary

and

CHEN DING

University of Rochester

On modern computer systems, the memory performance of an application depends on its locality. For

a single execution, locality-correlated measures like average miss rate or working-set size have long

been analyzed using reuse distance—the number of distinct locations accessed between consecutive

accesses to a given location. This article addresses the analysis problem at the program level, where

the size of data and the locality of execution may change signiﬁcantly depending on the input.

The article presents two techniques that predict how the locality of a program changes with

its input. The ﬁrst is approximate reuse-distance measurement, which is asymptotically faster

than exact methods while providing a guaranteed precision. The second is statistical prediction of

locality in all executions of a program based on the analysis of a few executions. The prediction

process has three steps: dividing data accesses into groups, ﬁnding the access patterns in each

group, and building parameterized models. The resulting prediction may be used on-line with

the help of distance-based sampling. When evaluated on ﬁfteen benchmark applications, the new

techniques predicted program locality with good accuracy, even for test executions that are orders

of magnitude larger than the training executions.

The two techniques are among the ﬁrst to enable quantitative analysis of whole-program local-

ity in general sequential code. These ﬁndings form the basis for a uniﬁed understanding of program

The article contains material previously published in the 2002 Workshop on Languages, Compilers,

and Runtime Systems (LCR), 2003 ACM SIGPLAN Conference on Programming Language Design

and Implementation (PLDI), and 2003 Annual Symposium of Los Alamos Computer Science Insti-

tute (LACSI).

The authors were supported by the National Science Foundation (CAREER Award CCR-0238176

and two grants CNS-0720796 and CNS-0509270), the Department of Energy (Young Investigator

Award DE-FG02-02ER25525), IBM CAS Faculty Fellowship, and a gift from Microsoft Research.

Any opinions, ﬁndings, and conclusions or recommendations expressed in this material are those

of the authors and do not necessarily reﬂect the views of the funding organizations.

Authors’ addresses: Y. Zhong, George Mason University, Fairfax, VA; email: yzhong@cs.gmu.edu; X.

Shen, College of William and Mary, Williamsburg, VA; email: xshen@cs.wm.edu; C. Ding, University

of Rochester, Rochester, NY; email: cding@cs.rochester.edu.

Permission to make digital or hard copies of part or all of this work for personal or classroom use

is granted without fee provided that copies are not made or distributed for proﬁt or commercial

advantage and that copies show this notice on the ﬁrst page or initial screen of a display along

with the full citation. Copyrights for components of this work owned by others than ACM must be

honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,

to redistribute to lists, or to use any component of this work in other works requires prior speciﬁc

permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn

Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org.



2009 ACM 0164-0925/2009/08-ART20 $10.00

DOI 10.1145/1552309.1552310 http://doi.acm.org/10.1145/1552309.1552310

ACM Transactions on Programming Languages and Systems, Vol. 31, No. 6, Article 20, Pub. date: August 2009.

20:2

•

Y. Zhong et al.

locality and its many facets. Concluding sections of the article present a taxonomy of related lit-

erature along ﬁve dimensions of locality and discuss the role of reuse distance in performance

modeling, program optimization, cache and virtual memory management, and network trafﬁc

analysis.

Categories and Subject Descriptors: D.3.4 [Programming Languages]: Processors—Optimiza-

tion, compilers

General Terms: Measurement, Languages, Algorithms

Additional Key Words and Phrases: Program locality, reuse distance, stack distance, training-based

analysis

ACM Reference Format:

Zhong, Y., Shen, X., and Ding, C. 2009. Program locality analysis using reuse distance. ACM Trans.

Program. Lang. Syst. 31, 6, Article 20 (August 2009), 39 pages.

DOI = 10.1145/1552309.1552310 http://doi.acm.org/10.1145/1552309.1552310

1. INTRODUCTION

Today’s computer systems must manage a vast amount of memory to meet the

data requirements of modern applications. Because of fundamental physical

limits—transistors cannot be inﬁnitely small and signals cannot travel faster

than the speed of light—practically all memory systems are organized as a

hierarchy with multiple layers of fast cache memory. On the software side, the

notion of locality arises from the observation that a program uses only part of

its data at each moment of execution. A program can be said to conform to the

80-20 rule if 80% of its execution requires only 20% of its data. In the general

case, we need to measure the active data usage of a program to understand and

improve its use of cache memory.

Whole-program locality describes how well the data demand of a program can

be satisﬁed by data caching. Although a basic question in program understand-

ing, it has eluded systematic analysis in the past due to two main obstacles:

the complexity of program code and the effect of program input. In this article,

we address these two difﬁculties using training-based locality analysis. This

analysis examines the execution of a program rather than analyzing its code.

It proﬁles a few runs of the program and uses the result to build a statistical

model to predict how the locality changes in other runs. Conceptually, training-

based analysis is analogous to observation and prediction in the physical and

biological sciences.

The basic runtime metric we measure is reuse distance. For each data access

in a sequential execution, the reuse distance is the number of distinct data ele-

ments accessed between the current and previous accesses to the same datum

(the distance is inﬁnite if no prior access exists). It is the same as the LRU stack

distance deﬁned by Mattson et al. [1970]. As an illustration, Figure 1(a) shows

an example access trace and its reuse distances. If we take the histogram of

all (ﬁnite) reuse distances, we have the locality signature, which is shown in

Figure 1(b) for the example trace. For a fully-associative LRU cache, an access

misses in the cache if and only if its reuse distance is greater than the cache

size. Figure 1(c) shows all nonzero miss rates of the example execution on all

cache sizes. In general, a locality signature captures the average locality of an

ACM Transactions on Programming Languages and Systems, Vol. 31, No. 6, Article 20, Pub. date: August 2009.

Program Locality Analysis Using Reuse Distance

•

20:3

Fig. 1. Example reuse distances, locality signature, and miss rate curve.

execution from the view of the hardware as the miss rate in caches of all sizes

and all levels of associativity [Mattson et al. 1970; Smith 1976; Hill and Smith

1989] and from the view of the operating system as the size of the working

sets [Denning 1980].

At the program level, locality analysis is hampered by complex control ﬂows

and data indirection. For example, pointer usage obscures the location of the

datum being accessed. With reuse distance, we can avoid the difﬁculty of code

analysis by directly examining the execution or, more accurately, the locality

aspect of the execution. Compilers may make local changes to a program, for

example, by unrolling a loop. Modern processors, likewise, may reorder instruc-

tions within a limited execution window. These transformations affect paral-

lelism but not cache locality. The unchanging locality cannot be seen in the

reuse distance since the number and the length of long reuse distances stay

the same with and without the transformations. As a direct measure, reuse

distance is unaffected by coding and execution variations that do not affect

locality.

Furthermore, reuse distance makes it possible to correlate data usage

across training executions. Since a program may allocate different data (or

the same data in different locations) between runs, we cannot directly compare

data addresses, but we may ﬁnd correlations in their reuse distances. More

importantly, we can partition memory accesses by decomposing the locality

signature into subcomponents with only short- or long-distance reuses. As we

shall see, programs often exhibit consistent patterns across inputs, at least in

some components. As a result, we can characterize whole-program locality by

deﬁning common patterns and identifying program components that have these

patterns.

A major difﬁculty of training-based analysis is the immense size of execution

traces. A small program may produce a long execution, in which a modern

processor may execute billions of operations a second. Section 2 addresses the

problem of measuring reuse distance. We present two approximate algorithms:

one guarantees a relative precision and the other an absolute precision. Since

data may span the entire execution between uses, a solution must maintain

some representation of the trace history. The approximate solutions use a data

structure called a scale tree, in which each node represents a time range of

the trace. By properly adjusting these time ranges, an analyzer can examine

the trace and compute approximate reuse distance in effectively constant time

regardless of the length of the trace. Over the past four decades, there has been a

ACM Transactions on Programming Languages and Systems, Vol. 31, No. 6, Article 20, Pub. date: August 2009.

20:4

•

Y. Zhong et al.

steady stream of solutions developed for the measurement problem. We review

the other solutions in Section 2.3 and present a a new lower-bound result in

Section 2.4.

The key to modeling whole-program locality is prediction across program

inputs. Section 3 describes the prediction process, which ﬁrst divides data ac-

cesses into groups, then identiﬁes statistical patterns in each group, and ﬁnally

computes parameterized models that yield the least error. Pattern analysis is

assisted by the fact that reuse distance is always bounded and can change at

most as a linear function of the size of the data. We present ﬁve prediction meth-

ods assembled from different division schemes, pattern types, and statistical

equations. Two methods are single-model, which means that a locality compo-

nent, that is, a partition of memory accesses, has only one pattern. The other

three are multimodel, which means that multiple patterns may appear in the

same component. These ofﬂine models can be used in online prediction using a

technique called distance-based sampling.

The new techniques of approximate measurement and statistical prediction

are evaluated in Section 4 using real and artiﬁcial benchmarks. Section 4.1

compares eight analyzers and shows that approximate analysis is substantially

faster than previous techniques in measuring long reuse distances. Section 4.2

compares ﬁve prediction techniques and shows that most programs have pre-

dictable components, and the accuracy and efﬁciency of prediction increase with

additional training inputs and with multimodel prediction. On average, the lo-

cality in ﬁfteen test programs can be predicted with 94% accuracy. Programs

that are difﬁcult to predict include interpreters and scientiﬁc code with high-

dimension data. Interestingly, because reuse distance is execution-based, our

analyses can reveal similarities in inherent data usage among applications that

do not share code.

Our locality prediction techniques are examples of a broader approach we

call behavior-based program analysis. Conventional program analysis identi-

ﬁes invariant properties by examining program code. Behavior analysis infers

common patterns by examining program executions. Section 5 discusses re-

lated work in locality analysis using program code and behavior metrics in-

cluding reuse distance, access frequency and data streams. Locality analysis

has numerous uses in performance modeling, program improvement, cache and

virtual memory management, and network caching. Section 6 presents a tax-

onomy that classiﬁes the uses of reuse distance into ﬁve dimensions—program

code, data, input, time, and environment. Many of these uses may beneﬁt from

the fast analysis and predictive modeling described in this article.

2. APPROXIMATE REUSE-DISTANCE MEASUREMENT

In our problem setup, a trace is a sequence of T accesses to N distinct data

items. A reuse-distance analyzer traverses the trace and measures the reuse

distance for each access. At each access, the analyzer ﬁnds the previous time the

data was accessed and counts the number of different data elements accessed in

between. To ﬁnd the previous access, the analyzer assigns each access a logical

time and stores the last access time of each datum in a hash table. In the worst

ACM Transactions on Programming Languages and Systems, Vol. 31, No. 6, Article 20, Pub. date: August 2009.

Program Locality Analysis Using Reuse Distance

•

20:5

Fig. 2. An example illustrating the reuse-distance measurement. Part (a) shows a reuse distance.

Parts (b) and (c) show its measurement by the Bennett-Kruskal algorithm and the Olken algorithm.

Part (d) shows our approximate measurement with a guaranteed precision of 33%.

case, the previous access may occur at the beginning of the trace, the difference

in access time is up to T − 1, and the reuse distance is up to N − 1. In large

applications, T can be over 100 billion, and N is often in the tens of millions.

We use the example in Figure 2 to introduce two previous solutions and

then describe the basic idea for our solution. Part (a) shows an example trace.

Suppose we want to ﬁnd the reuse distance between the two accesses of b at time

4 and 12. A solution has to store enough information about the trace history

before time 12. Bennett and Kruskal [1975] discovered that it is sufﬁcient to

store only the last access of each datum, as shown in Part (b) for the example

trace. The reuse distance is measured by counting the number of last accesses,

stored in a bit vector rather than using the original trace.

The efﬁciency was improved by Olken [1981], who organized the last accesses

as nodes in a search tree keyed by their access time. The Olken-style tree for

the example trace has 7 nodes, one for the last access of each datum, as shown

in Figure 2(c). The reuse distance is measured by counting the number of nodes

whose key values are between 4 and 12. The counting can be done in a single

tree search, ﬁrst ﬁnding the node with key value 4 and then backing up to the

root accumulating the subtree weights [Olken 1981]. Since the algorithm needs

one tree node for each data location, the search tree can grow to a signiﬁcant

size when analyzing programs with a large amount of data.

While it is costly to measure long reuse distances, we rarely need the exact

length. Often the ﬁrst few digits sufﬁce. For example, if a reuse distance is

about one million, it rarely matters whether the exact value is one million or

one million and one. Next we describe two approximate algorithms that extend

the Olken algorithm by adapting and trimming the search tree.

The new algorithms guarantee two types of precision for the approximate

distance, d

approximate

, compared to the actual distance, d

actual

. In both types, the

ACM Transactions on Programming Languages and Systems, Vol. 31, No. 6, Article 20, Pub. date: August 2009.

Program locality analysis using reuse distance

Figures

Citations

Survey of scheduling techniques for addressing shared resources in multicore processors

HOTL: a higher order theory of locality

Is reuse distance applicable to data locality analysis on chip multiprocessors

Evaluating iterative optimization across 1000 datasets

PARDA: A Fast Parallel Reuse Distance Analysis Algorithm

References

A data locality optimizing algorithm

High-Performance Compilers for Parallel Computing

Evaluation techniques for storage hierarchies

Self-adjusting binary search trees

The space complexity of approximating the frequency moments

Related Papers (5)

Evaluation techniques for storage hierarchies

Predicting whole-program locality through reuse distance analysis

StatCache: a probabilistic approach to efficient and accurate data locality analysis

Predicting inter-thread cache contention on a chip multi-processor architecture

Pin: building customized program analysis tools with dynamic instrumentation

Frequently Asked Questions (12)

Q1. What are the contributions mentioned in the paper "Program locality analysis using reuse distance" ?

Q2. Why is locality considered a fundamental concept in computing?

Q3. How many accesses is there between successive tree compressions?

Q4. How did they improve the efficiency of the tree-based methods?

Q5. What is the simplest way to model a buffer as a stack?

Q6. What is the cost of a reuse-distance analysis?

Q7. What is the way to predict a reuse distance?

Q8. How can an analyzer determine the approximate reuse distance?

Q9. What is the method for calculating the average distance of a group of reuse distances?

Q10. Why is the consistency across inputs due to consistency in programmers’ coding style?

Q11. How many accesses are transferred to the approximate trace?

Q12. How is the prediction accuracy shown in Table VI?