scispace - formally typeset
Open AccessJournal ArticleDOI

Program locality analysis using reuse distance

Reads0
Chats0
TLDR
Two techniques are presented, among the first to enable quantitative analysis of whole-program locality in general sequential code, that predict how the locality of a program changes with its input.
Abstract
On modern computer systems, the memory performance of an application depends on its locality. For a single execution, locality-correlated measures like average miss rate or working-set size have long been analyzed using reuse distance—the number of distinct locations accessed between consecutive accesses to a given location. This article addresses the analysis problem at the program level, where the size of data and the locality of execution may change significantly depending on the input.The article presents two techniques that predict how the locality of a program changes with its input. The first is approximate reuse-distance measurement, which is asymptotically faster than exact methods while providing a guaranteed precision. The second is statistical prediction of locality in all executions of a program based on the analysis of a few executions. The prediction process has three steps: dividing data accesses into groups, finding the access patterns in each group, and building parameterized models. The resulting prediction may be used on-line with the help of distance-based sampling. When evaluated on fifteen benchmark applications, the new techniques predicted program locality with good accuracy, even for test executions that are orders of magnitude larger than the training executions.The two techniques are among the first to enable quantitative analysis of whole-program locality in general sequential code. These findings form the basis for a unified understanding of program locality and its many facets. Concluding sections of the article present a taxonomy of related literature along five dimensions of locality and discuss the role of reuse distance in performance modeling, program optimization, cache and virtual memory management, and network traffic analysis.

read more

Content maybe subject to copyright    Report

20
Program Locality Analysis Using
Reuse Distance
YUTAO ZHONG
George Mason University
XIPENG SHEN
The College of William and Mary
and
CHEN DING
University of Rochester
On modern computer systems, the memory performance of an application depends on its locality. For
a single execution, locality-correlated measures like average miss rate or working-set size have long
been analyzed using reuse distance—the number of distinct locations accessed between consecutive
accesses to a given location. This article addresses the analysis problem at the program level, where
the size of data and the locality of execution may change significantly depending on the input.
The article presents two techniques that predict how the locality of a program changes with
its input. The first is approximate reuse-distance measurement, which is asymptotically faster
than exact methods while providing a guaranteed precision. The second is statistical prediction of
locality in all executions of a program based on the analysis of a few executions. The prediction
process has three steps: dividing data accesses into groups, finding the access patterns in each
group, and building parameterized models. The resulting prediction may be used on-line with
the help of distance-based sampling. When evaluated on fifteen benchmark applications, the new
techniques predicted program locality with good accuracy, even for test executions that are orders
of magnitude larger than the training executions.
The two techniques are among the first to enable quantitative analysis of whole-program local-
ity in general sequential code. These findings form the basis for a unified understanding of program
The article contains material previously published in the 2002 Workshop on Languages, Compilers,
and Runtime Systems (LCR), 2003 ACM SIGPLAN Conference on Programming Language Design
and Implementation (PLDI), and 2003 Annual Symposium of Los Alamos Computer Science Insti-
tute (LACSI).
The authors were supported by the National Science Foundation (CAREER Award CCR-0238176
and two grants CNS-0720796 and CNS-0509270), the Department of Energy (Young Investigator
Award DE-FG02-02ER25525), IBM CAS Faculty Fellowship, and a gift from Microsoft Research.
Any opinions, findings, and conclusions or recommendations expressed in this material are those
of the authors and do not necessarily reflect the views of the funding organizations.
Authors’ addresses: Y. Zhong, George Mason University, Fairfax, VA; email: yzhong@cs.gmu.edu; X.
Shen, College of William and Mary, Williamsburg, VA; email: xshen@cs.wm.edu; C. Ding, University
of Rochester, Rochester, NY; email: cding@cs.rochester.edu.
Permission to make digital or hard copies of part or all of this work for personal or classroom use
is granted without fee provided that copies are not made or distributed for profit or commercial
advantage and that copies show this notice on the first page or initial screen of a display along
with the full citation. Copyrights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,
to redistribute to lists, or to use any component of this work in other works requires prior specific
permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn
Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org.
C
2009 ACM 0164-0925/2009/08-ART20 $10.00
DOI 10.1145/1552309.1552310 http://doi.acm.org/10.1145/1552309.1552310
ACM Transactions on Programming Languages and Systems, Vol. 31, No. 6, Article 20, Pub. date: August 2009.

20:2
Y. Zhong et al.
locality and its many facets. Concluding sections of the article present a taxonomy of related lit-
erature along five dimensions of locality and discuss the role of reuse distance in performance
modeling, program optimization, cache and virtual memory management, and network traffic
analysis.
Categories and Subject Descriptors: D.3.4 [Programming Languages]: Processors—Optimiza-
tion, compilers
General Terms: Measurement, Languages, Algorithms
Additional Key Words and Phrases: Program locality, reuse distance, stack distance, training-based
analysis
ACM Reference Format:
Zhong, Y., Shen, X., and Ding, C. 2009. Program locality analysis using reuse distance. ACM Trans.
Program. Lang. Syst. 31, 6, Article 20 (August 2009), 39 pages.
DOI = 10.1145/1552309.1552310 http://doi.acm.org/10.1145/1552309.1552310
1. INTRODUCTION
Today’s computer systems must manage a vast amount of memory to meet the
data requirements of modern applications. Because of fundamental physical
limits—transistors cannot be infinitely small and signals cannot travel faster
than the speed of light—practically all memory systems are organized as a
hierarchy with multiple layers of fast cache memory. On the software side, the
notion of locality arises from the observation that a program uses only part of
its data at each moment of execution. A program can be said to conform to the
80-20 rule if 80% of its execution requires only 20% of its data. In the general
case, we need to measure the active data usage of a program to understand and
improve its use of cache memory.
Whole-program locality describes how well the data demand of a program can
be satisfied by data caching. Although a basic question in program understand-
ing, it has eluded systematic analysis in the past due to two main obstacles:
the complexity of program code and the effect of program input. In this article,
we address these two difficulties using training-based locality analysis. This
analysis examines the execution of a program rather than analyzing its code.
It profiles a few runs of the program and uses the result to build a statistical
model to predict how the locality changes in other runs. Conceptually, training-
based analysis is analogous to observation and prediction in the physical and
biological sciences.
The basic runtime metric we measure is reuse distance. For each data access
in a sequential execution, the reuse distance is the number of distinct data ele-
ments accessed between the current and previous accesses to the same datum
(the distance is infinite if no prior access exists). It is the same as the LRU stack
distance defined by Mattson et al. [1970]. As an illustration, Figure 1(a) shows
an example access trace and its reuse distances. If we take the histogram of
all (finite) reuse distances, we have the locality signature, which is shown in
Figure 1(b) for the example trace. For a fully-associative LRU cache, an access
misses in the cache if and only if its reuse distance is greater than the cache
size. Figure 1(c) shows all nonzero miss rates of the example execution on all
cache sizes. In general, a locality signature captures the average locality of an
ACM Transactions on Programming Languages and Systems, Vol. 31, No. 6, Article 20, Pub. date: August 2009.

Program Locality Analysis Using Reuse Distance
20:3
Fig. 1. Example reuse distances, locality signature, and miss rate curve.
execution from the view of the hardware as the miss rate in caches of all sizes
and all levels of associativity [Mattson et al. 1970; Smith 1976; Hill and Smith
1989] and from the view of the operating system as the size of the working
sets [Denning 1980].
At the program level, locality analysis is hampered by complex control flows
and data indirection. For example, pointer usage obscures the location of the
datum being accessed. With reuse distance, we can avoid the difficulty of code
analysis by directly examining the execution or, more accurately, the locality
aspect of the execution. Compilers may make local changes to a program, for
example, by unrolling a loop. Modern processors, likewise, may reorder instruc-
tions within a limited execution window. These transformations affect paral-
lelism but not cache locality. The unchanging locality cannot be seen in the
reuse distance since the number and the length of long reuse distances stay
the same with and without the transformations. As a direct measure, reuse
distance is unaffected by coding and execution variations that do not affect
locality.
Furthermore, reuse distance makes it possible to correlate data usage
across training executions. Since a program may allocate different data (or
the same data in different locations) between runs, we cannot directly compare
data addresses, but we may find correlations in their reuse distances. More
importantly, we can partition memory accesses by decomposing the locality
signature into subcomponents with only short- or long-distance reuses. As we
shall see, programs often exhibit consistent patterns across inputs, at least in
some components. As a result, we can characterize whole-program locality by
defining common patterns and identifying program components that have these
patterns.
A major difficulty of training-based analysis is the immense size of execution
traces. A small program may produce a long execution, in which a modern
processor may execute billions of operations a second. Section 2 addresses the
problem of measuring reuse distance. We present two approximate algorithms:
one guarantees a relative precision and the other an absolute precision. Since
data may span the entire execution between uses, a solution must maintain
some representation of the trace history. The approximate solutions use a data
structure called a scale tree, in which each node represents a time range of
the trace. By properly adjusting these time ranges, an analyzer can examine
the trace and compute approximate reuse distance in effectively constant time
regardless of the length of the trace. Over the past four decades, there has been a
ACM Transactions on Programming Languages and Systems, Vol. 31, No. 6, Article 20, Pub. date: August 2009.

20:4
Y. Zhong et al.
steady stream of solutions developed for the measurement problem. We review
the other solutions in Section 2.3 and present a a new lower-bound result in
Section 2.4.
The key to modeling whole-program locality is prediction across program
inputs. Section 3 describes the prediction process, which first divides data ac-
cesses into groups, then identifies statistical patterns in each group, and finally
computes parameterized models that yield the least error. Pattern analysis is
assisted by the fact that reuse distance is always bounded and can change at
most as a linear function of the size of the data. We present five prediction meth-
ods assembled from different division schemes, pattern types, and statistical
equations. Two methods are single-model, which means that a locality compo-
nent, that is, a partition of memory accesses, has only one pattern. The other
three are multimodel, which means that multiple patterns may appear in the
same component. These offline models can be used in online prediction using a
technique called distance-based sampling.
The new techniques of approximate measurement and statistical prediction
are evaluated in Section 4 using real and artificial benchmarks. Section 4.1
compares eight analyzers and shows that approximate analysis is substantially
faster than previous techniques in measuring long reuse distances. Section 4.2
compares five prediction techniques and shows that most programs have pre-
dictable components, and the accuracy and efficiency of prediction increase with
additional training inputs and with multimodel prediction. On average, the lo-
cality in fifteen test programs can be predicted with 94% accuracy. Programs
that are difficult to predict include interpreters and scientific code with high-
dimension data. Interestingly, because reuse distance is execution-based, our
analyses can reveal similarities in inherent data usage among applications that
do not share code.
Our locality prediction techniques are examples of a broader approach we
call behavior-based program analysis. Conventional program analysis identi-
fies invariant properties by examining program code. Behavior analysis infers
common patterns by examining program executions. Section 5 discusses re-
lated work in locality analysis using program code and behavior metrics in-
cluding reuse distance, access frequency and data streams. Locality analysis
has numerous uses in performance modeling, program improvement, cache and
virtual memory management, and network caching. Section 6 presents a tax-
onomy that classifies the uses of reuse distance into five dimensions—program
code, data, input, time, and environment. Many of these uses may benefit from
the fast analysis and predictive modeling described in this article.
2. APPROXIMATE REUSE-DISTANCE MEASUREMENT
In our problem setup, a trace is a sequence of T accesses to N distinct data
items. A reuse-distance analyzer traverses the trace and measures the reuse
distance for each access. At each access, the analyzer finds the previous time the
data was accessed and counts the number of different data elements accessed in
between. To find the previous access, the analyzer assigns each access a logical
time and stores the last access time of each datum in a hash table. In the worst
ACM Transactions on Programming Languages and Systems, Vol. 31, No. 6, Article 20, Pub. date: August 2009.

Program Locality Analysis Using Reuse Distance
20:5
Fig. 2. An example illustrating the reuse-distance measurement. Part (a) shows a reuse distance.
Parts (b) and (c) show its measurement by the Bennett-Kruskal algorithm and the Olken algorithm.
Part (d) shows our approximate measurement with a guaranteed precision of 33%.
case, the previous access may occur at the beginning of the trace, the difference
in access time is up to T 1, and the reuse distance is up to N 1. In large
applications, T can be over 100 billion, and N is often in the tens of millions.
We use the example in Figure 2 to introduce two previous solutions and
then describe the basic idea for our solution. Part (a) shows an example trace.
Suppose we want to find the reuse distance between the two accesses of b at time
4 and 12. A solution has to store enough information about the trace history
before time 12. Bennett and Kruskal [1975] discovered that it is sufficient to
store only the last access of each datum, as shown in Part (b) for the example
trace. The reuse distance is measured by counting the number of last accesses,
stored in a bit vector rather than using the original trace.
The efficiency was improved by Olken [1981], who organized the last accesses
as nodes in a search tree keyed by their access time. The Olken-style tree for
the example trace has 7 nodes, one for the last access of each datum, as shown
in Figure 2(c). The reuse distance is measured by counting the number of nodes
whose key values are between 4 and 12. The counting can be done in a single
tree search, first finding the node with key value 4 and then backing up to the
root accumulating the subtree weights [Olken 1981]. Since the algorithm needs
one tree node for each data location, the search tree can grow to a significant
size when analyzing programs with a large amount of data.
While it is costly to measure long reuse distances, we rarely need the exact
length. Often the first few digits suffice. For example, if a reuse distance is
about one million, it rarely matters whether the exact value is one million or
one million and one. Next we describe two approximate algorithms that extend
the Olken algorithm by adapting and trimming the search tree.
The new algorithms guarantee two types of precision for the approximate
distance, d
approximate
, compared to the actual distance, d
actual
. In both types, the
ACM Transactions on Programming Languages and Systems, Vol. 31, No. 6, Article 20, Pub. date: August 2009.

Citations
More filters
Journal ArticleDOI

Survey of scheduling techniques for addressing shared resources in multicore processors

TL;DR: A multitude of new and exciting work is surveyed that explores the diverse new roles the OS scheduler can successfully take on, including those that exclusively make use of OS thread-level scheduling to achieve their goals.
Proceedings ArticleDOI

HOTL: a higher order theory of locality

TL;DR: One measures the locality in real time without special hardware support, and the other predicts multicore cache interference without parallel testing, which is analogous to differentiation and integration used to convert between higher order polynomials.
Book ChapterDOI

Is reuse distance applicable to data locality analysis on chip multiprocessors

TL;DR: The concept of concurrent reuse distance is introduced, a direct extension of the traditional concept of reuse distance with data references by all co-running threads (or jobs) considered, and the special challenges facing the collection and application of concurrent reused distance on CMP platforms are revealed.
Proceedings ArticleDOI

Evaluating iterative optimization across 1000 datasets

TL;DR: It is demonstrated that it is possible to derive a robust iterative optimization strategy across data sets and found that there exists at least one combination of compiler optimizations that achieves 86% or more of the best possible speedup across all data sets using Intel's ICC (83% for GNU's GCC).
Proceedings ArticleDOI

PARDA: A Fast Parallel Reuse Distance Analysis Algorithm

TL;DR: This paper presents the first parallel algorithm to compute accurate reuse distances by analysis of memory address traces, using a tunable parameter that enables faster analysis when the maximum needed reuse distance is limited by a cache size upper bound.
References
More filters
Proceedings ArticleDOI

A data locality optimizing algorithm

TL;DR: An algorithm that improves the locality of a loop nest by transforming the code via interchange, reversal, skewing and tiling is proposed, and is successful in optimizing codes such as matrix multiplication, successive over-relaxation, LU decomposition without pivoting, and Givens QR factorization.
Book

High-Performance Compilers for Parallel Computing

TL;DR: This book discusses Programming Language Features, Data Dependence, Dependence System Solvers, and Run-time Dependence Testing for High Performance Systems.
Journal ArticleDOI

Evaluation techniques for storage hierarchies

TL;DR: A new and efficient method of determining, in one pass of an address trace, performance measures for a large class of demand-paged, multilevel storage systems utilizing a variety of mapping schemes and replacement algorithms.
Journal ArticleDOI

Self-adjusting binary search trees

TL;DR: The splay tree, a self-adjusting form of binary search tree, is developed and analyzed and is found to be as efficient as balanced trees when total running time is the measure of interest.
Proceedings ArticleDOI

The space complexity of approximating the frequency moments

TL;DR: It turns out that the numbers F0;F1 and F2 can be approximated in logarithmic space, whereas the approximation of Fk for k 6 requires n (1) space.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What are the contributions mentioned in the paper "Program locality analysis using reuse distance" ?

This article addresses the analysis problem at the program level, where the size of data and the locality of execution may change significantly depending on the input. The article presents two techniques that predict how the locality of a program changes with its input. 

Locality is considered a fundamental concept in computing because to understand a computation the authors must understand its use of data. 

Since one treenode is added for each access, the number of accesses between successive tree compressions is at least M/2 accesses. 

Almasi et al. [2002] showed that by recording the empty regions instead of the last accesses in the trace, they could improve the efficiency of vector and tree based methods by 20% to 40%. 

Mattson et al. [1970] showed that buffer memory could be modeled as a stack, if the method of buffer management satisfied the inclusion property in that a smaller buffer would hold a subset of data held by a larger buffer. 

To measure only the cost of reuse-distance analysis, the hashing step is bypassed by pre-computing the last access time in all analyzers (except for KHW, which does not need the access time). 

Using more than two training inputs may produce a better prediction, because more data may reduce the noise from imprecise reuse distance measurement and histogram construction. 

By properly adjusting these time ranges, an analyzer can examine the trace and compute approximate reuse distance in effectively constant time regardless of the length of the trace. 

For a group of reuse distances, the authors calculate the ratio of their average distance in two executions, di/d̂ i, and pick fi to be the pattern function that is closest to di/d̂ i. 

The consistency across inputs might be due to consistency in programmers’ coding style, for example, the distribution of function sizes. 

To satisfy the first requirement, the authors transfer the last accesses of c data elements from the precise trace to the approximate trace when the size of the precise trace exceeds 2c. 

Table VI shows the prediction accuracy when the size of the largest training run is reduced to 1.6%, 3%, and 13% of the size used previously in Table II.