What are the contributions mentioned in the paper "Modeling performance variation due to cache sharing" ?

This paper introduces a method for efficiently investigating the performance variability due to cache contention. The authors evaluate their method on a contemporary multicore machine and show that performance and bandwidth demands can vary significantly across runs of the same set of co-running applications. The authors show that their method can predict application slowdown with an average relative error of 0. 41 % ( maximum 1. 8 % ) as well as bandwidth consumption. Using their method, the authors can estimate an application pair ’ s performance variation 213× faster, on average, than native execution.

What are the future works in "Modeling performance variation due to cache sharing" ?

In future work, the authors plan on extending their analytical method to include such bandwidth-sharing effects. Due to its speed, simple input data, and accuracy, this method can be used to build efficient tools for software developers or system designers, and is fast enough to be leveraged in scheduling and operating system designs.

What is the reason why an application receives less cache space?

when an application receives less cache space, its bandwidth usage increases since it misses more in L3 and that data needs to be fetch from memory again.

How many cores can the authors use to model the system throughput?

Since all of the techniques the authors integrate in this method scale beyond two cores, the authors demonstrate that their method can scale as well by estimating the system throughput when co-running a mix of four applications on their four core reference system.

What are the three different categories of techniques to explore and understand multicore performance?

Techniques to explore and understand multicore performance can generally be divided into three different categories; full system simulation, partial simulation/modeling,and higher level modeling.

What is the average error of the windows-based method?

On average, the windows-based method has an error of 0.39% and a maximum error of 2.2% (bzip2 + omnetpp), while the phasebased method has an average error of 0.41% and a maximum of 1.8% (omnetpp + bwaves).

How do the authors estimate the performance of a mixed workload?

In order to accurately estimate the performance of a mixed workload, the authors need to run it multiple times and estimate its performance distribution.

How many times did the authors run each experiment?

In order to get an accurate representation of the performance, the authors ran each experiment (target-interference pair) 100 times with random start offsets for the target.

(Open Access) Modeling performance variation due to cache sharing (2013) | Andreas Sandberg

Q: What are the four interference benchmarks that the authors selected?

For their evaluation, the authors selected four interference benchmarks that represent four different phase behaviors: Single-Phase (omnetpp), Dual-Phase (bwaves), Few-Phase (astar/lakes) and Multi-Phase (mcf).

Q: How many steps did the authors use to measure the cache size?

The authors measured cache-size dependent data using cache pirating in 16 steps of 768 kB (the equivalent of one way) up to 12MB, and used a sample window size of 100 million instructions.

Q: How do the authors extend the cache pirate method?

In this paper, the authors extend the cache pirate method to produce time-dependent data by dividing the execution into sample windows by sampling the performance counters at regular intervals.

Q: What is the key difficulty in modeling time-dependent cache sharing?

The key difficulty in modeling time-dependent cache sharing is to determine which parts of the application (i.e., sample windows or phases) will co-execute.

2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including

reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or

reuse of any copyrighted component of this work in other works.

Modeling Performance Variation Due to Cache Sharing

Andreas Sandberg, Andreas Sembrant, Erik Hagersten and David Black-Schaffer

Uppsala University, Department of Information Technology

P.O. Box 337, SE-751 05 Uppsala, Sweden

{andreas.sandberg, andreas.sembrant, eh, david.black-schaffer}@it.uu.se

Abstract

Shared cache contention can cause signiﬁcant variabil-

ity in the performance of co-running applications from run

to run. This variability arises from different overlappings of

the applications’ phases, which can be the result of offsets

in application start times or other delays in the system. Un-

derstanding this variability is important for generating an

accurate view of the expected impact of cache contention.

However, variability effects are typically ignored due to the

high overhead of modeling or simulating the many execu-

tions needed to expose them.

This paper introduces a method for efﬁciently investi-

gating the performance variability due to cache contention.

Our method relies on input data captured from native execu-

tion of applications running in isolation and a fast, phase-

aware, cache sharing performance model. This allows us

to assess the performance interactions and bandwidth de-

mands of co-running applications by quickly evaluating

hundreds of overlappings.

We evaluate our method on a contemporary multicore

machine and show that performance and bandwidth de-

mands can vary signiﬁcantly across runs of the same set

of co-running applications. We show that our method can

predict application slowdown with an average relative error

of 0.41% (maximum 1.8%) as well as bandwidth consump-

tion. Using our method, we can estimate an application

pair’s performance variation 213× faster, on average, than

native execution.

1. Introduction

Shared caches in contemporary multicores have re-

peatedly been shown to be critical resources for perfor-

mance [15, 23, 28, 8, 17]. A signiﬁcant amount of research

has investigated the impact of cache sharing on application

performance [23, 30, 12, 11]. However, most previous re-

search provides a single value for the slowdown of an appli-

cation pair due to cache sharing and ignores the variability

0 5 10 15 20

2 7.7 17

Population [%]

Slowdown [%]

Average

Figure 1. Performance distribution for astar

co-running together with bwaves on an In-

tel Xeon E5620 based system. Ignoring per-

formance variability can be misleading, since

the average (7.7%) hides the fact that the per-

formance can vary between 1% and 17% de-

pending on how the two applications’ phases

overlap.

that occurs across multiple runs. This variability occurs due

to different overlappings of application phases that occur

when they are offset in time. As the different phases have

varying sensitivities to contention for the shared cache, the

result is a wide range of slowdowns for the same application

pair.

In multicore systems, there can be large performance

variations due to cache contention, since an applica-

tion’s performance depends on how its memory accesses

are interleaved with other applications’ memory accesses.

For example, when running astar/lakes and bwaves from

SPEC CPU2006, we observe an average slowdown of 8%

for astar compared to running it in isolation. However, the

slowdown can vary between 1% and 17% depending on

how the two applications’ phases overlap. Figure 1 shows

astar’s slowdown distribution based on 100 runs with dif-

ferent offsets in starting times. A developer assessing the

performance of these applications could draw the wrong

conclusions from a single run, or even a few runs, since

the probability of measuring a slowdown smaller than 2%

is more than 25%, while the average slowdown is almost

8% and the maximum slowdown is 17%.

In order to accurately estimate the performance of a

mixed workload, we need to run it multiple times and es-

timate its performance distribution. This is a both time- and

resource-consuming process. The distribution in Figure 1

took almost seven hours to generate; our method reproduces

the same performance distribution in less than 40 s.

To do this, we combine the cache sharing model pro-

posed by Sandberg et al. [16], the phase detection frame-

work developed by Sembrant et al. [19], and the co-

execution phase optimizations proposed by Van Bies-

brouck et al. [25]. This allows us to efﬁciently predict the

performance and bandwidth requirements of mixed work-

loads. In addition, the input data to the cache model is cap-

tured using low-overhead proﬁling [7] of each application

running in isolation. This means that only a small number

of proﬁling runs need to be done on the target machine. The

modeling can then be performed quickly for a large number

of mixed workloads and runs.

The main contributions of this paper are:

• An extension to a statistical cache-sharing model [16]

to handle time-dependent execution phases.

• A fast and efﬁcient method to predict the performance

variations due to shared cache contention on modern

hardware by combining a cache sharing model [16]

with phase optimizations [19, 25].

• A comparison with previous cache-sharing meth-

ods [16] demonstrating a 2.78× improvement in ac-

curacy (the relative error is reduced from 1.14% to

0.41%) and a 3.5× reduction in maximum error (from

6.3% to 1.8%).

• An analysis of how different types of phase behav-

ior impact the performance variations in mixed work-

loads.

2. Putting it Together

Our method combines and extends three existing pieces

of infrastructure: a cache sharing model [16], a low-

overhead cache analysis tool [7], and a phase detection

framework [19]. In this section, we describe the different

pieces and how we extend them.

2.1. Cache Sharing

We use the cache sharing model proposed by Sand-

berg et al. [16] for cache modeling. It accurately predicts

the amount of cache used, CPI, and bandwidth demand for

an application in a mixed workload of co-executing single-

threaded applications. The input to the model is a set of

independent application proﬁles. These proﬁles contain in-

formation about how the miss rate (misses per cycle) and

hit rate (hits per cycle) vary for an application as a func-

tion of cache size. We use the Cache Pirating [7] technique

(discussed below) to capture the model’s input data.

The model conceptually partitions the cache into two

parts with different reuse behavior. The model keeps fre-

quently reused data safe from replacements, while less fre-

quently reused data shares the remaining cache space pro-

portionally to its application’s miss rate. The partitioning

between frequently reused data and infrequently reused data

is an application property that is cache size dependent (i.e.,

the partitioning depends on how much cache an application

receives). The model uses an iterative solver that ﬁrst solves

cache sharing for the infrequently reused data and then up-

dates partitioning between frequently reused data and infre-

quently reused data.

The model however only works on phase-less applica-

tions where the average behavior is representative of the en-

tire application. In practice, most applications have phases.

To handle this, we extend the model by slicing applications

into multiple small time windows. As long as the windows

are short enough, the model’s assumption of constant be-

havior holds within the window. We then apply the model

to a set of co-executing windows instead of data averaged

across the entire execution.

2.2. Cache Pirating

The input to the cache sharing model is an application

proﬁle with information about cache miss rates and hit rates

as a function of cache size. Traditionally, such proﬁles have

been generated through simulation, but such an approach is

slow and it is difﬁcult to build accurate simulators for mod-

ern processor pipelines and memory systems. Instead, we

use Cache Pirating [7] to collect the data. Cache Pirating

solves both problems by measuring how an application be-

haves as a function of cache size on the target machine with

very low overhead.

Cache Pirating uses hardware performance monitoring

facilities to measure target application properties at runtime,

such as cache misses, hits, and execution cycles. To mea-

sure this information for varying cache sizes, Cache Pirat-

ing co-runs a small cache intensive stress application with

the target application. The amount of cache available to the

target application is then varied by changing the cache foot-

print of the stress application. This allows Cache Pirating

to measure any performance metric exposed by the target

machine as a function of available cache size.

The cache pirate method produces average measure-

ments for an entire application run. This is illustrated in

Figure 2a. It shows CPI as a function of cache size for as-

tar. The solid black line (Average) is the output produced

0 2 4 6 8 10 12

CPI

Cache Size [MB]

Average

Phase A

Phase B

Phase C

(a) Time oblivious

100

150

200

250

300

350

Time in Billions of Instructions

Cache Size [MB]

CPI

Detected Phases

(b) Time aware

Figure 2. Performance (CPI) as a function of cache size as produced by Cache Pirating. Figure (a)

shows the time-oblivious application average as a solid line. Figure (b) shows the time-dependent

variability of the cache sensitivity and the phases identiﬁed by ScarPhase above. The behavior of

the three largest phases vary signiﬁcantly from the average as can be seen by the dashed lines in

Figure (a).

with Cache Pirating.

Just examining the average behavior can however be

misleading since most applications have time-dependent be-

havior. Figure 2b instead shows astar’s CPI as a function of

both time and cache size. As seen in the ﬁgure, the applica-

tion displays three different phases of behavior: some parts

of the application execute with a very high CPI (phase A

& phase B), while other parts execute with a very low

CPI (phase C). This information is lost unless time is taken

into account.

In this paper, we extend the cache pirate method to pro-

duce time-dependent data by dividing the execution into

sample windows by sampling the performance counters at

regular intervals.

2.3. Phase Detection

A naive approach to phase-aware cache modeling would

be to model the effect of every pair of measured input sam-

ple windows. However, to make the analysis more efﬁcient,

we incorporate application phase information. This enables

us to analyze multiple sample windows with similar behav-

ior at the same time, which reduces the number of times we

need to invoke the cache sharing model.

We use the ScarPhase [19] library to detect and clas-

sify phases. ScarPhase is an execution-history based, low-

overhead (2%), online phase-detection library. It examines

the application’s execution path to detect hardware indepen-

dent phases [21, 14]. Such phases can be readily missed by

performance counter based phase detection, while changes

in executed code reﬂect changes in many different met-

rics [20, 21, 5, 22, 9, 18]. To leverage this, ScarPhase mon-

itors what code is executed by dividing the application into

windows and using hardware performance counters to sam-

ple which branches execute in a window. The address of

each branch is hashed into a vector of counters called a ba-

sic block vector (BBV) [20]. Each entry in the vector shows

how many times its corresponding branches were sampled

during the window. The vectors are then used to determine

phases by clustering them together using an online cluster-

ing algorithm [6]. Windows with similar vectors are then

grouped into the same cluster and considered to belong to

the same phase.

The phases detected by ScarPhase can be seen in the top

bar in Figure 2b for astar, with the longest phases labeled.

This benchmark has three major phases; A, B and C, all

with different cache behaviors. To highlight the differences

in CPI, we have plotted the average CPI of each phase in

Figure 2a. For example, phase A runs slower than C, since

it has a higher CPI. Phase B is more sensitive to cache-size

changes than phase A since phase B’s CPI decreases with

more cache.

The same phase can occur several times during execu-

tion. For example, phase A recurs two times, once in the

beginning and once at the end of the execution. We refer

to multiple repetitions of the same phase as instances of the

same phase, e.g., A

and A

in Figure 2b.

In addition, Figure 2b also demonstrates the limitation

of deﬁning phases based on changes in hardware-speciﬁc

metrics. For example, the CPI is very similar from 325 to

390 billion instructions when using 12 MB of cache (the

gray rectangle), but clearly different when using less than

4 MB (the black rectangle). This difference is even more

noticeable in Figure 2a when comparing phase A and B. A

phase detection method looking at only the CPI would draw

the conclusion that phase A and B are the same phase when

the application receives 12 MB of cache, while in reality

they are two very different phases. It is therefore important

to ﬁnd phases that are independent of the execution envi-

ronment (e.g., co-scheduling).

3. Time Dependent Cache Sharing

The key difﬁculty in modeling time-dependent cache

sharing is to determine which parts of the application (i.e.,

sample windows or phases) will co-execute. Since ap-

plications typically execute at different speeds depending

on phase, we can not simply use the ith sample windows

for each application since they may not overlap. For ex-

ample, consider two applications with different executions

rates (e.g., CPIs of 2 and 4), executing sample windows of

100 million instructions. The slower application with a CPI

of 4 will take twice as long to ﬁnish executing its sample

windows as the one with a CPI of 2. Furthermore, when

they share a cache they impact each others execution rates.

Instead, we advance time as follows:

1. Determine the cache sharing using the model for the

current windows and the resulting CPI for each appli-

cation due to its shared cache allocation.

2. Advance the fastest application (i.e., the one with low-

est CPI) to its next sample window. The slower appli-

cations will not have had time to completely execute

their windows. To handle this, their windows are ﬁrst

split into two smaller windows so that the ﬁrst window

ends at the same time the fastest applications sample

window. Finally, time is advanced to the beginning of

the latter windows.

This means that the cache model is applied several times

per sample window, since each window is usually split at

least twice. For example, when modeling the slowdown

of astar co-executing together with bwaves, we invoke the

cache sharing model roughly 13 000 times while astar only

has 4 000 sample windows by itself.

We refer to the method described so far as the window-

based method (Window) in the rest of paper. In the rest

of this section, we will introduce two more methods, the

dynamic-window-based method (Dynamic Window) and

the phase-based method (Phase), which both use phase in-

formation to improve the performance by reducing number

of times the cache sharing model needs to be applied

3.1. Dynamic-Windows:

Merging Sample-Windows

To improve performance we need to reduce the number

of times the cache sharing model is invoked. To do this,

The cache sharing model is implemented in Python and takes approx-

imately 88 ms per invocation on our reference system (see Section 4.1).

we merge multiple adjacent sample windows belonging to

the same phase into larger windows, a dynamic window.

For example, in astar (Figure 2), we consider all sample

windows in A

as one unit (i.e., the average of the sam-

ple windows) instead of looking at every individual sample

window within the phase. Merging consecutive windows

within a phase assumes that the behavior is stable within a

that instance (i.e., all windows have similar behavior). This

is usually true and does not signiﬁcantly affect the accuracy

of the method. However, compared to the window-based

method, it is dramatically faster. For example, modeling as-

tar running together with bwaves we reduce the number of

times the cache sharing model is used from 13 000 to 520,

which leads to 25x speedup over the window-based method.

3.2. Phase:

Reusing Cache-Sharing Results

The performance can be further improved by merging

the data for all instances of a phase. For example, when

considering astar (Figure 2), we consider all phase instances

of A (i.e., A

+ A

) as one unit. This makes the assumption

that all instances of the same phase have similar behavior

in an execution. This is not necessarily true for all appli-

cations (e.g., same function but different input data), but

works well in practice.

Looking at whole phases does not change the number of

times we need to determine an applications cache sharing.

It does however enables us to reuse cache sharing results for

co-executing phases that reappear later [25]. For example,

when astar’s phase A

co-executes with bwave’s phase B,

we can save the cache sharing result, and later reuse the

result if the second instance (A

) co-executes with bwaves

In the example with astar and bwaves, we can reuse the

results from previous cache sharing solutions 380 times.

We therefore only need to run the cache sharing model

140 times. The performance of the phase-based method is

highly dependent on an application’s phase behavior, but it

normally leads to a speed-up of 2–10x over the dynamic-

window method.

The main beneﬁt of the phase-based method is when de-

termining performance variability of a mix. In this case, the

same mix is run several times with slightly different offsets

in starting times. The same co-executing phases will usu-

ally reappear in different runs. For example, when modeling

100 different runs of astar and bwaves, we need to evaluate

1 400 000 co-executing windows, but with the phase-based

method we only need to run the model 939 times.

In addition to reducing the number of model invocations,

using phases reduces the amount of data needed to run the

model. Instead of storing a proﬁle per sample window, all

sample windows in one phase can be merged. This typically

0.5

1.5

120

180

240

300

360

Bandwidth [GB/s]

Time [s]

(a) Single-Phase (omnetpp)

120

240

360

480

600

720

840

Bandwidth [GB/s]

Time [s]

(b) Dual-Phase (bwaves)

0.01

0.02

0.03

0.04

Bandwidth [GB/s]

Time [s]

0.5

1.5

120

180

Bandwidth [GB/s]

Time [s]

(d) Few-Phase (astar/lakes)

120

180

240

300

360

420

Bandwidth [GB/s]

Time [s]

(e) Multi-Phase (mcf)

Bandwidth [GB/s]

Time [s]

(f) Target 2 (gcc/166)

Figure 3. Bandwidth usage across the whole execution of our six benchmark applications, including

the four interference applications. Detected phases are shown above. The Single-Phase, Dual-

Phase, Few-Phase, and Multi-Phase behavior is clearly visible for the interference applications.

leads to a 100–1000x size reduction in input data. For ex-

ample, bwaves, which is a long running benchmark with a

large proﬁle, reduces its proﬁle size from 57 MB to 82 kB.

4. Evaluation

To evaluate our method we compare the overhead and the

accuracy against results measured on real hardware. We ran

each target application together with an interference appli-

cation and measured the behavior of the target application.

In order to measure the performance variability, we started

the applications with an offset by ﬁrst starting the interfer-

ence application and then waiting for it to execute a prede-

ﬁned number of instructions before starting the target. We

then restarted the interference application if it terminated

before the target.

In order to get an accurate representation of the perfor-

mance, we ran each experiment (target-interference pair)

100 times with random start offsets for the target. We used

the same starting time offsets for both the hardware refer-

ence runs and for the modeled runs.

4.1. Experimental Setup

We ran the experiments on a 2.4 GHz Intel Xeon E5620

system (Westmere) with 4 cores and 3 × 2 GB memory dis-

tributed across 3 DDR3 channels. Each core has a private

32 kB L1 data cache and a private 256 kB L2 cache. All

four cores share a 12 MB 16-way L3 cache with a pseudo-

LRU replacement policy.

The cache sharing model requires information about ap-

plication fetch rate, access rate and hit rate as a function

of cache size and time. We measured cache-size dependent

data using cache pirating in 16 steps of 768 kB (the equiv-

alent of one way) up to 12 MB, and used a sample window

size of 100 million instructions.

4.2. Benchmark Selection

In order to see how time-dependent phase behavior af-

fects cache sharing and performance, we selected bench-

marks from SPEC CPU2006 with interesting phase be-

havior. In addition to interesting phase behavior, we also

wanted to select applications that make signiﬁcant use of

the shared L3 cache. For our evaluation, we selected four

interference benchmarks that represent four different phase

behaviors: Single-Phase (omnetpp), Dual-Phase (bwaves),

Few-Phase (astar/lakes) and Multi-Phase (mcf).

Figure 3 shows the interference applications’ bandwidth

usage (high bandwidth indicates signiﬁcant use of the

shared L3 cache), and the detected phases. In addition to

the interference benchmarks, we selected two more bench-

marks, gcc/166 and bzip2/chicken, that we only use as tar-

gets. These benchmarks have a lower average bandwidth

usage than the interference benchmarks, but they are still

sensitive to cache contention. For the evaluation, we ran all

combinations of the six applications as targets vs. each of

the four interference applications.

Modeling performance variation due to cache sharing

Figures

Citations

The application slowdown model: quantifying and controlling the impact of inter-application interference at shared caches and main memory

A detailed GPU cache model based on reuse distance theory

Run-to-run variability on Xeon Phi based cray XC systems

Optimal Cache Partition-Sharing

Analyzing the impact of CPU pinning and partial CPU loads on performance and energy efficiency

References

Pattern Classification

Automatically characterizing large scale program behavior

Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

The M5 Simulator: Modeling Networked Systems

Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications

Related Papers (5)

Addressing shared resource contention in multicore processors via scheduling

Bubble-Up: increasing utilization in modern warehouse scale computers via sensible co-locations

Towards practical page coloring-based multicore cache management

HOTL: a higher order theory of locality

Bubble-flux: precise online QoS management for increased utilization in warehouse scale computers

Frequently Asked Questions (13)

Q1. What are the contributions mentioned in the paper "Modeling performance variation due to cache sharing" ?

Q2. What are the future works in "Modeling performance variation due to cache sharing" ?

Q3. What is the purpose of the Cache Pirating model?

Q4. What is the reason why an application receives less cache space?

Q5. What are the four interference benchmarks that the authors selected?

Q6. How many steps did the authors use to measure the cache size?

Q7. How many cores can the authors use to model the system throughput?

Q8. How do the authors extend the cache pirate method?

Q9. What are the three different categories of techniques to explore and understand multicore performance?

Q10. What is the key difficulty in modeling time-dependent cache sharing?

Q11. What is the average error of the windows-based method?

Q12. How do the authors estimate the performance of a mixed workload?

Q13. How many times did the authors run each experiment?