scispace - formally typeset
Open AccessProceedings ArticleDOI

Modeling performance variation due to cache sharing

TLDR
This paper introduces a method for efficiently investigating the performance variability due to cache contention that can estimate an application pair's performance variation 213× faster, on average, than native execution and can predict application slowdown with an average relative error.
Abstract
Shared cache contention can cause significant variability in the performance of co-running applications from run to run. This variability arises from different overlappings of the applications' phases, which can be the result of offsets in application start times or other delays in the system. Understanding this variability is important for generating an accurate view of the expected impact of cache contention. However, variability effects are typically ignored due to the high overhead of modeling or simulating the many executions needed to expose them. This paper introduces a method for efficiently investigating the performance variability due to cache contention. Our method relies on input data captured from native execution of applications running in isolation and a fast, phase-aware, cache sharing performance model. This allows us to assess the performance interactions and bandwidth demands of co-running applications by quickly evaluating hundreds of overlappings. We evaluate our method on a contemporary multicore machine and show that performance and bandwidth demands can vary significantly across runs of the same set of co-running applications. We show that our method can predict application slowdown with an average relative error of 0.41% (maximum 1.8%) as well as bandwidth consumption. Using our method, we can estimate an application pair's performance variation 213× faster, on average, than native execution.

read more

Content maybe subject to copyright    Report

c
2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or
reuse of any copyrighted component of this work in other works.
Modeling Performance Variation Due to Cache Sharing
Andreas Sandberg, Andreas Sembrant, Erik Hagersten and David Black-Schaffer
Uppsala University, Department of Information Technology
P.O. Box 337, SE-751 05 Uppsala, Sweden
{andreas.sandberg, andreas.sembrant, eh, david.black-schaffer}@it.uu.se
Abstract
Shared cache contention can cause significant variabil-
ity in the performance of co-running applications from run
to run. This variability arises from different overlappings of
the applications’ phases, which can be the result of offsets
in application start times or other delays in the system. Un-
derstanding this variability is important for generating an
accurate view of the expected impact of cache contention.
However, variability effects are typically ignored due to the
high overhead of modeling or simulating the many execu-
tions needed to expose them.
This paper introduces a method for efficiently investi-
gating the performance variability due to cache contention.
Our method relies on input data captured from native execu-
tion of applications running in isolation and a fast, phase-
aware, cache sharing performance model. This allows us
to assess the performance interactions and bandwidth de-
mands of co-running applications by quickly evaluating
hundreds of overlappings.
We evaluate our method on a contemporary multicore
machine and show that performance and bandwidth de-
mands can vary significantly across runs of the same set
of co-running applications. We show that our method can
predict application slowdown with an average relative error
of 0.41% (maximum 1.8%) as well as bandwidth consump-
tion. Using our method, we can estimate an application
pair’s performance variation 213× faster, on average, than
native execution.
1. Introduction
Shared caches in contemporary multicores have re-
peatedly been shown to be critical resources for perfor-
mance [15, 23, 28, 8, 17]. A significant amount of research
has investigated the impact of cache sharing on application
performance [23, 30, 12, 11]. However, most previous re-
search provides a single value for the slowdown of an appli-
cation pair due to cache sharing and ignores the variability
0
5
10
15
20
25
30
0 5 10 15 20
2 7.7 17
Population [%]
Slowdown [%]
Average
Figure 1. Performance distribution for astar
co-running together with bwaves on an In-
tel Xeon E5620 based system. Ignoring per-
formance variability can be misleading, since
the average (7.7%) hides the fact that the per-
formance can vary between 1% and 17% de-
pending on how the two applications’ phases
overlap.
that occurs across multiple runs. This variability occurs due
to different overlappings of application phases that occur
when they are offset in time. As the different phases have
varying sensitivities to contention for the shared cache, the
result is a wide range of slowdowns for the same application
pair.
In multicore systems, there can be large performance
variations due to cache contention, since an applica-
tion’s performance depends on how its memory accesses
are interleaved with other applications’ memory accesses.
For example, when running astar/lakes and bwaves from
SPEC CPU2006, we observe an average slowdown of 8%
for astar compared to running it in isolation. However, the
slowdown can vary between 1% and 17% depending on
how the two applications’ phases overlap. Figure 1 shows
astar’s slowdown distribution based on 100 runs with dif-
ferent offsets in starting times. A developer assessing the
performance of these applications could draw the wrong
conclusions from a single run, or even a few runs, since
the probability of measuring a slowdown smaller than 2%

is more than 25%, while the average slowdown is almost
8% and the maximum slowdown is 17%.
In order to accurately estimate the performance of a
mixed workload, we need to run it multiple times and es-
timate its performance distribution. This is a both time- and
resource-consuming process. The distribution in Figure 1
took almost seven hours to generate; our method reproduces
the same performance distribution in less than 40 s.
To do this, we combine the cache sharing model pro-
posed by Sandberg et al. [16], the phase detection frame-
work developed by Sembrant et al. [19], and the co-
execution phase optimizations proposed by Van Bies-
brouck et al. [25]. This allows us to efficiently predict the
performance and bandwidth requirements of mixed work-
loads. In addition, the input data to the cache model is cap-
tured using low-overhead profiling [7] of each application
running in isolation. This means that only a small number
of profiling runs need to be done on the target machine. The
modeling can then be performed quickly for a large number
of mixed workloads and runs.
The main contributions of this paper are:
An extension to a statistical cache-sharing model [16]
to handle time-dependent execution phases.
A fast and efficient method to predict the performance
variations due to shared cache contention on modern
hardware by combining a cache sharing model [16]
with phase optimizations [19, 25].
A comparison with previous cache-sharing meth-
ods [16] demonstrating a 2.78× improvement in ac-
curacy (the relative error is reduced from 1.14% to
0.41%) and a 3.5× reduction in maximum error (from
6.3% to 1.8%).
An analysis of how different types of phase behav-
ior impact the performance variations in mixed work-
loads.
2. Putting it Together
Our method combines and extends three existing pieces
of infrastructure: a cache sharing model [16], a low-
overhead cache analysis tool [7], and a phase detection
framework [19]. In this section, we describe the different
pieces and how we extend them.
2.1. Cache Sharing
We use the cache sharing model proposed by Sand-
berg et al. [16] for cache modeling. It accurately predicts
the amount of cache used, CPI, and bandwidth demand for
an application in a mixed workload of co-executing single-
threaded applications. The input to the model is a set of
independent application profiles. These profiles contain in-
formation about how the miss rate (misses per cycle) and
hit rate (hits per cycle) vary for an application as a func-
tion of cache size. We use the Cache Pirating [7] technique
(discussed below) to capture the model’s input data.
The model conceptually partitions the cache into two
parts with different reuse behavior. The model keeps fre-
quently reused data safe from replacements, while less fre-
quently reused data shares the remaining cache space pro-
portionally to its application’s miss rate. The partitioning
between frequently reused data and infrequently reused data
is an application property that is cache size dependent (i.e.,
the partitioning depends on how much cache an application
receives). The model uses an iterative solver that first solves
cache sharing for the infrequently reused data and then up-
dates partitioning between frequently reused data and infre-
quently reused data.
The model however only works on phase-less applica-
tions where the average behavior is representative of the en-
tire application. In practice, most applications have phases.
To handle this, we extend the model by slicing applications
into multiple small time windows. As long as the windows
are short enough, the model’s assumption of constant be-
havior holds within the window. We then apply the model
to a set of co-executing windows instead of data averaged
across the entire execution.
2.2. Cache Pirating
The input to the cache sharing model is an application
profile with information about cache miss rates and hit rates
as a function of cache size. Traditionally, such profiles have
been generated through simulation, but such an approach is
slow and it is difficult to build accurate simulators for mod-
ern processor pipelines and memory systems. Instead, we
use Cache Pirating [7] to collect the data. Cache Pirating
solves both problems by measuring how an application be-
haves as a function of cache size on the target machine with
very low overhead.
Cache Pirating uses hardware performance monitoring
facilities to measure target application properties at runtime,
such as cache misses, hits, and execution cycles. To mea-
sure this information for varying cache sizes, Cache Pirat-
ing co-runs a small cache intensive stress application with
the target application. The amount of cache available to the
target application is then varied by changing the cache foot-
print of the stress application. This allows Cache Pirating
to measure any performance metric exposed by the target
machine as a function of available cache size.
The cache pirate method produces average measure-
ments for an entire application run. This is illustrated in
Figure 2a. It shows CPI as a function of cache size for as-
tar. The solid black line (Average) is the output produced

0
1
2
3
4
0 2 4 6 8 10 12
CPI
Cache Size [MB]
Average
Phase A
Phase B
Phase C
(a) Time oblivious
A
1
B
1
C
1
A
2
B
2
0
50
100
150
200
250
300
350
Time in Billions of Instructions
0
2
4
6
8
10
12
Cache Size [MB]
0
1
2
3
4
CPI
Detected Phases
(b) Time aware
Figure 2. Performance (CPI) as a function of cache size as produced by Cache Pirating. Figure (a)
shows the time-oblivious application average as a solid line. Figure (b) shows the time-dependent
variability of the cache sensitivity and the phases identified by ScarPhase above. The behavior of
the three largest phases vary significantly from the average as can be seen by the dashed lines in
Figure (a).
with Cache Pirating.
Just examining the average behavior can however be
misleading since most applications have time-dependent be-
havior. Figure 2b instead shows astar’s CPI as a function of
both time and cache size. As seen in the figure, the applica-
tion displays three different phases of behavior: some parts
of the application execute with a very high CPI (phase A
& phase B), while other parts execute with a very low
CPI (phase C). This information is lost unless time is taken
into account.
In this paper, we extend the cache pirate method to pro-
duce time-dependent data by dividing the execution into
sample windows by sampling the performance counters at
regular intervals.
2.3. Phase Detection
A naive approach to phase-aware cache modeling would
be to model the effect of every pair of measured input sam-
ple windows. However, to make the analysis more efficient,
we incorporate application phase information. This enables
us to analyze multiple sample windows with similar behav-
ior at the same time, which reduces the number of times we
need to invoke the cache sharing model.
We use the ScarPhase [19] library to detect and clas-
sify phases. ScarPhase is an execution-history based, low-
overhead (2%), online phase-detection library. It examines
the application’s execution path to detect hardware indepen-
dent phases [21, 14]. Such phases can be readily missed by
performance counter based phase detection, while changes
in executed code reflect changes in many different met-
rics [20, 21, 5, 22, 9, 18]. To leverage this, ScarPhase mon-
itors what code is executed by dividing the application into
windows and using hardware performance counters to sam-
ple which branches execute in a window. The address of
each branch is hashed into a vector of counters called a ba-
sic block vector (BBV) [20]. Each entry in the vector shows
how many times its corresponding branches were sampled
during the window. The vectors are then used to determine
phases by clustering them together using an online cluster-
ing algorithm [6]. Windows with similar vectors are then
grouped into the same cluster and considered to belong to
the same phase.
The phases detected by ScarPhase can be seen in the top
bar in Figure 2b for astar, with the longest phases labeled.
This benchmark has three major phases; A, B and C, all
with different cache behaviors. To highlight the differences
in CPI, we have plotted the average CPI of each phase in
Figure 2a. For example, phase A runs slower than C, since
it has a higher CPI. Phase B is more sensitive to cache-size
changes than phase A since phase Bs CPI decreases with
more cache.
The same phase can occur several times during execu-
tion. For example, phase A recurs two times, once in the
beginning and once at the end of the execution. We refer
to multiple repetitions of the same phase as instances of the
same phase, e.g., A
1
and A
2
in Figure 2b.
In addition, Figure 2b also demonstrates the limitation
of defining phases based on changes in hardware-specific
metrics. For example, the CPI is very similar from 325 to
390 billion instructions when using 12 MB of cache (the
gray rectangle), but clearly different when using less than
4 MB (the black rectangle). This difference is even more
noticeable in Figure 2a when comparing phase A and B. A
phase detection method looking at only the CPI would draw
the conclusion that phase A and B are the same phase when
the application receives 12 MB of cache, while in reality
they are two very different phases. It is therefore important

to find phases that are independent of the execution envi-
ronment (e.g., co-scheduling).
3. Time Dependent Cache Sharing
The key difficulty in modeling time-dependent cache
sharing is to determine which parts of the application (i.e.,
sample windows or phases) will co-execute. Since ap-
plications typically execute at different speeds depending
on phase, we can not simply use the ith sample windows
for each application since they may not overlap. For ex-
ample, consider two applications with different executions
rates (e.g., CPIs of 2 and 4), executing sample windows of
100 million instructions. The slower application with a CPI
of 4 will take twice as long to finish executing its sample
windows as the one with a CPI of 2. Furthermore, when
they share a cache they impact each others execution rates.
Instead, we advance time as follows:
1. Determine the cache sharing using the model for the
current windows and the resulting CPI for each appli-
cation due to its shared cache allocation.
2. Advance the fastest application (i.e., the one with low-
est CPI) to its next sample window. The slower appli-
cations will not have had time to completely execute
their windows. To handle this, their windows are first
split into two smaller windows so that the first window
ends at the same time the fastest applications sample
window. Finally, time is advanced to the beginning of
the latter windows.
This means that the cache model is applied several times
per sample window, since each window is usually split at
least twice. For example, when modeling the slowdown
of astar co-executing together with bwaves, we invoke the
cache sharing model roughly 13 000 times while astar only
has 4 000 sample windows by itself.
We refer to the method described so far as the window-
based method (Window) in the rest of paper. In the rest
of this section, we will introduce two more methods, the
dynamic-window-based method (Dynamic Window) and
the phase-based method (Phase), which both use phase in-
formation to improve the performance by reducing number
of times the cache sharing model needs to be applied
1
.
3.1. Dynamic-Windows:
Merging Sample-Windows
To improve performance we need to reduce the number
of times the cache sharing model is invoked. To do this,
1
The cache sharing model is implemented in Python and takes approx-
imately 88 ms per invocation on our reference system (see Section 4.1).
we merge multiple adjacent sample windows belonging to
the same phase into larger windows, a dynamic window.
For example, in astar (Figure 2), we consider all sample
windows in A
1
as one unit (i.e., the average of the sam-
ple windows) instead of looking at every individual sample
window within the phase. Merging consecutive windows
within a phase assumes that the behavior is stable within a
that instance (i.e., all windows have similar behavior). This
is usually true and does not significantly affect the accuracy
of the method. However, compared to the window-based
method, it is dramatically faster. For example, modeling as-
tar running together with bwaves we reduce the number of
times the cache sharing model is used from 13 000 to 520,
which leads to 25x speedup over the window-based method.
3.2. Phase:
Reusing Cache-Sharing Results
The performance can be further improved by merging
the data for all instances of a phase. For example, when
considering astar (Figure 2), we consider all phase instances
of A (i.e., A
1
+ A
2
) as one unit. This makes the assumption
that all instances of the same phase have similar behavior
in an execution. This is not necessarily true for all appli-
cations (e.g., same function but different input data), but
works well in practice.
Looking at whole phases does not change the number of
times we need to determine an applications cache sharing.
It does however enables us to reuse cache sharing results for
co-executing phases that reappear later [25]. For example,
when astar’s phase A
1
co-executes with bwave’s phase B,
we can save the cache sharing result, and later reuse the
result if the second instance (A
2
) co-executes with bwaves
B.
In the example with astar and bwaves, we can reuse the
results from previous cache sharing solutions 380 times.
We therefore only need to run the cache sharing model
140 times. The performance of the phase-based method is
highly dependent on an application’s phase behavior, but it
normally leads to a speed-up of 210x over the dynamic-
window method.
The main benefit of the phase-based method is when de-
termining performance variability of a mix. In this case, the
same mix is run several times with slightly different offsets
in starting times. The same co-executing phases will usu-
ally reappear in different runs. For example, when modeling
100 different runs of astar and bwaves, we need to evaluate
1 400 000 co-executing windows, but with the phase-based
method we only need to run the model 939 times.
In addition to reducing the number of model invocations,
using phases reduces the amount of data needed to run the
model. Instead of storing a profile per sample window, all
sample windows in one phase can be merged. This typically

0
0.5
1
1.5
2
0
60
120
180
240
300
360
Bandwidth [GB/s]
Time [s]
(a) Single-Phase (omnetpp)
0
1
2
3
4
5
0
120
240
360
480
600
720
840
Bandwidth [GB/s]
Time [s]
(b) Dual-Phase (bwaves)
0
0.01
0.02
0.03
0.04
0
10
20
30
40
Bandwidth [GB/s]
Time [s]
(c) Target 1 (bzip2/chicken)
0
0.5
1
1.5
0
60
120
180
Bandwidth [GB/s]
Time [s]
(d) Few-Phase (astar/lakes)
0
1
2
3
4
0
60
120
180
240
300
360
420
Bandwidth [GB/s]
Time [s]
(e) Multi-Phase (mcf)
0
1
2
3
4
5
0
5
10
15
20
25
Bandwidth [GB/s]
Time [s]
(f) Target 2 (gcc/166)
Figure 3. Bandwidth usage across the whole execution of our six benchmark applications, including
the four interference applications. Detected phases are shown above. The Single-Phase, Dual-
Phase, Few-Phase, and Multi-Phase behavior is clearly visible for the interference applications.
leads to a 1001000x size reduction in input data. For ex-
ample, bwaves, which is a long running benchmark with a
large profile, reduces its profile size from 57 MB to 82 kB.
4. Evaluation
To evaluate our method we compare the overhead and the
accuracy against results measured on real hardware. We ran
each target application together with an interference appli-
cation and measured the behavior of the target application.
In order to measure the performance variability, we started
the applications with an offset by first starting the interfer-
ence application and then waiting for it to execute a prede-
fined number of instructions before starting the target. We
then restarted the interference application if it terminated
before the target.
In order to get an accurate representation of the perfor-
mance, we ran each experiment (target-interference pair)
100 times with random start offsets for the target. We used
the same starting time offsets for both the hardware refer-
ence runs and for the modeled runs.
4.1. Experimental Setup
We ran the experiments on a 2.4 GHz Intel Xeon E5620
system (Westmere) with 4 cores and 3 × 2 GB memory dis-
tributed across 3 DDR3 channels. Each core has a private
32 kB L1 data cache and a private 256 kB L2 cache. All
four cores share a 12 MB 16-way L3 cache with a pseudo-
LRU replacement policy.
The cache sharing model requires information about ap-
plication fetch rate, access rate and hit rate as a function
of cache size and time. We measured cache-size dependent
data using cache pirating in 16 steps of 768 kB (the equiv-
alent of one way) up to 12 MB, and used a sample window
size of 100 million instructions.
4.2. Benchmark Selection
In order to see how time-dependent phase behavior af-
fects cache sharing and performance, we selected bench-
marks from SPEC CPU2006 with interesting phase be-
havior. In addition to interesting phase behavior, we also
wanted to select applications that make significant use of
the shared L3 cache. For our evaluation, we selected four
interference benchmarks that represent four different phase
behaviors: Single-Phase (omnetpp), Dual-Phase (bwaves),
Few-Phase (astar/lakes) and Multi-Phase (mcf).
Figure 3 shows the interference applications’ bandwidth
usage (high bandwidth indicates significant use of the
shared L3 cache), and the detected phases. In addition to
the interference benchmarks, we selected two more bench-
marks, gcc/166 and bzip2/chicken, that we only use as tar-
gets. These benchmarks have a lower average bandwidth
usage than the interference benchmarks, but they are still
sensitive to cache contention. For the evaluation, we ran all
combinations of the six applications as targets vs. each of
the four interference applications.

Figures
Citations
More filters
Proceedings ArticleDOI

The application slowdown model: quantifying and controlling the impact of inter-application interference at shared caches and main memory

TL;DR: The Application Slowdown Model is proposed, a new technique that accurately estimates application slowdowns due to interference at both the shared cache and main memory, in the absence of a priori application knowledge and has an average slowdown estimation error of only 9.9%.
Proceedings ArticleDOI

A detailed GPU cache model based on reuse distance theory

TL;DR: This work extends reuse distance to GPUs by modelling the GPU's hierarchy of threads, warps, threadblocks, and sets of active threads, including conditional and non-uniform latencies, cache associativity, miss-status holding-registers, and warp divergence.
Proceedings ArticleDOI

Run-to-run variability on Xeon Phi based cray XC systems

TL;DR: This study classify, quantify, and present ways to mitigate the sources of run-to-run variability on Cray XC systems with Intel Xeon Phi processors and a dragonfly interconnect, and demonstrates that the code-tuning performance observed in a variability-mitigating environment correlates with the performance seen in production running conditions.
Proceedings ArticleDOI

Optimal Cache Partition-Sharing

TL;DR: The theory shows that theproblem of partition-sharing is reducible to the problem of partitioning, and the technique uses dynamic programming to optimize partitioning for overall miss ratio, and for two different kinds of fairness.
Proceedings ArticleDOI

Analyzing the impact of CPU pinning and partial CPU loads on performance and energy efficiency

TL;DR: It is found that less common CPU pinning configurations improve energy efficiency at partial background loads, indicating that systems hosting collocated workloads could benefit from dynamicCPU pinning based on CPU load and workload type.
References
More filters
Proceedings ArticleDOI

Automatically characterizing large scale program behavior

TL;DR: This work quantifies the effectiveness of Basic Block Vectors in capturing program behavior across several different architectural metrics, explores the large scale behavior of several programs, and develops a set of algorithms based on clustering capable of analyzing this behavior.
Proceedings ArticleDOI

Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches

TL;DR: In this article, the authors propose a low-overhead, runtime mechanism that partitions a shared cache between multiple applications depending on the reduction in cache misses that each application is likely to obtain for a given amount of cache resources.
Journal ArticleDOI

The M5 Simulator: Modeling Networked Systems

TL;DR: The M5 simulator provides features necessary for simulating networked hosts, including full-system capability, a detailed I/O subsystem, and the ability to simulate multiple networked systems deterministically.
Proceedings ArticleDOI

Basic Block Distribution Analysis to Find Periodic Behavior and Simulation Points in Applications

TL;DR: This paper proposes Basic Block Distribution Analysis as an automated approach for finding these small portions of the program to simulate that are representative of the entire program's execution and shows that theperiodicity of the basic block frequency profile reflects the periodicity of detailed simulation across several different architectural metrics.
Related Papers (5)
Frequently Asked Questions (13)
Q1. What are the contributions mentioned in the paper "Modeling performance variation due to cache sharing" ?

This paper introduces a method for efficiently investigating the performance variability due to cache contention. The authors evaluate their method on a contemporary multicore machine and show that performance and bandwidth demands can vary significantly across runs of the same set of co-running applications. The authors show that their method can predict application slowdown with an average relative error of 0. 41 % ( maximum 1. 8 % ) as well as bandwidth consumption. Using their method, the authors can estimate an application pair ’ s performance variation 213× faster, on average, than native execution. 

In future work, the authors plan on extending their analytical method to include such bandwidth-sharing effects. Due to its speed, simple input data, and accuracy, this method can be used to build efficient tools for software developers or system designers, and is fast enough to be leveraged in scheduling and operating system designs. 

Cache Pirating uses hardware performance monitoring facilities to measure target application properties at runtime, such as cache misses, hits, and execution cycles. 

when an application receives less cache space, its bandwidth usage increases since it misses more in L3 and that data needs to be fetch from memory again. 

For their evaluation, the authors selected four interference benchmarks that represent four different phase behaviors: Single-Phase (omnetpp), Dual-Phase (bwaves), Few-Phase (astar/lakes) and Multi-Phase (mcf). 

The authors measured cache-size dependent data using cache pirating in 16 steps of 768 kB (the equivalent of one way) up to 12MB, and used a sample window size of 100 million instructions. 

Since all of the techniques the authors integrate in this method scale beyond two cores, the authors demonstrate that their method can scale as well by estimating the system throughput when co-running a mix of four applications on their four core reference system. 

In this paper, the authors extend the cache pirate method to produce time-dependent data by dividing the execution into sample windows by sampling the performance counters at regular intervals. 

Techniques to explore and understand multicore performance can generally be divided into three different categories; full system simulation, partial simulation/modeling,and higher level modeling. 

The key difficulty in modeling time-dependent cache sharing is to determine which parts of the application (i.e., sample windows or phases) will co-execute. 

On average, the windows-based method has an error of 0.39% and a maximum error of 2.2% (bzip2 + omnetpp), while the phasebased method has an average error of 0.41% and a maximum of 1.8% (omnetpp + bwaves). 

In order to accurately estimate the performance of a mixed workload, the authors need to run it multiple times and estimate its performance distribution. 

In order to get an accurate representation of the performance, the authors ran each experiment (target-interference pair) 100 times with random start offsets for the target.