scispace - formally typeset
Open AccessProceedings ArticleDOI

Characterizing the Influence of System Noise on Large-Scale Applications by Simulation

TLDR
An in-depth analysis of the impact of system noise on large-scale parallel application performance in realistic settings shows that not only collective operations but also point-to-point communications influence the application's sensitivity to noise.
Abstract
This paper presents an in-depth analysis of the impact of system noise on large-scale parallel application performance in realistic settings. Our analytical model shows that not only collective operations but also point-to-point communications influence the application's sensitivity to noise. We present a simulation toolchain that injects noise delays from traces gathered on common large-scale architectures into a LogGPS simulation and allows new insights into the scaling of applications in noisy environments. We investigate collective operations with up to 1 million processes and three applications (Sweep3D, AMG, and POP) with up to 32,000 processes.We show that the scale at which noise becomes a bottleneck is system-specific and depends on the structure of the noise. Simulations with different network speeds show that a 10x faster network does not improve application scalability. We quantify noise and conclude that our tools can be utilized to tune the noise signatures of a specific system.

read more

Content maybe subject to copyright    Report

Characterizing the Influence of System Noise on
Large-Scale Applications by Simulation
Torsten Hoefler
University of Illinois at Urbana-Champaign
Urbana IL 61801, USA
htor@illinois.edu
Timo S chneider and Andrew Lumsdaine
Indiana University
Bloomington IN 47405 , USA
{timoschn,lums}@cs.indian a.edu
Abstract—This paper presents an in-depth analysis of the
impact of system noise on large-scale parallel application perfor-
mance in realistic settings. Our analytical model shows that not
only collective operations but also point-to-point communications
influence the application’s sensitivity to noise. We present a sim-
ulation toolchain that injects noise delays from traces gathered
on common large-scale architectures into a LogGPS simulation
and allows new insights into the scaling of applications in noisy
environments. We investigate collective operations with up to 1
million processes and three applications (Sweep3D, AMG, and
POP) with up to 32,000 processes. We show that the scale at which
noise becomes a bottleneck is system-specific and depends on the
structure of the noise. Simulations with different network speeds
show that a 10x faster network does not improve application
scalability. We quantify noise and conclude that our tools can be
utilized to tune the noise signatures of a specific system.
I. MOTIVATION AND BACKGROUND
The performance impact of operating system and a rchitec-
tural overheads (system noise) at massive scale is increasingly
of concer n. Even sma ll loca l de lays on compute nodes, which
can be caused by interrupts, operating system daem ons, or
even cache or page misses, can affect global application
performance significantly [1]. Such local delays often cause
less than 1% overhead per process but severe perfor mance
losses can occur if noise is pr opagated (amplified) through
communication or globa l synchron iz a tion. Previous analyses
generally assume tha t the performance impact of system noise
grows a t scale and Tsafrir et al. [2] even suggest that the
impact of very low frequency n oise scales linearly with the
system size.
A. Related Work
Petrini, Kerbyson, and Pakin [1] report that the parallel
performance of SAGE on a fixed number of ASCI Q nodes
was highest when SAGE used only three o f the four CPUs
per node. It turned out that “resonance” between the applica-
tion’s collec tive communication and the misconfigured system
caused delays during ea ch iteration. Jones, Brenner, and Fier
[3] observed similar effects with collective communication and
also report that, under certain c ircumstances, it is beneficial to
leave one CPU idle. A theoretical analysis of the influence of
noise on collective communication [4] suggests that the impact
of noise depends on the type of distribution and their pa-
rameters and can, in the worst case ( exp onential distribution),
scale linearly with the numbe r of processes. Ferreira, Bridges,
and Brightwell use noise-injection tech niques to a ssess the
impact of noise on several a pplications [5]. Beckman e t al.
[6] analyzed the performance on Blu eGene/L, concluding that
most sources of noise can b e avoided in very specialized
systems.
Previous work was either limited to experimental analysis
on specific architectures with injection of artificially generated
noise (fixed frequency), or to purely theoretical analyses that
assume a particular collective pattern [4]. These previous
results show the severity of the p roblem but allow little
generalization and provide limited insight into application
behavior on r eal machine s. Effects, such as absorption of
noise, are described but not fur ther investigated [5]. On e
common theme in all previous works is to loo k at collective
communications as the main prob lem and often model such
operations as strictly synchronizing opaque entities. However,
the Messag e Passing Interface (MPI) standard [7] states that
“[...] a collective communication call may, or may not, have
the effect of synchronizing all callin g processes. This sta te m ent
excludes, of course, the barrier functio n. Th is invalidates all
simple models in use today. The synchronization properties
of an application depend on th e collective algorithm, point-to-
point me ssaging, and the system’s network parameters.
We chose a simulation approach similar to Sottile et al.s [8]
and improve it by using noise traces from existing systems
combined with detailed simulation and extrapolation of collec-
tive operations and parallel app lica tion traces. Our simulator
enables us to simulate applications on HPC systems that
cannot be accessed easily at full scale or that do not exist
yet and it also allows us to investigate the effect of changing
network speeds and other system parameters.
B. Contributions
In this work, we introduc e an open-source measurement
and simulation framework tha t measur e s O S noise and as-
sesses its impact on large-scale applications by simula tion. We
build upon a detailed model for dependencies in applications
and synchronization (cf. L a mport’s happens-bef ore rela tion)
that considers collective as well as point-to-point patterns
of real applications. We p e rform simulations for a set of
applications and systems, explain phenomena observed by
©2010 IEEE Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for
creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be
obtained from the IEEE.
SC10 November 2010, New Orleans, Louisiana, USA 978-1- 4244-7558-2/10/$26.00

other researchers, and show how the performance of collective
operations and applications is decreased by noise.
Such tech niques are very helpful in evaluating system
software and parallel applications targeting to run on upc oming
Peta- a nd Exascale computers that are not yet available.
For example, asynchronous (threaded) progression or a ctive
messages might be needed for programming such extre me-
scale systems but could increase noise on those systems. The
new a pproaches to implement parallel applications or new
concepts for the system software that will be needed to achieve
reasonable performance at this scale can be analyzed with our
toolchain.
Our approach com bines theory a nd practice in that we use
a detailed network model to describe all synchronization at
the messag e level and ana lyze the global impact of node-local
system n oise. The key contributions of this work are several:
A detailed dependency and synchronization mode l for
noise propaga tion and absorption in parallel applications.
A discrete-event simu la tion strategy to investigate the
impact of real-world and artificial OS noise on real
applications. The strategy accurately reproduces previous
experimental results and provides significant insight into
previous observations.
Simulation results of up to 1 million processes using real-
world noise traces. The results show tha t the influence of
real-world noise on collec tive com munication can be very
different from that of artificially generated noise.
Simulation results showing that point-to-point messaging
influences the noise sensitivity of applications.
Simulation results including the effects of co-scheduling
and network speed in the context of noise.
A q uantification of an effect that we call noise bottleneck
where increasing the network speeds does not im prove
application performance due to noise.
In the n ext section, we discuss established measurement
techniques for system noise an d present an enhanced noise
benchm ark. In Section III, we model the synchro nization prop-
erties of p oint-to-point messages and collective operations.
Then, we introduce the established LogGPS model to simulate
the behavior of collective o perations under the in fluence of
noise in Section IV. In Section V, we simulate complete
applications with our collected noise trace s from real systems.
Our methodology provides important insight into the effects
of no ise on parallel applications and enables us to explain
various phenomena found in pr evious studies. For example,
our model shows that the impact of noise depends on the
type of collective operation (we are able to explain Ferreira,
Bridges, and Brightwell’s fin ding that broadcast is significantly
less sensitive to noise than allreduce [5]). Our model also
explains why small-message collectives are more affected by
noise than larger ones and why low-frequency noise with
higher amplitude degrades the performance significantly while
high-f requency noise has nearly no impact.
II. MEASURING SYSTEM NOISE
A straightf orward noise measurement technique , called
fixed work quantum ( FWQ), measures the time t
i
to compute
several fixed workloads. FWQ assumes that the minimum
time t
min
represents the noiseless execution and all other
times t
i
are perturbed by t
i
t
min
. The main problem with
this approach is that the sampling frequency is n ot constant
because each perturbation influenc es the start of the next
sample. Sottile and Minnich [9] propose an inverse measure-
ment technique, fixed time quantum (FTQ), which counts the
number of fixed-work computations that can be perform ed in
a specific time and thus enab les the application of techniques
from signal analysis.
But we argue that both methods fail to record noise with
high frequency because the fixed workload needs t
min
to
compute. This effectively means that the sampling frequency
is limited to
1
/t
min
and all noise that has a high er frequency
simply eleva te s t
min
and underestimates noise. Additionally, if
the Nyquist-Shannon sampling theorem is not satisfied , then
aliasing could lead to wrong (Moir´e) observations. Thus, we
conclud e that to capture all noise frequencies accurately, the
workload has to be chosen as small as possible (t
min
0).
For our experiments, w e choose a FWQ benchmark with a
workload close to zero.
1
To mana ge the hug e number of
measurements, we only store the time of the perturbed mea-
surements similar to Beckmans “selfish detour” benchmark
[6] which also uses a tight loop to measure perturbations.
The first difference from “selfish detour” is that we define
the threshold relative to t
min
(instead of a fixed threshold )
and thus allow the benchmark to run on a wide variety of
systems. We used 9 · t
min
as threshold to filter cache misses
that are cau sed by recording the data. Such cache misses
occurre d on all systems and caused a detour between 6 · t
min
(Opteron) and 8·t
min
(BlueGene/P). Another difference is that
we assess t
min
in a separate step such that we do n ot need to
update t
min
in the bench mark loop. This removes one of the
three branches in the critical loop and increa sed the sampling
frequency (benchm ark resolution) by approximately 30% in
our tests. We measured all times with architecture-dependent
high-r e solution timers (RDTSC on x86, MFTB on PowerPC,
AR.ITC on IA64). All benchmarks are imp le mented in the
publicly available tool Netgauge [10]
2
.
A. Analyzing Real-World Architectures
We analyze four different systems that represent to day’s
common large-scale system a rchitectures, often scaling to tens
or even hundreds of thousands of processing cores. The first
system represents Linux clusters with InfiniBand such as the
Ranger system at TACC that run a default Linux kerne l.
The other three systems represen t specialized machines with
custom operating system kernels: SGI Altix 4700, Cray XT-4,
and BlueGene/P. For our benchmarks, we used the standard
batch mechanisms without special tuning to ru n our jobs (as
1
We use a tight loop and t
min
denotes loop overhead.
2
http://www.unixer.de/research/netgauge/osnoise

System/Architecture R
peak
OS t
min
1 ppn c ppn
CHiC Cluster, diskless, 2152 Opteron 2.6 GHz cores 11.2 TFlop/s Linux 2.6.18 3.74 ns 0.26% 0.21%
SGI Altix 4700, 2048 Itanium II 1.6 GHz cores 13.1 TFlop/s Linux 2.6.16 25.1 ns 0.05% n/a
Jugene, BlueGene/P, 295k PPC 450 850 MHz cores 825.5 TFlop/s CNK 2.6.19.2 29.4 ns 0% 0%
Intrepid, BlueGene/P, 164k PPC 450 cores 458.6 TFlop/s ZeptoOS 2.6.19.2 29.12 ns 0.02% 0.08%
Jaguar, Cray XT-4, 150k Opteron 2.1 GHz cores 1.38 PFlop/s Linux 2.6.16 CNL 32.9 ns 0.02% 0.02%
TABLE I
SYSTEM PARAMETERS AND SERIAL NOISE OVERHEAD FOR ALL INVESTIGATED MACHINES WITH EITHER ONE CORE OR ALL CORES PER
NODE USED. t
min
REPRESENTS THE MINIMUM LOOP TIME (MEASUREMENT ACCURACY).
0.01
0.1
1
10
100
1000
10000
100000
0 5 10 15 20 25 30 35 40 45 50
Detour [us]
Sample Time [s]
(a) CHiC (Linux)
0.01
0.1
1
10
100
1000
10000
100000
0 20 40 60 80 100 120 140 160 180 200
Detour [us]
Sample Time [s]
(b) S G I Altix
0.01
0.1
1
10
100
1000
10000
100000
0 50 100 150 200 250 300 350 400
Detour [us]
Sample Time [s]
(c) ZeptoOS
0.01
0.1
1
10
100
1000
10000
100000
0 20 40 60 80 100 120 140 160 180 200
Detour [us]
Sample Time [s]
(d) Jaguar
Fig. 1. Scatter plot of 10
5
detours on the different machines using all cores per node.
a typical user would do). We ran the noise benchmark three
times on each system, recording 10
5
events, and chose the
result with the lowest total detour (all runs differed by less
than 0.1%). On systems with c processing cores per node, we
executed our benchmark with 1 process per node (ppn) as well
as with c ppn.
Table I shows the co nfiguration of all test systems. Over-
heads of the sampling loop an d timer access limit th e sampling
frequency on real systems, even if the workload is zer o. Such
loop overheads are system dependent and vary be twe en a few
CPU cycles (3.74 ns) and 32.9 ns, as listed in Table I. As
discussed before, w e cannot reliably measure noise frequencies
higher th a n
1
2t
min
Hz (134 MHz o n our most accura te sy stem).
However, we assume that this limit is only of theoretical
interest because most noise has a much lower frequency. In
the following, we use the c onfigura tion where all cores on a
node are used by the application (as most parallel codes are
executed). Figure 1 shows scatter plots of the noise patter ns
for some of our investigated systems.
Figure 1(a) shows the diskless CHiC system with low
regular noise but reproducible longer interruptions (as seen
around 23 seco nds in the plot).
3
Figure 1(b ) shows that
most detours on the SGI Altix lie in the 18 µs range while
220 µs interruptions occur approximately every 2 seconds.
We measure d absolutely n o system noise on the BlueGene/P
system ru nning CNK (the benchma rk ran for seve ral hours
and did not collect a single detour). This is consistent with the
results by Yoshii et al. [11] who also report CNK as absolutely
noiseless. ZeptoOS on BlueGene/P, however, causes low noise
in a regular pattern as sh own in Figure 1(c). The XT-4 part
of the Jag uar system, number on e in the current top-500 list
(06/10), also shows high and infrequent random detou rs in
addition to two baselines.
3
Other investigated large Linux systems (e.g., Ranger and Juropa) show
show structurally similar noise patterns and are omitted for brevity.
III. AN ANALYTICAL MODEL FOR NOISE PROPAGATION
Now, we pr esent a suitable model to analyze noise effects on
applications. Noise (or “detours” as discussed in the pr evious
section) can either be ab sorbed or propagated by synchroniza-
tion. Processes are often synchronized implicitly by rem ote
data depend encies (cf. happens-before re lation). For example,
a receive cannot finish before the corresponding (matching)
send has been posted and the network transmission cost has
been paid (recv/send dependency). We analyze those effects in
detail by utilizing the LogGOPS network model to characterize
all situations where noise is transported or absorbed.
A. Th e LogGOPS Model
The LogGPS model [12] is a member of the LogP model
family. LogP models are often used to model parallel applica-
tions and network transmissions.
Multiple researche rs have shown that the L ogP model fam-
ily is able to model many parallel algorithms and architectures
accurately (e.g., [1 3]). LogGPS add itionally offers sup port for
modeling the synchronization effects of rendezvous messages.
We use the extended L ogGOPS model in our simulation
which includes an additional paramete r O [1 4] that models
the overhead per byte. Table II describes all parameters of the
LogGOPS model briefly (see [12], [14] for details).
L maximum latency between any two endpoints
o CPU overhead, o
s
for send and o
r
for receive
g inter-message gap, the minimum delay between two mes-
sages (1/g message-rate)
G gap per byte (1/G bandwidth)
O overhead per byte
P number of communicating processes
S threshold for eager messages that are buffered on the
receiver. Messages larger than S block the sender in the
rendezvous protocol until the receive has been posted, while
messages smaller than S are sent immediately
TABLE II
LOGGOPS PARAMETERS

o
r
o
s
T
r X
r
L + (k−1)G
T
s
(a) Blocking transfer
o
r
T
s
T
w
T
w
L + (k−1)G
o
s
s
T
r
r
X
r
(b) Nonblocking transfer
o
r
o
s
T
r X
r
L + (k−1)G
T
s
noise
(c) Absorption
Fig. 2. Examples for blocking (a) and nonblocking (b) point-to-point synchronization and noise absorption (c).
The LogGOPS model ignore s c ontention in the network
and might thus und erestimate communication c osts. Our
simulations are mostly targeted towards investigating noise
propagation and one cou ld see a congestion-free execution
as showing the worst effect of noise. If necessary, average
network contention can be modeled by increasing G in the
LogGOPS model. We chose this approach to limit simu lation
resources and allow for larger process counts, cf. [14]. We
discuss the influence of the ne twork para meters to noise
propagation and absorption in Section V-B.
In the following, we describe synchronization effects in
parallel applications and derive an analytical mod e l for n oise
propagation (sometimes also called amplification) and ab sorp-
tion. This model allows reasoning about th e effects of noise
and forms a base fo r our simulation . We discuss blocking
and nonblock ing point-to-po int messages in detail before we
proceed to more complex collective communication patterns.
B. Blocking Point-to-Point Communication
If the receive of a message is started too early, then the
receiver must wait until the message is sent. Likewise, if the
send of k bytes (and k > S, i.e., a rendezvous-send) is started
too early then the sender must wait until the receiver is ready.
Now let us discuss what “too early” means.
Figure 2(a) shows the scenario of a late sender and the
associated synchronization overhead X
r
. Both, the sender and
receiver are assumed to start at time t = 0. T
s
denotes the
time when the send is started and T
r
the time when the
receive is po sted. We assum e that all other time is spent with
computation that advances the algorithm (that is, no polling or
testing). Le t N = o
s
+ L + (k 1)G + o
r
denote the network
overhead in the LogGP model. We define the synchronization
overhead on the receiver as X
r
= max{T
s
+ N T
r
, 0}.
For k > S (rendezvous protocol), the synchronization
overhead X
s
can also occur on the sender X
s
= max{T
r
N T
s
, 0}. In this case, the only scenario where neither sender
nor the receiver are delayed is T
r
= T
s
+N, that is, the rec eive
is posted exactly at the right moment. Such timing is very
unlikely and blocking commu nication often propagates noise.
C. Nonblocking Point-to-Point Commun ic ation
A comm on method to avoid synchroniz ation overheads
and to reduce communication costs in general is nonblock-
ing communication. Figure 2(b) shows the commun ic a tion
diagram fo r a nonblocking send/receive pair. Nonblocking
transfers are split into two phases, the posting of the operation
and the waiting for completion. Synchronization and data
transfer can now happen in the background and the associated
overheads can be hidden. However, nonblocking tr ansfers
underlie several restrictions and synchronization overheads
can still occur if operations are waited for too early. Fig-
ure 2(b) sh ows an example where the receiver waits for an
operation before the message arrives. The rendezvous-send and
receive synchronization overheads are X
r
= max{T
s
+ N
T
w
r
, 0} and X
s
= max{T
r
N T
w
s
, 0} resp e ctively. The
main difference from the blocking case is that, if the time
between a send or receive and the respective wait is large
enoug h, then synchronization can be avoided. Informally, no
synchro nization overhead occurs on the receiver, when the
received data is needed late eno ugh, that is, T
w
r
T
s
+ N.
Synchronization overhead on the sender (rendezvous pro tocol)
can be avoided if the send has enough time to complete, that
is, T
w
s
T
r
N .
D. Noise Propagation and Absorption
As discussed before, system no ise occurs locally at each
process and usually has little impact on the process itself (<
0.25%, cf. Table I). However, the synchronizatio n described
before can lead to noise propagation. But noise can also be
consumed in existing sync hronization delays and disappear
completely (cf. [15]). Figure 2(c) shows an example where
noise on the receiver is completely absorbed in X
r
. However,
if the system noise had happened at the same time on the
sender, then noise would have been propagated and X
r
would
have been increased. Generally, only a limited amount of noise
can be subsumed in a synchro nization phase. We use σ
α
to
denote the noise that happens before time α.
If blocking communication is used, then σ
T
s
propagates
to the receiver but might be a bsorbed if the rec eive is posted
late enough, that is, T
r
T
s
+ σ
T
s
+ N. The synchroniza-
tion overheads (including noise pr opagation) on receiver and
rendezvous-send e r are X
r
= max{T
s
+ σ
T
s
+ N T
r
, 0}, and
X
s
= max{T
r
+σ
T
r
N T
s
, 0}, respectively. The condition
for X
r
= X
s
= 0 (execution w ithout synchronization
overhead) is now T
r
= T
s
+ N +
σ
T
s
σ
T
r
2
. If the detours
on sender and receiver are identical, then X
s
= X
r
= 0 iff
T
r
= T
s
+ N as in the noiseless case.
We expect applications that use nonblocking communi-
cation to be relatively resistant to system noise due to the
possibility to hide some synchronization overhe a ds. The syn-
chroniza tion overhead on receiver a nd rendezvous-send er can
be modeled as X
r
= max{T
s
+ σ
T
s
+ N T
w
r
, 0} and X
s
=
max{T
r
+ σ
T
r
N T
w
s
, 0}, respectively. We c onclude that
no synchronization overhead occurs on the receiver and all
noise on the sender is absorbed if T
w
r
T
s
+ σ
T
s
+ N. The

noise at the receiver can be a bsorbed o n the sender (rendezvous
protocol) if T
w
s
T
r
+ σ
T
r
N. Thus, nonblocking point-
to-poin t c ommun ication has a higher potential to absorb noise
than blocking communication.
E. Collective Operations
Collective operations often have more complex dataflow
dependencies than point-to-point messages.We can, however,
identify the following dependenc e classes in MPI:
1) broadcast, scatter: all non-root processes depend on
the roo t process
2) reduce, gather: the root process depends on all non-root
processes
3) scan, exscan: each proce ss depends on all processes
with a lower rank
4) alltoall, allgather, allreduce, barrier, reduce
scatter:
each process depends on all other processes
Those semantic depe ndencies are lower bounds for syn-
chroniza tion and noise propagation, which means for example
that an eager broadcast (at least) propagates all noise that
happened on th e root before the call (σ
T
s
) to all other
processes. This model assumes a linear im plementation of
the algorithm and would perform asymptotically worse than a
binomial- tree implementation [runtime of Ω(P ) vs. Ω(log P )].
Thus, at large scale, optimized algorithms must be used
to implement collective o perations. Such algorithms usually
add recv/send (data) depende ncies to the (minimal) semantic
dependencies, which can cause additional noise propag ation
from intermediate pr ocesses. For example, th e binomial tree
shown in Fig ure 3 has multiple paths from the root node
to the destinations and additional recv/send dependencies are
0
1
2
3
4
5
6
7
8
9
10
11
12
14
13
recv/send dependencies
in this range
Process 7 can absorb noise
Time
Fig. 3. LogGOP diagram for a binomial broadcast tree (P = 15).
introdu ced alo ng each path. The longest paths in the 15-
process example are (0, 1 , 3, 7), (0, 2, 6, 14), (0, 1, 5, 13), and
(0, 1, 3, 11) with four recv/send dependencies along each path.
Each detour σ
T
s
that precedes any send alo ng these pa ths
might delay all following processes. On the other hand, if all
processes p ost the broadcast ope ration at the same global time,
all but the root (process 0 in our example) can absorb some
detour. Some processes (e.g., process 13) could even absorb
three times as much as others. Also, if w e take a detailed look
at the longest paths, on all but the root node (e.g., processes 1,
2, or 3), noise that happens before the message is re ceived is
likely to be absorb ed, an d only de tours during the short period
between the receive and the send will delay the opera tion. Thus
the binomial broadcast is relatively insensitive to n oise.
The binomial-tree argument shows that the influen ce of
noise and its propagation can, even for simp le algorithms, not
easily be assessed analytically. Even the globally dependent
algorithm s in the fourth category depend on the de tails of
the underlying point-to-point alg orithm. Figure 4 shows the
LogGP diag ram of two barrier operations with a comp ute
0
7
1
6
2
5
3
4
COMPUTE
delay
Fig. 4. LogGP diagram of two barriers with process 4 delayed (P = 8).
phase betwee n them. We assume that the barrie r is imple-
mented with the dissemination algorithm and process 4 is
delayed during the compute phase. All processes leave the
second barrier at d ifferent times due to recv/send depen dencies
and process 3 is delayed most. This example shows clearly
that current models, which model the collective operation as
a blac k box (and assume th a t all processes are delayed in
the same way, e.g., [2]) cannot be used to assess the effects of
noise propagation accurately. An ac curate analytical model has
to account f or the whole communication and synchronization
of each send/receive pair and all recv/sen d dependencies to
account for each noise propagation and absorption correctly.
Finding such models for complex communication patterns
seems infeasible. Thus, we propose a full LogGOPS simulator
that enables a ccurate simulation of large-scale systems.
IV. LOGGOPS SIMULATION FRAMEWORK
The LogGOPS simulation to olchain consists of a trace col-
lector, a schedule generator, an optimized LogGOPS discrete-
event simulator similar to [16] , and a visualize r.
The trace collecto r is a library that uses the MPI p rofiling
interface [7, §14] in order to record all MPI c alls of an
application with minimal overhead.
The schedule generator reads the MPI traces and represents
the control- and dataflow in our happens-before application.
Collective operations are replaced with suitab le point-to-point
algorithm s. The generator supports state-of-the-art collective
algorithm s, such as n-ary (binomial) tr ees, d issemination,
recursive doublin g, and pipelined trees. A mapping from
collective operation to algorithm (e.g., allreduce 7→ binary tree
reduce + binary tree broa dcast, or barrier 7→ dissemination)
can be specified in the schedule generation phase. In this
work, we used the dissemination algorithm for small allreduc e,
allgather, alltoall, and barrier calls and the binomial tree
algorithm for small scatter, gather, and broadcast calls.
The simulator reads the schedule, performs the full Log-
GOPS simulation (cf. Sec tion III) and reports the end times
for ea c h process. The simulator was shown to predict collective
operations up to 128 processes with an average error of
less than 1% and full MPI applications with an error below
2%. A complete description of the simulator and a de ta iled
performance and accuracy study is available in [14] and the

Citations
More filters
Proceedings ArticleDOI

Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results

TL;DR: This work proposes statistically sound analysis and reporting techniques and simple guidelines for experimental design in parallel computing and codify them in a portable benchmarking library to improve the standards of reporting research results and initiate a discussion in the HPC field.
Proceedings ArticleDOI

There goes the neighborhood: performance degradation due to nearby jobs

TL;DR: This paper focuses on Cray machines and investigates potential causes for performance variability such as OS jitter, shape of the allocated partition, and interference from other jobs sharing the same network links.
Journal ArticleDOI

Hiding Global Communication Latency in the GMRES Algorithm on Massively Parallel Machines

TL;DR: A pipelined variation of GMRES is presented in which the result of a global reduction is used only one or more iterations after the communication phase has started and global synchronization is relaxed and scalability is much improved at the expense of some extra computations.
Proceedings ArticleDOI

Using automated performance modeling to find scalability bugs in complex codes

TL;DR: This paper shows how both coverage and speed of this scalability analysis can be substantially improved, and demonstrates that scalability bugs are not confined to those routines usually chosen as kernels.
Proceedings ArticleDOI

Enabling highly-scalable remote memory access programming with MPI-3 one sided

TL;DR: In this paper, the authors develop scalable bufferless protocols that implement the MPI-3.0 specification, which support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads.
References
More filters

MPI: A Message-Passing Interface Standard

TL;DR: This document contains all the technical features proposed for the interface and the goal of the Message Passing Interface, simply stated, is to develop a widely used standard for writing message-passing programs.
Journal ArticleDOI

BoomerAMG: a parallel algebraic multigrid solver and preconditioner

TL;DR: This paper describes an implementation of a parallel AMG code, using the algorithm of A.J. Cleary, and considers three basic coarsening schemes and certain modifications to the basic schemes, designed to address specific performance issues.
Proceedings ArticleDOI

The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

TL;DR: This paper describes how to improved the effective performance of ASCI Q, the world's second-fastest supercomputer, to meet expectations and provides insight into performance analysis that is immediately applicable to other large-scale supercomputers.
Proceedings ArticleDOI

Characterizing application sensitivity to OS interference using kernel-level noise injection

TL;DR: This paper examines the sensitivity of real-world, large-scale applications to a range of OS noise patterns using a kernel-based noise injection mechanism implemented in the Catamount lightweight kernel, and demonstrates the importance of how noise is generated, in terms of frequency and duration, and how this impact changes with application scale.
Proceedings ArticleDOI

System noise, OS clock ticks, and fine-grained parallel applications

TL;DR: This work identifies a major source of noise to be indirect overhead of periodic OS clock interrupts ("ticks"), that are used by all general-purpose OSs as a means of maintaining control, and suggests replacing ticks with an alternative mechanism the authors call "smart timers".
Related Papers (5)
Frequently Asked Questions (12)
Q1. What have the authors contributed in "Characterizing the influence of system noise on large-scale applications by simulation" ?

This paper presents an in-depth analysis of the impact of system noise on large-scale parallel application performance in realistic settings. The authors present a simulation toolchain that injects noise delays from traces gathered on common large-scale architectures into a LogGPS simulation and allows new insights into the scaling of applications in noisy environments. The authors investigate collective operations with up to 1 million processes and three applications ( Sweep3D, AMG, and POP ) with up to 32,000 processes. The authors show that the scale at which noise becomes a bottleneck is system-specific and depends on the structure of the noise. The authors quantify noise and conclude that their tools can be utilized to tune the noise signatures of a specific system. Previous analyses generally assume that the performance impact of system noise grows at scale and Tsafrir et al. [ 2 ] even suggest that the impact of very low frequency noise scales linearly with the system size. 

A possible direction for future research is the use of nonblocking collective operations that separate starting the operation and waiting for the data. 

The simulator supports the injection of system noise into all computations that are performed on the CPU: the application computation, the LogGOPS overheads os, or, and O, and the reductions in collective operations. 

A possible direction for future research is the use of nonblocking collective operations that separate starting the operation and waiting for the data. 

Co-scheduling eliminated nearly all noise propagation and showed excellent scaling behavior so that results were omitted from the graphs (less than 0.5% slowdown). 

at large-scale, faster networks are not able to improve the application speed significantly because noise propagation is becoming a bottleneck. 

The synchronization overhead on receiver and rendezvous-sender can be modeled as Xr = max{Ts+σTs +N −Twr , 0} andXs = max{Tr + σTr −N −Tws , 0}, respectively. 

The simulator was shown to predict collective operations up to 128 processes with an average error of less than 1% and full MPI applications with an error below 2%. 

The authors expect applications that use nonblocking communication to be relatively resistant to system noise due to the possibility to hide some synchronization overheads. 

The longest paths in the 15- process example are (0, 1, 3, 7), (0, 2, 6, 14), (0, 1, 5, 13), and (0, 1, 3, 11) with four recv/send dependencies along each path. 

The authors analyze those effects in detail by utilizing the LogGOPS network model to characterize all situations where noise is transported or absorbed. 

5) Influence of the Network Parameters: Several previous studies suggest that the influence of noise is tightly coupled to the network parameters.