What is the main purpose of the simulator?

The simulator supports the injection of system noise into all computations that are performed on the CPU: the application computation, the LogGOPS overheads os, or, and O, and the reductions in collective operations.

What is the direction for future research?

A possible direction for future research is the use of nonblocking collective operations that separate starting the operation and waiting for the data.

What did the authors find out about the effect of co-scheduling?

Co-scheduling eliminated nearly all noise propagation and showed excellent scaling behavior so that results were omitted from the graphs (less than 0.5% slowdown).

Why are faster networks not able to improve the application speed significantly?

at large-scale, faster networks are not able to improve the application speed significantly because noise propagation is becoming a bottleneck.

What does the author think of the influence of noise on the network parameters?

5) Influence of the Network Parameters: Several previous studies suggest that the influence of noise is tightly coupled to the network parameters.

(Open Access) Characterizing the Influence of System Noise on Large-Scale Applications by Simulation (2010) | Torsten Hoefler

Q: What is the synchronization overhead on receiver and rendezvous-sender?

The synchronization overhead on receiver and rendezvous-sender can be modeled as Xr = max{Ts+σTs +N −Twr , 0} andXs = max{Tr + σTr −N −Tws , 0}, respectively.

Q: How many processes were predicted with the simulator?

The simulator was shown to predict collective operations up to 128 processes with an average error of less than 1% and full MPI applications with an error below 2%.

Q: Why do the authors expect nonblocking communication to be relatively resistant to system noise?

The authors expect applications that use nonblocking communication to be relatively resistant to system noise due to the possibility to hide some synchronization overheads.

Q: What are the longest paths in the 15- process example?

The longest paths in the 15- process example are (0, 1, 3, 7), (0, 2, 6, 14), (0, 1, 5, 13), and (0, 1, 3, 11) with four recv/send dependencies along each path.

Characterizing the Inﬂuence of System Noise on

Large-Scale Applications by Simulation

Torsten Hoeﬂer

University of Illinois at Urbana-Champaign

Urbana IL 61801, USA

htor@illinois.edu

Timo S chneider and Andrew Lumsdaine

Indiana University

Bloomington IN 47405 , USA

{timoschn,lums}@cs.indian a.edu

Abstract—This paper presents an in-depth analysis of the

impact of system noise on large-scale parallel application perfor-

mance in realistic settings. Our analytical model shows that not

only collective operations but also point-to-point communications

inﬂuence the application’s sensitivity to noise. We present a sim-

ulation toolchain that injects noise delays from traces gathered

on common large-scale architectures into a LogGPS simulation

and allows new insights into the scaling of applications in noisy

environments. We investigate collective operations with up to 1

million processes and three applications (Sweep3D, AMG, and

POP) with up to 32,000 processes. We show that the scale at which

noise becomes a bottleneck is system-speciﬁc and depends on the

structure of the noise. Simulations with different network speeds

show that a 10x faster network does not improve application

scalability. We quantify noise and conclude that our tools can be

utilized to tune the noise signatures of a speciﬁc system.

I. MOTIVATION AND BACKGROUND

The performance impact of operating system and a rchitec-

tural overheads (system noise) at massive scale is increasingly

of concer n. Even sma ll loca l de lays on compute nodes, which

can be caused by interrupts, operating system daem ons, or

even cache or page misses, can affect global application

performance signiﬁcantly [1]. Such local delays often cause

less than 1% overhead per process but severe perfor mance

losses can occur if noise is pr opagated (ampliﬁed) through

communication or globa l synchron iz a tion. Previous analyses

generally assume tha t the performance impact of system noise

grows a t scale and Tsafrir et al. [2] even suggest that the

impact of very low frequency n oise scales linearly with the

system size.

A. Related Work

Petrini, Kerbyson, and Pakin [1] report that the parallel

performance of SAGE on a ﬁxed number of ASCI Q nodes

was highest when SAGE used only three o f the four CPUs

per node. It turned out that “resonance” between the applica-

tion’s collec tive communication and the misconﬁgured system

caused delays during ea ch iteration. Jones, Brenner, and Fier

[3] observed similar effects with collective communication and

also report that, under certain c ircumstances, it is beneﬁcial to

leave one CPU idle. A theoretical analysis of the inﬂuence of

noise on collective communication [4] suggests that the impact

of noise depends on the type of distribution and their pa-

rameters and can, in the worst case ( exp onential distribution),

scale linearly with the numbe r of processes. Ferreira, Bridges,

and Brightwell use noise-injection tech niques to a ssess the

impact of noise on several a pplications [5]. Beckman e t al.

[6] analyzed the performance on Blu eGene/L, concluding that

most sources of noise can b e avoided in very specialized

systems.

Previous work was either limited to experimental analysis

on speciﬁc architectures with injection of artiﬁcially generated

noise (ﬁxed frequency), or to purely theoretical analyses that

assume a particular collective pattern [4]. These previous

results show the severity of the p roblem but allow little

generalization and provide limited insight into application

behavior on r eal machine s. Effects, such as absorption of

noise, are described but not fur ther investigated [5]. On e

common theme in all previous works is to loo k at collective

communications as the main prob lem and often model such

operations as strictly synchronizing opaque entities. However,

the Messag e Passing Interface (MPI) standard [7] states that

“[...] a collective communication call may, or may not, have

the effect of synchronizing all callin g processes. This sta te m ent

excludes, of course, the barrier functio n.” Th is invalidates all

simple models in use today. The synchronization properties

of an application depend on th e collective algorithm, point-to-

point me ssaging, and the system’s network parameters.

We chose a simulation approach similar to Sottile et al.’s [8]

and improve it by using noise traces from existing systems

combined with detailed simulation and extrapolation of collec-

tive operations and parallel app lica tion traces. Our simulator

enables us to simulate applications on HPC systems that

cannot be accessed easily at full scale or that do not exist

yet and it also allows us to investigate the effect of changing

network speeds and other system parameters.

B. Contributions

In this work, we introduc e an open-source measurement

and simulation framework tha t measur e s O S noise and as-

sesses its impact on large-scale applications by simula tion. We

build upon a detailed model for dependencies in applications

and synchronization (cf. L a mport’s happens-bef ore rela tion)

that considers collective as well as point-to-point patterns

of real applications. We p e rform simulations for a set of

applications and systems, explain phenomena observed by

creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be

obtained from the IEEE.

SC10 November 2010, New Orleans, Louisiana, USA 978-1- 4244-7558-2/10/$26.00

other researchers, and show how the performance of collective

operations and applications is decreased by noise.

Such tech niques are very helpful in evaluating system

software and parallel applications targeting to run on upc oming

Peta- a nd Exascale computers that are not yet available.

For example, asynchronous (threaded) progression or a ctive

messages might be needed for programming such extre me-

scale systems but could increase noise on those systems. The

new a pproaches to implement parallel applications or new

concepts for the system software that will be needed to achieve

reasonable performance at this scale can be analyzed with our

toolchain.

Our approach com bines theory a nd practice in that we use

a detailed network model to describe all synchronization at

the messag e level and ana lyze the global impact of node-local

system n oise. The key contributions of this work are several:

• A detailed dependency and synchronization mode l for

noise propaga tion and absorption in parallel applications.

• A discrete-event simu la tion strategy to investigate the

impact of real-world and artiﬁcial OS noise on real

applications. The strategy accurately reproduces previous

experimental results and provides signiﬁcant insight into

previous observations.

• Simulation results of up to 1 million processes using real-

world noise traces. The results show tha t the inﬂuence of

real-world noise on collec tive com munication can be very

different from that of artiﬁcially generated noise.

• Simulation results showing that point-to-point messaging

inﬂuences the noise sensitivity of applications.

• Simulation results including the effects of co-scheduling

and network speed in the context of noise.

• A q uantiﬁcation of an effect that we call noise bottleneck

where increasing the network speeds does not im prove

application performance due to noise.

In the n ext section, we discuss established measurement

techniques for system noise an d present an enhanced noise

benchm ark. In Section III, we model the synchro nization prop-

erties of p oint-to-point messages and collective operations.

Then, we introduce the established LogGPS model to simulate

the behavior of collective o perations under the in ﬂuence of

noise in Section IV. In Section V, we simulate complete

applications with our collected noise trace s from real systems.

Our methodology provides important insight into the effects

of no ise on parallel applications and enables us to explain

various phenomena found in pr evious studies. For example,

our model shows that the impact of noise depends on the

type of collective operation (we are able to explain Ferreira,

Bridges, and Brightwell’s ﬁn ding that broadcast is signiﬁcantly

less sensitive to noise than allreduce [5]). Our model also

explains why small-message collectives are more affected by

noise than larger ones and why low-frequency noise with

higher amplitude degrades the performance signiﬁcantly while

high-f requency noise has nearly no impact.

II. MEASURING SYSTEM NOISE

A straightf orward noise measurement technique , called

ﬁxed work quantum ( FWQ), measures the time t

to compute

several ﬁxed workloads. FWQ assumes that the minimum

time t

min

represents the noiseless execution and all other

times t

are perturbed by t

− t

min

. The main problem with

this approach is that the sampling frequency is n ot constant

because each perturbation inﬂuenc es the start of the next

sample. Sottile and Minnich [9] propose an inverse measure-

ment technique, ﬁxed time quantum (FTQ), which counts the

number of ﬁxed-work computations that can be perform ed in

a speciﬁc time and thus enab les the application of techniques

from signal analysis.

But we argue that both methods fail to record noise with

high frequency because the ﬁxed workload needs t

min

compute. This effectively means that the sampling frequency

is limited to

min

and all noise that has a high er frequency

simply eleva te s t

min

and underestimates noise. Additionally, if

the Nyquist-Shannon sampling theorem is not satisﬁed , then

aliasing could lead to wrong (Moir´e) observations. Thus, we

conclud e that to capture all noise frequencies accurately, the

workload has to be chosen as small as possible (t

min

→ 0).

For our experiments, w e choose a FWQ benchmark with a

workload close to zero.

To mana ge the hug e number of

measurements, we only store the time of the perturbed mea-

surements similar to Beckman’s “selﬁsh detour” benchmark

[6] which also uses a tight loop to measure perturbations.

The ﬁrst difference from “selﬁsh detour” is that we deﬁne

the threshold relative to t

min

(instead of a ﬁxed threshold )

and thus allow the benchmark to run on a wide variety of

systems. We used 9 · t

min

as threshold to ﬁlter cache misses

that are cau sed by recording the data. Such cache misses

occurre d on all systems and caused a detour between 6 · t

min

(Opteron) and 8·t

min

(BlueGene/P). Another difference is that

we assess t

min

in a separate step such that we do n ot need to

update t

min

in the bench mark loop. This removes one of the

three branches in the critical loop and increa sed the sampling

frequency (benchm ark resolution) by approximately 30% in

our tests. We measured all times with architecture-dependent

high-r e solution timers (RDTSC on x86, MFTB on PowerPC,

AR.ITC on IA64). All benchmarks are imp le mented in the

publicly available tool Netgauge [10]

A. Analyzing Real-World Architectures

We analyze four different systems that represent to day’s

common large-scale system a rchitectures, often scaling to tens

or even hundreds of thousands of processing cores. The ﬁrst

system represents Linux clusters with InﬁniBand such as the

Ranger system at TACC that run a default Linux kerne l.

The other three systems represen t specialized machines with

custom operating system kernels: SGI Altix 4700, Cray XT-4,

and BlueGene/P. For our benchmarks, we used the standard

batch mechanisms without special tuning to ru n our jobs (as

We use a tight loop and t

min

denotes loop overhead.

http://www.unixer.de/research/netgauge/osnoise

System/Architecture R

peak

OS t

min

1 ppn c ppn

CHiC Cluster, diskless, 2152 Opteron 2.6 GHz cores 11.2 TFlop/s Linux 2.6.18 3.74 ns 0.26% 0.21%

SGI Altix 4700, 2048 Itanium II 1.6 GHz cores 13.1 TFlop/s Linux 2.6.16 25.1 ns 0.05% n/a

Jugene, BlueGene/P, 295k PPC 450 850 MHz cores 825.5 TFlop/s CNK 2.6.19.2 29.4 ns 0% 0%

Intrepid, BlueGene/P, 164k PPC 450 cores 458.6 TFlop/s ZeptoOS 2.6.19.2 29.12 ns 0.02% 0.08%

Jaguar, Cray XT-4, 150k Opteron 2.1 GHz cores 1.38 PFlop/s Linux 2.6.16 CNL 32.9 ns 0.02% 0.02%

TABLE I

SYSTEM PARAMETERS AND SERIAL NOISE OVERHEAD FOR ALL INVESTIGATED MACHINES WITH EITHER ONE CORE OR ALL CORES PER

NODE USED. t

min

REPRESENTS THE MINIMUM LOOP TIME (MEASUREMENT ACCURACY).

0.01

0.1

100

1000

10000

100000

0 5 10 15 20 25 30 35 40 45 50

Detour [us]

Sample Time [s]

(a) CHiC (Linux)

0.01

0.1

100

1000

10000

100000

0 20 40 60 80 100 120 140 160 180 200

Detour [us]

Sample Time [s]

(b) S G I Altix

0.01

0.1

100

1000

10000

100000

0 50 100 150 200 250 300 350 400

Detour [us]

Sample Time [s]

0.01

0.1

100

1000

10000

100000

0 20 40 60 80 100 120 140 160 180 200

Detour [us]

Sample Time [s]

(d) Jaguar

Fig. 1. Scatter plot of 10

detours on the different machines using all cores per node.

a typical user would do). We ran the noise benchmark three

times on each system, recording 10

events, and chose the

result with the lowest total detour (all runs differed by less

than 0.1%). On systems with c processing cores per node, we

executed our benchmark with 1 process per node (ppn) as well

as with c ppn.

Table I shows the co nﬁguration of all test systems. Over-

heads of the sampling loop an d timer access limit th e sampling

frequency on real systems, even if the workload is zer o. Such

loop overheads are system dependent and vary be twe en a few

CPU cycles (3.74 ns) and 32.9 ns, as listed in Table I. As

discussed before, w e cannot reliably measure noise frequencies

higher th a n

min

Hz (134 MHz o n our most accura te sy stem).

However, we assume that this limit is only of theoretical

interest because most noise has a much lower frequency. In

the following, we use the c onﬁgura tion where all cores on a

node are used by the application (as most parallel codes are

executed). Figure 1 shows scatter plots of the noise patter ns

for some of our investigated systems.

Figure 1(a) shows the diskless CHiC system with low

regular noise but reproducible longer interruptions (as seen

around 23 seco nds in the plot).

Figure 1(b ) shows that

most detours on the SGI Altix lie in the 1–8 µs range while

220 µs interruptions occur approximately every 2 seconds.

We measure d absolutely n o system noise on the BlueGene/P

system ru nning CNK (the benchma rk ran for seve ral hours

and did not collect a single detour). This is consistent with the

results by Yoshii et al. [11] who also report CNK as absolutely

noiseless. ZeptoOS on BlueGene/P, however, causes low noise

in a regular pattern as sh own in Figure 1(c). The XT-4 part

of the Jag uar system, number on e in the current top-500 list

(06/10), also shows high and infrequent random detou rs in

addition to two baselines.

Other investigated large Linux systems (e.g., Ranger and Juropa) show

show structurally similar noise patterns and are omitted for brevity.

III. AN ANALYTICAL MODEL FOR NOISE PROPAGATION

Now, we pr esent a suitable model to analyze noise effects on

applications. Noise (or “detours” as discussed in the pr evious

section) can either be ab sorbed or propagated by synchroniza-

tion. Processes are often synchronized implicitly by rem ote

data depend encies (cf. happens-before re lation). For example,

a receive cannot ﬁnish before the corresponding (matching)

send has been posted and the network transmission cost has

been paid (recv/send dependency). We analyze those effects in

detail by utilizing the LogGOPS network model to characterize

all situations where noise is transported or absorbed.

A. Th e LogGOPS Model

The LogGPS model [12] is a member of the LogP model

family. LogP models are often used to model parallel applica-

tions and network transmissions.

Multiple researche rs have shown that the L ogP model fam-

ily is able to model many parallel algorithms and architectures

accurately (e.g., [1 3]). LogGPS add itionally offers sup port for

modeling the synchronization effects of rendezvous messages.

We use the extended L ogGOPS model in our simulation

which includes an additional paramete r O [1 4] that models

the overhead per byte. Table II describes all parameters of the

LogGOPS model brieﬂy (see [12], [14] for details).

L maximum latency between any two endpoints

o CPU overhead, o

for send and o

for receive

g inter-message gap, the minimum delay between two mes-

sages (1/g ≡ message-rate)

G gap per byte (1/G ≡ bandwidth)

O overhead per byte

P number of communicating processes

S threshold for eager messages that are buffered on the

receiver. Messages larger than S block the sender in the

rendezvous protocol until the receive has been posted, while

messages smaller than S are sent immediately

TABLE II

LOGGOPS PARAMETERS

r X

L + (k−1)G

(a) Blocking transfer

L + (k−1)G

(b) Nonblocking transfer

r X

L + (k−1)G

noise

Fig. 2. Examples for blocking (a) and nonblocking (b) point-to-point synchronization and noise absorption (c).

The LogGOPS model ignore s c ontention in the network

and might thus und erestimate communication c osts. Our

simulations are mostly targeted towards investigating noise

propagation and one cou ld see a congestion-free execution

as showing the worst effect of noise. If necessary, average

network contention can be modeled by increasing G in the

LogGOPS model. We chose this approach to limit simu lation

resources and allow for larger process counts, cf. [14]. We

discuss the inﬂuence of the ne twork para meters to noise

propagation and absorption in Section V-B.

In the following, we describe synchronization effects in

parallel applications and derive an analytical mod e l for n oise

propagation (sometimes also called ampliﬁcation) and ab sorp-

tion. This model allows reasoning about th e effects of noise

and forms a base fo r our simulation . We discuss blocking

and nonblock ing point-to-po int messages in detail before we

proceed to more complex collective communication patterns.

B. Blocking Point-to-Point Communication

If the receive of a message is started too early, then the

receiver must wait until the message is sent. Likewise, if the

send of k bytes (and k > S, i.e., a rendezvous-send) is started

too early then the sender must wait until the receiver is ready.

Now let us discuss what “too early” means.

Figure 2(a) shows the scenario of a late sender and the

associated synchronization overhead X

. Both, the sender and

receiver are assumed to start at time t = 0. T

denotes the

time when the send is started and T

the time when the

receive is po sted. We assum e that all other time is spent with

computation that advances the algorithm (that is, no polling or

testing). Le t N = o

+ L + (k − 1)G + o

denote the network

overhead in the LogGP model. We deﬁne the synchronization

overhead on the receiver as X

= max{T

+ N − T

, 0}.

For k > S (rendezvous protocol), the synchronization

overhead X

can also occur on the sender X

= max{T

−

N −T

, 0}. In this case, the only scenario where neither sender

nor the receiver are delayed is T

= T

+N, that is, the rec eive

is posted exactly at the right moment. Such timing is very

unlikely and blocking commu nication often propagates noise.

C. Nonblocking Point-to-Point Commun ic ation

A comm on method to avoid synchroniz ation overheads

and to reduce communication costs in general is nonblock-

ing communication. Figure 2(b) shows the commun ic a tion

diagram fo r a nonblocking send/receive pair. Nonblocking

transfers are split into two phases, the posting of the operation

and the waiting for completion. Synchronization and data

transfer can now happen in the background and the associated

overheads can be hidden. However, nonblocking tr ansfers

underlie several restrictions and synchronization overheads

can still occur if operations are waited for too early. Fig-

ure 2(b) sh ows an example where the receiver waits for an

operation before the message arrives. The rendezvous-send and

receive synchronization overheads are X

= max{T

+ N −

, 0} and X

= max{T

− N − T

, 0} resp e ctively. The

main difference from the blocking case is that, if the time

between a send or receive and the respective wait is large

enoug h, then synchronization can be avoided. Informally, no

synchro nization overhead occurs on the receiver, when the

received data is needed late eno ugh, that is, T

≥ T

+ N.

Synchronization overhead on the sender (rendezvous pro tocol)

can be avoided if the send has enough time to complete, that

is, T

≥ T

− N .

D. Noise Propagation and Absorption

As discussed before, system no ise occurs locally at each

process and usually has little impact on the process itself (<

0.25%, cf. Table I). However, the synchronizatio n described

before can lead to noise propagation. But noise can also be

consumed in existing sync hronization delays and disappear

completely (cf. [15]). Figure 2(c) shows an example where

noise on the receiver is completely absorbed in X

. However,

if the system noise had happened at the same time on the

sender, then noise would have been propagated and X

would

have been increased. Generally, only a limited amount of noise

can be subsumed in a synchro nization phase. We use σ

denote the noise that happens before time α.

If blocking communication is used, then σ

propagates

to the receiver but might be a bsorbed if the rec eive is posted

late enough, that is, T

≥ T

+ σ

+ N. The synchroniza-

tion overheads (including noise pr opagation) on receiver and

rendezvous-send e r are X

= max{T

+ σ

+ N − T

, 0}, and

= max{T

+σ

−N −T

, 0}, respectively. The condition

for X

= X

= 0 (execution w ithout synchronization

overhead) is now T

= T

+ N +

−σ

. If the detours

on sender and receiver are identical, then X

= X

= 0 iff

= T

+ N as in the noiseless case.

We expect applications that use nonblocking communi-

cation to be relatively resistant to system noise due to the

possibility to hide some synchronization overhe a ds. The syn-

chroniza tion overhead on receiver a nd rendezvous-send er can

be modeled as X

= max{T

+ σ

+ N − T

, 0} and X

max{T

+ σ

− N − T

, 0}, respectively. We c onclude that

no synchronization overhead occurs on the receiver and all

noise on the sender is absorbed if T

≥ T

+ σ

+ N. The

noise at the receiver can be a bsorbed o n the sender (rendezvous

protocol) if T

≥ T

+ σ

− N. Thus, nonblocking point-

to-poin t c ommun ication has a higher potential to absorb noise

than blocking communication.

E. Collective Operations

Collective operations often have more complex dataﬂow

dependencies than point-to-point messages.We can, however,

identify the following dependenc e classes in MPI:

1) broadcast, scatter: all non-root processes depend on

the roo t process

2) reduce, gather: the root process depends on all non-root

processes

3) scan, exscan: each proce ss depends on all processes

with a lower rank

4) alltoall, allgather, allreduce, barrier, reduce

scatter:

each process depends on all other processes

Those semantic depe ndencies are lower bounds for syn-

chroniza tion and noise propagation, which means for example

that an eager broadcast (at least) propagates all noise that

happened on th e root before the call (σ

) to all other

processes. This model assumes a linear im plementation of

the algorithm and would perform asymptotically worse than a

binomial- tree implementation [runtime of Ω(P ) vs. Ω(log P )].

Thus, at large scale, optimized algorithms must be used

to implement collective o perations. Such algorithms usually

add recv/send (data) depende ncies to the (minimal) semantic

dependencies, which can cause additional noise propag ation

from intermediate pr ocesses. For example, th e binomial tree

shown in Fig ure 3 has multiple paths from the root node

to the destinations and additional recv/send dependencies are

recv/send dependencies

in this range

Process 7 can absorb noise

Time

Fig. 3. LogGOP diagram for a binomial broadcast tree (P = 15).

introdu ced alo ng each path. The longest paths in the 15-

process example are (0, 1 , 3, 7), (0, 2, 6, 14), (0, 1, 5, 13), and

(0, 1, 3, 11) with four recv/send dependencies along each path.

Each detour σ

that precedes any send alo ng these pa ths

might delay all following processes. On the other hand, if all

processes p ost the broadcast ope ration at the same global time,

all but the root (process 0 in our example) can absorb some

detour. Some processes (e.g., process 13) could even absorb

three times as much as others. Also, if w e take a detailed look

at the longest paths, on all but the root node (e.g., processes 1,

2, or 3), noise that happens before the message is re ceived is

likely to be absorb ed, an d only de tours during the short period

between the receive and the send will delay the opera tion. Thus

the binomial broadcast is relatively insensitive to n oise.

The binomial-tree argument shows that the inﬂuen ce of

noise and its propagation can, even for simp le algorithms, not

easily be assessed analytically. Even the globally dependent

algorithm s in the fourth category depend on the de tails of

the underlying point-to-point alg orithm. Figure 4 shows the

LogGP diag ram of two barrier operations with a comp ute

COMPUTE

delay

Fig. 4. LogGP diagram of two barriers with process 4 delayed (P = 8).

phase betwee n them. We assume that the barrie r is imple-

mented with the dissemination algorithm and process 4 is

delayed during the compute phase. All processes leave the

second barrier at d ifferent times due to recv/send depen dencies

and process 3 is delayed most. This example shows clearly

that current models, which model the collective operation as

a blac k box (and assume th a t all processes are delayed in

the same way, e.g., [2]) cannot be used to assess the effects of

noise propagation accurately. An ac curate analytical model has

to account f or the whole communication and synchronization

of each send/receive pair and all recv/sen d dependencies to

account for each noise propagation and absorption correctly.

Finding such models for complex communication patterns

seems infeasible. Thus, we propose a full LogGOPS simulator

that enables a ccurate simulation of large-scale systems.

IV. LOGGOPS SIMULATION FRAMEWORK

The LogGOPS simulation to olchain consists of a trace col-

lector, a schedule generator, an optimized LogGOPS discrete-

event simulator similar to [16] , and a visualize r.

The trace collecto r is a library that uses the MPI p roﬁling

interface [7, §14] in order to record all MPI c alls of an

application with minimal overhead.

The schedule generator reads the MPI traces and represents

the control- and dataﬂow in our happens-before application.

Collective operations are replaced with suitab le point-to-point

algorithm s. The generator supports state-of-the-art collective

algorithm s, such as n-ary (binomial) tr ees, d issemination,

recursive doublin g, and pipelined trees. A mapping from

collective operation to algorithm (e.g., allreduce 7→ binary tree

reduce + binary tree broa dcast, or barrier 7→ dissemination)

can be speciﬁed in the schedule generation phase. In this

work, we used the dissemination algorithm for small allreduc e,

allgather, alltoall, and barrier calls and the binomial tree

algorithm for small scatter, gather, and broadcast calls.

The simulator reads the schedule, performs the full Log-

GOPS simulation (cf. Sec tion III) and reports the end times

for ea c h process. The simulator was shown to predict collective

operations up to 128 processes with an average error of

less than 1% and full MPI applications with an error below

2%. A complete description of the simulator and a de ta iled

performance and accuracy study is available in [14] and the

Characterizing the Influence of System Noise on Large-Scale Applications by Simulation

Figures

Citations

Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results

There goes the neighborhood: performance degradation due to nearby jobs

Hiding Global Communication Latency in the GMRES Algorithm on Massively Parallel Machines

Using automated performance modeling to find scalability bugs in complex codes

Enabling highly-scalable remote memory access programming with MPI-3 one sided

References

MPI: A Message-Passing Interface Standard

BoomerAMG: a parallel algebraic multigrid solver and preconditioner

The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

Characterizing application sensitivity to OS interference using kernel-level noise injection

System noise, OS clock ticks, and fine-grained parallel applications

Related Papers (5)

The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q

Characterizing application sensitivity to OS interference using kernel-level noise injection

LogGOPSim: simulating large-scale applications in the LogGOPS model

Fast parallel algorithms for short-range molecular dynamics

System noise, OS clock ticks, and fine-grained parallel applications

Frequently Asked Questions (12)

Q1. What have the authors contributed in "Characterizing the influence of system noise on large-scale applications by simulation" ?

Q2. What are the future works in "Characterizing the influence of system noise on large-scale applications by simulation" ?

Q3. What is the main purpose of the simulator?

Q4. What is the direction for future research?

Q5. What did the authors find out about the effect of co-scheduling?

Q6. Why are faster networks not able to improve the application speed significantly?

Q7. What is the synchronization overhead on receiver and rendezvous-sender?

Q8. How many processes were predicted with the simulator?

Q9. Why do the authors expect nonblocking communication to be relatively resistant to system noise?

Q10. What are the longest paths in the 15- process example?

Q11. How do the authors analyze the effects of noise on Linux?

Q12. What does the author think of the influence of noise on the network parameters?