scispace - formally typeset
Open AccessProceedings ArticleDOI

Automated application-level checkpointing of MPI programs

TLDR
This paper focuses on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system, and argues that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing fault-tolerance protocols are not suitable for implementing this approach.
Abstract
The running times of many computational science applications, such as protein-folding using ab initio methods, are much longer than the mean-time-to-failure of high-performance computing platforms. To run to completion, therefore, these applications must tolerate hardware failures.In this paper, we focus on the stopping failure model in which a faulty process hangs and stops responding to the rest of the system. We argue that tolerating such faults is best done by an approach called application-level coordinated non-blocking checkpointing, and that existing fault-tolerance protocols in the literature are not suitable for implementing this approach.We then present a suitable protocol, which is implemented by a co-ordination layer that sits between the application program and the MPI library. We show how this protocol can be used with a precompiler that instruments C/MPI programs to save application and MPI library state. An advantage of our approach is that it is independent of the MPI implementation. We present experimental results that argue that the overhead of using our system can be small.

read more

Content maybe subject to copyright    Report

Automated Application-level Checkpointing of MPI Programs
Greg Bronevetsky, Daniel Marques, Keshav Pingali, Paul Stodghill
Department of Computer Science,
Cornell University, Ithaca, NY 14853
Abstract
Because of increasing hardware and software complexity,
the running time of many computational science applica-
tions is now more than the mean-time-to-failure of high-
peformance computing platforms. Therefore, computational
science applications need to tolerate hardware failures.
In this paper, we focus on the stopping failure model in
which a faulty process hangs and stops responding to the
rest of the system. We argue that tolerating such faults is
best done by an approach called application-level coordi-
nated non-blocking checkpointing, and that existing fault-
tolerance protocols in the literature are not suitable for im-
plementing this approach.
In this paper, we present a suitable protocol, and show
how it can be used with a precompiler that instruments
C/MPI programs to save application and MPI library state.
An advantage of our approach is that it is independent of the
MPI implementation. We present experimental results that
argue that the overhead of using our system can be small.
1 Introduction
Fault-tolerant programming has been studied extensively in
the context of distributed systems [6]. In contrast, the high-
performance parallel computing community has not devoted
much attention to this problem because hardware failures in
parallel platforms were not frequent enough to be a cause
for concern. Most high-performance computing was done
on ”big-iron platforms”: monolithic vector or parallel com-
puters that were designed, built, and maintained by a single
vendor. Because these machines cost many millions of dol-
lars, vendors could afford to design reliable components and
integrate them carefully to produce relatively robust com-
puting platforms. Moreover, unlike distributed systems pro-
grams such as air-traffic control systems that must run with-
out stopping, most computational science programs ran for
durations that were much less than the mean-time-between-
failure (MTBF) of the underlying hardware.
0
This work was supported by NSF grants ACI-9870687, EIA-9972853,
ACI-0085969, ACI-0090217, ACI-0103723, and ACI-0121401.
uncoordinated
non−blocking
optimistic causal
message logging
pessimistic
checkpointing
coordinated
blocking
Figure 1: Hierarchy of different fault tolerance techniques
Recent changesin thehigh-performance parallel comput-
ing world are bringing the issue of fault-tolerance to the front
and center. First, the number of processors in big-iron ma-
chines is increasing rapidly: the recently announced Blue
Gene/L will have over 130,000[18]. Anecdotal evidence
is that such a machine loses a processor every few hours;
increasing the number of processors increases the overall
performance, but it also increases the number of points of
failure. Second, parallel computing is shifting from ex-
pensive monolithic hardware systems to low-cost, custom-
assembled clusters of processors and communication fab-
ric. The recent trend towards Internet-wide grid-computing
is another change in the hardware picture that increases the
probability of hardware failures during program execution.
Third, many computational science programs are now de-
signed to run for days or even months at a time; some exam-
ples are the ASCI stockpile certification programs[13] and
ab initio protein-folding programs such as IBM’s Blue Gene
[9] codes which are intended to run for months.
Therefore, the running times of many applications are
now significantly longer than the MTBF of the underlying
hardware. Computational science programs must tolerate
hardware failures.
1.1 Problem Definition
To address this problem, it is necessary to define the fault
model. Two common classes of models are Stopping and

Byzantine [11]. In a Stopping model, a faulty process hangs
and stops
responding to the rest of the system, neither send-
ing nor receiving messages. Byzantine faults permit a faulty
process to perform more damaging acts such as sending cor-
rupted data to other processes.
In this paper, we focus our attention on stopping pro-
cesses. As we discuss in this paper, there are many inter-
esting problems to be solved even in this restricted domain.
Moreover, a good solution for this failure model can be a
useful mechanism in addressing the more general problem
of Byzantine faults.
In general, good abstractions are key to effective han-
dling of failures. In this spirit, we make the standard assump-
tion that there is a reliable transport layer for delivering ap-
plication messages, and we build our solutions on top of that
abstraction. One such reliable implementation of the MPI
communication library is Los Alamos MPI (LA-MPI) [7].
We can now state the problem we address in this paper.
We are given a long-running MPI program that must run on
a machine that has (i) a reliable message delivery system, (ii)
unreliable processors which can fail silently at any time, and
(iii) a mechanism such as a distributed failure detector [8]
for detecting failed processes. How do we ensure that the
program makes progress inspite of these faults?
1.2 Solution space
Figure 1 classifies some of the ways in which programs can
be made fault-tolerant. An excellent survey of these tech-
niques can be found in [6].
Checkpointing techniques periodically save a description
of the state of a computation to stable storage; if any process
fails, all processes are rolled back to the last checkpoint, and
the computation is restarted from there. Message-logging
techniques in contrast require restarting only the computa-
tion performed by the failed process. Surviving processes
are not rolled back but must help the restarted process by re-
playing messages that were sent to it before it failed. The
simplest implementation of message logging requires every
process to save a copy of every message it sends. A more
sophisticated approach might try to regenerate messages on
demand using approaches like reversible computation. Al-
though message-logging is a very appealing idea which has
been studied intensively by the distributed systems commu-
nity [5, 10, 16], our experience is that the overhead of sav-
ing or regenerating messages tends to be so overwhelming
that the technique is not competitive in practice. This may
be because parallel programs communicate more data more
frequently than distributed programs [17].
We therefore focus on checkpointing.
Checkpointing techniques can be classified along two in-
dependent dimensions.
(1) The first dimension is the abstraction level at which
the state of a process is saved. In system-level checkpoint-
ing, the bits that constitute the state of the processm such
as the contents of the program counter, registers and mem-
ory, are saved on stable storage. Examples of systems that do
system-levelcheckpointing are Condor[12] and Libckpt[14].
Some systems like Starfish[1] give the programmer some
control on what is saved. Unfortunately, complete system-
level checkpointing of parallel machines with thousands of
processors can be impractical because each system check-
point can require thousands of nodes sending terabytes of
data to stable storage. For this reason, system-level check-
pointing is not done on large machines such as the IBM Blue
Gene or the ASCI machines.
One alternative which is popular is application-level
checkpointing. Applications can obtain fault-tolerance by
providing their own checkpointing code[3]. The application
is written such that it correctly restarts from various posi-
tions in the code by storing certain information to a restart
file. The benefit of this technique is that that the program-
mer needs only save the minimum amount of data necessary
to recover the program state. For example, in an ab initio
protein folding code, it suffices to save the positions and ve-
locities of the various bases, which is a small fraction of the
total state of the parallel system. The disadvantage of this
approach to implementing application-level checkpointing
is that it complicates the coding of the application program,
and it is one more chore for the parallel programmer.
In this paper, we explore the use of compiler technology
to automate application-level checkpointing.
(2) The second dimension along which checkpointing
techniques can be classified is the technique used to coor-
dinate parallel processes when checkpoints need to be taken.
In uncoordinated checkpointing, each process saves its state
whenever it wants to without coordinating with other pro-
cesses. Although this is simple, restart can be problematic
due to exponential rollback, which may cause the computa-
tion to roll so far back that is makes no progress [6]. For this
reason, uncoordinated checkpointing has fallen out of favor.
Coordinated checkpointing can be divided into block-
ing and non-blocking checkpointing. Blocking techniques
bring all processes to a stop before taking a global check-
point. Hardware blocking was used on the IBM SP-2 to take
system-level checkpoints. Software blocking techniques ex-
ploit barriers - when processes reach a global barrier, each
one saves its own state on stable storage. This is essentially
the solution used today by applications programmers who
roll their own application-level state-saving code. However,
this solution can fail for some MPI programs since MPI al-
lows messages to cross barriers. These messages would not
be saved with the global checkpoint. Moreover, new data-
driven programming styles are eschewing the global barri-
ers, ubiquitous in BSP-style bulk-synchronous programs, in
favor of fine-grain, data-oriented synchronization. Such pro-
grams may not have barriers, and there may be no safe places
in the code in which barriers can be inserted without creating

Run Time
Compile Time
MPI MPI
Hardware
Application Source
Fault−tolerant
ApplicationApplication Source
Fault−tolerant
Native Compiler
Adding Precompiler
Fault−tolerance
Fault−tolerant
Application
Fault−tolerant
Application
Protocol LayerProtocol Layer
Figure 2: System Architecture
deadlocks.
For these reasons, non-blocking coordinated checkpoint-
ing is an interesting alternative. In this approach, a global
coordination protocol, implemented by exchanging special
marker or control tokens, is used to orchestrate the saving
of the states of individual processes and the contents of cer-
tain messages, to provide a global snapshot of the computa-
tion from which the computation can be restarted. A distin-
guished process called the initiator is responsible for initiat-
ing and monitoring the protocol; to take a local checkpoint,
an application process may communicate with other applica-
tion processes but it makes no assumptions about the states
of other processes. The Chandy-Lamport protocol is perhaps
the most well-known non-blocking protocol [4]. Unfortu-
nately, these protocols were designed to work with system-
level checkpointing as we discuss in Section 3, there are
fundamental difficulties in using them for application-level
checkpointing.
Therefore, we have developed a new protocol for non-
blocking coordination that works smoothly with application-
level state-saving.
1.3 Overview of our approach
In this paper, we discuss the use of compiler technology
to implement application-level, coordinated, non-blocking
checkpointing of MPI programs.
Figure 2 is an overview of our approach. The CCIFT
(Cornell Compiler for Inserting Fault-Tolerance) precom-
piler reads almost unmodified single-threaded C/MPI source
files and instruments them to perform application-level
state-saving; the only additional requirement for the pro-
grammer is that he insert calls to a function called
PotentialCheckpoint at points in the application
where the programmer wants checkpointing to occur. We
have not yet implemented optimizations to reduce the
amount of state that is saved, so the instrumented code saves
the entire state when it takes a checkpoint. The output of
this precompiler is compiled with the native compiler on the
hardware platform, and is linked with a library that consti-
tutes a protocol layer for implementing the non-blocking co-
ordination. This layer sits between the application and the
MPI layer, and intercepts all calls from the instrumented ap-
plication program to the MPI library
1
This design permits us to implement the coordination
protocol without modifying the underlying MPI library,
which promotes modularity and eliminates the need for ac-
cess to MPI library code which is proprietary on some sys-
tems. Further, it allows us to easily migrate from one MPI
implementation to another.
The rest of this paper is organized as follows. We intro-
duce some notation and terminology in Section 2. In Sec-
tion 3, we discuss the main hurdles that must be overcome
to implement our solution, and argue that the coordination
protocols in the literature cannot be used for our problem.
In Section 4, we present our solutions to these problems.
In particular, we describe a new coordination protocol that
supports with application-level checkpointing. We have im-
plemented this approach on a Windows 2000 cluster at the
Cornell Theory Center. In Section 5, we discuss how we
save and restore the state of the application and the MPI li-
brary. In Section 6, we measure the performance overheads
of our approach by running a number of small benchmarks
on this platform. The full paper will present more detailed
measurements of these and larger benchmarks. We conclude
in Section 7 with a discussion of future work.
2 Terminology
In this section, we introduce the terminology and notation
used in the rest of the paper. Following usual practice, we as-
sume that the system does not initiate the creation of a global
checkpoint before all previous global checkpoints have been
created and commited to global storage.
The execution of an application process can therefore be
divided into a succession of epochs where an epoch is the
period between two successive local checkpoints (by con-
vention, the start of the program is assumed to begin the first
epoch). Epochs are labeled successively by integers starting
at zero, as shown in Figure 3.
It is convenient to classify an application message into
three categories depending on the epoch numbers of the
sending and receiving processes at the points in the appli-
cation program execution when the message is sent and re-
ceived respectively.
1
Note that MPI can bypass the protocol layer to read and write message
buffers in the application space directly. Such manipulations, however, are
not invisible to the protocol layer. MPI may not begin to access a message
buffer until after it has been given specific permission to do so by the ap-
plication (e.g. via a call to MPI
Irecv). Similarly, once the application
has granted such permission to MPI, it should not access that buffer until
MPI has informed it that doing so is safe (e.g. with the return of a call to
MPI
Wait). The calls to, and returns from, those functions are intercepted
by the protocol layer.

P
Q
R
Start
of program
x
x
x
x
x
x
0
0
0
1
1
1
2
2
2
Early
Intra-epoch Late
Global Checkpoint 2Global Checkpoint 1
Figure 3: Epochs and message classification
Definition 1 Given an application message from process A
to process B, let

be the epoch number of A at the point in
the application program execution when the send command
is executed, and let

be the epoch number of B at the point
when the message is delivered to the application.
Late message: If


, the message is said to be a
late message.
Intra-epoch message: If


, the message is said
to be an intra-epoch message.
Early message: If


, the message is said to be
an early message.
Figure 3 shows examples of the three kinds of messages,
using the execution trace of three processes named
,
and
. MPI has several kinds of send and receive commands,
so it is important to understand what the message arrows
mean in the context of MPI programs. The source of the
arrow represents the point in the execution of the sending
process at which control returns from the MPI routine that
was invoked to send this message. Note that if this routine
is a non-blocking send, the message may not make it to the
communication network until much later in execution; nev-
ertheless, what is important for us is that if the system tries to
recover from global checkpoint 2, it will not reissue the MPI
send. Similarly, the destination of the arrow represents the
delivery of the message to the application program. In par-
ticular, if an MPI_Irecv is used by the receiving process to
get the message, the destination of the arrow represents not
the point where control returns from the MPI_Irecv rou-
tine, but the point at which an MPI_Wait for the message
would have returned.
In the literature, late messages are sometimes called in-
flight messages, and early messages are sometime called in-
consistent messages. This terminology was developed in the
context of system-level checkpointing protocols but in our
opinion, it is misleading in the context of application-level
checkpointing.
3 Difficulties in Application-level Checkpointing
of MPI programs
In this section, we describe the difficulties with imple-
menting application-level, coordinated, non-blocking check-
pointing for MPI programs. In particular, we argue that the
existing protocols for non-blocking parallel checkpointing,
which were designed for system-level checkpointers, are not
suitable when the state saving occurs at the application level.
3.1 Delayed state-saving
A fundamental difference between system-level checkpoint-
ing and application-level checkpointing is that a system-
level checkpoint may be taken at any time during a pro-
gram’s execution, while an application-level checkpoint
can only be taken when a program executes Poten-
tialCheckpoint calls.
System-level checkpointing protocols, such as the
Chandy-Lamport distributed snapshot protocol, exploit this
flexibility with checkpoint scheduling to avoid the creation
of early messages during the creation of a global check-
point, a process
must take its local checkpoint before it
can read a message from process
which
sent after tak-
ing its own checkpoint. This strategy does not work for
application-level checkpointing, because process
might
need to receive an early message before it can arrive at a
point where it may take a checkpoint.
Therefore, unlike system-level checkpointing protocols,
application-level checkpointing protocols must handle both
late and early messages.
3.2 Handling late and early messages
We use Figure 3 to illustrate the issues associated with late
and early messages. Suppose that one of the processes in
this figure fails after the taking of Global Checkpoint 2. On
restart, each processes will resume execution from its state
as saved in the checkpoint. For process
to recover cor-
rectly, it must obtain the late message that was sent to it by
process
prior to the failure. However, process
will not
resend this message because the send occurred before
took
its checkpoint. Therefore, we need mechanisms for (i) iden-
tifying late messages and saving them along with the global
checkpoint, and (ii) replaying these messages to the receiv-
ing process during recovery. Late messages must be handled
by system-level checkpointing protocols as well.
Early messages, such as the message sent from process
to process
pose a different problem. Process
received
this message before taking its checkpoint; after recovery it
does not expect to be resent this message. For the application
to be correct, therefore, process
must suppress resending
this message. To handle this, we need mechanisms for (i)
identifying early messages, and (ii) ensuring that they are
not resent during recovery.
Early messages also pose a separate and more subtle
problem. The saved state of process
at Global Checkpoint
2 may depend on the data contained in the early message
from process
. If that data was a random number generated
by
,
s state would be dependent on a non-deterministic

event at
. If the number was generated after
took its
checkpoint, then on restart,
and
may disagree on its
value.
In general, we must ensure that if a global checkpoint de-
pends on a non-deterministic event, that event will re-occur
after restart. Therefore, mechanisms are needed to (i) log the
non-deterministic events that a global checkpoint depends
on, so that (ii) these events can be replayed during recovery.
3.3 Non-FIFO message delivery at application
level
Many system-level protocols assume that the communica-
tion between a pair of processes behaves in a FIFO manner.
For example, in the Chandy-Lamport protocol, a process
that takes a checkpoint sends a marker token to other pro-
cesses, informing them of what it has done. The protocol
relies on the FIFO assumption to ensure that these other pro-
cesses must receive this token before they can receive any
message sent by
after it took its checkpoint.
In an MPI application, a process
can use tag match-
ing to receive messages from
in a different order than
as they were sent. Therefore, a protocol that works at the
application-level, as would be the case for application-level
checkpointing, cannot assume FIFO communication. It is
important to note that this problem has nothing to do with the
FIFO (or lack of) behavior of the underlying communication
system; rather, it is a property of a particular application.
3.4 Collective communication
The MPI standard includes collective communications func-
tions such as MPI_Bcast and MPI_Alltoall, which
involve the exchange of data among a number of proces-
sors. However, most checkpointing protocols in the litera-
ture, which were designed in the context of distributed com-
puting, ignore the issue of collective communication.
The difficulty presented by such functions occurs when
some processes make a collective communication call be-
fore taking their checkpoints, and others after. We need to
ensure that on restart, the processes that reexecute the calls
do not deadlock and receive correct information. Further-
more, MPI_Barrier guarantees specific synchronization
semantics, which must be preserved on restart.
3.5 Problems Checkpointing MPI Library State
The key issue in performing application-level checkpointing
of the state of the MPI library is that we do not assume to
have access to its source code. While it would be possi-
ble for us to add application-level checkpointing methods to
an existing MPI implementation, this would limit the porta-
bility of our checkpointer and would keep the programmer
from using vendor-provided,platform-optimized implemen-
tations of MPI. Thus, our problem is to record and recover
the state of the MPI library using only the MPI interface.
The library state can be broken up into three categories:
Library message buffers. At the application-level,
messages are invisible until they are received by the ap-
plication. Therefore, at checkpoint time, the applica-
tion cannot distinguish whether a given message is sit-
ting in a network buffer on the sending processor, being
transmitted, or sitting in a network buffer on the desti-
nation processor. All such messages are equivalently
“in-flight” from the application’s perspective. There-
fore, we do not need to checkpoint the library’s com-
munication buffers.
MPI’s opaque objects. Such objects are internal
to the MPI library but are visible to application
may via handles. These objects include request ob-
jects (MPI_Request), communicators (MPI_Comm),
groups (MPI_Group), data types (MPI_Datatype),
error handlers (MPI_Errhandler), user defined op-
erators (MPI_Op), and key-value pairs.
State internal to the MPI library. There is certain
state in the MPI library, such as message queues, timers
and the network addresses of processors, that is com-
pletely hidden to the application. Since this state cannot
be manipulated via MPI’s interface, it is impossible for
us to save or restore it. However, this is not required for
correctness. All that is required is that the application’s
view of the library remains consistent before and after
restart.
4 A Non-Blocking, Coordinated Protocol for
Application-level Checkpointing
We now describe the coordination protocol for global check-
pointing. The protocol is independent of the technique used
by processes to take local checkpoints. To avoid complicat-
ing the presentation, we first describe the protocol for point-
to-point communciation only. Then, we show that collective
communication can be handled elegantly using the mecha-
nism in place for point-to-point communication.
4.1 High-level description of protocol
Phase #1 To initiate a distributed snapshot, the initiator
sends a control message called pleaseCheckpoint to all ap-
plication processes. Each application process must take a
local checkpoint at some time after it receives this request,
but it is free to send and receive as many messages as it likes
between the time it is asked to take a checkpoint and when it
actually complies with this request.
Phase #2 When an application process reaches a point in
the program where it can take a local checkpoint, it saves its
local state and the identities of any early messages on stable

Citations
More filters
Proceedings ArticleDOI

DMTCP: Transparent checkpointing for cluster computations and the desktop

TL;DR: DMTCP as mentioned in this paper is a transparent user-level checkpointing package for distributed applications, which is used for the runCMS experiment of the Large Hadron Collider at CERN, and it can be incorporated and distributed as a checkpoint-restart module within some larger package.
Proceedings ArticleDOI

FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

TL;DR: FTC-Charms ++ is presented, a fault-tolerant runtime based on a scheme for fast and scalable in-memory checkpoint and restart, useful for applications whose memory footprint is small at the checkpoint state and a variation of this scheme - in-disk checkpoint/restart can be applied to applications with large memory footprint.
Proceedings ArticleDOI

Adaptive incremental checkpointing for massively parallel systems

TL;DR: This paper proposes a software based adaptive incremental checkpoint technique which uses a secure hash function to uniquely identify changed blocks in memory, and is the first self-optimizing algorithm that dynamically computes the optimal block boundaries, based on the history of changed blocks.
Proceedings ArticleDOI

Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods

TL;DR: Experimental results demonstrate that, when this online error detection approach is used together with checkpointing, it improves the time to obtain correct results by up to several orders of magnitude over the traditional offline approach.
Proceedings ArticleDOI

Application-level checkpointing for shared memory programs

TL;DR: This paper describes a system for shared-memory programs running on symmetric multiprocessors with a pre-compiler for source-to-source modification of applications, and a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application.
References
More filters
Book

Distributed algorithms

Nancy Lynch
TL;DR: This book familiarizes readers with important problems, algorithms, and impossibility results in the area, and teaches readers how to reason carefully about distributed algorithms-to model them formally, devise precise specifications for their required behavior, prove their correctness, and evaluate their performance with realistic measures.

MPI: A Message-Passing Interface Standard

TL;DR: This document contains all the technical features proposed for the interface and the goal of the Message Passing Interface, simply stated, is to develop a widely used standard for writing message-passing programs.
Journal ArticleDOI

Distributed snapshots: determining global states of distributed systems

TL;DR: An algorithm by which a process in a distributed system determines a global state of the system during a computation, which helps to solve an important class of problems: stable property detection.
Journal ArticleDOI

A survey of rollback-recovery protocols in message-passing systems

TL;DR: This survey covers rollback-recovery techniques that do not require special language constructs and distinguishes between checkpoint-based and log-based protocols, which rely solely on checkpointing for system state restoration.
Proceedings Article

Libckpt: transparent checkpointing under Unix

TL;DR: In this paper, the authors describe a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature and also supports the incorporation of user directives into the creation of checkpoints.
Related Papers (5)
Frequently Asked Questions (11)
Q1. How many bits can be used to represent the color of the epoch?

If 32-bit integers are used, the two most significant bits of an integer can be used to represent the color of the epoch and the state of the amLogging flag of the sender, and remaining 30 bits can be used as the messageID. 

The state of the application running on each node consists of its position in the static text of the program, its position in the dynamic execution of the program, its local and global variables, and its heap-allocated structures. 

On restart, the authors first restore the stack using the PS, and then use the VDS to restore stack variables by copying their value from the checkpoint to their locations on the stack. 

(2) The second dimension along which checkpointing techniques can be classified is the technique used to coordinate parallel processes when checkpoints need to be taken. 

Since portability is not one of their goals, and because the authors feel that the limitations on programming style and the added overhead of doing pointer conversion are too burdensome for their applications, the authors have chosen not to follow the PORCH approach. 

Because stack variables and heap objects are restored to their original virtual addresses, the authors need to make no special consideration regarding data pointers: they are saved as ordinary data. 

The key issue in performing application-level checkpointing of the state of the MPI library is that the authors do not assume to have access to its source code. 

Each process piggybacks its amLogging bit on the application data, and the functioninvoked by MPI_Allreduce computes the conjunction of these bits. 

Due to hardware problems, the authors used only 16 of those processors for their tests; in the final paper, the authors will present results for the full machine. 

When Q saves its log, the authors have a problem: the saved state of the global computation is causally dependent on an event that was not itself saved. 

It is convenient to classify an application message into three categories depending on the epoch numbers of the sending and receiving processes at the points in the application program execution when the message is sent and received respectively.