How many bits can be used to represent the color of the epoch?

If 32-bit integers are used, the two most significant bits of an integer can be used to represent the color of the epoch and the state of the amLogging flag of the sender, and remaining 30 bits can be used as the messageID.

What is the state of the application running on each node?

The state of the application running on each node consists of its position in the static text of the program, its position in the dynamic execution of the program, its local and global variables, and its heap-allocated structures.

What is the way to restore stack variables?

On restart, the authors first restore the stack using the PS, and then use the VDS to restore stack variables by copying their value from the checkpoint to their locations on the stack.

Why did the developers choose not to follow the PORCH approach?

Since portability is not one of their goals, and because the authors feel that the limitations on programming style and the added overhead of doing pointer conversion are too burdensome for their applications, the authors have chosen not to follow the PORCH approach.

Why do the authors need to restore heap objects to their original addresses?

Because stack variables and heap objects are restored to their original virtual addresses, the authors need to make no special consideration regarding data pointers: they are saved as ordinary data.

What is the function used to compute the conjunction of the bits?

Each process piggybacks its amLogging bit on the application data, and the functioninvoked by MPI_Allreduce computes the conjunction of these bits.

Why did the authors use only 16 processors for their tests?

Due to hardware problems, the authors used only 16 of those processors for their tests; in the final paper, the authors will present results for the full machine.

What is the problem when Q saves its log?

When Q saves its log, the authors have a problem: the saved state of the global computation is causally dependent on an event that was not itself saved.

(Open Access) Automated application-level checkpointing of MPI programs (2003) | Greg Bronevetsky

Automated Application-level Checkpointing of MPI Programs

Greg Bronevetsky, Daniel Marques, Keshav Pingali, Paul Stodghill

Department of Computer Science,

Cornell University, Ithaca, NY 14853

Abstract

Because of increasing hardware and software complexity,

the running time of many computational science applica-

tions is now more than the mean-time-to-failure of high-

peformance computing platforms. Therefore, computational

science applications need to tolerate hardware failures.

In this paper, we focus on the stopping failure model in

which a faulty process hangs and stops responding to the

rest of the system. We argue that tolerating such faults is

best done by an approach called application-level coordi-

nated non-blocking checkpointing, and that existing fault-

tolerance protocols in the literature are not suitable for im-

plementing this approach.

In this paper, we present a suitable protocol, and show

how it can be used with a precompiler that instruments

C/MPI programs to save application and MPI library state.

An advantage of our approach is that it is independent of the

MPI implementation. We present experimental results that

argue that the overhead of using our system can be small.

1 Introduction

Fault-tolerant programming has been studied extensively in

the context of distributed systems [6]. In contrast, the high-

performance parallel computing community has not devoted

much attention to this problem because hardware failures in

parallel platforms were not frequent enough to be a cause

for concern. Most high-performance computing was done

on ”big-iron platforms”: monolithic vector or parallel com-

puters that were designed, built, and maintained by a single

vendor. Because these machines cost many millions of dol-

lars, vendors could afford to design reliable components and

integrate them carefully to produce relatively robust com-

puting platforms. Moreover, unlike distributed systems pro-

grams such as air-trafﬁc control systems that must run with-

out stopping, most computational science programs ran for

durations that were much less than the mean-time-between-

failure (MTBF) of the underlying hardware.

This work was supported by NSF grants ACI-9870687, EIA-9972853,

ACI-0085969, ACI-0090217, ACI-0103723, and ACI-0121401.

uncoordinated

non−blocking

optimistic causal

message logging

pessimistic

checkpointing

coordinated

blocking

Figure 1: Hierarchy of different fault tolerance techniques

Recent changesin thehigh-performance parallel comput-

ing world are bringing the issue of fault-tolerance to the front

and center. First, the number of processors in big-iron ma-

chines is increasing rapidly: the recently announced Blue

Gene/L will have over 130,000[18]. Anecdotal evidence

is that such a machine loses a processor every few hours;

increasing the number of processors increases the overall

performance, but it also increases the number of points of

failure. Second, parallel computing is shifting from ex-

pensive monolithic hardware systems to low-cost, custom-

assembled clusters of processors and communication fab-

ric. The recent trend towards Internet-wide grid-computing

is another change in the hardware picture that increases the

probability of hardware failures during program execution.

Third, many computational science programs are now de-

signed to run for days or even months at a time; some exam-

ples are the ASCI stockpile certiﬁcation programs[13] and

ab initio protein-folding programs such as IBM’s Blue Gene

[9] codes which are intended to run for months.

Therefore, the running times of many applications are

now signiﬁcantly longer than the MTBF of the underlying

hardware. Computational science programs must tolerate

hardware failures.

1.1 Problem Deﬁnition

To address this problem, it is necessary to deﬁne the fault

model. Two common classes of models are Stopping and

Byzantine [11]. In a Stopping model, a faulty process hangs

and stops



responding to the rest of the system, neither send-

ing nor receiving messages. Byzantine faults permit a faulty

process to perform more damaging acts such as sending cor-

rupted data to other processes.

In this paper, we focus our attention on stopping pro-

cesses. As we discuss in this paper, there are many inter-

esting problems to be solved even in this restricted domain.

Moreover, a good solution for this failure model can be a

useful mechanism in addressing the more general problem

of Byzantine faults.

In general, good abstractions are key to effective han-

dling of failures. In this spirit, we make the standard assump-

tion that there is a reliable transport layer for delivering ap-

plication messages, and we build our solutions on top of that

abstraction. One such reliable implementation of the MPI

communication library is Los Alamos MPI (LA-MPI) [7].

We can now state the problem we address in this paper.

We are given a long-running MPI program that must run on

a machine that has (i) a reliable message delivery system, (ii)

unreliable processors which can fail silently at any time, and

(iii) a mechanism such as a distributed failure detector [8]

for detecting failed processes. How do we ensure that the

program makes progress inspite of these faults?

1.2 Solution space

Figure 1 classiﬁes some of the ways in which programs can

be made fault-tolerant. An excellent survey of these tech-

niques can be found in [6].

Checkpointing techniques periodically save a description

of the state of a computation to stable storage; if any process

fails, all processes are rolled back to the last checkpoint, and

the computation is restarted from there. Message-logging

techniques in contrast require restarting only the computa-

tion performed by the failed process. Surviving processes

are not rolled back but must help the restarted process by re-

playing messages that were sent to it before it failed. The

simplest implementation of message logging requires every

process to save a copy of every message it sends. A more

sophisticated approach might try to regenerate messages on

demand using approaches like reversible computation. Al-

though message-logging is a very appealing idea which has

been studied intensively by the distributed systems commu-

nity [5, 10, 16], our experience is that the overhead of sav-

ing or regenerating messages tends to be so overwhelming

that the technique is not competitive in practice. This may

be because parallel programs communicate more data more

frequently than distributed programs [17].

We therefore focus on checkpointing.

Checkpointing techniques can be classiﬁed along two in-

dependent dimensions.

(1) The ﬁrst dimension is the abstraction level at which

the state of a process is saved. In system-level checkpoint-

ing, the bits that constitute the state of the processm such

as the contents of the program counter, registers and mem-

ory, are saved on stable storage. Examples of systems that do

system-levelcheckpointing are Condor[12] and Libckpt[14].

Some systems like Starﬁsh[1] give the programmer some

control on what is saved. Unfortunately, complete system-

level checkpointing of parallel machines with thousands of

processors can be impractical because each system check-

point can require thousands of nodes sending terabytes of

data to stable storage. For this reason, system-level check-

pointing is not done on large machines such as the IBM Blue

Gene or the ASCI machines.

One alternative which is popular is application-level

checkpointing. Applications can obtain fault-tolerance by

providing their own checkpointing code[3]. The application

is written such that it correctly restarts from various posi-

tions in the code by storing certain information to a restart

ﬁle. The beneﬁt of this technique is that that the program-

mer needs only save the minimum amount of data necessary

to recover the program state. For example, in an ab initio

protein folding code, it sufﬁces to save the positions and ve-

locities of the various bases, which is a small fraction of the

total state of the parallel system. The disadvantage of this

approach to implementing application-level checkpointing

is that it complicates the coding of the application program,

and it is one more chore for the parallel programmer.

In this paper, we explore the use of compiler technology

to automate application-level checkpointing.

(2) The second dimension along which checkpointing

techniques can be classiﬁed is the technique used to coor-

dinate parallel processes when checkpoints need to be taken.

In uncoordinated checkpointing, each process saves its state

whenever it wants to without coordinating with other pro-

cesses. Although this is simple, restart can be problematic

due to exponential rollback, which may cause the computa-

tion to roll so far back that is makes no progress [6]. For this

reason, uncoordinated checkpointing has fallen out of favor.

Coordinated checkpointing can be divided into block-

ing and non-blocking checkpointing. Blocking techniques

bring all processes to a stop before taking a global check-

point. Hardware blocking was used on the IBM SP-2 to take

system-level checkpoints. Software blocking techniques ex-

ploit barriers - when processes reach a global barrier, each

one saves its own state on stable storage. This is essentially

the solution used today by applications programmers who

roll their own application-level state-saving code. However,

this solution can fail for some MPI programs since MPI al-

lows messages to cross barriers. These messages would not

be saved with the global checkpoint. Moreover, new data-

driven programming styles are eschewing the global barri-

ers, ubiquitous in BSP-style bulk-synchronous programs, in

favor of ﬁne-grain, data-oriented synchronization. Such pro-

grams may not have barriers, and there may be no safe places

in the code in which barriers can be inserted without creating

Run Time

Compile Time

MPI MPI

Hardware

Application Source

Fault−tolerant

ApplicationApplication Source

Fault−tolerant

Native Compiler

Adding Precompiler

Fault−tolerance

Fault−tolerant

Application

Fault−tolerant

Application

Protocol LayerProtocol Layer

Figure 2: System Architecture

deadlocks.

For these reasons, non-blocking coordinated checkpoint-

ing is an interesting alternative. In this approach, a global

coordination protocol, implemented by exchanging special

marker or control tokens, is used to orchestrate the saving

of the states of individual processes and the contents of cer-

tain messages, to provide a global snapshot of the computa-

tion from which the computation can be restarted. A distin-

guished process called the initiator is responsible for initiat-

ing and monitoring the protocol; to take a local checkpoint,

an application process may communicate with other applica-

tion processes but it makes no assumptions about the states

of other processes. The Chandy-Lamport protocol is perhaps

the most well-known non-blocking protocol [4]. Unfortu-

nately, these protocols were designed to work with system-

level checkpointing — as we discuss in Section 3, there are

fundamental difﬁculties in using them for application-level

checkpointing.

Therefore, we have developed a new protocol for non-

blocking coordination that works smoothly with application-

level state-saving.

1.3 Overview of our approach

In this paper, we discuss the use of compiler technology

to implement application-level, coordinated, non-blocking

checkpointing of MPI programs.

Figure 2 is an overview of our approach. The CCIFT

(Cornell Compiler for Inserting Fault-Tolerance) precom-

piler reads almost unmodiﬁed single-threaded C/MPI source

ﬁles and instruments them to perform application-level

state-saving; the only additional requirement for the pro-

grammer is that he insert calls to a function called

PotentialCheckpoint at points in the application

where the programmer wants checkpointing to occur. We

have not yet implemented optimizations to reduce the

amount of state that is saved, so the instrumented code saves

the entire state when it takes a checkpoint. The output of

this precompiler is compiled with the native compiler on the

hardware platform, and is linked with a library that consti-

tutes a protocol layer for implementing the non-blocking co-

ordination. This layer sits between the application and the

MPI layer, and intercepts all calls from the instrumented ap-

plication program to the MPI library

This design permits us to implement the coordination

protocol without modifying the underlying MPI library,

which promotes modularity and eliminates the need for ac-

cess to MPI library code which is proprietary on some sys-

tems. Further, it allows us to easily migrate from one MPI

implementation to another.

The rest of this paper is organized as follows. We intro-

duce some notation and terminology in Section 2. In Sec-

tion 3, we discuss the main hurdles that must be overcome

to implement our solution, and argue that the coordination

protocols in the literature cannot be used for our problem.

In Section 4, we present our solutions to these problems.

In particular, we describe a new coordination protocol that

supports with application-level checkpointing. We have im-

plemented this approach on a Windows 2000 cluster at the

Cornell Theory Center. In Section 5, we discuss how we

save and restore the state of the application and the MPI li-

brary. In Section 6, we measure the performance overheads

of our approach by running a number of small benchmarks

on this platform. The full paper will present more detailed

measurements of these and larger benchmarks. We conclude

in Section 7 with a discussion of future work.

2 Terminology

In this section, we introduce the terminology and notation

used in the rest of the paper. Following usual practice, we as-

sume that the system does not initiate the creation of a global

checkpoint before all previous global checkpoints have been

created and commited to global storage.

The execution of an application process can therefore be

divided into a succession of epochs where an epoch is the

period between two successive local checkpoints (by con-

vention, the start of the program is assumed to begin the ﬁrst

epoch). Epochs are labeled successively by integers starting

at zero, as shown in Figure 3.

It is convenient to classify an application message into

three categories depending on the epoch numbers of the

sending and receiving processes at the points in the appli-

cation program execution when the message is sent and re-

ceived respectively.

Note that MPI can bypass the protocol layer to read and write message

buffers in the application space directly. Such manipulations, however, are

not invisible to the protocol layer. MPI may not begin to access a message

buffer until after it has been given speciﬁc permission to do so by the ap-

plication (e.g. via a call to MPI

Irecv). Similarly, once the application

has granted such permission to MPI, it should not access that buffer until

MPI has informed it that doing so is safe (e.g. with the return of a call to

MPI

Wait). The calls to, and returns from, those functions are intercepted

by the protocol layer.

Start

of program

Early

Intra-epoch Late

Global Checkpoint 2Global Checkpoint 1

Figure 3: Epochs and message classiﬁcation

Deﬁnition 1 Given an application message from process A

to process B, let



be the epoch number of A at the point in

the application program execution when the send command

is executed, and let



be the epoch number of B at the point

when the message is delivered to the application.



Late message: If







, the message is said to be a

late message.



Intra-epoch message: If







, the message is said

to be an intra-epoch message.



Early message: If







, the message is said to be

an early message.

Figure 3 shows examples of the three kinds of messages,

using the execution trace of three processes named





and



. MPI has several kinds of send and receive commands,

so it is important to understand what the message arrows

mean in the context of MPI programs. The source of the

arrow represents the point in the execution of the sending

process at which control returns from the MPI routine that

was invoked to send this message. Note that if this routine

is a non-blocking send, the message may not make it to the

communication network until much later in execution; nev-

ertheless, what is important for us is that if the system tries to

recover from global checkpoint 2, it will not reissue the MPI

send. Similarly, the destination of the arrow represents the

delivery of the message to the application program. In par-

ticular, if an MPI_Irecv is used by the receiving process to

get the message, the destination of the arrow represents not

the point where control returns from the MPI_Irecv rou-

tine, but the point at which an MPI_Wait for the message

would have returned.

In the literature, late messages are sometimes called in-

ﬂight messages, and early messages are sometime called in-

consistent messages. This terminology was developed in the

context of system-level checkpointing protocols but in our

opinion, it is misleading in the context of application-level

checkpointing.

3 Difﬁculties in Application-level Checkpointing

of MPI programs

In this section, we describe the difﬁculties with imple-

menting application-level, coordinated, non-blocking check-

pointing for MPI programs. In particular, we argue that the

existing protocols for non-blocking parallel checkpointing,

which were designed for system-level checkpointers, are not

suitable when the state saving occurs at the application level.

3.1 Delayed state-saving

A fundamental difference between system-level checkpoint-

ing and application-level checkpointing is that a system-

level checkpoint may be taken at any time during a pro-

gram’s execution, while an application-level checkpoint

can only be taken when a program executes Poten-

tialCheckpoint calls.

System-level checkpointing protocols, such as the

Chandy-Lamport distributed snapshot protocol, exploit this

ﬂexibility with checkpoint scheduling to avoid the creation

of early messages — during the creation of a global check-

point, a process



must take its local checkpoint before it

can read a message from process



which



sent after tak-

ing its own checkpoint. This strategy does not work for

application-level checkpointing, because process



might

need to receive an early message before it can arrive at a

point where it may take a checkpoint.

Therefore, unlike system-level checkpointing protocols,

application-level checkpointing protocols must handle both

late and early messages.

3.2 Handling late and early messages

We use Figure 3 to illustrate the issues associated with late

and early messages. Suppose that one of the processes in

this ﬁgure fails after the taking of Global Checkpoint 2. On

restart, each processes will resume execution from its state

as saved in the checkpoint. For process



to recover cor-

rectly, it must obtain the late message that was sent to it by

process



prior to the failure. However, process



will not

resend this message because the send occurred before



took

its checkpoint. Therefore, we need mechanisms for (i) iden-

tifying late messages and saving them along with the global

checkpoint, and (ii) replaying these messages to the receiv-

ing process during recovery. Late messages must be handled

by system-level checkpointing protocols as well.

Early messages, such as the message sent from process



to process



pose a different problem. Process



received

this message before taking its checkpoint; after recovery it

does not expect to be resent this message. For the application

to be correct, therefore, process



must suppress resending

this message. To handle this, we need mechanisms for (i)

identifying early messages, and (ii) ensuring that they are

not resent during recovery.

Early messages also pose a separate and more subtle

problem. The saved state of process



at Global Checkpoint

2 may depend on the data contained in the early message

from process



. If that data was a random number generated





’s state would be dependent on a non-deterministic

event at



. If the number was generated after



took its

checkpoint, then on restart,



and



may disagree on its

value.

In general, we must ensure that if a global checkpoint de-

pends on a non-deterministic event, that event will re-occur

after restart. Therefore, mechanisms are needed to (i) log the

non-deterministic events that a global checkpoint depends

on, so that (ii) these events can be replayed during recovery.

3.3 Non-FIFO message delivery at application

level

Many system-level protocols assume that the communica-

tion between a pair of processes behaves in a FIFO manner.

For example, in the Chandy-Lamport protocol, a process



that takes a checkpoint sends a marker token to other pro-

cesses, informing them of what it has done. The protocol

relies on the FIFO assumption to ensure that these other pro-

cesses must receive this token before they can receive any

message sent by



after it took its checkpoint.

In an MPI application, a process



can use tag match-

ing to receive messages from



in a different order than

as they were sent. Therefore, a protocol that works at the

application-level, as would be the case for application-level

checkpointing, cannot assume FIFO communication. It is

important to note that this problem has nothing to do with the

FIFO (or lack of) behavior of the underlying communication

system; rather, it is a property of a particular application.

3.4 Collective communication

The MPI standard includes collective communications func-

tions such as MPI_Bcast and MPI_Alltoall, which

involve the exchange of data among a number of proces-

sors. However, most checkpointing protocols in the litera-

ture, which were designed in the context of distributed com-

puting, ignore the issue of collective communication.

The difﬁculty presented by such functions occurs when

some processes make a collective communication call be-

fore taking their checkpoints, and others after. We need to

ensure that on restart, the processes that reexecute the calls

do not deadlock and receive correct information. Further-

more, MPI_Barrier guarantees speciﬁc synchronization

semantics, which must be preserved on restart.

3.5 Problems Checkpointing MPI Library State

The key issue in performing application-level checkpointing

of the state of the MPI library is that we do not assume to

have access to its source code. While it would be possi-

ble for us to add application-level checkpointing methods to

an existing MPI implementation, this would limit the porta-

bility of our checkpointer and would keep the programmer

from using vendor-provided,platform-optimized implemen-

tations of MPI. Thus, our problem is to record and recover

the state of the MPI library using only the MPI interface.

The library state can be broken up into three categories:



Library message buffers. At the application-level,

messages are invisible until they are received by the ap-

plication. Therefore, at checkpoint time, the applica-

tion cannot distinguish whether a given message is sit-

ting in a network buffer on the sending processor, being

transmitted, or sitting in a network buffer on the desti-

nation processor. All such messages are equivalently

“in-ﬂight” from the application’s perspective. There-

fore, we do not need to checkpoint the library’s com-

munication buffers.



MPI’s opaque objects. Such objects are internal

to the MPI library but are visible to application

may via handles. These objects include request ob-

jects (MPI_Request), communicators (MPI_Comm),

groups (MPI_Group), data types (MPI_Datatype),

error handlers (MPI_Errhandler), user deﬁned op-

erators (MPI_Op), and key-value pairs.



State internal to the MPI library. There is certain

state in the MPI library, such as message queues, timers

and the network addresses of processors, that is com-

pletely hidden to the application. Since this state cannot

be manipulated via MPI’s interface, it is impossible for

us to save or restore it. However, this is not required for

correctness. All that is required is that the application’s

view of the library remains consistent before and after

restart.

4 A Non-Blocking, Coordinated Protocol for

Application-level Checkpointing

We now describe the coordination protocol for global check-

pointing. The protocol is independent of the technique used

by processes to take local checkpoints. To avoid complicat-

ing the presentation, we ﬁrst describe the protocol for point-

to-point communciation only. Then, we show that collective

communication can be handled elegantly using the mecha-

nism in place for point-to-point communication.

4.1 High-level description of protocol

Phase #1 To initiate a distributed snapshot, the initiator

sends a control message called pleaseCheckpoint to all ap-

plication processes. Each application process must take a

local checkpoint at some time after it receives this request,

but it is free to send and receive as many messages as it likes

between the time it is asked to take a checkpoint and when it

actually complies with this request.

Phase #2 When an application process reaches a point in

the program where it can take a local checkpoint, it saves its

local state and the identities of any early messages on stable

Automated application-level checkpointing of MPI programs

Citations

DMTCP: Transparent checkpointing for cluster computations and the desktop

FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI

Adaptive incremental checkpointing for massively parallel systems

Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods

Application-level checkpointing for shared memory programs

References

Distributed algorithms

MPI: A Message-Passing Interface Standard

Distributed snapshots: determining global states of distributed systems

A survey of rollback-recovery protocols in message-passing systems

Libckpt: transparent checkpointing under Unix

Related Papers (5)

A survey of rollback-recovery protocols in message-passing systems

Distributed snapshots: determining global states of distributed systems

Libckpt: transparent checkpointing under Unix

CoCheck: checkpointing and process migration for MPI

Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System

Frequently Asked Questions (11)

Q1. How many bits can be used to represent the color of the epoch?

Q2. What is the state of the application running on each node?

Q3. What is the way to restore stack variables?

Q4. What is the second dimension along which checkpointing techniques can be classified?

Q5. Why did the developers choose not to follow the PORCH approach?

Q6. Why do the authors need to restore heap objects to their original addresses?

Q7. What is the key issue in performing application-level checkpointing of the state of the MPI?

Q8. What is the function used to compute the conjunction of the bits?

Q9. Why did the authors use only 16 processors for their tests?

Q10. What is the problem when Q saves its log?

Q11. What is the way to classify an application message?