Automated application-level checkpointing of MPI programs
read more
Citations
DMTCP: Transparent checkpointing for cluster computations and the desktop
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
Adaptive incremental checkpointing for massively parallel systems
Online-ABFT: an online algorithm based fault tolerance scheme for soft error detection in iterative methods
Application-level checkpointing for shared memory programs
References
Distributed algorithms
MPI: A Message-Passing Interface Standard
Distributed snapshots: determining global states of distributed systems
A survey of rollback-recovery protocols in message-passing systems
Libckpt: transparent checkpointing under Unix
Related Papers (5)
Frequently Asked Questions (11)
Q2. What is the state of the application running on each node?
The state of the application running on each node consists of its position in the static text of the program, its position in the dynamic execution of the program, its local and global variables, and its heap-allocated structures.
Q3. What is the way to restore stack variables?
On restart, the authors first restore the stack using the PS, and then use the VDS to restore stack variables by copying their value from the checkpoint to their locations on the stack.
Q4. What is the second dimension along which checkpointing techniques can be classified?
(2) The second dimension along which checkpointing techniques can be classified is the technique used to coordinate parallel processes when checkpoints need to be taken.
Q5. Why did the developers choose not to follow the PORCH approach?
Since portability is not one of their goals, and because the authors feel that the limitations on programming style and the added overhead of doing pointer conversion are too burdensome for their applications, the authors have chosen not to follow the PORCH approach.
Q6. Why do the authors need to restore heap objects to their original addresses?
Because stack variables and heap objects are restored to their original virtual addresses, the authors need to make no special consideration regarding data pointers: they are saved as ordinary data.
Q7. What is the key issue in performing application-level checkpointing of the state of the MPI?
The key issue in performing application-level checkpointing of the state of the MPI library is that the authors do not assume to have access to its source code.
Q8. What is the function used to compute the conjunction of the bits?
Each process piggybacks its amLogging bit on the application data, and the functioninvoked by MPI_Allreduce computes the conjunction of these bits.
Q9. Why did the authors use only 16 processors for their tests?
Due to hardware problems, the authors used only 16 of those processors for their tests; in the final paper, the authors will present results for the full machine.
Q10. What is the problem when Q saves its log?
When Q saves its log, the authors have a problem: the saved state of the global computation is causally dependent on an event that was not itself saved.
Q11. What is the way to classify an application message?
It is convenient to classify an application message into three categories depending on the epoch numbers of the sending and receiving processes at the points in the application program execution when the message is sent and received respectively.