DMTCP: Transparent checkpointing for cluster computations and the desktop

doi:10.1109/IPDPS.2009.5161063

Open AccessProceedings ArticleDOI

DMTCP: Transparent checkpointing for cluster computations and the desktop

- pp 1-12

TLDR

DMTCP as mentioned in this paper is a transparent user-level checkpointing package for distributed applications, which is used for the runCMS experiment of the Large Hadron Collider at CERN, and it can be incorporated and distributed as a checkpoint-restart module within some larger package.

Abstract:

DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including MATLAB, Python, TightVNC, MPICH2, OpenMPI, and runCMS. RunCMS runs as a 680 MB image in memory that includes 540 dynamic libraries, and is used for the CMS experiment of the Large Hadron Collider at CERN. DMTCP transparently checkpoints general cluster computations consisting of many nodes, processes, and threads; as well as typical desktop applications. On 128 distributed cores (32 nodes), checkpoint and restart times are typically 2 seconds, with negligible run-time overhead. Typical checkpoint times are reduced to 0.2 seconds when using forked checkpointing. Experimental results show that checkpoint time remains nearly constant as the number of nodes increases on a medium-size cluster. DMTCP automatically accounts for fork, exec, ssh, mutexes/ semaphores, TCP/IP sockets, UNIX domain sockets, pipes, ptys (pseudo-terminals), terminal modes, ownership of controlling terminals, signal handlers, open file descriptors, shared open file descriptors, I/O (including the readline library), shared memory (via mmap), parent-child process relationships, pid virtualization, and other operating system artifacts. By emphasizing an unprivileged, user-space approach, compatibility is maintained across Linux kernels from 2.6.9 through the current 2.6.28. Since DMTCP is unprivileged and does not require special kernel modules or kernel patches, DMTCP can be incorporated and distributed as a checkpoint-restart module within some larger package.

DMTCP: Transparent checkpointing for cluster computations and the desktop

Citations

Be Kind, Rewind: Checkpoint a Restore Capability for Improving Reliability of Large-Scale Semiconductor Design

CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM

A Methodology for Soft Errors Detection and Automatic Recovery

RaaS: resilience as a service

Experimental and Analytical Analysis of Sorting Algorithms Error Criticality for HPC and Large Servers Applications

References

IPython: A System for Interactive Scientific Computing

Distributed snapshots: determining global states of distributed systems

Portable implementation of the mpi message passing interface standard

Libckpt: transparent checkpointing under Unix

Fast transparent migration for virtual machines

Related Papers (5)

Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters

A survey of rollback-recovery protocols in message-passing systems

Libckpt: transparent checkpointing under Unix

Distributed snapshots: determining global states of distributed systems

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System