scispace - formally typeset
Open AccessProceedings ArticleDOI

DMTCP: Transparent checkpointing for cluster computations and the desktop

Reads0
Chats0
TLDR
DMTCP as mentioned in this paper is a transparent user-level checkpointing package for distributed applications, which is used for the runCMS experiment of the Large Hadron Collider at CERN, and it can be incorporated and distributed as a checkpoint-restart module within some larger package.
Abstract
DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including MATLAB, Python, TightVNC, MPICH2, OpenMPI, and runCMS. RunCMS runs as a 680 MB image in memory that includes 540 dynamic libraries, and is used for the CMS experiment of the Large Hadron Collider at CERN. DMTCP transparently checkpoints general cluster computations consisting of many nodes, processes, and threads; as well as typical desktop applications. On 128 distributed cores (32 nodes), checkpoint and restart times are typically 2 seconds, with negligible run-time overhead. Typical checkpoint times are reduced to 0.2 seconds when using forked checkpointing. Experimental results show that checkpoint time remains nearly constant as the number of nodes increases on a medium-size cluster. DMTCP automatically accounts for fork, exec, ssh, mutexes/ semaphores, TCP/IP sockets, UNIX domain sockets, pipes, ptys (pseudo-terminals), terminal modes, ownership of controlling terminals, signal handlers, open file descriptors, shared open file descriptors, I/O (including the readline library), shared memory (via mmap), parent-child process relationships, pid virtualization, and other operating system artifacts. By emphasizing an unprivileged, user-space approach, compatibility is maintained across Linux kernels from 2.6.9 through the current 2.6.28. Since DMTCP is unprivileged and does not require special kernel modules or kernel patches, DMTCP can be incorporated and distributed as a checkpoint-restart module within some larger package.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

CPRtree: A Tree-Based Checkpointing Architecture for Heterogeneous FPGA Computing

TL;DR: A new checkpoint/restart architecture along with a checkpointing mechanism on FPGA is proposed and "fine-grain" management for checkpointing to reduce performance degradation and a technique to capture consistent snapshots of FPGAs and the rest of the computing system is proposed.
Journal ArticleDOI

A Comparison of Application-Level Fault Tolerance Schemes for Task Pools

TL;DR: Three application-level fault tolerance schemes for task pools are described and evaluated and it is revealed that IncFT and LogFT are superior in scenarios with large task descriptors.
Proceedings ArticleDOI

CRAC: Checkpoint-Restart Architecture for CUDA with Streams and UVM

TL;DR: The Checkpoint-Restart Architecture for CUDA (CRAC) as mentioned in this paper is a new checkpoint-restart solution for fault tolerance that supports the full range of CUDA applications and combines: low runtime overhead (approximately 1% or less), fast checkpointrestart; support for scalable CUDA streams (for efficient usage of all of the thousands of GPU cores); and support for the full features of Unified Virtual Memory.
Proceedings ArticleDOI

A Performance and Energy Comparison of Fault Tolerance Techniques for Exascale Computing Systems

TL;DR: This work explores four resilience techniques, and considers each technique's ability to handle varying levels of system reliability and system sizes, and demonstrates how each technique compares in terms of application performance and energy use and provides recommendations on their suitability at an exascale level.
Proceedings ArticleDOI

Performance evaluation of checkpoint/restart techniques: For MPI applications on Amazon cloud

TL;DR: This work evaluated the performance of two of the most commonly used checkpoint/restart techniques (Distributed Multithreaded Checkpointing (DMTCP) and Berkeley Lab Checkpoint/Restart library (BLCR) integrated into the OpenMPI framework) to test their validity and evaluate their performance in both local and Amazon Elastic Compute Cloud (EC2) environments.
References
More filters
Journal ArticleDOI

IPython: A System for Interactive Scientific Computing

TL;DR: The IPython project as mentioned in this paper provides an enhanced interactive environment that includes, among other features, support for data visualization and facilities for distributed and parallel computation for interactive work and a comprehensive library on top of which more sophisticated systems can be built.
Journal ArticleDOI

Distributed snapshots: determining global states of distributed systems

TL;DR: An algorithm by which a process in a distributed system determines a global state of the system during a computation, which helps to solve an important class of problems: stable property detection.

Portable implementation of the mpi message passing interface standard

TL;DR: The MPI Message Passing Interface (MPI) as discussed by the authors is a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists.
Proceedings Article

Libckpt: transparent checkpointing under Unix

TL;DR: In this paper, the authors describe a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature and also supports the incorporation of user directives into the creation of checkpoints.
Proceedings Article

Fast transparent migration for virtual machines

TL;DR: This is the first system that can migrate unmodified applications on unmodified mainstream Intel x86-based operating system, including Microsoft Windows, Linux, Novell NetWare and others, to provide fast, transparent application migration.
Related Papers (5)