DMTCP: Transparent checkpointing for cluster computations and the desktop
Jason Ansel,Kapil Aryay,Gene Coopermany +2 more
- pp 1-12
TLDR
DMTCP as mentioned in this paper is a transparent user-level checkpointing package for distributed applications, which is used for the runCMS experiment of the Large Hadron Collider at CERN, and it can be incorporated and distributed as a checkpoint-restart module within some larger package.Abstract:
DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including MATLAB, Python, TightVNC, MPICH2, OpenMPI, and runCMS. RunCMS runs as a 680 MB image in memory that includes 540 dynamic libraries, and is used for the CMS experiment of the Large Hadron Collider at CERN. DMTCP transparently checkpoints general cluster computations consisting of many nodes, processes, and threads; as well as typical desktop applications. On 128 distributed cores (32 nodes), checkpoint and restart times are typically 2 seconds, with negligible run-time overhead. Typical checkpoint times are reduced to 0.2 seconds when using forked checkpointing. Experimental results show that checkpoint time remains nearly constant as the number of nodes increases on a medium-size cluster. DMTCP automatically accounts for fork, exec, ssh, mutexes/ semaphores, TCP/IP sockets, UNIX domain sockets, pipes, ptys (pseudo-terminals), terminal modes, ownership of controlling terminals, signal handlers, open file descriptors, shared open file descriptors, I/O (including the readline library), shared memory (via mmap), parent-child process relationships, pid virtualization, and other operating system artifacts. By emphasizing an unprivileged, user-space approach, compatibility is maintained across Linux kernels from 2.6.9 through the current 2.6.28. Since DMTCP is unprivileged and does not require special kernel modules or kernel patches, DMTCP can be incorporated and distributed as a checkpoint-restart module within some larger package.read more
Citations
More filters
Proceedings ArticleDOI
Application migration in HPC — A driver of the exascale era?
TL;DR: This paper investigates the viability of application migration for HPC by deriving respective requirements for this specific field of application and presents a prototype migration mechanism enabling the seamless migration of MPI processes in HPC systems.
Journal ArticleDOI
Explorations of the viability of ARM and Xeon Phi for physics processing
David Abdurachmanov,Kapil Arya,Josh Bendavid,Tommaso Boccali,Gene Cooperman,Andrea Dotti,Peter Elmer,Giulio Eulisse,Francesco Giacomini,C. D. Jones,Matteo Manzali,Shahzad Muzaffar +11 more
TL;DR: The experience porting software to these processors and running benchmarks using real physics applications to explore the potential of these processors for production physics processing are described.
Journal ArticleDOI
Resilient Computing on ROS using Adaptive Fault Tolerance
Michael Lauer,Matthieu Amy,Matthieu Amy,Jean-Charles Fabre,Jean-Charles Fabre,Matthieu Roy,William Excoffon,William Excoffon,Miruna Stoicescu,Miruna Stoicescu +9 more
TL;DR: Computer‐based systems are now expected to evolve during their service life to cope with changes of various nature, ranging from evolution of user needs, eg, additional features requested by users, to system configuration changes, Eg, modifications in available hardware resources.
Proceedings ArticleDOI
Improving Performance of CAPE Using Discontinuous Incremental Checkpointing
Viet Hai Ha,Eric Renault +1 more
TL;DR: This paper presents the new prototype for CAPE based on the discontinuous incremental check pointing technique and an analysis of its performance.
Proceedings ArticleDOI
A framework for an in-depth comparison of scale-up and scale-out
TL;DR: A novel comparison framework based on MapReduce is proposed that accounts for the application, its requirements, and its input size by considering input, software, and hardware parameters.
References
More filters
Journal ArticleDOI
IPython: A System for Interactive Scientific Computing
Fernando Perez,Brian E. Granger +1 more
TL;DR: The IPython project as mentioned in this paper provides an enhanced interactive environment that includes, among other features, support for data visualization and facilities for distributed and parallel computation for interactive work and a comprehensive library on top of which more sophisticated systems can be built.
Journal ArticleDOI
Distributed snapshots: determining global states of distributed systems
K. Mani Chandy,Leslie Lamport +1 more
TL;DR: An algorithm by which a process in a distributed system determines a global state of the system during a computation, which helps to solve an important class of problems: stable property detection.
Portable implementation of the mpi message passing interface standard
TL;DR: The MPI Message Passing Interface (MPI) as discussed by the authors is a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists.
Proceedings Article
Libckpt: transparent checkpointing under Unix
TL;DR: In this paper, the authors describe a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature and also supports the incorporation of user directives into the creation of checkpoints.
Proceedings Article
Fast transparent migration for virtual machines
TL;DR: This is the first system that can migrate unmodified applications on unmodified mainstream Intel x86-based operating system, including Microsoft Windows, Linux, Novell NetWare and others, to provide fast, transparent application migration.