scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

DMTCP: Transparent checkpointing for cluster computations and the desktop

TL;DR: DMTCP as mentioned in this paper is a transparent user-level checkpointing package for distributed applications, which is used for the runCMS experiment of the Large Hadron Collider at CERN, and it can be incorporated and distributed as a checkpoint-restart module within some larger package.
Abstract: DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including MATLAB, Python, TightVNC, MPICH2, OpenMPI, and runCMS. RunCMS runs as a 680 MB image in memory that includes 540 dynamic libraries, and is used for the CMS experiment of the Large Hadron Collider at CERN. DMTCP transparently checkpoints general cluster computations consisting of many nodes, processes, and threads; as well as typical desktop applications. On 128 distributed cores (32 nodes), checkpoint and restart times are typically 2 seconds, with negligible run-time overhead. Typical checkpoint times are reduced to 0.2 seconds when using forked checkpointing. Experimental results show that checkpoint time remains nearly constant as the number of nodes increases on a medium-size cluster. DMTCP automatically accounts for fork, exec, ssh, mutexes/ semaphores, TCP/IP sockets, UNIX domain sockets, pipes, ptys (pseudo-terminals), terminal modes, ownership of controlling terminals, signal handlers, open file descriptors, shared open file descriptors, I/O (including the readline library), shared memory (via mmap), parent-child process relationships, pid virtualization, and other operating system artifacts. By emphasizing an unprivileged, user-space approach, compatibility is maintained across Linux kernels from 2.6.9 through the current 2.6.28. Since DMTCP is unprivileged and does not require special kernel modules or kernel patches, DMTCP can be incorporated and distributed as a checkpoint-restart module within some larger package.

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI
01 Oct 2016
TL;DR: This work proposes a novel architecture in which different semantic scenes are implemented as independent operating system processes, and employs checkpointing to provide a simpler framework for the end programmer, while at the same time resulting in higher performance.
Abstract: Typical devices of the Internet of Things are usually under-powered, and have limited RAM. This is due to energy and cost concerns. Yet, IoT applications require increasingly complex programs with increasingly large amounts of data. In principle, an application could manage the increasing data within the limited RAM by saving and loading data from the file system as needed. But managing the use of RAM in this way is both time-consuming and error-prone for the code developer. We propose instead a novel architecture in which different semantic scenes are implemented as independent operating system processes. As the need arises to switch from one scene to another, the currently running process, which represents the current scene, is checkpointed and a process representing the new scene is restarted from a checkpoint image. This solution employs checkpointing to provide a simpler framework for the end programmer, while at the same time resulting in higher performance. For example, experiments show that restarting an old process from a checkpoint image is about 25 times faster than starting a new process. When using an mmap-based optimization (deferring the paging in of virtual memory pages until runtime), restarting an old process is about 500 times faster. Overall, checkpoint and restart each execute in less than 0.2 seconds on a Raspberry Pi B.

5 citations

01 Jan 2014
TL;DR: This dissertation presents user-space process virtualization to decouple application processes from the external subsystems and an adaptive plugin based approach is used to implement the virtualization layers that allow the checkpoint-restart system to grow organically.
Abstract: Checkpoint-Restart is the ability to save a set of running processes to a checkpoint image on disk, and to later restart them from the disk. In addition to its traditional use in fault tolerance, recovering from a system failure, it has numerous other uses, such as for application debugging and save/restore of the workspace of an interactive problem-solving environment. Transparent checkpointing operates without modifying the underlying application program, but it implicitly relies on a “Closed World Assumption” — the world (including file system, network, etc.) will look the same upon restart as it did at the time of checkpoint. This is not valid for more complex programs. Until now, checkpoint-restart packages have adopted ad hoc solutions for each case where the environment changes upon restart. This dissertation presents user-space process virtualization to decouple application processes from the external subsystems. A thin virtualization layer is introduced between the application and each external subsystem. It provides the application with a consistent view of the external world and allows for checkpoint-restart to succeed. The ever growing number of external subsystems make it harder to deploy and maintain virtualization layers in a monolithic checkpoint-restart system. To address this, an adaptive plugin based approach is used to implement the virtualization layers that allow the checkpoint-restart system to grow organically. The principle of decoupling the external subsystem through process virtualization is also applied in the context of virtual machines for providing a solution to the long standing double-paging problem. Double-paging occurs when the guest attempts to page out memory that has previously been swapped out by the hypervisor and leads to long delays for the guest as the contents are read back into machine memory only to be written out again. The performance rapidly drops as a result of significant lengthening of the time to complete the guest I/O request.

5 citations

Proceedings ArticleDOI
01 Nov 2012
TL;DR: This paper proposes a transparent and scalable checkpoint/restart mechanism for OpenCL applications, named Two-level CheCL, which consists of two different checkpoint implementations, Local CheCL and Global CheCL.
Abstract: As the number of nodes in a GPU computing system increases, checkpointing to a global file system becomes more time-consuming due to the I/O bottlenecks and network congestion. To solve this problem, in this paper, we propose a transparent and scalable checkpoint/restart mechanism for OpenCL applications, named Two-level CheCL. As its name implies, Two-level CheCL consists of two different checkpoint implementations, Local CheCL and Global CheCL. Local CheCL avoids checkpointing to the global file system by utilizing node's local storage. Our experimental results show that Local CheCL can accelerate the checkpointing process by up to four times faster than a conventional checkpointing mechanism. We also implement Global CheCL, which utilizes a global file system, to make sure that we always have a global checkpoint file even in the case of a catastrophic failure. We discuss the performance of our proposed mechanism through an analysis with a two-level checkpoint model.

4 citations

23 Sep 2012
TL;DR: The novel design changes the default socket model avoiding being unexpectedly closed due to a remote node failure and a pessimistic log-based rollback recovery protocol added to this level makes possible restarting and re-executing a failed parallel process until the point of failure independently of the rest of the processes.
Abstract: We present an analysis design of how to incorpo- rate a transparent fault tolerance system at socket level for message passing applications. The novel design changes the default socket model avoiding being unexpectedly closed due to a remote node failure. Moreover, a pessimistic log-based rollback recovery protocol added to this level makes possible restarting and re-executing a failed parallel process until the point of failure independently of the rest of the processes. This paper explains and analyzes the design time decisions. We tested and assessed them executing a master-worker (M/W) and Single Program Multiple Data (SPMD) applications which follow different communication patterns. Promising results of robustness in interprocess communication were obtained.

4 citations

Proceedings ArticleDOI
26 Jun 2017
TL;DR: The approach involving AFT (Adaptive Fault Tolerance) implemented on ROS (Robot Operating System) and the simulation platform developed to experiment and validate over-the-air updates of ADAS and AFT are summarized and some lessons learnt and perspectives are drawn.
Abstract: The use of over-the-air updates has attracted very much interest these last few years with the software-intensive development of embedded systems in the car industry. The development of autonomous driving and ADAS (Advanced Driver Assistance Systems) renders over-the-air updates mandatory, for both user satisfaction and economic reasons. How to make sure that remote updates of critical ADAS do not have an impact on safety? This is the question we tackle in our work with a major car manufacturer. This paper is a progress report. We summarize our approach involving AFT (Adaptive Fault Tolerance) implemented on ROS (Robot Operating System), describe the simulation platform we have developed to experiment and validate over-the-air updates of ADAS and AFT, and finally draw some lessons learnt and perspectives.

4 citations

References
More filters
Journal ArticleDOI
01 May 2007
TL;DR: The IPython project as mentioned in this paper provides an enhanced interactive environment that includes, among other features, support for data visualization and facilities for distributed and parallel computation for interactive work and a comprehensive library on top of which more sophisticated systems can be built.
Abstract: Python offers basic facilities for interactive work and a comprehensive library on top of which more sophisticated systems can be built. The IPython project provides on enhanced interactive environment that includes, among other features, support for data visualization and facilities for distributed and parallel computation

3,355 citations

Journal ArticleDOI
TL;DR: An algorithm by which a process in a distributed system determines a global state of the system during a computation, which helps to solve an important class of problems: stable property detection.
Abstract: This paper presents an algorithm by which a process in a distributed system determines a global state of the system during a computation. Many problems in distributed systems can be cast in terms of the problem of detecting global states. For instance, the global state detection algorithm helps to solve an important class of problems: stable property detection. A stable property is one that persists: once a stable property becomes true it remains true thereafter. Examples of stable properties are “computation has terminated,” “ the system is deadlocked” and “all tokens in a token ring have disappeared.” The stable property detection problem is that of devising algorithms to detect a given stable property. Global state detection can also be used for checkpointing.

2,738 citations

01 Jan 1996
TL;DR: The MPI Message Passing Interface (MPI) as discussed by the authors is a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists.
Abstract: MPI (Message Passing Interface) is a specification for a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists. Multiple implementations of MPI have been developed. In this paper, we describe MPICH, unique among existing implementations in its design goal of combining portability with high performance. We document its portability and performance and describe the architecture by which these features are simultaneously achieved. We also discuss the set of tools that accompany the free distribution of MPICH, which constitute the beginnings of a portable parallel programming environment. A project of this scope inevitably imparts lessons about parallel computing, the specification being followed, the current hardware and software environment for parallel computing, and project management; we describe those we have learned. Finally, we discuss future developments for MPICH, including those necessary to accommodate extensions to the MPI Standard now being contemplated by the MPI Forum.

2,065 citations

Proceedings Article
16 Jan 1995
TL;DR: In this paper, the authors describe a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature and also supports the incorporation of user directives into the creation of checkpoints.
Abstract: Checkpointing is a simple technique for rollback recovery: the state of an executing program is periodically saved to a disk file from which it can be recovered after a failure. While recent research has developed a collection of powerful techniques for minimizing the overhead of writing checkpoint files, checkpointing remains unavailable to most application developers. In this paper we describe libckpt, a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature. While libckpt can be used in a mode which is almost totally transparent to the programmer, it also supports the incorporation of user directives into the creation of checkpoints. This user-directed checkpointing is an innovation which is unique to our work.

670 citations

Proceedings Article
10 Apr 2005
TL;DR: This is the first system that can migrate unmodified applications on unmodified mainstream Intel x86-based operating system, including Microsoft Windows, Linux, Novell NetWare and others, to provide fast, transparent application migration.
Abstract: This paper describes the design and implementation of a system that uses virtual machine technology [1] to provide fast, transparent application migration. This is the first system that can migrate unmodified applications on unmodified mainstream Intel x86-based operating system, including Microsoft Windows, Linux, Novell NetWare and others. Neither the application nor any clients communicating with the application can tell that the application has been migrated. Experimental measurements show that for a variety of workloads, application downtime caused by migration is less than a second.

588 citations