scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

DMTCP: Transparent checkpointing for cluster computations and the desktop

TL;DR: DMTCP as mentioned in this paper is a transparent user-level checkpointing package for distributed applications, which is used for the runCMS experiment of the Large Hadron Collider at CERN, and it can be incorporated and distributed as a checkpoint-restart module within some larger package.
Abstract: DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including MATLAB, Python, TightVNC, MPICH2, OpenMPI, and runCMS. RunCMS runs as a 680 MB image in memory that includes 540 dynamic libraries, and is used for the CMS experiment of the Large Hadron Collider at CERN. DMTCP transparently checkpoints general cluster computations consisting of many nodes, processes, and threads; as well as typical desktop applications. On 128 distributed cores (32 nodes), checkpoint and restart times are typically 2 seconds, with negligible run-time overhead. Typical checkpoint times are reduced to 0.2 seconds when using forked checkpointing. Experimental results show that checkpoint time remains nearly constant as the number of nodes increases on a medium-size cluster. DMTCP automatically accounts for fork, exec, ssh, mutexes/ semaphores, TCP/IP sockets, UNIX domain sockets, pipes, ptys (pseudo-terminals), terminal modes, ownership of controlling terminals, signal handlers, open file descriptors, shared open file descriptors, I/O (including the readline library), shared memory (via mmap), parent-child process relationships, pid virtualization, and other operating system artifacts. By emphasizing an unprivileged, user-space approach, compatibility is maintained across Linux kernels from 2.6.9 through the current 2.6.28. Since DMTCP is unprivileged and does not require special kernel modules or kernel patches, DMTCP can be incorporated and distributed as a checkpoint-restart module within some larger package.

Content maybe subject to copyright    Report

Citations
More filters
Dissertation
01 Jan 2019
TL;DR: If the update process of the Access Control controller can be improved by implementing existing update techniques to update the kernel and filesystem within seconds and additionally add fail-safe measures to revert to the last working system in case of a failed update is determined.
Abstract: Nedap needs to improve the update process of the Access Control controller to make their product highly available with only seconds of downtime each update. The current update process uses a straightforward approach which downloads the update, checks the files, stops the application, overwrite all files and reboots into the new system. After a full reboot, the access control application starts including fetching all authorisations from the server and initialising all connected hardware. This results in at least 3 minutes of downtime which can increase to 23 minutes due to the number of authorisations and the complexity of the system. This research aims to determine if the update process can be improved by implementing existing update techniques to update the kernel and filesystem within seconds and additionally add fail-safe measures to revert to the last working system in case of a failed update. The time measured as improvement indicator is the relative speed up between the downtime occurring to the access control software during an update. The downtime starts from the point in time the application is killed and stops when it is fully up and running again. Based on the insights in the old update process and two Design Space Explorations to kernel update techniques and checkpoint and restore techniques, we propose a new update process. This new process implements a second partition to store the update, uses Kexec to load and execute a new kernel directly from the running one and uses CRIU to create a checkpoint of the access control application which can be restored after a reboot. Additionally, a watchdog is implemented to reset the device in case the update fails and reboot into the last working system by using the second partition. By using the new update process, a kernel and file system update is performed with only seconds of downtime. After performing tests on a full system emulation tool a system update is performed with 13.8 seconds of downtime. Comparing this to the old update process results in a relative speed up of factor 5.6 to 11.

1 citations

Proceedings ArticleDOI
14 Nov 2021
TL;DR: ParaCrash as discussed by the authors is a testing framework for studying crash recovery in a typical HPC I/O stack, and demonstrate its use by identifying 15 new crash-consistency bugs in various parallel file systems (PFS) and libraries.
Abstract: We present ParaCrash, a testing framework for studying crash recovery in a typical HPC I/O stack, and demonstrate its use by identifying 15 new crash-consistency bugs in various parallel file systems (PFS) and I/O libraries. ParaCrash uses a "golden version" approach to test the entire HPC I/O stack: storage state after recovery from a crash is correct if it matches the state that can be achieved by a partial execution with no crashes. It supports systematic testing of a multilayered I/O stack while properly identifying the layer responsible for the bugs.

1 citations

01 Jan 2014
TL;DR: A vast majority of current-generation HPC systems and application codes work around system failures using rollback-recovery schemes, also known as Checkpoint-Restart (CR), wherein the parallel processes of an application frequently save a mutually agreed-upon state of their execution as checkpoints in a globally-shared storage medium.
Abstract: In high-performance computing (HPC), tightly-coupled, parallel applications run in lock-step over thousands to millions of processor cores. These applications simulate a wide-range of scientific phenomena, such as hurricanes and earthquakes, or the functioning of a human heart. The results of these simulations are important and time-critical, e.g., we want to know the path of the hurricane before it makes landfall. Thus, these applications are run on the fastest supercomputers in the world at the largest scales possible. However, due to the increased component count, large-scale executions are more prone to experience faults, with Mean Times Between Failures (MTBF) on the order of hours or days due to hardware breakdowns and soft errors. A vast majority of current-generation HPC systems and application codes work around system failures using rollback-recovery schemes, also known as Checkpoint-Restart (CR), wherein the parallel processes of an application frequently save a mutually agreed-upon state of their execution as checkpoints in a globally-shared storage medium. In the face of failures, applications rollback their execution to a fault-free state using these snapshots that were saved periodically. Over the years, checkpointing mechanisms have gained notoriety for their colossal I/O demands. While state-of-art parallel file systems are optimized for concurrent accesses from millions of processes, checkpointing overheads continue to dominate application run times, with the time taken to write a single checkpoint taking on the

1 citations

01 Jan 2014
TL;DR: This work presents a layered approach to providing fault tolerance for message-passing applications on compute clusters that are based on COTS hardware components, COTS operating systems, and a COTS API for application programmers that relies on highly-resilient cluster management middleware.
Abstract: Clusters of message-passing computing nodes provide high-performance platforms for distributed applications. Cost-effective implementations of such systems are based on commercial off-the-shelf (COTS) hardware and software components. One trend in the deployment of such systems is to scale up the number of compute nodes to deliver higher performance levels. The higher component count results in a corresponding higher rate of failure. Another trend is to deploy clusters for mission-critical applications or in harsh environments, where reliability requirements are higher than in a controlled lab setting. Both of these trends point to an increasing need to employ fault tolerance techniques to meet the reliability requirements of the applications being executed.We present a layered approach to providing fault tolerance for message-passing applications on compute clusters that are based on COTS hardware components, COTS operating systems, and a COTS API for application programmers. This approach relies on highly-resilient cluster management middleware (CMM) that ensures the survival of key system services despite the failure of cluster components. A key feature of this CMM is that it provides services that enable and simplify user-level implementation of fault tolerance for applications without dictating the specific techniques employed. In particular, while application-transparent techniques are supported, the CMM also supports application-specific techniques that are tailored and optimized for the characteristics and requirements of specific applications. To this end, we have developed an API that can be used in the implementation of fault tolerance by the application programmer as well as by developers of user-level libraries that provide application-transparent fault tolerance. The effectiveness of our layered approach is demonstrated and evaluated with several applications employing different techniques for fault tolerance. The entire system is subjected to a fault injection campaign. We show that the CMM services that support fault tolerance techniques operate reliably and with very low overhead. We also show that application-specific fault tolerance techniques detect and recover from a vast majority of manifested faults while imposing much lower performance overhead than application-transparent schemes.

1 citations

References
More filters
Journal ArticleDOI
01 May 2007
TL;DR: The IPython project as mentioned in this paper provides an enhanced interactive environment that includes, among other features, support for data visualization and facilities for distributed and parallel computation for interactive work and a comprehensive library on top of which more sophisticated systems can be built.
Abstract: Python offers basic facilities for interactive work and a comprehensive library on top of which more sophisticated systems can be built. The IPython project provides on enhanced interactive environment that includes, among other features, support for data visualization and facilities for distributed and parallel computation

3,355 citations

Journal ArticleDOI
TL;DR: An algorithm by which a process in a distributed system determines a global state of the system during a computation, which helps to solve an important class of problems: stable property detection.
Abstract: This paper presents an algorithm by which a process in a distributed system determines a global state of the system during a computation. Many problems in distributed systems can be cast in terms of the problem of detecting global states. For instance, the global state detection algorithm helps to solve an important class of problems: stable property detection. A stable property is one that persists: once a stable property becomes true it remains true thereafter. Examples of stable properties are “computation has terminated,” “ the system is deadlocked” and “all tokens in a token ring have disappeared.” The stable property detection problem is that of devising algorithms to detect a given stable property. Global state detection can also be used for checkpointing.

2,738 citations

01 Jan 1996
TL;DR: The MPI Message Passing Interface (MPI) as discussed by the authors is a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists.
Abstract: MPI (Message Passing Interface) is a specification for a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists. Multiple implementations of MPI have been developed. In this paper, we describe MPICH, unique among existing implementations in its design goal of combining portability with high performance. We document its portability and performance and describe the architecture by which these features are simultaneously achieved. We also discuss the set of tools that accompany the free distribution of MPICH, which constitute the beginnings of a portable parallel programming environment. A project of this scope inevitably imparts lessons about parallel computing, the specification being followed, the current hardware and software environment for parallel computing, and project management; we describe those we have learned. Finally, we discuss future developments for MPICH, including those necessary to accommodate extensions to the MPI Standard now being contemplated by the MPI Forum.

2,065 citations

Proceedings Article
16 Jan 1995
TL;DR: In this paper, the authors describe a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature and also supports the incorporation of user directives into the creation of checkpoints.
Abstract: Checkpointing is a simple technique for rollback recovery: the state of an executing program is periodically saved to a disk file from which it can be recovered after a failure. While recent research has developed a collection of powerful techniques for minimizing the overhead of writing checkpoint files, checkpointing remains unavailable to most application developers. In this paper we describe libckpt, a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature. While libckpt can be used in a mode which is almost totally transparent to the programmer, it also supports the incorporation of user directives into the creation of checkpoints. This user-directed checkpointing is an innovation which is unique to our work.

670 citations

Proceedings Article
10 Apr 2005
TL;DR: This is the first system that can migrate unmodified applications on unmodified mainstream Intel x86-based operating system, including Microsoft Windows, Linux, Novell NetWare and others, to provide fast, transparent application migration.
Abstract: This paper describes the design and implementation of a system that uses virtual machine technology [1] to provide fast, transparent application migration. This is the first system that can migrate unmodified applications on unmodified mainstream Intel x86-based operating system, including Microsoft Windows, Linux, Novell NetWare and others. Neither the application nor any clients communicating with the application can tell that the application has been migrated. Experimental measurements show that for a variety of workloads, application downtime caused by migration is less than a second.

588 citations