scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

DMTCP: Transparent checkpointing for cluster computations and the desktop

TL;DR: DMTCP as mentioned in this paper is a transparent user-level checkpointing package for distributed applications, which is used for the runCMS experiment of the Large Hadron Collider at CERN, and it can be incorporated and distributed as a checkpoint-restart module within some larger package.
Abstract: DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including MATLAB, Python, TightVNC, MPICH2, OpenMPI, and runCMS. RunCMS runs as a 680 MB image in memory that includes 540 dynamic libraries, and is used for the CMS experiment of the Large Hadron Collider at CERN. DMTCP transparently checkpoints general cluster computations consisting of many nodes, processes, and threads; as well as typical desktop applications. On 128 distributed cores (32 nodes), checkpoint and restart times are typically 2 seconds, with negligible run-time overhead. Typical checkpoint times are reduced to 0.2 seconds when using forked checkpointing. Experimental results show that checkpoint time remains nearly constant as the number of nodes increases on a medium-size cluster. DMTCP automatically accounts for fork, exec, ssh, mutexes/ semaphores, TCP/IP sockets, UNIX domain sockets, pipes, ptys (pseudo-terminals), terminal modes, ownership of controlling terminals, signal handlers, open file descriptors, shared open file descriptors, I/O (including the readline library), shared memory (via mmap), parent-child process relationships, pid virtualization, and other operating system artifacts. By emphasizing an unprivileged, user-space approach, compatibility is maintained across Linux kernels from 2.6.9 through the current 2.6.28. Since DMTCP is unprivileged and does not require special kernel modules or kernel patches, DMTCP can be incorporated and distributed as a checkpoint-restart module within some larger package.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: In this paper , the authors investigated the potential benefits of resource disaggregation from the aspect of reliability, which has not been considered before, and proposed a resource allocation problem to maximize the number of requests accepted with guaranteed reliability.
Abstract: By overcoming the “server box” barrier, resource disaggregation in data centers (DCs) can significantly improve resource utilization. This may then provide a more cost-efficient approach for resource upgrade and expansion. The advantages of resource disaggregation have been explored in earlier research to improve the efficiency of resource usage. This paper investigates the potential benefits of resource disaggregation from the aspect of reliability, which has not been considered before. Resource disaggregation gives rise to a new failure pattern. For example, in a conventional server, the failure of one type of resource leads to the failure of the entire server, so that other types of resources in the same server also become unavailable. After disaggregating, the failure of different types of resources becomes more isolated so that other resources are still available. In this paper, we model the reliability of a resource allocation request in a server-based or disaggregated DC based on whether the request is allocated with only working resources or is also provisioned with backup resources. We then consider a resource allocation problem to maximize the number of requests accepted with guaranteed reliability. This is formulated as an integer linear programming (ILP) problem, and a more straightforward heuristic approach is also proposed. Our numerical studies demonstrate that it may be possible to significantly improve service reliability with this resource disaggregation approach.

2 citations

Journal ArticleDOI
TL;DR: An extension to CPPC is presented to allow the checkpointing of OpenMP applica- tions and maintains the main characteristics of CPPC: portability and reduced checkpoint file size.
Abstract: Despite the increasing popularity of shared-memory systems, there is a lack of tools for providing fault tolerance support to shared-memory applications. CPPC (ComPiler for Portable Checkpointing) is an application-level checkpointing tool fo- cused on the insertion of fault tolerance into long-running MPI applications. This paper presents an extension to CPPC to allow the checkpointing of OpenMP applica- tions. The proposed solution maintains the main characteristics of CPPC: portability and reduced checkpoint file size. The performance of the proposal is evaluated using the OpenMP NAS Parallel Benchmarks showing that most of the applications present small checkpoint overheads.

2 citations

Dissertation
31 Jan 2017
TL;DR: Productive Programming Systems for Heterogeneous Supercomputers are presented, a meta-modelling system for heterogeneous supercomputers that automates the very labor-intensive and therefore time-heavy and expensive process of programming.
Abstract: Productive Programming Systems for Heterogeneous Supercomputers

2 citations

Proceedings ArticleDOI
28 Jun 2018
TL;DR: The recently completed research project DEEP-ER has developed a variety of hardware and software technologies to improve the I/O capabilities of next generation high-performance computers, and to enable applications recovering from the larger hardware failure rates expected on these machines.
Abstract: The recently completed research project DEEP-ER has developed a variety of hardware and software technologies to improve the I/O capabilities of next generation high-performance computers, and to enable applications recovering from the larger hardware failure rates expected on these machines. The heterogeneous Cluster-Booster architecture - first introduced in the predecessor DEEP project - has been extended by a multi-level memory hierarchy employing non-volatile and network-attached memory devices. Based on this hardware infrastructure, an I/O and resiliency software stack has been implemented combining and extending well established libraries and software tools, and sticking to standard user-interfaces. Real-world scientific codes have tested the projects' developments and demonstrated the improvements achieved without compromising the portability of the applications.

1 citations

Dissertation
01 Jan 2016
TL;DR: This dissertation proposes and implements a new distributed architecture for next generation core routers by implementing N-Modular Redundancy for Control Plane Processes and uses a distributed key-value data store as a substrate for High Availability (HA) operation and synchronization with different consistency requirements.
Abstract: The unending expansion of Internet-enabled services drives demand for higher speeds and more stringent reliability capabilities in the commercial Internet. Network carriers, driven by these customer demands, increasingly acquire and deploy ever larger routers with an ever growing feature set. The software however, is not currently structured adequately to preserve Non-Stop Operation and is sometimes deployed in non-redundant software configurations. In this dissertation, we look at current Routing Frameworks and address the growing reliability concerns of Future Internet Architectures. We propose and implement a new distributed architecture for next generation core routers by implementing N-Modular Redundancy for Control Plane Processes. We use a distributed key-value data store as a substrate for High Availability (HA) operation and synchronization with different consistency requirements. A new platform is proposed, with many characteristics similar to Software Defined Network, on which the framework is deployed. Our work looks at the feasibility of Non-Stop Routing System Design, where a router can continue to process both control and data messages, even with multiple concurrent software and hardware failures. A thorough discussion on the framework, components, and capabilities is given, and an estimation of overhead, fault recovery and responsiveness is presented. We argue that even though our work is focused in the context of routing, it also applies to any use-case where Non-Stop Real-Time response is required.

1 citations

References
More filters
Journal ArticleDOI
01 May 2007
TL;DR: The IPython project as mentioned in this paper provides an enhanced interactive environment that includes, among other features, support for data visualization and facilities for distributed and parallel computation for interactive work and a comprehensive library on top of which more sophisticated systems can be built.
Abstract: Python offers basic facilities for interactive work and a comprehensive library on top of which more sophisticated systems can be built. The IPython project provides on enhanced interactive environment that includes, among other features, support for data visualization and facilities for distributed and parallel computation

3,355 citations

Journal ArticleDOI
TL;DR: An algorithm by which a process in a distributed system determines a global state of the system during a computation, which helps to solve an important class of problems: stable property detection.
Abstract: This paper presents an algorithm by which a process in a distributed system determines a global state of the system during a computation. Many problems in distributed systems can be cast in terms of the problem of detecting global states. For instance, the global state detection algorithm helps to solve an important class of problems: stable property detection. A stable property is one that persists: once a stable property becomes true it remains true thereafter. Examples of stable properties are “computation has terminated,” “ the system is deadlocked” and “all tokens in a token ring have disappeared.” The stable property detection problem is that of devising algorithms to detect a given stable property. Global state detection can also be used for checkpointing.

2,738 citations

01 Jan 1996
TL;DR: The MPI Message Passing Interface (MPI) as discussed by the authors is a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists.
Abstract: MPI (Message Passing Interface) is a specification for a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists. Multiple implementations of MPI have been developed. In this paper, we describe MPICH, unique among existing implementations in its design goal of combining portability with high performance. We document its portability and performance and describe the architecture by which these features are simultaneously achieved. We also discuss the set of tools that accompany the free distribution of MPICH, which constitute the beginnings of a portable parallel programming environment. A project of this scope inevitably imparts lessons about parallel computing, the specification being followed, the current hardware and software environment for parallel computing, and project management; we describe those we have learned. Finally, we discuss future developments for MPICH, including those necessary to accommodate extensions to the MPI Standard now being contemplated by the MPI Forum.

2,065 citations

Proceedings Article
16 Jan 1995
TL;DR: In this paper, the authors describe a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature and also supports the incorporation of user directives into the creation of checkpoints.
Abstract: Checkpointing is a simple technique for rollback recovery: the state of an executing program is periodically saved to a disk file from which it can be recovered after a failure. While recent research has developed a collection of powerful techniques for minimizing the overhead of writing checkpoint files, checkpointing remains unavailable to most application developers. In this paper we describe libckpt, a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature. While libckpt can be used in a mode which is almost totally transparent to the programmer, it also supports the incorporation of user directives into the creation of checkpoints. This user-directed checkpointing is an innovation which is unique to our work.

670 citations

Proceedings Article
10 Apr 2005
TL;DR: This is the first system that can migrate unmodified applications on unmodified mainstream Intel x86-based operating system, including Microsoft Windows, Linux, Novell NetWare and others, to provide fast, transparent application migration.
Abstract: This paper describes the design and implementation of a system that uses virtual machine technology [1] to provide fast, transparent application migration. This is the first system that can migrate unmodified applications on unmodified mainstream Intel x86-based operating system, including Microsoft Windows, Linux, Novell NetWare and others. Neither the application nor any clients communicating with the application can tell that the application has been migrated. Experimental measurements show that for a variety of workloads, application downtime caused by migration is less than a second.

588 citations