scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

DMTCP: Transparent checkpointing for cluster computations and the desktop

TL;DR: DMTCP as mentioned in this paper is a transparent user-level checkpointing package for distributed applications, which is used for the runCMS experiment of the Large Hadron Collider at CERN, and it can be incorporated and distributed as a checkpoint-restart module within some larger package.
Abstract: DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including MATLAB, Python, TightVNC, MPICH2, OpenMPI, and runCMS. RunCMS runs as a 680 MB image in memory that includes 540 dynamic libraries, and is used for the CMS experiment of the Large Hadron Collider at CERN. DMTCP transparently checkpoints general cluster computations consisting of many nodes, processes, and threads; as well as typical desktop applications. On 128 distributed cores (32 nodes), checkpoint and restart times are typically 2 seconds, with negligible run-time overhead. Typical checkpoint times are reduced to 0.2 seconds when using forked checkpointing. Experimental results show that checkpoint time remains nearly constant as the number of nodes increases on a medium-size cluster. DMTCP automatically accounts for fork, exec, ssh, mutexes/ semaphores, TCP/IP sockets, UNIX domain sockets, pipes, ptys (pseudo-terminals), terminal modes, ownership of controlling terminals, signal handlers, open file descriptors, shared open file descriptors, I/O (including the readline library), shared memory (via mmap), parent-child process relationships, pid virtualization, and other operating system artifacts. By emphasizing an unprivileged, user-space approach, compatibility is maintained across Linux kernels from 2.6.9 through the current 2.6.28. Since DMTCP is unprivileged and does not require special kernel modules or kernel patches, DMTCP can be incorporated and distributed as a checkpoint-restart module within some larger package.

Content maybe subject to copyright    Report

Citations
More filters
Book ChapterDOI
TL;DR: This tutorial chapter outlines, design and demonstrate a fault tolerant mechanism associated with a ROS master failure, and presents a modified version of the ROS master which is equipped with a logging mechanism to record the meta information and network state of ROS nodes.
Abstract: In this chapter we discuss the problem of master failure in ROS 1.0 and its impact on robotic deployments in the real world. We address this issue in this tutorial chapter where we outline, design and demonstrate a fault tolerant mechanism associated with a ROS master failure. Unlike previous solutions which use primary backup replication and external checkpointing libraries which are resource demanding, our mechanism adds a lightweight functionality to the ROS master to enable it to recover from failure. We present a modified version of the ROS master which is equipped with a logging mechanism to record the meta information and network state of ROS nodes as well as a recovery mechanism to go back to the previous state without having to abort or restart all the nodes. We also implement an additional master monitor node responsible for failure detection on the master by polling it for its availability. Our code is implemented in Python and preliminary tests were conducted successfully on a variety of land, aerial and underwater robots and a teleoperated computer running ROS Kinetic on Ubuntu 16.04. The code is publicly available under a Creative Commons license on Github at https://github.com/PushyamiKaveti/fault-tolerant-ros-master.

3 citations

Proceedings ArticleDOI
07 Jul 2015
TL;DR: An Android cluster system that can reconfigure its system's scale statically and dynamically and automatically detect the change in the number of computation nodes and reconfigure the cluster's nodes, even while parallel and distributed application is running is presented.
Abstract: In recent years, high-performance mobile devices such as smart phones and tablet devices spread rapidly. They have attracted attention as a new platform for parallel and distributed applications. Based on this background, we are developing a cluster computer system using wireless-connected mobile devices running Android OS. However, since mobile devices can move anywhere, node computers might leave from the cluster, new devices might join the cluster. In this paper, we present an Android cluster system that can reconfigure its system's scale statically and dynamically. The system can automatically detect the change in the number of computation nodes and reconfigure the cluster's nodes, even while parallel and distributed application is running. Furthermore, we show preliminary performance results of our system. It is shown that our cluster provides the scalable performance to the number of nodes in parallel computation. Also, we have confirmed that the runtime overhead caused by checkpointing varies highly depending upon the interval of checkpointing.

3 citations

Journal ArticleDOI
TL;DR: An Android Cluster system that can automatically detect the change in the number of computation nodes and reconfigure the cluster’s nodes, even while parallel and distributed application is running is presented and preliminary performance results of the system are shown.
Abstract: Recently, high-performance mobile computer devices such as smart phones and tablet devices spread rapidly. They have attracted attention as a new promising platform for parallel and distributed applications. Based on the background, we are developing a cluster computer system using mobile devices or single board computers running Android OS. However, since mobile devices can move anywhere, node computers might leave from the cluster and new nodes might join the cluster. In this paper, we present an Android Cluster system that can reconfigure the system's scale dynamically. Our system can automatically detect the change in the number of computation nodes and reconfigure the cluster's nodes, even while parallel and distributed application is running. Furthermore, we show preliminary performance results of our system. The results show that our cluster provides the scalable performance to the number of nodes in parallel computation. Finally, it is confirmed that the mechanism of load balancing per process basis and the mechanism of switching to efficient data communication method can reduce the execution time of parallel applications. Our evaluation result shows that the execution time can be reduced up to 11.8% by load balancing per process basis, as compared to the load balancing per node basis, and shows that the execution time can be reduced 68% at maximum, by switching the communication method between processes to efficient one.

3 citations

Proceedings ArticleDOI
28 Jun 2018
TL;DR: The mechanism uses block-checksum which not only meets the requirement of matrix computations on large-scale parallel systems but also reduces the overhead of fault-tolerance compared to traditional schemes based on row and column checksums.
Abstract: With the scaling up of high performance computers, resilience has become a big challenge. Among various kinds of software-based fault-tolerant approaches, the algorithm-based fault tolerance (ABFT) has some attractive characteristics in the era of exa-scale systems, such as high efficiency and light-weight. In particular, considering that many engineering and scientific applications rely on some fundamental algorithms, it is possible to provide algorithm-based fault-tolerant mechanisms in low level and make it application-independent. Previous fault-tolerant mechanisms for matrix computation use row and column checksums, which cannot be directly used in large-scale parallel systems. This paper proposes an algorithm-based fault tolerant approach for matrix multiplication on large-scale parallel systems. The mechanism uses block-checksum which not only meets the requirement of matrix computations on large-scale parallel systems but also reduces the overhead of fault-tolerance compared to traditional schemes based on row and column checksums. In addition, this paper gives method for choosing the size of blocks to achieve balance between accuracy and efficiency. The complexity analysis and examples demonstrate effectiveness and feasibility of our approach.

3 citations

Journal ArticleDOI
TL;DR: A methodology has been designed based on predicting the size of the checkpoint when the number of processes, the application workload or the mapping varies, using a reduced number of resources.
Abstract: The use of fault tolerance strategies such as checkpoints is essential to maintain the availability of systems and their applications in high-performance computing environments. However, checkpoint storage can impact the performance and scalability of parallel applications that use message passing. In the present work, a study is carried out on the elements that can impact the storage of the checkpoint and how these can influence the scalability of an application with fault tolerance. A methodology has been designed based on predicting the size of the checkpoint when the number of processes, the application workload or the mapping varies, using a reduced number of resources. By following this methodology, the system administrator will be able to make decisions about what should be done with the number of processes used and the number of appropriate nodes, adjusting the process mapping in applications that use checkpoints.

3 citations

References
More filters
Journal ArticleDOI
01 May 2007
TL;DR: The IPython project as mentioned in this paper provides an enhanced interactive environment that includes, among other features, support for data visualization and facilities for distributed and parallel computation for interactive work and a comprehensive library on top of which more sophisticated systems can be built.
Abstract: Python offers basic facilities for interactive work and a comprehensive library on top of which more sophisticated systems can be built. The IPython project provides on enhanced interactive environment that includes, among other features, support for data visualization and facilities for distributed and parallel computation

3,355 citations

Journal ArticleDOI
TL;DR: An algorithm by which a process in a distributed system determines a global state of the system during a computation, which helps to solve an important class of problems: stable property detection.
Abstract: This paper presents an algorithm by which a process in a distributed system determines a global state of the system during a computation. Many problems in distributed systems can be cast in terms of the problem of detecting global states. For instance, the global state detection algorithm helps to solve an important class of problems: stable property detection. A stable property is one that persists: once a stable property becomes true it remains true thereafter. Examples of stable properties are “computation has terminated,” “ the system is deadlocked” and “all tokens in a token ring have disappeared.” The stable property detection problem is that of devising algorithms to detect a given stable property. Global state detection can also be used for checkpointing.

2,738 citations

01 Jan 1996
TL;DR: The MPI Message Passing Interface (MPI) as discussed by the authors is a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists.
Abstract: MPI (Message Passing Interface) is a specification for a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists. Multiple implementations of MPI have been developed. In this paper, we describe MPICH, unique among existing implementations in its design goal of combining portability with high performance. We document its portability and performance and describe the architecture by which these features are simultaneously achieved. We also discuss the set of tools that accompany the free distribution of MPICH, which constitute the beginnings of a portable parallel programming environment. A project of this scope inevitably imparts lessons about parallel computing, the specification being followed, the current hardware and software environment for parallel computing, and project management; we describe those we have learned. Finally, we discuss future developments for MPICH, including those necessary to accommodate extensions to the MPI Standard now being contemplated by the MPI Forum.

2,065 citations

Proceedings Article
16 Jan 1995
TL;DR: In this paper, the authors describe a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature and also supports the incorporation of user directives into the creation of checkpoints.
Abstract: Checkpointing is a simple technique for rollback recovery: the state of an executing program is periodically saved to a disk file from which it can be recovered after a failure. While recent research has developed a collection of powerful techniques for minimizing the overhead of writing checkpoint files, checkpointing remains unavailable to most application developers. In this paper we describe libckpt, a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature. While libckpt can be used in a mode which is almost totally transparent to the programmer, it also supports the incorporation of user directives into the creation of checkpoints. This user-directed checkpointing is an innovation which is unique to our work.

670 citations

Proceedings Article
10 Apr 2005
TL;DR: This is the first system that can migrate unmodified applications on unmodified mainstream Intel x86-based operating system, including Microsoft Windows, Linux, Novell NetWare and others, to provide fast, transparent application migration.
Abstract: This paper describes the design and implementation of a system that uses virtual machine technology [1] to provide fast, transparent application migration. This is the first system that can migrate unmodified applications on unmodified mainstream Intel x86-based operating system, including Microsoft Windows, Linux, Novell NetWare and others. Neither the application nor any clients communicating with the application can tell that the application has been migrated. Experimental measurements show that for a variety of workloads, application downtime caused by migration is less than a second.

588 citations