scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

DMTCP: Transparent checkpointing for cluster computations and the desktop

TL;DR: DMTCP as mentioned in this paper is a transparent user-level checkpointing package for distributed applications, which is used for the runCMS experiment of the Large Hadron Collider at CERN, and it can be incorporated and distributed as a checkpoint-restart module within some larger package.
Abstract: DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including MATLAB, Python, TightVNC, MPICH2, OpenMPI, and runCMS. RunCMS runs as a 680 MB image in memory that includes 540 dynamic libraries, and is used for the CMS experiment of the Large Hadron Collider at CERN. DMTCP transparently checkpoints general cluster computations consisting of many nodes, processes, and threads; as well as typical desktop applications. On 128 distributed cores (32 nodes), checkpoint and restart times are typically 2 seconds, with negligible run-time overhead. Typical checkpoint times are reduced to 0.2 seconds when using forked checkpointing. Experimental results show that checkpoint time remains nearly constant as the number of nodes increases on a medium-size cluster. DMTCP automatically accounts for fork, exec, ssh, mutexes/ semaphores, TCP/IP sockets, UNIX domain sockets, pipes, ptys (pseudo-terminals), terminal modes, ownership of controlling terminals, signal handlers, open file descriptors, shared open file descriptors, I/O (including the readline library), shared memory (via mmap), parent-child process relationships, pid virtualization, and other operating system artifacts. By emphasizing an unprivileged, user-space approach, compatibility is maintained across Linux kernels from 2.6.9 through the current 2.6.28. Since DMTCP is unprivileged and does not require special kernel modules or kernel patches, DMTCP can be incorporated and distributed as a checkpoint-restart module within some larger package.

Content maybe subject to copyright    Report

Citations
More filters
01 Jan 2020
TL;DR: The concept of data migration as a promising way of achieving proactive fault tolerance in HPC systems is introduced and a lightweight application library is presented to assist application programmers in making their applications fault tolerant.
Abstract: In this dissertation, we present a comprehensive survey on the state-of-the-practice failure prediction methods for HPC systems. We further introduce the concept of data migration as a promising way of achieving proactive fault tolerance in HPC systems. We present a lightweight application library – called LAIK – to assist application programmers in making their applications fault tolerant. Moreover, we propose an extension – called MPI sessions and MPI process sets – to the state-of-the-art programming model for HPC applications – the Message Passing Interface (MPI) – in order to benefit from failure prediction.

3 citations

Proceedings ArticleDOI
14 Sep 2020
TL;DR: An algorithm based on dynamic programming with predictable performance and scalability that enables supercomputing infrastructure schedulers to analyze the trade-off between delaying the start of opportunistic on-demand jobs and loss of progress due to preemption of batch jobs, and take decisions in near real-time is proposed.
Abstract: The increasing scale and complexity of scientific applications are rapidly transforming the ecosystem of tools, methods, and workflows adopted by the high-performance computing (HPC) community. Big data analytics and deep learning are gaining traction as essential components in this ecosystem in a variety of scenarios, such as, steering of experimental instruments, acceleration of high-fidelity simulations through surrogate computations, and guided ensemble searches. In this context, the batch job model traditionally adopted by the supercomputing infrastructures needs to be complemented with support to schedule opportunistic on-demand analytics jobs, leading to the problem of efficient preemption of batch jobs with minimum loss of progress. In this paper, we design and implement a simulator, CoSim, that enables on-the-fly analysis of the trade-offs arising between delaying the start of opportunistic on-demand jobs, which leads to longer analytics latency, and loss of progress due to preemption of batch jobs, which is necessary to make room for on-demand jobs. To this end, we propose an algorithm based on dynamic programming with predictable performance and scalability that enables supercomputing infrastructure schedulers to analyze the aforementioned trade-off and take decisions in near real-time. Compared with other state-of-art approaches using traces of the Theta pre-Exascale machine, our approach is capable of finding the optimal solution, while achieving high performance and scalability.

3 citations

Proceedings ArticleDOI
15 Dec 2014
TL;DR: It is reported that with existing technologies, it is feasible to create a cloud renderer that is standards compliant, perform ant and extensible, and point out some research avenues for further extension and improvements.
Abstract: In this paper we make a feasibility study of a cloud renderer. We record our attempts at implementing one with off-the-self components. Apart from the resilience scheme, we report that with existing technologies, it is feasible to create one that is standards compliant, perform ant and extensible. We point out some research avenues for further extension and improvements.

3 citations

31 Dec 2017
TL;DR: A framework for intelligently characterizing and managing extreme-scale HPC system resources is proposed and new HPC resource management techniques for intelligent utilizing system resources are developed through the optimal scheduling of applications to HPC nodes and the optimal configuration of fault resilience protocols.
Abstract: RESOURCE MANAGEMENT FOR EXTREME SCALE HIGH PERFORMANCE COMPUTING SYSTEMS IN THE PRESENCE OF FAILURES High performance computing (HPC) systems, such as data centers and supercomputers, coordinate the execution of large-scale computation of applications over tens or hundreds of thousands of multicore processors. Unfortunately, as the size of HPC systems continues to grow towards exascale complexities, these systems experience an exponential growth in the number of failures occurring in the system. These failures reduce performance and increase energy use, reducing the efficiency and effectiveness of emerging extreme-scale HPC systems. Applications executing in parallel on individual multicore processors also suffer from decreased performance and increased energy use as a result of applications being forced to share resources, in particular, the contention from multiple application threads sharing the last-level cache causes performance degradation. These challenges make it increasingly important to characterize and optimize the performance and behavior of applications that execute in these systems. To address these challenges, in this dissertation we propose a framework for intelligently characterizing and managing extreme-scale HPC system resources. We devise various techniques to mitigate the negative effects of failures and resource contention in HPC systems. In particular, we develop new HPC resource management techniques for intelligently utilizing system resources through the (a) optimal scheduling of applications to HPC nodes and (b) the optimal configuration of fault resilience protocols. These resource management techniques employ information obtained from historical analysis as well as theoretical and machine learning methods for predictions. We use these data to characterize system performance, energy use, and application behavior when operating under the uncertainty of performance degradation from both system failures and resource contention. We investigate how to better characterize and model the negative effects from system

2 citations

Proceedings ArticleDOI
01 Sep 2018
TL;DR: This paper intends to help Fortran MPI application developers avoid problems when developing fault-tolerant MPI applications.
Abstract: Powerful high performance computing systems of the future are expected to have higher failure rates than current systems. As a result, HPC applications running on such future systems are more likely to encounter a system failure than on today's machines. Application fault tolerance is therefore becoming more important to avoid costly waste of resources associated with rerunning failed applications. The MPI 3.1 standard does not address the issue of MPI process failures. Checkpoint/restart is commonly used to add fault tolerance to MPI applications. However, there can be complicated issues impacting an MPI application's ability to correctly and efficiently write checkpoint files, particularly if Fortran I/O statements are used. Moreover, it may be inefficient restart a large number MPI processes from a checkpoint. Several MPI fault tolerance libraries, such as ULFM, are being developed to enabl MPI programs to recover from MPI process failures. This can circumvent much of the overhead of an application restart, including rescheduling, launching, initializing, and reading checkpoint data. Each library uses a different approach to recovery from MPI process failures. Unfortunately, some of the proposed recovery models are incompatible with Fortran. This paper intends to help Fortran MPI application developers avoid problems when developing fault-tolerant MPI applications.

2 citations

References
More filters
Journal ArticleDOI
01 May 2007
TL;DR: The IPython project as mentioned in this paper provides an enhanced interactive environment that includes, among other features, support for data visualization and facilities for distributed and parallel computation for interactive work and a comprehensive library on top of which more sophisticated systems can be built.
Abstract: Python offers basic facilities for interactive work and a comprehensive library on top of which more sophisticated systems can be built. The IPython project provides on enhanced interactive environment that includes, among other features, support for data visualization and facilities for distributed and parallel computation

3,355 citations

Journal ArticleDOI
TL;DR: An algorithm by which a process in a distributed system determines a global state of the system during a computation, which helps to solve an important class of problems: stable property detection.
Abstract: This paper presents an algorithm by which a process in a distributed system determines a global state of the system during a computation. Many problems in distributed systems can be cast in terms of the problem of detecting global states. For instance, the global state detection algorithm helps to solve an important class of problems: stable property detection. A stable property is one that persists: once a stable property becomes true it remains true thereafter. Examples of stable properties are “computation has terminated,” “ the system is deadlocked” and “all tokens in a token ring have disappeared.” The stable property detection problem is that of devising algorithms to detect a given stable property. Global state detection can also be used for checkpointing.

2,738 citations

01 Jan 1996
TL;DR: The MPI Message Passing Interface (MPI) as discussed by the authors is a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists.
Abstract: MPI (Message Passing Interface) is a specification for a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists. Multiple implementations of MPI have been developed. In this paper, we describe MPICH, unique among existing implementations in its design goal of combining portability with high performance. We document its portability and performance and describe the architecture by which these features are simultaneously achieved. We also discuss the set of tools that accompany the free distribution of MPICH, which constitute the beginnings of a portable parallel programming environment. A project of this scope inevitably imparts lessons about parallel computing, the specification being followed, the current hardware and software environment for parallel computing, and project management; we describe those we have learned. Finally, we discuss future developments for MPICH, including those necessary to accommodate extensions to the MPI Standard now being contemplated by the MPI Forum.

2,065 citations

Proceedings Article
16 Jan 1995
TL;DR: In this paper, the authors describe a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature and also supports the incorporation of user directives into the creation of checkpoints.
Abstract: Checkpointing is a simple technique for rollback recovery: the state of an executing program is periodically saved to a disk file from which it can be recovered after a failure. While recent research has developed a collection of powerful techniques for minimizing the overhead of writing checkpoint files, checkpointing remains unavailable to most application developers. In this paper we describe libckpt, a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature. While libckpt can be used in a mode which is almost totally transparent to the programmer, it also supports the incorporation of user directives into the creation of checkpoints. This user-directed checkpointing is an innovation which is unique to our work.

670 citations

Proceedings Article
10 Apr 2005
TL;DR: This is the first system that can migrate unmodified applications on unmodified mainstream Intel x86-based operating system, including Microsoft Windows, Linux, Novell NetWare and others, to provide fast, transparent application migration.
Abstract: This paper describes the design and implementation of a system that uses virtual machine technology [1] to provide fast, transparent application migration. This is the first system that can migrate unmodified applications on unmodified mainstream Intel x86-based operating system, including Microsoft Windows, Linux, Novell NetWare and others. Neither the application nor any clients communicating with the application can tell that the application has been migrated. Experimental measurements show that for a variety of workloads, application downtime caused by migration is less than a second.

588 citations