DMTCP: Transparent checkpointing for cluster computations and the desktop
23 May 2009-pp 1-12
TL;DR: DMTCP as mentioned in this paper is a transparent user-level checkpointing package for distributed applications, which is used for the runCMS experiment of the Large Hadron Collider at CERN, and it can be incorporated and distributed as a checkpoint-restart module within some larger package.
Abstract: DMTCP (Distributed MultiThreaded CheckPointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including MATLAB, Python, TightVNC, MPICH2, OpenMPI, and runCMS. RunCMS runs as a 680 MB image in memory that includes 540 dynamic libraries, and is used for the CMS experiment of the Large Hadron Collider at CERN. DMTCP transparently checkpoints general cluster computations consisting of many nodes, processes, and threads; as well as typical desktop applications. On 128 distributed cores (32 nodes), checkpoint and restart times are typically 2 seconds, with negligible run-time overhead. Typical checkpoint times are reduced to 0.2 seconds when using forked checkpointing. Experimental results show that checkpoint time remains nearly constant as the number of nodes increases on a medium-size cluster. DMTCP automatically accounts for fork, exec, ssh, mutexes/ semaphores, TCP/IP sockets, UNIX domain sockets, pipes, ptys (pseudo-terminals), terminal modes, ownership of controlling terminals, signal handlers, open file descriptors, shared open file descriptors, I/O (including the readline library), shared memory (via mmap), parent-child process relationships, pid virtualization, and other operating system artifacts. By emphasizing an unprivileged, user-space approach, compatibility is maintained across Linux kernels from 2.6.9 through the current 2.6.28. Since DMTCP is unprivileged and does not require special kernel modules or kernel patches, DMTCP can be incorporated and distributed as a checkpoint-restart module within some larger package.
Citations
More filters
01 Jan 2012
TL;DR: A new reversible debugging platform, based on checkpoint, restart, re-execute and decomposition of histories of debugging commands, which can reversibly debug real-world multithreaded programs, such as MySQL, on multi-core architectures and a novel tool implemented on top of this platform, called reverse expression watchpoint.
Abstract: Reversible debuggers have existed since the early 1970s. However, they are not widely used, with the possible exception of GDB. GDB's target record is useful only when the cause of the bug is close in time to the bug manifestation. When the cause of the bug is far away from the manifestation, one resorts to a series of debugging sessions with the goal of narrowing down the cause of the bug. Thanks to reverse execution, it is possible to jump back and forth to any time of the execution.
In this dissertation, we present a new reversible debugging platform, based on checkpoint, restart, re-execute and decomposition of histories of debugging commands. Our platform can reversibly debug real-world multithreaded programs, such as MySQL, on multi-core architectures.
We present a novel tool implemented on top of this reversible debugging platform, called reverse expression watchpoint. Reverse expression watchpoint helps the user diagnose bugs, for which the cause of the bug is far away from the manifestation. Once the user identifies a failed invariant, she wishes to automatically locate a program statement inside a debugger, such as GDB, where the invariant holds, but it will fail at the next immediate statement. This approach is different than GDB's software watchpoints. Reverse expression watchpoint performs large jumps in time, thanks to this new approach, and it takes advantage of multi-core architectures during replay, for multithreaded applications.
3 citations
••
27 Dec 2011TL;DR: This work presents a novel approach to model the behavior of message passing parallel applications based in the concept of signatures, which is able to build a model that allows us to predict the application execution time in different systems with variable input data size.
Abstract: Being able to accurately estimate how an application will perform in a specific computational system provides many useful benefits and can result in smarter decisions. In this work we present a novel approach to model the behavior of message passing parallel applications. Based in the concept of signatures, which are the most relevant parts of an application (phases), we are able to build a model that allows us to predict the application execution time in different systems with variable input data size. Executing these signatures with different input data sizes defines a program's behavior partial function. Using regression we can generalize this behavior function to predict an application performance in a target system with other input data size within a predefined range. We explain our methodology and in order to validate the proposal present results using a synthetic program and well known applications.
3 citations
••
07 May 2014TL;DR: An approach and prototype module for ns-3 is presented which provides an API for checkpointing running ns- 3 applications at arbitrary times, restoring these applications to a running state, and for modifying parameters of the restored simulation before process continuation.
Abstract: Given that large-scale network simulations are a significant part of active research [6], and that such simulations are known to be computationally and resource intensive, it is still of interest to find innovative ways to achieve faster execution times and to more efficiently provide data and results for the researcher.While utilization of parallel programming techniques and frameworks such as MPI, CUDA, and OpenCL are valid approaches and very useful in speeding up computations, few of these techniques seek to reduce repetitive, uninteresting, or non-changing segments of simulation runs which are unnecessarily repeated, for instance during initialization, and which are not of interest to the end result being studied.Recent advances in process checkpointing utilities, such as Distributed Multithreaded Checkpointing (DMTCP) [1], which support practical user-space checkpointing of a wide variety of distributed, multithreaded applications and support the use of important libraries such as MPI, enable innovative techniques for achieving computational savings and provide the potential for a more generic mechanism of moving simulations forward and backward in time efficiently.We present an approach and prototype module for ns-3 which provides an API for checkpointing running ns-3 applications at arbitrary times, restoring these applications to a running state, and for modifying parameters of the restored simulation before process continuation.
3 citations
•
01 Jan 2010
TL;DR: Wissenschaftliche and okonomische Problemstellungen konnen in zunehmendem Mase nicht mehr mit den lokal, bei Forschungs- oder Wirtschaftunternehmen, verfugbaren Ressourcen bewaltigt werden, um Grid-Fehlertoleranz zu realisieren.
Abstract: Wissenschaftliche und okonomische Problemstellungen konnen in zunehmendem Mase nicht mehr mit den lokal, bei Forschungs- oder Wirtschaftunternehmen, verfugbaren Ressourcen bewaltigt werden Mit Gridtechnologien sind umfangreiche, verteilte Ressourcen, die durch Zusammenschluss mehrerer Standorte entstehen, gemeinsam nutzbar Mit einer steigenden Anzahl von Rechenknoten erhoht sich gleichzeitig die Wahrscheinlichkeit von Knotenausfallen Um Zwischenzustande, beispielsweise lang laufender Anwendungen, bei einem Rechnerausfall nicht zu verlieren, mussen Fehlertoleranzmechanismen eingesetzt werden Fehlertoleranz kann vor allem durch Checkpoint/Restart erzielt werden, das heist, Anwendungszustande werden in periodischen Abstanden gesichert und konnen im Fehlerfall wiederhergestellt werden In dieser Arbeit wird eine Grid-Checkpointing-Architektur (GCA) entworfen, um Grid-Fehlertoleranz zu realisieren Der Schwerpunkt liegt auf der Unterstutzung von Heterogenitat, das heist, es wird untersucht, inwieweit unterschiedliche, existierende, knotengebundene Checkpointer-Pakete in eine GCA integriert werden konnen Die Analyse verfugbarer Checkpointer-Pakete zeigt grose Unterschiede hinsichtlich deren Fahigkeiten, Prozessressourcen wie IPC, Sockets, Dateien, et cetera sichern und wiederherstellen zu konnen Dies bedingt, Anwendungen mit eingesetzten Checkpointer-Paketen abzugleichen, damit alle Prozessressourcen einer Anwendung nach einem Ausfall rekonstruiert werden konnen Kernelement der GCA bildet die sogenannte Uniforme Checkpointer-Schnittstelle (UCS), welche als einheitliche Schnittstelle zu heterogenen Checkpointer-Paketen dient Die Schnitt-stellenimplementierung muss Bezug zu existierenden Checkpointing-Protokollen nehmen, dabei auf die Abbildung von Semantiken der Grid- auf jene der Gridknotenebene (Prozess- und Benutzer-Management) achten, individuelle Aufrufsemantiken berucksichtigen und Checkpointdatei- und Callback-Management realisieren Sogenannte Container basieren auf leichtgewichtiger Virtualisierung Mit ihrer Hilfe konnen potentielle Ressourcenkonflikte beim Restart vermieden werden, insofern Kennungen von Prozessen, IPC-Objekten et cetera bereits vergeben sind In dieser Arbeit wird dargelegt, welche Auswirkungen auf die GCA entstehen, wenn verschiedene Container unterstutzende Checkpointer-Pakete integriert werden Eine zentrale Herausforderung bei Grid-Fehlertoleranz besteht darin, dass heterogene Checkpointer-Pakete an der Sicherung/Wiederherstellung einer verteilten Anwendung beteiligt sind Um beispielsweise Kanalzustande, mithilfe heterogener Checkpointer-Pakete an beiden Kanalenden, sichern zu konnen, mussen diese bei einem Marker-basierten Ansatz miteinander kooperieren Kooperation darf jedoch nicht auf Kosten der Checkpoint- oder Anwendungs-Modifizierung stattfinden Techniken wie Callbacks und Library Interposition stellen hierbei geeignete Mittel dar Das Ziel adaptiven Checkpointings ist, Checkpointingperformanz zu erhohen Dazu wurde in dieser Arbeit der beidseitige Wechsel zwischen koordineirtem und unkoordiniertem Checkpointing untersucht sowie inkrementelles Grid Checkpointing implementiert Anhand dieser Arbeit wird gezeigt, dass Grid-Fehlertoleranz mithilfe heterogener Checkpointer-Pakete realisierbar ist Der entwickelte Prototyp integriert drei verschiedene Checkpointer-Pakete, welche verwendet werden, um Anwendungen koordiniert, unabhangig und inkrementell zu sichern und wiederherzustellen Die Anwendungen mussen dabei nicht abgeandert werden Die umfangreichen Messungen belegen, dass die GCA keinen nennenswerten Aufwand gegenuber nativen Checkpointer-Paketen verursacht
3 citations
••
TL;DR: This article presents a novel way of expressing computational capacity, more adequate for heterogeneous clusters, and also advocates for task migration in order to further improve the utilization.
Abstract: This work has been supported by the Spanish Science and Technology Commission under contracts TIN2016-76635-C2-2-R and TIN2016-81840-REDT (CAPAP-H6 network) and the
European HiPEAC Network of Excellence
3 citations
References
More filters
••
01 May 2007TL;DR: The IPython project as mentioned in this paper provides an enhanced interactive environment that includes, among other features, support for data visualization and facilities for distributed and parallel computation for interactive work and a comprehensive library on top of which more sophisticated systems can be built.
Abstract: Python offers basic facilities for interactive work and a comprehensive library on top of which more sophisticated systems can be built. The IPython project provides on enhanced interactive environment that includes, among other features, support for data visualization and facilities for distributed and parallel computation
3,355 citations
••
TL;DR: An algorithm by which a process in a distributed system determines a global state of the system during a computation, which helps to solve an important class of problems: stable property detection.
Abstract: This paper presents an algorithm by which a process in a distributed system determines a global state of the system during a computation. Many problems in distributed systems can be cast in terms of the problem of detecting global states. For instance, the global state detection algorithm helps to solve an important class of problems: stable property detection. A stable property is one that persists: once a stable property becomes true it remains true thereafter. Examples of stable properties are “computation has terminated,” “ the system is deadlocked” and “all tokens in a token ring have disappeared.” The stable property detection problem is that of devising algorithms to detect a given stable property. Global state detection can also be used for checkpointing.
2,738 citations
01 Jan 1996
TL;DR: The MPI Message Passing Interface (MPI) as discussed by the authors is a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists.
Abstract: MPI (Message Passing Interface) is a specification for a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists. Multiple implementations of MPI have been developed. In this paper, we describe MPICH, unique among existing implementations in its design goal of combining portability with high performance. We document its portability and performance and describe the architecture by which these features are simultaneously achieved. We also discuss the set of tools that accompany the free distribution of MPICH, which constitute the beginnings of a portable parallel programming environment. A project of this scope inevitably imparts lessons about parallel computing, the specification being followed, the current hardware and software environment for parallel computing, and project management; we describe those we have learned. Finally, we discuss future developments for MPICH, including those necessary to accommodate extensions to the MPI Standard now being contemplated by the MPI Forum.
2,065 citations
•
16 Jan 1995TL;DR: In this paper, the authors describe a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature and also supports the incorporation of user directives into the creation of checkpoints.
Abstract: Checkpointing is a simple technique for rollback recovery: the state of an executing program is periodically saved to a disk file from which it can be recovered after a failure. While recent research has developed a collection of powerful techniques for minimizing the overhead of writing checkpoint files, checkpointing remains unavailable to most application developers. In this paper we describe libckpt, a portable checkpointing tool for Unix that implements all applicable performance optimizations which are reported in the literature. While libckpt can be used in a mode which is almost totally transparent to the programmer, it also supports the incorporation of user directives into the creation of checkpoints. This user-directed checkpointing is an innovation which is unique to our work.
670 citations
•
10 Apr 2005TL;DR: This is the first system that can migrate unmodified applications on unmodified mainstream Intel x86-based operating system, including Microsoft Windows, Linux, Novell NetWare and others, to provide fast, transparent application migration.
Abstract: This paper describes the design and implementation of a system that uses virtual machine technology [1] to provide fast, transparent application migration. This is the first system that can migrate unmodified applications on unmodified mainstream Intel x86-based operating system, including Microsoft Windows, Linux, Novell NetWare and others. Neither the application nor any clients communicating with the application can tell that the application has been migrated. Experimental measurements show that for a variety of workloads, application downtime caused by migration is less than a second.
588 citations