Short Notes Low-Latency, Concurrent Checkpointing for Parallel Programs

Open Access

Short Notes Low-Latency, Concurrent Checkpointing for Parallel Programs

TLDR

The best algorithm measured achieves its efficiency through a variation of copy-on-write, which allows the most time-consuming operations of the checkpoint to be overlapped with the running of the program being checkpointed.

Abstract:

This short note presents the results of an implementation of several algorithms for checkpointing and restarting parallel programs on shared-memory multiprocessors. The algorithms are compared according to the metrics of overall checkpointing time, overhead imposed by the checkpointer on the target program, and amount of time during which the checkpointer interrupts the target program. The best algorithm measured achieves its efficiency through a variation of copy-on-write, which allows the most time-consuming operations of the checkpoint to be overlapped with the running of the program being checkpointed.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Algorithm-based diskless checkpointing for fault tolerant matrix operations

James S. Plank, +2 more

TL;DR: This paper presents high-performance implementations of several algorithms for distributed scientific computing, including Cholesky factorization, LU factorsization, QR factorization), and preconditioned conjugate gradient, able to run on PVM networks of at least N processors, and can complete with low overhead as long as any N processors remain functional.

...read moreread less

On staggered checkpointing

Nitin H. Vaidya

TL;DR: In this paper, a simple approach to arbitrarily stagger the checkpoints is presented, which requires that the processes take consistent logical checkpoints, as compared to consistent physical checkpoints enforced by existing algorithms.

...read moreread less

References

PDF

Open Access

More filters

Journal ArticleDOI

Distributed snapshots: determining global states of distributed systems

K. Mani Chandy, +1 more

- 01 Feb 1985 -

ACM Transactions on Computer Systems

TL;DR: An algorithm by which a process in a distributed system determines a global state of the system during a computation, which helps to solve an important class of problems: stable property detection.

...read moreread less

Journal ArticleDOI

System structure for software fault tolerance

Brian Randell

- 01 Apr 1975 -

Sigplan Notices

TL;DR: In this article, the authors present a method for structuring complex computing systems by the use of what they term "recovery blocks", "conversations", and "fault-tolerant interfaces".

...read moreread less

Journal ArticleDOI

Memory coherence in shared virtual memory systems

Kai Li, +1 more

- 01 Nov 1989 -

ACM Transactions on Computer Systems

TL;DR: Both theoretical and practical results show that the memory coherence problem can indeed be solved efficiently on a loosely coupled multiprocessor.

...read moreread less

Proceedings ArticleDOI

Implementation techniques for main memory database systems

David J. DeWitt, +5 more

TL;DR: This paper considers the changes necessary to permit a relational database system to take advantage of large amounts of main memory, and evaluates AVL vs B+-tree access methods, hash-based query processing strategies vs sort-merge, and study recovery issues when most or all of the database fits in main memory.

...read moreread less

Journal ArticleDOI

Checkpointing and Rollback-Recovery for Distributed Systems

Richard Koo, +1 more

- 01 Jan 1987 -

IEEE Transactions on Software Engineerin...

TL;DR: In this article, the authors consider the problem of bringing a distributed system to a consistent state after transient failures, and propose a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system from transient failures.

...read moreread less

Collapse

Short Notes Low-Latency, Concurrent Checkpointing for Parallel Programs

Citations

Algorithm-based diskless checkpointing for fault tolerant matrix operations

On staggered checkpointing

References

Distributed snapshots: determining global states of distributed systems

System structure for software fault tolerance

Memory coherence in shared virtual memory systems

Implementation techniques for main memory database systems

Checkpointing and Rollback-Recovery for Distributed Systems

Related Papers (5)

Low-latency, concurrent checkpointing for parallel programs

An on-line algorithm for checkpoint placement

Low Overhead Incremental Checkpointing and Rollback Recovery Scheme on Windows Operating System

User-Level Checkpointing for LinuxThreads Programs

Experimental evaluation of concurrent checkpointing and rollback-recovery algorithms