Open Access
Short Notes Low-Latency, Concurrent Checkpointing for Parallel Programs
TLDR
The best algorithm measured achieves its efficiency through a variation of copy-on-write, which allows the most time-consuming operations of the checkpoint to be overlapped with the running of the program being checkpointed.Abstract:
This short note presents the results of an implementation of several algorithms for checkpointing and restarting parallel programs on shared-memory multiprocessors. The algorithms are compared according to the metrics of overall checkpointing time, overhead imposed by the checkpointer on the target program, and amount of time during which the checkpointer interrupts the target program. The best algorithm measured achieves its efficiency through a variation of copy-on-write, which allows the most time-consuming operations of the checkpoint to be overlapped with the running of the program being checkpointed.read more
Citations
More filters
Proceedings ArticleDOI
Algorithm-based diskless checkpointing for fault tolerant matrix operations
TL;DR: This paper presents high-performance implementations of several algorithms for distributed scientific computing, including Cholesky factorization, LU factorsization, QR factorization), and preconditioned conjugate gradient, able to run on PVM networks of at least N processors, and can complete with low overhead as long as any N processors remain functional.
On staggered checkpointing
TL;DR: In this paper, a simple approach to arbitrarily stagger the checkpoints is presented, which requires that the processes take consistent logical checkpoints, as compared to consistent physical checkpoints enforced by existing algorithms.
References
More filters
Journal ArticleDOI
Distributed snapshots: determining global states of distributed systems
K. Mani Chandy,Leslie Lamport +1 more
TL;DR: An algorithm by which a process in a distributed system determines a global state of the system during a computation, which helps to solve an important class of problems: stable property detection.
Journal ArticleDOI
System structure for software fault tolerance
TL;DR: In this article, the authors present a method for structuring complex computing systems by the use of what they term "recovery blocks", "conversations", and "fault-tolerant interfaces".
Journal ArticleDOI
Memory coherence in shared virtual memory systems
Kai Li,Paul Hudak +1 more
TL;DR: Both theoretical and practical results show that the memory coherence problem can indeed be solved efficiently on a loosely coupled multiprocessor.
Proceedings ArticleDOI
Implementation techniques for main memory database systems
David J. DeWitt,Randy H. Katz,Frank Olken,Leonard D. Shapiro,Michael Stonebraker,Darien Wood +5 more
TL;DR: This paper considers the changes necessary to permit a relational database system to take advantage of large amounts of main memory, and evaluates AVL vs B+-tree access methods, hash-based query processing strategies vs sort-merge, and study recovery issues when most or all of the database fits in main memory.
Journal ArticleDOI
Checkpointing and Rollback-Recovery for Distributed Systems
Richard Koo,Sam Toueg +1 more
TL;DR: In this article, the authors consider the problem of bringing a distributed system to a consistent state after transient failures, and propose a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system from transient failures.