scispace - formally typeset
Open Access

Short Notes Low-Latency, Concurrent Checkpointing for Parallel Programs

TLDR
The best algorithm measured achieves its efficiency through a variation of copy-on-write, which allows the most time-consuming operations of the checkpoint to be overlapped with the running of the program being checkpointed.
Abstract
This short note presents the results of an implementation of several algorithms for checkpointing and restarting parallel programs on shared-memory multiprocessors. The algorithms are compared according to the metrics of overall checkpointing time, overhead imposed by the checkpointer on the target program, and amount of time during which the checkpointer interrupts the target program. The best algorithm measured achieves its efficiency through a variation of copy-on-write, which allows the most time-consuming operations of the checkpoint to be overlapped with the running of the program being checkpointed.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

Algorithm-based diskless checkpointing for fault tolerant matrix operations

TL;DR: This paper presents high-performance implementations of several algorithms for distributed scientific computing, including Cholesky factorization, LU factorsization, QR factorization), and preconditioned conjugate gradient, able to run on PVM networks of at least N processors, and can complete with low overhead as long as any N processors remain functional.

On staggered checkpointing

TL;DR: In this paper, a simple approach to arbitrarily stagger the checkpoints is presented, which requires that the processes take consistent logical checkpoints, as compared to consistent physical checkpoints enforced by existing algorithms.
References
More filters
Journal ArticleDOI

Distributed snapshots: determining global states of distributed systems

TL;DR: An algorithm by which a process in a distributed system determines a global state of the system during a computation, which helps to solve an important class of problems: stable property detection.
Journal ArticleDOI

System structure for software fault tolerance

TL;DR: In this article, the authors present a method for structuring complex computing systems by the use of what they term "recovery blocks", "conversations", and "fault-tolerant interfaces".
Journal ArticleDOI

Memory coherence in shared virtual memory systems

TL;DR: Both theoretical and practical results show that the memory coherence problem can indeed be solved efficiently on a loosely coupled multiprocessor.
Proceedings ArticleDOI

Implementation techniques for main memory database systems

TL;DR: This paper considers the changes necessary to permit a relational database system to take advantage of large amounts of main memory, and evaluates AVL vs B+-tree access methods, hash-based query processing strategies vs sort-merge, and study recovery issues when most or all of the database fits in main memory.
Journal ArticleDOI

Checkpointing and Rollback-Recovery for Distributed Systems

TL;DR: In this article, the authors consider the problem of bringing a distributed system to a consistent state after transient failures, and propose a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system from transient failures.
Related Papers (5)