scispace - formally typeset
Open AccessJournal ArticleDOI

A higher order estimate of the optimum checkpoint interval for restart dumps

J. T. Daly
- 01 Feb 2006 - 
- Vol. 22, Iss: 3, pp 303-312
TLDR
This paper examines methods of approximating the optimum checkpoint restart strategy for minimizing application run time on a system exhibiting Poisson single component failures and develops and compares two different models.
About
This article is published in Future Generation Computer Systems.The article was published on 2006-02-01 and is currently open access. It has received 501 citations till now.

read more

Citations
More filters
Proceedings ArticleDOI

Pregel: a system for large-scale graph processing

TL;DR: A model for processing large graphs that has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier.
Proceedings ArticleDOI

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

TL;DR: The Scalable Checkpoint/Restart (SCR) library is designed, a multi-level checkpoint system that writes checkpoints to RAM, Flash, or disk on the compute nodes in addition to the parallel file system that improves efficiency on existing large-scale systems and that this benefit increases as the system size grows.
Proceedings ArticleDOI

PLFS: a checkpoint filesystem for parallel applications

TL;DR: A virtual parallel log structured file system which remaps an application's preferred data layout into one which is optimized for the underlying file system, which can reduce checkpoint time by an order of magnitude.
Journal ArticleDOI

Toward Exascale Resilience: 2014 Update

TL;DR: This paper surveys what the community has learned in the past five years and summarizes the research problems still considered critical by the HPC community.
References
More filters
Journal ArticleDOI

A first order approximation to the optimum checkpoint interval

TL;DR: It is standard practice to save periodically sufficient information to enable the job to be restarted at the previous point at which information was saved, and the saving of such information at these points is called checkpointing.
Journal ArticleDOI

Impact of checkpoint latency on overhead ratio of a checkpointing scheme

TL;DR: In this paper, the authors show that a large increase in latency is acceptable if it is accompanied by a relatively small reduction in overhead, and for equidistant checkpoints, optimal checkpoint interval is typically independent of checkpoint latency.

Brief Contributions Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme

TL;DR: The paper shows that a large increase in latency is acceptable if it is accompanied by a relatively small reduction in overhead, and for equidistant checkpoints, optimal checkpoint interval is shown to be typically independent of checkpoint latency.
Journal ArticleDOI

A variational calculus approach to optimal checkpoint placement

TL;DR: By means of the calculus of variations, an explicit formula is derived that links the optimal checkpointing frequency with a general failure rate, with the objective of globally minimizing the total expected cost of checkpointing and recovery.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What contributions have the authors mentioned in the paper "A higher order estimate of the optimum checkpoint interval for restart dumps" ?

This paper examines methods of approximating the optimum checkpoint restart strategy for minimizing application run time on a system exhibiting Poisson single component failures. Then the authors will derive a more complete cost function and demonstrate a perturbation solution that provides accurate high order approximations to the optimum checkpoint interval.