Open AccessJournal ArticleDOI

A higher order estimate of the optimum checkpoint interval for restart dumps

- 01 Feb 2006 -

- Vol. 22, Iss: 3, pp 303-312

TLDR

About:

This article is published in Future Generation Computer Systems.The article was published on 2006-02-01 and is currently open access. It has received 501 citations till now.

Fig. 2. The relative error for four different perturbation solutionη, dump timeξ. The relative error for the asymptotic solution ˜η = 2ξ2 ented by solid lines, plotted as a function of the nondimension shown with a dashed line in each plot.

Fig. 1. The application time line broken into five passed com segments and one failed compute segment designated by X. plication run is complete when the accumulated computation tτ of all of the passed segments is equal to the total solution timeTs for the application.

Table 1 The optimal reference valueξ0 associated with applying different numbers of terms from the perturbation solution ˜η in Eq. (32) is shown along with the maximum relative error inτopt corresponding to that solution

Fig. 4. Comparison of model and simulation results forM = 6 h,Ts = 500 h ,R = 10 min, andδ = 5 min. The new model predictsτopt = 57 min.

Table 2 The optimal reference valueξ1 associated with applying different numbers of terms from the perturbation solution ˜η in Eq. (32) is shown along with the maximum relative error inTw(τopt) corresponding to that solution

Fig. 5. Comparison of model and simulation results forM = 15 min,Ts = 500 h,R = 10 min, andδ = 5 min. The new model predictsτopt = 9.1 min.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Pregel: a system for large-scale graph processing

Grzegorz Malewicz, +6 more

TL;DR: A model for processing large graphs that has been designed for efficient, scalable and fault-tolerant implementation on clusters of thousands of commodity computers, and its implied synchronicity makes reasoning about programs easier.

...read moreread less

Proceedings ArticleDOI

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Adam Moody, +3 more

TL;DR: The Scalable Checkpoint/Restart (SCR) library is designed, a multi-level checkpoint system that writes checkpoints to RAM, Flash, or disk on the compute nodes in addition to the parallel file system that improves efficiency on existing large-scale systems and that this benefit increases as the system size grows.

...read moreread less

Journal ArticleDOI

Addressing failures in exascale computing

Marc Snir, +27 more

TL;DR: This report presents a report produced by a workshop on ‘Addressing failures in exascale computing’ held in Park City, Utah, 4–11 August 2012, which summarizes and builds on discussions on resilience.

...read moreread less

Proceedings ArticleDOI

PLFS: a checkpoint filesystem for parallel applications

John M. Bent, +7 more

TL;DR: A virtual parallel log structured file system which remaps an application's preferred data layout into one which is optimized for the underlying file system, which can reduce checkpoint time by an order of magnitude.

...read moreread less

Journal ArticleDOI

Toward Exascale Resilience: 2014 Update

Franck Cappello, +5 more

TL;DR: This paper surveys what the community has learned in the past five years and summarizes the research problems still considered critical by the HPC community.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

A first order approximation to the optimum checkpoint interval

John W. Young

- 01 Sep 1974 -

Communications of The ACM

TL;DR: It is standard practice to save periodically sufficient information to enable the job to be restarted at the previous point at which information was saved, and the saving of such information at these points is called checkpointing.

...read moreread less

NIST/SEMATECH e-Handbook of Statistical Methods; Chapter 1: Exploratory Data Analysis

Nathanael A. Heckert, +1 more

Journal ArticleDOI

Impact of checkpoint latency on overhead ratio of a checkpointing scheme

Nitin H. Vaidya

- 01 Aug 1997 -

IEEE Transactions on Computers

TL;DR: In this paper, the authors show that a large increase in latency is acceptable if it is accompanied by a relatively small reduction in overhead, and for equidistant checkpoints, optimal checkpoint interval is typically independent of checkpoint latency.

...read moreread less

Brief Contributions Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme

Nitin H. Vaidya

TL;DR: The paper shows that a large increase in latency is acceptable if it is accompanied by a relatively small reduction in overhead, and for equidistant checkpoints, optimal checkpoint interval is shown to be typically independent of checkpoint latency.

...read moreread less

Journal ArticleDOI

A variational calculus approach to optimal checkpoint placement

Yibei Ling, +2 more

- 01 Jul 2001 -

IEEE Transactions on Computers

TL;DR: By means of the calculus of variations, an explicit formula is derived that links the optimal checkpointing frequency with a general failure rate, with the objective of globally minimizing the total expected cost of checkpointing and recovery.

...read moreread less

A first order approximation to the optimum checkpoint interval

John W. Young

- 01 Sep 1974 -

Communications of The ACM

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Adam Moody, +3 more

A survey of rollback-recovery protocols in message-passing systems

Elmootazbellah Nabil Elnozahy, +3 more

- 01 Sep 2002 -

ACM Computing Surveys

Understanding failures in petascale computers

Bianca Schroeder, +1 more

A large-scale study of failures in high-performance computing systems

Bianca Schroeder, +1 more

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "A higher order estimate of the optimum checkpoint interval for restart dumps" ?

This paper examines methods of approximating the optimum checkpoint restart strategy for minimizing application run time on a system exhibiting Poisson single component failures. Then the authors will derive a more complete cost function and demonstrate a perturbation solution that provides accurate high order approximations to the optimum checkpoint interval.

A higher order estimate of the optimum checkpoint interval for restart dumps

Figures

Citations

Pregel: a system for large-scale graph processing

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

Addressing failures in exascale computing

PLFS: a checkpoint filesystem for parallel applications

Toward Exascale Resilience: 2014 Update

References

A first order approximation to the optimum checkpoint interval

NIST/SEMATECH e-Handbook of Statistical Methods; Chapter 1: Exploratory Data Analysis

Impact of checkpoint latency on overhead ratio of a checkpointing scheme

Brief Contributions Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme

A variational calculus approach to optimal checkpoint placement

Related Papers (5)

A first order approximation to the optimum checkpoint interval

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

A survey of rollback-recovery protocols in message-passing systems

Understanding failures in petascale computers

A large-scale study of failures in high-performance computing systems

Frequently Asked Questions (1)

Q1. What contributions have the authors mentioned in the paper "A higher order estimate of the optimum checkpoint interval for restart dumps" ?