Journal ArticleDOI
A model for error recovery with global checkpointing
TLDR
A new technique for providing software fault tolerance in concurrent systems is proposed that combines the traditional global checkpointing mechanism with the recovery block concept in order to come up with an easily implementable error recovery mechanism.About:
This article is published in Information Sciences.The article was published on 1983-09-01. It has received 4 citations till now. The article focuses on the topics: Software fault tolerance & Overhead (computing).read more
Citations
More filters
Journal ArticleDOI
A survey of rollback-recovery protocols in message-passing systems
TL;DR: This survey covers rollback-recovery techniques that do not require special language constructs and distinguishes between checkpoint-based and log-based protocols, which rely solely on checkpointing for system state restoration.
Checkpointing and the modeling of program execution time
TL;DR: This chapter considers several models of checkpointing and recovery in a program in order to derive the distribution of program execution time or its expectation, and it is shown that the expected execution time increases linearly with the processing requirement in the presence of checkpoints.
Journal ArticleDOI
A quasi-synchronous checkpointing algorithm that prevents contention for stable storage
TL;DR: This paper proposes a staggered quasi-synchronous checkpointing algorithm which reduces contention for network stable storage without any synchronization overhead.
Journal ArticleDOI
On selecting rollback points for error recovery
TL;DR: A generalized formulation for checkpoint selection is given from which three different schemes are derived and each scheme is shown to have a smaller expected cost of recovery and a larger optimal checkpoint interval than rolling back to the most recent checkpoint.
References
More filters
Journal ArticleDOI
System structure for software fault tolerance
TL;DR: In this article, the authors present a method for structuring complex computing systems by the use of what they term "recovery blocks", "conversations", and "fault-tolerant interfaces".
Journal ArticleDOI
System structure for software fault tolerance
TL;DR: In this article, the authors present and discuss the rationale behind a method for structuring complex computing systems by the use of what they term "recovery blocks," "conversations," and "fault" tolerant interfa...
Journal ArticleDOI
Performance-Related Reliability Measures for Computing Systems
TL;DR: These measures, which reflect the interaction between the reliability and the performance characteristics of computing systems, can be used to evaluate traditional computer architectures; gracefully degrading systems; and distributed systems.
Journal ArticleDOI
On the Optimum Checkpoint Interval
TL;DR: It is shown that the optimum checkpoint interval is a function of the load of the system, and it is proved that the total operating time between successive checkpoints should be a deterministic quantity in order to maximize the availability.
Journal ArticleDOI
Theories of Software Reliability: How Good Are They and How Can They Be Improved?
TL;DR: An examination of the assumptions used in early bug-counting models of software reliability shows them to be deficient and it is suggested that current theories are only the first step along what threatens to be a long road.