scispace - formally typeset
Journal ArticleDOI

Rollback and Recovery Strategies for Computer Programs

K. M. Chandy, +1 more
- 01 Jun 1972 - 
- Vol. 21, Iss: 6, pp 546-556
Reads0
Chats0
TLDR
The problem is that of recovering error-free information when an error is detected at some stage in the processing of a program, and the solution is to determine the optimum points at which the state of the program should be stored to recover after any malfunction.
Abstract
Reliability is an important aspect of any system. On-line diagnosis, parity check coding, triple modular redundancy, and other methods have been used to improve the reliability of computing systems. In this paper another aspect of reliable computing systems is explored. The problem is that of recovering error-free information when an error is detected at some stage in the processing of a program. If an error or fault is detected while a program is being processed and if it cannot be corrected immediately, it may be necessary to run the entire program again. The time spent in rerunning the program may be substantial and in some real time applications critical. Recovery time can be reduced by saving states of the program (all the information stored in registers, primary and secondary storage, etc.) at intervals, as the processing continues. If an error is detected the program is restarted from its most recently saved state. However, a price is paid in saving a state in the form of time spent storing all the relevant information in secondary storage. Hence it is expensive to save the state of the program too often. Not saving any state of the program may cause an unacceptably large recovery time. The problem that we solve is the following. Determine the optimum points at which the state of the program should be stored to recover after any malfunction.

read more

Citations
More filters
Journal ArticleDOI

A survey of rollback-recovery protocols in message-passing systems

TL;DR: This survey covers rollback-recovery techniques that do not require special language constructs and distinguishes between checkpoint-based and log-based protocols, which rely solely on checkpointing for system state restoration.
Journal ArticleDOI

Necessary and sufficient conditions for consistent global snapshots

TL;DR: This work proves the exact conditions for an arbitrary checkpoint, or a set of checkpoints, to belong to a consistent global snapshot, a previously open problem.
Journal ArticleDOI

Distributed execution of recovery blocks: an approach for uniform treatment of hardware and software faults in real-time applications

TL;DR: The concept of distributed execution of recovery blocks is examined as an approach for uniform treatment of hardware and software faults and a specific formulation of the approach aimed at minimizing the recovery time is presented, called the distributed recovery blocks scheme.
Journal ArticleDOI

Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery

TL;DR: This paper starts by surveying the current technology roadmap and particularly how Peta-Flop capable systems may be plausibly constructed in the next few years and considers how rollback-recovery as practiced today will fare when systems may have to be constructed out of thousands of nodes.
Journal ArticleDOI

Analytic models for rollback and recovery strategies in data base systems

TL;DR: These models and techniques are presented which aid in determining optimal times for checkpoints and all transactions on the audit trail since this check point are reprocessed in chronological sequence, thus recovering from the error.
References
More filters
Journal ArticleDOI

Optimal Scheduling Strategies in a Multiprocessor System

TL;DR: A set of techniques that can be used to optimally schedule a sequence of interrelated computational tasks on a multiprocessor computer system using a directed graph model to represent a computational process are described.
Book

A general-purpose file system for secondary storage

TL;DR: The need for a versatile on-line secondary storage complex in a multiprogramming environment is immense and information must be easy to access when required, safe from accidents and maliciousness, and it should be accessible to other users on an easily controllable basis when desired.
Proceedings ArticleDOI

A general-purpose file system for secondary storage

TL;DR: In this article, the need for a versatile on-line secondary storage complex in a multiprogramming environment is immense, and various needs become crucial: little-used information must percolate to devices with longer access times, to allow ample space on faster devices for more frequently used files.
Proceedings ArticleDOI

A structural theory of machine diagnosis

TL;DR: A unified approach based on graph theory is presented, which seems to provide a new insight into the problem without regard to the level of detail under consideration, and is presented as a unified approach to diagnostics in multi-processors.
Proceedings ArticleDOI

Measurement based automatic analysis of FORTRAN programs

TL;DR: A valuable by-product of this measurement and analysis which directs attention toward those parts of a program which are leading candidates for application of optimization techniques is discussed, including an example of the automatic analysis of programs written in the FORTRAN IV language.
Related Papers (5)