Rollback and Recovery Strategies for Computer Programs

doi:10.1109/TC.1972.5009007

Journal ArticleDOI

Rollback and Recovery Strategies for Computer Programs

K. M. Chandy, +1 more

- 01 Jun 1972 -

IEEE Transactions on Computers

- Vol. 21, Iss: 6, pp 546-556

Chats0

TLDR

The problem is that of recovering error-free information when an error is detected at some stage in the processing of a program, and the solution is to determine the optimum points at which the state of the program should be stored to recover after any malfunction.

Abstract:

Reliability is an important aspect of any system. On-line diagnosis, parity check coding, triple modular redundancy, and other methods have been used to improve the reliability of computing systems. In this paper another aspect of reliable computing systems is explored. The problem is that of recovering error-free information when an error is detected at some stage in the processing of a program. If an error or fault is detected while a program is being processed and if it cannot be corrected immediately, it may be necessary to run the entire program again. The time spent in rerunning the program may be substantial and in some real time applications critical. Recovery time can be reduced by saving states of the program (all the information stored in registers, primary and secondary storage, etc.) at intervals, as the processing continues. If an error is detected the program is restarted from its most recently saved state. However, a price is paid in saving a state in the form of time spent storing all the relevant information in secondary storage. Hence it is expensive to save the state of the program too often. Not saving any state of the program may cause an unacceptably large recovery time. The problem that we solve is the following. Determine the optimum points at which the state of the program should be stored to recover after any malfunction.

Rollback and Recovery Strategies for Computer Programs

Citations

A survey of rollback-recovery protocols in message-passing systems

Necessary and sufficient conditions for consistent global snapshots

Distributed execution of recovery blocks: an approach for uniform treatment of hardware and software faults in real-time applications

Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery

Analytic models for rollback and recovery strategies in data base systems

References

Optimal Scheduling Strategies in a Multiprocessor System

A general-purpose file system for secondary storage

A general-purpose file system for secondary storage

A structural theory of machine diagnosis

Measurement based automatic analysis of FORTRAN programs

Related Papers (5)

System structure for software fault tolerance

On the Optimum Checkpoint Interval

A first order approximation to the optimum checkpoint interval

Analytic models for rollback and recovery strategies in data base systems

Distributed snapshots: determining global states of distributed systems