Proceedings ArticleDOI
System structure for software fault tolerance
Brian Randell
- pp 437-449
Reads0
Chats0
TLDR
In this article, the authors present a method for structuring complex computing systems by the use of what they term "recovery blocks", "conversations", and "fault-tolerant interfaces".Abstract:
The paper presents, and discusses the rationale behind, a method for structuring complex computing systems by the use of what we term “recovery blocks”, “conversations” and “fault-tolerant interfaces”. The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.read more
Citations
More filters
Basic Concepts and Taxonomy of Dependable and Secure Computing
TL;DR: In this paper, the main definitions relating to dependability, a generic concept including a special case of such attributes as reliability, availability, safety, integrity, maintainability, etc.
Journal ArticleDOI
A survey of rollback-recovery protocols in message-passing systems
TL;DR: This survey covers rollback-recovery techniques that do not require special language constructs and distinguishes between checkpoint-based and log-based protocols, which rely solely on checkpointing for system state restoration.
Journal ArticleDOI
The N-Version Approach to Fault-Tolerant Software
TL;DR: Principal requirements for the implementation of N-version software are summarized and the DEDIX distributed supervisor and testbed for the execution of N -version software is described.
Journal ArticleDOI
Checkpointing and Rollback-Recovery for Distributed Systems
Richard Koo,Sam Toueg +1 more
TL;DR: In this article, the authors consider the problem of bringing a distributed system to a consistent state after transient failures, and propose a distributed algorithm to create consistent checkpoints, as well as a rollback-recovery algorithm to recover the system from transient failures.
Journal ArticleDOI
Resilience and survivability in communication networks: Strategies, principles, and survey of disciplines
James P. G. Sterbenz,David Hutchison,Egemen K. Çetinkaya,Abdul Jabbar,Justin P. Rohrer,Marcus Schöller,Paul Smith +6 more
TL;DR: An architectural framework for resilience and survivability in communication networks is provided and a survey of the disciplines that resilience encompasses is provided, along with significant past failures of the network infrastructure.
References
More filters
Journal ArticleDOI
The structure of the “THE”-multiprogramming system
TL;DR: A multiprogramming system is described in which all activities are divided over a number of sequential processes, in each of which one or more independent abstractions have been implemented.
Book ChapterDOI
A program structure for error detection and recovery
TL;DR: A method of structuring programs which aids the design and validation of facilities for the detection of and recovery from software errors and a mechanism for the automatic preservation of restart information at a level of overhead which is believed to be tolerable.
Proceedings Article
PLANNER: a language for proving theorems in robots
TL;DR: The deductive system of PLANNER is subordinate to the hierarchical control structure in order to make the language efficient and the use of a general purpose matching language makes the deductives system more powerful.
Proceedings ArticleDOI
Recovery semantics for a DB/DC system
TL;DR: A unified, systematic view of integrity/recovery as it relates to a data-processing system—whether man, machine, or both is presented.
Proceedings ArticleDOI
A recursive virtual machine architecture
Hugh C. Lauer,David Wyeth +1 more
TL;DR: This paper summarizes the preliminary design of a computer system with a recursive, virtual machine architecture and gives a brief account of the considerations leading to that design.