Self-stabilization

Open AccessBook

Self-stabilization

TLDR

A formal impossibility proof shows that, in order to ensure the correct behavior of the system, less than one-third of the processors may be of the Byzantine type; that is, to design the system as if there were no (yesterday) past history—a system that can be started in any possible state of its state space.

Abstract:

AULT tolerance and reliability are important issues for flight vehicles such as aircraft, space-shuttles, and satellites. A self-stabilizing system recovers automatically following disturbances that force the system to an arbitrary state. The self-stabilization concept is an essential property of any autonomous control/computing system. Important branches of distributed computing theory were initiated because of the need for fault-tolerance of aircraft computing devices. The Byzantine fault model, for example, was a creation of the NASA SIFT project of more than a couple of decades ago. The Byzantine fault model is an elegant abstraction of faults where it is assumed that faulty parts behave as adversaries. The idea is to use redundancy—for instance, in the number of processors—in order to overcome the behavior of faulty processors. This line of thinking fits the common practice in engineering in which the design of critical parts is done independently by several design teams to make sure that there is no mistake in the calculations. Analogously, in the Byzantine fault model, some processors are simultaneously computing the same calculations for implementing a robust flight controller; thus the flight controller can function well in spite of the faulty behavior of several of the processors. Faulty behavior cannot be anticipated, the most severe behavior is therefore assumed—one that is reminiscent of the behavior of an adversary in the Byzantine court, in which backstabbing was common. A formal impossibility proof shows that, in order to ensure the correct behavior of the system, less than one-third of the processors may be of the Byzantine type. The task examined is the agreement, or consensus, for which processors need to decide on the same output, which is the input of one of the processors. The decision can be viewed as choosing the common result among the results of the design teams in the engineering example. The intuition behind the impossibility result is as follows: assume you have met two persons, Alice and Bob, one of which is honest while the other is not. You may try to decide what to do by speaking with each of them. Because you do not know which of the two is honest, you have to find this out. You may try a direct question to Alice asking who among them is not honest; Alice will answer Bob, and Bob, if asked, will obviously answer Alice. Each of them may also describe the conversations he/she had with the other, knowing that no one listened to the communication between them. This symmetry in the weights of the answers of Alice and Bob make it impossible to decide. It is possible to formally prove that agreement can be achieved if, and only if, less than one-third of the processors are Byzantine (e.g. Ref. 7). Systems that tolerate Byzantine faults are designed for flight devices, which need to be extremely robust. In such a device, the assumptions made for reaching agreement can be violated: Is it reasonable to assume that, during any period of the execution, less than one-third of the processors are faulty? What happens if, for a short period, more than a third are faulty, or perhaps experience a weaker fault than a Byzantine fault (say, caused by a transient electric spark)? What happens if messages sent by non-faulty processors are lost in one instant of time? Seven years prior to the introduction of the Byzantine fault model, Edsger W. Dijkstra suggested an important fundamental fault tolerance property called self-stabilization; 3 that is, to design the system as if there were no (yesterday) past history—a system that can be started in any possible state of its state space. It would therefore not be assumed that consistency was maintained from the fixed initial state by always executing steps according to the program of the processors. Self-stabilizing systems thus overcome transient faults. Temporary violations of the assumptions made by the algorithm designer can be viewed as leaving the system in an arbitrary initial state from which the system resumes. Self-stabilizing systems work correctly when started in any initial system state. Thus, even if the system loses its consistency due to an unexpected temporary violation of the assumptions made, it converges to legal behavior once the assumptions start to hold again. Self-stabilization is a strong fault tolerance property for systems; it ensures automatic recovery once faults stop occurring. A self-stabilizing system is designed to start in any possible state where its components—e.g., processors, processes, communication links, communication buffers—are in an arbitrary state; i.e., arbitrary variable values,

Self-stabilization

Citations

Internet of things: Vision, applications and research challenges

Self-Managed Systems: an Architectural Challenge

Distributed computation in dynamic networks

Survey of local algorithms

Dynamic networks: models and algorithms

References

Reaching Agreement in the Presence of Faults

SIFT - Design and analysis of a fault-tolerant computer for aircraft control. [Software Implemented Fault Tolerant systems]

Self-repairing computers.

Self-stabilizing microprocessor: analyzing and overcoming soft errors

Self-Stabilizing Autonomic Recoverer for Eventual Byzantine Software (Extended Abstract)

Related Papers (5)

Self-stabilizing systems in spite of distributed control

Distributed algorithms

Self-stabilization of dynamic systems assuming only read/write atomicity

Distributed reset

Introduction to distributed algorithms