Automatically increasing fault tolerance in distributed systems

Open Access

Automatically increasing fault tolerance in distributed systems

Chats0

TLDR

This dissertation presents a complete study of the relationship between fault-tolerance and round complexity of translations, and develops new translations that are optimal and proves that some previously developed translations are optimal.

Abstract:

Developing fault-tolerant distributed protocols is a difficult task. The difficulty of this task increases with the severity of the failures to be tolerated. One way to deal with this difficulty is to develop protocols tolerant of benign failures and then transform these protocols into ones that are tolerant of more severe failures. This transformation mechanism is called a translation. This dissertation considers a variety of processor failures and synchrony models. The failures studied range from simple stopping failures to arbitrary faulty behavior. The synchrony models range from systems in which processors are fully synchronized (synchronous systems) to systems in which processors are not synchronized at all (asynchronous systems). For all synchrony models, the dissertation gives general definitions of translations and of measures to evaluate their performance. The two measures considered are communication complexity and fault-tolerance. Communication complexity is the communication overhead incurred when using a translation. Fault-tolerance is the maximum proportion of processors that can be faulty without affecting the correctness of the translations. For synchronous systems, this dissertation presents a complete study of the relationship between fault-tolerance and round complexity of translations. It develops new translations that are optimal and proves that some previously developed translations are optimal. For asynchronous systems, it proves that some previously developed translations are optimal. For systems that are only partially synchronous this dissertation discusses some of the issues involved in designing efficient translations.

Citations

PDF

Open Access

More filters

Proceedings Article

Nysiad: practical protocol transformation to tolerate Byzantine failures

Chi Ho, +3 more

TL;DR: Nysiad is presented, a system that implements a new technique for transforming a scalable distributed system or network protocol tolerant only of crash failures into one that tolerates arbitrary failures, including such failures as freeloading and malicious attacks.

...read moreread less

Journal ArticleDOI

Simplifying fault-tolerance: providing the abstraction of crash failures

Rida A. Bazzi, +1 more

- 01 May 2001 -

Journal of the ACM

TL;DR: Methods that automatically translate algorithms tolerant of simple crash failures into ones tolerant of more severe failures are considered, showing that previously developed translaions to send-omission failures are optimal with respect to both fault-tolerance and round-complexity.

...read moreread less

Book ChapterDOI

Making distributed applications robust

Chi Ho, +2 more

TL;DR: A novel translation of systems that are tolerant of crash failures to systems that is tolerant of Byzantine failures in an asynchronous environment is presented, making weaker assumptions than previous approaches.

...read moreread less

Journal ArticleDOI

Time Bounds for Decision Problems in the Presence of Timing Uncertainty and Failures

Hagit Attiya, +1 more

- 01 Aug 2001 -

Journal of Parallel and Distributed Comp...

TL;DR: This paper presents a new stretching technique for deriving lower bounds in the presence of late timing failures and yields the following lower bounds for a semi-synchronous model of distributed message-passing when there is inexact information about time and process failures.

...read moreread less

Dissertation

Reducing Costs Of Byzantine Fault Tolerant Distributed Applications

Chi Ho

Automatically increasing fault tolerance in distributed systems

Citations

Nysiad: practical protocol transformation to tolerate Byzantine failures

Simplifying fault-tolerance: providing the abstraction of crash failures

Making distributed applications robust

Time Bounds for Decision Problems in the Presence of Timing Uncertainty and Failures

Reducing Costs Of Byzantine Fault Tolerant Distributed Applications

Related Papers (5)

Consensus in the presence of partial synchrony (Preliminary Version)

Localizing failures in distributed synchronization

Reaching agreement on processor-group membrship in synchronous distributed systems

On the Cost of Fault-Tolerant Consensus When There Are No Faults – A Tutorial

Consensus in asynchronous systems where processes can crash and recover