RADIC-based Message Passing Fault Tolerance System

Open Access

RADIC-based Message Passing Fault Tolerance System

Marcela Castro, +2 more

- pp 59-64

Chats0

TLDR

The novel design changes the default socket model avoiding being unexpectedly closed due to a remote node failure and a pessimistic log-based rollback recovery protocol added to this level makes possible restarting and re-executing a failed parallel process until the point of failure independently of the rest of the processes.

Abstract:

We present an analysis design of how to incorpo- rate a transparent fault tolerance system at socket level for message passing applications. The novel design changes the default socket model avoiding being unexpectedly closed due to a remote node failure. Moreover, a pessimistic log-based rollback recovery protocol added to this level makes possible restarting and re-executing a failed parallel process until the point of failure independently of the rest of the processes. This paper explains and analyzes the design time decisions. We tested and assessed them executing a master-worker (M/W) and Single Program Multiple Data (SPMD) applications which follow different communication patterns. Promising results of robustness in interprocess communication were obtained.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Fault tolerance at system level based on RADIC architecture

Marcela Castro-León, +3 more

- 01 Dec 2015 -

Journal of Parallel and Distributed Comp...

TL;DR: This paper presents an automatic and scalable fault tolerant model designed to be transparent for applications and for message passing libraries, consisting of detecting failures in the communication socket caused by a faulty node.

...read moreread less

Book ChapterDOI

Integrated Tolerant Distributed Computing Network

O. M. Brekhov

TL;DR: In this paper, two analytical performance models of tolerant computing networks are described, and the first model is itself based on two models: a model for evaluating performance depending on the number of serviceable computing modules and a performance model depending upon the method of ensuring the tolerance of the computer network.

...read moreread less

Fault tolerance using credentials management in online transaction

G. S. Anandhamala

TL;DR: The credential management and session management are used to manage a multilevel credential from web client to web resource level and vice versa and the credential management also performs the maintenance process for fixing the fault tolerance level to the web user.

...read moreread less

Journal ArticleDOI

Middleware to Manage Fault Tolerance Using Semi-Coordinated Checkpoints

Alvaro Wong, +3 more

- 01 Feb 2021 -

IEEE Transactions on Parallel and Distri...

TL;DR: A methodology that allows applications to tolerate failures through the creation of semi-coordinated checkpoints within the RADIC architecture that divides the application into independent MPI worlds where each MPI world would correspond to a compute node and make use of the DMTCP checkpoint library in a semi- coordinated environment.

...read moreread less

References

PDF

Open Access

More filters

Journal ArticleDOI

A survey of rollback-recovery protocols in message-passing systems

Elmootazbellah Nabil Elnozahy, +3 more

- 01 Sep 2002 -

ACM Computing Surveys

TL;DR: This survey covers rollback-recovery techniques that do not require special language constructs and distinguishes between checkpoint-based and log-based protocols, which rely solely on checkpointing for system state restoration.

...read moreread less

Book

The design and implementation of the 4.3BSD UNIX operating system

Samuel J. Leffler

TL;DR: This book describes the design and implementation of the BSD operating system--previously known as the Berkeley version of UNIX, and is widely used for Internet services and firewalls, timesharing, and multiprocessing systems.

...read moreread less

Book

The Design and implementation of the 4.3BSD UNIX operating system

Marshall Kirk McKusick, +3 more

TL;DR: The Berkeley version of UNIX (BSD) as discussed by the authors is a popular operating system for Internet services and firewalls, timesharing, and multiprocessing systems.

...read moreread less

Journal ArticleDOI

Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters

Paul Hargrove, +1 more

TL;DR: The motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI, are described.

...read moreread less

Proceedings ArticleDOI

DMTCP: Transparent checkpointing for cluster computations and the desktop

Jason Ansel, +2 more

TL;DR: DMTCP as mentioned in this paper is a transparent user-level checkpointing package for distributed applications, which is used for the runCMS experiment of the Large Hadron Collider at CERN, and it can be incorporated and distributed as a checkpoint-restart module within some larger package.

...read moreread less

RADIC-based Message Passing Fault Tolerance System

Citations

Fault tolerance at system level based on RADIC architecture

Integrated Tolerant Distributed Computing Network

Fault tolerance using credentials management in online transaction

Middleware to Manage Fault Tolerance Using Semi-Coordinated Checkpoints

References

A survey of rollback-recovery protocols in message-passing systems

The design and implementation of the 4.3BSD UNIX operating system

The Design and implementation of the 4.3BSD UNIX operating system

Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters

DMTCP: Transparent checkpointing for cluster computations and the desktop

Related Papers (5)

Fault tolerance at system level based on RADIC architecture

Common Mechanisms for supporting fault tolerance in DSM and message passing systems

A message system supporting fault tolerance

An efficient algorithm for causal message logging

A non-blocking recovery algorithm for causal message logging