scispace - formally typeset
Open Access

RADIC-based Message Passing Fault Tolerance System

Reads0
Chats0
TLDR
The novel design changes the default socket model avoiding being unexpectedly closed due to a remote node failure and a pessimistic log-based rollback recovery protocol added to this level makes possible restarting and re-executing a failed parallel process until the point of failure independently of the rest of the processes.
Abstract
We present an analysis design of how to incorpo- rate a transparent fault tolerance system at socket level for message passing applications. The novel design changes the default socket model avoiding being unexpectedly closed due to a remote node failure. Moreover, a pessimistic log-based rollback recovery protocol added to this level makes possible restarting and re-executing a failed parallel process until the point of failure independently of the rest of the processes. This paper explains and analyzes the design time decisions. We tested and assessed them executing a master-worker (M/W) and Single Program Multiple Data (SPMD) applications which follow different communication patterns. Promising results of robustness in interprocess communication were obtained.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Fault tolerance at system level based on RADIC architecture

TL;DR: This paper presents an automatic and scalable fault tolerant model designed to be transparent for applications and for message passing libraries, consisting of detecting failures in the communication socket caused by a faulty node.
Book ChapterDOI

Integrated Tolerant Distributed Computing Network

TL;DR: In this paper, two analytical performance models of tolerant computing networks are described, and the first model is itself based on two models: a model for evaluating performance depending on the number of serviceable computing modules and a performance model depending upon the method of ensuring the tolerance of the computer network.

Fault tolerance using credentials management in online transaction

TL;DR: The credential management and session management are used to manage a multilevel credential from web client to web resource level and vice versa and the credential management also performs the maintenance process for fixing the fault tolerance level to the web user.
Journal ArticleDOI

Middleware to Manage Fault Tolerance Using Semi-Coordinated Checkpoints

TL;DR: A methodology that allows applications to tolerate failures through the creation of semi-coordinated checkpoints within the RADIC architecture that divides the application into independent MPI worlds where each MPI world would correspond to a compute node and make use of the DMTCP checkpoint library in a semi- coordinated environment.
References
More filters
Journal ArticleDOI

A survey of rollback-recovery protocols in message-passing systems

TL;DR: This survey covers rollback-recovery techniques that do not require special language constructs and distinguishes between checkpoint-based and log-based protocols, which rely solely on checkpointing for system state restoration.
Book

The design and implementation of the 4.3BSD UNIX operating system

TL;DR: This book describes the design and implementation of the BSD operating system--previously known as the Berkeley version of UNIX, and is widely used for Internet services and firewalls, timesharing, and multiprocessing systems.
Book

The Design and implementation of the 4.3BSD UNIX operating system

TL;DR: The Berkeley version of UNIX (BSD) as discussed by the authors is a popular operating system for Internet services and firewalls, timesharing, and multiprocessing systems.
Journal ArticleDOI

Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters

TL;DR: The motivation, design and implementation of Berkeley Lab Checkpoint/Restart (BLCR), a system-level checkpoint/restart implementation for Linux clusters that targets the space of typical High Performance Computing applications, including MPI, are described.
Proceedings ArticleDOI

DMTCP: Transparent checkpointing for cluster computations and the desktop

TL;DR: DMTCP as mentioned in this paper is a transparent user-level checkpointing package for distributed applications, which is used for the runCMS experiment of the Large Hadron Collider at CERN, and it can be incorporated and distributed as a checkpoint-restart module within some larger package.