scispace - formally typeset
Journal ArticleDOI

Mpi-ft: portable fault tolerance scheme for mpi

Reads0
Chats0
TLDR
This paper proposes the design and development of a fault tolerant and recovery scheme for the Message Passing Interface (MPI), which consists of a detection mechanism for detection and a recovery mechanism for recovery.
Abstract
In this paper, we propose the design and development of a fault tolerant and recovery scheme for the Message Passing Interface (MPI). The proposed scheme consists of a detection mechanism for detec...

read more

Citations
More filters
Proceedings ArticleDOI

MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes

TL;DR: This work presents MPICH-V, an automatic Volatility tolerant MPI environment based on uncoordinated checkpoint/roll-back and distributed message logging, and presents a detailed performance evaluation of every component and its global performance for non-trivial parallel applications.
Proceedings ArticleDOI

MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging

TL;DR: Experimental results demonstrate that MPICH-V2 provides performance close toMPICH-P4 for applications using large messages while reducing dramatically the number of reliable nodes compared to MPich-V1.
Journal ArticleDOI

MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI

TL;DR: The MPICH-V project focuses on designing, implementing and comparing several automatic fault-tolerant protocols for MPI applications, covering a large spectrum of known approaches from coordinated checkpoint, to uncoordinated checkpoint associated with causal message logging.
Journal ArticleDOI

Fault Tolerance in Message Passing Interface Programs

TL;DR: It is concluded that, within certain constraints, MPI can provide a useful context for writing application programs that exhibit significant degrees of fault tolerance.
Book ChapterDOI

Proactive fault tolerance in MPI applications via task migration

TL;DR: In this paper, a fault tolerance solution for MPI applications that proactively migrates execution from processors where failure is imminent is presented, assuming that some failures are predictable, and leverages the features in current hardware devices supporting early indication of faults.
References
More filters
Journal ArticleDOI

Hypertree: A Multiprocessor Interconnection Topology

TL;DR: A new interconnection topology for incrementally expansible multicomputer systems is described, which combines the easy expansibility of tree structures with the compactness of the n-dimensional hypercube.
Journal ArticleDOI

Algorithm-based fault tolerance on a hypercube multiprocessor

TL;DR: The authors propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based error detection, which allows the authors to isolate and replace faulty processors with spare processors.
Book ChapterDOI

Net-dbx: A Java Powered Tool for Interactive Debugging of MPI Programs Across the Internet

TL;DR: Net-dbx is a source level interactive debugger with the full power of gdb augmented with the debug functionality of LAM-MPI, a tool that utilizes Java and other WWW tools for the debugging of MPI programs from anywhere in the Internet.
Proceedings ArticleDOI

Fault detection and recovery in a data-driven real-time multiprocessor

TL;DR: The mechanisms required to perform fault detection and recovery in the DART (Data-driven Architecture for Real-Time) multiprocessor architecture are introduced and a strategy to statically predict the system performance in the event of multiple processor failures is presented and evaluated.
Related Papers (5)