Journal ArticleDOI
Mpi-ft: portable fault tolerance scheme for mpi
Reads0
Chats0
TLDR
This paper proposes the design and development of a fault tolerant and recovery scheme for the Message Passing Interface (MPI), which consists of a detection mechanism for detection and a recovery mechanism for recovery.Abstract:
In this paper, we propose the design and development of a fault tolerant and recovery scheme for the Message Passing Interface (MPI). The proposed scheme consists of a detection mechanism for detec...read more
Citations
More filters
Proceedings ArticleDOI
MPICH-V: Toward a Scalable Fault Tolerant MPI for Volatile Nodes
George Bosilca,Aurelien Bouteiller,Franck Cappello,Samir Djilali,Gilles Fedak,Cécile Germain,Thomas Herault,Pierre Lemarinier,Oleg Lodygensky,Frédéric Magniette,Vincent Neri,Anton Selikhov +11 more
TL;DR: This work presents MPICH-V, an automatic Volatility tolerant MPI environment based on uncoordinated checkpoint/roll-back and distributed message logging, and presents a detailed performance evaluation of every component and its global performance for non-trivial parallel applications.
Proceedings ArticleDOI
MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging
Aurelien Bouteiller,Franck Cappello,Thomas Herault,Géraud Krawezik,Pierre Lemarinier,Frédéric Magniette +5 more
TL;DR: Experimental results demonstrate that MPICH-V2 provides performance close toMPICH-P4 for applications using large messages while reducing dramatically the number of reliable nodes compared to MPich-V1.
Journal ArticleDOI
MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI
TL;DR: The MPICH-V project focuses on designing, implementing and comparing several automatic fault-tolerant protocols for MPI applications, covering a large spectrum of known approaches from coordinated checkpoint, to uncoordinated checkpoint associated with causal message logging.
Journal ArticleDOI
Fault Tolerance in Message Passing Interface Programs
William Gropp,Ewing Lusk +1 more
TL;DR: It is concluded that, within certain constraints, MPI can provide a useful context for writing application programs that exhibit significant degrees of fault tolerance.
Book ChapterDOI
Proactive fault tolerance in MPI applications via task migration
TL;DR: In this paper, a fault tolerance solution for MPI applications that proactively migrates execution from processors where failure is imminent is presented, assuming that some failures are predictable, and leverages the features in current hardware devices supporting early indication of faults.
References
More filters
Journal ArticleDOI
Hypertree: A Multiprocessor Interconnection Topology
TL;DR: A new interconnection topology for incrementally expansible multicomputer systems is described, which combines the easy expansibility of tree structures with the compactness of the n-dimensional hypercube.
Journal ArticleDOI
Algorithm-based fault tolerance on a hypercube multiprocessor
Prithviraj Banerjee,J.T. Rahmeh,Craig B. Stunkel,V.S.S. Nair,Kaushik Roy,V. Balasubramanian,Jacob A. Abraham +6 more
TL;DR: The authors propose the detection and location of faulty processors concurrently with the actual execution of parallel applications on the hypercube using a novel scheme of algorithm-based error detection, which allows the authors to isolate and replace faulty processors with spare processors.
Book ChapterDOI
Net-dbx: A Java Powered Tool for Interactive Debugging of MPI Programs Across the Internet
TL;DR: Net-dbx is a source level interactive debugger with the full power of gdb augmented with the debug functionality of LAM-MPI, a tool that utilizes Java and other WWW tools for the debugging of MPI programs from anywhere in the Internet.
Proceedings ArticleDOI
Fault detection and recovery in a data-driven real-time multiprocessor
TL;DR: The mechanisms required to perform fault detection and recovery in the DART (Data-driven Architecture for Real-Time) multiprocessor architecture are introduced and a strategy to statically predict the system performance in the event of multiple processor failures is presented and evaluated.