scispace - formally typeset
Open AccessJournal ArticleDOI

MPICH-V Project: A Multiprotocol Automatic Fault-Tolerant MPI

Reads0
Chats0
TLDR
The MPICH-V project focuses on designing, implementing and comparing several automatic fault-tolerant protocols for MPI applications, covering a large spectrum of known approaches from coordinated checkpoint, to uncoordinated checkpoint associated with causal message logging.
Abstract
High performance computing platforms such as Clusters, Grid and Desktop Grids are becoming larger and subject to more frequent failures. MPI is one of the most used message passing libraries in HPC applications. These two trends raise the need for fault-tolerant MPI. The MPICH-V project focuses on designing, implementing and comparing several automatic fault-tolerant protocols for MPI applications. We present an extensive related work section highlighting the originality of our approach and the proposed protocols. We then present four fault-tolerant protocols implemented in a new generic framework for fault-tolerant protocol comparison, covering a large spectrum of known approaches from coordinated checkpoint, to uncoordinated checkpoint associated with causal message logging. We measure the performance of these protocols on a micro-benchmark and compare them with the NAS benchmark, using an original fault tolerance test. Finally, we outline the lessons learned from this in depth fault-tolerant protocol comparison of MPI applications.

read more

Citations
More filters
Proceedings ArticleDOI

Uncoordinated Checkpointing Without Domino Effect for Send-Deterministic MPI Applications

TL;DR: An uncoordinated check pointing protocol for send-deterministic MPI HPC applications that (i) logs only a subset of the application messages and (ii) does not require to restart systematically all processes when a failure occurs is proposed.
Book ChapterDOI

Fault-Tolerance Techniques for High-Performance Computing

TL;DR: This timely text presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC) including a survey of resilience methods and performance models and investigates different approaches to replication.
Proceedings ArticleDOI

Exploring automatic, online failure recovery for scientific applications at extreme scales

TL;DR: Fenix is presented, a framework for enabling recovery from process/node/blade/cabinet failures for MPI-based parallel applications in an online and transparent manner, and relies on application-driven, diskless, implicitly coordinated check pointing.
Journal ArticleDOI

The Reliability Wall for Exascale Supercomputing

TL;DR: This paper analyzes and extrapolate the existence of the reliability wall using two representative supercomputers, Intrepid and ASCI White, both employing checkpointing for fault tolerance, and generalizes these results into a general reliability speedup/wall framework by considering not only speedup but also costup.
References
More filters
Journal ArticleDOI

Distributed snapshots: determining global states of distributed systems

TL;DR: An algorithm by which a process in a distributed system determines a global state of the system during a computation, which helps to solve an important class of problems: stable property detection.
Book

MPI: The Complete Reference

TL;DR: MPI: The Complete Reference is an annotated manual for the latest 1.1 version of the standard that illuminates the more advanced and subtle features of MPI and covers such advanced issues in parallel computing and programming as true portability, deadlock, high-performance message passing, and libraries for distributed and parallel computing.
Journal ArticleDOI

The Nas Parallel Benchmarks

TL;DR: A new set of benchmarks has been developed for the performance evaluation of highly parallel supercom puters that mimic the computation and data move ment characteristics of large-scale computational fluid dynamics applications.
Journal ArticleDOI

A high-performance, portable implementation of the MPI message passing interface standard

TL;DR: The MPI Message Passing Interface (MPI) as mentioned in this paper is a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists.

Portable implementation of the mpi message passing interface standard

TL;DR: The MPI Message Passing Interface (MPI) as discussed by the authors is a standard library for message passing that was defined by the MPI Forum, a broadly based group of parallel computer vendors, library writers, and applications specialists.
Related Papers (5)