A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
TLDR
The failure rates of HPC systems are reviewed, rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed, and a taxonomy is developed for over twenty popular checkpoint/restart solutions.Abstract:
In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a growing concern for long-running applications. In this paper, we briefly review the failure rates of HPC systems and also survey the fault tolerance approaches for HPC systems and issues with these approaches. Rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed because they are widely used for long-running applications on HPC systems. Specifically, the feature requirements of rollback-recovery are discussed and a taxonomy is developed for over twenty popular checkpoint/restart solutions. The intent of this paper is to aid researchers in the domain as well as to facilitate development of new checkpointing solutions.read more
Citations
More filters
Journal ArticleDOI
Running resilient MPI applications on a Dynamic Group of Recommended Processes
TL;DR: This work presents a new model to deal with this problem in which processes execute tests among themselves in order to determine whether the processors (or cores) on which they are running are recommended or non-recommended.
Journal ArticleDOI
Freezing and defrosting cloud applications: automated saving and restoring of running applications
TL;DR: Two approaches are introduced: a concept to generically terminate applications and save their internal state, and an approach to reinstate the application in the same state again.
Posted Content
Understanding Fault Scenarios and Impacts through Fault Injection Experiments in Cielo
Valerio Formicola,Saurabh Jha,Daniel Chen,Fei Deng,Amanda Bonnie,Mike Mason,Jim Brandt,Ann C. Gentile,Larry Kaplan,Jason Repik,Jeremy Enos,Mike Showerman,Annette Greiner,Zbigniew Kalbarczyk,Ravishankar K. Iyer,Bill Krammer +15 more
TL;DR: A set of fault injection experiments performed on the ACES (LANL/SNL) Cray XE supercomputer Cielo are presented to improve the understanding of failure causes and propagation that was observed in the field failure data analysis of NCSA's Blue Waters.
Proceedings ArticleDOI
Complex Patterns of Failure: Fault Tolerance via Complex Event Processing for IoT Systems
Alexander Power,Gerald Kotonya +1 more
TL;DR: Complex Patterns of Failure (CPoF), an approach to providing FT support for IoT systems using Complex Event Processing (CEP) that promotes modularity and reusability in FT-support design, is proposed.
Proceedings ArticleDOI
Factory: Non-stop batch jobs without checkpointing
TL;DR: In the course of experiments, this study successfully applied a method to make hydrodynamics HPC application run on constantly changing number of nodes and believes that this technique can be generalised to other types of scientific applications as well.
References
More filters
Book ChapterDOI
Time, clocks, and the ordering of events in a distributed system
TL;DR: In this paper, the concept of one event happening before another in a distributed system is examined, and a distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events.
Journal ArticleDOI
Time, clocks, and the ordering of events in a distributed system
TL;DR: In this article, the concept of one event happening before another in a distributed system is examined, and a distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events.
Proceedings ArticleDOI
Live migration of virtual machines
Christopher Clark,Keir Fraser,Steven Hand,Jacob Gorm Hansen,Eric Jul,Christian Limpach,Ian Pratt,Andrew Warfield +7 more
TL;DR: The design options for migrating OSes running services with liveness constraints are considered, the concept of writable working set is introduced, and the design, implementation and evaluation of high-performance OS migration built on top of the Xen VMM are presented.
MPI: A Message-Passing Interface Standard
TL;DR: This document contains all the technical features proposed for the interface and the goal of the Message Passing Interface, simply stated, is to develop a widely used standard for writing message-passing programs.
Journal ArticleDOI
Distributed snapshots: determining global states of distributed systems
K. Mani Chandy,Leslie Lamport +1 more
TL;DR: An algorithm by which a process in a distributed system determines a global state of the system during a computation, which helps to solve an important class of problems: stable property detection.