scispace - formally typeset
Open AccessJournal ArticleDOI

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

TLDR
The failure rates of HPC systems are reviewed, rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed, and a taxonomy is developed for over twenty popular checkpoint/restart solutions.
Abstract
In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a growing concern for long-running applications. In this paper, we briefly review the failure rates of HPC systems and also survey the fault tolerance approaches for HPC systems and issues with these approaches. Rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed because they are widely used for long-running applications on HPC systems. Specifically, the feature requirements of rollback-recovery are discussed and a taxonomy is developed for over twenty popular checkpoint/restart solutions. The intent of this paper is to aid researchers in the domain as well as to facilitate development of new checkpointing solutions.

read more

Content maybe subject to copyright    Report

Citations
More filters
Proceedings ArticleDOI

CSR: Core Surprise Removal in Commodity Operating Systems

TL;DR: This work presents CSR, a strategy for recovery from unexpected permanent processor faults in commodity operating systems, which overcomes surprise removal of faulty cores, and also tolerates cascading core failures.
Proceedings ArticleDOI

A snapshot security protocol for radar network protection

TL;DR: A snapshot security protocol is designed and implemented, which calculates consistent global snapshot for distributed applications running on radar networks in order to add more reliability and high availability to these systems.
Journal ArticleDOI

Optimal equidistant checkpointing of fault tolerant systems subject to correlated failure

TL;DR: The results indicate that correlation can negatively impact checkpointing, necessitating more frequent checkpointing and increasing the total time required, but that the approach can still identify the optimal number of equidistant checkpoints, despite this correlation.
Posted Content

Dependency-Aware Rollback and Checkpoint-Restart for Distributed Task-Based Runtimes

TL;DR: This work explores task dependencies in the rollback, and designs a new C/R technique, which is influenced by recursive decomposition of tasks, and combines it with dependency-aware rollbacks, which are expected to cancel and recompute less tasks in the presence of node failures.
Proceedings ArticleDOI

Modeling and evaluation of mixed redundancy strategy with instant switching in cloud-based systems

TL;DR: A model to evaluate the reliability and performance of cloud-based degraded system subjected to mixed active and cold standby redundancy strategy with continual monitoring and detection mechanism proved that the system behavior was different using different kinds of mixed strategy and the analysis model for traditional strategy was not suitable for strategies in cloud-bases system.
References
More filters
Book ChapterDOI

Time, clocks, and the ordering of events in a distributed system

TL;DR: In this paper, the concept of one event happening before another in a distributed system is examined, and a distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events.
Journal ArticleDOI

Time, clocks, and the ordering of events in a distributed system

TL;DR: In this article, the concept of one event happening before another in a distributed system is examined, and a distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events.
Proceedings ArticleDOI

Live migration of virtual machines

TL;DR: The design options for migrating OSes running services with liveness constraints are considered, the concept of writable working set is introduced, and the design, implementation and evaluation of high-performance OS migration built on top of the Xen VMM are presented.

MPI: A Message-Passing Interface Standard

TL;DR: This document contains all the technical features proposed for the interface and the goal of the Message Passing Interface, simply stated, is to develop a widely used standard for writing message-passing programs.
Journal ArticleDOI

Distributed snapshots: determining global states of distributed systems

TL;DR: An algorithm by which a process in a distributed system determines a global state of the system during a computation, which helps to solve an important class of problems: stable property detection.
Related Papers (5)