A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

doi:10.1007/S11227-013-0884-0

Open AccessJournal ArticleDOI

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

Ifeanyi P. Egwutuoha, +3 more

- 01 Sep 2013 -

The Journal of Supercomputing

- Vol. 65, Iss: 3, pp 1302-1326

TLDR

The failure rates of HPC systems are reviewed, rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed, and a taxonomy is developed for over twenty popular checkpoint/restart solutions.

Abstract:

In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a growing concern for long-running applications. In this paper, we briefly review the failure rates of HPC systems and also survey the fault tolerance approaches for HPC systems and issues with these approaches. Rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed because they are widely used for long-running applications on HPC systems. Specifically, the feature requirements of rollback-recovery are discussed and a taxonomy is developed for over twenty popular checkpoint/restart solutions. The intent of this paper is to aid researchers in the domain as well as to facilitate development of new checkpointing solutions.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems

Georgia Psychou, +6 more

- 04 Oct 2017 -

ACM Computing Surveys

TL;DR: A systematic classification of approaches that increase system resilience in the presence of functional hardware (HW)-induced errors is presented, dealing with higher system abstractions, such as the (micro)architecture, the mapping, and platform software (SW).

...read moreread less

Journal ArticleDOI

Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units

Daniel Oliveira, +3 more

- 01 Mar 2016 -

IEEE Transactions on Computers

TL;DR: Novel insights on GPU reliability are given by evaluating the neutron sensitivity of modern GPUs memory structures, highlighting pattern dependence and multiple errors occurrences and error-correcting code, algorithm-based fault tolerance, and comparison hardening strategies are presented and evaluated on GPUs through radiation experiments.

...read moreread less

System Structure for Software Fault Tolerance

Brian Randell

TL;DR: The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.

...read moreread less

Journal ArticleDOI

Camflow: Managed Data-Sharing for Cloud Services

Thomas Pasquier, +3 more

- 01 Jul 2017 -

IEEE Transactions on Cloud Computing

TL;DR: The potential of cloud-deployed IFC for enforcing owners’ data flow policy with regard to protection and sharing, as well as safeguarding against malicious or buggy software is discussed.

...read moreread less

Journal ArticleDOI

Scalable Graph Processing Frameworks: A Taxonomy and Open Challenges

Safiollah Heidari, +3 more

- 12 Jun 2018 -

ACM Computing Surveys

TL;DR: A taxonomy of graph processing systems is proposed and existing systems are mapped to this classification, which captures the diversity in programming and computation models, runtime aspects of partitioning and communication, both for in-memory and distributed frameworks.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Study of fault-tolerant software technology

T. Slivinski, +6 more

TL;DR: It is concluded that fault-tolerant software has progressed beyond the pure research state and that, although not perfectly matched, newer architectural and language capabilities provide many of the notations and functions needed to effectively and efficiently implement software fault-Tolerance.

...read moreread less

Journal ArticleDOI

A high performance data integrity assurance based on the determinant technique

Jasim A. Ghaeb, +2 more

- 01 May 2011 -

Future Generation Computer Systems

TL;DR: The proposed technique is based on the Check Determinant Factor (CDF) in measuring data integrity assurance and outperforms the traditional methods such as Hamming code and RAID methods for improving the detection of data integrity violations.

...read moreread less

The Dangers of Failure Masking in Fault-Tolerant Software: Aspects of a Recent In-Flight Upset Event

Johnson, +1 more

Proceedings ArticleDOI

Failure Semantics in a SOA Environment

C. Hobbs, +2 more

TL;DR: The technique of crash-only failure is proposed as a useful first step and it is illustrated how it is particularly applicable to web services in a SOA.

...read moreread less

On Performance Optimization and System Design of Flash Memory based Solid State Drives in the Storage Hierarchy

Feng Chen

TL;DR: This dissertation presents a thorough experimental study on the unique features of SSDs, and states that although SSDs have shown a great performance, especially for handling small and random data accesses, SSDs are much more e xpensive than conventional hard disks.

...read moreread less