scispace - formally typeset
Open AccessJournal ArticleDOI

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

TLDR
The failure rates of HPC systems are reviewed, rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed, and a taxonomy is developed for over twenty popular checkpoint/restart solutions.
Abstract
In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a growing concern for long-running applications. In this paper, we briefly review the failure rates of HPC systems and also survey the fault tolerance approaches for HPC systems and issues with these approaches. Rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed because they are widely used for long-running applications on HPC systems. Specifically, the feature requirements of rollback-recovery are discussed and a taxonomy is developed for over twenty popular checkpoint/restart solutions. The intent of this paper is to aid researchers in the domain as well as to facilitate development of new checkpointing solutions.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems

TL;DR: A systematic classification of approaches that increase system resilience in the presence of functional hardware (HW)-induced errors is presented, dealing with higher system abstractions, such as the (micro)architecture, the mapping, and platform software (SW).
Journal ArticleDOI

Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units

TL;DR: Novel insights on GPU reliability are given by evaluating the neutron sensitivity of modern GPUs memory structures, highlighting pattern dependence and multiple errors occurrences and error-correcting code, algorithm-based fault tolerance, and comparison hardening strategies are presented and evaluated on GPUs through radiation experiments.

System Structure for Software Fault Tolerance

TL;DR: The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.
Journal ArticleDOI

Camflow: Managed Data-Sharing for Cloud Services

TL;DR: The potential of cloud-deployed IFC for enforcing owners’ data flow policy with regard to protection and sharing, as well as safeguarding against malicious or buggy software is discussed.
Journal ArticleDOI

Scalable Graph Processing Frameworks: A Taxonomy and Open Challenges

TL;DR: A taxonomy of graph processing systems is proposed and existing systems are mapped to this classification, which captures the diversity in programming and computation models, runtime aspects of partitioning and communication, both for in-memory and distributed frameworks.
References
More filters
Proceedings ArticleDOI

BlueGene/L Failure Analysis and Prediction Models

TL;DR: This study has collected RAS event logs from BlueGene/L over a period of more than 100 days, and investigated the characteristics of fatal failure events, as well as the correlation between fatal events and non-fatal events, leading to three simple yet effective failure prediction methods.
Proceedings ArticleDOI

DMTCP: Transparent checkpointing for cluster computations and the desktop

TL;DR: DMTCP as mentioned in this paper is a transparent user-level checkpointing package for distributed applications, which is used for the runCMS experiment of the Large Hadron Collider at CERN, and it can be incorporated and distributed as a checkpoint-restart module within some larger package.
ReportDOI

The design and implementation of Berkeley Lab's linuxcheckpoint/restart

TL;DR: BLCR can be used either as a stand alone system for checkpointing applications on a single machine, or as a component by a scheduling system or parallel communication library for checkpointed and restoring parallel jobs running on multiple machines.
Journal ArticleDOI

Handbook of reliability engineering and management

TL;DR: This book discusses the development of Reliability Standards and Specifications, as well as techniques of Estimating Reliability at Design Stage, and the role of management in Reliability.
Journal ArticleDOI

Fault Tolerance in Petascale/ Exascale Systems: Current Knowledge, Challenges and Research Opportunities

TL;DR: This paper summarizes and analyzes the existing results concerning the failures in large-scale computers and points out the urgent need for drastic improvements or disruptive approaches for fault tolerance in these systems.
Related Papers (5)