A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
TLDR
The failure rates of HPC systems are reviewed, rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed, and a taxonomy is developed for over twenty popular checkpoint/restart solutions.Abstract:
In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a growing concern for long-running applications. In this paper, we briefly review the failure rates of HPC systems and also survey the fault tolerance approaches for HPC systems and issues with these approaches. Rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed because they are widely used for long-running applications on HPC systems. Specifically, the feature requirements of rollback-recovery are discussed and a taxonomy is developed for over twenty popular checkpoint/restart solutions. The intent of this paper is to aid researchers in the domain as well as to facilitate development of new checkpointing solutions.read more
Citations
More filters
Journal ArticleDOI
Classification of Resilience Techniques Against Functional Errors at Higher Abstraction Layers of Digital Systems
Georgia Psychou,Dimitrios Rodopoulos,Mohamed M. Sabry,Tobias Gemmeke,David Atienza,Tobias G. Noll,Francky Catthoor +6 more
TL;DR: A systematic classification of approaches that increase system resilience in the presence of functional hardware (HW)-induced errors is presented, dealing with higher system abstractions, such as the (micro)architecture, the mapping, and platform software (SW).
Journal ArticleDOI
Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units
TL;DR: Novel insights on GPU reliability are given by evaluating the neutron sensitivity of modern GPUs memory structures, highlighting pattern dependence and multiple errors occurrences and error-correcting code, algorithm-based fault tolerance, and comparison hardening strategies are presented and evaluated on GPUs through radiation experiments.
System Structure for Software Fault Tolerance
TL;DR: The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.
Journal ArticleDOI
Camflow: Managed Data-Sharing for Cloud Services
TL;DR: The potential of cloud-deployed IFC for enforcing owners’ data flow policy with regard to protection and sharing, as well as safeguarding against malicious or buggy software is discussed.
Journal ArticleDOI
Scalable Graph Processing Frameworks: A Taxonomy and Open Challenges
TL;DR: A taxonomy of graph processing systems is proposed and existing systems are mapped to this classification, which captures the diversity in programming and computation models, runtime aspects of partitioning and communication, both for in-memory and distributed frameworks.
References
More filters
Patent
Wear leveling techniques for flash EEPROM systems
TL;DR: A mass storage system made of flash electrically erasable and programmable read only memory (EEPROM) cells organized into blocks, the blocks in turn being grouped into memory banks, is managed to even out the numbers of erase and rewrite cycles experienced by the memory banks in order to extend the service lifetime of the memory.
Journal ArticleDOI
Understanding fault-tolerant distributed systems
TL;DR: This article attempts to introduce some discipline and order in understanding fault-tolerance issues in distributed system architectures by examining various proposals, discusses their relative merits, and illustrates their use in existing commercial fault-Tolerance systems.
Journal ArticleDOI
The use of triple-modular redundancy to improve computer reliability
R. E. Lyons,W. Vanderkulk +1 more
TL;DR: One of the proposed techniques for meeting the severe reliability requirements inherent in certain future computer applications is described, which involves the use of triple-modular redundancy, which is essentially theuse of the two-out-of-three votingc oncept at a low level.
ReportDOI
Static analysis of executables to detect malicious patterns
Mihai Christodorescu,Somesh Jha +1 more
TL;DR: An architecture for detecting malicious patterns in executables that is resilient to common obfuscation transformations is presented, and experimental results demonstrate the efficacy of the prototype tool, SAFE (a static analyzer for executables).
Proceedings ArticleDOI
A large-scale study of failures in high-performance computing systems
Bianca Schroeder,Garth A. Gibson +1 more
TL;DR: Analysis of failure data collected at two large high-performance computing sites finds that average failure rates differ wildly across systems, ranging from 20-1000 failures per year, and that time between failures is modeled well by a Weibull distribution with decreasing hazard rate.