scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

TL;DR: The failure rates of HPC systems are reviewed, rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed, and a taxonomy is developed for over twenty popular checkpoint/restart solutions.
Abstract: In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a growing concern for long-running applications. In this paper, we briefly review the failure rates of HPC systems and also survey the fault tolerance approaches for HPC systems and issues with these approaches. Rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed because they are widely used for long-running applications on HPC systems. Specifically, the feature requirements of rollback-recovery are discussed and a taxonomy is developed for over twenty popular checkpoint/restart solutions. The intent of this paper is to aid researchers in the domain as well as to facilitate development of new checkpointing solutions.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: A systematic classification of approaches that increase system resilience in the presence of functional hardware (HW)-induced errors is presented, dealing with higher system abstractions, such as the (micro)architecture, the mapping, and platform software (SW).
Abstract: Nanoscale technology nodes bring reliability concerns back to the center stage of digital system design. A systematic classification of approaches that increase system resilience in the presence of functional hardware (HW)-induced errors is presented, dealing with higher system abstractions, such as the (micro)architecture, the mapping, and platform software (SW). The field is surveyed in a systematic way based on nonoverlapping categories, which add insight into the ongoing work by exposing similarities and differences. HW and SW solutions are discussed in a similar fashion so that interrelationships become apparent. The presented categories are illustrated by representative literature examples to illustrate their properties. Moreover, it is demonstrated how hybrid schemes can be decomposed into their primitive components.

103 citations


Cites background from "A survey of fault tolerance mechani..."

  • ...Such concepts are elaborated in Elnozahy et al. (2002), Sancho et al. (2005), Chen et al. (2015), and Egwutuoha et al. (2013)....

    [...]

Journal ArticleDOI
TL;DR: Novel insights on GPU reliability are given by evaluating the neutron sensitivity of modern GPUs memory structures, highlighting pattern dependence and multiple errors occurrences and error-correcting code, algorithm-based fault tolerance, and comparison hardening strategies are presented and evaluated on GPUs through radiation experiments.
Abstract: Graphics processing units (GPUs) are increasingly attractive for both safety-critical and High-Performance Computing applications. GPU reliability is a primary concern for both the automotive and aerospace markets and is becoming an issue also for supercomputers. In fact, the high number of devices in large data centers makes the probability of having at least a device corrupted to be very high. In this paper, we aim at giving novel insights on GPU reliability by evaluating the neutron sensitivity of modern GPUs memory structures, highlighting pattern dependence and multiple errors occurrences. Additionally, a wide set of parallel codes are exposed to controlled neutron beams to measure GPUs operative error rates. From experimental data and algorithm analysis we derive general insights on parallel algorithms and programming approaches reliability. Finally, error-correcting code, algorithm-based fault tolerance, and duplication with comparison hardening strategies are presented and evaluated on GPUs through radiation experiments. We present and compare both the reliability improvement and imposed overhead of the selected hardening solutions.

80 citations

01 Jan 1975
TL;DR: The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.
Abstract: The paper presents, and discusses the rationale behind, a method for structuring complex computing systems by the use of what we term “recovery blocks”, “conversations” and “fault-tolerant interfaces”. The aim is to facilitate the provision of dependable error detection and recovery facilities which can cope with errors caused by residual design inadequacies, particularly in the system software, rather than merely the occasional malfunctioning of hardware components.

62 citations

Journal ArticleDOI
TL;DR: The potential of cloud-deployed IFC for enforcing owners’ data flow policy with regard to protection and sharing, as well as safeguarding against malicious or buggy software is discussed.
Abstract: A model of cloud services is emerging whereby a few trusted providers manage the underlying hardware and communications whereas many companies build on this infrastructure to offer higher level, cloud-hosted PaaS services and/or SaaS applications. From the start, strong isolation between cloud tenants was seen to be of paramount importance, provided first by virtual machines (VM) and later by containers, which share the operating system (OS) kernel. Increasingly it is the case that applications also require facilities to effect isolation and protection of data managed by those applications. They also require flexible data sharing with other applications, often across the traditional cloud-isolation boundaries; for example, when government, consisting of different departments, provides services to its citizens through a common platform. These concerns relate to the management of data. Traditional access control is application and principal/role specific, applied at policy enforcement points, after which there is no subsequent control over where data flows;a crucial issue once data has left its owner’s control by cloud-hosted applications andwithin cloud-services. Information Flow Control (IFC), in addition, offers system-wide, end-to-end, flow control based on the properties of the data. We discuss the potential of cloud-deployed IFC for enforcing owners’ data flow policy with regard to protection and sharing, aswell as safeguarding against malicious or buggy software. In addition, the audit log associated with IFC provides transparency and offers system-wide visibility over data flows. This helps those responsible to meet their data management obligations, providing evidence of compliance, and aids in the identification ofpolicy errors and misconfigurations. We present our IFC model and describe and evaluate our IFC architecture and implementation (CamFlow). This comprises an OS level implementation of IFC with support for application management, together with an IFC-enabled middleware.

59 citations


Cites background from "A survey of fault tolerance mechani..."

  • ...Checkpointing a process involves halting its execution, allowing it to be restarted at a later stage, and enabling migration, see [32], [33]....

    [...]

Journal ArticleDOI
TL;DR: A taxonomy of graph processing systems is proposed and existing systems are mapped to this classification, which captures the diversity in programming and computation models, runtime aspects of partitioning and communication, both for in-memory and distributed frameworks.
Abstract: The world is becoming a more conjunct place and the number of data sources such as social networks, online transactions, web search engines, and mobile devices is increasing even more than had been predicted. A large percentage of this growing dataset exists in the form of linked data, more generally, graphs, and of unprecedented sizes. While today's data from social networks contain hundreds of millions of nodes connected by billions of edges, inter-connected data from globally distributed sensors that forms the Internet of Things can cause this to grow exponentially larger. Although analyzing these large graphs is critical for the companies and governments that own them, big data tools designed for text and tuple analysis such as MapReduce cannot process them efficiently. So, graph distributed processing abstractions and systems are developed to design iterative graph algorithms and process large graphs with better performance and scalability. These graph frameworks propose novel methods or extend previous methods for processing graph data. In this article, we propose a taxonomy of graph processing systems and map existing systems to this classification. This captures the diversity in programming and computation models, runtime aspects of partitioning and communication, both for in-memory and distributed frameworks. Our effort helps to highlight key distinctions in architectural approaches, and identifies gaps for future research in scalable graph systems.

47 citations


Cites methods from "A survey of fault tolerance mechani..."

  • ...Most graph processing systems use checkpointing and rollback mechanisms (Egwutuoha et al. 2013) for failure recovery, such as Pregel and Pregel-like systems like Giraph....

    [...]

References
More filters
Book ChapterDOI
Leslie Lamport1
TL;DR: In this paper, the concept of one event happening before another in a distributed system is examined, and a distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events.
Abstract: The concept of one event happening before another in a distributed system is examined, and is shown to define a partial ordering of the events. A distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events. The use of the total ordering is illustrated with a method for solving synchronization problems. The algorithm is then specialized for synchronizing physical clocks, and a bound is derived on how far out of synchrony the clocks can become.

8,381 citations

Journal ArticleDOI
Leslie Lamport1
TL;DR: In this article, the concept of one event happening before another in a distributed system is examined, and a distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events.
Abstract: The concept of one event happening before another in a distributed system is examined, and is shown to define a partial ordering of the events. A distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events. The use of the total ordering is illustrated with a method for solving synchronization problems. The algorithm is then specialized for synchronizing physical clocks, and a bound is derived on how far out of synchrony the clocks can become.

6,804 citations

Proceedings ArticleDOI
02 May 2005
TL;DR: The design options for migrating OSes running services with liveness constraints are considered, the concept of writable working set is introduced, and the design, implementation and evaluation of high-performance OS migration built on top of the Xen VMM are presented.
Abstract: Migrating operating system instances across distinct physical hosts is a useful tool for administrators of data centers and clusters: It allows a clean separation between hard-ware and software, and facilitates fault management, load balancing, and low-level system maintenance.By carrying out the majority of migration while OSes continue to run, we achieve impressive performance with minimal service downtimes; we demonstrate the migration of entire OS instances on a commodity cluster, recording service downtimes as low as 60ms. We show that that our performance is sufficient to make live migration a practical tool even for servers running interactive loads.In this paper we consider the design options for migrating OSes running services with liveness constraints, focusing on data center and cluster environments. We introduce and analyze the concept of writable working set, and present the design, implementation and evaluation of high-performance OS migration built on top of the Xen VMM.

3,186 citations


"A survey of fault tolerance mechani..." refers methods in this paper

  • ...Stop-and-copy and live migration of VMs are the commonly used techniques [16]....

    [...]

01 Apr 1994
TL;DR: This document contains all the technical features proposed for the interface and the goal of the Message Passing Interface, simply stated, is to develop a widely used standard for writing message-passing programs.
Abstract: The Message Passing Interface Forum (MPIF), with participation from over 40 organizations, has been meeting since November 1992 to discuss and define a set of library standards for message passing MPIF is not sanctioned or supported by any official standards organization The goal of the Message Passing Interface, simply stated, is to develop a widely used standard for writing message-passing programs As such the interface should establish a practical, portable, efficient and flexible standard for message passing , This is the final report, Version 10, of the Message Passing Interface Forum This document contains all the technical features proposed for the interface This copy of the draft was processed by LATEX on April 21, 1994 , Please send comments on MPI to mpi-comments@csutkedu Your comment will be forwarded to MPIF committee members who will attempt to respond

3,181 citations

Journal ArticleDOI
TL;DR: An algorithm by which a process in a distributed system determines a global state of the system during a computation, which helps to solve an important class of problems: stable property detection.
Abstract: This paper presents an algorithm by which a process in a distributed system determines a global state of the system during a computation. Many problems in distributed systems can be cast in terms of the problem of detecting global states. For instance, the global state detection algorithm helps to solve an important class of problems: stable property detection. A stable property is one that persists: once a stable property becomes true it remains true thereafter. Examples of stable properties are “computation has terminated,” “ the system is deadlocked” and “all tokens in a token ring have disappeared.” The stable property detection problem is that of devising algorithms to detect a given stable property. Global state detection can also be used for checkpointing.

2,738 citations


"A survey of fault tolerance mechani..." refers background in this paper

  • ...A number of checkpoint protocols have been proposed to ensure global coordination: a nonblocking checkpointing coordination protocol was proposed [11] to ensure that applications that would make coordinated checkpointing inconsistent are prevented from running....

    [...]