A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

doi:10.1007/S11227-013-0884-0

Open AccessJournal ArticleDOI

A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems

Ifeanyi P. Egwutuoha, +3 more

- 01 Sep 2013 -

The Journal of Supercomputing

- Vol. 65, Iss: 3, pp 1302-1326

TLDR

The failure rates of HPC systems are reviewed, rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed, and a taxonomy is developed for over twenty popular checkpoint/restart solutions.

Abstract:

In recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a growing concern for long-running applications. In this paper, we briefly review the failure rates of HPC systems and also survey the fault tolerance approaches for HPC systems and issues with these approaches. Rollback-recovery techniques which are most often used for long-running applications on HPC clusters are discussed because they are widely used for long-running applications on HPC systems. Specifically, the feature requirements of rollback-recovery are discussed and a taxonomy is developed for over twenty popular checkpoint/restart solutions. The intent of this paper is to aid researchers in the domain as well as to facilitate development of new checkpointing solutions.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Page overlays: an enhanced virtual memory framework to enable fine-grained memory management

Vivek Seshadri, +7 more

TL;DR: The page-overlay framework enables cache-line-granularity memory management without significantly altering the existing virtual memory framework or introducing high overheads and can enable simple and efficient implementations of seven memory management techniques, each of which has a wide variety of applications.

...read moreread less

Journal ArticleDOI

Implementation of a Large-Scale Platform for Cyber-Physical System Real-Time Monitoring

Mikel Canizo, +5 more

- 18 Apr 2019 -

IEEE Access

TL;DR: This paper presents a large-scale platform for CPS real-time monitoring based on big data technologies, which aims to perform real- time analysis that targets the monitoring of industrial machines in a real work environment and the overall equipment effectiveness has been improved.

...read moreread less

An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management

Vivek Seshadri, +8 more

TL;DR: In this paper, the authors propose a page-overlay framework for fine-grained deduplication and memorization, which reduces the number of pages to be added to a regular page by using overlay-on-write and sparse-dat-a-s tructure.

...read moreread less

Book ChapterDOI

Static Analysis-Based Approaches for Secure Software Development

Miltiadis Siavvas, +3 more

TL;DR: Two mechanisms, particularly the vulnerability prediction models (VPMs) and the optimum checkpoint recommendation (OCR) mechanisms, are theoretically examined, while their potential improvement by using static analysis is also investigated.

...read moreread less

Posted Content

Simple DRAM and Virtual Memory Abstractions to Enable Highly Efficient Memory Systems.

Vivek Seshadri

- 20 May 2016 -

arXiv: Hardware Architecture

TL;DR: This thesis proposes page overlays, a framework that augments the existing virtual memory framework with the ability to track a new version of a subset of cache lines within each virtual page, and Gather-Scatter DRAM, a technique that exploits DRAM organization to effectively gather/scatter values with a power-of-2 strided access patterns.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Book ChapterDOI

Time, clocks, and the ordering of events in a distributed system

Leslie Lamport

- 04 Oct 2019 -

Concurrency and Computation: Practice an...

TL;DR: In this paper, the concept of one event happening before another in a distributed system is examined, and a distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events.

...read moreread less

Journal ArticleDOI

Time, clocks, and the ordering of events in a distributed system

Leslie Lamport

- 01 Jul 1978 -

Communications of The ACM

TL;DR: In this article, the concept of one event happening before another in a distributed system is examined, and a distributed algorithm is given for synchronizing a system of logical clocks which can be used to totally order the events.

...read moreread less

Proceedings ArticleDOI

Live migration of virtual machines

Christopher Clark, +7 more

TL;DR: The design options for migrating OSes running services with liveness constraints are considered, the concept of writable working set is introduced, and the design, implementation and evaluation of high-performance OS migration built on top of the Xen VMM are presented.

...read moreread less

MPI: A Message-Passing Interface Standard

Message P Forum

TL;DR: This document contains all the technical features proposed for the interface and the goal of the Message Passing Interface, simply stated, is to develop a widely used standard for writing message-passing programs.

...read moreread less

Journal ArticleDOI

Distributed snapshots: determining global states of distributed systems

K. Mani Chandy, +1 more

- 01 Feb 1985 -

ACM Transactions on Computer Systems

TL;DR: An algorithm by which a process in a distributed system determines a global state of the system during a computation, which helps to solve an important class of problems: stable property detection.

...read moreread less