scispace - formally typeset
Open Access

FaulTM: Fault-Tolerance Using Hardware Transactional Memory

Reads0
Chats0
TLDR
This study shows how it is possible to provide low-cost faulttolerance for serial programs by using a minimallymodified Hardware Transactional Memory (HTM) that features lazy conflict detection, lazy data versioning, and a hybrid hardware-software fault-tolerance technique.
Abstract
Fault-tolerance has become an essential concern for processor designers due to increasing soft-error rates. In this study, we are motivated by the fact that Transactional Memory (TM) hardware provides an ideal base upon which to build a fault-tolerant system. We show how it is possible to provide low-cost faulttolerance for serial programs by using a minimallymodified Hardware Transactional Memory (HTM) that features lazy conflict detection, lazy data versioning. This scheme, called FaulTM, employs a hybrid hardware-software fault-tolerance technique. On the software side, FaulTM programming model is able to provide the flexibility for programmers to decide between performance and reliability. Our experimental results indicate that FaulTM produces relatively less performance overhead by reducing the number of comparisons and by leveraging already proposed TM hardware. We also conduct experiments which indicate that the baseline FaulTM design has a good error coverage. To the best of our knowledge, this is the first architectural fault-tolerance proposal using Hardware Transactional Memory.

read more

Citations
More filters
Journal ArticleDOI

Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

TL;DR: In this paper, a structured approach to the management of HPC resilience using the concept of resilience-based design patterns is presented, where each established solution is described in the form of a pattern that addresses concrete problems in the design of resilient systems.
Proceedings ArticleDOI

FIMSIM: A fault injection infrastructure for microarchitectural simulators

TL;DR: FIMSIM provides the opportunity to comprehensively evaluate the vulnerability of different microarchitectural structures against different fault models, and enables a preliminary analysis of the correlation between the criticality of processor-structure level faults and their impact on applications.
Proceedings ArticleDOI

FaulTM: error detection and recovery using hardware transactional memory

TL;DR: It is shown how a minimally modified HTM that features lazy conflict detection and lazy data versioning can provide low-cost reliability in addition to HTM's intended purpose of supporting optimistic concurrency.
Proceedings ArticleDOI

Transactional memory for dependable embedded systems

TL;DR: The position is that it is both possible and worthwhile to develop embedded transactional memory, yet the focus of TM should be on failure control and not concurrency control, and this will require modifications of the TM language primitives, tools, algorithms, runtime systems, and hardware itself.
Journal ArticleDOI

Rolex: Resilience-oriented language extensions for extreme-scale systems

TL;DR: Rolex as mentioned in this paper is a language extension for HPC applications that facilitates the incorporation of fault resilience as an intrinsic property of the application code and leverages the programmer's insight to reason about the context and significance of faults to the application outcome.
References
More filters
Proceedings ArticleDOI

MediaBench: a tool for evaluating and synthesizing multimedia and communications systems

TL;DR: The MediaBench benchmark suite as discussed by the authors is a benchmark suite that has been designed to fill the gap between the compiler community and embedded applications developers, which has been constructed through a three-step process: intuition and market driven initial selection, experimental measurement, and integration with system synthesis algorithms to establish usefulness.
Journal ArticleDOI

The M5 Simulator: Modeling Networked Systems

TL;DR: The M5 simulator provides features necessary for simulating networked hosts, including full-system capability, a detailed I/O subsystem, and the ability to simulate multiple networked systems deterministically.
Journal ArticleDOI

Transactional Memory Coherence and Consistency

TL;DR: To explore the costs and benefits of TCC, the characteristics of an optimal transaction-based memory system are studied, and how different design parameters could affect the performance of real systems are examined.
Proceedings ArticleDOI

LogTM: log-based transactional memory

TL;DR: This paper presents a new implementation of transactional memory, log-based transactionalMemory (LogTM), that makes commits fast by storing old values to a per-thread log in cacheable virtual memory and storing new values in place.
Proceedings ArticleDOI

Transient fault detection via simultaneous multithreading

TL;DR: The concept of the sphere of replication is introduced, which abstract both the physical redundancy of a lockstepped system and the logical redundancy of an SRT processor, and two mechanisms-slack fetch and branch outcome queue-are proposed and evaluated that enhance the performance of anSRT processor by allowing one thread to prefetch cache misses and branch results for the other thread.