scispace - formally typeset
Proceedings ArticleDOI

Fault tolerance for multi-threaded applications by leveraging hardware transactional memory

TLDR
FaulTM-multi, a fault tolerance scheme for multi threaded applications running on transactional memory hardware which reduces the performance degradation of lockstepping and creates 28% less checkpoints compared to Rebound, the state of the art checkpointing scheme.
Abstract
Providing fault tolerance especially to mission critical applications in order to detect transient and permanent faults and to recover from them is one of the main necessity for processor designers. However, fault tolerance for multi-threaded applications presents high performance degradations due to comparing the results of the instruction streams, checkpointing the entire system and recovering from the detected errors to an agreed state. In this study, we present FaulTM-multi, a fault tolerance scheme for multi threaded applications running on transactional memory hardware which reduces these performance degradations. FaulTM-multi decreases the performance degradation of lockstepping, a conventional fault detection scheme, from 23% and 9% to 10% and 2% for lock-based parallel and TM applications respectively. Also, FaulTM-multi creates 28% less checkpoints compared to Rebound, the state of the art checkpointing scheme.

read more

Citations
More filters
Journal ArticleDOI

Toward Exascale Resilience: 2014 Update

TL;DR: This paper surveys what the community has learned in the past five years and summarizes the research problems still considered critical by the HPC community.
Proceedings ArticleDOI

HAFT: hardware-assisted fault tolerance

TL;DR: This work presents HAFT, a fault tolerance technique using hardware extensions of commodity CPUs to protect unmodified multithreaded applications against data corruptions, and applied it to real-world case studies including Memcached, Apache, and SQLite.
Journal ArticleDOI

Edge-TM: Exploiting Transactional Memory for Error Tolerance and Energy Efficiency

TL;DR: Edge-TM is proposed, an adaptive hardware/software error management policy that optimistically scales the voltage beyond the edge of safe operation for better energy savings and works in combination with a Hardware Transactional Memory (HTM)-based error recovery mechanism.
Proceedings ArticleDOI

Combining Error Detection and Transactional Memory for Energy-Efficient Computing below Safe Operation Margins

TL;DR: A first study investigating the combination of different error detection mechanisms with transactional memory with the objective to improve energy efficiency and reliability is provided.
Book ChapterDOI

Fault-Tolerant Execution on COTS Multi-core Processors with Hardware Transactional Memory Support

TL;DR: This paper proposes a software/hardware hybrid approach, which targets Intel’s current x86 multi-core platforms of the Core and Xeon family, and leverage hardware transactional memory (Intel TSX) to support implicit checkpoint creation and fast rollback.
References
More filters
Journal ArticleDOI

Pin: building customized program analysis tools with dynamic instrumentation

TL;DR: The goals are to provide easy-to-use, portable, transparent, and efficient instrumentation, and to illustrate Pin's versatility, two Pintools in daily use to analyze production software are described.
Proceedings ArticleDOI

The SPLASH-2 programs: characterization and methodological considerations

TL;DR: This paper quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well, including the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality.
Proceedings ArticleDOI

STAMP: Stanford Transactional Applications for Multi-Processing

TL;DR: This paper introduces the Stanford Transactional Application for Multi-Processing (STAMP), a comprehensive benchmark suite for evaluating TM systems and uses the suite to evaluate six different TM systems, identify their shortcomings, and motivate further research on their performance characteristics.
Journal ArticleDOI

The M5 Simulator: Modeling Networked Systems

TL;DR: The M5 simulator provides features necessary for simulating networked hosts, including full-system capability, a detailed I/O subsystem, and the ability to simulate multiple networked systems deterministically.
Journal ArticleDOI

Soft errors in advanced computer systems

TL;DR: This article comprehensively analyzes soft-error sensitivity in modern systems and shows it to be application dependent.
Related Papers (5)