Proceedings ArticleDOI
Fault tolerance for multi-threaded applications by leveraging hardware transactional memory
Gulay Yalcin,Osman Unsal,Adrian Cristal +2 more
- pp 4
TLDR
FaulTM-multi, a fault tolerance scheme for multi threaded applications running on transactional memory hardware which reduces the performance degradation of lockstepping and creates 28% less checkpoints compared to Rebound, the state of the art checkpointing scheme.Abstract:
Providing fault tolerance especially to mission critical applications in order to detect transient and permanent faults and to recover from them is one of the main necessity for processor designers. However, fault tolerance for multi-threaded applications presents high performance degradations due to comparing the results of the instruction streams, checkpointing the entire system and recovering from the detected errors to an agreed state. In this study, we present FaulTM-multi, a fault tolerance scheme for multi threaded applications running on transactional memory hardware which reduces these performance degradations. FaulTM-multi decreases the performance degradation of lockstepping, a conventional fault detection scheme, from 23% and 9% to 10% and 2% for lock-based parallel and TM applications respectively. Also, FaulTM-multi creates 28% less checkpoints compared to Rebound, the state of the art checkpointing scheme.read more
Citations
More filters
Journal ArticleDOI
Toward Exascale Resilience: 2014 Update
TL;DR: This paper surveys what the community has learned in the past five years and summarizes the research problems still considered critical by the HPC community.
Proceedings ArticleDOI
HAFT: hardware-assisted fault tolerance
TL;DR: This work presents HAFT, a fault tolerance technique using hardware extensions of commodity CPUs to protect unmodified multithreaded applications against data corruptions, and applied it to real-world case studies including Memcached, Apache, and SQLite.
Journal ArticleDOI
Edge-TM: Exploiting Transactional Memory for Error Tolerance and Energy Efficiency
TL;DR: Edge-TM is proposed, an adaptive hardware/software error management policy that optimistically scales the voltage beyond the edge of safe operation for better energy savings and works in combination with a Hardware Transactional Memory (HTM)-based error recovery mechanism.
Proceedings ArticleDOI
Combining Error Detection and Transactional Memory for Energy-Efficient Computing below Safe Operation Margins
Gulay Yalcin,Adrian Cristal,Osman Unsal,Anita Sobe,Derin Harmanci,Pascal Felber,Alexey Voronin,Jons-Tobias Wamhoff,Christof Fetzer +8 more
TL;DR: A first study investigating the combination of different error detection mechanisms with transactional memory with the objective to improve energy efficiency and reliability is provided.
Book ChapterDOI
Fault-Tolerant Execution on COTS Multi-core Processors with Hardware Transactional Memory Support
TL;DR: This paper proposes a software/hardware hybrid approach, which targets Intel’s current x86 multi-core platforms of the Core and Xeon family, and leverage hardware transactional memory (Intel TSX) to support implicit checkpoint creation and fast rollback.
References
More filters
Journal ArticleDOI
Pin: building customized program analysis tools with dynamic instrumentation
Chi-Keung Luk,Robert Cohn,Robert Muth,Harish Patil,Artur Klauser,Geoff Lowney,Steven Wallace,Vijay Janapa Reddi,Kim Hazelwood +8 more
TL;DR: The goals are to provide easy-to-use, portable, transparent, and efficient instrumentation, and to illustrate Pin's versatility, two Pintools in daily use to analyze production software are described.
Proceedings ArticleDOI
The SPLASH-2 programs: characterization and methodological considerations
TL;DR: This paper quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well, including the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality.
Proceedings ArticleDOI
STAMP: Stanford Transactional Applications for Multi-Processing
TL;DR: This paper introduces the Stanford Transactional Application for Multi-Processing (STAMP), a comprehensive benchmark suite for evaluating TM systems and uses the suite to evaluate six different TM systems, identify their shortcomings, and motivate further research on their performance characteristics.
Journal ArticleDOI
The M5 Simulator: Modeling Networked Systems
TL;DR: The M5 simulator provides features necessary for simulating networked hosts, including full-system capability, a detailed I/O subsystem, and the ability to simulate multiple networked systems deterministically.
Journal ArticleDOI
Soft errors in advanced computer systems
TL;DR: This article comprehensively analyzes soft-error sensitivity in modern systems and shows it to be application dependent.