Proceedings ArticleDOI
FastLane: improving performance of software transactional memory for low thread counts
Jons-Tobias Wamhoff,Christof Fetzer,Pascal Felber,Etienne Rivière,Gilles Muller +4 more
- Vol. 48, Iss: 8, pp 113-122
Reads0
Chats0
TLDR
Evaluation results indicate that the approach provides promising performance at low thread counts: FastLane almost systematically wins over a classical STM in the 1-6 threads range, and often performs better than sequential execution of the non-instrumented version of the same application starting with 2 threads.Abstract:
Software transactional memory (STM) can lead to scalable implementations of concurrent programs, as the relative performance of an application increases with the number of threads that support it. However, the absolute performance is typically impaired by the overheads of transaction management and instrumented accesses to shared memory. This often leads STM-based programs with low thread counts to perform worse than a sequential, non-instrumented version of the same application.In this paper, we propose FastLane, a new STM algorithm that bridges the performance gap between sequential execution and classical STM algorithms when running on few cores. FastLane seeks to reduce instrumentation costs and thus performance degradation in its target operation range. We introduce a novel algorithm that differentiates between two types of threads: One thread (the master) executes transactions pessimistically without ever aborting, thus with minimal instrumentation and management costs, while other threads (the helpers) can commit speculative transactions only when they do not conflict with the master. Helpers thus contribute to the application progress without impairing on the performance of the master.We implement FastLane as an extension of a state-of-the-art STM runtime system and compiler. Multiple code paths are produced for execution on a single, few, and many cores. The runtime system selects the code path providing the best throughput, depending on the number of cores available on the target machine. Evaluation results indicate that our approach provides promising performance at low thread counts: FastLane almost systematically wins over a classical STM in the 1-6 threads range, and often performs better than sequential execution of the non-instrumented version of the same application starting with 2 threads.read more
Citations
More filters
Journal Article
A Lazy Snapshot Algorithm with Eager Validation
TL;DR: This paper formally introduces a lazy snapshot algorithm that verifies at each object access that the view observed by a transaction is consistent and demonstrates that the performance is quite competitive by comparing other STMs with an STM that uses the algorithm.
Book
Shared-Memory Synchronization
TL;DR: This lecture offers a comprehensive survey of shared-memory synchronization, with an emphasis on "systems-level" issues, and includes sufficient Coverage of architectural details to understand correctness and performance on modern multicore machines, and sufficient coverage of higher-level issues to understand how synchronization is embedded in modern programming languages.
Journal Article
The TURBO Diaries: Application-controlled Frequency Scaling Explained.
Jons-Tobias Wamhoff,Stephan Diestelhorst,Christof Fetzer,Patrick Marlier,Pascal Felber,Dave Dice +5 more
TL;DR: In this article, the authors propose a general-purpose library that allows selective control of DVFS from user space to accelerate multi-threaded applications and expose the potential of heterogeneous frequencies.
Proceedings Article
The TURBO diaries: application-controlled frequency scaling explained
Jons-Tobias Wamhoff,Stephan Diestelhorst,Christof Fetzer,Patrick Marlier,Pascal Felber,Dave Dice +5 more
TL;DR: A general-purpose library that allows selective control of DVFS from user space to accelerate multi-threaded applications and expose the potential of heterogeneous frequencies is proposed.
Proceedings ArticleDOI
Low-overhead software transactional memory with progress guarantees and strong semantics
TL;DR: An adaptive version of LarkTM is designed that uses alternative concurrency control for high-contention objects, and not only provides low single-thread overhead, but their multithreaded performance compares favorably with existing high-performance STMs.
References
More filters
Journal ArticleDOI
Algorithms for scalable synchronization on shared-memory multiprocessors
TL;DR: The principal conclusion is that contention due to synchronization need not be a problemin large-scale shared-memory multiprocessors, and the existence of scalable algorithms greatly weakens the case for costly special-purpose hardware support for synchronization, and provides protection against so-called “dance hall” architectures.
Book
Algorithms for scalable synchronization on shared-memory multiprocessors
TL;DR: In this article, the authors present a scalable algorithm for spin locks that provides reasonable latency in the absence of contention, requires only a constant amount of space per lock, and requires no hardware support other than a swap-with-memory instruction.
Proceedings ArticleDOI
STAMP: Stanford Transactional Applications for Multi-Processing
TL;DR: This paper introduces the Stanford Transactional Application for Multi-Processing (STAMP), a comprehensive benchmark suite for evaluating TM systems and uses the suite to evaluate six different TM systems, identify their shortcomings, and motivate further research on their performance characteristics.
Journal Article
Transactional locking II
Dave Dice,Ori Shalev,Nir Shavit +2 more
TL;DR: This paper introduces the transactional locking II (TL2) algorithm, a software transactional memory (STM) algorithm based on a combination of commit-time locking and a novel global version-clock based validation technique, which is ten-fold faster than a single lock.
Book ChapterDOI
Transactional locking II
Dave Dice,Ori Shalev,Nir Shavit +2 more
TL;DR: TL2 as mentioned in this paper is a software transactional memory (STM) algorithm based on a combination of commit-time locking and a novel global version-clock based validation technique, which is ten times faster than a single lock.