scispace - formally typeset
Search or ask a question

Showing papers on "Transactional memory published in 2014"


Proceedings ArticleDOI
14 Apr 2014
TL;DR: The design, implementation, and evaluation of a high-throughput and memory-efficient concurrent hash table that supports multiple readers and writers is presented, and performance results demonstrate that the new hash table design, based around optimistic cuckoo hashing, outperforms other optimized concurrent hash tables by up to 2.5x for write-heavy workloads, even while using substantially less memory for small key-value items.
Abstract: Fast concurrent hash tables are an increasingly important building block as we scale systems to greater numbers of cores and threads. This paper presents the design, implementation, and evaluation of a high-throughput and memory-efficient concurrent hash table that supports multiple readers and writers. The design arises from careful attention to systems-level optimizations such as minimizing critical section length and reducing interprocessor coherence traffic through algorithm re-engineering. As part of the architectural basis for this engineering, we include a discussion of our experience and results adopting Intel's recent hardware transactional memory (HTM) support to this critical building block. We find that naively allowing concurrent access using a coarse-grained lock on existing data structures reduces overall performance with more threads. While HTM mitigates this slowdown somewhat, it does not eliminate it. Algorithmic optimizations that benefit both HTM and designs for fine-grained locking are needed to achieve high performance.Our performance results demonstrate that our new hash table design---based around optimistic cuckoo hashing---outperforms other optimized concurrent hash tables by up to 2.5x for write-heavy workloads, even while using substantially less memory for small key-value items. On a 16-core machine, our hash table executes almost 40 million insert and more than 70 million lookup operations per second.

172 citations


Proceedings ArticleDOI
19 May 2014
TL;DR: This work shows that HTM allows to achieve nearly lock-free processing of database transactions by carefully controlling the data layout and the access patterns, and allows for an optimistic, and thus very low-overhead execution of concurrent transactions.
Abstract: So far, transactional memory—although a promising technique—suffered from the absence of an efficient hardware implementation. The upcoming Haswell microarchitecture from Intel introduces hardware transactional memory (HTM) in mainstream CPUs. HTM allows for efficient concurrent, atomic operations, which is also highly desirable in the context of databases. On the other hand HTM has several limitations that, in general, prevent a one-to-one mapping of database transactions to HTM transactions. In this work we devise several building blocks that can be used to exploit HTM in main-memory databases. We show that HTM allows to achieve nearly lock-free processing of database transactions by carefully controlling the data layout and the access patterns. The HTM component is used for detecting the (infrequent) conflicts, which allows for an optimistic, and thus very low-overhead execution of concurrent transactions.

105 citations


Proceedings ArticleDOI
14 Apr 2014
TL;DR: DBX is an in-memory database that uses Intel's restricted transactional memory (RTM) to achieve high performance and good scalability across multi-core machines.
Abstract: The recent availability of Intel Haswell processors marks the transition of hardware transactional memory from research toys to mainstream reality. DBX is an in-memory database that uses Intel's restricted transactional memory (RTM) to achieve high performance and good scalability across multi-core machines. The main limitation (and also key to practicality) of RTM is its constrained working set size: an RTM region that reads or writes too much data will always be aborted. The design of DBX addresses this challenge in several ways. First, DBX builds a database transaction layer on top of an underlying shared-memory store. The two layers use separate RTM regions to synchronize shared memory access. Second, DBX uses optimistic concurrency control to separate transaction execution from its commit. Only the commit stage uses RTM for synchronization. As a result, the working set of the RTMs used scales with the meta-data of reads and writes in a database transaction as opposed to the amount of data read/written. Our evaluation using TPC-C workload mix shows that DBX achieves 506,817 transactions per second on a 4-core machine.

98 citations


Proceedings ArticleDOI
14 Apr 2014
TL;DR: StackTrack as mentioned in this paper is the first concurrent memory reclamation scheme that can be applied automatically by a compiler, while maintaining efficiency, and tracks thread variables dynamically, and in an atomic fashion, effectively makes all memory references visible without having threads pay the overhead of writing out this information.
Abstract: Dynamic memory reclamation is arguably the biggest open problem in concurrent data structure design: all known solutions induce high overhead, or must be customized to the specific data structure by the programmer, or both. This paper presents StackTrack, the first concurrent memory reclamation scheme that can be applied automatically by a compiler, while maintaining efficiency. StackTrack eliminates most of the expensive bookkeeping required for memory reclamation by leveraging the power of hardware transactional memory (HTM) in a new way: it tracks thread variables dynamically, and in an atomic fashion. This effectively makes all memory references visible without having threads pay the overhead of writing out this information. Our empirical results show that this new approach matches or outperforms prior, non-automated, techniques.

65 citations


Proceedings Article
01 Jan 2014
TL;DR: This work introduces an innovative self-tuning approach that exploits lightweight reinforcement learning techniques to identify the optimal TSX configuration in a workload-oblivious manner, i.e. not requiring any off-line/a-priori sampling of the application’s workload.
Abstract: Transactional Memory was recently integrated in Intel processors under the name TSX. We show that its performance can be significantly affected by the configuration of its interplay with the software-based fallback: in fact, there does not seem to exist a single configuration that can perform best independently of the application and workload. We address this challenge by introducing an innovative self-tuning approach that exploits lightweight reinforcement learning techniques to identify the optimal TSX configuration in a workload-oblivious manner, i.e. not requiring any off-line/a-priori sampling of the application’s workload. To achieve total transparency for the programmer, we integrated the proposed algorithm in the GCC compiler. Our evaluation shows improvements up to 2× over state of the art approaches, while remaining within 5% from the performance achievable using optimal static configurations.

63 citations


Proceedings ArticleDOI
24 Aug 2014
TL;DR: In this paper, Invyswell is presented, a novel HyTM that exploits the benefits and manages the limitations of Haswell's RTM and outperforms NOrec, a state-of-the-art STM, by 35%, Hybrid Norec, NOrec's hybrid implementation, by 18%, and haswell's hardware-only lock elision by 25% across all STAMP benchmarks.
Abstract: The Intel Haswell processor includes restricted transactional memory (RTM), which is the first commodity-based hardware transactional memory (HTM) to become publicly available. However, like other real HTMs, such as IBM's Blue Gene/Q, Haswell's RTM is best-effort, meaning it provides no transactional forward progress guarantees. Because of this, a software fallback system must be used in conjunction with Haswell's RTM to ensure transactional programs execute to completion. To complicate matters, Haswell does not provide escape actions. Without escape actions, non-transactional instructions cannot be executed within the context of a hardware transaction, thereby restricting the ways in which a software fallback can interact with the HTM. As such, the challenge of creating a scalable hybrid TM (HyTM) that uses Haswell's RTM and a software TM (STM) fallback is exacerbated. In this paper, we present Invyswell, a novel HyTM that exploits the benefits and manages the limitations of Haswell's RTM. After describing Invyswell's design, we show that it outperforms NOrec, a state-of-the-art STM, by 35%, Hybrid NOrec, NOrec's hybrid implementation, by 18%, and Haswell's hardware-only lock elision by 25% across all STAMP benchmarks.

60 citations


Proceedings ArticleDOI
24 Aug 2014
TL;DR: This study identifies workloads in which HTM clearly outperforms any alternative synchronization mechanism and shows that current HTM implementations suffer of restrictions that narrow the scope in which these can be more effective than state of the art software solutions.
Abstract: Over the last years Transactional Memory (TM) gained growing popularity as a simpler, attractive alternative to classic lock-based synchronization schemes. Recently, the TM landscape has been profoundly changed by the integration of Hardware TM (HTM) in Intel commodity processors, raising a number of questions on the future of TM. We seek answers to these questions by conducting the largest study on TM to date, comparing different locking techniques, hardware and software TMs, as well as different combinations of these mechanisms, from the dual perspective of performance and power consumption. Our study sheds a mix of light and shadows on currently available commodity HTM: on one hand, we identify workloads in which HTM clearly outperforms any alternative synchronization mechanism; on the other hand, we show that current HTM implementations suffer of restrictions that narrow the scope in which these can be more effective than state of the art software solutions. Thanks to the results of our study, we identify a number of compelling research problems in the areas of TM design, compilers and self-tuning.

59 citations


Proceedings ArticleDOI
Yutao Liu1, Yubin Xia1, Haibing Guan1, Binyu Zang1, Haibo Chen1 
19 Jun 2014
TL;DR: This paper proposes a novel approach, called TxIntro, which retrofits hardware transactional memory (HTM) for concurrent, timely and consistent introspection of guest VMs, and leverages the strong atomicity of HTM to actively monitor updates to critical kernel data structures.
Abstract: Virtual machine introspection, which provides tamperresistant, high-fidelity “out of the box” monitoring of virtual machines, has many prominent security applications including VM-based intrusion detection, malware analysis and memory forensic analysis. However, prior approaches are either intrusive in stopping the world to avoid race conditions between introspection tools and the guest VM, or providing no guarantee of getting a consistent state of the guest VM. Further, there is currently no effective means for timely examining the VM states in question. In this paper, we propose a novel approach, called TxIntro, which retrofits hardware transactional memory (HTM) for concurrent, timely and consistent introspection of guest VMs. Specifically, TxIntro leverages the strong atomicity of HTM to actively monitor updates to critical kernel data structures. Then TxIntro can mount introspection to timely detect malicious tampering. To avoid fetching inconsistent kernel states for introspection, TxIntro uses HTM to add related synchronization states into the read set of the monitoring core and thus can easily detect potential inflight concurrent kernel updates. We have implemented and evaluated TxIntro based on Xen VMM on a commodity Intel Haswell machine that provides restricted transactional memory (RTM) support. To demonstrate the effectiveness of TxIntro, we implemented a set of kernel rootkit detectors using TxIntro. Evaluation results show that TxIntro is effective in detecting these rootkits, and is efficient in adding negligible performance overhead.

53 citations


Proceedings ArticleDOI
24 Feb 2014
TL;DR: In this article, snapshot isolation transactional memory (SHM) is proposed to reduce the number of aborted transactions by up to 20x in some cases by exploiting snapshots, an established database model of transactions and only need to abort on write-write conflicts.
Abstract: Transactional memory represents an attractive conceptual model for programming concurrent applications. Unfortunately, high transaction abort rates can cause significant performance degradation. Conventional transactional memory realizations not only pessimistically abort transactions on every read-write conflict but also because of false sharing, cache evictions, TLB misses, page faults and interrupts. Consequently, the use of transactions needs to be restricted to a very small number of operations to achieve predictable performance, thereby, limiting its benefit to programming simplification. In this paper, we investigate snapshot isolation transactional memory in which transactions operate on memory snapshots that always guarantee consistent reads. By exploiting snapshots, an established database model of transactions, transactions can ignore read-write conflicts and only need to abort on write-write conflicts. Our implementation utilizes a memory controller that supports multiversion memory, to efficiently support snapshotting in hardware.We show that snapshot isolation can reduce the number of aborts in some cases by three orders of magnitude and improve performance by up to 20x.

47 citations


Proceedings ArticleDOI
15 Jul 2014
TL;DR: Evaluation on STAMP and on data structure benchmarks on an Intel Haswell processor shows that these techniques improve the speedup by up to 3.5 times and $10$ times respectively, compared to using Haswell's hardware lock elision as is.
Abstract: With hardware transactional memory (HTM) becoming available in mainstream processors, lock-based critical sections may now initiate a hardware transaction instead of taking the lock, enabling their concurrent execution unless a real data conflict occurs. However, just a few transactional aborts can cause the lock to be acquired non-transactionally resulting in the serialization of all the threads, severely degrading the amount of speedup obtained. In this paper we provide two software extension mechanisms that considerably improve the concurrency and speedup levels attained by lock based programs using HTM-based lock elision. The first sacrifices opacity to achieve higher levels of concurrency, and the second retains opacity while reaching slightly lower levels of concurrency.Evaluation on STAMP and on data structure benchmarks on an Intel Haswell processor shows that these techniques improve the speedup by up to 3.5 times and $10$ times respectively, compared to using Haswell's hardware lock elision as is.

42 citations


Proceedings ArticleDOI
24 Feb 2014
TL;DR: This paper describes the experiences transactionalizing the memcached application through the use of the GCC implementation of the Draft C++ TM Specification and presents experiences and recommendations that it hopes will guide the effort to integrate TM into languages.
Abstract: The addition of transactional memory (TM) support to existing languages provides the opportunity to create new soft- ware from scratch using transactions, and also to simplify or extend legacy code by replacing existing synchronization with language-level transactions. In this paper, we describe our experiences transactionalizing the memcached application through the use of the GCC implementation of the Draft C++ TM Specification. We present experiences and recommendations that we hope will guide the effort to integrate TM into languages, and that may also contribute to the growing collective knowledge about how programmers can begin to exploit TM in existing production-quality software.

Proceedings ArticleDOI
23 Jun 2014
TL;DR: This work describes an adaptive policy and present experiments on three platforms, two of which support HTM, showing that the adaptive policy is competitive with and often significantly better than hand-tuned static policies.
Abstract: Transactional Lock Elision (TLE) and optimistic software execution can both improve scalability of lock-based programs. The former uses hardware transactional memory (HTM) without requiring code changes; the latter involves modest code changes but does not require special hardware support. Numerous factors affect the choice of technique, including: critical section code, calling context, workload characteristics, and hardware support for synchronization. The ALE library integrates these techniques, and collects detailed, fine-grained performance data, enabling policies that decide between them at runtime for each critical section execution. We describe an adaptive policy and present experiments on three platforms, two of which support HTM, showing that---without tuning for specific platforms or workload---the adaptive policy is competitive with and often significantly better than hand-tuned static policies.

Proceedings ArticleDOI
19 May 2014
TL;DR: It is found that which system performs better depends heavily on the workload characteristics of the RTM system, and what kinds of software optimizations can help overcome RTM's hardware limitations.
Abstract: Hardware transactional memory implementations are becoming increasingly available. For instance, the Intel Core i7 4770 implements Restricted Transactional Memory (RTM) support for Intel Transactional Synchronization Extensions (TSX). In this paper, we present a detailed evaluation of RTM performance and energy expenditure. We compare RTM behavior to that of the TinySTM software transactional memory system, first by running micro benchmarks, and then by running the STAMP benchmark suite. We find that which system performs better depends heavily on the workload characteristics. We then conduct a case study of two STAMP applications to assess the impact of programming style on RTM performance and to investigate what kinds of software optimizations can help overcome RTM's hardware limitations.

Journal ArticleDOI
TL;DR: Algorithms for concurrently reading and modifying a red‐black tree (RBTree) are presented, which have deterministic response times for a given tree size and uncontended read performance that is at least 60% faster than other known approaches.
Abstract: This paper presents algorithms for concurrently reading and modifying a red-black tree RBTree. The algorithms allow wait-free, linearly scalable lookups in the presence of concurrent inserts and deletes. They have deterministic response times for a given tree size and uncontended read performance that is at least 60% faster than other known approaches. The techniques used to derive these algorithms arise from a concurrent programming methodology called relativistic programming. Relativistic programming introduces write-side delay primitives that allow the writer to pay most of the cost of synchronization between readers and writers. Only minimal synchronization overhead is placed on readers. Relativistic programming avoids unnecessarily strict ordering of read and write operations while still providing the capability to enforce linearizability. This paper shows how relativistic programming can be used to build a concurrent RBTree with synchronization-free readers and both lock-based and transactional memory-based writers. Copyright © 2013 John Wiley & Sons, Ltd.

Patent
27 Feb 2014
TL;DR: In this paper, a transaction-begin instruction specifies an initial value for a transaction count-to-completion (CTC) value, which is adjusted as the transaction progresses.
Abstract: When executed, a transaction-begin instruction specifies an initial value for a transaction-count-to-completion (CTC) value for a transaction. The initial value indicates a predicted duration of the transaction. The CTC value may be a number of instructions to completion or an amount of time to completion. The CTC value is adjusted as the transaction progresses. The adjusted CTC value indicates how far the transaction is from completion. When a disruptive event associated with inducing transactional aborts, such as an interrupt or a conflicting memory access, is identified while processing the transaction, processing of the disruptive event is deferred if the adjusted CTC value satisfies deferral criteria. If the adjusted CTC value does not satisfy deferral criteria, the transaction is aborted and the disruptive event is processed.

Proceedings ArticleDOI
22 Oct 2014
TL;DR: The idea of using the dataflow concept to define novel thread types that are called Data-Flow-Threads or DF- Threads, aimed at a more complete model with the way of managing the mutable shared state by relying on the transactional memory semantics.
Abstract: Current computing systems are mostly focused on achieving performance, programmability, energy efficiency, resiliency by essentially trying to replicate the uni-core execution model n-times in parallel on a multi/many-core system. This choice has heavily conditioned the way both software and hardware are designed nowadays. However, as old as computer architecture is the concept of dataflow, that is "initiating an activity in presence of the data it needs to perform its function" [J. Dennis]. Dataflow had been historically initially explored at instruction level and has led to the major result of the realization of current superscalar processors, which implement a form of "restricted dataflow" at instruction level. In this paper, we illustrate the idea of using the dataflow concept to define novel thread types that we call Data-Flow-Threads or DF-Threads. The advantages we are aiming at regard several aspects, not fully explored yet: i) isolating the computations so that communication patterns can be more efficiently managed by a not-so-complex architecture, ii) possibility to repeat the execution of a thread in case of detected faults affecting the thread resources, iii) providing a minimalistic low-level API for allowing compilers and programmers to map their parallel codes and architects to implement more efficient and scalable systems. The semantics of DF-Threads is also tightly connected to their execution model, hereby illustrated. Several other efforts have been done with similar purposes since the introduction of macro-dataflow through the more recent DF-Codelets and the OCR project. In our case, we aim at a more complete model with the above advantages and in particular including the way of managing the mutable shared state by relying on the transactional memory semantics. Our initial experiments show how to map some simple kernel and the scalability potential on a futuristic 1k-core many-core.

Proceedings ArticleDOI
Rei Odaira1, Takuya Nakaike1
01 Oct 2014
TL;DR: This investigation suggests that future hardware should support not only ordered transactions but also memory data forwarding, data synchronization, multi-version cache, and word-level conflict detection for thread-level speculation.
Abstract: Thread-level speculation can speed up a single-thread application by splitting its execution into multiple tasks and speculatively executing those tasks in multiple threads. Efficient thread-level speculation requires hardware support for memory conflict detection, store buffering, and execution rollback, and in addition, previous research has also proposed advanced optimization facilities, such as ordered transactions and data forwarding. Recently, implementations of hardware transactional memory (HTM) are coming into the market with minimal hardware support for thread-level speculation. However, few implementations offer advanced optimization facilities. Thus, it is important to determine how well thread-level speculation can be realized on the current HTM implementations, and what optimization facilities should be implemented in the future. In our research, we studied thread-level speculation on the off-the-shelf HTM implementation in Intel TSX. We manually modified potentially parallel benchmarks in SPEC CPU2006 for thread-level speculation. Our experimental results showed that thread-level speculation resulted in up to an 11% speed-up even without the advanced optimization facilities, but actually degraded the performance in most cases. In contrast to our expectations, the main reason for the performance loss was not the lack of hardware support for ordered transactions but the transaction aborts due to memory conflicts. Our investigation suggests that future hardware should support not only ordered transactions but also memory data forwarding, data synchronization, multi-version cache, and word-level conflict detection for thread-level speculation.

Journal ArticleDOI
TL;DR: To the best of the knowledge, this is the first deterministic consistency protocol for distributed transactional memory that achieves poly-log approximation in general networks.
Abstract: We consider the problem of implementing transactional memory in large-scale distributed networked systems We present Spiral, a novel distributed directory-based protocol for transactional memory, and theoretically analyze and experimentally evaluate it for the performance boundaries of this approach from the worst-case perspective Spiral is designed for the data-flow distributed implementation of software transactional memory which supports three basic operations: publish, allowing a shared object to be inserted in the directory so that other nodes can find it; lookup, providing a read-only copy of the object to the requesting node; move, allowing the requesting node to write the object locally after the node gets it The protocol runs on a hierarchical directory construction based on sparse covers, where clusters at each level are ordered to avoid race conditions while serving concurrent requests Given a shared object the protocol maintains a directory path pointing to the object The basic idea is to use "spiral" paths that grow outward to search for the directory path of the object in a bottom-up fashion For general networks, this protocol guarantees an $$\mathcal{O}(\log ^2 n\cdot \log D)$$ O ( log 2 n · log D ) approximation in sequential and one-shot concurrent executions of a finite set of move requests, where $$n$$ n is the number of nodes and $$D$$ D is the diameter of the network It also guarantees poly-log approximation for any single lookup request Our bounds are deterministic and hold in the worst-case Moreover, this protocol requires only polylogarithmic bits of memory per node Experimental evaluations in real networks also confirm our theoretical findings To the best of our knowledge, this is the first deterministic consistency protocol for distributed transactional memory that achieves poly-log approximation in general networks

Proceedings ArticleDOI
18 Oct 2014
TL;DR: This paper illustrates three lightweight STM designs that are capable of scaling to a large number of GPU threads and proposes to support software transactional memory on GPU that is able to achieve performance comparable to fine-grained locking, while requiring minimal programming efforts.
Abstract: Graphics Processing Units (GPUs) provide an attractive option for extracting data-level parallelism from diverse applications However, some applications, although possess abundant data-level parallelism, exhibit irregular memory access patterns to the shared data structures Porting such applications to GPUs requires synchronization mechanisms such as locks, which significantly increase the programming complexity Coarse-grained locking, where a single lock controls all the shared resources, although reduces programming efforts, can substantially serialize GPU threads On the other hand, fine-grained locking, where each data element is protected by an independent lock, although facilitates maximum parallelism, requires significant programming efforts To overcome these challenges, we propose to support software transactional memory (STM) on GPU that is able to achieve performance comparable to fine-grained locking, while requiring minimal programming efforts Software-based transactional execution can incur significant runtime overheads due to activities such as detecting conflicts across thousands of GPU threads and managing a consistent memory state Thus, in this paper we illustrate three lightweight STM designs that are capable of scaling to a large number of GPU threads In our system, programmers simply mark the critical sections in the applications, and the underlying STM support is able to achieve performance comparable to fine-grained locking

Journal ArticleDOI
TL;DR: Control transactions without compromising their simplicity for the sake of expressiveness, application concurrency, or performance.
Abstract: Control transactions without compromising their simplicity for the sake of expressiveness, application concurrency, or performance

Proceedings ArticleDOI
06 Feb 2014
TL;DR: This paper presents the first results eliminating the GIL in Ruby using Hardware Transactional Memory (HTM) in the IBM zEnterprise EC12 and Intel 4th Generation Core processors.
Abstract: Many scripting languages use a Global Interpreter Lock (GIL) to simplify the internal designs of their interpreters, but this kind of lock severely lowers the multi-thread per-formance on multi-core machines. This paper presents our first results eliminating the GIL in Ruby using Hardware Transactional Memory (HTM) in the IBM zEnterprise EC12 and Intel 4th Generation Core processors. Though prior prototypes replaced a GIL with HTM, we tested real-istic programs, the Ruby NAS Parallel Benchmarks (NPB), the WEBrick HTTP server, and Ruby on Rails. We devised a new technique to dynamically adjust the transaction lengths on a per-bytecode basis, so that we can optimize the likelihood of transaction aborts against the relative overhead of the instructions to begin and end the transactions. Our results show that HTM achieved 1.9- to 4.4-fold speedups in the NPB programs over the GIL with 12 threads, and 1.6- and 1.2-fold speedups in WEBrick and Ruby on Rails, respectively. The dynamic transaction-length adjustment chose the best transaction lengths for any number of threads and applications with sufficiently long running times.

Proceedings ArticleDOI
06 Feb 2014
TL;DR: This paper seeks to identify a sweet spot between permissiveness and efficiency by introducing the Time-Warp Multi-version algorithm (TWM), and demonstrates the practicality of this approach through an extensive experimental study, where it is shown to show an average performance improvement of 65% in high concurrency scenarios.
Abstract: The notion of permissiveness in Transactional Memory (TM) translates to only aborting a transaction when it cannot be accepted in any history that guarantees correctness criterion. This property is neglected by most TMs, which, in order to maximize implementation's efficiency, resort to aborting transactions under overly conservative conditions. In this paper we seek to identify a sweet spot between permissiveness and efficiency by introducing the Time-Warp Multi-version algorithm (TWM). TWM is based on the key idea of allowing an update transaction that has performed stale reads (i.e., missed the writes of concurrently committed transactions) to be serialized by committing it in the past, which we call a time-warp commit. At its core, TWM uses a novel, lightweight validation mechanism with little computational overheads. TWM also guarantees that read-only transactions can never be aborted. Further, TWM guarantees Virtual World Consistency, a safety property that is deemed as particularly relevant in the context of TM. We demonstrate the practicality of this approach through an extensive experimental study, where we compare TWM with four other TMs, and show an average performance improvement of 65% in high concurrency scenarios.

Proceedings ArticleDOI
12 Jun 2014
TL;DR: This work explores how this can be applied to three garbage collection scenarios in Jikes RVM: parallel copying, concurrent copying and bitmap marking, and demonstrates gains in concurrent copying speed over traditional synchronisation mechanisms of 48-101%.
Abstract: Intel's latest processor microarchitecture, Haswell, adds support for a restricted form of transactional memory to the x86 programming model. We explore how this can be applied to three garbage collection scenarios in Jikes RVM: parallel copying, concurrent copying and bitmap marking. We demonstrate gains in concurrent copying speed over traditional synchronisation mechanisms of 48-101%. We also show how similar but portable performance gains can be achieved through software transactional memory techniques. We identify the architectural overhead of capturing sufficient work for transactional execution as a major stumbling block to the effective use of transactions in the other scenarios.

Proceedings ArticleDOI
23 Jun 2014
TL;DR: This work introduces a new implementation of condition variables, which uses transactions internally, which can be used from within both transactions and lock-based critical sections, and which is compatible with existing C/C++ interfaces for condition synchronization.
Abstract: Recent microprocessors and compilers have added support for transactional memory (TM). While state-of-the-art TM systems allow the replacement of lock-based critical sections with scalable, optimistic transactions, there is not yet an acceptable mechanism for supporting the use of condition variables in transactions. We introduce a new implementation of condition variables, which uses transactions internally, which can be used from within both transactions and lock-based critical sections, and which is compatible with existing C/C++ interfaces for condition synchronization. By moving most of the mechanism for condition synchronization into user-space, our condition variables have low overhead and permit flexible interfaces that can avoid some of the pitfalls of traditional condition variables. Performance evaluation on an unmodified PARSEC benchmark suite shows equivalent performance to lock-basedcode, and our transactional condition variables also make it possible to replace all locks in PARSEC with transactions.

01 Jan 2014
TL;DR: This paper identifies several pitfalls that show that lazy subscription is not safe for TLE because unmodified critical sections executing before subscribing to the lock may behave incorrectly in a number of subtle ways and proposes hardware extensions that make lazy subscription safe, both for SGL-based transactional systems and for Tle, without the need for special compiler support.
Abstract: Transactional Lock Elision (TLE) uses Hardware Transactional Memory (HTM) to execute unmodified critical sections concurrently, even if they are protected by the same lock. To ensure correctness, the transactions used to execute these critical sections “subscribe” to the lock by reading it and checking that it is available. A recent paper proposed using the tempting “lazy subscription” optimization for a similar technique in a different context, namely transactional systems that use a single global lock (SGL) to protect all transactional data. We identify several pitfalls that show that lazy subscription is not safe for TLE because unmodified critical sections executing before subscribing to the lock may behave incorrectly in a number of subtle ways. We also show that recently proposed compiler support for modifying transaction code to ensure subscription occurs before any incorrect behavior could manifest is not sufficient to avoid all of the pitfalls we identify. We further argue that extending such compiler support to avoid all pitfalls would add substantial complexity and would usually limit the extent to which subscription can be deferred, undermining the effectiveness of the optimization. Hardware extensions suggested in the recent proposal also do not address all of the pitfalls we identify. A longer version of this paper proposes hardware extensions that make lazy subscription safe, both for SGL-based transactional systems and for TLE, without the need for special compiler support.

Book ChapterDOI
12 Oct 2014
TL;DR: One of the main challenges in stating the correctness of transactional memory (TM) systems is the need to provide guarantees on the system state observed by live transactions, i.e., those that have not yet committed or aborted.
Abstract: One of the main challenges in stating the correctness of transactional memory (TM) systems is the need to provide guarantees on the system state observed by live transactions, i.e., those that have not yet committed or aborted. A TM correctness condition should be weak enough to allow flexibility in implementation, yet strong enough to disallow undesirable TM behavior, which can lead to run-time errors in live transactions. The latter feature is formalized by observational refinement between TM implementations, stating that properties of a program using a concrete TM implementation can be established by analyzing its behavior with an abstract TM, serving as a specification of the concrete one.

Book ChapterDOI
16 Dec 2014
TL;DR: This paper provides general guidelines to show how to design a transactional (non-optimized) version of the highly concurrent lazy set with a minimal reengineering effort and shows how to make specific optimizations to the implementations of the OTB-Set for further enhancing its performance.
Abstract: Transactional data structures with the same performance of highly concurrent data structures enable performance-competitive transactional applications. Although Software Transactional Memory (STM) is a promising technology for designing and implementing transactional applications, STM-based transactional data structures still perform inferior to their optimized, concurrent (i.e. non-transactional) counterparts. In this paper, we present OTB-Set, an efficient optimistic transactional lazy set based on both linked-list and skip-list implementations. We first provide general guidelines to show how to design a transactional (non-optimized) version of the highly concurrent lazy set with a minimal reengineering effort. Subsequently, we show how to make specific optimizations to the implementations of the OTB-Set for further enhancing its performance. We also prove that our OTB-Set provides linearizable individual operations and opaque transactions. Our experimental study on a 64-core machine reveals that OTB-Set outperforms competitors in most workloads.

Journal ArticleDOI
TL;DR: This model has been implemented in a lightweight software transactional memory system, TinySTM, and has been evaluated on a set of benchmarks obtaining an important performance improvement over the baseline TM system.
Abstract: This article describes a transactional memory execution model intended to exploit maximum parallelism from sequential and multithreaded programs. A program code section is partitioned into chunks that will be mapped onto threads and executed transactionally. These transactions run concurrently and out of order, trying to exploit maximum parallelism but managed by a specific fully distributed commit control to meet data dependencies. To accomplish correct parallel execution, a partial precedence order relation is derived from the program code section and/or defined by the programmer. When a conflict between chunks is eagerly detected, the precedence order relation is used to determine the best policy to solve the conflict that preserves the precedence order while maximizing concurrency. The model defines a new transactional state called executed but not committed. This state allows exploiting concurrency on two levels: intrathread and interthread. Intrathread concurrency is improved by having pending uncommitted transactions while executing a new one in the same thread. The new state improves interthread concurrency because it permits out-of-order transaction commits regarding the precedence order. Our model has been implemented in a lightweight software transactional memory system, TinySTM, and has been evaluated on a set of benchmarks obtaining an important performance improvement over the baseline TM system.

Proceedings ArticleDOI
01 Jun 2014
TL;DR: This paper describes the innovation that was applied to tools and methodology in pre-silicon simulation, acceleration and post- silicon in order to verify transactional memory in the IBM POWER8 processor core.
Abstract: Transactional memory is a promising mechanism for synchronizing concurrent programs that eliminates locks at the expense of hardware complexity. Transactional memory is a hard feature to verify. First, transactions comprise several instructions that must be observed as a single global atomic operation. In addition, there are many reasons a transaction can fail. This results in a high level of non-determinism which must be tamed by the verification methodology. This paper describes the innovation that was applied to tools and methodology in pre-silicon simulation, acceleration and post-silicon in order to verify transactional memory in the IBM POWER8 processor core.

Proceedings ArticleDOI
20 May 2014
TL;DR: This paper presents DaSH - the first comprehensive benchmark suite for hybrid dataflow and shared memory programming models, and uses DaSH to evaluate three different hybrid data flow models, identify their advantages and shortcomings, and motivate further research on their characteristics.
Abstract: The current trend in development of parallel programming models is to combine different well established models into a single programming model in order to support efficient implementation of a wide range of real world applications. The dataflow model has particularly managed to recapture the interest of the research community due to its ability to express parallelism efficiently. Thus, a number of recently proposed hybrid parallel programming models combine dataflow and traditional shared memory. Their findings have influenced the introduction of task dependency in the recently published OpenMP 4.0 standard. In this paper, we present DaSH - the first comprehensive benchmark suite for hybrid dataflow and shared memory programming models. DaSH features 11 benchmarks, each representing one of the Berkeley dwarfs that capture patterns of communication and computation common to a wide range of emerging applications. We also include sequential and shared-memory implementations based on OpenMP and TBB to facilitate easy comparison between hybrid dataflow implementations and traditional shared memory implementations based on work-sharing and/or tasks. Finally, we use DaSH to evaluate three different hybrid dataflow models, identify their advantages and shortcomings, and motivate further research on their characteristics.