Showing papers on "Transactional memory published in 2016"

PDF

Open Access

Proceedings Article•DOI•

FPTree: A Hybrid SCM-DRAM Persistent and Concurrent B-Tree for Storage Class Memory

[...]

Ismail Oukid¹, Johan Lasperas, Anisoara Nica, Thomas Willhalm², Wolfgang Lehner¹ - Show less +1 more•Institutions (2)

Dresden University of Technology¹, Intel²

14 Jun 2016

TL;DR: A novel hybrid SCM-DRAM persistent and concurrent B-Tree, named Fingerprinting Persistent Tree (FPTree) that achieves similar performance to DRAM-based counterparts and a hybrid concurrency scheme for the FPTree that is partially based on Hardware Transactional Memory is proposed.

...read moreread less

Abstract: The advent of Storage Class Memory (SCM) is driving a rethink of storage systems towards a single-level architecture where memory and storage are merged. In this context, several works have investigated how to design persistent trees in SCM as a fundamental building block for these novel systems. However, these trees are significantly slower than DRAM-based counterparts since trees are latency-sensitive and SCM exhibits higher latencies than DRAM. In this paper we propose a novel hybrid SCM-DRAM persistent and concurrent B-Tree, named Fingerprinting Persistent Tree (FPTree) that achieves similar performance to DRAM-based counterparts. In this novel design, leaf nodes are persisted in SCM while inner nodes are placed in DRAM and rebuilt upon recovery. The FPTree uses Fingerprinting, a technique that limits the expected number of in-leaf probed keys to one. In addition, we propose a hybrid concurrency scheme for the FPTree that is partially based on Hardware Transactional Memory. We conduct a thorough performance evaluation and show that the FPTree outperforms state-of-the-art persistent trees with different SCM latencies by up to a factor of 8.2. Moreover, we show that the FPTree scales very well on a machine with 88 logical cores. Finally, we integrate the evaluated trees in memcached and a prototype database. We show that the FPTree incurs an almost negligible performance overhead over using fully transient data structures, while significantly outperforming other persistent trees.

...read moreread less

281 citations

Proceedings Article•DOI•

Transactional data structure libraries

[...]

Alexander Spiegelman¹, Guy Golan-Gueta², Idit Keidar¹•Institutions (2)

Technion – Israel Institute of Technology¹, Yahoo!²

02 Jun 2016

TL;DR: This work designs and implements a library supporting transactions on any number of maps, sets, and queues, and treats stand-alone data structure operations as first class citizens, and allows them to execute with virtually no overhead, at the speed of the original data structure library.

...read moreread less

Abstract: We introduce transactions into libraries of concurrent data structures; such transactions can be used to ensure atomicity of sequences of data structure operations. By focusing on transactional access to a well-defined set of data structure operations, we strike a balance between the ease-of-programming of transactions and the efficiency of custom-tailored data structures. We exemplify this concept by designing and implementing a library supporting transactions on any number of maps, sets (implemented as skiplists), and queues. Our library offers efficient and scalable transactions, which are an order of magnitude faster than state-of-the-art transactional memory toolkits. Moreover, our approach treats stand-alone data structure operations (like put and enqueue) as first class citizens, and allows them to execute with virtually no overhead, at the speed of the original data structure library.

...read moreread less

48 citations

Proceedings Article•DOI•

Lock-free Transactions without Rollbacks for Linked Data Structures

[...]

Deli Zhang, Damian Dechev¹•Institutions (1)

University of Central Florida¹

11 Jul 2016

TL;DR: This work's approach leverages the semantic knowledge of the data structure to eliminate the overhead of false conflicts and rollbacks in high-performance lock-free transactional linked data structures without revamping the data structures' original synchronization design.

...read moreread less

Abstract: Non-blocking data structures allow scalable and thread-safe accesses to shared data. They provide individual operations that appear to execute atomically. However, it is often desirable to execute multiple operations atomically in a transactional manner. Previous solutions, such as software transactional memory (STM) and transactional boosting, manage transaction synchronization in an external layer separated from the data structure's own thread-level concurrency control. Although this reduces programming effort, it leads to overhead associated with additional synchronization and the need to rollback aborted transactions.In this work, we present a new methodology for transforming high-performance lock-free linked data structures into high-performance lock-free transactional linked data structures without revamping the data structures' original synchronization design. Our approach leverages the semantic knowledge of the data structure to eliminate the overhead of false conflicts and rollbacks. We encapsulate all operations, operands, and transaction status in a transaction descriptor, which is shared among the nodes accessed by the same transaction. We coordinate threads to help finish the remaining operations of delayed transactions based on their transaction descriptors. When transaction fails, we recover the correct abstract state by reversely interpreting the logical status of a node.In our experimental evaluation using transactions with randomly generated operations, our lock-free transactional lists and skiplist outperform the transactional boosted ones by 40% on average and as much as 125% for large transactions. They also outperform the alternative STM-based approaches by a factor of 3 to 10 across all scenarios. More importantly, we achieve 4 to 6 orders of magnitude less spurious aborts than the alternatives.

...read moreread less

38 citations

Proceedings Article•DOI•

ProteusTM: Abstraction Meets Performance in Transactional Memory

[...]

Diego Didona¹, Nuno Diegues¹, Anne-Marie Kermarrec², Rachid Guerraoui³, Ricardo Neves¹, Paolo Romano¹ - Show less +2 more•Institutions (3)

Instituto Superior Técnico¹, French Institute for Research in Computer Science and Automation², École Polytechnique Fédérale de Lausanne³

25 Mar 2016

TL;DR: The proposed system, ProteusTM, hides behind the TM interface a large library of implementations and leverages a novel multi-dimensional online optimization scheme, combining two popular learning techniques: Collaborative Filtering and Bayesian Optimization.

...read moreread less

Abstract: The Transactional Memory (TM) paradigm promises to greatly simplify the development of concurrent applications. This led, over the years, to the creation of a plethora of TM implementations delivering wide ranges of performance across workloads. Yet, no universal implementation fits each and every workload. In fact, the best TM in a given workload can reveal to be disastrous for another one. This forces developers to face the complex task of tuning TM implementations, which significantly hampers their wide adoption. In this paper, we address the challenge of automatically identifying the best TM implementation for a given workload. Our proposed system, ProteusTM, hides behind the TM interface a large library of implementations. Underneath, it leverages a novel multi-dimensional online optimization scheme, combining two popular learning techniques: Collaborative Filtering and Bayesian Optimization. We integrated ProteusTM in GCC and demonstrate its ability to switch between TMs and adapt several configuration parameters (e.g., number of threads). We extensively evaluated ProteusTM, obtaining average performance

...read moreread less

32 citations

Proceedings Article•DOI•

Analysing Snapshot Isolation

[...]

Andrea Cerone¹, Alexey Gotsman¹•Institutions (1)

IMDEA¹

25 Jul 2016

TL;DR: An alternative specification to SI is given that characterises it in terms of transactional dependency graphs of Adya et al., generalising serialization graphs, and does not require adding additional information to dependency graphs about start and commit points of transactions.

...read moreread less

Abstract: Snapshot isolation (SI) is a widely used consistency model for transaction processing, implemented by most major databases and some of transactional memory systems. Unfortunately, its classical definition is given in a low-level operational way, by an idealised concurrency-control algorithm, and this complicates reasoning about the behaviour of applications running under SI. We give an alternative specification to SI that characterises it in terms of transactional dependency graphs of Adya et al., generalising serialization graphs. Unlike previous work, our characterisation does not require adding additional information to dependency graphs about start and commit points of transactions. We then exploit our specification to obtain two kinds of static analyses. The first one checks when a set of transactions running under SI can be chopped into smaller pieces without introducing new behaviours, to improve performance. The other analysis checks whether a set of transactions running under a weakening of SI behaves the same as when it running under SI.

...read moreread less

32 citations

Journal Article•DOI•

Persistent hybrid transactional memory for databases

[...]

Hillel Avni¹, Trevor Brown²•Institutions (2)

Huawei¹, University of Toronto²

01 Nov 2016

TL;DR: PHyTM allows hardware assisted ACID transactions to execute concurrently with pure software transactions, which allows applications to gain the benefit of persistent HTM while simultaneously accommodating unbounded transactions (with a high degree of concurrency).

...read moreread less

Abstract: Processors with hardware support for transactional memory (HTM) are rapidly becoming commonplace, and processor manufacturers are currently working on implementing support for upcoming non-volatile memory (NVM) technologies. The combination of HTM and NVM promises to be a natural choice for in-memory database synchronization. However, limitations on the size of hardware transactions and the lack of progress guarantees by modern HTM implementations prevent some applications from obtaining the full benefit of hardware transactional memory. In this paper, we propose a persistent hybrid TM algorithm called PHyTM for systems that support NVM and HTM. PHyTM allows hardware assisted ACID transactions to execute concurrently with pure software transactions, which allows applications to gain the benefit of persistent HTM while simultaneously accommodating unbounded transactions (with a high degree of concurrency). Experimental simulations demonstrate that PHyTM is fast and scalable for realistic workloads.

...read moreread less

31 citations

Proceedings Article•DOI•

TxRace: Efficient Data Race Detection Using Commodity Hardware Transactional Memory

[...]

Tong Zhang¹, Dongyoon Lee¹, Changhee Jung¹•Institutions (1)

Virginia Tech¹

25 Mar 2016

TL;DR: TxRace is a new software data race detector that leverages commodity hardware transactional memory (HTM) to speed up data race detection and reduces the average runtime overhead of dynamic dataRace detection from 11.68x to 4.65x with only a small number of false negatives.

...read moreread less

Abstract: Detecting data races is important for debugging shared-memory multithreaded programs, but the high runtime overhead prevents the wide use of dynamic data race detectors. This paper presents TxRace, a new software data race detector that leverages commodity hardware transactional memory (HTM) to speed up data race detection. TxRace instruments a multithreaded program to transform synchronization-free regions into transactions, and exploits the conflict detection mechanism of HTM for lightweight data race detection at runtime. However, the limitations of the current best-effort commodity HTMs expose several challenges in using them for data race detection: (1) lack of ability to pinpoint racy instructions, (2) false positives caused by cache line granularity of conflict detection, and (3) transactional aborts for non-conflict reasons (e.g., capacity or unknown). To overcome these challenges, TxRace performs lightweight HTM-based data race detection at first, and occasionally switches to slow yet precise data race detection only for the small fraction of execution intervals in which potential races are reported by HTM. According to the experimental results, TxRace reduces the average runtime overhead of dynamic data race detection from 11.68x to 4.65x with only a small number of false negatives.

...read moreread less

28 citations

Patent•

System, Method, and Apparatus for Improving Throughput of Consecutive Transactional Memory Regions

[...]

Omar M. Shaikh, Ravi Rajwar, Paul Caprioli, Muawya M. Al-Otoom

16 Dec 2016

TL;DR: In this paper, a TM region indicator (or color) is used to improve TM throughput using a TM Region Indicator (ORI) for TM regions in order to have their instructions retired while waiting for older regions to commit.

...read moreread less

Abstract: Systems, apparatuses, and methods for improving TM throughput using a TM region indicator (or color) are described. Through the use of TM region indicators younger TM regions can have their instructions retired while waiting for older TM regions to commit.

...read moreread less

27 citations

Journal Article•DOI•

Automated Data Partitioning for Highly Scalable and Strongly Consistent Transactions

[...]

Alexandru Turcu¹, Roberto Palmieri¹, Binoy Ravindran¹, Sachin Hirve¹•Institutions (1)

Virginia Tech¹

01 Jan 2016-IEEE Transactions on Parallel and Distributed Systems

TL;DR: This work attempts to solve the partitioning process, choosing the correct transactional primitive, and routing transactions appropriately by automating the partitioned process.

...read moreread less

Abstract: Modern transactional processing systems need to be fast and scalable, but this means many such systems settled for weak consistency models. It is however possible to achieve all of strong consistency, high scalability and high performance, by using fine-grained partitions and light-weight concurrency control that avoids superfluous synchronization and other overheads such as lock management. Independent transactions are one such mechanism, that rely on good partitions and appropriately defined transactions. On the downside, it is not usually straightforward to determine optimal partitioning schemes, especially when dealing with non-trivial amounts of data. Our work attempts to solve this problem by automating the partitioning process, choosing the correct transactional primitive, and routing transactions appropriately.

...read moreread less

26 citations

Proceedings Article•DOI•

Investigating the Performance of Hardware Transactions on a Multi-Socket Machine

[...]

Trevor Brown¹, Alex Kogan², Yossi Lev², Victor Luchangco²•Institutions (2)

University of Toronto¹, Oracle Corporation²

11 Jul 2016

TL;DR: It is shown that a system can scale as long as all threads run on the same socket, but a single thread running on a different socket can wreck performance, and a simple adaptive technique is presented that overcomes this problem by throttling threads as necessary to optimize system performance.

...read moreread less

Abstract: The introduction of hardware transactional memory (HTM) into commercial processors opens a door for designing and implementing scalable synchronization mechanisms. One example for such a mechanism is transactional lock elision (TLE), where lock-based critical sections are executed concurrently using hardware transactions. So far, the effectiveness of TLE and other HTM-based mechanisms has been assessed mostly on small, single-socket machines. This paper investigates the behavior of hardware transactions on a large two-socket machine. Using TLE as an example, we show that a system can scale as long as all threads run on the same socket, but a single thread running on a different socket can wreck performance. We identify the reason for this phenomenon, and present a simple adaptive technique that overcomes this problem by throttling threads as necessary to optimize system performance. Using extensive evaluation of multiple microbenchmarks and real applications, we demonstrate that our technique achieves the full performance of the system for workloads that scale across sockets, and avoids the performance degradation that cripples TLE for workloads that do not.

...read moreread less

26 citations

Proceedings Article•DOI•

Refined transactional lock elision

[...]

Dave Dice¹, Alex Kogan¹, Yossi Lev¹•Institutions (1)

Oracle Corporation¹

27 Feb 2016

TL;DR: Two algorithms are presented that rely on existing compiler support for transactional programs and allow threads to speculate concurrently on HTM along with a thread holding the lock, and demonstrate the benefit of the algorithms over TLE and other related approaches with an in-depth analysis of a number of benchmarks.

...read moreread less

Abstract: Transactional lock elision (TLE) is a well-known technique that exploits hardware transactional memory (HTM) to introduce concurrency into lock-based software. It achieves that by attempting to execute a critical section protected by a lock in an atomic hardware transaction, reverting to the lock if these attempts fail. One significant drawback of TLE is that it disables hardware speculation once there is a thread running under lock. In this paper we present two algorithms that rely on existing compiler support for transactional programs and allow threads to speculate concurrently on HTM along with a thread holding the lock. We demonstrate the benefit of our algorithms over TLE and other related approaches with an in-depth analysis of a number of benchmarks and a wide range of workloads, including an AVL tree-based micro-benchmark and ccTSA, a real sequence assembler application.

...read moreread less

Proceedings Article•

PHyTM: Persistent Hybrid Transactional Memory.

[...]

Trevor Brown¹, Hillel Avni²•Institutions (2)

University of Toronto¹, Huawei²

01 Jan 2016

TL;DR: Persistent hybrid TM is studied, which allows hardware assisted ACID transactions to execute concurrently with pure software transactions to gain the benefit of persistent HTM while accommodating unbounded transactions with a high degree of concurrency.

...read moreread less

Abstract: The availability of hardware transactional memory (HTM) and the feasibility of persistent hardware transactions make them a natural choice for in-memory database synchronization. However, limitations on the size of hardware transactions and the lack of progress guarantees by modern HTM implementations prevent some applications from obtaining the benefit of hardware transactional memory. In this paper, we study persistent hybrid TM, which allows hardware assisted ACID transactions to execute concurrently with pure software transactions. This allows applications to gain the benefit of persistent HTM while accommodating unbounded transactions with a high degree of concurrency. Our experiments demonstrate that PHyTM is fast and scalable for realistic workloads.

...read moreread less

Journal Article•DOI•

Scaling HTM-Supported Database Transactions to Many Cores

[...]

Viktor Leis¹, Alfons Kemper¹, Thomas Neumann¹•Institutions (1)

Technische Universität München¹

01 Feb 2016-IEEE Transactions on Knowledge and Data Engineering

TL;DR: It is shown that HTM allows for achieving nearly lock-free processing of database transactions by carefully controlling the data layout and the access patterns, and provides a scalable, powerful, and easy to use synchronization primitive.

...read moreread less

Abstract: So far, transactional memory—although a promising technique—suffered from the absence of an efficient hardware implementation. Intel’s Haswell microarchitecture introduced hardware transactional memory (HTM) in mainstream CPUs. HTM allows for efficient concurrent, atomic operations, which is also highly desirable in the context of databases. On the other hand, HTM has several limitations that, in general, prevent a one-to-one mapping of database transactions to HTM transactions. In this work, we devise several building blocks that can be used to exploit HTM in main-memory databases. We show that HTM allows for achieving nearly lock-free processing of database transactions by carefully controlling the data layout and the access patterns. The HTM component is used for detecting the (infrequent) conflicts, which allows for an optimistic, and thus very low-overhead execution of concurrent transactions. We evaluate our approach on a four-core desktop and a 28-core server system and find that HTM indeed provides a scalable, powerful, and easy to use synchronization primitive.

...read moreread less

Proceedings Article•DOI•

Efficient GPU hardware transactional memory through early conflict resolution

[...]

Sui Chen¹, Lu Peng¹•Institutions (1)

Louisiana State University¹

12 Mar 2016

TL;DR: The cause of conflicts and contentions are identified and analyzed, and two enhancements that try to resolve conflicts early are proposed that greatly improves overall execution speed while reducing energy consumption.

...read moreread less

Abstract: It has been proposed that Transactional Memory be added to Graphics Processing Units (GPUs) in recent years. One proposed hardware design, Warp TM, can scale to 1000s of concurrent transactions. As a programming method that can atomicize an arbitrary number of memory access locations and greatly reduce the efforts to program parallel applications, transactional memory handles the complexity of inter-thread synchronization. However, when thousands of transactions run concurrently on a GPU, conflicts and resource contentions arise, causing performance loss. In this paper, we identify and analyze the cause of conflicts and contentions and propose two enhancements that try to resolve conflicts early: (1) Early-Abort global conflict resolution that allows conflicts to be detected before they reach the Commit Units so that contention in the Commit Units is reduced and (2) Pause-and-Go execution scheme that reduces the chance of conflict and the performance penalty of re-executing long transactions. These two enhancements are enabled by a single hardware modification. Our evaluation shows the combination of the two enhancements greatly improves overall execution speed while reducing energy consumption.

...read moreread less

Journal Article•DOI•

Atomic RMI: A Distributed Transactional Memory Framework

[...]

Konrad Siek¹, Paweł T. Wojciechowski¹•Institutions (1)

Poznań University of Technology¹

01 Jun 2016-International Journal of Parallel Programming

TL;DR: Atomic RMI extends Java RMI with distributed transactions that can run on many Java virtual machines located on different network nodes and employs SVA, a fully-pessimistic concurrency control algorithm that provides exclusive access to shared objects and supports rollback and fault tolerance.

...read moreread less

Abstract: This paper presents Atomic RMI, a distributed transactional memory framework that supports the control flow model of execution. Atomic RMI extends Java RMI with distributed transactions that can run on many Java virtual machines located on different network nodes. Our system employs SVA, a fully-pessimistic concurrency control algorithm that provides exclusive access to shared objects and supports rollback and fault tolerance. SVA is capable of achieving a relatively high level of parallelism by interweaving transactions that access the same objects and by making transactions that do not share objects independent of one another. It also allows any operations within transactions, including irrevocable ones, like system calls, and provides an unobtrusive API. Our evaluation shows that in most cases Atomic RMI performs better than fine grained mutual-exclusion and read/write locking mechanisms. Atomic RMI also performs better than an optimistic transactional memory in environments with high contention and a high ratio of write operations, while being competitive otherwise.

...read moreread less

Book Chapter•DOI•

Taming Transactions: Towards Hardware-Assisted Control Flow Integrity Using Transactional Memory

[...]

Marius Muench¹, Fabio Pagani¹, Yan Shoshitaishvili², Christopher Kruegel², Giovanni Vigna², Davide Balzarotti¹ - Show less +2 more•Institutions (2)

Institut Eurécom¹, University of California²

19 Sep 2016

TL;DR: This paper demonstrates that the Transactional Synchronization Extensions (TSX) recently introduced by Intel in the x86-64 instruction set can be used to support CFI.

...read moreread less

Abstract: Control Flow Integrity (CFI) is a promising defense technique against code-reuse attacks. While proposals to use hardware features to support CFI already exist, there is still a growing demand for an architectural CFI support on commodity hardware. To tackle this problem, in this paper we demonstrate that the Transactional Synchronization Extensions (TSX) recently introduced by Intel in the x86-64 instruction set can be used to support CFI.

...read moreread less

Proceedings Article•DOI•

Evaluating and Improving Thread-Level Speculation in Hardware Transactional Memories

[...]

Juan Salamanca¹, José Nelson Amaral², Guido Araujo¹•Institutions (2)

State University of Campinas¹, University of Alberta²

23 May 2016

TL;DR: This paper shows that performance issues well-known to loop parallelism are exacerbated in the presence of HTM, and that capacity aborts can increase when one tries to overcome them, and reveals that, although modern HTM extensions can provide support for TLS, they are not powerful enough to fully implement TLS.

...read moreread less

Abstract: This paper presents a detailed analysis of the application of Hardware Transactional Memory (HTM) support for loop parallelization with Thread-Level Speculation (TLS). As a result it provides three contributions: (a) it shows that performance issues well-known to loop parallelism (e.g. false sharing) are exacerbated in the presence of HTM, and that capacity aborts can increase when one tries to overcome them, (b) it reveals that, although modern HTM extensions can provide support for TLS, they are not powerful enough to fully implement TLS, (c) it shows that simple code transformations, such as judicious strip mining and privatization techniques, can overcome such shortcomings, delivering speed-ups for programs that contain loop-carried dependencies. Experimental results reveal that, when these code transformations are used, speed-ups of up to 30% can be achieved for some loops for which previous research had reported slowdowns.

...read moreread less

Proceedings Article•DOI•

Hybrid STM/HTM for nested transactions on OpenJDK

[...]

Keith Chapman¹, Antony L. Hosking², J. Eliot B. Moss³•Institutions (3)

Purdue University¹, Australian National University², University of Massachusetts Amherst³

19 Oct 2016

TL;DR: It is demonstrated that HTM offers significant acceleration of both closed and open nested transactions, while yielding parallel scaling up to the limits of the hardware, whereupon scaling in software continues but with the penalty to throughput imposed by software mechanisms.

...read moreread less

Abstract: Transactional memory (TM) has long been advocated as a promising pathway to more automated concurrency control for scaling concurrent programs running on parallel hardware. Software TM (STM) has the benefit of being able to run general transactional programs, but at the significant cost of overheads imposed to log memory accesses, mediate access conflicts, and maintain other transaction metadata. Recently, hardware manufacturers have begun to offer commodity hardware TM (HTM) support in their processors wherein the transaction metadata is maintained "for free" in hardware. However, HTM approaches are only best-effort: they cannot successfully run all transactional programs, whether because of hardware capacity issues (causing large transactions to fail), or compatibility restrictions on the processor instructions permitted within hardware transactions (causing transactions that execute those instructions to fail). In such cases, programs must include failure-handling code to attempt the computation by some other software means, since retrying the transaction would be futile. Thus, a canonical use of HTM is lock elision: replacing lock regions with transactions, retrying some number of times in the case of conflicts, but falling back to locking when HTM fails for other reasons. Here, we describe how software and hardware schemes can combine seamlessly into a hybrid system in support of transactional programs, allowing use of low-cost HTM when it works, but reverting to STM when it doesn't. We describe heuristics used to make this choice dynamically and automatically, but allowing the transition back to HTM opportunistically. Our implementation is for an extension of Java having syntax for both open and closed nested transactions, and boosting, running on the OpenJDK, with dynamic injection of STM mechanisms (into code variants used under STM) and HTM instructions (into code variants used under HTM). Both schemes are compatible to allow different threads to run concurrently with either mechanism, while preserving transaction safety. Using a standard synthetic benchmark we demonstrate that HTM offers significant acceleration of both closed and open nested transactions, while yielding parallel scaling up to the limits of the hardware, whereupon scaling in software continues but with the penalty to throughput imposed by software mechanisms.

...read moreread less

Proceedings Article•DOI•

Drinking from both glasses: combining pessimistic and optimistic tracking of cross-thread dependences

[...]

Man Cao¹, Minjia Zhang¹, Aritra Sengupta¹, Michael D. Bond¹•Institutions (1)

Ohio State University¹

27 Feb 2016

TL;DR: In this paper, the authors propose a hybrid tracking approach to track cross-thread dependences based on profile-based tracking, which is suitable for building efficient runtime support for parallel software systems that are both scalable and correct.

...read moreread less

Abstract: It is notoriously challenging to develop parallel software systems that are both scalable and correct. Runtime support for parallelism---such as multithreaded record & replay, data race detectors, transactional memory, and enforcement of stronger memory models---helps achieve these goals, but existing commodity solutions slow programs substantially in order to track (i.e., detect or control) an execution's cross-thread dependences accurately. Prior work tracks cross-thread dependences either "pessimistically," slowing every program access, or "optimistically," allowing for lightweight instrumentation of most accesses but dramatically slowing accesses involved in cross-thread dependences.This paper seeks to hybridize pessimistic and optimistic tracking, which is challenging because there exists a fundamental mismatch between pessimistic and optimistic tracking. We address this challenge based on insights about how dependence tracking and program synchronization interact, and introduce a novel approach called hybrid tracking. Hybrid tracking is suitable for building efficient runtime support, which we demonstrate by building hybrid-tracking-based versions of a dependence recorder and a region serializability enforcer. An adaptive, profile-based policy makes runtime decisions about switching between pessimistic and optimistic tracking. Our evaluation shows that hybrid tracking enables runtime support to overcome the performance limitations of both pessimistic and optimistic tracking alone.

...read moreread less

Proceedings Article•DOI•

Exploiting semantic commutativity in hardware speculation

[...]

Guowei Zhang¹, Virginia Chiu¹, Daniel Sanchez¹•Institutions (1)

Massachusetts Institute of Technology¹

15 Oct 2016

TL;DR: This work presents COMMTM, an HTM that exploits semantic commutativity and extends the coherence protocol and conflict detection scheme to support user-defined commutative operations, and preserves transactional guarantees and can be applied to arbitrary HTMs.

...read moreread less

Abstract: Hardware speculative execution schemes such as hardware transactional memory (HTM) enjoy low run-time overheads but suffer from limited concurrency because they rely on reads and writes to detect conflicts. By contrast, software speculation schemes can exploit semantic knowledge of concurrent operations to reduce conflicts. In particular, they often exploit that many operations on shared data, like insertions into sets, are semantically commutative: they produce semantically equivalent results when reordered. However, software techniques often incur unacceptable run-time overheads. To solve this dichotomy, we present COMMTM, an HTM that exploits semantic commutativity. CommTM extends the coherence protocol and conflict detection scheme to support user-defined commutative operations. Multiple cores can perform commutative operations to the same data concurrently and without conflicts. CommTM preserves transactional guarantees and can be applied to arbitrary HTMs. CommTM scales on many operations that serialize in conventional HTMs, like set insertions, reference counting, and top-K insertions, and retains the low overhead of HTMs. As a result, at 128 cores, CommTM outperforms a conventional eager-lazy HTM by up to 3.4 χ and reduces or eliminates aborts.

...read moreread less

Proceedings Article•DOI•

PleaseTM: Enabling transaction conflict management in requester-wins hardware transactional memory

[...]

Sunjae Park¹, Milos Prvulovic¹, Christopher J. Hughes²•Institutions (2)

Georgia Institute of Technology¹, Intel²

12 Mar 2016

TL;DR: This paper presents PleaseTM, a mechanism that allows more freedom in deciding which transaction to abort, while leaving the coherence protocol design unchanged, on STAMP benchmarks running at 32 threads compared to requester-wins HTM.

...read moreread less

Abstract: With recent commercial offerings, hardware transactional memory (HTM) has finally become an important tool in writing multithreaded applications. However, current offerings are commonly implemented in a way that keep the coherence protocol unmodified. Data conflicts are recognized by coherence messages sent by the requester to sharers of the cache block (e.g., a write to a speculatively read line), who are then aborted. This tends to abort transactions that have done more work, leading to suboptimal performance. Even worse, this can lead to live-lock situations where transactions repeatedly abort each other. In this paper, we present PleaseTM, a mechanism that allows more freedom in deciding which transaction to abort, while leaving the coherence protocol design unchanged. In PleaseTM, transactions insert plea bits into their responses to coherence requests as a simple payload, and use these bits to inform conflict management decisions. Coherence permission changes are then achieved with normal coherence requests. Our experiments show that this additional freedom can provide on average 43% speedup, with a maximum of 7-fold speedup, on STAMP benchmarks running at 32 threads compared to requester-wins HTM.

...read moreread less

Patent•

Prefetch insensitive transactional memory

[...]

Michael K. Gschwind¹, Valentina Salapura¹, Chung-Lung K. Shum¹, Timothy J. Slegel¹•Institutions (1)

IBM¹

20 Jun 2016

TL;DR: Prevention of a prefetch memory operation from causing a transaction to abort as discussed by the authors : A local processor receives a request from a remote processor and determines whether the prefetch request conflicts with a transaction of the local processor.

...read moreread less

Abstract: Prevention of a prefetch memory operation from causing a transaction to abort A local processor receives a prefetch request from a remote processor A processor determines whether the prefetch request conflicts with a transaction of the local processor A processor responds to at least one of i) a determination that the local processor has no transaction, and ii) a determination that the prefetch request does not conflict with a transaction by providing a requested prefetch data by providing a requested prefetch data A processor responds to a determination that the prefetch request conflicts with a transaction by suppressing a processing of the prefetch request

...read moreread less

Proceedings Article•DOI•

Markov Chain-Based Adaptive Scheduling in Software Transactional Memory

[...]

Pierangelo Di Sanzo¹, Marco Sannicandro¹, Bruno Ciciani¹, Francesco Quaglia¹•Institutions (1)

Sapienza University of Rome¹

23 May 2016

TL;DR: An adaptive transaction scheduling policy relying on a Markov Chain-based model of STM systems, which schedules transactions depending on throughput predictions by the model as a function of the current system state and is periodically re-instated at run-time to adapt it to dynamic variations of the workload.

...read moreread less

Abstract: Software Transactional Memory (STM) may suffer from performance degradation due to excessive conflicts among concurrent transactions. An approach to cope with this issue consists in putting in place smart scheduling policies which temporarily suspend the execution of some transaction in order to reduce the actual conflict rate. In this paper, we present an adaptive transaction scheduling policyrelying on a Markov Chain-based model of STM systems. The policy is adaptive in a twofold sense: (i) it schedules transactions depending on throughput predictions by the model as a function of the current system state, (ii) its underlying Markov Chain-based model is periodically re-instantiated at run-time to adapt it to dynamic variations of the workload. We also present an implementation of our adaptive transaction scheduler which has been integrated within the open source TinySTM package. The accuracy of our performance model in predicting the system throughput and the advantages of the adaptive scheduling policy over state-of-the-art approaches have been assessed via an experimental study based on the STAMP benchmark suite.

...read moreread less

Proceedings Article•DOI•

EXCITE-VM: Extending the Virtual Memory System to Support Snapshot Isolation Transactions

[...]

Heiner Litz¹, Benjamin Braun¹, David R. Cheriton¹•Institutions (1)

Stanford University¹

11 Sep 2016

TL;DR: This paper describes how EXCITE-VM implements snapshot isolation transactions efficiently by manipulating virtual memory mappings and using a novel copy-on-read mechanism with a customized page cache.

...read moreread less

Abstract: Multi-core programming remains a major software development and maintenance challenge because of data races, deadlock, non-deterministic failures and complex performance issues. In this paper, we describe EXCITE-VM, a system that provides snapshot isolation transactions on shared memory to facilitate programming and to improve the performance of parallel applications. With snapshots, an application thread is not exposed to the committed changes of other threads until it receives the updates by explicitly creating a new snapshot. Snapshot isolation enables low overhead lockless read operations and improves fault tolerance by isolating each thread from the transient, uncommitted writes of other threads. This paper describes how EXCITE-VM implements snapshot isolation transactions efficiently by manipulating virtual memory mappings and using a novel copy-on-read mechanism with a customized page cache. Compared to conventional software transactional memory systems, EXCITE-VM provides up to 2.2× performance improvement for the STAMP benchmark suite and up to 1000× speedup for a modified benchmark having long running read-only transactions. Furthermore, EXCITE-VM achieves a 2× performance improvement on a Memcached benchmark and the Yahoo Cloud Server Benchmarks. Finally, EXCITE-VM improves fault tolerance and offers features such as low-overhead concurrent audit and analysis.

...read moreread less

Journal Article•DOI•

Pot: Deterministic Transactional Execution

[...]

Tiago M. Vale¹, João A. Silva¹, Ricardo J. Dias, João Lourenço¹•Institutions (1)

Universidade Nova de Lisboa¹

16 Dec 2016-ACM Transactions on Architecture and Code Optimization

TL;DR: Pot as mentioned in this paper leverages the concept of preordered transactions to achieve deterministic multithreaded execution of programs that use transactional memory and uses a concurrency control protocol to distinguish between fast and speculative transaction execution modes in order to mitigate the overhead of imposing a deterministic order.

...read moreread less

Abstract: This article presents Pot, a system that leverages the concept of preordered transactions to achieve deterministic multithreaded execution of programs that use Transactional Memory. Preordered transactions eliminate the root cause of nondeterminism in transactional execution: they provide the illusion of executing in a deterministic serial order, unlike traditional transactions that appear to execute in a nondeterministic order that can change from execution to execution. Pot uses a new concurrency control protocol that exploits the serialization order to distinguish between fast and speculative transaction execution modes in order to mitigate the overhead of imposing a deterministic order. We build two Pot prototypes: one using STM and another using off-the-shelf HTM. To the best of our knowledge, Pot enables deterministic execution of programs using off-the-shelf HTM for the first time. An experimental evaluation shows that Pot achieves deterministic execution of TM programs with low overhead, sometimes even outperforming nondeterministic executions, and clearly outperforming the state of the art.

...read moreread less

Proceedings Article•DOI•

Autonomic Parallelism and Thread Mapping Control on Software Transactional Memory

[...]

Naweiluo Zhou¹, Gwenaël Delaval¹, Bogdan Robu¹, Eric Rutten¹, Jean-François Méhaut¹ - Show less +1 more•Institutions (1)

University of Grenoble¹

17 Jul 2016

TL;DR: A feedback control loop is proposed to design to automate thread management at runtime and diminish program execution time through Software Transactional Memory (STM), which has emerged as a promising technique to address synchronization issues through transactions.

...read moreread less

Abstract: Parallel programs need to manage the trade-off between the time spent in synchronization and computation. The time trade-off is affected by the number of active threads significantly. High parallelism may decrease computing time while increase synchronization cost. Furthermore thread locality on different cores may impact on program performance too, as the memory access time can vary from one core to another due to the complexity of the underlying memory architecture. Therefore the performance of a program can be improved by adjusting the number of active threads as well as the mapping of its threads to physical cores. However, there is no universal rule to decide the parallelism and the thread locality for a program from an offline view. Furthermore, an offline tuning is error-prone. In this paper, we dynamically manage parallelism and thread localities. We address multiple threads problems via Software Transactional Memory (STM). STM has emerged as a promising technique, which bypasses locks, to address synchronization issues through transactions. Autonomic computing offers designers a framework of methods and techniques to build autonomic systems with well-mastered behaviours. Its key idea is to implement feedback control loops to design safe, efficient and predictable controllers, which enable monitoring and adjusting controlled systems dynamically while keeping overhead low. We propose to design a feedback control loop to automate thread management at runtime and diminish program execution time.

...read moreread less

Patent•

Transactional memory system including cache versioning architecture to implement nested transactions

[...]

Michael K. Gschwind¹, Valentina Salapura¹, Eric M. Schwarz¹, Chung-Lung K. Shum¹•Institutions (1)

IBM¹

23 Feb 2016

TL;DR: In this paper, a computer system includes transactional memory to implement a nested transaction, and the computer system generates a plurality of speculative identification numbers (IDs), identifies at least one of a software thread executed by a hardware processor and a memory operation performed in accordance with an application code.

...read moreread less

Abstract: A computer system includes transactional memory to implement a nested transaction. The computer system generates a plurality of speculative identification numbers (IDs), identifies at least one of a software thread executed by a hardware processor and a memory operation performed in accordance with an application code. The computer system assigns at least one speculative cache version to a requested transaction based on a corresponding software thread. The speculative ID of the corresponding software thread identifies the speculative cache version. The computer system also identifies a nested transaction in the memory unit, assigns a cache version to the nested transaction, detects a conflict with the nested transaction, determines a conflicted nesting level of the nested transaction, and determines a cache version corresponding to the conflicted nesting level. The computer system also invalidates the cache version corresponding to the conflicted nesting level.

...read moreread less

Proceedings Article•DOI•

Parallel virtual machines with RPython

[...]

Remigius Meier¹, Armin Rigo, Thomas R. Gross¹•Institutions (1)

ETH Zurich¹

01 Nov 2016

TL;DR: The rationale and design of a new parallel execution model for RPython is described that allows the generation of parallel virtual machines while leaving the language semantics unchanged and improves the runtime of a set of multi-threaded Python programs over PyPy with a GIL.

...read moreread less

Abstract: The RPython framework takes an interpreter for a dynamic language as its input and produces a Virtual Machine (VM) for that language. RPython is being used to develop PyPy, a high-performance Python interpreter. However, the produced VM does not support parallel execution since the framework relies on a Global Interpreter Lock (GIL): PyPy serialises the execution of multi-threaded Python programs. We describe the rationale and design of a new parallel execution model for RPython that allows the generation of parallel virtual machines while leaving the language semantics unchanged. This model then allows different implementations of concurrency control, and we discuss an implementation based on a GIL and an implementation based on Software Transactional Memory (STM). To evaluate the benefits of either choice, we adapt PyPy to work with both implementations (GIL and STM). The evaluation shows that PyPy with STM improves the runtime of a set of multi-threaded Python programs over PyPy with a GIL by factors in the range of 1.87x up to 5.96x when executing on a processor with 8 cores.

...read moreread less

Proceedings Article•DOI•

RUBIC: Online Parallelism Tuning for Co-located Transactional Memory Applications

[...]

Amin Mohtasham¹, João Barreto¹•Institutions (1)

Instituto Superior Técnico¹

11 Jul 2016

TL;DR: RUBIC is proposed, a novel parallelism tuning method for TM applications in both single and multi-process scenarios that overcomes the shortcomings of the preciously proposed solutions and achieves unprecedented system-wide fairness and efficiency.

...read moreread less

Abstract: With the advent of Chip-Multiprocessors, Transactional Memory (TM) emerged as a powerful paradigm to simplify parallel programming. Unfortunately, as more cores become available in commodity systems, the scalability limits of a wide class of TM applications become more evident. Hence, online parallelism tuning techniques were proposed to adapt the optimal number of threads of TM applications. However, state-of-the-art solutions are exclusively tailored to single-process systems with relatively static workloads, exhibiting pathological behaviors in scenarios where multiple multi-threaded TM processes contend for the shared hardware resources.This paper proposes RUBIC, a novel parallelism tuning method for TM applications in both single and multi-process scenarios that overcomes the shortcomings of the preciously proposed solutions. RUBIC helps the co-running processes adapt their parallelism level so that they can efficiently space-share the hardware.When compared to previous online parallelism tuning solutions, RUBIC achieves unprecedented system-wide fairness and efficiency, both in single- and multi-process scenarios. Our evaluation with different workloads and scenarios shows that, on average, RUBIC enhances the overall performance by 26% with respect to the best-performing state-of-the-art online parallelism tuning techniques in multi-process scenarios, while incurring negligible overhead in single-process cases. RUBIC also exhibits unique features in converging to a fair and efficient state.

...read moreread less

Journal Article•DOI•

Study of hardware transactional memory characteristics and serialization policies on Haswell

[...]

Marcio Machado Pereira¹, Matthew Gaudet², J. Nelson Amaral², Guido Araujo¹•Institutions (2)

State University of Campinas¹, University of Alberta²

01 May 2016

TL;DR: This detailed performance study provides insights on the constraints imposed by the Intel's Transaction Synchronization Extension (Intel's TSX) and introduces a simple, but efficient policy for guaranteeing forward progress on top of the best-effort Intel's HTM which was critical to achieving performance.

...read moreread less

Abstract: We evaluated the strengths and weaknesses of Intel extensions to HTM - TSX.We described features that are likely to yield performance gains when using TSX.We explored with the aid of a new tool called htm-pBuilder the performance of TSX.We introduced a efficient policy for guaranteeing forward progress on top of TSX.We explored various fall-back policy tunings and transaction properties of TSX. This paper presents an extensive performance study of the implementation of Hardware Transactional Memory (HTM) in the Haswell generation of Intel x86 core processors. It evaluates the strengths and weaknesses of this new architecture by exploring several dimensions in the space of Transactional Memory (TM) application characteristics using the Eigenbench?(Hong et?al., 2010 1) and the CLOMP-TM?(Schindewolf et?al., 2012 2), benchmarks. This paper also introduces a new tool, called htm-pBuilder that tailors fallback policies and allows independent exploration of its parameters.This detailed performance study provides insights on the constraints imposed by the Intel's Transaction Synchronization Extension (Intel's TSX) and introduces a simple, but efficient policy for guaranteeing forward progress on top of the best-effort Intel's HTM which was critical to achieving performance. The evaluation also shows that there are a number of potential improvements for designers of TM applications and software systems that use Intel's TM and provides recommendations to extract maximum benefit from the current TM support available in Haswell.

...read moreread less