scispace - formally typeset
Search or ask a question

Showing papers on "Transactional memory published in 2016"


Proceedings ArticleDOI
14 Jun 2016
TL;DR: A novel hybrid SCM-DRAM persistent and concurrent B-Tree, named Fingerprinting Persistent Tree (FPTree) that achieves similar performance to DRAM-based counterparts and a hybrid concurrency scheme for the FPTree that is partially based on Hardware Transactional Memory is proposed.
Abstract: The advent of Storage Class Memory (SCM) is driving a rethink of storage systems towards a single-level architecture where memory and storage are merged. In this context, several works have investigated how to design persistent trees in SCM as a fundamental building block for these novel systems. However, these trees are significantly slower than DRAM-based counterparts since trees are latency-sensitive and SCM exhibits higher latencies than DRAM. In this paper we propose a novel hybrid SCM-DRAM persistent and concurrent B-Tree, named Fingerprinting Persistent Tree (FPTree) that achieves similar performance to DRAM-based counterparts. In this novel design, leaf nodes are persisted in SCM while inner nodes are placed in DRAM and rebuilt upon recovery. The FPTree uses Fingerprinting, a technique that limits the expected number of in-leaf probed keys to one. In addition, we propose a hybrid concurrency scheme for the FPTree that is partially based on Hardware Transactional Memory. We conduct a thorough performance evaluation and show that the FPTree outperforms state-of-the-art persistent trees with different SCM latencies by up to a factor of 8.2. Moreover, we show that the FPTree scales very well on a machine with 88 logical cores. Finally, we integrate the evaluated trees in memcached and a prototype database. We show that the FPTree incurs an almost negligible performance overhead over using fully transient data structures, while significantly outperforming other persistent trees.

281 citations


Proceedings ArticleDOI
02 Jun 2016
TL;DR: This work designs and implements a library supporting transactions on any number of maps, sets, and queues, and treats stand-alone data structure operations as first class citizens, and allows them to execute with virtually no overhead, at the speed of the original data structure library.
Abstract: We introduce transactions into libraries of concurrent data structures; such transactions can be used to ensure atomicity of sequences of data structure operations. By focusing on transactional access to a well-defined set of data structure operations, we strike a balance between the ease-of-programming of transactions and the efficiency of custom-tailored data structures. We exemplify this concept by designing and implementing a library supporting transactions on any number of maps, sets (implemented as skiplists), and queues. Our library offers efficient and scalable transactions, which are an order of magnitude faster than state-of-the-art transactional memory toolkits. Moreover, our approach treats stand-alone data structure operations (like put and enqueue) as first class citizens, and allows them to execute with virtually no overhead, at the speed of the original data structure library.

48 citations


Proceedings ArticleDOI
11 Jul 2016
TL;DR: This work's approach leverages the semantic knowledge of the data structure to eliminate the overhead of false conflicts and rollbacks in high-performance lock-free transactional linked data structures without revamping the data structures' original synchronization design.
Abstract: Non-blocking data structures allow scalable and thread-safe accesses to shared data. They provide individual operations that appear to execute atomically. However, it is often desirable to execute multiple operations atomically in a transactional manner. Previous solutions, such as software transactional memory (STM) and transactional boosting, manage transaction synchronization in an external layer separated from the data structure's own thread-level concurrency control. Although this reduces programming effort, it leads to overhead associated with additional synchronization and the need to rollback aborted transactions.In this work, we present a new methodology for transforming high-performance lock-free linked data structures into high-performance lock-free transactional linked data structures without revamping the data structures' original synchronization design. Our approach leverages the semantic knowledge of the data structure to eliminate the overhead of false conflicts and rollbacks. We encapsulate all operations, operands, and transaction status in a transaction descriptor, which is shared among the nodes accessed by the same transaction. We coordinate threads to help finish the remaining operations of delayed transactions based on their transaction descriptors. When transaction fails, we recover the correct abstract state by reversely interpreting the logical status of a node.In our experimental evaluation using transactions with randomly generated operations, our lock-free transactional lists and skiplist outperform the transactional boosted ones by 40% on average and as much as 125% for large transactions. They also outperform the alternative STM-based approaches by a factor of 3 to 10 across all scenarios. More importantly, we achieve 4 to 6 orders of magnitude less spurious aborts than the alternatives.

38 citations


Proceedings ArticleDOI
25 Mar 2016
TL;DR: The proposed system, ProteusTM, hides behind the TM interface a large library of implementations and leverages a novel multi-dimensional online optimization scheme, combining two popular learning techniques: Collaborative Filtering and Bayesian Optimization.
Abstract: The Transactional Memory (TM) paradigm promises to greatly simplify the development of concurrent applications. This led, over the years, to the creation of a plethora of TM implementations delivering wide ranges of performance across workloads. Yet, no universal implementation fits each and every workload. In fact, the best TM in a given workload can reveal to be disastrous for another one. This forces developers to face the complex task of tuning TM implementations, which significantly hampers their wide adoption. In this paper, we address the challenge of automatically identifying the best TM implementation for a given workload. Our proposed system, ProteusTM, hides behind the TM interface a large library of implementations. Underneath, it leverages a novel multi-dimensional online optimization scheme, combining two popular learning techniques: Collaborative Filtering and Bayesian Optimization. We integrated ProteusTM in GCC and demonstrate its ability to switch between TMs and adapt several configuration parameters (e.g., number of threads). We extensively evaluated ProteusTM, obtaining average performance

32 citations


Proceedings ArticleDOI
Andrea Cerone1, Alexey Gotsman1
25 Jul 2016
TL;DR: An alternative specification to SI is given that characterises it in terms of transactional dependency graphs of Adya et al., generalising serialization graphs, and does not require adding additional information to dependency graphs about start and commit points of transactions.
Abstract: Snapshot isolation (SI) is a widely used consistency model for transaction processing, implemented by most major databases and some of transactional memory systems. Unfortunately, its classical definition is given in a low-level operational way, by an idealised concurrency-control algorithm, and this complicates reasoning about the behaviour of applications running under SI. We give an alternative specification to SI that characterises it in terms of transactional dependency graphs of Adya et al., generalising serialization graphs. Unlike previous work, our characterisation does not require adding additional information to dependency graphs about start and commit points of transactions. We then exploit our specification to obtain two kinds of static analyses. The first one checks when a set of transactions running under SI can be chopped into smaller pieces without introducing new behaviours, to improve performance. The other analysis checks whether a set of transactions running under a weakening of SI behaves the same as when it running under SI.

32 citations


Journal ArticleDOI
01 Nov 2016
TL;DR: PHyTM allows hardware assisted ACID transactions to execute concurrently with pure software transactions, which allows applications to gain the benefit of persistent HTM while simultaneously accommodating unbounded transactions (with a high degree of concurrency).
Abstract: Processors with hardware support for transactional memory (HTM) are rapidly becoming commonplace, and processor manufacturers are currently working on implementing support for upcoming non-volatile memory (NVM) technologies. The combination of HTM and NVM promises to be a natural choice for in-memory database synchronization. However, limitations on the size of hardware transactions and the lack of progress guarantees by modern HTM implementations prevent some applications from obtaining the full benefit of hardware transactional memory. In this paper, we propose a persistent hybrid TM algorithm called PHyTM for systems that support NVM and HTM. PHyTM allows hardware assisted ACID transactions to execute concurrently with pure software transactions, which allows applications to gain the benefit of persistent HTM while simultaneously accommodating unbounded transactions (with a high degree of concurrency). Experimental simulations demonstrate that PHyTM is fast and scalable for realistic workloads.

31 citations


Proceedings ArticleDOI
25 Mar 2016
TL;DR: TxRace is a new software data race detector that leverages commodity hardware transactional memory (HTM) to speed up data race detection and reduces the average runtime overhead of dynamic dataRace detection from 11.68x to 4.65x with only a small number of false negatives.
Abstract: Detecting data races is important for debugging shared-memory multithreaded programs, but the high runtime overhead prevents the wide use of dynamic data race detectors. This paper presents TxRace, a new software data race detector that leverages commodity hardware transactional memory (HTM) to speed up data race detection. TxRace instruments a multithreaded program to transform synchronization-free regions into transactions, and exploits the conflict detection mechanism of HTM for lightweight data race detection at runtime. However, the limitations of the current best-effort commodity HTMs expose several challenges in using them for data race detection: (1) lack of ability to pinpoint racy instructions, (2) false positives caused by cache line granularity of conflict detection, and (3) transactional aborts for non-conflict reasons (e.g., capacity or unknown). To overcome these challenges, TxRace performs lightweight HTM-based data race detection at first, and occasionally switches to slow yet precise data race detection only for the small fraction of execution intervals in which potential races are reported by HTM. According to the experimental results, TxRace reduces the average runtime overhead of dynamic data race detection from 11.68x to 4.65x with only a small number of false negatives.

28 citations


Patent
16 Dec 2016
TL;DR: In this paper, a TM region indicator (or color) is used to improve TM throughput using a TM Region Indicator (ORI) for TM regions in order to have their instructions retired while waiting for older regions to commit.
Abstract: Systems, apparatuses, and methods for improving TM throughput using a TM region indicator (or color) are described. Through the use of TM region indicators younger TM regions can have their instructions retired while waiting for older TM regions to commit.

27 citations


Journal ArticleDOI
TL;DR: This work attempts to solve the partitioning process, choosing the correct transactional primitive, and routing transactions appropriately by automating the partitioned process.
Abstract: Modern transactional processing systems need to be fast and scalable, but this means many such systems settled for weak consistency models. It is however possible to achieve all of strong consistency, high scalability and high performance, by using fine-grained partitions and light-weight concurrency control that avoids superfluous synchronization and other overheads such as lock management. Independent transactions are one such mechanism, that rely on good partitions and appropriately defined transactions. On the downside, it is not usually straightforward to determine optimal partitioning schemes, especially when dealing with non-trivial amounts of data. Our work attempts to solve this problem by automating the partitioning process, choosing the correct transactional primitive, and routing transactions appropriately.

26 citations


Proceedings ArticleDOI
11 Jul 2016
TL;DR: It is shown that a system can scale as long as all threads run on the same socket, but a single thread running on a different socket can wreck performance, and a simple adaptive technique is presented that overcomes this problem by throttling threads as necessary to optimize system performance.
Abstract: The introduction of hardware transactional memory (HTM) into commercial processors opens a door for designing and implementing scalable synchronization mechanisms. One example for such a mechanism is transactional lock elision (TLE), where lock-based critical sections are executed concurrently using hardware transactions. So far, the effectiveness of TLE and other HTM-based mechanisms has been assessed mostly on small, single-socket machines. This paper investigates the behavior of hardware transactions on a large two-socket machine. Using TLE as an example, we show that a system can scale as long as all threads run on the same socket, but a single thread running on a different socket can wreck performance. We identify the reason for this phenomenon, and present a simple adaptive technique that overcomes this problem by throttling threads as necessary to optimize system performance. Using extensive evaluation of multiple microbenchmarks and real applications, we demonstrate that our technique achieves the full performance of the system for workloads that scale across sockets, and avoids the performance degradation that cripples TLE for workloads that do not.

26 citations


Proceedings ArticleDOI
27 Feb 2016
TL;DR: Two algorithms are presented that rely on existing compiler support for transactional programs and allow threads to speculate concurrently on HTM along with a thread holding the lock, and demonstrate the benefit of the algorithms over TLE and other related approaches with an in-depth analysis of a number of benchmarks.
Abstract: Transactional lock elision (TLE) is a well-known technique that exploits hardware transactional memory (HTM) to introduce concurrency into lock-based software. It achieves that by attempting to execute a critical section protected by a lock in an atomic hardware transaction, reverting to the lock if these attempts fail. One significant drawback of TLE is that it disables hardware speculation once there is a thread running under lock. In this paper we present two algorithms that rely on existing compiler support for transactional programs and allow threads to speculate concurrently on HTM along with a thread holding the lock. We demonstrate the benefit of our algorithms over TLE and other related approaches with an in-depth analysis of a number of benchmarks and a wide range of workloads, including an AVL tree-based micro-benchmark and ccTSA, a real sequence assembler application.

Proceedings Article
01 Jan 2016
TL;DR: Persistent hybrid TM is studied, which allows hardware assisted ACID transactions to execute concurrently with pure software transactions to gain the benefit of persistent HTM while accommodating unbounded transactions with a high degree of concurrency.
Abstract: The availability of hardware transactional memory (HTM) and the feasibility of persistent hardware transactions make them a natural choice for in-memory database synchronization. However, limitations on the size of hardware transactions and the lack of progress guarantees by modern HTM implementations prevent some applications from obtaining the benefit of hardware transactional memory. In this paper, we study persistent hybrid TM, which allows hardware assisted ACID transactions to execute concurrently with pure software transactions. This allows applications to gain the benefit of persistent HTM while accommodating unbounded transactions with a high degree of concurrency. Our experiments demonstrate that PHyTM is fast and scalable for realistic workloads.

Journal ArticleDOI
TL;DR: It is shown that HTM allows for achieving nearly lock-free processing of database transactions by carefully controlling the data layout and the access patterns, and provides a scalable, powerful, and easy to use synchronization primitive.
Abstract: So far, transactional memory—although a promising technique—suffered from the absence of an efficient hardware implementation. Intel’s Haswell microarchitecture introduced hardware transactional memory (HTM) in mainstream CPUs. HTM allows for efficient concurrent, atomic operations, which is also highly desirable in the context of databases. On the other hand, HTM has several limitations that, in general, prevent a one-to-one mapping of database transactions to HTM transactions. In this work, we devise several building blocks that can be used to exploit HTM in main-memory databases. We show that HTM allows for achieving nearly lock-free processing of database transactions by carefully controlling the data layout and the access patterns. The HTM component is used for detecting the (infrequent) conflicts, which allows for an optimistic, and thus very low-overhead execution of concurrent transactions. We evaluate our approach on a four-core desktop and a 28-core server system and find that HTM indeed provides a scalable, powerful, and easy to use synchronization primitive.

Proceedings ArticleDOI
12 Mar 2016
TL;DR: The cause of conflicts and contentions are identified and analyzed, and two enhancements that try to resolve conflicts early are proposed that greatly improves overall execution speed while reducing energy consumption.
Abstract: It has been proposed that Transactional Memory be added to Graphics Processing Units (GPUs) in recent years. One proposed hardware design, Warp TM, can scale to 1000s of concurrent transactions. As a programming method that can atomicize an arbitrary number of memory access locations and greatly reduce the efforts to program parallel applications, transactional memory handles the complexity of inter-thread synchronization. However, when thousands of transactions run concurrently on a GPU, conflicts and resource contentions arise, causing performance loss. In this paper, we identify and analyze the cause of conflicts and contentions and propose two enhancements that try to resolve conflicts early: (1) Early-Abort global conflict resolution that allows conflicts to be detected before they reach the Commit Units so that contention in the Commit Units is reduced and (2) Pause-and-Go execution scheme that reduces the chance of conflict and the performance penalty of re-executing long transactions. These two enhancements are enabled by a single hardware modification. Our evaluation shows the combination of the two enhancements greatly improves overall execution speed while reducing energy consumption.

Journal ArticleDOI
TL;DR: Atomic RMI extends Java RMI with distributed transactions that can run on many Java virtual machines located on different network nodes and employs SVA, a fully-pessimistic concurrency control algorithm that provides exclusive access to shared objects and supports rollback and fault tolerance.
Abstract: This paper presents Atomic RMI, a distributed transactional memory framework that supports the control flow model of execution. Atomic RMI extends Java RMI with distributed transactions that can run on many Java virtual machines located on different network nodes. Our system employs SVA, a fully-pessimistic concurrency control algorithm that provides exclusive access to shared objects and supports rollback and fault tolerance. SVA is capable of achieving a relatively high level of parallelism by interweaving transactions that access the same objects and by making transactions that do not share objects independent of one another. It also allows any operations within transactions, including irrevocable ones, like system calls, and provides an unobtrusive API. Our evaluation shows that in most cases Atomic RMI performs better than fine grained mutual-exclusion and read/write locking mechanisms. Atomic RMI also performs better than an optimistic transactional memory in environments with high contention and a high ratio of write operations, while being competitive otherwise.

Book ChapterDOI
19 Sep 2016
TL;DR: This paper demonstrates that the Transactional Synchronization Extensions (TSX) recently introduced by Intel in the x86-64 instruction set can be used to support CFI.
Abstract: Control Flow Integrity (CFI) is a promising defense technique against code-reuse attacks. While proposals to use hardware features to support CFI already exist, there is still a growing demand for an architectural CFI support on commodity hardware. To tackle this problem, in this paper we demonstrate that the Transactional Synchronization Extensions (TSX) recently introduced by Intel in the x86-64 instruction set can be used to support CFI.

Proceedings ArticleDOI
23 May 2016
TL;DR: This paper shows that performance issues well-known to loop parallelism are exacerbated in the presence of HTM, and that capacity aborts can increase when one tries to overcome them, and reveals that, although modern HTM extensions can provide support for TLS, they are not powerful enough to fully implement TLS.
Abstract: This paper presents a detailed analysis of the application of Hardware Transactional Memory (HTM) support for loop parallelization with Thread-Level Speculation (TLS). As a result it provides three contributions: (a) it shows that performance issues well-known to loop parallelism (e.g. false sharing) are exacerbated in the presence of HTM, and that capacity aborts can increase when one tries to overcome them, (b) it reveals that, although modern HTM extensions can provide support for TLS, they are not powerful enough to fully implement TLS, (c) it shows that simple code transformations, such as judicious strip mining and privatization techniques, can overcome such shortcomings, delivering speed-ups for programs that contain loop-carried dependencies. Experimental results reveal that, when these code transformations are used, speed-ups of up to 30% can be achieved for some loops for which previous research had reported slowdowns.

Proceedings ArticleDOI
19 Oct 2016
TL;DR: It is demonstrated that HTM offers significant acceleration of both closed and open nested transactions, while yielding parallel scaling up to the limits of the hardware, whereupon scaling in software continues but with the penalty to throughput imposed by software mechanisms.
Abstract: Transactional memory (TM) has long been advocated as a promising pathway to more automated concurrency control for scaling concurrent programs running on parallel hardware. Software TM (STM) has the benefit of being able to run general transactional programs, but at the significant cost of overheads imposed to log memory accesses, mediate access conflicts, and maintain other transaction metadata. Recently, hardware manufacturers have begun to offer commodity hardware TM (HTM) support in their processors wherein the transaction metadata is maintained "for free" in hardware. However, HTM approaches are only best-effort: they cannot successfully run all transactional programs, whether because of hardware capacity issues (causing large transactions to fail), or compatibility restrictions on the processor instructions permitted within hardware transactions (causing transactions that execute those instructions to fail). In such cases, programs must include failure-handling code to attempt the computation by some other software means, since retrying the transaction would be futile. Thus, a canonical use of HTM is lock elision: replacing lock regions with transactions, retrying some number of times in the case of conflicts, but falling back to locking when HTM fails for other reasons. Here, we describe how software and hardware schemes can combine seamlessly into a hybrid system in support of transactional programs, allowing use of low-cost HTM when it works, but reverting to STM when it doesn't. We describe heuristics used to make this choice dynamically and automatically, but allowing the transition back to HTM opportunistically. Our implementation is for an extension of Java having syntax for both open and closed nested transactions, and boosting, running on the OpenJDK, with dynamic injection of STM mechanisms (into code variants used under STM) and HTM instructions (into code variants used under HTM). Both schemes are compatible to allow different threads to run concurrently with either mechanism, while preserving transaction safety. Using a standard synthetic benchmark we demonstrate that HTM offers significant acceleration of both closed and open nested transactions, while yielding parallel scaling up to the limits of the hardware, whereupon scaling in software continues but with the penalty to throughput imposed by software mechanisms.

Proceedings ArticleDOI
27 Feb 2016
TL;DR: In this paper, the authors propose a hybrid tracking approach to track cross-thread dependences based on profile-based tracking, which is suitable for building efficient runtime support for parallel software systems that are both scalable and correct.
Abstract: It is notoriously challenging to develop parallel software systems that are both scalable and correct. Runtime support for parallelism---such as multithreaded record & replay, data race detectors, transactional memory, and enforcement of stronger memory models---helps achieve these goals, but existing commodity solutions slow programs substantially in order to track (i.e., detect or control) an execution's cross-thread dependences accurately. Prior work tracks cross-thread dependences either "pessimistically," slowing every program access, or "optimistically," allowing for lightweight instrumentation of most accesses but dramatically slowing accesses involved in cross-thread dependences.This paper seeks to hybridize pessimistic and optimistic tracking, which is challenging because there exists a fundamental mismatch between pessimistic and optimistic tracking. We address this challenge based on insights about how dependence tracking and program synchronization interact, and introduce a novel approach called hybrid tracking. Hybrid tracking is suitable for building efficient runtime support, which we demonstrate by building hybrid-tracking-based versions of a dependence recorder and a region serializability enforcer. An adaptive, profile-based policy makes runtime decisions about switching between pessimistic and optimistic tracking. Our evaluation shows that hybrid tracking enables runtime support to overcome the performance limitations of both pessimistic and optimistic tracking alone.

Proceedings ArticleDOI
15 Oct 2016
TL;DR: This work presents COMMTM, an HTM that exploits semantic commutativity and extends the coherence protocol and conflict detection scheme to support user-defined commutative operations, and preserves transactional guarantees and can be applied to arbitrary HTMs.
Abstract: Hardware speculative execution schemes such as hardware transactional memory (HTM) enjoy low run-time overheads but suffer from limited concurrency because they rely on reads and writes to detect conflicts. By contrast, software speculation schemes can exploit semantic knowledge of concurrent operations to reduce conflicts. In particular, they often exploit that many operations on shared data, like insertions into sets, are semantically commutative: they produce semantically equivalent results when reordered. However, software techniques often incur unacceptable run-time overheads. To solve this dichotomy, we present COMMTM, an HTM that exploits semantic commutativity. CommTM extends the coherence protocol and conflict detection scheme to support user-defined commutative operations. Multiple cores can perform commutative operations to the same data concurrently and without conflicts. CommTM preserves transactional guarantees and can be applied to arbitrary HTMs. CommTM scales on many operations that serialize in conventional HTMs, like set insertions, reference counting, and top-K insertions, and retains the low overhead of HTMs. As a result, at 128 cores, CommTM outperforms a conventional eager-lazy HTM by up to 3.4 χ and reduces or eliminates aborts.

Proceedings ArticleDOI
12 Mar 2016
TL;DR: This paper presents PleaseTM, a mechanism that allows more freedom in deciding which transaction to abort, while leaving the coherence protocol design unchanged, on STAMP benchmarks running at 32 threads compared to requester-wins HTM.
Abstract: With recent commercial offerings, hardware transactional memory (HTM) has finally become an important tool in writing multithreaded applications. However, current offerings are commonly implemented in a way that keep the coherence protocol unmodified. Data conflicts are recognized by coherence messages sent by the requester to sharers of the cache block (e.g., a write to a speculatively read line), who are then aborted. This tends to abort transactions that have done more work, leading to suboptimal performance. Even worse, this can lead to live-lock situations where transactions repeatedly abort each other. In this paper, we present PleaseTM, a mechanism that allows more freedom in deciding which transaction to abort, while leaving the coherence protocol design unchanged. In PleaseTM, transactions insert plea bits into their responses to coherence requests as a simple payload, and use these bits to inform conflict management decisions. Coherence permission changes are then achieved with normal coherence requests. Our experiments show that this additional freedom can provide on average 43% speedup, with a maximum of 7-fold speedup, on STAMP benchmarks running at 32 threads compared to requester-wins HTM.

Patent
20 Jun 2016
TL;DR: Prevention of a prefetch memory operation from causing a transaction to abort as discussed by the authors : A local processor receives a request from a remote processor and determines whether the prefetch request conflicts with a transaction of the local processor.
Abstract: Prevention of a prefetch memory operation from causing a transaction to abort A local processor receives a prefetch request from a remote processor A processor determines whether the prefetch request conflicts with a transaction of the local processor A processor responds to at least one of i) a determination that the local processor has no transaction, and ii) a determination that the prefetch request does not conflict with a transaction by providing a requested prefetch data by providing a requested prefetch data A processor responds to a determination that the prefetch request conflicts with a transaction by suppressing a processing of the prefetch request

Proceedings ArticleDOI
23 May 2016
TL;DR: An adaptive transaction scheduling policy relying on a Markov Chain-based model of STM systems, which schedules transactions depending on throughput predictions by the model as a function of the current system state and is periodically re-instated at run-time to adapt it to dynamic variations of the workload.
Abstract: Software Transactional Memory (STM) may suffer from performance degradation due to excessive conflicts among concurrent transactions. An approach to cope with this issue consists in putting in place smart scheduling policies which temporarily suspend the execution of some transaction in order to reduce the actual conflict rate. In this paper, we present an adaptive transaction scheduling policyrelying on a Markov Chain-based model of STM systems. The policy is adaptive in a twofold sense: (i) it schedules transactions depending on throughput predictions by the model as a function of the current system state, (ii) its underlying Markov Chain-based model is periodically re-instantiated at run-time to adapt it to dynamic variations of the workload. We also present an implementation of our adaptive transaction scheduler which has been integrated within the open source TinySTM package. The accuracy of our performance model in predicting the system throughput and the advantages of the adaptive scheduling policy over state-of-the-art approaches have been assessed via an experimental study based on the STAMP benchmark suite.

Proceedings ArticleDOI
11 Sep 2016
TL;DR: This paper describes how EXCITE-VM implements snapshot isolation transactions efficiently by manipulating virtual memory mappings and using a novel copy-on-read mechanism with a customized page cache.
Abstract: Multi-core programming remains a major software development and maintenance challenge because of data races, deadlock, non-deterministic failures and complex performance issues. In this paper, we describe EXCITE-VM, a system that provides snapshot isolation transactions on shared memory to facilitate programming and to improve the performance of parallel applications. With snapshots, an application thread is not exposed to the committed changes of other threads until it receives the updates by explicitly creating a new snapshot. Snapshot isolation enables low overhead lockless read operations and improves fault tolerance by isolating each thread from the transient, uncommitted writes of other threads. This paper describes how EXCITE-VM implements snapshot isolation transactions efficiently by manipulating virtual memory mappings and using a novel copy-on-read mechanism with a customized page cache. Compared to conventional software transactional memory systems, EXCITE-VM provides up to 2.2× performance improvement for the STAMP benchmark suite and up to 1000× speedup for a modified benchmark having long running read-only transactions. Furthermore, EXCITE-VM achieves a 2× performance improvement on a Memcached benchmark and the Yahoo Cloud Server Benchmarks. Finally, EXCITE-VM improves fault tolerance and offers features such as low-overhead concurrent audit and analysis.

Journal ArticleDOI
TL;DR: Pot as mentioned in this paper leverages the concept of preordered transactions to achieve deterministic multithreaded execution of programs that use transactional memory and uses a concurrency control protocol to distinguish between fast and speculative transaction execution modes in order to mitigate the overhead of imposing a deterministic order.
Abstract: This article presents Pot, a system that leverages the concept of preordered transactions to achieve deterministic multithreaded execution of programs that use Transactional Memory. Preordered transactions eliminate the root cause of nondeterminism in transactional execution: they provide the illusion of executing in a deterministic serial order, unlike traditional transactions that appear to execute in a nondeterministic order that can change from execution to execution. Pot uses a new concurrency control protocol that exploits the serialization order to distinguish between fast and speculative transaction execution modes in order to mitigate the overhead of imposing a deterministic order. We build two Pot prototypes: one using STM and another using off-the-shelf HTM. To the best of our knowledge, Pot enables deterministic execution of programs using off-the-shelf HTM for the first time. An experimental evaluation shows that Pot achieves deterministic execution of TM programs with low overhead, sometimes even outperforming nondeterministic executions, and clearly outperforming the state of the art.

Proceedings ArticleDOI
17 Jul 2016
TL;DR: A feedback control loop is proposed to design to automate thread management at runtime and diminish program execution time through Software Transactional Memory (STM), which has emerged as a promising technique to address synchronization issues through transactions.
Abstract: Parallel programs need to manage the trade-off between the time spent in synchronization and computation. The time trade-off is affected by the number of active threads significantly. High parallelism may decrease computing time while increase synchronization cost. Furthermore thread locality on different cores may impact on program performance too, as the memory access time can vary from one core to another due to the complexity of the underlying memory architecture. Therefore the performance of a program can be improved by adjusting the number of active threads as well as the mapping of its threads to physical cores. However, there is no universal rule to decide the parallelism and the thread locality for a program from an offline view. Furthermore, an offline tuning is error-prone. In this paper, we dynamically manage parallelism and thread localities. We address multiple threads problems via Software Transactional Memory (STM). STM has emerged as a promising technique, which bypasses locks, to address synchronization issues through transactions. Autonomic computing offers designers a framework of methods and techniques to build autonomic systems with well-mastered behaviours. Its key idea is to implement feedback control loops to design safe, efficient and predictable controllers, which enable monitoring and adjusting controlled systems dynamically while keeping overhead low. We propose to design a feedback control loop to automate thread management at runtime and diminish program execution time.

Patent
23 Feb 2016
TL;DR: In this paper, a computer system includes transactional memory to implement a nested transaction, and the computer system generates a plurality of speculative identification numbers (IDs), identifies at least one of a software thread executed by a hardware processor and a memory operation performed in accordance with an application code.
Abstract: A computer system includes transactional memory to implement a nested transaction. The computer system generates a plurality of speculative identification numbers (IDs), identifies at least one of a software thread executed by a hardware processor and a memory operation performed in accordance with an application code. The computer system assigns at least one speculative cache version to a requested transaction based on a corresponding software thread. The speculative ID of the corresponding software thread identifies the speculative cache version. The computer system also identifies a nested transaction in the memory unit, assigns a cache version to the nested transaction, detects a conflict with the nested transaction, determines a conflicted nesting level of the nested transaction, and determines a cache version corresponding to the conflicted nesting level. The computer system also invalidates the cache version corresponding to the conflicted nesting level.

Proceedings ArticleDOI
01 Nov 2016
TL;DR: The rationale and design of a new parallel execution model for RPython is described that allows the generation of parallel virtual machines while leaving the language semantics unchanged and improves the runtime of a set of multi-threaded Python programs over PyPy with a GIL.
Abstract: The RPython framework takes an interpreter for a dynamic language as its input and produces a Virtual Machine (VM) for that language. RPython is being used to develop PyPy, a high-performance Python interpreter. However, the produced VM does not support parallel execution since the framework relies on a Global Interpreter Lock (GIL): PyPy serialises the execution of multi-threaded Python programs. We describe the rationale and design of a new parallel execution model for RPython that allows the generation of parallel virtual machines while leaving the language semantics unchanged. This model then allows different implementations of concurrency control, and we discuss an implementation based on a GIL and an implementation based on Software Transactional Memory (STM). To evaluate the benefits of either choice, we adapt PyPy to work with both implementations (GIL and STM). The evaluation shows that PyPy with STM improves the runtime of a set of multi-threaded Python programs over PyPy with a GIL by factors in the range of 1.87x up to 5.96x when executing on a processor with 8 cores.

Proceedings ArticleDOI
11 Jul 2016
TL;DR: RUBIC is proposed, a novel parallelism tuning method for TM applications in both single and multi-process scenarios that overcomes the shortcomings of the preciously proposed solutions and achieves unprecedented system-wide fairness and efficiency.
Abstract: With the advent of Chip-Multiprocessors, Transactional Memory (TM) emerged as a powerful paradigm to simplify parallel programming. Unfortunately, as more cores become available in commodity systems, the scalability limits of a wide class of TM applications become more evident. Hence, online parallelism tuning techniques were proposed to adapt the optimal number of threads of TM applications. However, state-of-the-art solutions are exclusively tailored to single-process systems with relatively static workloads, exhibiting pathological behaviors in scenarios where multiple multi-threaded TM processes contend for the shared hardware resources.This paper proposes RUBIC, a novel parallelism tuning method for TM applications in both single and multi-process scenarios that overcomes the shortcomings of the preciously proposed solutions. RUBIC helps the co-running processes adapt their parallelism level so that they can efficiently space-share the hardware.When compared to previous online parallelism tuning solutions, RUBIC achieves unprecedented system-wide fairness and efficiency, both in single- and multi-process scenarios. Our evaluation with different workloads and scenarios shows that, on average, RUBIC enhances the overall performance by 26% with respect to the best-performing state-of-the-art online parallelism tuning techniques in multi-process scenarios, while incurring negligible overhead in single-process cases. RUBIC also exhibits unique features in converging to a fair and efficient state.

Journal ArticleDOI
01 May 2016
TL;DR: This detailed performance study provides insights on the constraints imposed by the Intel's Transaction Synchronization Extension (Intel's TSX) and introduces a simple, but efficient policy for guaranteeing forward progress on top of the best-effort Intel's HTM which was critical to achieving performance.
Abstract: We evaluated the strengths and weaknesses of Intel extensions to HTM - TSX.We described features that are likely to yield performance gains when using TSX.We explored with the aid of a new tool called htm-pBuilder the performance of TSX.We introduced a efficient policy for guaranteeing forward progress on top of TSX.We explored various fall-back policy tunings and transaction properties of TSX. This paper presents an extensive performance study of the implementation of Hardware Transactional Memory (HTM) in the Haswell generation of Intel x86 core processors. It evaluates the strengths and weaknesses of this new architecture by exploring several dimensions in the space of Transactional Memory (TM) application characteristics using the Eigenbench?(Hong et?al., 2010 1) and the CLOMP-TM?(Schindewolf et?al., 2012 2), benchmarks. This paper also introduces a new tool, called htm-pBuilder that tailors fallback policies and allows independent exploration of its parameters.This detailed performance study provides insights on the constraints imposed by the Intel's Transaction Synchronization Extension (Intel's TSX) and introduces a simple, but efficient policy for guaranteeing forward progress on top of the best-effort Intel's HTM which was critical to achieving performance. The evaluation also shows that there are a number of potential improvements for designers of TM applications and software systems that use Intel's TM and provides recommendations to extract maximum benefit from the current TM support available in Haswell.