scispace - formally typeset
Search or ask a question

Showing papers on "Transactional memory published in 2015"


Proceedings ArticleDOI
24 Jan 2015
TL;DR: In this article, the authors present the most extensive comparison of synchronization techniques through a series of 31 data structure algorithms from the recent literature on 3 multicore platforms from Intel, Sun Microsystems and AMD.
Abstract: In this paper, we present the most extensive comparison of synchronization techniques. We evaluate 5 different synchronization techniques through a series of 31 data structure algorithms from the recent literature on 3 multicore platforms from Intel, Sun Microsystems and AMD. To this end, we developed in C/C++ and Java a new micro-benchmark suite, called Synchrobench, hence helping the community evaluate new data structures and synchronization techniques. The main conclusion of this evaluation is threefold: (i) although compare-and-swap helps achieving the best performance on multicores, doing so correctly is hard; (ii) optimistic locking offers varying performance results while transactional memory offers more consistent results; and (iii) copy-on-write and read-copy-update suffer more from contention than any other technique but could be combined with others to derive efficient algorithms.

143 citations


Proceedings ArticleDOI
Le Guan, Jingqiang Lin, Bo Luo1, Jiwu Jing, Jing Wang 
17 May 2015
TL;DR: Through extensive experiments, it is shown that Mimosa effectively protects cryptographic keys against various attacks that attempt to read sensitive data from memory, and it only introduces a small performance overhead.
Abstract: Cryptography plays an important role in computer and communication security. In practical implementations of cryptosystems, the cryptographic keys are usually loaded into the memory as plaintext, and then used in the cryptographic algorithms. Therefore, the private keys are subject to memory disclosure attacks that read unauthorized data from RAM. Such attacks could be performed through software methods (e.g., Open SSL Heart bleed) even when the integrity of the victim system's executable binaries is maintained. They could also be performed through physical methods (e.g., Cold-boot attacks on RAM chips) even when the system is free of software vulnerabilities. In this paper, we propose Mimosa that protects RSA private keys against the above software-based and physical memory attacks. When the Mimosa service is in idle, private keys are encrypted and reside in memory as cipher text. During the cryptographic computing, Mimosa uses hardware transactional memory (HTM) to ensure that (a) whenever a malicious process other than Mimosa attempts to read the plaintext private key, the transaction aborts and all sensitive data are automatically cleared with hardware mechanisms, due to the strong atomicity guarantee of HTM, and (b) all sensitive data, including private keys and intermediate states, appear as plaintext only within CPU-bound caches, and are never loaded to RAM chips. To the best of our knowledge, Mimosa is the first solution to use transactional memory to protect sensitive data against memory disclosure attacks. We have implemented Mimosa on a commodity machine with Intel Core i7 Has well CPUs. Through extensive experiments, we show that Mimosa effectively protects cryptographic keys against various attacks that attempt to read sensitive data from memory, and it only introduces a small performance overhead.

95 citations


Proceedings ArticleDOI
13 Jun 2015
TL;DR: There is no single HTM system that is more scalable than the others in all of the benchmarks, there are measurable performance differences among the HTM systems in some benchmarks, and eachHTM system has its own implementation characteristics that limit its scalability.
Abstract: Transactional Memory (TM) is a new programming paradigm for both simple concurrent programming and high concurrent performance. Hardware Transactional Memory (HTM) is hardware support for TM-based programming. It has lower overhead than software transactional memory (STM), which is a software-based implementation of TM. There are now four commercial systems, IBM Blue Gene/Q, IBM zEnterprise EC12, Intel Core, and IBM POWER8, offering HTM. Our work is the first to compare the performance of these four HTM systems. We measured the STAMP benchmarks, the most widely used TM benchmarks. We also evaluated the specific features of each HTM system. Our experimental results show that: (1) there is no single HTM system that is more scalable than the others in all of the benchmarks, (2) there are measurable performance differences among the HTM systems in some benchmarks, and (3) each HTM system has its own implementation characteristics that limit its scalability.

86 citations


Journal ArticleDOI
TL;DR: The Power ISA transactional memory architecture, the POWER8 implementation of this architecture, and two practical uses of this Architecture—Transactional Lock Elision (TLE) and Thread-Level Speculation (TLS)—and provide performance results for these uses are detailed.
Abstract: With multi-core processors, parallel programming has taken on greater importance. Traditional parallel programming techniques based on critical sections controlled by locking have several well-known drawbacks. To allow for more efficient parallel programming with higher performance, the IBM POWER8™ processor implements a hardware transactional memory facility. Transactional memory allows groups of load and store operations to execute and commit as a single atomic unit without the use of traditional locks, thereby improving performance and simplifying the parallel programming model. The POWER8 transactional memory facility provides a robust capability to execute transactions that can survive interrupts. It also allows non-speculative accesses within transactions, which facilitates debugging and thread-level speculation. Unique challenges caused by implementing transactional memory on top of the Power ISA (Instruction Set Architecture) weakly consistent memory model are addressed. We detail the Power ISA transactional memory architecture, the POWER8 implementation of this architecture, and two practical uses of this architecture—Transactional Lock Elision (TLE) and Thread-Level Speculation (TLS)—and provide performance results for these uses.

57 citations


Proceedings ArticleDOI
14 Mar 2015
TL;DR: This work proposes a novel reduced-hardware (RH) version of the NOrec HyTM algorithm, which is the first Hybrid TM that scales while fully preserving the hardware's original semantics of opacity and privatization.
Abstract: Because of hardware TM limitations, software fallbacks are the only way to make TM algorithms guarantee progress. Nevertheless, all known software fallbacks to date, from simple locks to sophisticated versions of the NOrec Hybrid TM algorithm, have either limited scalability or weakened semantics. We propose a novel reduced-hardware (RH) version of the NOrec HyTM algorithm. Instead of an all-software slow path, in our RH NOrec the slow-path is a "mix" of hardware and software: one short hardware transaction executes a maximal amount of initial reads in the hardware, and the second executes all of the writes. This novel combination of the RH approach and the NOrec algorithm delivers the first Hybrid TM that scales while fully preserving the hardware's original semantics of opacity and privatization. Our GCC implementation of RH NOrec is promising in that it shows improved performance relative to all prior methods, at the concurrency levels we could test today.

50 citations


Proceedings ArticleDOI
15 Jun 2015
TL;DR: A detailed performance analysis of AAM is conducted on Intel Haswell and IBM Blue Gene/Q and various performance tradeoffs between different HTM parameters that impact the efficiency of graph processing are illustrated.
Abstract: We propose Atomic Active Messages (AAM), a mechanism that accelerates irregular graph computations on both shared- and distributed-memory machines The key idea behind AAM is that hardware transactional memory (HTM) can be used for simple and efficient processing of irregular structures in highly parallel environments We illustrate techniques such as coarsening and coalescing that enable hardware transactions to considerably accelerate graph processing We conduct a detailed performance analysis of AAM on Intel Haswell and IBM Blue Gene/Q and we illustrate various performance tradeoffs between different HTM parameters that impact the efficiency of graph processing AAM can be used to implement abstractions offered by existing programming models and to improve the performance of irregular graph processing codes such as Graph500 or Galois

43 citations


Proceedings ArticleDOI
14 Jun 2015
TL;DR: SuperMalloc prefetches everything it can before starting a critical section, which makes the critical sections run fast, and for HTM improves the odds that the transaction will commit, and generally compares favorably with the other allocators on speed, scalability, speed variance, memory footprint, and code size.
Abstract: SuperMalloc is an implementation of malloc(3) originally designed for X86 Hardware Transactional Memory (HTM)@. It turns out that the same design decisions also make it fast even without HTM@. For the malloc-test benchmark, which is one of the most difficult workloads for an allocator, with one thread SuperMalloc is about 2.1 times faster than the best of DLmalloc, JEmalloc, Hoard, and TBBmalloc; with 8 threads and HTM, SuperMalloc is 2.75 times faster; and on 32 threads without HTM SuperMalloc is 3.4 times faster. SuperMalloc generally compares favorably with the other allocators on speed, scalability, speed variance, memory footprint, and code size. SuperMalloc achieves these performance advantages using less than half as much code as the alternatives. SuperMalloc exploits the fact that although physical memory is always precious, virtual address space on a 64-bit machine is relatively cheap. It allocates 2 chunks which contain objects all the same size. To translate chunk numbers to chunk metadata, SuperMalloc uses a simple array (most of which is uncommitted to physical memory). SuperMalloc takes care to avoid associativity conflicts in the cache: most of the size classes are a prime number of cache lines, and nonaligned huge accesses are randomly aligned within a page. Objects are allocated from the fullest non-full page in the appropriate size class. For each size class, SuperMalloc employs a 10-object per-thread cache, a per-CPU cache that holds about a level-2-cache worth of objects per size class, and a global cache that is organized to allow the movement of many objects between a per-CPU cache and the global cache using $O(1)$ instructions. SuperMalloc prefetches everything it can before starting a critical section, which makes the critical sections run fast, and for HTM improves the odds that the transaction will commit.

38 citations


Journal ArticleDOI
TL;DR: A novel hybrid approach is introduced that combines model-driven performance forecasting techniques and on-line exploration in order to take the best of the two techniques, namely enhancing robustness despite model’s inaccuracies, and maximizing convergence speed towards optimum solutions.
Abstract: In this paper we investigate the issue of automatically identifying the "natural" degree of parallelism of an application using software transactional memory (STM), i.e., the workload-specific multiprogramming level that maximizes application's performance. We discuss the importance of adapting the concurrency level in two different scenarios, a shared-memory and a distributed STM infrastructure. We propose and evaluate two alternative self-tuning methodologies, explicitly tailored for the considered scenarios. In shared-memory STM, we show that lightweight, black-box approaches relying solely on on-line exploration can be extremely effective. For distributed STMs , we introduce a novel hybrid approach that combines model-driven performance forecasting techniques and on-line exploration in order to take the best of the two techniques, namely enhancing robustness despite model's inaccuracies, and maximizing convergence speed towards optimum solutions.

36 citations


Proceedings ArticleDOI
01 Jul 2015
TL;DR: This study uses two state-of-the-art index implementations: a memory-optimized B-tree extended with HTM to provide multi-threaded concurrency and the Bw-tree lock-free B- tree used in several Microsoft production environments.
Abstract: The release of hardware transactional memory (HTM) in commodity CPUs has major implications on the design and implementation of main-memory databases, especially on the architecture of high-performance lock-free indexing methods at the core of several of these systems. This paper studies the interplay of HTM and lock-free indexing methods. First, we evaluate whether HTM will obviate the need for crafty lock-free index designs by integrating it in a traditional B-tree architecture. HTM performs well for simple data sets with small fixed-length keys and payloads, but its benefits disappear for more complex scenarios (e.g., larger variable-length keys and payloads), making it unattractive as a general solution for achieving high performance. Second, we explore fundamental differences between HTM-based and lock-free B-tree designs. While lock-freedom entails design complexity and extra mechanism, it has performance advantages in several scenarios, especially high-contention cases where readers proceed uncontested (whereas HTM aborts readers). Finally, we explore the use of HTM as a method to simplify lock-free design. We find that using HTM to implement a multi-word compare-and-swap greatly reduces lock-free programming complexity at the cost of only a 10-15% performance degradation. Our study uses two state-of-the-art index implementations: a memory-optimized B-tree extended with HTM to provide multi-threaded concurrency and the Bw-tree lock-free B-tree used in several Microsoft production environments.

34 citations


Proceedings ArticleDOI
24 Jan 2015
TL;DR: This work describes a programming technique and compiler support to reduce both overflow and conflict rates by partitioning common operations into read- mostly (planning) and write-mostly (completion) operations, which then execute separately.
Abstract: Best-effort hardware transactional memory (HTM) allows complex operations to execute atomically and in parallel, so long as hardware buffers do not overflow, and conflicts are not encountered with concurrent operations. We describe a programming technique and compiler support to reduce both overflow and conflict rates by partitioning common operations into read-mostly (planning) and write-mostly (completion) operations, which then execute separately. The completion operation remains transactional; planning can often occur in ordinary code. High-level (semantic) atomicity for the overall operation is ensured by passing an application-specific validator object between planning and completion. Transparent composition of partitioned operations is made possible through fully-automated compiler support, which migrates all planning operations out of the parent transaction while respecting all program data flow and dependences. For both micro- and macro-benchmarks, experiments on IBM z-Series and Intel Haswell machines demonstrate that partitioning can lead to dramatically lower abort rates and higher scalability.

33 citations


Journal ArticleDOI
Zhaoguo Wang1, Han Yi1, Ran Liu1, Mingkai Dong1, Haibo Chen1 
TL;DR: Persistent transactional memory dynamically tracks transactional updates to cache lines to ensure the ACI properties during cache flushes and leverages an undo log in NVM to ensure PTM can always consistently recover transactional data structures from a machine crash.
Abstract: This paper proposes persistent transactional memory (PTM), a new design that adds durability to transactional memory (TM) by incorporating with the emerging non-volatile memory (NVM). PTM dynamically tracks transactional updates to cache lines to ensure the ACI (atomicity, consistency and isolation) properties during cache flushes and leverages an undo log in NVM to ensure PTM can always consistently recover transactional data structures from a machine crash. This paper describes the PTM design based on Intel’s restricted transactional memory. A preliminary evaluation using a concurrent key/value store and a database with a cache-based simulator shows that the additional cache line flushes are small.

Book ChapterDOI
07 Oct 2015
TL;DR: The notion of persistent HTM PHTM is presented, which combines HTM and NVM and features hardware-assisted, lock-free, full ACID transactions and is an order of magnitude faster than the best persistent STM on read-dominant workloads.
Abstract: Hardware transactional memory HTM implementations already provide a transactional abstraction at HW speed in multi-core systems. The imminent availability of mature byte-addressable, nonvolatile memory NVM will provide persistence at the speed of accessing main memory. This paper presents the notion of persistent HTM PHTM, which combines HTM and NVM and features hardware-assisted, lock-free, full ACID transactions. For atomicity and isolation, PHTM is based on the current implementations of HTM. For durability, PHTM adds the algorithmic and minimal HW enhancements needed due to the incorporation of NVM. The paper compares the performance of an implementation of PHTM that emulates NVM aspects with other schemes that are based on HTM and STM. The results clearly indicate the advantage of PHTM in reads, as they are served directly from the cache without locking or versioning. In particular, PHTM is an order of magnitude faster than the best persistent STM on read-dominant workloads.

Book ChapterDOI
TL;DR: In this article, the authors provide formal definitions for a comprehensive collection of consistency conditions for transactional memory (TM) computing, and express all conditions in a uniform way using a formal framework that they present.
Abstract: This chapter provides formal definitions for a comprehensive collection of consistency conditions for transactional memory (TM) computing. We express all conditions in a uniform way using a formal framework that we present. For each of the conditions, we provide two versions: one that allows a transaction T to read the value of a data item written by another transaction T′ that can be live and not yet commit-pending provided that T′ will eventually commit, and a version which allows transactions to read values written only by transactions that have either committed before T starts or are commit-pending. Deriving the first version for a consistency condition was not an easy task but it has the benefit that this version is weaker than the second one and so it results in a wider universe of algorithms which there is no reason to exclude from being considered correct. The formalism for the presented consistency conditions is not based on any unrealistic assumptions, such as that transactional operations are executed atomically or that write operations write distinct values for data items. Making such assumptions facilitates the task of formally expressing the consistency conditions significantly, but results in formal presentations of them that are unrealistic, i.e. that cannot be used to characterize the correctness of most of the executions produced by any reasonable TM algorithm.

Patent
28 Aug 2015
TL;DR: In this paper, a transaction-hint instruction specifies a transaction count-to-completion (CTC) value for a transaction, which indicates how far a transaction is from completion.
Abstract: When executed, a transaction-hint instruction specifies a transaction-count-to-completion (CTC) value for a transaction. The CTC value indicates how far a transaction is from completion. The CTC may be a number of instructions to completion or an amount of time to completion. The CTC value is adjusted as the transaction progresses. When a disruptive event associated with inducing transactional aborts, such as an interrupt or a conflicting memory access, is identified while processing the transaction, processing of the disruptive event is deferred if the adjusted CTC value satisfies deferral criteria. If the adjusted CTC value does not satisfy deferral criteria, the transaction is aborted and the disruptive event is processed.

Patent
15 Jun 2015
TL;DR: In this paper, a transactional memory system coalesces two outermost transactions in a transaction environment, based on encountering a first transaction end instruction of the first outermost transaction, the processor determines whether the first transaction is to-be coalesced with a second transaction.
Abstract: A transactional memory system coalesces two outermost transactions in a transactional memory environment. A processor of the transactional memory system executes a first transaction begin instruction of a first outermost transaction and processes the first transaction. Based on encountering a first transaction end instruction of the first outermost transaction, the processor determines whether the first transaction is to-be coalesced with a second outermost transaction.

Proceedings ArticleDOI
21 Jul 2015
TL;DR: It is shown that adopting WRTO makes it possible to design a strictly DAP TM with invisible and wait-free read-only transactions, while preserving strong progressiveness for write transactions and an isolation level known in literature as Extended Update Serializability.
Abstract: Disjoint-Access Parallelism (DAP) is considered one of the most desirable properties to maximize the scalability of Transactional Memory (TM). This paper investigates the possibility and inherent cost of implementing a DAP TM that ensures two properties that are regarded as important to maximize efficiency in read-dominated workloads, namely having invisible and wait-free read-only transactions. We first prove that relaxing Real-Time Order (RTO) is necessary to implement such a TM. This result motivates us to introduce Witnessable Real-Time Order (WRTO), a weaker variant of RTO that demands enforcing RTO only between directly conflicting transactions. Then we show that adopting WRTO makes it possible to design a strictly DAP TM with invisible and wait-free read-only transactions, while preserving strong progressiveness for write transactions and an isolation level known in literature as Extended Update Serializability. Finally, we shed light on the inherent inefficiency of DAP TM implementations that have invisible and wait-free read-only transactions, by establishing lower bounds on the time and space complexity of such TMs.

Journal ArticleDOI
01 Jul 2015
Abstract: The release of hardware transactional memory (HTM) in commodity CPUs has major implications on the design and implementation of main-memory databases, especially on the architecture of high-perform...

Proceedings ArticleDOI
03 Jun 2015
TL;DR: A compact semantics in which concurrent transactions PUSH their effects into the shared view and PULL the effects of potentially uncommitted concurrent transactions into their local view (or UNPULL to detangle).
Abstract: We present a general theory of serializability, unifying a wide range of transactional algorithms, including some that are yet to come. To this end, we provide a compact semantics in which concurrent transactions PUSH their effects into the shared view (or UNPUSH to recall effects) and PULL the effects of potentially uncommitted concurrent transactions into their local view (or UNPULL to detangle). Each operation comes with simple criteria given in terms of commutativity (Lipton's left-movers and right-movers). The benefit of this model is that most of the elaborate reasoning (coinduction, simulation, subtle invariants, etc.) necessary for proving the serializability of a transactional algorithm is already proved within the semantic model. Thus, proving serializability (or opacity) amounts simply to mapping the algorithm on to our rules, and showing that it satisfies the rules' criteria.

Book ChapterDOI
07 Oct 2015
TL;DR: This work investigates the classical problem of transaction chopping for a promising consistency model in this class--parallel snapshot isolation PSI, and proposes a criterion for checking when a set of transactions executing on PSI can be chopped into smaller pieces without introducing new behaviours, thus improving efficiency.
Abstract: Modern Internet services often achieve scalability and availability by relying on large-scale distributed databases that provide consistency models for transactions weaker than serialisability. We investigate the classical problem of transaction chopping for a promising consistency model in this class--parallel snapshot isolation PSI, which weakens the classical snapshot isolation to allow more efficient large-scale implementations. Namely, we propose a criterion for checking when a set of transactions executing on PSI can be chopped into smaller pieces without introducing new behaviours, thus improving efficiency. We find that our criterion is more permissive than the existing one for chopping serialisable transactions. To establish our criterion, we propose a novel declarative specification of PSI that does not refer to implementation-level concepts and, thus, allows reasoning about the behaviour of PSI databases more easily. Our results contribute to building a theory of consistency models for modern large-scale databases.

Proceedings ArticleDOI
13 Jun 2015
TL;DR: Seer is proposed, a scheduler that addresses precisely this restriction of HTM by leveraging on an on-line probabilistic inference technique that identifies the most likely conflict relations, and establishes a dynamic locking scheme to serialize transactions in a fine-grained manner.
Abstract: Scheduling concurrent transactions to minimize contention is a well known technique in the Transactional Memory (TM) literature, which was largely investigated in the context of software TMs. However, the recent advent of Hardware Transactional Memory (HTM), and its inherently restricted nature, pose new technical challenges that prevent the adoption of existing schedulers: unlike software implementations of TM, existing HTMs provide no information on which data item or contending transaction caused abort. We propose Seer, a scheduler that addresses precisely this restriction of HTM by leveraging on an on-line probabilistic inference technique that identifies the most likely conflict relations, and establishes a dynamic locking scheme to serialize transactions in a fine-grained manner. Our evaluation shows that Seer improves the performance of the Intel TSX HTM by up to 2.5x, and by 62% on average, in TM benchmarks with 8 threads. These performance gains are not only a consequence of the reduced aborts, but also of the reduced activation of the HTM's pessimistic fall-back path.

Patent
18 Aug 2015
TL;DR: In this paper, a transactional memory system salvages a partially executed hardware transaction by detecting an about-to-fail condition in the first code region of the first hardware transaction.
Abstract: A transactional memory system salvages a partially executed hardware transaction. A processor of the transactional memory system saves state information in a first code region of a first hardware transaction, the state information useable to determine whether the first hardware transaction is to be salvaged or to be aborted. The processor detects an about to fail condition in the first code region of the first hardware transaction. The processor, based on the detecting, executes an about-to-fail handler, the about-to-fail handler using the saved state information to determine whether the first hardware transaction is to be salvaged or to be aborted. The processor executing the about-to-fail handler, based on the transaction being to be salvaged, uses the saved state information to determine what portion of the first hardware transaction to salvage.

Book ChapterDOI
07 Oct 2015
TL;DR: The key idea in ALE is to use a sequence of fine-grained locks in the fall back-path to detect conflicts with the fast-path, and at the same time reduce the costs of these locks by executing the fallback-path as a series segments, where each segment is a dynamic length short hardware transaction.
Abstract: Hardware lock-elision HLE introduces concurrency into legacy lock-based code by optimistically executing critical sections in a fast-path as hardware transactions. Its main limitation is that in case of repeated aborts, it reverts to a fallback-path that acquires a serial lock. This fallback-path lacks hardware-software concurrency, because all fast-path hardware transactions abort and wait for the completion of the fallback. Software lock elision has no such limitation, but the overheads incurred are simply too high. We propose amalgamated lock-elision ALE, a novel lock-elision algorithm that provides hardware-software concurrency and efficiency: the fallback-path executes concurrently with fast-path hardware transactions, while the common-path fast-path reads incur no overheads and proceed without any instrumentation. The key idea in ALE is to use a sequence of fine-grained locks in the fallback-path to detect conflicts with the fast-path, and at the same time reduce the costs of these locks by executing the fallback-path as a series segments, where each segment is a dynamic length short hardware transaction. We implemented ALE into GCC and tested the new system on Intel Haswell 16-way chip that provides hardware transactions. We benchmarked linked-lists, hash-tables and red-black trees, as well as converting KyotoCacheDB to use ALE in GCC, and all show that ALE significantly outperforms HLE.

Proceedings ArticleDOI
04 Oct 2015
TL;DR: NUVA, which stands for nonuniform verification architecture, a distributed automata-based RV architecture for parametric specifications in the form of parameterized finite-state automata, is introduced with a case study over a cache-coherent non uniform-memory-access (ccNUMA) multiprocessor.
Abstract: Runtime Verification (RV) has recently emerged as a complementary technology to extend coverage of conventional software verification methods. To address the substantial performance and power overhead of pure software RV frameworks, this paper introduces NUVA, which stands for nonuniform verification architecture, a distributed automata-based RV architecture for parametric specifications in the form of parameterized finite-state automata, with a case study over a cache-coherent nonuniform-memory-access (ccNUMA) multiprocessor. The core of NUVA is a coherent distributed automata transactional memory (ATM) that efficiently maintains states of a dynamic population of automata checkers organized into a rooted dynamic directed acyclic graph (DAG) concurrently shared among all processor nodes. A cycle-accurate model of a ccNUMA multiprocessor confirms that performance slowdown is 1~3% and NoC message traffic increases by 10~15% at parametric event density1 of 0.025 EPI for two compute-intensive scientific benchmarks having irregular concurrent data structures. The detailed architecture and implementation metrics in TSMC 40nm CMOS technology are presented. NUVA can be dimensioned to incur average total power overhead of less than 140mW and area overhead of 4mm2 for a quad-core multiprocessor chip, at operating frequency of 250MHz. It is also estimated to incur 1.9~2.6% area overhead and 0.5~1% power overhead when integrated with Intel's family of high-end desktop and mobile quad-core processors operating at 1.6~3.2GHz. Our silicon implementation achieves average performance2 of 1.5 MEPS/mW. For a processor core with 5 MIPS/mW, this corresponds to a 3.2% drop in energy efficiency at parametric event density of 0.01 EPI.

Journal ArticleDOI
TL;DR: This article presents a mechanism for dynamically reordering memory operations within a transaction so that read-modify-write operations on highly contended locations can be delayed until the very end of the transaction, thereby improving throughput and reducing wasted work.
Abstract: Language-level transactions are said to provide “atomicity,” implying that the order of operations within a transaction should be invisible to concurrent transactions and thus that independent operations within a transaction should be safe to execute in any order. In this article, we present a mechanism for dynamically reordering memory operations within a transaction so that read-modify-write operations on highly contended locations can be delayed until the very end of the transaction. When integrated with traditional transactional conflict detection mechanisms, our approach reduces aborts on hot memory locations, such as statistics counters, thereby improving throughput and reducing wasted work. We present three algorithms for delaying highly contended read-modify-write operations within transactions, and we evaluate their impact on throughput for eager and lazy transactional systems across multiple workloads. We also discuss complications that arise from the interaction between our mechanism and the need for strong language-level semantics, and we propose algorithmic extensions that prevent errors from occurring when accesses are aggressively reordered in a transactional memory implementation with weak semantics.

Proceedings ArticleDOI
10 Jun 2015
TL;DR: Evaluation of both forms of transactional memory found in the Intel Haswell processor, Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM), are evaluated and show that RTM generally outperforms conventional locking mechanisms and that HLE provides consistently better performance than conventional locking mechanism.
Abstract: Transactional memory is a concurrency control mechanism that dynamically determines when threads may safely execute critical sections of code. It provides the performance of fine-grained locking mechanisms with the simplicity of coarse-grained locking mechanisms. With hardware based transactions, the protection of shared data accesses and updates can be evaluated at runtime so that only true collisions to shared data force serialization. This paper explores the use of transactional memory as an alternative to conventional synchronization mechanisms for managing the pending event set in a Time Warp synchronized parallel simulator. In particular, we explore the application of Intel's hardware-based transactional memory (TSX) to manage shared access to the pending event set by the simulation threads. Comparison between conventional locking mechanisms and transactional memory access is performed to evaluate each within the warped Time Warp synchronized parallel simulation kernel. In this testing, evaluation of both forms of transactional memory found in the Intel Haswell processor, Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM), are evaluated. The results show that RTM generally outperforms conventional locking mechanisms and that HLE provides consistently better performance than conventional locking mechanisms, in some cases as much as 27%.

Book ChapterDOI
01 Jan 2015
TL;DR: Multiversion concurrency control is a classical approach for providing concurrent access to the database in database management systems that lets a reading transaction obtain a consistent snapshot corresponding to an arbitrary point in time.
Abstract: Reducing the number of aborts is one of the biggest challenges of most transactional systems: existing TMs may abort many transactions that could, in fact, commit without violating correctness. Historically, the commonly used method for reducing the abort rate was maintaining multiple object versions. Multiversion concurrency control is a classical approach for providing concurrent access to the database in database management systems. Its idea is to let a reading transaction obtain a consistent snapshot corresponding to an arbitrary point in time (e.g., defined at the beginning of a transaction) – concurrent updates are isolated through maintaining old versions rather than via scheduling decisions.

Patent
15 Sep 2015
TL;DR: In this article, a transactional memory system dynamically predicts the resource requirements of hardware transactions and allocates resources for the first hardware transaction based on the predicted resource requirements, based on a previous execution of a prior hardware transaction.
Abstract: A transactional memory system dynamically predicts the resource requirements of hardware transactions. A processor of the transactional memory system predicts resource requirements of a first hardware transaction to be executed based on any one of a resource hint and a previous execution of a prior hardware transaction. The processor allocates resources for the first hardware transaction based on the predicted resource requirements. The processor executes the first hardware transaction. The processor saves resource usage information of the first hardware transaction for future prediction.

Journal ArticleDOI
TL;DR: This work presents a technique based on dynamic code and graph dependency analysis that automatically corrects existing snapshot isolation anomalies in transactional memory programs and shows that corrected applications retain the performance benefits characteristic of snapshot isolation over conventional transactionalMemory systems.
Abstract: Transactional memory systems providing snapshot isolation enable concurrent access to shared data without incurring aborts on read-write conflicts. Reducing aborts is extremely relevant as it leads to higher concurrency, greater performance, and better predictability. Unfortunately, snapshot isolation does not provide serializability as it allows certain anomalies that can lead to subtle consistency violations. While some mechanisms have been proposed to verify the correctness of a program utilizing snapshot isolation transactions, it remains difficult to repair incorrect applications. To reduce the programmer’s burden in this case, we present a technique based on dynamic code and graph dependency analysis that automatically corrects existing snapshot isolation anomalies in transactional memory programs. Our evaluation shows that corrected applications retain the performance benefits characteristic of snapshot isolation over conventional transactional memory systems.

Proceedings ArticleDOI
23 Oct 2015
TL;DR: The tool is the first to insert fences into transactional memory algorithms and it solves the long-standing problem of how to easily port such algorithms to a novel memory model.
Abstract: Previous work has shown how to insert fences that enforce sequential consistency. However, for many concurrent algorithms, sequential consistency is unnecessarily strong and can lead to high execution overhead. The reason is that, often, correctness relies on the execution order of a few specific pairs of instructions. Algorithm designers can declare those execution orders and thereby enable memory-model-independent reasoning about correctness and also ease implementation of algorithms on multiple platforms. The literature has examples of such reasoning, while tool support for enforcing the orders has been lacking until now. In this paper we present a declarative approach to specify and enforce execution orders. Our fence insertion algorithm first identifies the execution orders that a given memory model enforces automatically, and then inserts fences that enforce the rest. Our benchmarks include three off-the-shelf transactional memory algorithms written in C/C++ for which we specify suitable execution orders. For those benchmarks, our experiments with the x86 and ARMv7 memory models show that our tool inserts fences that are competitive with those inserted by the original authors. Our tool is the first to insert fences into transactional memory algorithms and it solves the long-standing problem of how to easily port such algorithms to a novel memory model.

Proceedings ArticleDOI
24 Jan 2015
TL;DR: A thorough investigation of the interplay between memory allocators and software transactional memory (STM) systems shows that allocators can interfere with the way memory addresses are mapped to versioned locks on state-of-the-art softwaretransactional memory implementations.
Abstract: Although dynamic memory management accounts for a significant part of the execution time on many modern software systems, its impact on the performance of transactional memory systems has been mostly overlooked. In order to shed some light into this subject, this paper conducts a thorough investigation of the interplay between memory allocators and software transactional memory (STM) systems. We show that allocators can interfere with the way memory addresses are mapped to versioned locks on state-of-the-art software transactional memory implementations. Moreover, we observed that key aspects of allocators such as false sharing avoidance, scalability, and locality have a drastic impact on the final performance. For instance, we have detected performance differences of up to 171% in the STAMP applications when using distinct allocators. Moreover, we show that optimizations at the STM-level (such as caching transactional objects) are not effective when a modern allocator is already in use. All in all, our study highlights the importance of reporting the allocator utilized in the performance evaluation of transactional memory systems.