scispace - formally typeset
Search or ask a question

Showing papers on "Transactional memory published in 2011"


Proceedings ArticleDOI
05 Mar 2011
TL;DR: A family of hybrid TMs built using the recent NOrec STM algorithm is introduced that, unlike existing hybrid approaches, provide both low overhead on hardware transactions and concurrent execution of hardware and software transactions.
Abstract: Transactional memory (TM) is a promising synchronization mechanism for the next generation of multicore processors. Best-effort Hardware Transactional Memory (HTM) designs, such as Sun's prototype Rock processor and AMD's proposed Advanced Synchronization Facility (ASF), can efficiently execute many transactions, but abort in some cases due to various limitations. Hybrid TM systems can use a compatible software TM (STM) in such cases.We introduce a family of hybrid TMs built using the recent NOrec STM algorithm that, unlike existing hybrid approaches, provide both low overhead on hardware transactions and concurrent execution of hardware and software transactions. We evaluate implementations for Rock and ASF, exploring how the differing HTM designs affect optimization choices. Our investigation yields valuable input for designers of future best-effort HTMs.

131 citations


Journal ArticleDOI
TL;DR: This article develops semantics and type systems for the constructs of the Automatic Mutual Exclusion (AME) programming model for STM systems that use in-place update, optimistic concurrency, lazy conflict detection, and rollback.
Abstract: Software Transactional Memory (STM) is an attractive basis for the development of language features for concurrent programming. However, the semantics of these features can be delicate and problematic. In this article we explore the trade-offs semantic simplicity, the viability of efficient implementation strategies, and the flexibility of language constructs. Specifically, we develop semantics and type systems for the constructs of the Automatic Mutual Exclusion (AME) programming model; our results apply also to other constructs, such as atomic blocks. With this semantics as a point of reference, we study several implementation strategies. We model STM systems that use in-place update, optimistic concurrency, lazy conflict detection, and rollback. These strategies are correct only under nontrivial assumptions that we identify and analyze. One important source of errors is that some efficient implementations create dangerous “zombie” computations where a transaction keeps running after experiencing a conflict; the assumptions confine the effects of these computations.

119 citations


Proceedings ArticleDOI
03 Dec 2011
TL;DR: KILO TM is proposed, a novel hardware TM design for GPUs that scales to 1000s of concurrent transactions that uses word-level, value-based conflict detection to avoid broadcast communication and reduce on-chip storage overhead.
Abstract: Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a relatively smaller set of single-instruction, multiple-thread (SIMT) cores to hide various long latency operations. While threads within a CUDA block/OpenCL workgroup can communicate efficiently through an intra-core scratchpad memory, threads in different blocks can only communicate via global memory accesses. Programmers wishing to exploit such communication have to consider data-races that may occur when multiple threads modify the same memory location. Recent GPUs provide a form of inter-block communication through atomic operations for single 32-bit/64-bit words. Although fine-grained locks can be constructed from these atomic operations, synchronization using locks is prone to deadlock. In this paper, we propose to solve these problems by extending GPUs to support transactional memory (TM). Major challenges include supporting 1000s of concurrent transactions and committing non-conflicting transactions in parallel. We propose KILO TM, a novel hardware TM design for GPUs that scales to 1000s of concurrent transactions. Without cache coherency hardware to depend on, it uses word-level, value-based conflict detection to avoid broadcast communication and reduce on-chip storage overhead. It employs speculative validation using a novel bloom filter organization to increase transaction commit parallelism. For a set of TM-enhanced GPU applications, KILO TM captures 59% of the performance of fine-grained locking, and is on average 128x faster than executing all transactions serially, for an estimated hardware area overhead of 0.5% of a commercial GPU.

104 citations


Proceedings ArticleDOI
12 Feb 2011
TL;DR: A new lock-free commit algorithm is presented that allows write transactions to proceed in parallel, by allowing them to run their validation phase independently of each other, and by resorting to helping from threads that would otherwise be waiting to commit, during the write-back phase.
Abstract: Software Transactional Memory (STM) was initially proposed as a lock-free mechanism for concurrency control. Early implementations had efficiency limitations, and soon obstruction-free proposals appeared, to tackle this problem, often simplifying STM implementation. Today, most of the modern and top-performing STMs use blocking designs, relying on locks to ensure an atomic commit operation. This approach has revealed better in practice, in part due to its simplicity. Yet, it may have scalability problems when we move into many-core computers, requiring fine-tuning and careful programming to avoid contention. In this paper we present and discuss the modifications we made to a lock-based multi-version STM in Java, to turn it into a lock-free implementation that we have tested to scale at least up to 192 cores, and which provides results that compete with, and sometimes exceed, some of today's top-performing lock-based implementations. The new lock-free commit algorithm allows write transactions to proceed in parallel, by allowing them to run their validation phase independently of each other, and by resorting to helping from threads that would otherwise be waiting to commit, during the write-back phase. We also present a new garbage collection algorithm to dispose of old unused object versions that allows for asynchronous identification of unnecessary versions, which minimizes its interference with the rest of the transactional system.

91 citations


Proceedings ArticleDOI
04 Jun 2011
TL;DR: A detailed case study comparing teams of programmers developing a parallel program from scratch using transactional memory and locks, finding that TM code was easier to understand than locks code, because the locks teams used many locks to improve performance.
Abstract: Transactional Memory (TM) promises to simplify parallel programming by replacing locks with atomic transactions. Despite much recent progress in TM research, there is very little experience using TM to develop realistic parallel programs from scratch. In this paper, we present the results of a detailed case study comparing teams of programmers developing a parallel program from scratch using transactional memory and locks. We analyze and quantify in a realistic environment the development time, programming progress, code metrics, programming patterns, and ease of code understanding for six teams who each wrote a parallel desktop search engine over a fifteen week period. Three randomly chosen teams used Intel's Software Transactional Memory compiler and Pthreads, while the other teams used just Pthreads. Our analysis is exploratory: Given the same requirements, how far did each team get? The TM teams were among the first to have a prototype parallel search engine. Compared to the locks teams, the TM teams spent less than half the time debugging segmentation faults, but had more problems tuning performance and implementing queries. Code inspections with industry experts revealed that TM code was easier to understand than locks code, because the locks teams used many locks (up to thousands) to improve performance. Learning from each team's individual success and failure story, this paper provides valuable lessons for improving TM.

81 citations


Proceedings ArticleDOI
04 Jun 2011
TL;DR: Several new hybrid TM algorithms are presented that can execute HTM and STM transactions concurrently and can thus provide good performance over a large spectrum of workloads and are evaluated based on AMD's Advanced Synchronization Facility.
Abstract: Transactional memory (TM) is a speculative shared-memory synchronization mechanism used to speed up concurrent programs. Most current TM implementations are software-based (STM) and incur noticeable overheads for each transactional memory access. Hardware TM proposals (HTM) address this issue but typically suffer from other restrictions such as limits on the number of data locations that can be accessed in a transaction.In this paper, we present several new hybrid TM algorithms that can execute HTM and STM transactions concurrently and can thus provide good performance over a large spectrum of workloads. The algorithms exploit the ability of some HTMs to have both speculative and nonspeculative (nontransactional) memory accesses within a transaction to decrease the transactions' runtime overhead, abort rates, and hardware capacity requirements. We evaluate implementations of these algorithms based on AMD's Advanced Synchronization Facility, an x86 instruction set extension proposal that has been shown to provide a sound basis for HTM.

75 citations


Proceedings ArticleDOI
06 Jun 2011
TL;DR: It is demonstrated that HTM enables simpler and faster solutions, with better memory reclamation properties, than prior approaches, and support the claim thatHTM can provide significantly better common-case performance, as well as reduced conceptual complexity.
Abstract: Dynamic memory management is a significant source of complexity in the design and implementation of practical concurrent data structures. We study how hardware transactional memory (HTM) can be used to simplify and streamline memory reclamation for such data structures. We propose and evaluate several new HTM-based algorithms for the "Dynamic Collect" problem that lies at the heart of many modern memory management algorithms. We demonstrate that HTM enables simpler and faster solutions, with better memory reclamation properties, than prior approaches. Despite recent theoretical arguments that HTM provides no worst-case advantages, our results support the claim that HTM can provide significantly better common-case performance, as well as reduced conceptual complexity.

66 citations


Book ChapterDOI
20 Sep 2011
TL;DR: It is shown that the memory consumption of algorithms keeping a constant number of versions per object might grow exponentially with the number of objects, while SMV operates successfully even in systems with stringent memory constraints.
Abstract: We present Selective Multi-Versioning (SMV), a new STM that reduces the number of aborts, especially those of long read-only transactions. SMV keeps old object versions as long as they might be useful for some transaction to read. It is able to do so while still allowing reading transactions to be invisible by relying on automatic garbage collection to dispose of obsolete versions. SMV is most suitable for read-dominated workloads, for which it performs better than previous solutions. It has an up to ×7 throughput improvement over a single-version STMand more than a two-fold improvement over an STMkeeping a constant number of versions per object. We show that the memory consumption of algorithms keeping a constant number of versions per object might grow exponentially with the number of objects, while SMV operates successfully even in systems with stringent memory constraints.

65 citations


Journal ArticleDOI
TL;DR: A lower bound of Ω(t) is proved on the number of writes needed in order to implement a read-only transaction of t items, which successfully terminates in a disjoint-access parallel TM implementation, which assumes strict serializability and thus hold under the assumption of opacity.
Abstract: Transactional memory (TM) is a popular approach for alleviating the difficulty of programming concurrent applications; TM guarantees that a transaction, consisting of a sequence of operations, appear to be executed atomically. Two fundamental properties of TM implementations are disjoint-access parallelism and the invisibility of read operations. Disjoint access parallelism ensures that operations on disconnected data do not interfere, and thus it is critical for TM scalability. The invisibility of read operations means that their implementation does not write to the memory, thereby reducing memory contention. This paper proves an inherent tradeoff for implementations of transactional memories: they cannot be both disjoint-access parallel and have read-only transactions that are invisible and always terminate successfully. In fact, a lower bound of Ω(t) is proved on the number of writes needed in order to implement a read-only transaction of t items, which successfully terminates in a disjoint-access parallel TM implementation. The results assume strict serializability and thus hold under the assumption of opacity. It is shown how to extend the results to hold also for weaker consistency conditions, snapshot isolation and serializability.

63 citations


Patent
22 Feb 2011
TL;DR: In this paper, an apparatus and method for a computer processor (102) configured to access a memory shared by a plurality of processing cores and to execute memory access operations in a transactional mode as a single atomic transaction and to suspend the transaction in response to determining an implicit suspend condition, such as a program control transfer.
Abstract: An apparatus and method is disclosed for a computer processor (102) configured to access a memory (140) shared by a plurality of processing cores and to execute a plurality of memory access operations in a transactional mode as a single atomic transaction and to suspend the transactional mode in response to determining an implicit suspend condition, such as a program control transfer. As part of executing the transaction, the processor marks data accessed by the speculative memory access operations as being speculative data (220). In response to determining a suspend condition (including by detecting a control transfer in an executing thread) (230) the processor suspends the transactional mode of execution, which includes setting a suspend flag (240) and suspending marking speculative data (250). If the processor later detects a resumption condition (e.g., a return control transfer corresponding to a return from the control transfer), the processor is configured to resume the marking of speculative data.

57 citations


Book ChapterDOI
13 Dec 2011
TL;DR: In this paper, the authors evaluate the cost of concurrency by measuring the amount of expensive synchronization that must be employed in an STM implementation that ensures positive concurrency, i.e., allows for concurrent transaction processing in some executions.
Abstract: The promise of software transactional memory (STM) is to combine an easy-to-use programming interface with an efficient utilization of the concurrent-computing abilities provided by modern machines. But does this combination come with an inherent cost? We evaluate the cost of concurrency by measuring the amount of expensive synchronization that must be employed in an STM implementation that ensures positive concurrency, i.e., allows for concurrent transaction processing in some executions. We focus on two popular progress conditions that provide positive concurrency: progressiveness and permissiveness. We show that in permissive STMs, providing a very high degree of concurrency, a transaction may perform a linear number of expensive synchronization patterns with respect to its read-set size. In contrast, progressive STMs provide a very small degree of concurrency but, as we demonstrate, can be implemented using at most one expensive synchronization pattern per transaction. However, we show that even in progressive STMs, a transaction has to "protect" (e.g., by using locks or strong synchronization primitives) a linear amount of data with respect to its write-set size. Our results suggest that achieving high degrees of concurrency in STM implementations may bring a considerable synchronization cost.

Proceedings ArticleDOI
18 Dec 2011
TL;DR: This paper proposes a machine learning-based approach to automatically infer a suitable thread mapping strategy for transactional memory applications and shows that this approach improves performance up to 18.46% compared to the worst case and up to 6.37% over the Linux defaultthread mapping strategy.
Abstract: Thread mapping has been extensively used as a technique to efficiently exploit memory hierarchy on modern chip-multiprocessors. It places threads on cores in order to amortize memory latency and/or to reduce memory contention. However, efficient thread mapping relies upon matching application behavior with system characteristics. Particularly, Software Transactional Memory (STM) applications introduce another dimension due to its runtime system support. Existing STM systems implement several conflict detection and resolution mechanisms, which leads STM applications to behave differently for each combination of these mechanisms. In this paper we propose a machine learning-based approach to automatically infer a suitable thread mapping strategy for transactional memory applications. First, we profile several STM applications from the STAMP benchmark suite considering application, STM system and platform features to build a set of input instances. Then, such data feeds a machine learning algorithm, which produces a decision tree able to predict the most suitable thread mapping strategy for new unobserved instances. Results show that our approach improves performance up to 18.46% compared to the worst case and up to 6.37% over the Linux default thread mapping strategy.

Proceedings ArticleDOI
04 Oct 2011
TL;DR: OSARE, an active replication protocol for transactional systems that combines the usage of Optimistic Atomic Broadcast with a speculative concurrency control mechanism in order to overlap transaction processing and replica synchronization, achieves remarkable speed-up with respect to state of the art speculative replication protocols.
Abstract: In this work we present OSARE, an active replication protocol for transactional systems that combines the usage of Optimistic Atomic Broadcast with a speculative concurrency control mechanism in order to overlap transaction processing and replica synchronization. OSARE biases the speculative serialization of transactions towards an order aligned with the optimistic message delivery order. However, due to the lock-free nature of its concurrency control algorithm, at high concurrency levels, namely when the probability of mismatches between optimistic and final deliveries is higher, OSARE explores additional alternative transaction serialization orders in a lightweight and opportunistic fashion. A simulation study we carried out in the context of Software Transactional Memory systems shows that OSARE achieves robust performance also in scenarios characterized by non-minimal likelihood of reorder between optimistic and final deliveries, providing remarkable speed-up with respect to state of the art speculative replication protocols.

Proceedings ArticleDOI
08 Jun 2011
TL;DR: HyFlow is a Java framework for D-STM, with pluggable support for directory lookup protocols, transactional synchronization and recovery mechanisms, contention management policies, cache coherence protocols, and network communication protocols, that outperforms competitors on a broad range of transactional workloads on a 72-node system.
Abstract: We present HyFlow --- a distributed software transactional memory (D-STM) framework for distributed concurrency control. HyFlow is a Java framework for D-STM, with pluggable support for directory lookup protocols, transactional synchronization and recovery mechanisms, contention management policies, cache coherence protocols, and network communication protocols. HyFlow exports a simple distributed programming model that excludes locks: using (Java 5) annotations, atomic sections are defined as transactions, in which reads and writes to shared, local and remote objects appear to take effect instantaneously. No changes are needed to the underlying virtual machine or compiler. We describe HyFlow's architecture and implementation, and report on experimental studies comparing HyFlow against competing models including Java remote method invocation (RMI) with mutual exclusion and read/write locks, distributed shared memory (DSM), and directory-based D-STM. Our studies show that HyFlow outperforms competitors by as much as 40-190% on a broad range of transactional workloads on a 72-node system, with more than 500 concurrent transactions.

Patent
28 Jul 2011
TL;DR: In this article, a processor includes an execution unit and at least one last branch record (LBR) register to store address information of a branch taken during program execution, which may further store a transaction indicator to indicate whether the branch was taken during a transactional memory (TM) transaction.
Abstract: In one embodiment, a processor includes an execution unit and at least one last branch record (LBR) register to store address information of a branch taken during program execution. This register may further store a transaction indicator to indicate whether the branch was taken during a transactional memory (TM) transaction. This register may further store an abort indicator to indicate whether the branch was caused by a transaction abort. Other embodiments are described and claimed.

Proceedings ArticleDOI
12 Feb 2011
TL;DR: A novel transaction scheduling scheme called “Bloom Filter Guided Transaction Scheduling” (BFGTS), that uses a combination of simple hardware and Bloom filter heuristics to guide scheduling decisions and provide enhanced performance in high contention situations.
Abstract: Contention management is an important design component to a transactional memory system. Without effective contention management to ensure forward progress, a transactional memory system can experience live-lock, which is difficult to debug in parallel programs. Early work in contention management focused on heuristic managers that reacted to conflicts between transactions by picking the most appropriate transaction to abort. Reactive methods allow conflicts to happen repeatedly as they do not try to prevent future conflicts from happening. These shortcomings of reactive contention managers have led to proposals that approach contention management as a scheduling problem — proactive managers. Proactive techniques range from throttling execution in predicted periods of high contention to preventing groups of transactions running concurrently that are predicted likely to conflict. We propose a novel transaction scheduling scheme called “Bloom Filter Guided Transaction Scheduling” (BFGTS), that uses a combination of simple hardware and Bloom filter heuristics to guide scheduling decisions and provide enhanced performance in high contention situations. We compare to two state-of-the-art transaction schedulers, “Adaptive Transaction Scheduling” and “Proactive Transaction Scheduling” and show that BFGTS attains up to a 4.6× and 1.7× improvement on high contention benchmarks respectively. Across all benchmarks it shows a 35% and 25% average performance improvement respectively.

Proceedings ArticleDOI
12 Feb 2011
TL;DR: A programming model is presented that is the first to have opaque transactions, safe asynchronous message passing, and an efficient implementation and a novel definition of safe message passing that may be of independent interest.
Abstract: Many concurrent programming models enable both transactional memory and message passing. For such models, researchers have built increasingly efficient implementations and defined reasonable correctness criteria, while it remains an open problem to obtain the best of both worlds. We present a programming model that is the first to have opaque transactions, safe asynchronous message passing, and an efficient implementation. Our semantics uses tentative message passing and keeps track of dependencies to enable undo of message passing in case a transaction aborts. We can program communication idioms such as barrier and rendezvous that do not deadlock when used in an atomic block. Our experiments show that our model adds little overhead to pure transactions, and that it is significantly more efficient than Transactional Events. We use a novel definition of safe message passing that may be of independent interest.

Proceedings ArticleDOI
27 Jun 2011
TL;DR: This paper proposes to support reactive applications by allowing the developer to annotate some transaction blocks with deadlines by adjusting the transaction execution strategy by decreasing the level of optimism as the deadlines near through two modes of conservative execution, without overly limiting the progress of concurrent transactions.
Abstract: Software Transactional Memory (STM) is an optimistic concurrency control mechanism that simplifies the development of parallel programs. Still, the interest of STM has not yet been demonstrated for reactive applications that require bounded response time for some of their operations. We propose to support such applications by allowing the developer to annotate some transaction blocks with deadlines. Based on previous execution statistics, we adjust the transaction execution strategy by decreasing the level of optimism as the deadlines near through two modes of conservative execution, without overly limiting the progress of concurrent transactions. Our implementation comprises a STM extension for gathering statistics and implementing the execution mode strategies. We have also extended the Linux scheduler to disable preemption or migration of threads that are executing transactions with deadlines. Our experimental evaluation shows that our approach significantly improves the chance of a transaction meeting its deadline when its progress is hampered by conflicts.

Proceedings ArticleDOI
05 Mar 2011
TL;DR: It is demonstrated that hardware can substantially accelerate the performance of an STM on unmodified commodity processors, and it is shown that, for all but short transactions, it is not necessary to modify the processor to obtain substantial improvement in TM performance.
Abstract: The adoption of transactional memory is hindered by the high overhead of software transactional memory and the intrusive design changes required by previously proposed TM hardware. We propose that hardware to accelerate software transactional memory (STM) can reside outside an unmodified commodity processor core, thereby substantially reducing implementation costs. This paper introduces Transactional Memory Acceleration using Commodity Cores (TMACC), a hardware-accelerated TM system that does not modify the processor, caches, or coherence protocol.We present a complete hardware implementation of TMACC using a rapid prototyping platform. Using this hardware, we implement two unique conflict detection schemes which are accelerated using Bloom filters on an FPGA. These schemes employ novel techniques for tolerating the latency of fine-grained asynchronous communication with an out-of-core accelerator. We then conduct experiments to explore the feasibility of accelerating TM without modifying existing system hardware. We show that, for all but short transactions, it is not necessary to modify the processor to obtain substantial improvement in TM performance. In these cases, TMACC outperforms an STM by an average of 69% in applications using moderate-length transactions, showing maximum speedup within 8% of an upper bound on TM acceleration. Overall, we demonstrate that hardware can substantially accelerate the performance of an STM on unmodified commodity processors.

Book ChapterDOI
Andras Vajda1
01 Jan 2011
TL;DR: This chapter’s main goal is to introduce the reader to the most important processor architecture concepts relevant in the context of multi-core processors as well the most common processor architectures available today.
Abstract: No book on programming would be complete without an overview of the hardware on which the software will execute. In this chapter we outline the main design principles and solutions applied when designing these chips, as well as the challenges facing the hardware industry, together with an outlook of promising technologies not yet in common practice. This chapter’s main goal is to introduce the reader to the most important processor architecture concepts (core organization, interconnects, memory architectures, support for parallel programming etc) relevant in the context of multi-core processors as well the most common processor architectures available today. We also analyze the challenges faced by processor designs as the number of cores will continue scaling and the emerging technologies—such as transactional memory, support for speculative threading, novel interconnects, 3D stacking of memory etc—that will allow continued scaling of processors in terms of available computational power.

Proceedings ArticleDOI
22 Oct 2011
TL;DR: This paper presents Aida, a new model of isolated execution for parallel programs that perform frequent, irregular accesses to pointer-based shared data structures, and offers an implementation of Aida on top of the Habanero Java parallel programming language.
Abstract: Isolation---the property that a task can access shared data without interference from other tasks---is one of the most basic concerns in parallel programming. In this paper, we present Aida, a new model of isolated execution for parallel programs that perform frequent, irregular accesses to pointer-based shared data structures. The three primary benefits of Aida are dynamism, safety and liveness guarantees, and programmability. First, Aida allows tasks to dynamically select and modify, in an isolated manner, arbitrary fine-grained regions in shared data structures, all the while maintaining a high level of concurrency. Consequently, the model can achieve scalable parallelization of regular as well as irregular shared-memory applications. Second, the model offers freedom from data races, deadlocks, and livelocks. Third, no extra burden is imposed on programmers, who access the model via a simple, declarative isolation construct that is similar to that for transactional memory. The key new insight in Aida is a notion of delegation among concurrent isolated tasks (known in Aida as assemblies). Each assembly A is equipped with a region in the shared heap that it owns---the only objects accessed by A are those it owns, guaranteeing race-freedom. The region owned by A can grow or shrink flexibly---however, when A needs to own a datum owned by B, A delegates itself, as well as its owned region, to B. From now on, B has the responsibility of re-executing the task A set out to complete. Delegation as above is the only inter-assembly communication primitive in Aida. In addition to reducing contention in a local, data-driven manner, it guarantees freedom from deadlocks and livelocks.We offer an implementation of Aida on top of the Habanero Java parallel programming language. The implementation employs several novel ideas, including the use of a union-find data structure to represent tasks and the regions that they own. A thorough evaluation using several irregular data-parallel benchmarks demonstrates the low overhead and excellent scalability of Aida, as well as its benefits over existing approaches to declarative isolation. Our results show that Aida performs on par with the state-of-the-art customized implementations of irregular applications and much better than coarse-grained locking and transactional memory approaches.

Proceedings ArticleDOI
30 Sep 2011
TL;DR: RMS-TM is introduced, a Transactional Memory benchmark suite composed of seven real-world applications from the Recognition, Mining and Synthesis domain that provide a mix of short and long transactions with small/large read and write sets with low/medium/high contention rates.
Abstract: Transactional Memory (TM) has been proposed as an alternative concurrency mechanism for the shared memory parallel programming model. Its main goal is to make parallel programming for Chip Multiprocessors (CMPs) easier than using the traditional lock synchronization constructs, without compromising the performance and the scalability. This topic has received substantial research attention and several TM designs have been proposed using various TM benchmarks. We believe that the evaluation of TM proposals would be more solid if it included realistic applications, that address on-going TM research issues, and that provide the potential for straightforward comparison against locks.In this paper, we introduce RMS-TM, a Transactional Memory benchmark suite composed of seven real-world applications from the Recognition, Mining and Synthesis (RMS) domain. In addition to featuring current TM research issues such as nesting and I/O and system calls inside transactions, the RMS-TM applications also provide a mix of short and long transactions with small/large read and write sets with low/medium/high contention rates. These characteristics, as well as providing lock-based versions of the applications, make RMS-TM a useful TM tool. Current TM benchmarks do not explore all these features. In our evaluation with selected STM and HTM systems, we find that our benchmark suite is also scalable, which is useful for evaluating TM designs on high core counts.

Proceedings ArticleDOI
30 Aug 2011
TL;DR: This paper defends that the amount of contention can be reduced if read-only transactions access recent consistent data snapshots, progressing in a wait-free manner, and shows how the required number of versions of a shared object can be calculated for a set of tasks.
Abstract: The recent trends of chip architectures with higher number of heterogeneous cores, and non-uniform memory/non-coherent caches, brings renewed attention to the use of Software Transactional Memory (STM) as a fundamental building block for developing parallel applications. Nevertheless, although STM promises to ease concurrent and parallel software development, it relies on the possibility of aborting conflicting transactions to maintain data consistency, which impacts on the responsiveness and timing guarantees required by embedded real-time systems. In these systems, contention delays must be (efficiently) limited so that the response times of tasks executing transactions are upper-bounded and task sets can be feasibly scheduled. In this paper we assess the use of STM in the development of embedded real-time software, defending that the amount of contention can be reduced if read-only transactions access recent consistent data snapshots, progressing in a wait-free manner. We show how the required number of versions of a shared object can be calculated for a set of tasks. We also outline an algorithm to manage conflicts between update transactions that prevents starvation.

Book ChapterDOI
26 Mar 2011
TL;DR: This paper introduces a revision calculus that concisely captures the programming model and proves that the calculus is confluent and guarantees determinacy, and shows that the consistency guarantees of the calculus are a logical extension of snapshot isolation with support for conflict resolution and nesting.
Abstract: Enabling applications to execute various tasks in parallel is difficult if those tasks exhibit read and write conflicts. We recently developed a programming model based on concurrent revisions that addresses this challenge in a novel way: each forked task gets a conceptual copy of all the shared state, and state changes are integrated only when tasks are joined, at which time write-write conflicts are deterministically resolved. In this paper, we study the precise semantics of this model, in particular its guarantees for determinacy and consistency. First, we introduce a revision calculus that concisely captures the programming model. Despite allowing concurrent execution and locally nondeterministic scheduling, we prove that the calculus is confluent and guarantees determinacy. We show that the consistency guarantees of our calculus are a logical extension of snapshot isolation with support for conflict resolution and nesting. Moreover, we discuss how custom merge functions can provide stronger guarantees for particular data types that are tailored to the needs of the application. Finally, we show we can visualize the nonlinear history of state in our computations using revision diagrams that clarify the synchronization between tasks and allow local reasoning about state updates.

Proceedings ArticleDOI
31 May 2011
TL;DR: This work develops an HTM system that allows selection of versioning and conflict resolution policies at the granularity of cache lines and discovers that this neat match in granularity with that of the cache coherence protocol results in a design that is very simple and yet able to track closely or exceed the performance of the best performing policy for a given workload.
Abstract: Hardware Transactional Memory (HTM) systems, in prior research, have either fixed policies of conflict resolution and data versioning for the entire system or allowed a degree of flexibility at the level of transactions. Unfortunately, this results in susceptibility to pathologies, lower average performance over diverse workload characteristics or high design complexity. In this work we explore a new dimension along which flexibility in policy can be introduced. Recognizing the fact that contention is more a property of data rather than that of an atomic code block, we develop an HTM system that allows selection of versioning and conflict resolution policies at the granularity of cache lines. We discover that this neat match in granularity with that of the cache coherence protocol results in a design that is very simple and yet able to track closely or exceed the performance of the best performing policy for a given workload. It also brings together the benefits of parallel commits (inherent in traditional eager HTMs) and good optimistic concurrency without deadlock avoidance mechanisms (inherent in lazy HTMs), with little increase in complexity.

Proceedings ArticleDOI
24 Jan 2011
TL;DR: This work addresses a number of challenges posed by this type of parallelization and quantifies the trade-offs of some of the design decisions, such as how to select good loops for parallelization, how to partition the iteration space among parallel threads,How to handle loop-carried dependencies, and how to transition from serial to parallel execution and back.
Abstract: This paper proposes a new runtime parallelization technique, based on a dynamic optimization framework, to automatically parallelize single-threaded legacy programs. It heavily leverages the optimistic concurrency of transactional memory. This work addresses a number of challenges posed by this type of parallelization and quantifies the trade-offs of some of the design decisions, such as how to select good loops for parallelization, how to partition the iteration space among parallel threads, how to handle loop-carried dependencies, and how to transition from serial to parallel execution and back. The simulated implementation of runtime parallelization shows a potential speedup of 1.36 for the NAS benchmarks and a 1.34 speedup for the SPEC 2000 CPU floating point benchmarks when using two cores for parallel execution.

Book ChapterDOI
13 Dec 2011
TL;DR: It is shown empirically that the COP approach can enhance a software transactional memory (STM) framework to deliver more efficient concurrent data structures from serial source code and deliver performance comparable to that of more complex fine-grained structures.
Abstract: It is well known that guaranteeing program consistency when accessing shared data comes at the price of degraded performance and scalability. This paper initiates the investigation of consistency oblivious programming (COP). In COP, sections of concurrent code that meet certain criteria are executed without checking for consistency. However, checkpoints are added before any shared data modification to verify the algorithm was on the right track, and if not, it is re-executed in a more conservative and expensive consistent way. We show empirically that the COP approach can enhance a software transactional memory (STM) framework to deliver more efficient concurrent data structures from serial source code. In some cases the COP code delivers performance comparable to that of more complex fine-grained structures.

Patent
23 Jun 2011
TL;DR: In this paper, the authors describe a shared dynamic-sized data structure using hardware transactional memory to simplify and/or improve memory management of the data structure, and various indicators may be used determine whether memory allocated to the element can be freed.
Abstract: The systems and methods described herein may be used to implement a shared dynamic-sized data structure using hardware transactional memory to simplify and/or improve memory management of the data structure. An application (or thread thereof) may indicate (or register) the intended use of an element of the data structure and may initialize the value of the data structure element. Thereafter, another thread or application may use hardware transactions to access the data structure element while confirming that the data structure element is still part of the dynamic data structure and/or that memory allocated to the data structure element has not been freed. Various indicators may be used determine whether memory allocated to the element can be freed.

Proceedings ArticleDOI
25 Jul 2011
TL;DR: Evaluation of a Java language extension for coordinated exception handling where a named abox (atomic box) is used to demarcate a region of code that must execute atomically and in isolation indicates that, in addition to enabling recovery, an atomic box executes a reasonably small area of code twice as fast as when using a failbox.
Abstract: In concurrent programs raising an exception in one thread does not prevent others from operating on an inconsistent shared state. Instead, exceptions should ideally be handled in coordination by all the threads that are affected by their cause.In this paper, we propose a Java language extension for coordinated exception handling where a named abox (atomic box) is used to demarcate a region of code that must execute atomically and in isolation. Upon an exception raised inside an abox, threads executing in dependent aboxes, roll back their changes, and execute their recovery handler in coordination. We provide a dedicated compiler framework, CXH, to evaluate experimentally our atomic box construct. Our evaluation indicates that, in addition to enabling recovery, an atomic box executes a reasonably small region of code twice as fast as when using a failbox, the existing coordination alternative that has no recovery support.

Proceedings ArticleDOI
09 Oct 2011
TL;DR: This proposal leverages a Hardware Transactional Memory (HTM) design, based on a dedicated HW module for conflict management, whose functionality is exposed to the software through compiler directives, implemented as an extension to the popular OpenMP programming model.
Abstract: Two overriding concerns in the development of embedded MPSoCs are ease of programming and hardware complexity. In this paper we present SoC-TM, an integrated HW/SW solution for transactional programming on embedded MP-SoCs. Our proposal leverages a Hardware Transactional Memory (HTM) design, based on a dedicated HW module for conflict management, whose functionality is exposed to the software through compiler directives, implemented as an extension to the popular OpenMP programming model. To further improve ease of programming, our framework supports speculative parallelism, thanks to the ability of enforcing a given commit order in hardware. Our experimental results confirm that SoC-TM is a viable and cost-effective solution for embedded MPSoCs, in terms of energy, performance and productivity.