Showing papers on "Transactional memory published in 2011"

PDF

Open Access

Proceedings Article•DOI•

Hybrid NOrec: a case study in the effectiveness of best effort hardware transactional memory

[...]

Luke Dalessandro¹, Francois Carouge², Sean White², Yossi Lev³, Mark S. Moir³, Michael L. Scott¹, Michael Spear² - Show less +3 more•Institutions (3)

University of Rochester¹, Lehigh University², Oracle Corporation³

05 Mar 2011

TL;DR: A family of hybrid TMs built using the recent NOrec STM algorithm is introduced that, unlike existing hybrid approaches, provide both low overhead on hardware transactions and concurrent execution of hardware and software transactions.

...read moreread less

Abstract: Transactional memory (TM) is a promising synchronization mechanism for the next generation of multicore processors. Best-effort Hardware Transactional Memory (HTM) designs, such as Sun's prototype Rock processor and AMD's proposed Advanced Synchronization Facility (ASF), can efficiently execute many transactions, but abort in some cases due to various limitations. Hybrid TM systems can use a compatible software TM (STM) in such cases.We introduce a family of hybrid TMs built using the recent NOrec STM algorithm that, unlike existing hybrid approaches, provide both low overhead on hardware transactions and concurrent execution of hardware and software transactions. We evaluate implementations for Rock and ASF, exploring how the differing HTM designs affect optimization choices. Our investigation yields valuable input for designers of future best-effort HTMs.

...read moreread less

131 citations

Journal Article•DOI•

Semantics of transactional memory and automatic mutual exclusion

[...]

Martín Abadi¹, Andrew Birrell², Tim Harris², Michael Isard²•Institutions (2)

University of California, Santa Cruz¹, Microsoft²

25 Jan 2011-ACM Transactions on Programming Languages and Systems

TL;DR: This article develops semantics and type systems for the constructs of the Automatic Mutual Exclusion (AME) programming model for STM systems that use in-place update, optimistic concurrency, lazy conflict detection, and rollback.

...read moreread less

Abstract: Software Transactional Memory (STM) is an attractive basis for the development of language features for concurrent programming. However, the semantics of these features can be delicate and problematic. In this article we explore the trade-offs semantic simplicity, the viability of efficient implementation strategies, and the flexibility of language constructs. Specifically, we develop semantics and type systems for the constructs of the Automatic Mutual Exclusion (AME) programming model; our results apply also to other constructs, such as atomic blocks. With this semantics as a point of reference, we study several implementation strategies. We model STM systems that use in-place update, optimistic concurrency, lazy conflict detection, and rollback. These strategies are correct only under nontrivial assumptions that we identify and analyze. One important source of errors is that some efficient implementations create dangerous “zombie” computations where a transaction keeps running after experiencing a conflict; the assumptions confine the effects of these computations.

...read moreread less

119 citations

Proceedings Article•DOI•

Hardware transactional memory for GPU architectures

[...]

Wilson W. L. Fung¹, Inderpreet Singh¹, Andrew Brownsword¹, Tor M. Aamodt¹•Institutions (1)

University of British Columbia¹

03 Dec 2011

TL;DR: KILO TM is proposed, a novel hardware TM design for GPUs that scales to 1000s of concurrent transactions that uses word-level, value-based conflict detection to avoid broadcast communication and reduce on-chip storage overhead.

...read moreread less

Abstract: Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a relatively smaller set of single-instruction, multiple-thread (SIMT) cores to hide various long latency operations. While threads within a CUDA block/OpenCL workgroup can communicate efficiently through an intra-core scratchpad memory, threads in different blocks can only communicate via global memory accesses. Programmers wishing to exploit such communication have to consider data-races that may occur when multiple threads modify the same memory location. Recent GPUs provide a form of inter-block communication through atomic operations for single 32-bit/64-bit words. Although fine-grained locks can be constructed from these atomic operations, synchronization using locks is prone to deadlock. In this paper, we propose to solve these problems by extending GPUs to support transactional memory (TM). Major challenges include supporting 1000s of concurrent transactions and committing non-conflicting transactions in parallel. We propose KILO TM, a novel hardware TM design for GPUs that scales to 1000s of concurrent transactions. Without cache coherency hardware to depend on, it uses word-level, value-based conflict detection to avoid broadcast communication and reduce on-chip storage overhead. It employs speculative validation using a novel bloom filter organization to increase transaction commit parallelism. For a set of TM-enhanced GPU applications, KILO TM captures 59% of the performance of fine-grained locking, and is on average 128x faster than executing all transactions serially, for an estimated hardware area overhead of 0.5% of a commercial GPU.

...read moreread less

104 citations

Proceedings Article•DOI•

Lock-free and scalable multi-version software transactional memory

[...]

Sérgio Miguel Fernandes¹, João Cachopo¹•Institutions (1)

INESC-ID¹

12 Feb 2011

TL;DR: A new lock-free commit algorithm is presented that allows write transactions to proceed in parallel, by allowing them to run their validation phase independently of each other, and by resorting to helping from threads that would otherwise be waiting to commit, during the write-back phase.

...read moreread less

Abstract: Software Transactional Memory (STM) was initially proposed as a lock-free mechanism for concurrency control. Early implementations had efficiency limitations, and soon obstruction-free proposals appeared, to tackle this problem, often simplifying STM implementation. Today, most of the modern and top-performing STMs use blocking designs, relying on locks to ensure an atomic commit operation. This approach has revealed better in practice, in part due to its simplicity. Yet, it may have scalability problems when we move into many-core computers, requiring fine-tuning and careful programming to avoid contention. In this paper we present and discuss the modifications we made to a lock-based multi-version STM in Java, to turn it into a lock-free implementation that we have tested to scale at least up to 192 cores, and which provides results that compete with, and sometimes exceed, some of today's top-performing lock-based implementations. The new lock-free commit algorithm allows write transactions to proceed in parallel, by allowing them to run their validation phase independently of each other, and by resorting to helping from threads that would otherwise be waiting to commit, during the write-back phase. We also present a new garbage collection algorithm to dispose of old unused object versions that allows for asynchronous identification of unnecessary versions, which minimizes its interference with the rest of the transactional system.

...read moreread less

91 citations

Proceedings Article•DOI•

A study of transactional memory vs. locks in practice

[...]

Victor Pankratius¹, Ali-Reza Adl-Tabatabai²•Institutions (2)

Karlsruhe Institute of Technology¹, Intel²

04 Jun 2011

TL;DR: A detailed case study comparing teams of programmers developing a parallel program from scratch using transactional memory and locks, finding that TM code was easier to understand than locks code, because the locks teams used many locks to improve performance.

...read moreread less

Abstract: Transactional Memory (TM) promises to simplify parallel programming by replacing locks with atomic transactions. Despite much recent progress in TM research, there is very little experience using TM to develop realistic parallel programs from scratch. In this paper, we present the results of a detailed case study comparing teams of programmers developing a parallel program from scratch using transactional memory and locks. We analyze and quantify in a realistic environment the development time, programming progress, code metrics, programming patterns, and ease of code understanding for six teams who each wrote a parallel desktop search engine over a fifteen week period. Three randomly chosen teams used Intel's Software Transactional Memory compiler and Pthreads, while the other teams used just Pthreads. Our analysis is exploratory: Given the same requirements, how far did each team get? The TM teams were among the first to have a prototype parallel search engine. Compared to the locks teams, the TM teams spent less than half the time debugging segmentation faults, but had more problems tuning performance and implementing queries. Code inspections with industry experts revealed that TM code was easier to understand than locks code, because the locks teams used many locks (up to thousands) to improve performance. Learning from each team's individual success and failure story, this paper provides valuable lessons for improving TM.

...read moreread less

81 citations

Proceedings Article•DOI•

Optimizing hybrid transactional memory: the importance of nonspeculative operations

[...]

Torvald Riegel¹, Patrick Marlier, Martin Nowack¹, Pascal Felber, Christof Fetzer¹ - Show less +1 more•Institutions (1)

Dresden University of Technology¹

04 Jun 2011

TL;DR: Several new hybrid TM algorithms are presented that can execute HTM and STM transactions concurrently and can thus provide good performance over a large spectrum of workloads and are evaluated based on AMD's Advanced Synchronization Facility.

...read moreread less

Abstract: Transactional memory (TM) is a speculative shared-memory synchronization mechanism used to speed up concurrent programs. Most current TM implementations are software-based (STM) and incur noticeable overheads for each transactional memory access. Hardware TM proposals (HTM) address this issue but typically suffer from other restrictions such as limits on the number of data locations that can be accessed in a transaction.In this paper, we present several new hybrid TM algorithms that can execute HTM and STM transactions concurrently and can thus provide good performance over a large spectrum of workloads. The algorithms exploit the ability of some HTMs to have both speculative and nonspeculative (nontransactional) memory accesses within a transaction to decrease the transactions' runtime overhead, abort rates, and hardware capacity requirements. We evaluate implementations of these algorithms based on AMD's Advanced Synchronization Facility, an x86 instruction set extension proposal that has been shown to provide a sound basis for HTM.

...read moreread less

75 citations

Proceedings Article•DOI•

On the power of hardware transactional memory to simplify memory management

[...]

Aleksandar Dragojevic¹, Maurice Herlihy², Yossi Lev³, Mark S. Moir³•Institutions (3)

École Polytechnique Fédérale de Lausanne¹, Brown University², Oracle Corporation³

06 Jun 2011

TL;DR: It is demonstrated that HTM enables simpler and faster solutions, with better memory reclamation properties, than prior approaches, and support the claim thatHTM can provide significantly better common-case performance, as well as reduced conceptual complexity.

...read moreread less

Abstract: Dynamic memory management is a significant source of complexity in the design and implementation of practical concurrent data structures. We study how hardware transactional memory (HTM) can be used to simplify and streamline memory reclamation for such data structures. We propose and evaluate several new HTM-based algorithms for the "Dynamic Collect" problem that lies at the heart of many modern memory management algorithms. We demonstrate that HTM enables simpler and faster solutions, with better memory reclamation properties, than prior approaches. Despite recent theoretical arguments that HTM provides no worst-case advantages, our results support the claim that HTM can provide significantly better common-case performance, as well as reduced conceptual complexity.

...read moreread less

66 citations

Book Chapter•DOI•

SMV: selective multi-versioning STM

[...]

Dmitri Perelman¹, Anton Byshevsky¹, Oleg Litmanovich¹, Idit Keidar¹•Institutions (1)

Technion – Israel Institute of Technology¹

20 Sep 2011

TL;DR: It is shown that the memory consumption of algorithms keeping a constant number of versions per object might grow exponentially with the number of objects, while SMV operates successfully even in systems with stringent memory constraints.

...read moreread less

Abstract: We present Selective Multi-Versioning (SMV), a new STM that reduces the number of aborts, especially those of long read-only transactions. SMV keeps old object versions as long as they might be useful for some transaction to read. It is able to do so while still allowing reading transactions to be invisible by relying on automatic garbage collection to dispose of obsolete versions. SMV is most suitable for read-dominated workloads, for which it performs better than previous solutions. It has an up to ×7 throughput improvement over a single-version STMand more than a two-fold improvement over an STMkeeping a constant number of versions per object. We show that the memory consumption of algorithms keeping a constant number of versions per object might grow exponentially with the number of objects, while SMV operates successfully even in systems with stringent memory constraints.

...read moreread less

65 citations

Journal Article•DOI•

Inherent Limitations on Disjoint-Access Parallel Implementations of Transactional Memory

[...]

Hagit Attiya¹, Eshcar Hillel¹, Alessia Milani¹•Institutions (1)

Technion – Israel Institute of Technology¹

01 Nov 2011-Theory of Computing Systems \/ Mathematical Systems Theory

TL;DR: A lower bound of Ω(t) is proved on the number of writes needed in order to implement a read-only transaction of t items, which successfully terminates in a disjoint-access parallel TM implementation, which assumes strict serializability and thus hold under the assumption of opacity.

...read moreread less

Abstract: Transactional memory (TM) is a popular approach for alleviating the difficulty of programming concurrent applications; TM guarantees that a transaction, consisting of a sequence of operations, appear to be executed atomically. Two fundamental properties of TM implementations are disjoint-access parallelism and the invisibility of read operations. Disjoint access parallelism ensures that operations on disconnected data do not interfere, and thus it is critical for TM scalability. The invisibility of read operations means that their implementation does not write to the memory, thereby reducing memory contention. This paper proves an inherent tradeoff for implementations of transactional memories: they cannot be both disjoint-access parallel and have read-only transactions that are invisible and always terminate successfully. In fact, a lower bound of Ω(t) is proved on the number of writes needed in order to implement a read-only transaction of t items, which successfully terminates in a disjoint-access parallel TM implementation. The results assume strict serializability and thus hold under the assumption of opacity. It is shown how to extend the results to hold also for weaker consistency conditions, snapshot isolation and serializability.

...read moreread less

63 citations

Patent•

Automatic suspend and resume in hardware transactional memory

[...]

Jaewoong Chung¹, David S. Christie¹, Michael P. Hohmuth¹, Stephan Diestelhorst¹, Martin T. Pohlack¹ - Show less +1 more•Institutions (1)

Advanced Micro Devices¹

22 Feb 2011

TL;DR: In this paper, an apparatus and method for a computer processor (102) configured to access a memory shared by a plurality of processing cores and to execute memory access operations in a transactional mode as a single atomic transaction and to suspend the transaction in response to determining an implicit suspend condition, such as a program control transfer.

...read moreread less

Abstract: An apparatus and method is disclosed for a computer processor (102) configured to access a memory (140) shared by a plurality of processing cores and to execute a plurality of memory access operations in a transactional mode as a single atomic transaction and to suspend the transactional mode in response to determining an implicit suspend condition, such as a program control transfer. As part of executing the transaction, the processor marks data accessed by the speculative memory access operations as being speculative data (220). In response to determining a suspend condition (including by detecting a control transfer in an executing thread) (230) the processor suspends the transactional mode of execution, which includes setting a suspend flag (240) and suspending marking speculative data (250). If the processor later detects a resumption condition (e.g., a return control transfer corresponding to a return from the control transfer), the processor is configured to resume the marking of speculative data.

...read moreread less

57 citations

Book Chapter•DOI•

On the cost of concurrency in transactional memory

[...]

Petr Kuznetsov¹, Srivatsan Ravi¹•Institutions (1)

Deutsche Telekom¹

13 Dec 2011

TL;DR: In this paper, the authors evaluate the cost of concurrency by measuring the amount of expensive synchronization that must be employed in an STM implementation that ensures positive concurrency, i.e., allows for concurrent transaction processing in some executions.

...read moreread less

Abstract: The promise of software transactional memory (STM) is to combine an easy-to-use programming interface with an efficient utilization of the concurrent-computing abilities provided by modern machines. But does this combination come with an inherent cost? We evaluate the cost of concurrency by measuring the amount of expensive synchronization that must be employed in an STM implementation that ensures positive concurrency, i.e., allows for concurrent transaction processing in some executions. We focus on two popular progress conditions that provide positive concurrency: progressiveness and permissiveness. We show that in permissive STMs, providing a very high degree of concurrency, a transaction may perform a linear number of expensive synchronization patterns with respect to its read-set size. In contrast, progressive STMs provide a very small degree of concurrency but, as we demonstrate, can be implemented using at most one expensive synchronization pattern per transaction. However, we show that even in progressive STMs, a transaction has to "protect" (e.g., by using locks or strong synchronization primitives) a linear amount of data with respect to its write-set size. Our results suggest that achieving high degrees of concurrency in STM implementations may bring a considerable synchronization cost.

...read moreread less

Proceedings Article•DOI•

A machine learning-based approach for thread mapping on transactional memory applications

[...]

Márcio Castro¹, Luís F. W. Góes², Christiane Pousa Ribeiro¹, Murray Cole², Marcelo Cintra², Jean-François Méhaut¹ - Show less +2 more•Institutions (2)

University of Grenoble¹, University of Edinburgh²

18 Dec 2011

TL;DR: This paper proposes a machine learning-based approach to automatically infer a suitable thread mapping strategy for transactional memory applications and shows that this approach improves performance up to 18.46% compared to the worst case and up to 6.37% over the Linux defaultthread mapping strategy.

...read moreread less

Abstract: Thread mapping has been extensively used as a technique to efficiently exploit memory hierarchy on modern chip-multiprocessors. It places threads on cores in order to amortize memory latency and/or to reduce memory contention. However, efficient thread mapping relies upon matching application behavior with system characteristics. Particularly, Software Transactional Memory (STM) applications introduce another dimension due to its runtime system support. Existing STM systems implement several conflict detection and resolution mechanisms, which leads STM applications to behave differently for each combination of these mechanisms. In this paper we propose a machine learning-based approach to automatically infer a suitable thread mapping strategy for transactional memory applications. First, we profile several STM applications from the STAMP benchmark suite considering application, STM system and platform features to build a set of input instances. Then, such data feeds a machine learning algorithm, which produces a decision tree able to predict the most suitable thread mapping strategy for new unobserved instances. Results show that our approach improves performance up to 18.46% compared to the worst case and up to 6.37% over the Linux default thread mapping strategy.

...read moreread less

Proceedings Article•DOI•

OSARE: Opportunistic Speculation in Actively REplicated Transactional Systems

[...]

Roberto Palmieri¹, Francesco Quaglia¹, Paolo Romano²•Institutions (2)

Sapienza University of Rome¹, INESC-ID²

04 Oct 2011

TL;DR: OSARE, an active replication protocol for transactional systems that combines the usage of Optimistic Atomic Broadcast with a speculative concurrency control mechanism in order to overlap transaction processing and replica synchronization, achieves remarkable speed-up with respect to state of the art speculative replication protocols.

...read moreread less

Abstract: In this work we present OSARE, an active replication protocol for transactional systems that combines the usage of Optimistic Atomic Broadcast with a speculative concurrency control mechanism in order to overlap transaction processing and replica synchronization. OSARE biases the speculative serialization of transactions towards an order aligned with the optimistic message delivery order. However, due to the lock-free nature of its concurrency control algorithm, at high concurrency levels, namely when the probability of mismatches between optimistic and final deliveries is higher, OSARE explores additional alternative transaction serialization orders in a lightweight and opportunistic fashion. A simulation study we carried out in the context of Software Transactional Memory systems shows that OSARE achieves robust performance also in scenarios characterized by non-minimal likelihood of reorder between optimistic and final deliveries, providing remarkable speed-up with respect to state of the art speculative replication protocols.

...read moreread less

Proceedings Article•DOI•

HyFlow: a high performance distributed software transactional memory framework

[...]

Mohamed M. Saad¹, Binoy Ravindran¹•Institutions (1)

Virginia Tech¹

08 Jun 2011

TL;DR: HyFlow is a Java framework for D-STM, with pluggable support for directory lookup protocols, transactional synchronization and recovery mechanisms, contention management policies, cache coherence protocols, and network communication protocols, that outperforms competitors on a broad range of transactional workloads on a 72-node system.

...read moreread less

Abstract: We present HyFlow --- a distributed software transactional memory (D-STM) framework for distributed concurrency control. HyFlow is a Java framework for D-STM, with pluggable support for directory lookup protocols, transactional synchronization and recovery mechanisms, contention management policies, cache coherence protocols, and network communication protocols. HyFlow exports a simple distributed programming model that excludes locks: using (Java 5) annotations, atomic sections are defined as transactions, in which reads and writes to shared, local and remote objects appear to take effect instantaneously. No changes are needed to the underlying virtual machine or compiler. We describe HyFlow's architecture and implementation, and report on experimental studies comparing HyFlow against competing models including Java remote method invocation (RMI) with mutual exclusion and read/write locks, distributed shared memory (DSM), and directory-based D-STM. Our studies show that HyFlow outperforms competitors by as much as 40-190% on a broad range of transactional workloads on a 72-node system, with more than 500 concurrent transactions.

...read moreread less

Patent•

Last branch record indicators for transactional memory

[...]

Ravi Rajwar, Peter Lachner, Laura A. Knauth, Konrad K. Lai

28 Jul 2011

TL;DR: In this article, a processor includes an execution unit and at least one last branch record (LBR) register to store address information of a branch taken during program execution, which may further store a transaction indicator to indicate whether the branch was taken during a transactional memory (TM) transaction.

...read moreread less

Abstract: In one embodiment, a processor includes an execution unit and at least one last branch record (LBR) register to store address information of a branch taken during program execution. This register may further store a transaction indicator to indicate whether the branch was taken during a transactional memory (TM) transaction. This register may further store an abort indicator to indicate whether the branch was caused by a transaction abort. Other embodiments are described and claimed.

...read moreread less

Proceedings Article•DOI•

Bloom Filter Guided Transaction Scheduling

[...]

Geoffrey Blake¹, Ronald G. Dreslinski¹, Trevor Mudge¹•Institutions (1)

University of Michigan¹

12 Feb 2011

TL;DR: A novel transaction scheduling scheme called “Bloom Filter Guided Transaction Scheduling” (BFGTS), that uses a combination of simple hardware and Bloom filter heuristics to guide scheduling decisions and provide enhanced performance in high contention situations.

...read moreread less

Abstract: Contention management is an important design component to a transactional memory system. Without effective contention management to ensure forward progress, a transactional memory system can experience live-lock, which is difficult to debug in parallel programs. Early work in contention management focused on heuristic managers that reacted to conflicts between transactions by picking the most appropriate transaction to abort. Reactive methods allow conflicts to happen repeatedly as they do not try to prevent future conflicts from happening. These shortcomings of reactive contention managers have led to proposals that approach contention management as a scheduling problem — proactive managers. Proactive techniques range from throttling execution in predicted periods of high contention to preventing groups of transactions running concurrently that are predicted likely to conflict. We propose a novel transaction scheduling scheme called “Bloom Filter Guided Transaction Scheduling” (BFGTS), that uses a combination of simple hardware and Bloom filter heuristics to guide scheduling decisions and provide enhanced performance in high contention situations. We compare to two state-of-the-art transaction schedulers, “Adaptive Transaction Scheduling” and “Proactive Transaction Scheduling” and show that BFGTS attains up to a 4.6× and 1.7× improvement on high contention benchmarks respectively. Across all benchmarks it shows a 35% and 25% average performance improvement respectively.

...read moreread less

Proceedings Article•DOI•

Communicating memory transactions

[...]

Mohsen Lesani¹, Jens Palsberg¹•Institutions (1)

University of California, Los Angeles¹

12 Feb 2011

TL;DR: A programming model is presented that is the first to have opaque transactions, safe asynchronous message passing, and an efficient implementation and a novel definition of safe message passing that may be of independent interest.

...read moreread less

Abstract: Many concurrent programming models enable both transactional memory and message passing. For such models, researchers have built increasingly efficient implementations and defined reasonable correctness criteria, while it remains an open problem to obtain the best of both worlds. We present a programming model that is the first to have opaque transactions, safe asynchronous message passing, and an efficient implementation. Our semantics uses tentative message passing and keeps track of dependencies to enable undo of message passing in case a transaction aborts. We can program communication idioms such as barrier and rendezvous that do not deadlock when used in an atomic block. Our experiments show that our model adds little overhead to pure transactions, and that it is significantly more efficient than Transactional Events. We use a novel definition of safe message passing that may be of independent interest.

...read moreread less

Proceedings Article•DOI•

Deadline-aware scheduling for Software Transactional Memory

[...]

Walther Maldonado, Patrick Marlier, Pascal Felber, Julia Lawall, Giller Muller¹, Etienne Rivière - Show less +2 more•Institutions (1)

French Institute for Research in Computer Science and Automation¹

27 Jun 2011

TL;DR: This paper proposes to support reactive applications by allowing the developer to annotate some transaction blocks with deadlines by adjusting the transaction execution strategy by decreasing the level of optimism as the deadlines near through two modes of conservative execution, without overly limiting the progress of concurrent transactions.

...read moreread less

Abstract: Software Transactional Memory (STM) is an optimistic concurrency control mechanism that simplifies the development of parallel programs. Still, the interest of STM has not yet been demonstrated for reactive applications that require bounded response time for some of their operations. We propose to support such applications by allowing the developer to annotate some transaction blocks with deadlines. Based on previous execution statistics, we adjust the transaction execution strategy by decreasing the level of optimism as the deadlines near through two modes of conservative execution, without overly limiting the progress of concurrent transactions. Our implementation comprises a STM extension for gathering statistics and implementing the execution mode strategies. We have also extended the Linux scheduler to disable preemption or migration of threads that are executing transactions with deadlines. Our experimental evaluation shows that our approach significantly improves the chance of a transaction meeting its deadline when its progress is hampered by conflicts.

...read moreread less

Proceedings Article•DOI•

Hardware acceleration of transactional memory on commodity systems

[...]

Jared Casper¹, Tayo Oguntebi¹, Sungpack Hong¹, Nathan G. Bronson¹, Christos Kozyrakis¹, Kunle Olukotun¹ - Show less +2 more•Institutions (1)

Stanford University¹

05 Mar 2011

TL;DR: It is demonstrated that hardware can substantially accelerate the performance of an STM on unmodified commodity processors, and it is shown that, for all but short transactions, it is not necessary to modify the processor to obtain substantial improvement in TM performance.

...read moreread less

Abstract: The adoption of transactional memory is hindered by the high overhead of software transactional memory and the intrusive design changes required by previously proposed TM hardware. We propose that hardware to accelerate software transactional memory (STM) can reside outside an unmodified commodity processor core, thereby substantially reducing implementation costs. This paper introduces Transactional Memory Acceleration using Commodity Cores (TMACC), a hardware-accelerated TM system that does not modify the processor, caches, or coherence protocol.We present a complete hardware implementation of TMACC using a rapid prototyping platform. Using this hardware, we implement two unique conflict detection schemes which are accelerated using Bloom filters on an FPGA. These schemes employ novel techniques for tolerating the latency of fine-grained asynchronous communication with an out-of-core accelerator. We then conduct experiments to explore the feasibility of accelerating TM without modifying existing system hardware. We show that, for all but short transactions, it is not necessary to modify the processor to obtain substantial improvement in TM performance. In these cases, TMACC outperforms an STM by an average of 69% in applications using moderate-length transactions, showing maximum speedup within 8% of an upper bound on TM acceleration. Overall, we demonstrate that hardware can substantially accelerate the performance of an STM on unmodified commodity processors.

...read moreread less

Book Chapter•DOI•

Multi-core and Many-core Processor Architectures

[...]

Andras Vajda¹•Institutions (1)

Ericsson¹

01 Jan 2011

TL;DR: This chapter’s main goal is to introduce the reader to the most important processor architecture concepts relevant in the context of multi-core processors as well the most common processor architectures available today.

...read moreread less

Abstract: No book on programming would be complete without an overview of the hardware on which the software will execute. In this chapter we outline the main design principles and solutions applied when designing these chips, as well as the challenges facing the hardware industry, together with an outlook of promising technologies not yet in common practice. This chapter’s main goal is to introduce the reader to the most important processor architecture concepts (core organization, interconnects, memory architectures, support for parallel programming etc) relevant in the context of multi-core processors as well the most common processor architectures available today. We also analyze the challenges faced by processor designs as the number of cores will continue scaling and the emerging technologies—such as transactional memory, support for speculative threading, novel interconnects, 3D stacking of memory etc—that will allow continued scaling of processors in terms of available computational power.

...read moreread less

Proceedings Article•DOI•

Delegated isolation

[...]

Roberto Lublinerman¹, Jisheng Zhao², Zoran Budimlić², Swarat Chaudhuri², Vivek Sarkar² - Show less +1 more•Institutions (2)

Pennsylvania State University¹, Rice University²

22 Oct 2011

TL;DR: This paper presents Aida, a new model of isolated execution for parallel programs that perform frequent, irregular accesses to pointer-based shared data structures, and offers an implementation of Aida on top of the Habanero Java parallel programming language.

...read moreread less

Abstract: Isolation---the property that a task can access shared data without interference from other tasks---is one of the most basic concerns in parallel programming. In this paper, we present Aida, a new model of isolated execution for parallel programs that perform frequent, irregular accesses to pointer-based shared data structures. The three primary benefits of Aida are dynamism, safety and liveness guarantees, and programmability. First, Aida allows tasks to dynamically select and modify, in an isolated manner, arbitrary fine-grained regions in shared data structures, all the while maintaining a high level of concurrency. Consequently, the model can achieve scalable parallelization of regular as well as irregular shared-memory applications. Second, the model offers freedom from data races, deadlocks, and livelocks. Third, no extra burden is imposed on programmers, who access the model via a simple, declarative isolation construct that is similar to that for transactional memory. The key new insight in Aida is a notion of delegation among concurrent isolated tasks (known in Aida as assemblies). Each assembly A is equipped with a region in the shared heap that it owns---the only objects accessed by A are those it owns, guaranteeing race-freedom. The region owned by A can grow or shrink flexibly---however, when A needs to own a datum owned by B, A delegates itself, as well as its owned region, to B. From now on, B has the responsibility of re-executing the task A set out to complete. Delegation as above is the only inter-assembly communication primitive in Aida. In addition to reducing contention in a local, data-driven manner, it guarantees freedom from deadlocks and livelocks.We offer an implementation of Aida on top of the Habanero Java parallel programming language. The implementation employs several novel ideas, including the use of a union-find data structure to represent tasks and the regions that they own. A thorough evaluation using several irregular data-parallel benchmarks demonstrates the low overhead and excellent scalability of Aida, as well as its benefits over existing approaches to declarative isolation. Our results show that Aida performs on par with the state-of-the-art customized implementations of irregular applications and much better than coarse-grained locking and transactional memory approaches.

...read moreread less

Proceedings Article•DOI•

RMS-TM: a comprehensive benchmark suite for transactional memory systems

[...]

Gokcen Kestor¹, Vasileios Karakostas¹, Osman Unsal¹, Adrian Cristal², Ibrahim Hur¹, Mateo Valero³ - Show less +2 more•Institutions (3)

Barcelona Supercomputing Center¹, Spanish National Research Council², Polytechnic University of Catalonia³

30 Sep 2011

TL;DR: RMS-TM is introduced, a Transactional Memory benchmark suite composed of seven real-world applications from the Recognition, Mining and Synthesis domain that provide a mix of short and long transactions with small/large read and write sets with low/medium/high contention rates.

...read moreread less

Abstract: Transactional Memory (TM) has been proposed as an alternative concurrency mechanism for the shared memory parallel programming model. Its main goal is to make parallel programming for Chip Multiprocessors (CMPs) easier than using the traditional lock synchronization constructs, without compromising the performance and the scalability. This topic has received substantial research attention and several TM designs have been proposed using various TM benchmarks. We believe that the evaluation of TM proposals would be more solid if it included realistic applications, that address on-going TM research issues, and that provide the potential for straightforward comparison against locks.In this paper, we introduce RMS-TM, a Transactional Memory benchmark suite composed of seven real-world applications from the Recognition, Mining and Synthesis (RMS) domain. In addition to featuring current TM research issues such as nesting and I/O and system calls inside transactions, the RMS-TM applications also provide a mix of short and long transactions with small/large read and write sets with low/medium/high contention rates. These characteristics, as well as providing lock-based versions of the applications, make RMS-TM a useful TM tool. Current TM benchmarks do not explore all these features. In our evaluation with selected STM and HTM systems, we find that our benchmark suite is also scalable, which is useful for evaluating TM designs on high core counts.

...read moreread less

Proceedings Article•DOI•

Software Transactional Memory as a Building Block for Parallel Embedded Real-Time Systems

[...]

António Barros, Luis Miguel Pinho

30 Aug 2011

TL;DR: This paper defends that the amount of contention can be reduced if read-only transactions access recent consistent data snapshots, progressing in a wait-free manner, and shows how the required number of versions of a shared object can be calculated for a set of tasks.

...read moreread less

Abstract: The recent trends of chip architectures with higher number of heterogeneous cores, and non-uniform memory/non-coherent caches, brings renewed attention to the use of Software Transactional Memory (STM) as a fundamental building block for developing parallel applications. Nevertheless, although STM promises to ease concurrent and parallel software development, it relies on the possibility of aborting conflicting transactions to maintain data consistency, which impacts on the responsiveness and timing guarantees required by embedded real-time systems. In these systems, contention delays must be (efficiently) limited so that the response times of tasks executing transactions are upper-bounded and task sets can be feasibly scheduled. In this paper we assess the use of STM in the development of embedded real-time software, defending that the amount of contention can be reduced if read-only transactions access recent consistent data snapshots, progressing in a wait-free manner. We show how the required number of versions of a shared object can be calculated for a set of tasks. We also outline an algorithm to manage conflicts between update transactions that prevents starvation.

...read moreread less

Book Chapter•DOI•

Semantics of concurrent revisions

[...]

Sebastian Burckhardt¹, Daan Leijen¹•Institutions (1)

Microsoft¹

26 Mar 2011

TL;DR: This paper introduces a revision calculus that concisely captures the programming model and proves that the calculus is confluent and guarantees determinacy, and shows that the consistency guarantees of the calculus are a logical extension of snapshot isolation with support for conflict resolution and nesting.

...read moreread less

Abstract: Enabling applications to execute various tasks in parallel is difficult if those tasks exhibit read and write conflicts. We recently developed a programming model based on concurrent revisions that addresses this challenge in a novel way: each forked task gets a conceptual copy of all the shared state, and state changes are integrated only when tasks are joined, at which time write-write conflicts are deterministically resolved. In this paper, we study the precise semantics of this model, in particular its guarantees for determinacy and consistency. First, we introduce a revision calculus that concisely captures the programming model. Despite allowing concurrent execution and locally nondeterministic scheduling, we prove that the calculus is confluent and guarantees determinacy. We show that the consistency guarantees of our calculus are a logical extension of snapshot isolation with support for conflict resolution and nesting. Moreover, we discuss how custom merge functions can provide stronger guarantees for particular data types that are tailored to the needs of the application. Finally, we show we can visualize the nonlinear history of state in our computations using revision diagrams that clarify the synchronization between tasks and allow local reasoning about state updates.

...read moreread less

Proceedings Article•DOI•

ZEBRA: a data-centric, hybrid-policy hardware transactional memory design

[...]

Ruben Titos-Gil¹, Anurag Negi², Manuel E. Acacio¹, José M. García¹, Per Stenström² - Show less +1 more•Institutions (2)

University of Murcia¹, Chalmers University of Technology²

31 May 2011

TL;DR: This work develops an HTM system that allows selection of versioning and conflict resolution policies at the granularity of cache lines and discovers that this neat match in granularity with that of the cache coherence protocol results in a design that is very simple and yet able to track closely or exceed the performance of the best performing policy for a given workload.

...read moreread less

Abstract: Hardware Transactional Memory (HTM) systems, in prior research, have either fixed policies of conflict resolution and data versioning for the entire system or allowed a degree of flexibility at the level of transactions. Unfortunately, this results in susceptibility to pathologies, lower average performance over diverse workload characteristics or high design complexity. In this work we explore a new dimension along which flexibility in policy can be introduced. Recognizing the fact that contention is more a property of data rather than that of an atomic code block, we develop an HTM system that allows selection of versioning and conflict resolution policies at the granularity of cache lines. We discover that this neat match in granularity with that of the cache coherence protocol results in a design that is very simple and yet able to track closely or exceed the performance of the best performing policy for a given workload. It also brings together the benefits of parallel commits (inherent in traditional eager HTMs) and good optimistic concurrency without deadlock avoidance mechanisms (inherent in lazy HTMs), with little increase in complexity.

...read moreread less

Proceedings Article•DOI•

Runtime parallelization of legacy code on a transactional memory system

[...]

Matthew DeVuyst¹, Dean M. Tullsen¹, Seon Wook Kim²•Institutions (2)

University of California, San Diego¹, Korea University²

24 Jan 2011

TL;DR: This work addresses a number of challenges posed by this type of parallelization and quantifies the trade-offs of some of the design decisions, such as how to select good loops for parallelization, how to partition the iteration space among parallel threads,How to handle loop-carried dependencies, and how to transition from serial to parallel execution and back.

...read moreread less

Abstract: This paper proposes a new runtime parallelization technique, based on a dynamic optimization framework, to automatically parallelize single-threaded legacy programs. It heavily leverages the optimistic concurrency of transactional memory. This work addresses a number of challenges posed by this type of parallelization and quantifies the trade-offs of some of the design decisions, such as how to select good loops for parallelization, how to partition the iteration space among parallel threads, how to handle loop-carried dependencies, and how to transition from serial to parallel execution and back. The simulated implementation of runtime parallelization shows a potential speedup of 1.36 for the NAS benchmarks and a 1.34 speedup for the SPEC 2000 CPU floating point benchmarks when using two cores for parallel execution.

...read moreread less

Book Chapter•DOI•

Towards consistency oblivious programming

[...]

Yehuda Afek¹, Hillel Avni¹, Nir Shavit¹•Institutions (1)

Tel Aviv University¹

13 Dec 2011

TL;DR: It is shown empirically that the COP approach can enhance a software transactional memory (STM) framework to deliver more efficient concurrent data structures from serial source code and deliver performance comparable to that of more complex fine-grained structures.

...read moreread less

Abstract: It is well known that guaranteeing program consistency when accessing shared data comes at the price of degraded performance and scalability. This paper initiates the investigation of consistency oblivious programming (COP). In COP, sections of concurrent code that meet certain criteria are executed without checking for consistency. However, checkpoints are added before any shared data modification to verify the algorithm was on the right track, and if not, it is re-executed in a more conservative and expensive consistent way. We show empirically that the COP approach can enhance a software transactional memory (STM) framework to deliver more efficient concurrent data structures from serial source code. In some cases the COP code delivers performance comparable to that of more complex fine-grained structures.

...read moreread less

Patent•

System and method for performing memory management using hardware transactions

[...]

Aleksandar Dragojevic¹, Maurice P. Herlihy¹, Yosef Lev¹, Mark S. Moir¹•Institutions (1)

Business International Corporation¹

23 Jun 2011

TL;DR: In this paper, the authors describe a shared dynamic-sized data structure using hardware transactional memory to simplify and/or improve memory management of the data structure, and various indicators may be used determine whether memory allocated to the element can be freed.

...read moreread less

Abstract: The systems and methods described herein may be used to implement a shared dynamic-sized data structure using hardware transactional memory to simplify and/or improve memory management of the data structure. An application (or thread thereof) may indicate (or register) the intended use of an element of the data structure and may initialize the value of the data structure element. Thereafter, another thread or application may use hardware transactions to access the data structure element while confirming that the data structure element is still part of the dynamic data structure and/or that memory allocated to the data structure element has not been freed. Various indicators may be used determine whether memory allocated to the element can be freed.

...read moreread less

Proceedings Article•DOI•

Atomic boxes: coordinated exception handling with transactional memory

[...]

Derin Harmanci, Vincent Gramoli¹, Pascal Felber•Institutions (1)

École Polytechnique Fédérale de Lausanne¹

25 Jul 2011

TL;DR: Evaluation of a Java language extension for coordinated exception handling where a named abox (atomic box) is used to demarcate a region of code that must execute atomically and in isolation indicates that, in addition to enabling recovery, an atomic box executes a reasonably small area of code twice as fast as when using a failbox.

...read moreread less

Abstract: In concurrent programs raising an exception in one thread does not prevent others from operating on an inconsistent shared state. Instead, exceptions should ideally be handled in coordination by all the threads that are affected by their cause.In this paper, we propose a Java language extension for coordinated exception handling where a named abox (atomic box) is used to demarcate a region of code that must execute atomically and in isolation. Upon an exception raised inside an abox, threads executing in dependent aboxes, roll back their changes, and execute their recovery handler in coordination. We provide a dedicated compiler framework, CXH, to evaluate experimentally our atomic box construct. Our evaluation indicates that, in addition to enabling recovery, an atomic box executes a reasonably small region of code twice as fast as when using a failbox, the existing coordination alternative that has no recovery support.

...read moreread less

Proceedings Article•DOI•

SoC-TM: integrated HW/SW support for transactional memory programming on embedded MPSoCs

[...]

Cesare Ferri¹, Andrea Marongiu², Benjamin Lipton³, R. Iris Bahar¹, Tali Moreshet³, Luca Benini², Maurice Herlihy¹ - Show less +3 more•Institutions (3)

Brown University¹, University of Bologna², Swarthmore College³

09 Oct 2011

TL;DR: This proposal leverages a Hardware Transactional Memory (HTM) design, based on a dedicated HW module for conflict management, whose functionality is exposed to the software through compiler directives, implemented as an extension to the popular OpenMP programming model.

...read moreread less

Abstract: Two overriding concerns in the development of embedded MPSoCs are ease of programming and hardware complexity. In this paper we present SoC-TM, an integrated HW/SW solution for transactional programming on embedded MP-SoCs. Our proposal leverages a Hardware Transactional Memory (HTM) design, based on a dedicated HW module for conflict management, whose functionality is exposed to the software through compiler directives, implemented as an extension to the popular OpenMP programming model. To further improve ease of programming, our framework supports speculative parallelism, thanks to the ability of enforcing a given commit order in hardware. Our experimental results confirm that SoC-TM is a viable and cost-effective solution for embedded MPSoCs, in terms of energy, performance and productivity.

...read moreread less

Collapse