scispace - formally typeset
Search or ask a question

Showing papers on "Software transactional memory published in 2007"


Book
12 Jan 2007
TL;DR: This book presents an overview of the state of the art in the design and implementation of transactional memory systems, as of early summer 2006.
Abstract: The advent of multicore processors has renewed interest in the idea of incorporating transactions into the programming model used to write parallel programs. This approach, known as transactional memory, offers an alternative, and hopefully better, way to coordinate concurrent threads. The ACI (atomicity, consistency, isolation) properties of transactions provide a foundation to ensure that concurrent reads and writes of shared data do not produce inconsistent or incorrect results. At a higher level, a computation wrapped in a transaction executes atomically – either it completes successfully and commits its result in its entirety or it aborts. In addition, isolation ensures the transaction produces the same result as if no other transactions were executing concurrently. Although transactions are not a parallel programming panacea, they shift much of the burden of synchronizing and coordinating parallel computations from a programmer to a compiler, runtime system, and hardware. The challenge for the system implementers is to build an efficient transactional memory infrastructure. This book presents an overview of the state of the art in the design and implementation of transactional memory systems, as of early summer 2006.

442 citations


Proceedings ArticleDOI
09 Jun 2007
TL;DR: For certain workloads, SigTM can match the performance of a full-featured hardware TM system, while for workloads with large read-sets it can be up to two times slower.
Abstract: We propose signature-accelerated transactional memory (SigTM), ahybrid TM system that reduces the overhead of software transactions. SigTM uses hardware signatures to track the read-set and write-set forpending transactions and perform conflict detection between concurrent threads. All other transactional functionality, including dataversioning, is implemented in software. Unlike previously proposed hybrid TM systems, SigTM requires no modifications to the hardware caches, which reduces hardware cost and simplifies support for nested transactions and multithreaded processor cores. SigTM is also the first hybrid TM system to provide strong isolation guarantees between transactional blocks and non-transactional accesses without additional read and write barriers in non-transactional code.Using a set of parallel programs that make frequent use of coarse-grain transactions, we show that SigTM accelerates software transactions by 30% to 280%. For certain workloads, SigTM can match the performance of a full-featured hardware TM system, while for workloads with large read-sets it can be up to two times slower. Overall, we show that SigTM combines the performance characteristics and strong isolation guarantees of hardware TM implementations with the low cost and flexibility of software TM systems.

340 citations


Journal ArticleDOI
TL;DR: This article presents three APIs which make it easier to develop nonblocking implementations of arbitrary data structures, and compares the performance of the resulting implementations against one another and against high-performance lock-based systems.
Abstract: Mutual exclusion locks remain the de facto mechanism for concurrency control on shared-memory data structures However, their apparent simplicity is deceptive: It is hard to design scalable locking strategies because locks can harbor problems such as priority inversion, deadlock, and convoying Furthermore, scalable lock-based systems are not readily composable when building compound operations In looking for solutions to these problems, interest has developed in nonblocking systems which have promised scalability and robustness by eschewing mutual exclusion while still ensuring safety However, existing techniques for building nonblocking systems are rarely suitable for practical use, imposing substantial storage overheads, serializing nonconflicting operations, or requiring instructions not readily available on today's CPUsIn this article we present three APIs which make it easier to develop nonblocking implementations of arbitrary data structures The first API is a multiword compare-and-swap operation (MCAS) which atomically updates a set of memory locations This can be used to advance a data structure from one consistent state to another The second API is a word-based software transactional memory (WSTM) which can allow sequential code to be reused more directly than with MCAS and which provides better scalability when locations are being read rather than being updated The third API is an object-based software transactional memory (OSTM) OSTM allows a simpler implementation than WSTM, but at the cost of reengineering the code to use OSTM objectsWe present practical implementations of all three of these APIs, built from operations available across all of today's major CPU families We illustrate the use of these APIs by using them to build highly concurrent skip lists and red-black trees We compare the performance of the resulting implementations against one another and against high-performance lock-based systems These results demonstrate that it is possible to build useful nonblocking data structures with performance comparable to, or better than, sophisticated lock-based designs

283 citations


Proceedings ArticleDOI
21 Mar 2007
TL;DR: STMBench7 is presented: a candidate benchmark for evaluating STM implementations and illustrated with an evaluation of a well-known software transactional memory implementation.
Abstract: Software transactional memory (STM) is a promising technique for controlling concurrency in modern multi-processor architectures. STM aims to be more scalable than explicit coarse-grained locking and easier to use than fine-grained locking. However, STM implementations have yet to demonstrate that their runtime overheads are acceptable. To date, empiric evaluations of these implementations have suffered from the lack of realistic benchmarks. Measuring performance of an STM in an overly simplified setting can be at best uninformative and at worst misleading as it may steer researchers to try to optimize irrelevant aspects of their implementations.This paper presents STMBench7: a candidate benchmark for evaluating STM implementations. The underlying data structure consists of a set of graphs and indexes intended to be suggestive of many complex applications, e.g., CAD/CAM. A collection of operations is supported to model a wide range of workloads and concurrency patterns. Companion locking strategies serve as a baseline for STM performance comparisons. STMBench7 strives for simplicity. Users may choose a workload, number of threads, benchmark length, as well as the possibility of structure modification and the nature of traversals of shared data structures. We illustrate the use of STMBench7 with an evaluation of a well-known software transactional memory implementation.

226 citations


Proceedings ArticleDOI
14 Mar 2007
TL;DR: New language constructs to support open nesting in Java are described, and it is demonstrated how these constructs can be mapped efficiently to existing STM data structures, demonstrating how open nesting can enhance application scalability.
Abstract: Transactional memory (TM) promises to simplify concurrent programming while providing scalability competitive to fine-grained locking. Language-based constructs allow programmers to denote atomic regions declaratively and to rely on the underlying system to provide transactional guarantees along with concurrency. In contrast with fine-grained locking, TM allows programmers to write simpler programs that are composable and deadlock-free.TM implementations operate by tracking loads and stores to memory and by detecting concurrent conflicting accesses by different transactions. By automating this process, they greatly reduce the programmer's burden, but they also are forced to be conservative. Incertain cases, conflicting memory accesses may not actually violate the higher-level semantics of a program, and a programmer may wish to allow seemingly conflicting transactions to execute concurrently.Open nested transactions enable expert programmers to differentiate between physical conflicts, at the level of memory, and logical conflicts that actually violate application semantics. A TMsystem with open nesting can permit physical conflicts that are not logical conflicts, and thus increase concurrency among application threads.Here we present an implementation of open nested transactions in a Java-based software transactional memory (STM)system. We describe new language constructs to support open nesting in Java, and we discuss new abstract locking mechanisms that a programmer can use to prevent logical conflicts. We demonstrate how these constructs can be mapped efficiently to existing STM data structures. Finally, we evaluate our system on a set of Java applications and data structures, demonstrating how open nesting can enhance application scalability.

212 citations


Proceedings ArticleDOI
10 Jun 2007
TL;DR: The results on a set of Java programs show that strong atomicity can be implemented efficiently in a high-performance STM system and introduces a dynamic escape analysis that differentiates private and public data at runtime to make barriers cheaper and a static not-accessed-in-transaction analysis that removes many barriers completely.
Abstract: Transactional memory provides a new concurrency control mechanism that avoids many of the pitfalls of lock-based synchronization. High-performance software transactional memory (STM) implementations thus far provide weak atomicity: Accessing shared data both inside and outside a transaction can result in unexpected, implementation-dependent behavior. To guarantee isolation and consistent ordering in such a system, programmers are expected to enclose all shared-memory accesses inside transactions.A system that provides strong atomicity guarantees isolation even in the presence of threads that access shared data outside transactions. A strongly-atomic system also orders transactions with conflicting non-transactional memory operations in a consistent manner.In this paper, we discuss some surprising pitfalls of weak atomicity, and we present an STM system that avoids these problems via strong atomicity. We demonstrate how to implement non-transactional data accesses via efficient read and write barriers, and we present compiler optimizations that further reduce the overheads of these barriers. We introduce a dynamic escape analysis that differentiates private and public data at runtime to make barriers cheaper and a static not-accessed-in-transaction analysis that removes many barriers completely. Our results on a set of Java programs show that strong atomicity can be implemented efficiently in a high-performance STM system.

209 citations


Proceedings ArticleDOI
12 Aug 2007
TL;DR: It is argued that privatization comprises a pair of symmetric subproblems: private operations may fail to see updates made by transactions that have committed but not yet completed; conversely, transactions that are doomed but have not yet aborted may see Updates made by private code, causing them to perform erroneous, externally visible operations.
Abstract: Early implementations of software transactional memory (STM) assumed that sharable data would be accessed only within transactions. Memory may appear inconsistent in programs that violate this assumption, even when program logic would seem to make extra-transactional accesses safe. Designing STM systems that avoid such inconsistency has been dubbed the privatization problem. We argue that privatization comprises a pair of symmetric subproblems: private operations may fail to see updates made by transactions that have committed but not yet completed; conversely, transactions that are doomed but have not yet aborted may see updates made by private code, causing them to perform erroneous, externally visible operations. We explain how these problems arise in different styles of STM, present strategies to address them, and discuss their implementation tradeoffs. We also propose a taxonomy of contracts between the system and the user, analogous to programmer-centric memory consistency models, which allow us to classify programs based on their privatization requirements. Finally, we present empirical comparisons of several privatization strategies. Our results suggest that the best strategy may depend on application characteristics.

163 citations


Proceedings ArticleDOI
11 Mar 2007
TL;DR: This system is the first to demonstrate that transactions integrate well with an unmanaged language, and can perform as well as fine- grain locking while providing the programming ease of coarse-grain locking even on an unmanaging environment.
Abstract: Transactional memory offers significant advantages for concurrency control compared to locks. This paper presents the design and implementation of transactional memory constructs in an unmanaged language. Unmanaged languages pose a unique set of challenges to transactional memory constructs - for example, lack of type and memory safety, use of function pointers, aliasing of local variables, and others. This paper describes novel compiler and runtime mechanisms that address these challenges and optimize the performance of transactions in an unmanaged environment. We have implemented these mechanisms in a production-quality C compiler and a high-performance software transactional memory runtime. We measure the effectiveness of these optimizations and compare the performance of lock-based versus transaction-based programming on a set of concurrent data structures and the SPLASH-2 benchmark suite. On a 16 processor SMP system, the transaction-based version of the SPLASH-2 benchmarks scales much better than the coarse-grain locking version and performs comparably to the fine-grain locking version. Compiler optimizations significantly reduce the overheads of transactional memory so that, on a single thread, the transaction-based version incurs only about 6.4% overhead compared to the lock-based version for the SPLASH-2 benchmark suite. Thus, our system is the first to demonstrate that transactions integrate well with an unmanaged language, and can perform as well as fine-grain locking while providing the programming ease of coarse-grain locking even on an unmanaged environment.

154 citations


Proceedings ArticleDOI
10 Feb 2007
TL;DR: This paper presents the first scalable TM implementation for directory-based distributed shared memory systems that is livelock free without the need for user-level intervention and is based on transactional coherence and consistency (TCC), which supports continuous transactions and fault isolation.
Abstract: Transactional memory (TM) provides mechanisms that promise to simplify parallel programming by eliminating the need for locks and their associated problems (deadlock, livelock, priority inversion, convoying). For TM to be adopted in the long term, not only does it need to deliver on these promises, but it needs to scale to a high number of processors. To date, proposals for scalable TM have relegated livelock issues to user-level contention managers. This paper presents the first scalable TM implementation for directory-based distributed shared memory systems that is livelock free without the need for user-level intervention. The design is a scalable implementation of optimistic concurrency control that supports parallel commits with a two-phase commit protocol, uses write-back caches, and filters coherence messages. The scalable design is based on transactional coherence and consistency (TCC), which supports continuous transactions and fault isolation. A performance evaluation of the design using both scientific and enterprise benchmarks demonstrates that the directory-based TCC design scales efficiently for NUMA systems up to 64 processors

148 citations


Proceedings ArticleDOI
09 Jun 2007
TL;DR: The Lazy Snapshot Algorithm, which forms the basis for the LSA-STM time-based software transactional memory, has to be changed to support these new time bases and how the global counter can be replaced by an external or physical clock that can be accessed efficiently, and by multiple synchronized physical clocks.
Abstract: Time-based transactional memories use time to reason about the consistency of data accessed by transactions and about the order in which transactions commit. They avoid the large read overhead of transactional memories that always check consistency when a new object is accessed, while still guaranteeing consistency at all times--in contrast to transactional memories that only check consistency on transaction commit.Current implementations of time-based transactional memories use a single global clock that is incremented by the commit operation for each update transaction that commits. In large systems with frequent commits, the contention on this global counter can thus become a major bottleneck.We present a scalable replacement for this global counter and describe how the Lazy Snapshot Algorithm (LSA), which forms the basis for our LSA-STM time-based software transactional memory, has to be changed to support these new time bases. In particular, we show how the global counter can be replaced (1) by an external or physical clock that can be accessed efficiently, and (2) by multiple synchronized physical clocks.

145 citations


Proceedings ArticleDOI
09 Jun 2007
TL;DR: This paper proposes OneTM to simplify the implementation of unbounded transactional memory by bounding the concurrency of transactions that overflow the cache, and introduces the permissions-only cache to extend the bound at which transactions overflow to allow the fast, bounded case to be used as frequently as possible.
Abstract: Hardware transactional memory has great potential to simplify the creation ofcorrect and efficient multithreaded programs, allowing programmers to exploitmore effectively the soon-to-be-ubiquitous multi-core designs. Several recentproposals have extended the original bounded transactional memory to unboundedtransactional memory, a crucial step toward transactions becoming ageneral-purpose primitive. Unfortunately, supporting the concurrent executionof an unbounded number of unbounded transactions is challenging, and as aresult, many proposed implementations are complex.This paper explores a different approach. First, we introduce thepermissions-only cache to extend the bound at which transactions overflow toallow the fast, bounded case to be used as frequently as possible. Second, wepropose OneTM to simplify the implementation of unbounded transactional memoryby bounding the concurrency of transactions that overflow the cache. Thesemechanisms work synergistically to provide a simple and fast unboundedtransactional memory system.The permissions-only cache efficiently maintains the coherencepermissions-but not data-for blocks read or written transactionally thathave been evicted from the processor's caches. By holding coherencepermissions for these blocks, the regular cache coherence protocol can be usedto detect transactional conflicts using only a few bits of on-chip storage peroverflowed cache block.OneTM allows only one overflowed transaction at a time, relying on thepermissions-only cache to ensure that overflow is infrequent. We present twoimplementations. In OneTM-Serialized, an overflowed transaction simply stallsall other threads in the application.In OneTM-Concurrent, non-overflowedtransactions and non-transactional code can execute concurrently with theoverflowed transaction, providing more concurrency while retaining OneTM's coresimplifying assumption.

Proceedings ArticleDOI
14 Oct 2007
TL;DR: TxLinux is a variant of Linux that is the first operating system to use hardware transactional memory (HTM) as a synchronization primitive, and the first to manage HTM in the scheduler, and integration of transactions with the OS scheduler is discussed.
Abstract: TxLinux is a variant of Linux that is the first operating system to use hardware transactional memory (HTM) as a synchronization primitive, and the first to manage HTM in the scheduler. This paper describes and measures TxLinux and discusses two innovations in detail: cooperation between locks and transactions, and theintegration of transactions with the OS scheduler. Mixing locks and transactions requires a new primitive, cooperative transactional spinlocks (cxspinlocks) that allow locks and transactions to protect the same data while maintaining the advantages of both synchronization primitives. Cxspinlocks allow the system to attemptexecution of critical regions with transactions and automatically roll back to use locking if the region performs I/O. Integrating the scheduler with HTM eliminates priority inversion. On a series ofreal-world benchmarks TxLinux has similar performance to Linux, exposing concurrency with as many as 32 concurrent threads on 32 CPUs in the same critical region.

01 Jan 2007
TL;DR: Phased Transactional Memory (PhTM) is introduced, which supports switching between different “phases”, each implemented by a different form of transactional memory support, and can match the performance and scalability of unbounded HTM implementations better than the previous HyTM prototype.
Abstract: Hybrid transactional memory (HyTM) [3] works in today’s systems, and can use future “best effort” hardware transactional memory (HTM) support to improve performance. Best effort HTM can be substantially simpler than alternative “unbounded” HTM designs being proposed in the literature, so HyTM both supports and encourages an incremental approach to adopting HTM. We introduce Phased Transactional Memory (PhTM), which supports switching between different “phases”, each implemented by a different form of transactional memory support. This allows us to adapt between a variety of different transactional memory implementations according to the current environment and workload. We describe a simple PhTM prototype, and present experimental results showing that PhTM can match the performance and scalability of unbounded HTM implementations better than our previous HyTM prototype when best effort HTM support is available and effective, and is more competitive with state-of-the-a rt software transactional memory implementations when it is not.

Patent
01 May 2007
TL;DR: In this article, methods and systems for managing transactional memory allocations and deallocations while in transactional code, including nested transactional codes, are described and claimed. But they do not specify how to manage transactional data structures.
Abstract: Methods and systems are provided for managing memory allocations and deallocations while in transactional code, including nested transactional code. The methods and systems manage transactional memory operations by using identifiers, such as sequence numbers, to handle memory management in transactions. The methods and systems also maintain lists of deferred actions to be performed at transaction abort and commit times. A number of memory management routines associated with one or more transactions examine the transaction sequence number of the current transaction, manipulate commit and/or undo logs, and set/use the transaction sequence number of an associated object, but are not so limited. The methods and systems provide for memory allocation and deallocations within transactional code while preserving transactional semantics. Other embodiments are described and claimed.

Proceedings ArticleDOI
09 Jun 2007
TL;DR: This work describes an alert on update mechanism (AOU) that allows a thread to receive fast, asynchronous notification when previously-identified lines are written by other threads, and a programmable data isolation mechanism (PDI) that allowing athread to hide its speculative writes from otherthreads, ignoring conflicts, until software decides to make them visible.
Abstract: There has been considerable recent interest in both hardware andsoftware transactional memory (TM). We present an intermediateapproach, in which hardware serves to accelerate a TM implementation controlled fundamentally by software. Specifically, we describe an alert on update mechanism (AOU) that allows a thread to receive fast, asynchronous notification when previously-identified lines are written by other threads, and a programmable data isolation mechanism (PDI) that allows a thread to hide its speculative writes from other threads, ignoring conflicts, until software decides to make them visible. These mechanisms reduce bookkeeping, validation, and copying overheads without constraining software policy on a host of design decisions.We have used AOU and PDI to implement a hardwareacceleratedsoftware transactional memory system we call RTM. We have also used AOU alone to create a simpler "RTM-Lite". Across a range of microbenchmarks, RTM outperforms RSTM, a publicly available software transactional memory system, by as much as 8.7x (geometric mean of 3.5x) in single-thread mode. At 16 threads, it outperforms RSTM by as much as 5x, with an average speedup of 2x. Performance degrades gracefully when transactions overflow hardware structures. RTM-Lite is slightly faster than RTM for transactions that modify only small objects; full RTM is significantly faster when objects are large. In a strongargument for policy flexibility, we find that the choice between eager (first-access) and lazy (commit-time) conflict detection can lead to significant performance differences in both directions, depending on application characteristics.

Journal ArticleDOI
TL;DR: The heart of the design is a new cache-coherence protocol, called the Ballistic protocol, for tracking and moving up-to-date copies of cached objects, which has stretch logarithmic in the diameter of the network.
Abstract: Transactional Memory is a concurrent programming API in which concurrent threads synchronize via transactions (instead of locks). Although this model has mostly been studied in the context of multiprocessors, it has attractive features for distributed systems as well. In this paper, we consider the problem of implementing transactional memory in a network of nodes where communication costs form a metric. The heart of our design is a new cache-coherence protocol, called the Ballistic protocol, for tracking and moving up-to-date copies of cached objects. For constant-doubling metrics, a broad class encompassing both Euclidean spaces and growth-restricted networks, this protocol has stretch logarithmic in the diameter of the network.

Patent
06 Feb 2007
TL;DR: In this article, the authors present a method and apparatus for accelerating transactional execution by using hardware support to determine if an access is the first access to a shared memory line during a pendancy of a transaction.
Abstract: A method and apparatus for accelerating transactional execution. Barriers associated with shared memory lines referenced by memory accesses within a transaction are only invoked/executed the first time the shared memory lines are accessed within a transaction. Hardware support, such as a transaction field/transaction bits, are provided to determine if an access is the first access to a shared memory line during a pendancy of a transaction. Additionally, in an aggressive operational mode version numbers representing versions of elements stored in shared memory lines are not stored and validated upon commitment to save on validation costs. Moreover, even in a cautious mode, that stores version numbers to enable validation, validation costs may not be incurred, if eviction of accessed shared memory lines do not occur during execution of the transaction.

Proceedings ArticleDOI
15 Sep 2007
TL;DR: JudoSTM is presented, a novel dynamic binary-rewriting approach to implementing STM that supports C and C++ code and significantly lower overhead through several novel optimizations that improve the quality of rewritten code and reduce the cost of conflict detection and buffering.
Abstract: With the advent of chip-multiprocessors, we are faced with the challenge of parallelizing performance-critical software. Transactional memory (TM) has emerged as a promising programming model allowing programmers to focus on parallelism rather than maintaining correctness and avoiding deadlock. Many implementations of hardware, software, and hybrid support for TM have been proposed; of these, software-only implementations (STMs) are especially compelling since they can be used with current commodity hardware. However, in addition to higher overheads, many existing STM systems are limited to either managed languages or intrusive APIs. Furthermore, transactions in STMs cannot normally contain calls to unobservable code such as shared libraries or system calls. In this paper we present JudoSTM, a novel dynamic binary-rewriting approach to implementing STM that supports C and C++ code. Furthermore, by using value-based conflict detection, JudoSTM additionally supports the transactional execution of both (i) irreversible system calls and (ii) library functions that may contain locks. We significantly lower overhead through several novel optimizations that improve the quality of rewritten code and reduce the cost of conflict detection and buffering. We show that our approach performs comparably to Rochester's RSTM library-based implementation- demonstrating that a dynamic binary-rewriting approach to implementing STM is an interesting alternative.

Proceedings ArticleDOI
10 Jun 2007
TL;DR: In this article, the authors combine two seemingly opposed programming models for building massively concurrent network services: the event-driven model and the multithreaded model, and show how the hybrid model can be implemented entirely at the application level using concurrency monads in Haskell.
Abstract: This paper proposes to combine two seemingly opposed programming models for building massively concurrent network services: the event-driven model and the multithreaded model. The result is a hybrid design that offers the best of both worlds--the ease of use and expressiveness of threads and the flexibility and performance of events.This paper shows how the hybrid model can be implemented entirely at the application level using concurrency monads in Haskell, which provides type-safe abstractions for both events and threads. This approach simplifies the development of massively concurrent software in a way that scales to real-world network services. The Haskell implementation supports exceptions, symmetrical multiprocessing, software transactional memory, asynchronous I/O mechanisms and application-level network protocol stacks. Experimental results demonstrate that this monad-based approach has good performance: the threads are extremely lightweight (scaling to ten million threads), and the I/O performance compares favorably to that of Linux NPTL. tens of thousands of simultaneous, mostly-idle client connections. Such massively-concurrent programs are difficult to implement, especially when other requirements, such as high performance and strong security, must also be met.

Proceedings ArticleDOI
11 Mar 2007
TL;DR: This paper created the transactional locking (TL) framework of STM algorithms and used it to conduct a range of comparisons of the performance of non-blocking, lock-based, and Hybrid STm algorithms versus fine-grained hand-crafted ones.
Abstract: There has been a flurry of recent work on the design of high performance software and hybrid hardware/software transactional memories (STMs and HyTMs). This paper reexamines the design decisions behind several of these stateof- the-art algorithms, adopting some ideas, rejecting others, all in an attempt to make STMs faster. We created the transactional locking (TL) framework of STM algorithms and used it to conduct a range of comparisons of the performance of non-blocking, lock-based, and Hybrid STM algorithms versus fine-grained hand-crafted ones. We were able to make several illuminating observations regarding lock acquisition order, the interaction of STMs with memory management schemes, and the role of overheads and abort rates in STM performance.

Patent
20 Jun 2007
TL;DR: In this paper, a method and apparatus for virtualizing and/or extending transactional memory is described, where transactions are executed using local shared transactional memories, such as a cache memory.
Abstract: A method and apparatus for virtualizing and/or extending transactional memory is described herein. Transactions are executed using local shared transactional memory, such as a cache memory. Upon overflowing the shared transactional memory, the transactional memory is virtualized and/or extended into a higher-level memory, such as a system memory. Upon an overflow event, such as an eviction of a cache line previously accessed during a currently pending transaction, an overflow flag is set to notify processors/cores that the transactional memory is to be virtualized in a global overflow table. A base address of the global overflow table is also potentially stored to reference the base of the global overflow table in the higher-level memory.

Proceedings ArticleDOI
21 Mar 2007
TL;DR: Using Whodunit, the prototype transactional profiler, the author is able to track and profile transactions that flow through shared memory, events, stages or via message passing, and measure the interference among concurrent transactions.
Abstract: This paper is concerned with performance debugging of multi-tier applications, such as commonly found in servers and dynamic-content web sites. Existing tools and techniques for profiling such applications are not general enough to track and profile transactions in a generic multi-tier application. We propose transactional profiling that provides a general solution to this problem. We provide novel algorithms and techniques to track and profile transactions that flow through shared memory, events, stages or via interprocess communication using messages. We also measure interference among concurrent transactions.We describe the design and implementation of Whodunit, our prototype transactional profiler. We demonstrate the correctness of our proposed algorithm for tracking transaction flow through shared memory using Apache and MySQL. Using Whodunit we are able to track and profile transactions that flow through shared memory, events, stages or via message passing, and measure the interference among concurrent transactions. We illustrate the use of Whodunit in obtaining the transactional profile of web servers, a web proxy cache and a bookstore application.

Michael Isard1, Andrew Birrell1
07 May 2007
TL;DR: This work proposes a new concurrent programming model, Automatic Mutual Exclusion (AME), which favors correctness over performance for simple programs, while allowing advanced programmers the expressivity they need.
Abstract: We propose a new concurrent programming model, Automatic Mutual Exclusion (AME). In contrast to lock-based programming, and to other programming models built over software transactional memory (STM), we arrange that all shared state is implicitly protected unless the programmer explicitly specifies otherwise. An AME program is composed from serializable atomic fragments. We include features allowing the programmer to delimit and manage the fragments to achieve appropriate program structure and performance. We explain how I/O activity and legacy code can be incorporated within an AME program. Finally, we outline ways in which future work might expand on these ideas. The resulting programming model makes it easier to write correct code than incorrect code. It favors correctness over performance for simple programs, while allowing advanced programmers the expressivity they need.

Proceedings ArticleDOI
14 Mar 2007
TL;DR: The results show that easier-to-use long transactions can still allow programs to deliver scalable performance by simply wrapping existing data structures with transactional collection classes, without the need for custom implementations or knowledge of data structure internals.
Abstract: While parallel programmers find it easier to reason about large atomic regions, the conventional mutual exclusion-based primitives for synchronization force them to interleave many small operations to achieve performance. Transactional memory promises that programmer scan use large atomic regions while achieving similar performance. However, these large transactions can conflict when operating on shared data structures, even for logically independent operations. Transactional collection classes address this problem by allowing long-running transactions to operate on shared data while eliminating unnecessary conflicts. Transactional collection classes wrap existing data structures, without the need for custom implementations or knowledge of data structure internals.Without transactional collection classes, access to shared datafrom within long-running transactions can suffer from data dependency conflicts that are logically unnecessary, but are artifacts of the data structure implementation such as hash table collisions or tree-balancing rotations. Our transactional collection classes use the concept of semantic concurrency control to eliminate these unnecessary data dependencies, replacing them with conflict detection based on the operations of the abstract data type.The design and behavior of these transactional collection classes is discussed with reference to the related work from the database community such as multi-level transactions and semantic concurrency control, as well as other concurrent data structures such as java.util.concurrent. The required transactional semantics needed for implementing transactional collection are enumerated, including open-nested transactions and commit and abort handlers. We also discuss how isolation can be reduced for greater concurrency. Finally, we provide guidelines on the construction of classes that preserve isolation and serializability.The performance of these classes is evaluated with a number of benchmarks including targeted micro-benchmarks and a version of SPECjbb2000 with increased contention. The results show that easier-to-use long transactions can still allow programs to deliver scalable performance by simply wrapping existing data structures with transactional collection classes.

Proceedings ArticleDOI
27 Sep 2007
TL;DR: An open-source implementation of Delaunay triangulation that uses transactions as one component of a larger parallelization strategy, and employs one of the fastest known sequential algorithms to triangulate geometrically partitioned regions in parallel.
Abstract: Transactional memory has been widely hailed as a simpler alternative to locks in multithreaded programs, but few nontrivial transactional programs are currently available. We describe an open-source implementation of Delaunay triangulation that uses transactions as one component of a larger parallelization strategy. The code is written in C+ +, for use with the RSTM software transactional memory library (also open source). It employs one of the fastest known sequential algorithms to triangulate geometrically partitioned regions in parallel; it then employs alternating, barrier-separated phases of transactional and partitioned work to stitch those regions together. Experiments on multiprocessor and multicore machines confirm excellent single-thread performance and good speedup with increasing thread count. Since execution time is dominated by geometrically partitioned computation, performance is largely insensitive to the overhead of transactions, but highly sensitive to any costs imposed on shamble data that are currently "privatized".

Patent
Adam Welc1, Ali-Reza Adl-Tabatabai1
27 Dec 2007
TL;DR: In this article, a hybrid transactional memory system is described, where a main thread can directly update memory locations, while a helper thread's transactional writes are buffered to ensure they do not invalidate transactional reads of the main thread.
Abstract: A method and apparatus for a hybrid transactional memory system is herein described. A first transaction is executed utilizing a first style of a transactional memory system and a second transaction is executed in parallel utilizing a second style of a transactional memory system. For example, a main thread is executed utilizing an update-in place Software Transactional Memory (STM) system while a parallel thread, such as a helper thread, is executed utilizing a write buffering STM. As a result, a main thread may directly update memory locations, while a helper thread's transactional writes are buffered to ensure they do not invalidate transactional reads of the main thread. Therefore, parallel execution of threads is achieved, while ensuring at least one thread, such as a main thread, does not degrade below an amount of execution cycles it would take to execute the main thread serially.

Proceedings ArticleDOI
16 Apr 2007
TL;DR: The authors have mapped ATLAS to the BEE2 multi-FPGA board to create a full-system prototype that operates at 100MHz, boots Linux, and provides significant performance and ease-of-use benefits for a range of parallel applications.
Abstract: Chip-multiprocessors are quickly becoming popular in embedded systems. However, the practical success of CMPs strongly depends on addressing the difficulty of multithreaded application development for such systems. Transactional memory (TM) promises to simplify concurrency management in multithreaded applications by allowing programmers to specify coarse-grain parallel tasks, while achieving performance comparable to fine-grain lock-based applications. This paper presents ATLAS, the first prototype of a CMP with hardware support for transactional memory. ATLAS includes 8 embedded PowerPC cores that access coherent shared memory in a transactional manner. The data cache for each core is modified to support the speculative buffering and conflict detection necessary for transactional execution. The authors have mapped ATLAS to the BEE2 multi-FPGA board to create a full-system prototype that operates at 100MHz, boots Linux, and provides significant performance and ease-of-use benefits for a range of parallel applications. Overall, the ATLAS prototype provides an excellent framework for further research on the software and hardware techniques necessary to deliver on the potential of transactional memory

Patent
27 Jun 2007
TL;DR: In this article, a fine-grained filtering in a hardware accelerated software transactional memory system is described, where a data object, which may have any arbitrary size, is associated with a filter word and the filter word is in a first default state when no access, such as a read, from the data object has occurred during a pendancy of a transaction.
Abstract: A method and apparatus for fine-grained filtering in a hardware accelerated software transactional memory system is herein described. A data object, which may have any arbitrary size, is associated with a filter word. The filter word is in a first default state when no access, such as a read, from the data object has occurred during a pendancy of a transaction. Upon encountering a first access, such as a first read, from the data object, access barrier operations including an ephemeral/private store operation to set the filter word to a second state are performed. Upon a subsequent/redundant access, such as a second read, the access barrier operations are elided to accelerate the subsequent access, based on the filter word being set to the second state to indicate a previous access occurred.

Proceedings ArticleDOI
12 Aug 2007
TL;DR: This work investigates the use of weaker semantics for TM and introduces a new consistency criterion that is called z-linearizability, which provides a good trade-off between strong semantics and good practical performance even for long transactions.
Abstract: The current generation of time-based transactional memories (TMs) has the advantage of being simple and efficient, and providing strong linearizability semantics. Linearizability matches well the goal of TM to simplify the design and implementation of concurrent applications. However, long transactions can have a much lower likelihood of committing than smaller transactions because of the strict ordering constraints imposed by linearizability. We investigate the use of weaker semantics for TM and introduce a new consistency criterion that we call z-linearizability. By combining properties of linearizability and serializability, z-linearizability provides a good trade-off between strong semantics and good practical performance even for long transactions.

01 Jan 2007
TL;DR: It is concluded that while the interface is a significant improvement on earlier efforts, and makes it practical for systems researchers to build nontrivial applications, it fails to realize the programming simplicity that was supposed to be the motivation for transactions in the first place.
Abstract: Like many past extensions to user programming models, transactions can be added to the programming language or implemented in a library using existing language features. We describe a library-based transactional memory API for C++. Designed to address the limitations of an earlier API with similar functionality, the new interface leverages macros, exceptions, multiple inheritance, generics (templates), and overloading of operators (including pointer dereference) in an attempt to minimize syntactic clutter, admit a wide variety of back-end implementations, avoid arbitrary restrictions on otherwise valid language constructs, enable privatization, catch as many programmer errors as possible, and provide semantics that “seem natural” to C++ programmers. Having used our API to construct several small and one large application, we conclude that while the interface is a significant improvement on earlier efforts, and makes it practical for systems researchers to build nontrivial applications, it fails to realize the programming simplicity that was supposed to be the motivation for transactions in the first place. Several groups have proposed compiler support as a way to improve the performance of transactions. We conjecture that compiler—and language—support will be even more important as a way to improve the programming model.