scispace - formally typeset
Search or ask a question

Showing papers on "Transactional memory published in 2008"


Proceedings ArticleDOI
30 Sep 2008
TL;DR: This paper introduces the Stanford Transactional Application for Multi-Processing (STAMP), a comprehensive benchmark suite for evaluating TM systems and uses the suite to evaluate six different TM systems, identify their shortcomings, and motivate further research on their performance characteristics.
Abstract: Transactional Memory (TM) is emerging as a promising technology to simplify parallel programming. While several TM systems have been proposed in the research literature, we are still missing the tools and workloads necessary to analyze and compare the proposals. Most TM systems have been evaluated using microbenchmarks, which may not be representative of any real-world behavior, or individual applications, which do not stress a wide range of execution scenarios. We introduce the Stanford Transactional Application for Multi-Processing (STAMP), a comprehensive benchmark suite for evaluating TM systems. STAMP includes eight applications and thirty variants of input parameters and data sets in order to represent several application domains and cover a wide range of transactional execution cases (frequent or rare use of transactions, large or small transactions, high or low contention, etc.). Moreover, STAMP is portable across many types of TM systems, including hardware, software, and hybrid systems. In this paper, we provide descriptions and a detailed characterization of the applications in STAMP. We also use the suite to evaluate six different TM systems, identify their shortcomings, and motivate further research on their performance characteristics.

934 citations


Proceedings ArticleDOI
20 Feb 2008
TL;DR: Opacity is defined as a property of concurrent transaction histories and its graph theoretical interpretation is given and it is proved that every single-version TM system that uses invisible reads and does not abort non-conflicting transactions requires, in the worst case, k steps for an operation to terminate.
Abstract: Transactional memory (TM) is perceived as an appealing alternative to critical sections for general purpose concurrent programming. Despite the large amount of recent work on TM implementations, however, very little effort has been devoted to precisely defining what guarantees these implementations should provide. A formal description of such guarantees is necessary in order to check the correctness of TM systems, as well as to establish TM optimality results and inherent trade-offs.This paper presents opacity, a candidate correctness criterion for TM implementations. We define opacity as a property of concurrent transaction histories and give its graph theoretical interpretation. Opacity captures precisely the correctness requirements that have been intuitively described by many TM designers. Most TM systems we know of do ensure opacity.At a very first approximation, opacity can be viewed as an extension of the classical database serializability property with the additional requirement that even non-committed transactions are prevented from accessing inconsistent states. Capturing this requirement precisely, in the context of general objects, and without precluding pragmatic strategies that are often used by modern TM implementations, such as versioning, invisible reads, lazy updates, and open nesting, is not trivial.As a use case of opacity, we prove the first lower bound on the complexity of TM implementations. Basically, we show that every single-version TM system that uses invisible reads and does not abort non-conflicting transactions requires, in the worst case, ?(k) steps for an operation to terminate, where k is the total number of objects shared by transactions. This (tight) bound precisely captures an inherent trade-off in the design of TM systems. The bound also highlights a fundamental gap between systems in which transactions can be fully isolated from the outside environment, e.g., databases or certain specialized transactional languages, and systems that lack such isolation capabilities, e.g., general TM frameworks.

500 citations


Proceedings ArticleDOI
20 Feb 2008
TL;DR: This paper investigates dynamic tuning mechanisms on a new time-based software transactional memory implementation and exhibits the benefits of dynamic tuning.
Abstract: The current generation of software transactional memories has the advantage of being simple and efficient. Nevertheless, there are several parameters that affect the performance of a transactional memory, for example the locality of the application and the cache line size of the processor. In this paper, we investigate dynamic tuning mechanisms on a new time-based software transactional memory implementation. We study in extensive measurements the performance of our implementation and exhibit the benefits of dynamic tuning. We compare our results with TL2, which is currently one of the fastest word-based software transactional memories.

485 citations


Proceedings ArticleDOI
20 Feb 2008
TL;DR: A simple wrapper for the linearizable implementation that guarantees that concurrent transactions without inherent conflicts can synchronize at the same granularity as the originallinearizable implementation is defined.
Abstract: We describe a methodology for transforming a large class of highly-concurrent linearizable objects into highly-concurrent transactional objects. As long as the linearizable implementation satisfies certain regularity properties (informally, that every method has an inverse), we define a simple wrapper for the linearizable implementation that guarantees that concurrent transactions without inherent conflicts can synchronize at the same granularity as the original linearizable implementation.

313 citations


Journal ArticleDOI
TL;DR: The promise of STM may likely be undermined by its overheads and workload applicabilities.
Abstract: TM (transactional memory) is a concurrency control paradigm that provides atomic and isolated execution for regions of code. TM is considered by many researchers to be one of the most promising sol...

252 citations


Journal ArticleDOI
01 Jul 2008
TL;DR: Is TM the answer for improving parallel programming?
Abstract: Is TM the answer for improving parallel programming?

245 citations


Journal ArticleDOI
TL;DR: This paper presents a concurrency model, based on transactional memory, that offers far richer composition, and describes modular forms of blocking and choice that were inaccessible in earlier work.
Abstract: Writing concurrent programs is notoriously difficult and is of increasing practical importance. A particular source of concern is that even correctly implemented concurrency ions cannot be composed together to form larger ions. In this paper we present a concurrency model, based on transactional memory, that offers far richer composition. All the usual benefits of transactional memory are present (e.g., freedom from low-level deadlock), but in addition we describe modular forms of blocking and choice that were inaccessible in earlier work.

224 citations


Proceedings ArticleDOI
14 Jun 2008
TL;DR: This paper proposes a new paradigm called adaptive transaction scheduling, based on the parallelism feedback from applications, that dynamically dispatches and controls the number of concurrently executing transactions and significantly improves performance for both hardware and software transactional memory systems.
Abstract: Transactional memory systems are expected to enable parallel programming at lower programming complexity, while delivering improved performance over traditional lock-based systems. Nonetheless, there are certain situations where transactional memory systems could actually perform worse. Transactional memory systems can outperform locks only when the executing workloads contain sufficient parallelism. When the workload lacks inherent parallelism, launching excessive transactions can adversely degrade performance. These situations are likely to become dominant in future workloads when large-scale transactions are frequently executed. In this paper, we propose a new paradigm called adaptive transaction scheduling to address this issue. Based on the parallelism feedback from applications, our adaptive transaction scheduler dynamically dispatches and controls the number of concurrently executing transactions. In our case study, we show that our low-cost mechanism not only guarantees that hardware transactional memory systems perform no worse than a single global lock, but also significantly improves performance for both hardware and software transactional memory systems.

201 citations


Journal ArticleDOI
01 Jun 2008
TL;DR: A new substrate for deterministic replay that provides substantial advances along these axes, called DeLorean, where processors execute blocks of instructions atomically, as in transactional memory or speculative multithreading, and the system only needs to record the commit order of these blocks.
Abstract: Support for deterministic replay of multithreaded execution can greatly help in finding concurrency bugs. For highest effectiveness, replay schemes should (i) record at production-run speed, (ii) keep their logging requirements minute, and (iii) replay at a speed similar to that of the initial execution. In this paper, we propose a new substrate for deterministic replay that provides substantial advances along these axes. In our proposal, processors execute blocks of instructions atomically, as in transactional memory or speculative multithreading, and the system only needs to record the commit order of these blocks. We call our scheme DeLorean. Our results show that DeLorean records execution at a speed similar to that of Release Consistency (RC) execution and replays at about 82% of its speed. In contrast, most current schemes only record at the speed of Sequential Consistency (SC) execution. Moreover, DeLorean only needs 7.5% of the log size needed by a state-of-the-art scheme. Finally, DeLorean can be configured to need only 0.6% of the log size of the state-of-the-art scheme at the cost of recording at 86% of RC’s execution speed — still faster than SC. In this configuration, the log of an 8-processor 5-GHz machine is estimated to be only about 20GB per day.

194 citations


Journal ArticleDOI
TL;DR: It is observed that the overall performance of TM is significantly worse at low levels of parallelism, which is likely to limit the adoption of this programming paradigm.
Abstract: TM (transactional memory) is a concurrency control paradigm that provides atomic and isolated execution for regions of code. TM is considered by many researchers to be one of the most promising solutions to address the problem of programming multicore processors. Its most appealing feature is that most programmers only need to reason locally about shared data accesses, mark the code region to be executed transactionally, and let the underlying system ensure the correct concurrent execution. This model promises to provide the scalability of fine-grain locking, while avoiding common pitfalls of lock composition such as deadlock. In this article we explore the performance of a highly optimized STM and observe that the overall performance of TM is significantly worse at low levels of parallelism, which is likely to limit the adoption of this programming paradigm.

177 citations


Proceedings ArticleDOI
14 Jun 2008
TL;DR: The RingSTM system is the first STM that is inherently livelock-free and privatization-safe while at the same time permitting parallel writeback by concurrent disjoint transactions.
Abstract: Existing Software Transactional Memory (STM) designs attach metadata to ranges of shared memory; subsequent runtime instructions read and update this metadata in order to ensure that an in-flight transaction's reads and writes remain correct. The overhead of metadata manipulation and inspection is linear in the number of reads and writes performed by a transaction, and involves expensive read-modify-write instructions, resulting in substantial overheads.We consider a novel approach to STM, in which transactions represent their read and write sets as Bloom filters, and transactions commit by enqueuing a Bloom filter onto a global list. Using this approach, our RingSTM system requires at most one read-modify-write operation for any transaction, and incurs validation overhead linear not in transaction size, but in the number of concurrent writers who commit. Furthermore, RingSTM is the first STM that is inherently livelock-free and privatization-safe while at the same time permitting parallel writeback by concurrent disjoint transactions.We evaluate three variants of the RingSTM algorithm, and find that it offers superior performance and/or stronger semantics than the state-of-the-art TL2 algorithm under a number of workloads.

Proceedings ArticleDOI
07 Jan 2008
TL;DR: This paper develops semantics and type systems for the constructs of the Automatic Mutual Exclusion (AME) programming model and model STM systems that use in-place update, optimistic concurrency, lazy conflict detection, and roll-back.
Abstract: Software Transactional Memory (STM) is an attractive basis for the development of language features for concurrent programming. However, the semantics of these features can be delicate and problematic. In this paper we explore the tradeoffs between semantic simplicity, the viability of efficient implementation strategies, and the flexibilityof language constructs. Specifically, we develop semantics and type systems for the constructs of the Automatic Mutual Exclusion (AME) programming model; our results apply also to other constructs, such as atomic blocks. With this semantics as a point of reference, we study several implementation strategies. We model STM systems that use in-place update, optimistic concurrency, lazy conflict detection, and roll-back. These strategies are correct only under non-trivial assumptions that we identify and analyze. One important source of errors is that some efficient implementations create dangerous 'zombie' computations where a transaction keeps running after experiencing a conflict; the assumptions confine the effects of these computations.

Proceedings ArticleDOI
20 Feb 2008
TL;DR: Cluster-STM is presented, an STM designed for high performance on large-scale commodity clusters that addresses several novel issues posed by this domain, including aggregating communication, managing locality, and distributing transactional metadata onto the nodes.
Abstract: While there has been extensive work on the design of software transactional memory (STM) for cache coherent shared memory systems, there has been no work on the design of an STM system for very large scale platforms containing potentially thousands of nodes. In this work, we present Cluster-STM, an STM designed for high performance on large-scale commodity clusters. Our design addresses several novel issues posed by this domain, including aggregating communication, managing locality, and distributing transactional metadata onto the nodes. We also re-evaluate several STM design choices previously studied for cache-coherent machines and conclude that, in some cases, different choices are appropriate on clusters. Finally, we show that our design scales well up to 512 processors. This is because on a cluster, the main barrier to STM scalability is the remote communication overhead imposed by the STM operations, and our design aggregates most of that communication with the communication of the underlying data.

Proceedings ArticleDOI
14 Jun 2008
TL;DR: A novel mechanism called single-owner read locks is used to implement irrevocable actions inside transactions that maximizes concurrency and avoids overhead when the mechanism is not used.
Abstract: Transactional memory (TM) provides a safer, more modular, and more scalable alternative to traditional lock-based synchronization. Implementing high performance TM systems has recently been an active area of research. However, current TM systems provide limited, if any, support for transactions executing irrevocable actions, such as I/O and system calls, whose side effects cannot in general be rolled back. This severely limits the ability of these systems to run commercial workloads.This paper describes the design of a transactional memory system that allows irrevocable actions to be executed inside of transactions. While one transaction is executing an irrevocable action, other transactions can still execute and commit concurrently. We use a novel mechanism called single-owner read locks to implement irrevocable actions inside transactions that maximizes concurrency and avoids overhead when the mechanism is not used. We also show how irrevocable transactions can be leveraged for contention management to handle actions whose effects may be expensive to roll back. Finally, we present a thorough performance evaluation of the irrevocability mechanism for the different usage models.

Proceedings ArticleDOI
07 Jun 2008
TL;DR: This paper presents a system that combines compiler and runtime techniques to automatically transform programs written with atomic sections into programs that only use locking primitives, and shows that the compiler is sound, and reports experimental results.
Abstract: Atomic sections are a recent and popular idiom to support the development of concurrent programs. Updates performed within an atomic section should not be visible to other threads until the atomic section has been executed entirely. Traditionally, atomic sections are supported through the use of optimistic concurrency, either using a transactional memory hardware, or an equivalent software emulation (STM).This paper explores automatically supporting atomic sections using pessimistic concurrency. We present a system that combines compiler and runtime techniques to automatically transform programs written with atomic sections into programs that only use locking primitives. To minimize contention in the transformed programs, our compiler chooses from several lock granularities, using fine-grain locks whenever it is possible.This paper formally presents our framework, shows that our compiler is sound (i.e., it protects all shared locations accessed within atomic sections), and reports experimental results.

Proceedings ArticleDOI
01 Feb 2008
TL;DR: This paper shows how the goals of high throughput and high single-thread performance, mainframe-class reliability, hardware transactional memory, and linear scalability are met by the logical and physical design of this 2.3GHz 396mm2 16-core 32-thread plus 32-scout-thread microprocessor.
Abstract: The goals for this high-end commercial microprocessor are high throughput and high single-thread performance, mainframe-class reliability, hardware transactional memory, and linear scalability. We show how these goals are met by the logical and physical design of this 2.3GHz 396mm2 16-core 32-thread plus 32-scout-thread microprocessor.

Journal ArticleDOI
01 Jun 2008
TL;DR: FlexTM (FLEXible Transactional Memory) is described, an STM-inspired protocol that uses CSTs to manage conflicts in a distributed manner (no global arbitration) and allows parallel commits and its distributed commit protocol is also more efficient than a central hardware manager.
Abstract: A high-concurrency transactional memory (TM) implementation needs to track concurrent accesses, buffer speculative updates, and manage conflicts. We present a system, FlexTM (FLEXible Transactional Memory), that coordinates four decoupled hardware mechanisms: read and write signatures, which summarize per-thread access sets; per-thread conflict summary tables (CSTs), which identify the threads with which conflicts have occurred; Programmable Data Isolation, which maintains speculative updates in the local cache and employs a thread-private buffer (in virtual memory) in the rare event of overflow; and Alert-On-Update, which selectively notifies threads about coherence events. All mechanisms are software-accessible, to enable virtualization and to support transactions of arbitrary length. FlexTM allows software to determine when to manage conflicts (either eagerly or lazily), and to employ a variety of conflict management and commit protocols. We describe an STM-inspired protocol thatuses CSTs to manage conflicts in a distributed manner (no global arbitration) and allows parallel commits. In experiments with a prototype on Simics/GEMS, FlexTM exhibits 5x speedup over high-quality software TM, with no loss in policy flexibility. Its distributed commit protocol is also more efficient than a central hardware manager. Our results highlight the importance of flexibility in determining when to manage conflicts: lazy maximizes concurrency and helps to ensure forward progress while eager provides better overall utilization in a multi-programmed system.

Proceedings ArticleDOI
19 Oct 2008
TL;DR: A software transactional memory system that introduces first-class C++ language constructs for transactional programming, a production-quality optimizing C++ compiler that translates and optimizes these extensions, and a high-performance STM runtime library are presented.
Abstract: This paper presents a software transactional memory system that introduces first-class C++ language constructs for transactional programming. We describe new C++ language extensions, a production-quality optimizing C++ compiler that translates and optimizes these extensions, and a high-performance STM runtime library. The transactional language constructs support C++ language features including classes, inheritance, virtual functions, exception handling, and templates. The compiler automatically instruments the program for transactional execution and optimizes TM overheads. The runtime library implements multiple execution modes and implements a novel STM algorithm that supports both optimistic and pessimistic concurrency control. The runtime switches a transaction's execution mode dynamically to improve performance and to handle calls to precompiled functions and I/O libraries. We present experimental results on 8 cores (two quad-core CPUs) running a set of 20 non-trivial parallel programs. Our measurements show that our system scales well as the numbers of cores increases and that our compiler and runtime optimizations improve scalability.

Proceedings ArticleDOI
14 Jun 2008
TL;DR: A new weakly atomic Java STM implementation is described that provides single global lock semantics while permitting concurrent execution, but it is shown that this comes at a significant performance cost.
Abstract: As memory transactions have been proposed as a language-level replacement for locks, there is growing need for well-defined semantics. In contrast to database transactions, transaction memory (TM) semantics are complicated by the fact that programs may access the same memory locations both inside and outside transactions. Strongly atomic semantics, where non transactional accesses are treated as implicit single-operation transactions, remain difficult to provide without specialized hardware support or significant performance overhead. As an alternative, many in the community have informally proposed that a single global lock semantics [18,10], where transaction semantics are mapped to those of regions protected by a single global lock, provide an intuitive and efficiently implementable model for programmers.In this paper, we explore the implementation and performance implications of single global lock semantics in a weakly atomic STM from the perspective of Java, and we discuss why even recent STM implementations fall short of these semantics. We describe a new weakly atomic Java STM implementation that provides single global lock semantics while permitting concurrent execution, but we show that this comes at a significant performance cost. We also propose and implement various alternative semantics that loosen single lock requirements while still providing strong guarantees. We compare our new implementations to previous ones, including a strongly atomic STM.[24]

Proceedings ArticleDOI
07 Jan 2008
TL;DR: This work presents a family of formal languages that model a wide variety of behaviors for software transactions, and proves some key language-equivalence theorems to confirm that under sufficient static restrictions, no program can determine whether the language implementation uses weak isolation or strong isolation.
Abstract: Software transactions have received significant attention as a way to simplify shared-memory concurrent programming, but insufficient focus has been given to the precise meaning of software transactions or their interaction with other language features. This work begins to rectify that situation by presenting a family of formal languages that model a wide variety of behaviors for software transactions. These languages abstract away implementation details of transactional memory, providing high-level definitions suitable for programming languages. We use small-step semantics in order to represent explicitly the interleaved execution of threads that is necessary to investigate pertinent issues.We demonstrate the value of our core approach to modeling transactions by investigating two issues in depth. First, we consider parallel nesting, in which parallelism and transactions can nest arbitrarily. Second, we present multiple models for weak isolation, in which nontransactional code can violate the isolation of a transaction. For both, type-and-effect systems let us soundly and statically restrict what computation can occur inside or outside a transaction. We prove some key language-equivalence theorems to confirm that under sufficient static restrictions, in particular that each mutable memory location is used outside transactions or inside transactions (but not both), no program can determine whether the language implementation uses weak isolation or strong isolation.

Proceedings ArticleDOI
09 Sep 2008
TL;DR: A research platform for exploiting software TMon clusters and the distributed software transactional memory (DiSTM) system has been designed for easy prototyping of TM coherence protocols and it does not rely on a software or hardware implementation of distributed shared memory.
Abstract: While transactional memory (TM) research on shared-memory chip multiprocessors has been flourishing over the last years,limited research has been conducted in the cluster domain. In this paper,we introduce a research platform for exploiting software TMon clusters. The distributed software transactional memory (DiSTM) system has been designed for easy prototyping of TM coherence protocols and it does not rely on a software or hardware implementation of distributed shared memory. Three TM coherence protocols have been implemented and evaluated with established TM benchmarks. The decentralized transactional coherence and consistency protocol has been compared against two centralized protocols that utilize leases. Results indicate that depending on network congestion and amount of contention different protocols perform better.

Proceedings ArticleDOI
18 Aug 2008
TL;DR: CAR-STM, a scheduling-based mechanism for STM Collision Avoidance and Resolution, is presented and incorporated into the University of Rochester's STM (RSTM) and the performance of the new implementation with that of the original RSTM by using STMBench7.
Abstract: Transactional memory (TM) is a key concurrent programming abstraction. Several software-based transactional memory (STM) implementations have been developed in recent years. All STM implementations must guarantee transaction atomicity but different STM implementations may provide different progress guarantees. In order to ensure progress, an STM implementation must resolve transaction conflicts. This is done either by the implementation itself or by delegating conflict resolution to a separate contention manager module that tries to resolve transaction collisions once they are detected.We present CAR-STM, a scheduling-based mechanism for STM Collision Avoidance and Resolution, that can be incorporated into existing STM implementations. CAR-STM maintains per-core transaction queues and schedules a thread while it is performing a transaction. CAR-STM employs the following two novel collision reduction techniques: (1) seriailizing contention managers resolve conflicts by aborting one transaction and moving it to the transactions queue of the other, effectively serializing the execution of these transactions and ensuring they will not collide again. (2) Proactive collision reduction allows applications to provide information about transactions' collision-probability. CAR-STM uses this information to pre-assign transactions that are more likely to collide to the same core.We have incorporated CAR-STM into the University of Rochester's STM (RSTM) and compared the performance of the new implementation with that of the original RSTM by using STMBench7. Our results show that the new implementation provides orders-of-magnitude reduction of execution times and improved throughput for almost all concurrency levels. Additionally, since CAR-STM greatly reduces the unpredictable influence of operating-system scheduling on STM performance, the new implementation provides a much more stable performance. In contrast, the performance of the original RSTM implementation on STMBench7 workloads exhibits extremely high variance. Though our paper focuses on software transactional memory, we believe the ideas introduced by CAR-STM may prove useful also for hybrid implementations of transactional memory.

Journal ArticleDOI
01 Jun 2008
TL;DR: TokenTM is a unbounded HTM that uses the abstraction of tokens to precisely track conflicts on an unbounded number of memory blocks and implements tokens with new mechanisms, including metastate fission/fusion and fast token release.
Abstract: Current hardware transactional memory systems seek to simplify parallel programming, but assume that large transactions are rare, so it is acceptable to penalize their performance or concurrency. However, future programmers may wish to use large transactions more often in order to integrate with higher-level programming models (e.g., database transactions) or perform selected I/O operations. To prevent the "small transactions are common" assumption from becoming self-fulfilling, this paper contributes TokenTM—an unbounded HTM that uses the abstraction of tokens to precisely track conflicts on an unbounded number of memory blocks. TokenTM implements tokens with new mechanisms, including metastate fission/fusion and fast token release. TokenTM executes small transactions fast, executes concurrent large transactions with no penalty to nonconflicting transactions, and gracefully handles paging, context switching, and System-V-style shared memory.

Proceedings ArticleDOI
08 Nov 2008
TL;DR: A model and implementation of dependence-aware transactional memory (DATM), a novel solution to the problem of scaling under contention, and shows that DATM increases concurrency, for example by reducing the runtime of STAMP benchmarks by up to 39% and reducing transaction restarts byUp to 94%.
Abstract: Transactional memory (TM) is a promising paradigm for helping programmers take advantage of emerging multi-core platforms. Though they perform well under low contention, hardware TM systems have a reputation of not performing well under high contention, as compared to locks. This paper presents a model and implementation of dependence-aware transactional memory (DATM), a novel solution to the problem of scaling under contention. Unlike many proposals to deal with write-shared data (which arise in common data structures like counters and linked lists), DATM operates transparently to the programmer. The main idea in DATM is to accept any transaction execution interleaving that is conflict serializable, including interleavings that contain simple conflicts. Current TM systems reduce useful concurrency by restarting conflicting transactions, even if the execution interleaving is conflict serializable. DATM manages dependences between uncommitted transactions, sometimes forwarding data between them to safely commit conflicting transactions. The evaluation of our prototype shows that DATM increases concurrency, for example by reducing the runtime of STAMP benchmarks by up to 39% and reducing transaction restarts by up to 94%.

Proceedings ArticleDOI
14 Jun 2008
TL;DR: Four major performance bottlenecks that were unique to the case of executing large-scale workloads on an STM are identified: false conflicts, over-instrumentation, privatization-safety cost, and poor amortization.
Abstract: Transactional Memory (TM) promises to simplify concurrent programming, which has been notoriously difficult but crucial in realizing the performance benefit of multi-core processors. Software Transaction Memory (STM), in particular, represents a body of important TM technologies since it provides a mechanism to run transactional programs when hardware TM support is not available, or when hardware TM resources are exhausted. Nonetheless, most previous researches on STMs were constrained to executing trivial, small-scale workloads. The assumption was that the same techniques applied to small-scale workloads could readily be applied to real-life, large-scale workloads. However, by executing several nontrivial workloads such as particle dynamics simulation and game physics engine on a state of the art STM, we noticed that this assumption does not hold. Specifically, we identified four major performance bottlenecks that were unique to the case of executing large-scale workloads on an STM: false conflicts, over-instrumentation, privatization-safety cost, and poor amortization. We believe that these bottlenecks would be common for any STM targeting real-world applications. In this paper, we describe those identified bottlenecks in detail, and we propose novel solutions to alleviate the issues. We also thoroughly validate these approaches with experimental results on real machines.

Patent
30 May 2008
TL;DR: In this paper, various technologies and techniques are disclosed for providing a bounded transactional memory application that accesses cache metadata in a cache of a central processing unit. And the application can also interrogate a cache line metadata eviction summary to determine whether a transaction is doomed and then take an appropriate action.
Abstract: Various technologies and techniques are disclosed for providing a bounded transactional memory application that accesses cache metadata in a cache of a central processing unit. When performing a transactional read from the bounded transactional memory application, a cache line metadata transaction-read bit is set. When performing a transactional write from the bounded transactional memory application, a cache line metadata transaction-write bit is set and a conditional store is performed. At commit time, if any lines marked with the transaction-read bit or the transaction-write bit were evicted or invalidated, all speculatively written lines are discarded. The application can also interrogate a cache line metadata eviction summary to determine whether a transaction is doomed and then take an appropriate action.

Journal ArticleDOI
01 Jun 2008
TL;DR: This work demonstrates how fine-grained memory protection can be used in support of transactional memory systems and identifies key policies regarding contention management within and across the hardware and software TM components that are key to achieving robust performance with a hybrid TM.
Abstract: We demonstrate how fine-grained memory protection can be used in support of transactional memory systems: first showing how a software transactional memory system (STM) can be made strongly atomic by using memory protection on transactionally-held state, then showing how such a strongly-atomic STM can be used with a bounded hardware TM system to build a hybrid TM system in which zero-overhead hardware transactions may safely run concurrently with potentially-conflicting software transactions. We experimentally demonstrate how this hybrid TM organization avoids the common-case overheads associated with previous hybrid TM proposals, achieving performance rivaling an unbounded HTM system without the hardware complexity of ensuring completion of arbitrary transactions in hardware. As part of our findings, we identify key policies regarding contention management within and across the hardware and software TM components that are key to achieving robust performance with a hybrid TM.

Book ChapterDOI
09 Jun 2008
TL;DR: A sample evaluation shows unfavourable transactional performance and scalability compared to lock-based execution, in contrast to much of the published TM evaluations, and highlights the need for non-trivial TM benchmarks.
Abstract: Transactional Memory (TM) is a concurrent programming paradigm that aims to make concurrent programming easier than fine-grain locking, whilst providing similar performance and scalability. Several TM systems have been made available for research purposes. However, there is a lack of a wide range of non-trivial benchmarks with which to thoroughly evaluate these TM systems. This paper introduces Lee-TM, a non-trivial and realistic TM benchmark suite based on Lee's routing algorithm. The benchmark suite provides sequential, lock-based, and transactional implementations to enable direct performance comparison. Lee's routing algorithm has several of the desirable properties of a non-trivial TM benchmark, such as large amounts of parallelism, complex contention characteristics, and a wide range of transaction durations and lengths. A sample evaluation shows unfavourable transactional performance and scalability compared to lock-based execution, in contrast to much of the published TM evaluations, and highlights the need for non-trivial TM benchmarks.

Proceedings ArticleDOI
01 Jan 2008

Patent
Gad Sheaffer1, Raikin Shlomo1, Vadim Bassin1, Raanan Sade1, Ehud Cohen1, Oleg Margulis1 
30 Dec 2008
TL;DR: In this paper, a method and apparatus for monitoring memory accesses in hardware to support transactional execution is described, which monitors accesses to data items without regard for detection at physical storage structure granularity, but rather ensuring monitoring at least at data items granularity.
Abstract: A method and apparatus for monitoring memory accesses in hardware to support transactional execution is herein described. Attributes are monitor accesses to data items without regard for detection at physical storage structure granularity, but rather ensuring monitoring at least at data items granularity. As an example, attributes are added to state bits of a cache to enable new cache coherency states. Upon a monitored memory access to a data item, which may be selectively determined, coherency states associated with the data item are updated to a monitored state. As a result, invalidating requests to the data item are detected through combination of the request type and the monitored coherency state of the data item.