scispace - formally typeset
Search or ask a question

Showing papers on "Transactional memory published in 2009"


Proceedings ArticleDOI
07 Mar 2009
TL;DR: The experience includes a number of promising results using HTM to improve performance in a variety of contexts, and also identifies some ways in which the feature could be improved to make it even better.
Abstract: We report on our experience with the hardware transactional memory (HTM) feature of two pre-production revisions of a new commercial multicore processor. Our experience includes a number of promising results using HTM to improve performance in a variety of contexts, and also identifies some ways in which the feature could be improved to make it even better. We give detailed accounts of our experiences, sharing techniques we used to achieve the results we have, as well as describing challenges we faced in doing so.

318 citations


Proceedings ArticleDOI
15 Jun 2009
TL;DR: SwissTM is lock- and word-based and uses a new two-phase contention manager that ensures the progress of long transactions while inducing no overhead on short ones, and outperforms state-of-the-art STM implementations, namely RSTM, TL2, and TinySTM.
Abstract: Transactional memory (TM) is an appealing abstraction for programming multi-core systems. Potential target applications for TM, such as business software and video games, are likely to involve complex data structures and large transactions, requiring specific software solutions (STM). So far, however, STMs have been mainly evaluated and optimized for smaller scale benchmarks.We revisit the main STM design choices from the perspective of complex workloads and propose a new STM, which we call SwissTM. In short, SwissTM is lock- and word-based and uses (1) optimistic (commit-time) conflict detection for read/write conflicts and pessimistic (encounter-time) conflict detection for write/write conflicts, as well as (2) a new two-phase contention manager that ensures the progress of long transactions while inducing no overhead on short ones. SwissTM outperforms state-of-the-art STM implementations, namely RSTM, TL2, and TinySTM, in our experiments on STMBench7, STAMP, Lee-TM and red-black tree benchmarks.Beyond SwissTM, we present the most complete evaluation to date of the individual impact of various STM design choices on the ability to support the mixed workloads of large applications.

228 citations


Proceedings ArticleDOI
11 Oct 2009
TL;DR: TxOS is described, a variant of Linux 2.6.22 that implements system transactions that uses new implementation techniques to provide fast, serializable transactions with strong isolation and fairness between system transactions and non-transactional activity.
Abstract: Applications must be able to synchronize accesses to operating system resources in order to ensure correctness in the face of concurrency and system failures. System transactions allow the programmer to specify updates to heterogeneous system resources with the OS guaranteeing atomicity, consistency, isolation, and durability (ACID). System transactions efficiently and cleanly solve persistent concurrency problems that are difficult to address with other techniques. For example, system transactions eliminate security vulnerabilities in the file system that are caused by time-of-check-to-time-of-use (TOCTTOU) race conditions. System transactions enable an unsuccessful software installation to roll back without disturbing concurrent, independent updates to the file system.This paper describes TxOS, a variant of Linux 2.6.22 that implements system transactions. TxOS uses new implementation techniques to provide fast, serializable transactions with strong isolation and fairness between system transactions and non-transactional activity. The prototype demonstrates that a mature OS running on commodity hardware can provide system transactions at a reasonable performance cost. For instance, a transactional installation of OpenSSH incurs only 10% overhead, and a non-transactional compilation of Linux incurs negligible overhead on TxOS. By making transactions a central OS abstraction, TxOS enables new transactional services. For example, one developer prototyped a transactional ext3 file system in less than one month.

155 citations


Journal ArticleDOI
TL;DR: Rock, Sun's third-generation chip-multithreading processor, contains 16 high-performance cores, each of which can support two software threads, and uses a novel checkpoint-based architecture to support automatic hardware scouting under a load miss and aggressive dynamic hardware parallelization of a sequential instruction stream.
Abstract: Rock, Sun's third-generation chip-multithreading processor, contains 16 high-performance cores, each of which can support two software threads. Rock uses a novel checkpoint-based architecture to support automatic hardware scouting under a load miss, speculative out-of-order retirement of instructions, and aggressive dynamic hardware parallelization of a sequential instruction stream. It is also the first processor to support transactional memory in hardware.

146 citations


Proceedings ArticleDOI
16 Nov 2009
TL;DR: D2STM is presented, a replicated STM whose consistency is ensured in a transparent manner, even in the presence of failures, and which permits to achieve remarkable performance gains even for negligible increases of the transaction abort rate.
Abstract: At current date the problem of how to build distributed and replicated Software Transactional Memory (STM) to enhance both dependability and performance is still largely unexplored. This paper fills this gap by presenting D2STM, a replicated STM whose consistency is ensured in a transparent manner, even in the presence of failures. Strong consistency is enforced at transaction commit time by a non-blocking distributed certification scheme, which we name BFC (Bloom Filter Certification). BFC exploits a novel Bloom Filter-based encoding mechanism that permits to significantly reduce the overheads of replica coordination at the cost of a user tunable increase in the probability of transaction abort. Through an extensive experimental study based on standard STM benchmarks we show that the BFC scheme permits to achieve remarkable performance gains even for negligible (e.g. 1%) increases of the transaction abort rate.

144 citations


Proceedings ArticleDOI
14 Feb 2009
TL;DR: This paper introduces a new way to provide strong atomicity in an implementation of transactional memory by using off-the-shelf page-level memory protection hardware to detect conflicts between normal memory accesses and transactional ones and shows how a combination of careful object placement and dynamic code update allows us to eliminate almost all of the protection changes.
Abstract: This paper introduces a new way to provide strong atomicity in an implementation of transactional memory. Strong atomicity lets us offer clear semantics to programs, even if they access the same locations inside and outside transactions. It also avoids differences between hardware-implemented transactions and software-implemented ones. Our approach is to use off-the-shelf page-level memory protection hardware to detect conflicts between normal memory accesses and transactional ones. This page-level mechanism ensures correctness but gives poor performance because of the costs of manipulating memory protection settings and receiving notifications of access violations. However, in practice, we show how a combination of careful object placement and dynamic code update allows us to eliminate almost all of the protection changes. Existing implementations of strong atomicity in software rely on detecting conflicts by conservatively treating some non-transactional accesses as short transactions. In contrast, our page-level mechanism lets us be less conservative about how non-transactional accesses are treated; we avoid changes to non-transactional code until a possible conflict is detected dynamically, and we can respond to phase changes where a given instruction sometimes generates conflicts and sometimes does not. We evaluate our implementation with C# versions of many of the STAMP benchmarks, and show how it performs within 25% of an implementation with weak atomicity on all the benchmarks we have studied. It avoids pathological cases in which other implementations of strong atomicity perform poorly.

129 citations


Proceedings ArticleDOI
12 Dec 2009
TL;DR: A new scalable HTM architecture is shown that performs comparably to the state-of-the-art and can be implemented by minor modifications to the MESI protocol rather than re-engineering it from the ground up and performs on average 7% faster than scalable-TCC.
Abstract: Transactional Memory aims to provide a programming model that makes parallel programming easier. Hardware implementations of transactional memory (HTM) suffer from fewer overheads than implementations in software, and refinements in conflict management strategies for HTM allow for even larger improvements. In particular, lazy conflict management has been shown to deliver better performance, but it has hitherto required complex protocols and implementations. In this paper we show a new scalable HTM architecture that performs comparably to the state-of-the-art and can be implemented by minor modifications to the MESI protocol rather than re-engineering it from the ground up. Our approach detects conflicts eagerly while a transaction is running, but defers the resolution lazily until commit time. We evaluate this EAger-laZY system, EazyHTM, by comparing it with the Scalable-TCC-like approach and a system employing ideal lazy conflict management with a zero-cycle transaction validation and fully-parallel commits. We show that EazyHTM performs on average 7% faster than Scalable-TCC. In addition, EazyHTM has fast commits and aborts, can commit in parallel even if there is only one directory present, and does not suffer from cascading waits.

111 citations


Proceedings ArticleDOI
14 Feb 2009
TL;DR: This work describes how it has taken an existing parallel implementation of the Quake game server and restructured it to use transactions, the first attempt to develop a large, complex, application that uses TM for all of its synchronization.
Abstract: Transactional Memory (TM) is being studied widely as a new technique for synchronizing concurrent accesses to shared memory data structures for use in multi-core systems. Much of the initial work on TM has been evaluated using microbenchmarks and application kernels; it is not clear whether conclusions drawn from these workloads will apply to larger systems. In this work we make the first attempt to develop a large, complex, application that uses TM for all of its synchronization. We describe how we have taken an existing parallel implementation of the Quake game server and restructured it to use transactions. In doing so we have encountered examples where transactions simplify the structure of the program. We have also encountered cases where using transactions occludes the structure of the existing code. Compared with existing TM benchmarks, our workload exhibits non-block-structured transactions within which there are I/Ooperations and system call invocations. There are long and short running transactions (200-1.3M cycles) with small and large read and write sets (a few bytes to 1.5MB). There are nested transactions reaching up to 9 levels at runtime. There are examples where error handling and recovery occurs inside transactions. There are also examples where data changes between being accessed transactionally and accessed non-transactionally. However, we did not see examples where the kind of access to one piece of data depended on the value of another.

92 citations


Proceedings ArticleDOI
12 Sep 2009
TL;DR: This paper proposes an adaptive locking technique that dynamically observes whether a critical section would be best executed transactionally or while holding a mutex lock, and finds adaptive locks to consistently match or out perform the better of the two component mechanisms.
Abstract: Transactional memory is being advanced as an alternative to traditional lock-based synchronization for concurrent programming. Transactional memory simplifies the programming model and maximizes concurrency. At the same time, transactions can suffer from interference that causes them to often abort, from heavy overheads for memory accesses, and from expressiveness limitations (e.g., for I/O operations). In this paper we propose an adaptive locking technique that dynamically observes whether a critical section would be best executed transactionally or while holding a mutex lock. The critical new elements of our approach include the adaptivity logic and cost-benefit analysis, a low overhead implementation of statistics collection and adaptive locking in a full C compiler, and an exposition of the effects on the programming model. In experiments with both microand macro-benchmarks we found adaptive locks to consistently match or out perform the better of the two component mechanisms (mutexes or transactions). Compared to either mechanism alone, adaptive locks often provide 3-to-10x speedups. Additionally, adaptive locks simplify the programming model by reducing the need for fine-grained locking: with adaptive locks, the programmer can specify coarse-grained locking annotations and often achieve fine-grained locking performance due to the transactional memory mechanisms.

83 citations


Patent
30 Jun 2009
TL;DR: In this article, a hardware assisted transactional memory system with open nested transactions is presented, where a top level transaction can be implemented in software, and thus not be limited by hardware constraints typical when using hardware transactional systems.
Abstract: Hardware assisted transactional memory system with open nested transactions. Embodiments include a system whereby hardware acceleration of transactions can be accomplished by implementing open nested transaction in hardware which respect software locks such that a top level transaction can be implemented in software, and thus not be limited by hardware constraints typical when using hardware transactional memory systems.

82 citations


Journal ArticleDOI
01 Nov 2009
TL;DR: The community must work together to prepare for the challenges of exascale computing, ultimately combing their efforts in a coordinated International Exascale Software Project.
Abstract: Over the last 20 years, the open-source community has provided more and more software on which the world’s high-performance computing systems depend for performance and productivity. The community has invested millions of dollars and years of effort to build key components. Although the investments in these separate software elements have been tremendously valuable, a great deal of productivity has also been lost because of the lack of planning, coordination, and key integration of technologies necessary to make them work together smoothly and efficiently, both within individual petascale systems and between different systems. A repository gatekeeper and an email discussion list can coordinate open-source development within a single project, but there is no global mechanism working across the community to identify critical holes in the overall software environment, spot opportunities for beneficial integration, or specify requirements for more careful coordination. It seems clear that this completely uncoordinated development model will not provide the software needed to support the unprecedented parallelism required for peta/exascale computation on millions of cores, or the flexibility required to exploit new hardware models and features, such as transactional memory, speculative execution, and GPUs. We believe the community must work together to prepare for the challenges of exascale computing, ultimately combing their efforts in a coordinated International Exascale Software Project.

Proceedings ArticleDOI
14 Feb 2009
TL;DR: This paper provides a safety proof for the dependence-aware model and describes the first application of dependence tracking to software transactional memory (STM) design and implementation, quantifying how dependence tracking converts certain types of transactional conflicts into successful commits.
Abstract: Dependence-aware transactional memory (DATM) is a recently proposed model for increasing concurrency of memory transactions without complicating their interface. DATM manages dependences between conflicting, uncommitted transactions so that they commit safely.The contributions of this paper are twofold. First, we provide a safety proof for the dependence-aware model. This proof also shows that the DATM model accepts all concurrent interleavings that are conflict-serializable.Second, we describe the first application of dependence tracking to software transactional memory (STM) design and implementation. We compare our implementation with a state of the art STM, TL2 [4]. We use benchmarks from the STAMP [21] suite, quantifying how dependence tracking converts certain types of transactional conflicts into successful commits. On high contention workloads, DATM is able to take advantage of dependences to speed up execution by up to 4.8x.

Proceedings ArticleDOI
11 Aug 2009
TL;DR: The results show that nonblocking support introduces little overhead when compared with blocking STMs, and that NZTM is competitive with LogTM-SE, an unbounded HTM.
Abstract: This paper introduces NZTM, a nonblocking, zero-indirection, object-based, hybrid transactional memory system. NZTM comprises a nonblocking software transactional memory (STM) system that can exploit best-effort hardware transactional memory (HTM) if available to improve performance.Most previous nonblocking software transactional memory implementations pay a significant performance cost in the common case, as compared to simpler, blocking ones. However, blocking is problematic in some cases and unacceptable in others. NZTM is nonblocking, but shares the advantages of recent blocking STM proposals in the common case: it stores object data "in place", avoiding the costly levels of indirection of previous nonblocking STMs, and improves cache performance by collocating object metadata with the data it controls.We also explain how our nonblocking NZSTM algorithm can be substantially simplified using very simple hardware transactions, and evaluate its performance on Sun's forthcoming Rock processor. Our results show that nonblocking support introduces little overhead when compared with blocking STMs, and that NZTM is competitive with LogTM-SE, an unbounded HTM.

Proceedings ArticleDOI
21 Jan 2009
TL;DR: This paper is the first to formally define the progress semantics of lockbased TMs, which are considered the most effective in practice, and uses this semantics to reduce the problems of reasoning about the correctness and computability power of lock- based TMs to those of simple try-lock objects.
Abstract: Transactional memory (TM) is a promising paradigm for concurrent programming. Whereas the number of TM implementations is growing, however, little research has been conducted to precisely define TM semantics, especially their progress guarantees. This paper is the first to formally define the progress semantics of lockbased TMs, which are considered the most effective in practice.We use our semantics to reduce the problems of reasoning about the correctness and computability power of lock-based TMs to those of simple try-lock objects. More specifically, we prove that checking the progress of any set of transactions accessing an arbitrarily large set of shared variables can be reduced to verifying a simple property of each individual (logical) try-lock used by those transactions. We use this theorem to determine the correctness of state-of-the-art lock-based TMs and highlight various configuration ambiguities. We also prove that lock-based TMs have consensus number 2. This means that, on the one hand, a lock-based TM cannot be implemented using only read-write memory, but, on the other hand, it does not need very powerful instructions such as the commonly used compare-and-swap.We finally use our semantics to formally capture an inherent trade-off in the performance of lock-based TM implementations. Namely, we show that the space complexity of every lock-based software TM implementation that uses invisible reads is at least exponential in the number of objects accessible to transactions.

Patent
15 Dec 2009
TL;DR: In this paper, the authors present methods to monitor performance of one or more architecturally significant processor caches coupled to a processor, where the application utilizes the architecturally important portions of the processor caches.
Abstract: Monitoring performance of one or more architecturally significant processor caches coupled to a processor. The methods include executing an application on one or more processors coupled to one or more architecturally significant processor caches, where the application utilizes the architecturally significant portions of the architecturally significant processor caches. The methods further include at least one of generating metrics related to performance of the architecturally significant processor caches; implementing one or more debug exceptions related to performance of the architecturally significant processor caches; or implementing one or more transactional breakpoints related to performance of the architecturally significant processor caches as a result of utilizing the architecturally significant portions of the architecturally significant processor caches.

Proceedings ArticleDOI
12 Dec 2009
TL;DR: A dynamic contention management strategy is introduced that minimizes contention by using past history to identify when hot spots of contention will reoccur in the future and proactively schedule affected transactions around these hot spots.
Abstract: Hardware Transactional Memory offers a promising high performance and easier to program alternative to lock-based synchronization for creating parallel programs. This is particularly important as hardware manufacturers continue to put more cores on die. But transactional memory still has one main drawback: contention. Contention is caused by multiple transactions trying to speculatively modify the same memory location concurrently causing one or more transactions to abort and retry its execution. Contention serializes the execution, meaning high contention leads to very poor parallel performance. As more cores are added, contention worsens. To date contention-manager designs have been primarily reactive in nature and limited to various forms of randomized backoff to effectively stall contending transactions when conflicts occur. While backoff-based managers have been popular due to their simplicity, at higher core counts our analysis on the STAMP benchmark suite shows that backoff-based managers perform poorly. In particular, small groups of transactions create hot spots of contention that lead to this poor performance. We show these hot spots commonly consist of small sets of conflicts that occur in a predictable manner. To counter this challenge we introduce a dynamic contention management strategy that minimizes contention by using past history to identify when these hot spots will reoccur in the future and proactively schedule affected transactions around these hot spots. The strategy used predicts future contention and schedules to avoid it at runtime without the need for programmer input. Our experiments show that by using our proactive scheduling technique we outperform a backoff-based policy for a 16 processor system by an average of 85%.

Patent
30 Jun 2009
TL;DR: In this article, the authors present a system that facilitates the execution of a transaction for a program in a hardware-supported transactional memory system, and provides an advice state associated with the recorded failure state to the program to facilitate a response to the transaction failure.
Abstract: One embodiment provides a system that facilitates the execution of a transaction for a program in a hardware-supported transactional memory system. During operation, the system records a failure state of the transaction during execution of the transaction using hardware transactional memory mechanisms. Next, the system detects a transaction failure associated with the transaction. Finally, the system provides an advice state associated with the recorded failure state to the program to facilitate a response to the transaction failure by the program.

Patent
13 Nov 2009
TL;DR: In this paper, the authors describe a processor that transactionally executes instructions from a protected section of program code and then encounters a transactional failure condition while transactionally executing the instructions from the protected section.
Abstract: The described embodiments provide a processor (e.g., processor 102) for executing instructions. During execution, the processor starts by transactionally executing instructions from a protected section of program code. The processor then encounters a transactional failure condition while transactionally executing the instructions from the protected section of program code. In response to encountering the transactional failure condition, the processor enters a transactional-scout mode and speculatively executes subsequent instructions in the transactional-scout mode.

01 Jan 2009
TL;DR: This position paper argues for an analogous transactional datarace-free (TDRF) programming model, observing that WI is strong enough to implement this model, and further that weakly isolated TM systems based on redo logging can provide the safety guarantees required by languages like Java.
Abstract: In the transactional memory (TM) community, much debate has revolved around the choice of strong vs. weak isolation (SI vs. WI) between transactions and conflicting nontransactional accesses. In this position paper we argue that what programmers really want is the natural transactional extension of sequential consistency (SC), and that even SI is insufficient to achieve this. It is widely agreed among architects and language designers that SC imposes unacceptable hardware costs and compiler restrictions. Programmer-centric, relaxed memory models were developed as a compromise, guaranteeing SC to programs that “follow the rules” while admitting many of the compiler optimizations that result in fast single-threaded execution. We argue for an analogous transactional datarace-free (TDRF) programming model. We observe that WI is strong enough to implement this model, and further that weakly isolated TM systems based on redo logging can provide the safety guarantees (no “out-of-thin-air reads”) required by languages like Java. Seen in this light, strong isolation (SI) serves only to require more constrained behavior in racy (buggy) programs. We submit that the benefit is not worth the cost, at least for software TM.

Proceedings ArticleDOI
08 Jun 2009
TL;DR: This paper presents Quake TM, a multiplayer game server; a complex real life TM application that was parallelized from the sequential version with TM-specific considerations in mind and provides extensive analysis of the transactional behavior of QuakeTM, with an emphasis and discussion of the TM promise of making parallel programming easier.
Abstract: "Is transactional memory useful?" is the question that cannot be answered until we provide substantial applications that can evaluate its capabilities. While existing TM applications can partially answer the above question, and are useful in the sense that they provide a first-order TM experimentation framework, they serve only as a proof of concept and fail to make a conclusive case for wide adoption by the general computing community.This paper presents QuakeTM, a multiplayer game server; a complex real life TM application that was parallelized from the sequential version with TM-specific considerations in mind. QuakeTM consists of 27,600 lines of code spread across 49 files and exhibits irregular parallelism for which a task parallel model fits well. We provide a coarse-grained TM implementation characterized with eight large transactional blocks as well as a fine-grained implementation which consists of 58 different critical sections and compare these two approaches. In spite of the fact that QuakeTM scales, we show that more effort is needed to decrease the overhead and the abort rate of current software transactional memory systems to achieve a good performance. We give insights into development challenges, suggest techniques to solve them and provide extensive analysis of the transactional behavior of QuakeTM, with an emphasis and discussion of the TM promise of making parallel programming easier.

Proceedings ArticleDOI
11 Aug 2009
TL;DR: An inherent tradeoff for implementations of transactional memories is proved: they cannot be both disjoint-access parallel and have read-only transactions that are invisible and always terminate successfully.
Abstract: Transactional memory (TM) is a promising approach for designing concurrent data structures, and it is essential to develop better understanding of the formal properties that can be achieved by TM implementations. Two fundamental properties of TM implementations are disjoint-access parallelism, which is critical for their scalability, and the invisibility of read operations, which reduces memory contention.This paper proves an inherent tradeoff for implementations of transactional memories: they cannot be both disjoint-access parallel and have read-only transactions that are invisible and always terminate successfully. In fact, a lower bound of Ω(t) is proved on the number of writes needed in order to implement a read-only transaction of t items, which successfully terminates in a disjoint-access parallel TM implementation. The results assume strict serializability and thus hold under the assumption of opacity. It is shown how to extend the results to hold also for weaker consistency conditions, serializability and snapshot isolation.

Patent
23 Dec 2009
TL;DR: In this article, the authors describe methods, systems, and apparatuses to provide an XABORT in a transactional memory access system, where the stored value is a context value indicating the context in which an execution was aborted.
Abstract: Methods, systems, and apparatuses to provide an XABORT in a transactional memory access system are described. In one embodiment, the stored value is a context value indicating the context in which a transactional memory execution was aborted. A fallback handler may use the context value to perform a series of operations particular to the context in which the abort occurred.

Proceedings ArticleDOI
12 Sep 2009
TL;DR: FASTM is presented, an eager log-based HTM that takes advantage of the processor’s cache hierarchy to provide fast abort recovery and uses a novel coherence protocol to buffer the transactional modifications in the first level cache and to keep the non-speculative values in the higher levels of the memory hierarchy.
Abstract: Version management, one of the key design dimensions of Hardware Transactional Memory (HTM) systems, defines where and how transactional modifications are stored. Current HTM systems use either eager or lazy version management. Eager systems that keep new values in-place while they hold old values in a software log, suffer long delays when aborts are frequent because the pre-transactional state is recovered by software. Lazy systems that buffer new values in specialized hardware offer complex and inefficient solutions to handle hardware overflows, which are common in applications with coarse-grain transactions. In this paper, we present FASTM, an eager log-based HTM that takes advantage of the processor’s cache hierarchy to provide fast abort recovery. FASTM uses a novel coherence protocol to buffer the transactional modifications in the first level cache and to keep the non-speculative values in the higher levels of the memory hierarchy. This mechanism allows fast abort recovery of transactions that do not overflow the first level cache resources. Contrary to lazy HTM systems, committing transactions do not have to perform any actions in order to make their results visible to the rest of the system. FASTM keeps the pre-transactional state in a software-managed log as well, which permits the eviction of speculative values and enables transparent execution even in the case of cache overflow. This approach simplifies eviction policies without degrading performance, because it only falls back to a software abort recovery for transactions whose modified state has overflowed the cache. Simulation results show that FASTM achieves a speed-up of 43% compared to LogTM-SE, improving the scalability of applications with coarse-grain transactions and obtaining similar performance to an ideal eager HTM with zero-cost abort recovery.

Book ChapterDOI
Bo Zhang1, Binoy Ravindran1
03 Dec 2009
TL;DR: The Relay protocol is presented, a novel cache-coherence protocol, which optimizes these values, and it is shown that Relay's competitive ratio is significantly improved by a factor of O (N i ) for N i transactions requesting the same object when compared against past distributed queuing protocols.
Abstract: Distributed transactional memory promises to alleviate difficulties with lock-based (distributed) synchronization and object performance bottlenecks in distributed systems The design of the cache- coherence protocol is critical to the performance of distributed transactional memory systems We evaluate the performance of a cache-coherence protocol by measuring its worst-case competitive ratio -- ie, the ratio of its makespan to the makespan of the optimal cache-coherence protocol We establish the upper bound of the competitive ratio and show that it is determined by the worst-case number of abortions, maximum locating stretch, and maximum moving stretch of the protocol -- the first such result We present the Relay protocol, a novel cache-coherence protocol, which optimizes these values, and evaluate its performance We show that Relay's competitive ratio is significantly improved by a factor of O (N i ) for N i transactions requesting the same object when compared against past distributed queuing protocols

Patent
Jr. Thomas J. Heller1
16 Nov 2009
TL;DR: In this paper, the authors propose a hybrid transactional memory elements support for a simple/cost effective hardware design that can deal with limited hardware resources, yet one which has a transactional facility control logic providing for a back up assist thread that can still allow transactions to reference existing libraries and allows programmers to include calls to existing software libraries inside of their transactions, and which will not make a user code use a second lock based solution.
Abstract: A computer processing system having memory and processing facilities for processing data with a computer program is a Hybrid Transactional Memory multiprocessor system with modules 1 . . . n coupled to a system physical memory array, I/O devices via a high speed interconnection element. A CPU is integrated as in a multi-chip module with microprocessors which contain or are coupled in the CPU module to an assist thread facility, as well as a memory controller, cache controllers, cache memory, and other components which form part of the CPU which connects to the high speed interconnect which functions under the architecture and operating system to interconnect elements of the computer system with physical memory, various 1/0, devices and the other CPUs of the system. The current hybrid transactional memory elements support for a transactional memory system that has a simple/cost effective hardware design that can deal with limited hardware resources, yet one which has a transactional facility control logic providing for a back up assist thread that can still allow transactions to reference existing libraries and allows programmers to include calls to existing software libraries inside of their transactions, and which will not make a user code use a second lock based solution.

Proceedings ArticleDOI
26 Apr 2009
TL;DR: A state-of-the-art profile-driven TLS compiler is used to identify loops that can be speculatively parallelized and finds that with optimal loop selection, one can potentially achieve an average speedup of 60% on four cores over what could be achieved by a traditional parallelizing compiler such as Intel's ICC compiler.
Abstract: The computer industry has adopted multi-threaded and multi-core architectures as the clock rate increase stalled in early 2000's. It was hoped that the continuous improvement of single-program performance could be achieved through these architectures. However, traditional parallelizing compilers often fail to effectively parallelize general-purpose applications which typically have complex control flow and excessive pointer usage. Recently hardware techniques such as Transactional Memory (TM) and Thread-Level Speculation (TLS) have been proposed to simplify the task of parallelization by using speculative threads. Potential of speculative parallelism in general-purpose applications like SPEC CPU 2000 have been well studied and shown to be moderately successful. Preliminary work examining the potential parallelism in SPEC2006 deployed parallel threads with a restrictive TLS execution model and limited compiler support, and thus only showed limited performance potential. In this paper, we first analyze the cross-iteration dependence behavior of SPEC 2006 benchmarks and show that more parallelism potential is available in SPEC 2006 benchmarks, comparing to SPEC2000. We further use a state-of-the-art profile-driven TLS compiler to identify loops that can be speculatively parallelized. Overall, we found that with optimal loop selection we can potentially achieve an average speedup of 60% on four cores over what could be achieved by a traditional parallelizing compiler such as Intel's ICC compiler.We also found that an additional 11% improvement can be potentially obtained on selected benchmarks using 8 cores when we extend TLS on multiple loop levels as opposed to restricting to a single loop level.

Proceedings ArticleDOI
08 Jun 2009
TL;DR: This analysis corroborates previous research findings that stalling (especially prior to retrying an access rather than the entire transaction) helps side-step conflicts and avoid wasted work and demonstrates that conflict resolution time has the dominant effect on performance.
Abstract: In the search for high performance, most transactional memory (TM) systems execute atomic blocks concurrently and must thus be prepared for data conflicts. The TM system also needs to choose a policy to decide when and how to manage the resulting contention. In this paper, we analyze the interplay between conflict resolution time and contention management policy in the context of hardware-supported TM systems, highlighting both the implementation tradeoffs and the performance implications of the various points in the design space. We show that both policy decisions have a significant impact on the ability to exploit available parallelism and thereby affect overall performance. Our analysis corroborates previous research findings that stalling (especially prior to retrying an access rather than the entire transaction) helps side-step conflicts and avoid wasted work. We also demonstrate that conflict resolution time has the dominant effect on performance: lazy (which delays resolution to commit time) uncovers more parallelism than eager (which resolves conflicts at access time). Furthermore, Lazy's delayed conflict management decreases the likelihood of livelock while Eager needs sophisticated priority mechanisms. Finally, we evaluate a mixed conflict detection mode that detects write-write conflicts eagerly while detecting read-write conflicts lazily, and show that it provides a good compromise between flexibility and implementation complexity.

Proceedings ArticleDOI
01 Apr 2009
TL;DR: Despite the overhead of implementing transactions in software, transactions with xCalls improved the performance of two applications with poor locking behavior by 16 and 70%.
Abstract: Memory transactions, similar to database transactions, allow a programmer to focus on the logic of their program and let the system ensure that transactions are atomic and isolated. Thus, programs using transactions do not suffer from deadlock. However, when a transaction performs I/O or accesses kernel resources, the atomicity and isolation guarantees from the TM system do not apply to the kernel.The xCall interface is a new API that provides transactional semantics for system calls. With a combination of deferral and compensation, xCalls enable transactional memory programs to use common OS functionality within transactions.We implement xCalls for the Intel Software Transactional Memory compiler, and found it straightforward to convert programs to use transactions and xCalls. In tests on a 16-core NUMA machine, we show that xCalls enable concurrent I/O and system calls within transactions. Despite the overhead of implementing transactions in software, transactions with xCalls improved the performance of two applications with poor locking behavior by 16 and 70%.

Patent
26 Jun 2009
TL;DR: In this article, a system and method for transactional memory using read-write locks is described, where each thread executing a group of memory access operations as an atomic transaction acquires the proper read or write permissions before performing a memory operation.
Abstract: A system and method for transactional memory using read-write locks is disclosed. Each of a plurality of shared memory areas is associated with a respective read-write lock, which includes a read-lock portion indicating whether any thread has a read-lock for read-only access to the memory area and a write-lock portion indicating whether any thread has a write-lock for write access to the memory area. A thread executing a group of memory access operations as an atomic transaction acquires the proper read or write permissions before performing a memory operation. To perform a read access, the thread attempts to obtain the corresponding read-lock and succeeds if no other thread holds a write-lock for the memory area. To perform a write-access, the thread attempts to obtain the corresponding write-lock and succeeds if no other thread holds a write-lock or read-lock for the memory area.

Patent
18 Dec 2009
TL;DR: In this article, a method for software prioritization of concurrent transactions for embedded conflict arbitration in transactional memory management can be found, which can include setting different hardware registers with different priority values for correspondingly different transactions.
Abstract: Embodiments of the present invention provide a method, system and computer program product for software prioritization of concurrent transactions for embedded conflict arbitration in transactional memory management. In an embodiment of the invention, a method for software prioritization of concurrent transactions for embedded conflict arbitration in transactional memory management can include setting different hardware registers with different priority values for correspondingly different transactions in a transactional memory system configured for transactional memory management according to respective priority values specified by priority assignment logic in external software support for the system. The method also can include detecting a conflict amongst the transactions in the system. Finally, the method can include applying conflict arbitration within the system based upon the priority values specified by the priority assignment logic in the external software support for the system.