scispace - formally typeset
Search or ask a question

Showing papers on "Transactional memory published in 2019"


Proceedings ArticleDOI
01 Jun 2019
TL;DR: OneFile is presented, the first wait-free PTM with integratedWait-free memory reclamation and two variants of the OneFile are designed and implemented, one with lock-free progress and the other with bounded wait- free progress.
Abstract: A persistent transactional memory (PTM) library provides an easy-to-use interface to programmers for using byte-addressable non-volatile memory (NVM). Previously proposed PTMs have, so far, been blocking. We present OneFile, the first wait-free PTM with integrated wait-free memory reclamation. We have designed and implemented two variants of the OneFile, one with lock-free progress and the other with bounded wait-free progress. We additionally present software transactional memory (STM) implementations of the lock-free and wait-free algorithms targeting volatile memory. Each of our PTMs and STMs is implemented as a single C++ file with ~1,000 lines of code, making them versatile to use. Equipped with these PTMs and STMs, non-expert developers can design and implement their own lock-free and wait-free data structures on NVM, thus making lock-free programming accessible to common software developers.

45 citations


Proceedings ArticleDOI
17 Jun 2019
TL;DR: This paper presents a construction that takes any concurrent program with reads, writes and CASs to shared memory and makes it persistent, i.e., can be continued after one or more processes fault and have to restart, and provides an optimized transformation for normalized lock-free data structures, thus speeding up a large class of concurrent algorithms.
Abstract: Non-volatile memory (NVM) promises persistent main memory that remains correct despite loss of power. This has sparked a line of research into algorithms that can recover from a system crash. Since caches are expected to remain volatile, concurrent data structures and algorithms must be redesigned to guarantee that they are left in a consistent state after a system crash, and that the execution can be continued upon recovery. However, the prospect of redesigning every concurrent data structure or algorithm before it can be used in NVM architectures is daunting. In this paper, we present a construction that takes any concurrent program with reads, writes and CASs to shared memory and makes it persistent, i.e., can be continued after one or more processes fault and have to restart. The converted algorithm has constant computational delay (preserves instruction counts on each process within a constant factor), as well as constant recovery delay (a process can recover from a fault in a constant number of instructions). We show this first for a simple transformation, and then present optimizations to make it more practical, allowing for a trade-off between computation and recovery delay. We also provide an optimized transformation for normalized lock-free data structures, thus speeding up a large class of concurrent algorithms. Finally, we experimentally evaluate these transformations by applying them to a queue. We compare the performance of our transformations to that of a persistent transactional memory framework, Romulus, and to a hand-tuned persistent queue. We show that our optimized transformation performs favorably when compared to Romulus. Furthermore, our optimized transformation is even comparable to the hand-tuned version, showing that the generality we provide comes at very little performance cost.

45 citations


Proceedings Article
01 Jan 2019
TL;DR: Pisces is presented, a read-friendly PTM that exploits snapshot isolation (SI) on NVM and proposes a dual-version concurrency control (DVCC) protocol that maintains up to two versions in NVMbacked storage hierarchy.
Abstract: Persistent transactional memory (PTM) programming model has recently been exploited to provide crashconsistent transactional interfaces to ease programming atop NVM. However, existing PTM designs either incur high reader-side overhead due to blocking or long delay in the writer side (efficiency), or place excessive constraints on persistent ordering (scalability). This paper presents Pisces, a read-friendly PTM that exploits snapshot isolation (SI) on NVM. The key design of Pisces is based on two observations: the redo logs of transactions can be reused as newer versions for the data, and an intuitive MVCC-based design has read deficiency. Based on the observations, we propose a dual-version concurrency control (DVCC) protocol that maintains up to two versions in NVMbacked storage hierarchy. Together with a three-stage commit protocol, Pisces ensures SI and allows more transactions to commit and persist simultaneously. Most importantly, it promises a desired feature: hiding NVM persistence overhead from reads and allowing nearly non-blocking reads. Experimental evaluation on an Intel 40-thread (20-core) machine with real NVM equipped shows that Pisces outperforms the state-of-the-art design (i.e., DUDETM) by up to 6.3× for micro-benchmarks and 4.6× for TPC-C new order transaction, and also scales much better. The persistency cost is from 19% to 50% for 40 threads.

36 citations


Proceedings ArticleDOI
01 Sep 2019
TL;DR: This work shows how to build concurrent persistent transactional memory from traditional software transactional memories and introduces general and model-specific optimizations that can substantially improve the performance of persistent transactions.
Abstract: Byte-addressable, non-volatile, random access memory (NVM) has the potential to dramatically accelerate the performance of storage-intensive workloads. For applications with irregular data access patterns, and applications that rely on ad-hoc data structures, the most promising model for interacting with NVM is a transactional model. However, the specifics of the model matter significantly. We introduce two models for programming persistent transactions. We show how to build concurrent persistent transactional memory from traditional software transactional memories. We then introduce general and model-specific optimizations that can substantially improve the performance of persistent transactions. Our evaluation shows a substantial improvement in the both the latency and scalability of persistent transactions.

18 citations


Proceedings ArticleDOI
05 Aug 2019
TL;DR: This work proposes RNTree, a durable NVM-based B+tree using the hardware transactional memory (HTM), and proposes a new slot-array approach which traces the order of entries in the leaf nodes while still reducing the number of persistent instructions.
Abstract: Emerging on-volatile memory (NVM) opens an opportunity to build durable data structures. However, to build a highly efficient complex data structure like B+tree on NVM is not easy. We investigate the essential performance bottleneck for NVM-based B+tree. Even with a single-core CPU, the performance is limited by the atomic-write size which plays an essential role in the trade-off between the persistent overhead and keeping leaf node entries sorted. For the multi-core setting, the overlapping of concurrency and persistency is key to the system scalability. Based on the analysis, we propose RNTree, a durable NVM-based B+tree using the hardware transactional memory (HTM). Our way of using HTM can actually address both problems mentioned above simultaneously. (1) HTM can use cache-line granularity to provide larger atomic-write size. Based on this, we propose a new slot-array approach which traces the order of entries in the leaf nodes while still reducing the number of persistent instructions. (2) With careful design, RNTree moves slow persistent instructions out of critical sections and proposes the dual slot array design, to extract more concurrency. For single thread, RNTree achieves 1.44×/4.2× higher throughput for single-key operations and range queries respectively. For multiple threads, the throughput of RNTree is 2.3× higher than state-of-the-art works.

14 citations


Journal ArticleDOI
TL;DR: NV-HTM is presented, a system that allows the execution of transactions over PM using unmodified commodity HTM implementations, and can achieve up to 10 × speed-ups and up to 11.6 × reduced flush operations with respect to state of the art solutions, which, unlike NV- HTM, require custom modifications to existing HTM systems.

13 citations


Journal ArticleDOI
TL;DR: This work proposes a binary search tree data structure whose key novelty stems from the decoupling of update operations, ie, instead of performing an update operation in a single large transaction, it is split into one transaction that modifies the abstraction state and several other transactions that restructure the tree implementation in the background.
Abstract: We introduce the first binary search tree algorithm designed for speculative executions. Prior to this work, tree structures were mainly designed for their pessimistic (non-speculative) accesses to have a bounded complexity. Researchers tried to evaluate transactional memory using such tree structures whose prominent example is the red-black tree library developed by Oracle Labs that is part of multiple benchmark distributions. Although well-engineered, such structures remain badly suited for speculative accesses, whose step complexity might raise dramatically with contention.We show that our speculation-friendly tree outperforms the existing transaction-based version of the AVL and the red-black trees. Its key novelty stems from the decoupling of update operations: they are split into one transaction that modifies the abstraction state and multiple ones that restructure its tree implementation in the background. In particular, the speculation-friendly tree is shown correct, reusable and it speeds up a transaction-based travel reservation application by up to 3.5x.

13 citations


Journal ArticleDOI
TL;DR: By forbidding explicit self-abort, and by introducing an executor-based mechanism for running transactions, this approach makes it easier for developers to get code up and running with TM, and enables the implementation of robust support for TM in a small, orthogonal compiler extension.
Abstract: C++ has supported a provisional version of Transactional Memory (TM) since 2015, via a technical specification. However, TM has not seen widespread adoption, and compiler vendors have been slow to implement the technical specification. We conjecture that the proposed TM support is too difficult for programmers to use, too complex for compiler designers to implement and verify, and not industry-proven enough to justify final standardization in its current form.To address these problems, we present a different design for supporting TM in C++. By forbidding explicit self-abort, and by introducing an executor-based mechanism for running transactions, our approach makes it easier for developers to get code up and running with TM. Our proposal should also be appealing to compiler developers, as it allows a spectrum of levels of support for TM, with varying performance, and varying reliance on hardware TM support in order to provide scalability.While our design does not enable some of the optimizations admitted by the current technical specification, we show that it enables the implementation of robust support for TM in a small, orthogonal compiler extension. Our implementation is able to handle a wide range of transactional programs, delivering low instrumentation overhead and scalability and performance on par with the current state of the art. Based on this experience, we believe our approach to be a viable means of reinvigorating the standardization of TM in C++.

12 citations


Proceedings ArticleDOI
17 Feb 2019
TL;DR: This work extends LFTT to add support for dynamic transactions and wait-free progress while retaining its speed, and finds that these features do not hurt the performance of L FTT for the authors' test cases.
Abstract: Transactional data structures support threads executing a sequence of operations atomically. Dynamic transactions allow operands to be generated on the fly and allows threads to execute code in between the operations of a transaction, in contrast to static transactions which need to know the operands in advance. A framework called Lock-free Transactional Transformation (LFTT) allows data structures to run high-performance transactions, but it only supports static transactions. We extend LFTT to add support for dynamic transactions and wait-free progress while retaining its speed. The thread-helping scheme of LFTT presents a unique challenge to dynamic transactions. We overcome this challenge by changing the input of LFTT from a list of operations to a function, forcing helping threads to always start at the beginning of the transaction, and allowing threads to skip completed operations through the use of a list of return values. We thoroughly evaluate the performance impact of support for dynamic transactions and wait-free progress and find that these features do not hurt the performance of LFTT for our test cases.

10 citations


Book ChapterDOI
19 Jun 2019
TL;DR: In this paper, a starvation free algorithm for multi-version STM is proposed, which can be used either with the case where the number of versions is unbounded and garbage collection is used or where only the latest K versions are maintained.
Abstract: Software Transactional Memory systems (STMs) have garnered significant interest as an elegant alternative for addressing synchronization and concurrency issues with multi-threaded programming in multi-core systems. Client programs use STMs by issuing transactions. STM ensures that transaction either commits or aborts. A transaction aborted due to conflicts is typically re-issued with the expectation that it will complete successfully in a subsequent incarnation. However, many existing STMs fail to provide starvation freedom, i.e., in these systems, it is possible that concurrency conflicts may prevent an incarnated transaction from committing. To overcome this limitation, we systematically derive a novel starvation free algorithm for multi-version STM. Our algorithm can be used either with the case where the number of versions is unbounded and garbage collection is used or where only the latest K versions are maintained, KSFTM. We have demonstrated that our proposed algorithm performs better than existing state-of-the-art STMs.

9 citations


Posted Content
TL;DR: This paper develops an efficient framework to execute the SCT concurrently by miner using optimistic Object-Based Software Transactional Memory systems (OSTMs) and Multi-Version OSTMs (MV-OSTM).
Abstract: This paper proposes an efficient framework to execute Smart Contract Transactions (SCTs) concurrently based on object semantics, using optimistic Single-Version Object-based Software Transactional Memory Systems (SVOSTMs) and Multi-Version OSTMs (MVOSTMs). In our framework, a multi-threaded miner constructs a Block Graph (BG), capturing the object-conflicts relations between SCTs, and stores it in the block. Later, validators re-execute the same SCTs concurrently and deterministically relying on this BG. A malicious miner can modify the BG to harm the blockchain, e.g., to cause double-spending. To identify malicious miners, we propose Smart Multi-threaded Validator (SMV). Experimental analysis shows that the proposed multi-threaded miner and validator achieve significant performance gains over state-of-the-art SCT execution framework.

Proceedings ArticleDOI
01 Oct 2019
TL;DR: Algorithms using software transactional memory for implementing thread-safe associative arrays (a red-black tree, a hash table with open addressing based on the Hopscotch method hashing collision resolution) and algorithm selection recommendations for performing transactions are proposed.
Abstract: Algorithms using software transactional memory for implementing thread-safe associative arrays (a red-black tree, a hash table with open addressing based on the Hopscotch method hashing collision resolution) are proposed. The analysis of the efficiency of associative arrays with different number of involved threads and processor cores is given, comparison with data structures based on coarse-grained and fine-grained locks is given, algorithm selection recommendations for performing transactions are also formulated. The basics of software transaction memory, various policies for updating objects in memory and strategies for conflict detection are described. Various locking methods for using transactional memory implemented in the GCC 5.4.0 compiler are presented. The alternatives currently in use are briefly considered, advantages and disadvantages are also highlighted.

Journal ArticleDOI
TL;DR: The case for phase-based transactional systems using PhTM*, the first implementation of PhTM on modern HTM-ready processors is made and it is shown for the first time that conventional hybrid systems do not perform better than phased-based system in a scenario with hybrid-behaved transactions.
Abstract: In recent years, Hybrid TM (HyTM) has been proposed as a transactional memory approach that leverages on the advantages of both hardware (HTM) and software (STM) execution modes. HyTM assumes that concurrent transactions can have very different phases and thus should run under different execution modes. Although HyTM has shown to improve performance, the overall solution can be complicated to manage, both in terms of correctness and performance. On the other hand, Phased Transactional Memory (PhTM) considers that concurrent transactions have similar phases, and thus all transactions could run under the same mode. As a result, PhTM does not require coordination between transactions on distinct modes making its implementation simpler and more flexible. In this article we make the case for phase-based transactional systems using PhTM*, the first implementation of PhTM on modern HTM-ready processors. PhTM* novelty relies on avoiding unnecessary transitions to software mode by: (i) taking into account the categories of hardware aborts; (ii) adding a new serialization mode. Experimental results using Broadwell's TSX reveal that, for the STAMP benchmark suite, PhTM* performs on average 1.68x better than PhTM, a previous phase-based TM, 2.08x better than HyTM-NOrec, a state-of-the-art HyTM, and 2.28x better than HyCO, the most recent hybrid system in the literature. In addition, PhTM* also showed to be effective when running on a Power8 machine by performing over 1.18x, 1.36x and 1.81x better than PhTM, HyTM-NOrec and HyCO, respectively. We also show that STAMP applications do not exhibit hybrid behavior to justify the use of conventional hybrid systems, thus making PhTM* a better solution to those type of programs. Finally, we show for the first time that conventional hybrid systems do not perform better than phased-based system in a scenario with hybrid-behaved transactions.

Proceedings ArticleDOI
16 Feb 2019
TL;DR: TxSampler measures performance via sampling and provides a structured performance analysis to guide intuitive optimization with a novel decision-tree model and incurs ~4% runtime overhead and negligible memory overhead for its insightful analyses.
Abstract: Programs that use hardware transactional memory (HTM) demand sophisticated performance analysis tools when they suffer from performance losses. We have developed TxSampler---a lightweight profiler for programs that use HTM. TxSampler measures performance via sampling and provides a structured performance analysis to guide intuitive optimization with a novel decision-tree model. TxSampler computes metrics that drive the investigation process in a systematic way. It not only pinpoints hot transactions with time quantification of transactional and fallback paths, but also identifies causes of transaction aborts such as data contention, capacity overflow, false sharing, and problematic instructions. TxSampler associates metrics with full call paths that are even deeply embedded inside transactions and maps them to the program's source code. Our evaluation of more than 30 HTM benchmarks and applications shows that TxSampler incurs ~4% runtime overhead and negligible memory overhead for its insightful analyses. Guided by TxSampler, we are able to optimize several HTM programs and obtain nontrivial speedups.

Journal ArticleDOI
TL;DR: The main goal of APUTM is to understand the trade-offs of implementing a software TM on such platform, and it is shown that it is able to outperform sequential execution of the applications.
Abstract: The heterogeneous accelerated processing units (APUs) integrate a multi-core CPU and a GPU within the same chip. Modern APUs implement CPU–GPU platform atomics for simple data types. However, ensuring atomicity for complex data types is a task delegated to programmers. Transactional memory (TM) is an optimistic approach to achieve this goal. With TM, shared data can be accessed by multiple computing threads speculatively, but changes are only visible if a transaction ends with no conflict with others in its memory accesses. In this paper we present APUTM, a software TM designed for APU processors which focuses on minimizing the access to shared metadata. The main goal of APUTM is to understand the trade-offs of implementing a software TM on such platform. In our experiments, APUTM is able to outperform sequential execution of the applications. Additionally, we compare its adaptability to execute in one of the devices or in both simultaneously.

Posted Content
TL;DR: This work evaluates and analyzes the performance of several concurrent Union-Find algorithms and optimization strategies across a wide range of platforms and workloads and finds one of the fastest algorithm variants is a sequential one that uses coarse-grained locking with the lock elision optimization to reduce synchronization cost and increase scalability.
Abstract: Union-Find (or Disjoint-Set Union) is one of the fundamental problems in computer science; it has been well-studied from both theoretical and practical perspectives in the sequential case. Recently, there has been mounting interest in analyzing this problem in the concurrent scenario, and several asymptotically-efficient algorithms have been proposed. Yet, to date, there is very little known about the practical performance of concurrent Union-Find. This work addresses this gap. We evaluate and analyze the performance of several concurrent Union-Find algorithms and optimization strategies across a wide range of platforms (Intel, AMD, and ARM) and workloads (social, random, and road networks, as well as integrations into more complex algorithms). We first observe that, due to the limited computational cost, the number of induced cache misses is the critical determining factor for the performance of existing algorithms. We introduce new techniques to reduce this cost by storing node priorities implicitly and by using plain reads and writes in a way that does not affect the correctness of the algorithms. Finally, we show that Union-Find implementations are an interesting application for Transactional Memory (TM): one of the fastest algorithm variants we discovered is a sequential one that uses coarse-grained locking with the lock elision optimization to reduce synchronization cost and increase scalability.

Proceedings ArticleDOI
01 Apr 2019
TL;DR: A lightweight transactional memory library TuFast which provides easy-to-use primitives for the end-user to agilely develop fast shared memory graph parallelization on a multi-core server and outperform state-of-the-art distributed and multi- core graph analytical systems by up to 4 orders of magnitude.
Abstract: Recently, there has been significant interest in large-scale graph analytics systems. However, most of the design efforts focus on accelerating graph analytics on giant graphs and/or in a distributed environment. Little attention focuses on the programmer usability perspective, which is critical to implementing ad-hoc analytics on moderate size graphs. In this paper, we present a lightweight transactional memory (TM) library TuFast which provides easy-to-use primitives for the end-user to agilely develop fast shared memory graph parallelization on a multi-core server. TuFast exploits recent CPU instructions set Hardware Transactional Memory (HTM), which has been available in off-the-shelf CPUs. HTM offers free transactional semantic but also suffers from capacity limitation. Our framework resolves the capacity challenge and efficiently utilizes HTM on graph parallelization by exploiting the graph degree information. Large scale graphs have a power-law degree distribution: a large proportion of the vertices with a small degree, fits in single HTM transactions; a small proportion of vertices with a big degree fits a pessimistic approach like locking; other vertices with a moderate degree can be processed with an optimistic approach with HTM acceleration. Our hybrid approach automatically adapts to the degree of graphs dynamically during the processing. The graph analytical jobs expressed via our library are straightforward and concise and outperform state-of-the-art distributed and multi-core graph analytical systems by up to 4 orders of magnitude.

Journal ArticleDOI
TL;DR: This work quantifies the performance and energy efficiency of HTM for scientific workloads based on the widely-used CLOMP-TM benchmark and discusses a set of generic software optimizations, which effectively improve the performance of transactional science workloads on large-scale NUMA systems.

Proceedings ArticleDOI
01 Sep 2019
TL;DR: This work introduces the abstraction of Heterogeneous Transactional Memory (HeTM), which provides programmers with the illusion of a single memory region, shared among the CPUs and the (discrete) GPU(s) of a heterogeneous system, with support for atomic transactions.
Abstract: Modern heterogeneous computing architectures, which couple multi-core CPUs with discrete many-core GPUs (or other specialized hardware accelerators), enable unprecedented peak performance and energy efficiency levels. However, developing applications that can take full advantage of the potential of heterogeneous systems is a notoriously hard task. This work takes a step towards reducing the complexity of programming heterogeneous systems by introducing the abstraction of Heterogeneous Transactional Memory (HeTM). HeTM provides programmers with the illusion of a single memory region, shared among the CPUs and the (discrete) GPU(s) of a heterogeneous system, with support for atomic transactions. Besides introducing the abstract semantics and programming model of HeTM, we present the design and evaluation of a concrete implementation of the proposed abstraction, referred herein as Speculative HeTM (SHeTM). SHeTM makes use of a novel design that leverages speculative techniques, which aims at hiding the inherently large communication latency between CPUs and discrete GPUs and at minimizing inter-device synchronization overhead. We demonstrate the efficiency of the SHeTM via an extensive quantitative study based both on synthetic benchmarks and on a popular object caching system.

Proceedings ArticleDOI
TL;DR: In this paper, speculative heterogeneous transactional memory (HeTM) is proposed to reduce the complexity of programming heterogeneous systems by introducing the abstraction of Heterogeneous Transactional Memory.
Abstract: Modern heterogeneous computing architectures, which couple multi-core CPUs with discrete many-core GPUs (or other specialized hardware accelerators), enable unprecedented peak performance and energy efficiency levels. Unfortunately, though, developing applications that can take full advantage of the potential of heterogeneous systems is a notoriously hard task. This work takes a step towards reducing the complexity of programming heterogeneous systems by introducing the abstraction of Heterogeneous Transactional Memory (HeTM). HeTM provides programmers with the illusion of a single memory region, shared among the CPUs and the (discrete) GPU(s) of a heterogeneous system, with support for atomic transactions. Besides introducing the abstract semantics and programming model of HeTM, we present the design and evaluation of a concrete implementation of the proposed abstraction, which we named Speculative HeTM (SHeTM). SHeTM makes use of a novel design that leverages on speculative techniques and aims at hiding the inherently large communication latency between CPUs and discrete GPUs and at minimizing inter-device synchronization overhead. SHeTM is based on a modular and extensible design that allows for easily integrating alternative TM implementations on the CPU's and GPU's sides, which allows the flexibility to adopt, on either side, the TM implementation (e.g., in hardware or software) that best fits the applications' workload and the architectural characteristics of the processing unit. We demonstrate the efficiency of the SHeTM via an extensive quantitative study based both on synthetic benchmarks and on a porting of a popular object caching system.

Proceedings ArticleDOI
01 Sep 2019
TL;DR: This paper describes a mechanism that allows conventional hardware to support lazy conflict detection, while still keeping the coherence protocol intact, and introduces Forgive-TM, a scheme that effectively allows conflict detection to be done lazily.
Abstract: Commercial hardware transactional memory (TM) systems commonly use coherence messages to detect data conflicts. When a core inside a transaction receives a coherence request for data, it uses this information to determine whether there was a data conflict. Inherent in this behavior is the fact that data conflicts are detected eagerly, i.e., as soon as possible, and even while both sides of the conflict are speculative. Although it has been shown that lazy conflict detection can lead to better performance, this approach precludes lazy detection. In this paper, we describe a mechanism that allows conventional hardware to support lazy conflict detection, while still keeping the coherence protocol intact. Under Forgive-TM, speculative writes are done immediately to a special buffer, without first obtaining global write permission. The write permission is acquired later, when the transaction is about to commit. In other words, it "acts first, and asks forgiveness later." This effectively allows conflict detection to be done lazily. Using this scheme, ForgiveTM is able to provide 19% overall performance improvement in STAMP.

Posted Content
TL;DR: In this paper, the authors explore the concurrency vs. software validation trade-offs for hybrid TMs and show that, unlike in progressive TMs, software transactions in progressive HyTMs cannot avoid incremental validation, even if hardware transactions can read metadata non-speculatively.
Abstract: State-of-the-art \emph{software transactional memory (STM)} implementations achieve good performance by carefully avoiding the overhead of \emph{incremental validation} (i.e., re-reading previously read data items to avoid inconsistency) while still providing \emph{progressiveness} (allowing transactional aborts only due to \emph{data conflicts}). Hardware transactional memory (HTM) implementations promise even better performance, but offer no progress guarantees. Thus, they must be combined with STMs, leading to \emph{hybrid} TMs (HyTMs) in which hardware transactions must be \emph{instrumented} (i.e., access metadata) to detect contention with software transactions. We show that, unlike in progressive STMs, software transactions in progressive HyTMs cannot avoid incremental validation. In fact, this result holds even if hardware transactions can \emph{read} metadata \emph{non-speculatively}. We then present \emph{opaque} HyTM algorithms providing \emph{progressiveness for a subset of transactions} that are optimal in terms of hardware instrumentation. We explore the concurrency vs. hardware instrumentation vs. software validation trade-offs for these algorithms. Our experiments with Intel and IBM POWER8 HTMs seem to suggest that (i) the \emph{cost of concurrency} also exists in practice, (ii) it is important to implement HyTMs that provide progressiveness for a maximal set of transactions without incurring high hardware instrumentation overhead or using global contending bottlenecks and (iii) there is no easy way to derive more efficient HyTMs by taking advantage of non-speculative accesses within hardware.

Book ChapterDOI
22 Oct 2019
TL;DR: This paper implemented their adaptive versioning approach in the latest TinySTM distribution and extensively evaluated it through 5 micro-benchmarks and 8 complex benchmarks from STAMP and STAMPEDE suites, showing significant benefits of the approach.
Abstract: Transactional memory has been receiving much attention from both academia and industry. In transactional memory, program code is split into transactions, blocks of code that appear to execute atomically. Transactions are executed speculatively and the speculative execution is supported through data versioning and conflict detection and resolution mechanisms. Lazy versioning makes aborts fast but penalizes commits, whereas eager versioning makes commits fast but penalizes aborts. In this paper, we present an adaptive versioning approach that dynamically switches between eager and lazy versioning at runtime based on appropriate system parameters so that the performance of a transactional memory system is always better than that is obtained using either eager or lazy versioning individually. We implemented our adaptive versioning approach in the latest TinySTM distribution and extensively evaluated it through 5 micro-benchmarks and 8 complex benchmarks from STAMP and STAMPEDE suites. The results show significant benefits of our approach, giving performance improvements as much as 6.3x for execution time and as much as 170x for number of aborts.

Patent
15 Oct 2019
TL;DR: In this article, a transaction log is updated in a persistent memory at each memory access of the transaction and when the transaction is complete, it is marked as ‘pending’.
Abstract: Methods and apparatus are provided for executing a transaction in a data processing system, responsive to each memory access of the transaction, a transaction log is updated in a persistent memory. After execution of the transaction and when the transaction log is complete, the transaction log is marked as ‘pending’. When all values modified in the transaction have been written back to the persistent memory, the transaction log is marked as ‘free’. When, following a reboot, a transaction log is marked as ‘pending’, data stored in the transaction log is copied to the persistent memory at addresses indicated in the transaction log. After the copying is complete, the transaction log is marked as ‘free’. Cache values modified in the transaction may be written back to persistent memory when evicted, and values read in the transaction may be read from the cache rather than from the transaction log.

Book ChapterDOI
22 Oct 2019
TL;DR: A Starvation-Free Multi-Version OSTM (SF-MVOSTM) is proposed which ensures starvation-freedom in object based STM systems while satisfying the correctness criteria as co-opacity and satisfies the Correctness criteria such as local opacity.
Abstract: To utilize the multi-core processors properly concurrent programming is needed. The main challenge is to design a correct and efficient concurrent program. Software Transactional Memory Systems (STMs) provide ease of multithreading to the programmer without worrying about concurrency issues as deadlock, livelock, priority inversion, etc. Most of the STMs work on read-write operations known as RWSTMs. Some STMs work at higher-level operations and ensure greater concurrency than RWSTMs. Such STMs are known as Single-Version Object-based STMs (SVOSTMs). The transactions of SVOSTMs can return commit or abort. Aborted SVOSTMs transactions retry. But in the current setting of SVOSTMs, transactions may starve. So, we propose a Starvation-Freedom in SVOSTM as SF-SVOSTM that satisfies the correctness criteria conflict-opacity.

Book ChapterDOI
19 Jun 2019
TL;DR: This work extends GPU Software Transactional Memory to allow threads across many GPUs to access a coherent distributed shared memory space and proposes a scheme for GPU-to-GPU communication using CUDA-Aware MPI.
Abstract: We present CUDA-DTM, the first ever Distributed Transactional Memory framework written in CUDA for large scale GPU clusters. Transactional Memory has become an attractive auto-coherence scheme for GPU applications with irregular memory access patterns due to its ability to avoid serializing threads while still maintaining programmability. We extend GPU Software Transactional Memory to allow threads across many GPUs to access a coherent distributed shared memory space and propose a scheme for GPU-to-GPU communication using CUDA-Aware MPI. The performance of CUDA-DTM is evaluated using a suite of seven irregular memory access benchmarks with varying degrees of compute intensity, contention, and node-to-node communication frequency. Using a cluster of 256 devices, our experiments show that GPU clusters using CUDA-DTM can be up to 115x faster than CPU clusters.

Proceedings ArticleDOI
01 Sep 2019
TL;DR: This paper presents a new approach to multiversioning support (Multiversioned Page Overlays) along with a new HTM design that it enables: OverlayTM, which takes advantage of multiversions to reduce unnecessary transaction aborts while providing full serializable semantics.
Abstract: Practical and efficient support for multiversioning memory systems would offer a number of potential advantages, including improving the performance and functionality of hardware transactional memory (HTM). This paper presents a new approach to multiversioning support (Multiversioned Page Overlays) along with a new HTM design that it enables: OverlayTM. Compared with existing HTM designs, OverlayTM takes advantage of multiversioning to reduce unnecessary transaction aborts while providing full serializable semantics (in contrast with multiversioning HTMs that improve performance at the expense of being vulnerable to write skew anomalies). Our performance results demonstrate that OverlayTM is especially advantageous in read-heavy workloads.

Proceedings ArticleDOI
02 Sep 2019
TL;DR: This paper proves that a Python STM implementation is serializable by constructing its Push/Pull model and by showing that the model satisfies the correctness criteria for the relevant push/pull semantic rules.
Abstract: The Push/Pull semantic model of transactions has appeared recently as a solution that unifies a wide range of transactional memory algorithms. It has been proved that the push/pull semantic model satisfies serializability, thus one may prove that a given STM satisfies serializability by constructing its push/pull model such that this model satisfies respective correctness criteria. In this paper, we prove that a Python STM implementation is serializable by constructing its Push/Pull model and by showing that the model satisfies the correctness criteria for the relevant push/pull semantic rules. We first identify that modeling Python STM requires only four, out of seven, push/pull operations, namely the operations pull, apply, push, and commit. Next, we introduce the detailed specification of the PSTM transactional algorithm. Then we map the steps of the PSTM transactional algorithm to the respective push/pull semantic rules. Finally, we prove that the PSTM algorithm satisfies the correctness criteria of the respective push/pull semantic rules. We have envisaged this paper to provide interested researchers with a better understanding of PSTM semantics, in order to construct push/pull models of their own STMs more easily.

Proceedings ArticleDOI
01 Feb 2019
TL;DR: This paper examines the code generated by a state-of-theart JavaScript compiler and finds that the code has a high frequency of Stack Map Points, and extends the compiler to generate hardware transactions around SMPs, and performs simple within-transaction optimizations enabled by transactions.
Abstract: Scripting languages’ inferior performance stems from compilers lacking enough static information. To address this limitation, they use JIT compilers organized into multiple tiers, with higher tiers using profiling information to generate high-performance code. Checks are inserted to detect incorrect assumptions and, when a check fails, execution transfers to a lower tier. The points of potential transfer between tiers are called Stack Map Points (SMPs). They require a consistent state in both tiers and, hence, limit code optimization across SMPs in the higher tier. This paper examines the code generated by a state-of-theart JavaScript compiler and finds that the code has a high frequency of SMPs. These SMPs rarely cause execution to transfer to lower tiers. However, both the optimization-limiting effect of the SMPs, and the overhead of the SMP-guarding checks contribute to scripting languages’ low performance. To tackle this problem, we extend the compiler to generate hardware transactions around SMPs, and perform simple within-transaction optimizations enabled by transactions. We target emerging lightweight HTM systems and call our changes NoMap. We evaluate NoMap on the SunSpider and Kraken suites. We find that NoMap lowers the instruction count by an average of 14.2% and 11.5%, and the execution time by an average of 16.7% and 8.9%, for SunSpider and Kraken, respectively. Keywords-JavaScript; Transactional Memory; Compiler Optimizations; JIT Compilation.

Proceedings ArticleDOI
04 Jan 2019
TL;DR: In this paper, a block of the chain consists of multiple transactions of smart contracts which are added by a miner, and the validators serially reexecute the smart contract transactions of the block.
Abstract: It is commonly believed that blockchain is a revolutionary technology for doing business on the Internet. Blockchain is a decentralized, distributed database or ledger of records. It ensures that the records are tamper-proof but publicly readable. Blockchain platforms such as Ethereum [3] and several others execute complex transactions in blocks through user-defined scripts known as smart contracts. Normally, a block of the chain consists of multiple transactions of smart contracts which are added by a miner. To append a correct block into the blockchain, miners execute these transactions of smart contracts sequentially. Later the validators serially re-execute the smart contract transactions of the block. If the validators agree with final state of the blocks as recorded by the miner, then the block is said to be valid and added to the blockchain using a consensus protocol.