scispace - formally typeset
Search or ask a question

Showing papers on "Transactional memory published in 2020"


Proceedings ArticleDOI
09 Mar 2020
TL;DR: TimeStone is a highly scalable DTM system with low write amplification and minimal memory footprint, which uses a novel multi-layered hybrid logging technique, called TOC logging, to guarantee crash consistency and relies on Multi-Version Concurrency Control mechanism to achieve high scalability and to support different isolation levels on the same data set.
Abstract: Non-volatile main memory (NVMM) technologies promise byte addressability and near-DRAM access that allows developers to build persistent applications with common load and store instructions. However, it is difficult to realize these promises because NVMM software should also provide crash consistency while providing high performance, and scalability. Durable transactional memory (DTM) systems address these challenges. However, none of them scale beyond 16 cores. The poor scalability either stems from the underlying STM layer or from employing limited write parallelism (single writer or dual version). In addition, other fundamental issues with guaranteeing crash consistency are high write amplification and memory footprint in existing approaches. To address these challenges, we propose TimeStone: a highly scalable DTM system with low write amplification and minimal memory footprint. TimeStone uses a novel multi-layered hybrid logging technique, called TOC logging, to guarantee crash consistency. Also, TimeStone further relies on Multi-Version Concurrency Control (MVCC) mechanism to achieve high scalability and to support different isolation levels on the same data set. Our evaluation of TimeStone against the state-of-the-art DTM systems shows that it significantly outperforms other systems for a wide range of workloads with varying data-set size and contention levels, up to 112 hardware threads. In addition, with our TOC logging, TimeStone achieves a write amplification of less than 1, while existing DTM systems suffer from 2×-6× overhead.

37 citations


Posted Content
TL;DR: In this paper, the authors propose Atomic Active Messages (AAM), a mechanism that accelerates irregular graph computations on both shared-and distributed-memory machines, and conduct a detailed performance analysis of AAM on Intel Haswell and IBM Blue Gene/Q and illustrate various performance tradeoffs between different HTM parameters that impact the efficiency of graph processing.
Abstract: We propose Atomic Active Messages (AAM), a mechanism that accelerates irregular graph computations on both shared- and distributed-memory machines. The key idea behind AAM is that hardware transactional memory (HTM) can be used for simple and efficient processing of irregular structures in highly parallel environments. We illustrate techniques such as coarsening and coalescing that enable hardware transactions to considerably accelerate graph processing.We conduct a detailed performance analysis of AAM on Intel Haswell and IBM Blue Gene/Q and we illustrate various performance tradeoffs between different HTM parameters that impact the efficiency of graph processing. AAM can be used to implement abstractions offered by existing programming models and to improve the performance of irregular graph processing codes such as Graph500 or Galois.

31 citations


Proceedings ArticleDOI
15 Apr 2020
TL;DR: CX-PUC is presented, the first bounded wait-free persistent universal construction requiring no annotation of the underlying sequential data structure, and Redo-PTM is proposed, a new generic construction based on a finite number of replicas and Herlihy's wait- free consensus, which uses physical instead of logical logging.
Abstract: Non-Volatile Main Memory (NVMM) has brought forth the need for data structures that are not only concurrent but also resilient to non-corrupting failures. Until now, persistent transactional memory libraries (PTMs) have focused on providing correct recovery from non-corrupting failures without memory leaks. Most PTMs that provide concurrent access do so with blocking progress. The main focus of this paper is to design practical PTMs with wait-free progress based on universal constructions. We first present CX-PUC, the first bounded wait-free persistent universal construction requiring no annotation of the underlying sequential data structure. CX-PUC is an adaptation to persistence of CX, a recently proposed universal construction. We next introduce CX-PTM, a PTM that achieves better throughput and supports transactions over multiple data structure instances, at the price of requiring annotation of the loads and stores in the data structure---as is commonplace in software transactional memory. Finally, we propose a new generic construction, Redo-PTM, based on a finite number of replicas and Herlihy's wait-free consensus, which uses physical instead of logical logging. By exploiting its capability of providing wait-free ACID transactions, we have used Redo-PTM to implement the world's first persistent key-value store with bounded wait-free progress.

22 citations


Proceedings ArticleDOI
11 Jun 2020
TL;DR: Crafty as discussed by the authors employs a novel technique called nondestructive undo logging that leverages commodity transactional memory (HTM) capabilities to control persist ordering and achieves state-of-the-art performance under low contention and competitively under high contention.
Abstract: Byte-addressable persistent memory, such as Intel/Micron 3D XPoint, is an emerging technology that bridges the gap between volatile memory and persistent storage. Data in persistent memory survives crashes and restarts; however, it is challenging to ensure that this data is consistent after failures. Existing approaches incur significant performance costs to ensure crash consistency. This paper introduces Crafty, a new approach for ensuring consistency and atomicity on persistent memory operations using commodity hardware with existing hardware transactional memory (HTM) capabilities, while incurring low overhead. Crafty employs a novel technique called nondestructive undo logging that leverages commodity HTM to control persist ordering. Our evaluation shows that Crafty outperforms state-of-the-art prior work under low contention, and performs competitively under high contention.

17 citations


Proceedings ArticleDOI
TL;DR: This paper introduces Crafty, a new approach for ensuring consistency and atomicity on persistent memory operations using commodity hardware with existing hardware transactional memory (HTM) capabilities, while incurring low overhead.
Abstract: Byte-addressable persistent memory, such as Intel/Micron 3D XPoint, is an emerging technology that bridges the gap between volatile memory and persistent storage. Data in persistent memory survives crashes and restarts; however, it is challenging to ensure that this data is consistent after failures. Existing approaches incur significant performance costs to ensure crash consistency. This paper introduces Crafty, a new approach for ensuring consistency and atomicity on persistent memory operations using commodity hardware with existing hardware transactional memory (HTM) capabilities, while incurring low overhead. Crafty employs a novel technique called nondestructive undo logging that leverages commodity HTM to control persist ordering. Our evaluation shows that Crafty outperforms state-of-the-art prior work under low contention, and performs competitively under high contention.

14 citations


Proceedings ArticleDOI
Jungi Jeong1, Jaewan Hong2, Seungryoul Maeng2, Changhee Jung1, Youngjin Kwon2 
01 Oct 2020
TL;DR: UHTM, unbounded hardware transactional memory for DRAM and NVM hybrid memory systems combines the cache coherence protocol and address-signatures to detect conflicts in the entire memory space and improves concurrency by significantly reducing the false-positive rates of previous studies.
Abstract: Persistent memory programming requires failure atomicity. To achieve this in an efficient manner, recent proposals use hardware-based logging for atomic-durable updates and hardware transactional memory (HTM) for isolation. Although the unbounded HTMs are promising for both performance and programmability reasons, none of the previous studies satisfies the practical requirements. They either require unrealistic hard-ware overheads or do not allow transactions to exceed on-chip cache boundaries. Furthermore, it has never been possible to use both DRAM and NVM in HTM, though it is becoming a popular persistency modelTo this end, this study proposes UHTM, unbounded hardware transactional memory for DRAM and NVM hybrid memory systems. UHTM combines the cache coherence protocol and address-signatures to detect conflicts in the entire memory space. This approach improves concurrency by significantly reducing the false-positive rates of previous studies. More importantly, UHTM allows both DRAM and NVM data to interact with each other in transactions without compromising the consistency guarantee. This is rendered possible by UHTM’s hybrid version management that provides an undo-based log for DRAM and a redo-based log for NVM. The experimental results show that UHTM outperforms the state-of-the-art durable HTM, which is LLC-bounded, by 56% on average and up to 818%.

14 citations


Book ChapterDOI
03 Jun 2020
TL;DR: In this article, the authors leverage multiple threads to execute SCTs and achieve better efficiency and higher throughput by leveraging multiple versions for each shared data item as opposed to Single-Version OSTMs (SVOSTMs).
Abstract: Several popular blockchains such as Ethereum execute complex transactions through user-defined scripts. A block of the chain typically consists of multiple smart contract transactions (SCTs). To append a block into the blockchain, a miner executes these SCTs. On receiving this block, other nodes act as validators, who re-execute these SCTs as part of the consensus protocol to validate the block. In Ethereum and other blockchains that support cryptocurrencies, a miner gets an incentive every time such a valid block is successfully added to the blockchain. When executing SCTs sequentially, miners and validators fail to harness the power of multiprocessing offered by the prevalence of multi-core processors, thus degrading throughput. By leveraging multiple threads to execute SCTs, we can achieve better efficiency and higher throughput. Recently, Read-Write Software Transactional Memory Systems (RWSTMs) were used for concurrent execution of SCTs. It is known that Object-based STMs (OSTMs), using higher-level objects (such as hash-tables or lists), achieve better throughput as compared to RWSTMs. Even greater concurrency can be obtained using Multi-Version OSTMs (MVOSTMs), which maintain multiple versions for each shared data item as opposed to Single-Version OSTMs (SVOSTMs).

10 citations


Proceedings ArticleDOI
19 Feb 2020
TL;DR: This paper applies hardware transactional memory (HTM) to design TxCAS, a scalable compare-and-set (CAS) primitive, and applies it to the baskets queue, which steers enqueuers whose CAS fails into dedicated basket data structures, resulting in SBQ, the scalable baskets queue.
Abstract: Queues are fundamental concurrent data structures, but despite years of research, even the state-of-the-art queues scale poorly. This poor scalability occurs because of contended atomic read-modify-write (RMW) operations. This paper makes a first step towards designing a scalable linearizable queue. We leverage hardware transactional memory (HTM) to design TxCAS, a scalable compare-and-set (CAS) primitive---despite HTM being targeted mainly at uncontended scenarios. Leveraging TxCAS's scalability requires a queue design that does not blindly retry failed CASs. We thus apply TxCAS to the baskets queue, which steers enqueuers whose CAS fails into dedicated basket data structures. Coupled with a new, scalable basket algorithm, we obtain SBQ, the scalable baskets queue. At high concurrency levels, SBQ outperforms the fastest queue today by 1.6X on a producer-only workload.

7 citations


Proceedings ArticleDOI
22 Feb 2020
TL;DR: The first attempt at hardware transactional memory (HTM) support for cross-ISA virtualization of HTMs is made, which emulates guest HTMs using host HTMs, and tries to preserve as much as possible the performance and the scalability of guest applications.
Abstract: System virtualization is a key enabling technology. However, existing virtualization techniques suffer from a significant limitation due to their limited cross-ISA support for emerging architecture-specific hardware extensions. To address this issue, we make the first attempt at hardware transactional memory (HTM), which has been supported by modern multi-core processors and used by more and more applications to simplify concurrent programming. In particular, we propose an efficient and scalable mechanism to support cross-ISA virtualization of HTMs. The mechanism emulates guest HTMs using host HTMs, and tries to preserve as much as possible the performance and the scalability of guest applications. Experimental results on STAMP benchmarks show that an average of 2.3X and 12.6X performance speedup can be achieved respectively for x86_64 and PowerPC64 guest applications on an x86_64 host machine. Moreover, it can attain similar scalability to the native execution of the applications.

7 citations


Journal ArticleDOI
TL;DR: This work proposes using Helper Warps to move persistence out of the critical path of transaction execution, alleviating the impact of latencies, and achieves a speedup of 4.4 and 1.5 under bandwidth limits, resulting in reduction in overall energy consumption.
Abstract: Non-volatile Random Access Memories (NVRAM) have emerged in recent years to bridge the performance gap between the main memory and external storage devices, such as Solid State Drives (SSD). In addition to higher storage density, NVRAM provides byte-addressability, higher bandwidth, near-DRAM latency, and easier access compared to block devices such as traditional SSDs. This enables new programming paradigms taking advantage of durability and larger memory footprint. With the range and size of GPU workloads expanding, NVRAM will present itself as a promising addition to GPU's memory hierarchy. To utilize the non-volatility of NVRAMs, programs should allow durable stores, maintaining consistency through a power loss event. This is usually done through a logging mechanism that works in tandem with a transaction execution layer which can consist of a transactional memory or a locking mechanism. Together, this results in a transaction processing system that preserves the ACID properties. GPUs are designed with high throughput in mind, leveraging high degrees of parallelism. Transactional memory proposals enable fine-grained transactions at the GPU thread-level. However, with lower write bandwidths compared to that of DRAMs, using NVRAM as-is may yield sub-optimal overall system performance when threads experience long latency. To address this problem, we propose using Helper Warps to move persistence out of the critical path of transaction execution, alleviating the impact of latencies. Our mechanism achieves a speedup of 4.4 and 1.5 under bandwidth limits of 1.6 GB/s and 12 GB/s and is projected to maintain speed advantage even when NVRAM bandwidth gets as high as hundreds of GB/s in certain cases. Due to the speedup, our proposed method also results in reduction in overall energy consumption.

6 citations


Proceedings ArticleDOI
18 May 2020
TL;DR: A large throughput difference is found, which emphasizes the importance of choosing the best durability domain for each application and system, and confirms that recently published persistent transactional memory algorithms are able to scale, and that recent optimizations for these algorithms lead to strong performance.
Abstract: Storing data structures in high-capacity byte-addressable persistent memory instead of DRAM or a storage device offers the opportunity to (1) reduce cost and power consumption compared with DRAM, (2) decrease the latency and CPU resources needed for an I/O operation compared with storage, and (3) allow for fast recovery as the data structure remains in memory after a machine failure. The first commercial offering in this space is Intel® Optane™ Direct Connect (Optane™ DC) Persistent Memory. Optane™ DC promises access time within a constant factor of DRAM, with larger capacity, lower energy consumption, and persistence. We present an experimental evaluation of persistent transactional memory performance, and explore how Optane™ DC durability domains affect the overall results. Given that neither of the two available durability domains can deliver performance competitive with DRAM, we introduce and emulate a new durability domain, called PDRAM, in which the memory controller tracks enough information (and has enough reserve power) to make DRAM behave like a persistent cache of Optane™ DC memory.In this paper we compare the performance of these durability domains on several configurations of five persistent transactional memory applications. We find a large throughput difference, which emphasizes the importance of choosing the best durability domain for each application and system. At the same time, our results confirm that recently published persistent transactional memory algorithms are able to scale, and that recent optimizations for these algorithms lead to strong performance, with speedups as high as 6× at 16 threads.

Proceedings ArticleDOI
TL;DR: SI-HTM is proposed, which stretches the capacity bounds of the underlying HTM, thus opening HTM to a much broader class of applications, and exhibits improved scalability, achieving speedups of up to 300% relatively to HTM on in-memory database benchmarks.
Abstract: The hardware transactional memory (HTM) implementations in commercially available processors are significantly hindered by their tight capacity constraints. In practice, this renders current HTMs unsuitable to many real-world workloads of in-memory databases. This paper proposes SI-HTM, which stretches the capacity bounds of the underlying HTM, thus opening HTM to a much broader class of applications. SI-HTM leverages the HTM implementation of the IBM POWER architecture with a software layer to offer a single-version implementation of Snapshot Isolation. When compared to HTM- and software-based concurrency control alternatives, SI-HTM exhibits improved scalability, achieving speedups of up to 300% relatively to HTM on in-memory database benchmarks.

Proceedings ArticleDOI
18 May 2020
TL;DR: This paper analyzes for the very first time scheduling algorithms in the online dynamic scheduling setting where transactions and the objects they access are not known a priori and the transactions may arrive online over time and provides efficient and near-optimal execution time schedules for dynamic scheduling in many specialized network architectures.
Abstract: We investigate scheduling algorithms for distributed transactional memory systems where transactions residing at nodes of a communication graph operate on shared, mobile objects. A transaction requests the objects it needs, executes once those objects have been assembled, and then sends the objects to other waiting transactions. We study scheduling algorithms with provable performance guarantees. Previously, only the offline batch scheduling setting was considered in the literature where transactions and the objects they access are known a priori. Minimizing execution time, even for the offline batch scheduling, is known to be NP-hard for arbitrary communication graphs. In this paper, we analyze for the very first time scheduling algorithms in the online dynamic scheduling setting where transactions and the objects they access are not known a priori and the transactions may arrive online over time. We provide efficient and near-optimal execution time schedules for dynamic scheduling in many specialized network architectures. The core of our technique is a method to convert offline schedules to online. We first describe a centralized scheduler which we then adapt it to a purely distributed scheduler. To our knowledge, these are the first attempts to obtain provably efficient online execution schedules for distributed transactional memory.

Journal ArticleDOI
TL;DR: A replication model using quorum system for transactional memory protocol where communication among the nodes takes place using gossip and the algorithm maintains the coherence of the objects and aims to achieve low communication cost while reducing execution time of the transactions.
Abstract: Single copy Distributed Software Transactional Memory protocol maintains only one replica of each object in the system and is therefore prone to failures in large scale dynamically changing network. In this paper we propose a replication model using quorum system for transactional memory protocol where communication among the nodes takes place using gossip. The previous protocols demand a static structure over the network. Maintenance of a static structure for a dynamic network requires a significant overhead. Our method executes on an unstructured network which does not require adaption in case of node joining and node leaving. The algorithm maintains the coherence of the objects and aims to achieve low communication cost while reducing execution time of the transactions. The algorithm achieves a message complexity of $$ {\text{O}}\left( {\sqrt n } \right) $$ and time complexity of $$ {\text{O}}\left( {\log \sqrt n } \right) $$ which is an improvement over previous replication protocols for distributed transactional memory. Simulation results shows that the method exhibits better fault tolerance and requires less number of messages than existing approaches.

Journal ArticleDOI
TL;DR: P8TM is proposed, a novel approach that mitigates this limitation on IBM’s POWER8 architecture by leveraging a key combination of hardware and software techniques to support different execution paths and also relies on self-tuning mechanisms aimed at dynamically switching between different execution modes to best adapt to the workload characteristics.
Abstract: Transactional memory (TM) aims at simplifying concurrent programming via the familiar abstraction of atomic transactions. Recently, Intel and IBM have integrated hardware based TM (HTM) implementations in commodity processors, paving the way for the mainstream adoption of the TM paradigm. Yet, existing HTM implementations suffer from a crucial limitation, which hampers the adoption of HTM as a general technique for regulating concurrent access to shared memory: the inability to execute transactions whose working sets exceed the capacity of CPU caches. In this article we propose P8TM, a novel approach that mitigates this limitation on IBM’s POWER8 architecture by leveraging a key combination of hardware and software techniques to support different execution paths. P8TM also relies on self-tuning mechanisms aimed at dynamically switching between different execution modes to best adapt to the workload characteristics. In-depth evaluation with several benchmarks indicates that P8TM can achieve striking performance gains in workloads that stress the capacity limitations of HTM, while achieving performance on par with HTM even in unfavourable workloads.

Proceedings ArticleDOI
22 Feb 2020
TL;DR: This work presents the first lock-free transactional vector, which pre-processes transactions to reduce shared memory access and simplify access logic, and generally offers better scalability than STM and STO, and competitive performance with Transactional Boosting, but with additionalLock-free guarantees.
Abstract: The vector is a fundamental data structure, offering constant-time traversal to elements and a dynamically resizable range of indices. While several concurrent vectors exist, a composition of concurrent vector operations dependent on each other can lead to undefined behavior. Techniques for providing transactional capabilities for data structure operations include Software Transactional Memory (STM) and transactional transformation methodologies. Transactional transformations convert concurrent data structures into their transactional equivalents at an operation level, rather than STM's object or memory level. To the best of our knowledge, existing STMs do not support dynamic read/write sets in a lock-free manner, and transactional transformation methodologies are unsuitable for the vector's contiguous memory layout. In this work, we present the first lock-free transactional vector. It integrates the fast lock-free resizing and instant logical status changes from related works. Our approach pre-processes transactions to reduce shared memory access and simplify access logic. This can be done without locking elements or verifying conflicts between transactions. We compare our design against state-of-the-art transactional designs, GCC STM, Transactional Boosting, and STO. All data structures are tested on four different platforms, including x86_64 and ARM architectures. We find that our lock-free transactional vector generally offers better scalability than STM and STO, and competitive performance with Transactional Boosting, but with additional lock-free guarantees. In scenarios with only reads and writes, our vector is as much as 47% faster than Transactional Boosting.

Proceedings ArticleDOI
22 Feb 2020
TL;DR: This paper designs a mechanism for localizing conflicts back to transactional program points, defines the semantics for optional repair handler annotations, and extends the conflict detection algorithm to ensure all repairs are completed.
Abstract: Transactional memory (TM) provides developers with a transaction primitive for concurrent code execution that transparently checks for concurrency conflicts. When such a conflict is detected, the system recovers by aborting and restarting the transaction. Although correct, this behavior wastes work and inhibits forward progress. In this paper, we present TardisTM, a software TM system that supports repairing concurrency conflicts while preserving unaffected computation. Our key insight is that existing conflict detection mechanisms can be extended to perform incremental transaction repair, when augmented with additional runtime information. To do so, we design a mechanism for localizing conflicts back to transactional program points, define the semantics for optional repair handler annotations, and extend the conflict detection algorithm to ensure all repairs are completed. To evaluate our system, we characterize the benefit of repair on a set of benchmark programs; we measure up to 2.95x speedup over mutual exclusion, and 93% abort reduction over a baseline software TM system that does not support repair.

Proceedings ArticleDOI
01 Sep 2020
TL;DR: A novel approach for fail-operational systems using hardware transactional memory, which can also be used for embedded systems running heterogeneous multi-cores, which allows the reproduction of atomic operations and recovery in case of an error.
Abstract: Modern safety-critical embedded applications like autonomous driving need to be fail-operational. At the same time, high performance and low power consumption are demanded. A common way to achieve this is the use of heterogeneous multi-cores. When applied to such systems, prevalent fault tolerance mechanisms suffer from some disadvantages: Some (e.g. triple modular redundancy) require a substantial amount of duplication, resulting in high hardware costs and power consumption. Others (e.g. lockstep) require supplementary checkpointing mechanisms to recover from errors. Further approaches (e.g. software-based process-level redundancy) cannot handle the indeterminism introduced by multithreaded execution. This paper presents a novel approach for fail-operational systems using hardware transactional memory, which can also be used for embedded systems running heterogeneous multi-cores. Each thread is automatically split into transactions, which then execute redundantly. The hardware transactional memory is extended to support multiple versions, which allows the reproduction of atomic operations and recovery in case of an error. In our FPGA-based evaluation, we executed the PARSEC benchmark suite with fault tolerance on 12 cores.

Proceedings ArticleDOI
18 May 2020
TL;DR: It is revealed that, by correctly using the annotations on just a few lines of code, it is possible to reduce the total number of instrumented barriers by 95% and to achieve speed-ups of up to 7× when compared to the original code generated by GCC and the Clang compiler.
Abstract: With chip manufacturers such as Intel, IBM and ARM offering native support for transactional memory in their instruction set architectures, memory transactions are on the verge of being considered a genuine application tool rather than just an interesting research topic. Despite this recent increase in popularity on the hardware side of transactional memory (HTM), software support for transactional memory (STM) is still scarce and the only compiler with transactional support currently available, the GNU Compiler Collection (GCC), does not generate code that achieves desirable performance. This paper presents a detailed analysis of transactional code generated by GCC and by a proposed transactional memory support added to the Clang/LLVM compiler framework. Experimental results support the following contributions: (a) STM’s performance overhead is due to an excessive amount of read and write barriers added by the compiler; (b) a new annotation mechanism for the Clang/LLVM compiler framework that aims to overcome the barrier over-instrumentation problem by allowing programmers to specify which variables should be free from transactional instrumentation; (c) a profiling tool that ranks the most accessed memory locations at runtime, working as a guiding tool for programmers to annotate the code. Furthermore, it is revealed that, by correctly using the annotations on just a few lines of code, it is possible to reduce the total number of instrumented barriers by 95% and to achieve speed-ups of up to 7× when compared to the original code generated by GCC and the Clang compiler.

Journal ArticleDOI
TL;DR: This paper introduces and tackles a special performance hazard in Hardware Transactional Memory: false abortion by introducing a new memory allocator design that is able to put objects that are likely to be accessed together from different threads into different cache lines and thus avoid conflicts of hardware transactions in different threads.
Abstract: This paper introduces and tackles a special performance hazard in Hardware Transactional Memory (HTM): false abortion. False abortion causes many unnecessary transaction abortions in HTM and can gr...

Proceedings ArticleDOI
06 Jul 2020
TL;DR: This paper introduces memory tagging, a simple hardware mechanism which enables the programmer to "tag" a dynamic set of memory locations, at cache-line granularity, and later validate whether the memory has been concurrently modified, with the possibility of updating one of the underlying locations atomically if validation succeeds.
Abstract: There has been a significant amount of research on hardware and software support for efficient concurrent data structures; yet, the question of how to build correct, simple, and scalable data structures has not yet been definitively settled. In this paper, we revisit this question from a minimalist perspective, and ask: what is the smallest amount of synchronization required for correct and efficient concurrent search data structures, and how could this minimal synchronization support be provided in hardware? To address these questions, we introduce memory tagging, a simple hardware mechanism which enables the programmer to "tag" a dynamic set of memory locations, at cache-line granularity, and later validate whether the memory has been concurrently modified, with the possibility of updating one of the underlying locations atomically if validation succeeds. We provide several examples showing that this mechanism can enable fast and arguably simple concurrent data structure designs, such as lists, binary search trees, balanced search trees, range queries, and Software Transactional Memory (STM) implementations. We provide an implementation of memory tags in the Graphite multi-core simulator, showing that the mechanism can be implemented entirely at the level of L1 cache, and that it can enable non-trivial speedups versus existing implementations of the above data structures.

Journal ArticleDOI
01 Nov 2020
TL;DR: This article develops two novel and provably correct LL/SC emulation schemes, which are implemented in the Synopsys DesignWare ARC nSIM DBT system, and evaluate their implementations against full applications, and targeted microbenchmarks.
Abstract: Dynamic binary translation (DBT) requires the implementation of load-link/store-conditional (LL/SC) primitives for guest systems that rely on this form of synchronization. When targeting, e.g., $\times 86$ host systems, LL/SC guest instructions are typically emulated using atomic compare-and-swap (CAS) instructions on the host. Whilst this direct mapping is efficient, this approach is problematic due to subtle differences between LL/SC and CAS semantics. In this article, we demonstrate that this is a real problem, and we provide code examples that fail to execute correctly on QEMU and a commercial DBT system, which both use the CAS approach to LL/SC emulation. We then develop two novel and provably correct LL/SC emulation schemes: 1) a purely software-based scheme, which uses the DBT system’s page translation cache for correctly selecting between fast, but unsynchronized, and slow, but fully synchronized memory accesses and 2) a hardware-accelerated scheme that leverages hardware transactional memory (HTM) provided by the host. We have implemented these two schemes in the Synopsys DesignWare ARC nSIM DBT system, and we evaluate our implementations against full applications, and targeted microbenchmarks. We demonstrate that our novel schemes are not only correct but also deliver competitive performance on-par or better than the widely used, but broken CAS scheme.

Journal ArticleDOI
TL;DR: An extensive experimental evaluation is conducted on an x86- and an ARM v8 multicore platform to explore the trade-offs of the proposed designs with respect to programmability, scalability and performance; and evaluate the performance improvements achievable with relaxed memory consistency models.

Journal ArticleDOI
TL;DR: This paper presents a solution that minimises blocking/waiting in GPGPU computing using a contention manager that offsets memory conflicts across threads through thread re-ordering, and believes this is the first work of its kind demonstrating a generalised conflict and semantic contention manager suitable for the scale of parallel execution found on a GPU.

Journal ArticleDOI
TL;DR: This article proposes a software-based thread-level synchronization mechanism called lock stealing for GPUs to avoid live-locks, and describes how to implement the lock stealing algorithm in mutual exclusive locks and readers-writer locks with high performance.
Abstract: As more emerging applications are moving to GPUs, thread-level synchronization has become a requirement. However, GPUs only provide warp-level and thread-block-level rather than thread-level synchronization. Moreover, it is highly possible to cause live-locks by using CPU synchronization mechanisms to implement thread-level synchronization for GPUs. In this article, we first propose a software-based thread-level synchronization mechanism called lock stealing for GPUs to avoid live-locks. We then describe how to implement our lock stealing algorithm in mutual exclusive locks and readers-writer locks with high performance. Finally, by putting it all together, we develop a thread-level locking library (TLLL) for commercial GPUs. To evaluate TLLL and show its general applicability, we use it to implement six widely used programs. We compare TLLL against the state-of-the-art ad-hoc GPU synchronization, GPU software transactional memory (STM), and CPU hardware transactional memory (HTM), respectively. The results show that, compared with the ad-hoc GPU synchronization for Delaunay mesh refinement (DMR), TLLL improves the performance by 22 percent on average on a GTX970 GPU, and shows up to 11 percent of performance improvement on a Volta V100 GPU. Moreover, it significantly reduces the required memory size. Such low memory consumption enables DMR to successfully run on the GTX970 GPU with the 10-million mesh size, and the V100 GPU with the 40-million mesh size, with which the ad-hoc synchronization can not run successfully. In addition, TLLL outperforms the GPU STM by 65 percent, and the CPU HTM (running on a Xeon E5-2620 v4 CPU with 16 hardware threads) by 43 percent on average.

Posted Content
TL;DR: A novel algorithm is described that reduces the problem to reachability, so that off-the-shelf program analysis tools can perform the reasoning necessary for proving commutativity, and abstracts away effects of methods that would be the same regardless of the order.
Abstract: Commutativity of data structure methods is of ongoing interest, with roots in the database community. In recent years commutativity has been shown to be a key ingredient to enabling multicore concurrency in contexts such as parallelizing compilers, transactional memory, speculative execution and, more broadly, software scalability. Despite this interest, it remains an open question as to how a data structure's commutativity specification can be verified automatically from its implementation. In this paper, we describe techniques to automatically prove the correctness of method commutativity conditions from data structure implementations. We introduce a new kind of abstraction that characterizes the ways in which the effects of two methods differ depending on the order in which the methods are applied, and abstracts away effects of methods that would be the same regardless of the order. We then describe a novel algorithm that reduces the problem to reachability, so that off-the-shelf program analysis tools can perform the reasoning necessary for proving commutativity. Finally, we describe a proof-of-concept implementation and experimental results, showing that our tool can verify commutativity of data structures such as a memory cell, counter, two-place Set, array-based stack, queue, and a rudimentary hash table. We conclude with a discussion of what makes a data structure's commutativity provable with today's tools and what needs to be done to prove more in the future.

Book ChapterDOI
24 Aug 2020
TL;DR: NV-PhTM is the first phase-based system to provide durable transactions, a transactional system for NVM that delivers the best out of both HW and SW transactions by dynamically selecting the best execution mode according to the application’s characteristics.
Abstract: Non-Volatile Memory (NVM) is an emerging memory technology aimed to eliminate the gap between main memory and stable storage. Nevertheless, today’s programs will not readily benefit from NVM because crash failures may render the program in an unrecoverable and inconsistent state. In this context, durable transactions have been proposed as a mechanism to ease the adoption of NVM by simplifying the task of programming NVM systems. Existing systems employ either hardware (HW) or software (SW) transactions with different performance tradeoffs. Although SW transactions are flexible and unbounded, they may significantly hurt the performance of short-lived transactions. On the other hand, HW transactional memories provide low-overhead but are resource-constrained. In this paper we present NV-PhTM, a transactional system for NVM that delivers the best out of both HW and SW transactions by dynamically selecting the best execution mode according to the application’s characteristics. NV-PhTM is comprised of a set of heuristics to guide online phase transition while retaining persistency in case of crashes during migration. To the best of our knowledge, NV-PhTM is the first phase-based system to provide durable transactions. Experimental results with the STAMP benchmark show that the proposed heuristics are efficient in guiding phase transitions with low overhead.

Proceedings ArticleDOI
31 Jul 2020
TL;DR: This work discusses some consequences of the C++ memory model on STM, identifies an easy-to-fix implementation error, and describes an unavoidable formal race condition that occurs in an important class of STM algorithms.
Abstract: High-performance software transactional memory (STM) implementations rely on nuanced use of synchronization variables to coordinate speculative accesses to program data. We discuss some consequences of the C++ memory model on STM, identify an easy-to-fix implementation error, and describe an unavoidable formal race condition that occurs in an important class of STM algorithms.

Proceedings ArticleDOI
24 Nov 2020
TL;DR: Transactional memory (TM) is a paradigm that removes the need for using locks and makes it easier for programmers because it reduces work when it comes to synchronization as discussed by the authors, which enables the atomicity and isolation of concurrent threads.
Abstract: Transactional memory (TM) is a paradigm that removes the need for using locks and makes it easier for programmers because it reduces work when it comes to synchronization. TM enables the atomicity and isolation of concurrent threads. TM can be realized in software (STM) or with hardware support (HTM) or hybrid. In this survey different machine learning (ML) techniques are explored to improve performance, energy effectiveness, simplicity, and benefits of using TM. Various methods as thread mapping, self-adjusting concurrency, etc. are combined with machine learning algorithms. After presenting the techniques, the obtained results are discussed.

Proceedings ArticleDOI
04 Jan 2020
TL;DR: The challenge is on how to schedule the transactions so that two crucial performance metrics, namely total execution time to commit all the transactions, and total communication cost involved in moving the objects to the requesting nodes, are minimized.
Abstract: In this paper, we present GraphTM, an efficient and scalable framework for processing transactions in a distributed environment. The distributed environment is modeled as a graph where each node of the graph is a processing node that issues transactions. The objects that transactions use to execute are also on the graph nodes (the initial placement may be arbitrary). The transactions execute on the nodes which issue them after collecting all the objects that they need following the data-flow model of computation. This collection is done by issuing the requests for the objects as soon as transaction starts and wait until all required objects for the transaction come to the requesting node. The challenge is on how to schedule the transactions so that two crucial performance metrics, namely (i) total execution time to commit all the transactions, and (ii) total communication cost involved in moving the objects to the requesting nodes, are minimized. We implemented GraphTM in Java and assessed its performance through 3 micro-benchmarks and 5 complex benchmarks from STAMP benchmark suite on 5 different network topologies, namely, clique, line, grid, cluster, and star, that make an underlying communication network for a representative set of distributed systems commonly used in practice. The results show the efficiency and scalability of our approach.