scispace - formally typeset
Search or ask a question

Showing papers on "Transactional memory published in 2017"


Proceedings ArticleDOI
26 Feb 2017
TL;DR: T-SGX is implemented as a compiler-level scheme to automatically transform a normal enclave program into a secured enclave program without requiring manual source code modification or annotation, and is an order of magnitude faster than the state-of-the-art mitigation schemes.
Abstract: Intel Software Guard Extensions (SGX) is a hardware-based Trusted Execution Environment (TEE) that enables secure execution of a program in an isolated environment, called an enclave. SGX hardware protects the running enclave against malicious software, including the operating system, hypervisor, and even low-level firmware. This strong security property allows trustworthy execution of programs in hostile environments, such as a public cloud, without trusting anyone (e.g., a cloud provider) between the enclave and the SGX hardware. However, recent studies have demonstrated that enclave programs are vulnerable to accurate controlled-channel attacks conducted by a malicious OS. Since enclaves rely on the underlying OS, curious and potentially malicious OSs can observe a sequence of accessed addresses by intentionally triggering page faults. In this paper, we propose T-SGX, a complete mitigation solution to the controlled-channel attack in terms of compatibility, performance, and ease of use. T-SGX relies on a commodity component of the Intel processor (since Haswell), called Transactional Synchronization Extensions (TSX), which implements a restricted form of hardware transactional memory. As TSX is implemented as an extension (i.e., snooping the cache protocol), any unusual event, such as an exception or interrupt, that should be handled in its core component, results in an abort of the ongoing transaction. One interesting property is that the TSX abort suppresses the notification of errors to the underlying OS. This means that the OS cannot know whether a page fault has occurred during the transaction. T-SGX, by utilizing this property of TSX, can carefully isolate the effect of attempts to tap running enclaves, thereby completely eradicating the known controlled channel attack. We have implemented T-SGX as a compiler-level scheme to automatically transform a normal enclave program into a secured enclave program without requiring manual source code modification or annotation. We not only evaluate the security properties of T-SGX, but also demonstrate that it could be applied to all the previously demonstrated attack targets, such as libjpeg, Hunspell, and FreeType. To evaluate the performance of T-SGX, we ported 10 benchmark programs of nbench to the SGX environment. Our evaluation results look promising. T-SGX is an order of magnitude faster than the state-of-the-art mitigation schemes. On our benchmarks, T-SGX incurs on average 50% performance overhead and less than 30% storage overhead.

362 citations


Proceedings ArticleDOI
02 Apr 2017
TL;DR: Deja Vu is presented, a software framework that enables a shielded execution to detect privileged side-channel attacks and builds into shielded execution the ability to check program execution time at the granularity of paths in its control-flow graph.
Abstract: Intel Software Guard Extension (SGX) protects the confidentiality and integrity of an unprivileged program running inside a secure enclave from a privileged attacker who has full control of the entire operating system (OS). Program execution inside this enclave is therefore referred to as shielded. Unfortunately, shielded execution does not protect programs from side-channel attacks by a privileged attacker. For instance, it has been shown that by changing page table entries of memory pages used by shielded execution, a malicious OS kernel could observe memory page accesses from the execution and hence infer a wide range of sensitive information about it. In fact, this page-fault side channel is only an instance of a category of side-channel attacks, here called privileged side-channel attacks, in which privileged attackers frequently preempt the shielded execution to obtain fine-grained side-channel observations. In this paper, we present Deja Vu, a software framework that enables a shielded execution to detect such privileged side-channel attacks. Specifically, we build into shielded execution the ability to check program execution time at the granularity of paths in its control-flow graph. To provide a trustworthy source of time measurement, Deja Vu implements a novel software reference clock that is protected by Intel Transactional Synchronization Extensions (TSX), a hardware implementation of transactional memory. Evaluations show that Deja Vu effectively detects side-channel attacks against shielded execution and against the reference clock itself.

182 citations


Proceedings ArticleDOI
04 Apr 2017
TL;DR: DUDETM is presented, a crash-consistent durable transaction system that avoids the drawbacks of both undo logging and redo logging and can be implemented with existing hardware TMs with minor hardware modifications, leading to a further 1.7times speedup.
Abstract: Emerging non-volatile memory (NVM) offers non-volatility, byte-addressability and fast access at the same time. To make the best use of these properties, it has been shown by empirical evidence that programs should access NVM directly through CPU load and store instructions, so that the overhead of a traditional file system or database can be avoided. Thus, durable transactions become a common choice of applications for accessing persistent memory data in a crash consistent manner. However, existing durable transaction systems employ either undo logging, which requires a fence for every memory write, or redo logging, which requires intercepting all memory reads within transactions.This paper presents DUDETM, a crash-consistent durable transaction system that avoids the drawbacks of both undo logging and redo logging. DUDETM uses shadow DRAM to decouple the execution of a durable transaction into three fully asynchronous steps. The advantage is that only minimal fences and no memory read instrumentation are required. This design also enables an out-of-the-box transactional memory (TM) to be used as an independent component in our system. The evaluation results show that DUDETM adds durability to a TM system with only 7.4 ~ 24.6% throughput degradation. Compared to the existing durable transaction systems, DUDETM provides 1.7times to 4.4times higher throughput. Moreover, DUDETM can be implemented with existing hardware TMs with minor hardware modifications, leading to a further 1.7times speedup.

179 citations


Proceedings Article
Qingda Hu1, Jinglei Ren2, Anirudh Badam2, Jiwu Shu1, Thomas Moscibroda2 
12 Jul 2017
TL;DR: This paper presents a log-structured NVMM system that not only maintains NVMM in a compact manner but also reduces the write traffic and the number of persist barriers needed for executing transactions.
Abstract: Emerging non-volatile main memory (NVMM) unlocks the performance potential of applications by storing persistent data in the main memory. Such applications require a lightweight persistent transactional memory (PTM) system, instead of a heavyweight filesystem or database, to have fast access to data. In a PTM system, the memory usage, both capacity and bandwidth, plays a key role in dictating performance and efficiency. Existing memory management mechanisms for PTMs generate high memory fragmentation, high write traffic and a large number of persist barriers, since data is first written to a log and then to the main data store. In this paper, we present a log-structured NVMM system that not only maintains NVMM in a compact manner but also reduces the write traffic and the number of persist barriers needed for executing transactions. All data allocations and modifications are appended to the log which becomes the location of the data. Further, we address a unique challenge of log-structured memory management by designing a tree-based address translation mechanism where access granularities are flexible and different from allocation granularities. Our results show that the new system enjoys up to 89.9% higher transaction throughput and up to 82.8% lower write traffic than a traditional PTM system.

58 citations


Proceedings ArticleDOI
04 Apr 2017
TL;DR: This work develops a novel "failure-atomic slotted page structure" for persistent memory that leverages byte addressability and durability of persistent memory to minimize redundant write operations used to maintain consistency in traditional database systems.
Abstract: The slotted-page structure is a database page format commonly used for managing variable-length records. In this work, we develop a novel "failure-atomic slotted page structure" for persistent memory that leverages byte addressability and durability of persistent memory to minimize redundant write operations used to maintain consistency in traditional database systems. Failure-atomic slotted paging consists of two key elements: (i) in-place commit per page using hardware transactional memory and (ii) slot header logging that logs the commit mark of each page. The proposed scheme is implemented in SQLite and compared against NVWAL, the current state-of-the-art scheme. Our performance study shows that our failure-atomic slotted paging shows optimal performance for database transactions that insert a single record. For transactions that touch more than one database page, our proposed slot-header logging scheme minimizes the logging overhead by avoiding duplicating pages and logging only the metadata of the dirty pages. Overall, we find that our failure-atomic slotted-page management scheme reduces database logging overhead to 1/6 and improves query response time by up to 33% compared to NVWAL.

50 citations


Journal ArticleDOI
TL;DR: The elastic transaction model and an implementation of it are presented, then its simplicity and performance on various concurrent data structures, namely double-ended queue, hash table, linked list, and skip list are illustrated.

31 citations


Proceedings ArticleDOI
22 Feb 2017
TL;DR: This paper describes an architecture that consists of many lightweight multi-threaded processing engines, a global transactional shared memory, and a work scheduler and argues that despite increased transaction conflicts due to the higher concurrency and single-thread latency, scalable speedup over serial execution can be achieved.
Abstract: Many applications that operate on large graphs can be intuitively parallelized by executing a large number of the graph operations concurrently and as transactions to deal with potential conflicts. However, large numbers of operations occurring concurrently might incur too many conflicts that would negate the potential benefits of the parallelization which has probably made highly multi-threaded transactional machines seem impractical. Given the large size and topology of many modern graphs, however, such machines can provide real performance, energy efficiency, and programability benefits. This paper describes an architecture that consists of many lightweight multi-threaded processing engines, a global transactional shared memory, and a work scheduler. We present challenges of realizing such an architecture, especially the requirement of scalable conflict detection, and propose solutions. We also argue that despite increased transaction conflicts due to the higher concurrency and single-thread latency, scalable speedup over serial execution can be achieved. We implement the proposed architecture as a synthesizable FPGA RTL design and demonstrate improved per-socket performance (2X) and energy efficiency (22X) by comparing to a baseline platform that contains two Intel Haswell processors, each with 12 cores.

30 citations


Journal ArticleDOI
TL;DR: An optimized transaction chopping algorithm is designed and implemented to decompose a set of large transactions into smaller pieces such that HTM is only required to protect each piece and this work describes how DrTM supports common database features like read-only transactions and logging for durability.
Abstract: DrTM is a fast in-memory transaction processing system that exploits advanced hardware features such as remote direct memory access (RDMA) and hardware transactional memory (HTM). To achieve high efficiency, it mostly offloads concurrency control such as tracking read/write accesses and conflict detection into HTM in a local machine and leverages the strong consistency between RDMA and HTM to ensure serializability among concurrent transactions across machines. To mitigate the high probability of HTM aborts for large transactions, we design and implement an optimized transaction chopping algorithm to decompose a set of large transactions into smaller pieces such that HTM is only required to protect each piece. We further build an efficient hash table for DrTM by leveraging HTM and RDMA to simplify the design and notably improve the performance. We describe how DrTM supports common database features like read-only transactions and logging for durability. Evaluation using typical OLTP workloads including TPC-C and SmallBank shows that DrTM has better single-node efficiency and scales well on a six-node cluster; it achieves greater than 1.51, 34 and 5.24, 138 million transactions per second for TPC-C and SmallBank on a single node and the cluster, respectively. Such numbers outperform a state-of-the-art single-node system (i.e., Silo) and a distributed transaction system (i.e., Calvin) by at least 1.9X and 29.6X for TPC-C.

29 citations


Journal ArticleDOI
TL;DR: This paper investigates another, less obvious, feature of “not taking effect” called non-interference: aborted or incomplete transactions should not force any other transaction to abort, and presents a simple though efficient implementation that satisfies non- Interference and local opacity.

26 citations


Proceedings ArticleDOI
18 Jun 2017
TL;DR: This paper exploits the ordering mechanism to design a novel persistence method that decouples HTM concurrency from back-end NVM operations and yields significant gains in throughput and latency in comparison with persistent transactional locking.
Abstract: This paper addresses the challenges of coupling byte addressable non-volatile memory (NVM) and hardware transaction memory (HTM) in high-performance transaction processing. We first show that HTM transactions can be ordered using existing processor instructions without any hardware changes. In contrast, existing solutions posit changes to HTM mechanisms in the form of special instructions or modified functionality. We exploit the ordering mechanism to design a novel persistence method that decouples HTM concurrency from back-end NVM operations. Failure atomicity is achieved using redo logging coupled with aliasing to guard against mistimed cache evictions. Our algorithm uses efficient lock-free mechanisms with bounded static memory requirements. We evaluated our approach using both micro-benchmarks, and, benchmarks in the STAMP suite, and showed that it compares well with standard (volatile) HTM transactions. We also showed that it yields significant gains in throughput and latency in comparison with persistent transactional locking.

24 citations


Journal ArticleDOI
TL;DR: Edge-TM is proposed, an adaptive hardware/software error management policy that optimistically scales the voltage beyond the edge of safe operation for better energy savings and works in combination with a Hardware Transactional Memory (HTM)-based error recovery mechanism.
Abstract: Scaling of semiconductor devices has enabled higher levels of integration and performance improvements at the price of making devices more susceptible to the effects of static and dynamic variability. Adding safety margins (guardbands) on the operating frequency or supply voltage prevents timing errors, but has a negative impact on performance and energy consumption. We propose Edge-TM, an adaptive hardware/software error management policy that (i) optimistically scales the voltage beyond the edge of safe operation for better energy savings and (ii) works in combination with a Hardware Transactional Memory (HTM)-based error recovery mechanism. The policy applies dynamic voltage scaling (DVS) (while keeping frequency fixed) based on the feedback provided by HTM, which makes it simple and generally applicable. Experiments on an embedded platform show our technique capable of 57% energy improvement compared to using voltage guardbands and an extra 21-24% improvement over existing state-of-the-art error tolerance solutions, at a nominal area and time overhead.

Proceedings ArticleDOI
24 Jun 2017
TL;DR: A new execution model that supports unordered and timestamp-ordered nested parallelism, FRACTAL, which can parallelize a broader range of applications than prior speculative execution models and outperforms prior speculative architectures by up to 88× at 256 cores.
Abstract: Most systems that support speculative parallelization, like hardware transactional memory (HTM), do not support nested parallelism. This sacrifices substantial parallelism and precludes composing parallel algorithms. And the few HTMs that do support nested parallelism focus on parallelizing at the coarsest (shallowest) levels, incurring large overheads that squander most of their potential.We present FRACTAL, a new execution model that supports unordered and timestamp-ordered nested parallelism. FRACTAL lets programmers seamlessly compose speculative parallel algorithms, and lets the architecture exploit parallelism at all levels. FRACTAL can parallelize a broader range of applications than prior speculative execution models. We design a FRACTAL implementation that extends the Swarm architecture and focuses on parallelizing at the finest (deepest) levels. Our approach sidesteps the issues of nested parallel HTMs and uncovers abundant fine-grain parallelism. As a result, FRACTAL outperforms prior speculative architectures by up to 88x at 256 cores.

Proceedings ArticleDOI
04 Feb 2017
TL;DR: This paper demonstrates a multi-threaded DBT-based emulator that scales in an architecture-independent manner, explores the trade-offs that exist when emulating atomic operations across ISAs, and presents a novel approach for correct and scalable emulation of load-locked/store-conditional instructions based on hardware transactional memory (HTM).
Abstract: Speed, portability and correctness have traditionally been the main requirements for dynamic binary translation (DBT) systems. Given the increasing availability of multi-core machines as both emulation guests and hosts, scalability has emerged as an additional design objective. It has however been an elusive goal for two reasons: contention on common data structures such as the translation cache is difficult to avoid without hurting performance, and instruction set architecture (ISA) disparities between guest and host (such as mismatches in the memory consistency model and the semantics of atomic operations) can compromise correctness. In this paper we address these challenges in a simple and memory-efficient way, demonstrating a multi-threaded DBT-based emulator that scales in an architecture-independent manner. Furthermore, we explore the trade-offs that exist when emulating atomic operations across ISAs, and present a novel approach for correct and scalable emulation of load-locked/store-conditional instructions based on hardware transactional memory (HTM). By adding around 1000 lines of code to QEMU, we demonstrate the scalability of both user-mode and full-system emulation on a 64-core x86_64 host running x86_64 guest code, and a 12-core, 96-thread POWER8 host running x86_64 and Aarch64 guest code.

Proceedings ArticleDOI
26 Jan 2017
TL;DR: Evaluation using key-value store benchmarks on a 20-core HTM-capable multi-core machine shows that Eunomia leads to 5X-11X speedup under high contention, while incurring small overhead under low contention.
Abstract: While hardware transactional memory (HTM) has recently been adopted to construct efficient concurrent search tree structures, such designs fail to deliver scalable performance under contention. In this paper, we first conduct a detailed analysis on an HTM-based concurrent B+Tree, which uncovers several reasons for excessive HTM aborts induced by both false and true conflicts under contention. Based on the analysis, we advocate Eunomia, a design pattern for search trees which contains several principles to reduce HTM aborts, including splitting HTM regions with version-based concurrency control to reduce HTM working sets, partitioned data layout to reduce false conflicts, proactively detecting and avoiding true conflicts, and adaptive concurrency control. To validate their effectiveness, we apply such designs to construct a scalable concurrent B+Tree using HTM. Evaluation using key-value store benchmarks on a 20-core HTM-capable multi-core machine shows that Eunomia leads to 5X-11X speedup under high contention, while incurring small overhead under low contention.

Proceedings ArticleDOI
01 Jan 2017
TL;DR: This paper presents the first formal verification of a pessimistic software TM algorithm, namely, an algorithm proposed by Matveev and Shavit, and proves that this pessimistic TM is a refinement of an intermediate opaque I/O-automaton, known as TMS2.
Abstract: Transactional Memory (TM) is a high-level programming abstraction for concurrency control that provides programmers with the illusion of atomically executing blocks of code, called transactions. TMs come in two categories, optimistic and pessimistic, where in the latter transactions never abort. While this simplifies the programming model, high-performing pessimistic TMs can be complex. In this paper, we present the first formal verification of a pessimistic software TM algorithm, namely, an algorithm proposed by Matveev and Shavit. The correctness criterion used is opacity, formalising the transactional atomicity guarantees. We prove that this pessimistic TM is a refinement of an intermediate opaque I/O-automaton, known as TMS2. To this end, we develop a rely-guarantee approach for reducing the complexity of the proof. Proofs are mechanised in the interactive prover Isabelle.

Proceedings ArticleDOI
24 Jun 2017
TL;DR: The results show that snapshot isolation can effectively boost the performance of dynamically sized data structures such as linked lists, binary trees and red-black trees, sometimes by as much as 4.5x, which results in improved overall performance of benchmarks utilizing these data structures.
Abstract: Snapshot Isolation (SI) is an established model in the database community, which permits write-read conflicts to pass and aborts transactions only on write-write conflicts. With the Write Skew anomaly correctly eliminated, SI can reduce the occurrence of aborts, save the work done by transactions, and greatly benefit long transactions involving complex data structures.GPUs are evolving towards a general-purpose computing device with growing support for irregular workloads, including transactional memory. The usage of snapshot isolation on transactional memory has proven to be greatly beneficial for performance. In this paper, we propose a multi-versioned memory subsystem for hardware-based transactional memory on the GPU, with a method for eliminating the Write Skew anomaly on the fly, and finally incorporate Snapshot Isolation with this system.The results show that snapshot isolation can effectively boost the performance of dynamically sized data structures such as linked lists, binary trees and red-black trees, sometimes by as much as 4.5x, which results in improved overall performance of benchmarks utilizing these data structures.

Journal ArticleDOI
TL;DR: This paper introduces Optimistic Transactional Boosting (OTB), an optimistic methodology for extending data structures in order to support the composition of multiple operations into one atomic execution by building a single traversal phase and a single commit phase for the whole atomic execution.
Abstract: The last two decades witnessed the success of many efficient designs of concurrent data structures. A large set of them has a common base principle: each operation is split into a read-only traversal phase, which scans the data structure without locking or monitoring, and a read-write commit phase, which atomically validates the output of the traversal phase and applies the needed modifications to the data structure. In this paper we introduce Optimistic Transactional Boosting (OTB), an optimistic methodology for extending those designs in order to support the composition of multiple operations into one atomic execution by building a single traversal phase and a single commit phase for the whole atomic execution. As a result, OTB-based data structures are optimistic and composable . The former because they defer any locking and/or monitoring to the commit phase of the entire atomic execution; the latter because they allow the execution of multiple operations atomically. Additionally, in this paper we provide a theoretical model for analyzing OTB-based data structures and proving their correctness. In particular, we extended a recent approach that models concurrent data structures by including the two notions of optimism and composition of operations.

Journal ArticleDOI
TL;DR: This work proposes S eer, a software scheduler that addresses precisely this restriction of HTM by leveraging on an online probabilistic inference technique that identifies the most likely conflict relations and establishes a dynamic locking scheme to serialize transactions in a fine-grained manner.
Abstract: The ubiquity of multicore processors has led programmers to write parallel and concurrent applications to take advantage of the underlying hardware and speed up their executions. In this context, Transactional Memory (TM) has emerged as a simple and effective synchronization paradigm, via the familiar abstraction of atomic transactions.After many years of intense research, major processor manufacturers (including Intel) have recently released mainstream processors with hardware support for TM (HTM).In this work, we study a relevant issue with great impact on the performance of HTM. Due to the optimistic and inherently limited nature of HTM, transactions may have to be aborted and restarted numerous times, without any progress guarantee. As a result, it is up to the software library that regulates the HTM usage to ensure progress and optimize performance. Transaction scheduling is probably one of the most well-studied and effective techniques to achieve these goals.However, these recent mainstream HTMs have some technical limitations that prevent the adoption of known scheduling techniques: unlike software implementations of TM used in the past, existing HTMs provide limited or no information on which memory regions or contending transactions caused the abort.To address this crucial issue for HTMs, we propose Seer, a software scheduler that addresses precisely this restriction of HTM by leveraging on an online probabilistic inference technique that identifies the most likely conflict relations and establishes a dynamic locking scheme to serialize transactions in a fine-grained manner. The key idea of our solution is to constrain the portions of parallelism that are affecting negatively the whole system. As a result, this not only prevents performance reduction but also in fact unveils further scalability and performance for HTM. Via an extensive evaluation study, we show that Seer improves the performance of the Intel’s HTM by up to 3.6×, and by 65% on average across all concurrency degrees and benchmarks on a large processor with 28 cores.

Proceedings ArticleDOI
TL;DR: In this article, the interplay between weak memory and transactional memory has been studied in detail, but their combined semantics are not well understood, which is problematic because such widely-used architectures and languages as x86, Power, and C++ all support TM, and all have weak memory models.
Abstract: Weak memory models provide a complex, system-centric semantics for concurrent programs, while transactional memory (TM) provides a simpler, programmer-centric semantics. Both have been studied in detail, but their combined semantics is not well understood. This is problematic because such widely-used architectures and languages as x86, Power, and C++ all support TM, and all have weak memory models. Our work aims to clarify the interplay between weak memory and TM by extending existing axiomatic weak memory models (x86, Power, ARMv8, and C++) with new rules for TM. Our formal models are backed by automated tooling that enables (1) the synthesis of tests for validating our models against existing implementations and (2) the model-checking of TM-related transformations, such as lock elision and compiling C++ transactions to hardware. A key finding is that a proposed TM extension to ARMv8 currently being considered within ARM Research is incompatible with lock elision without sacrificing portability or performance.

Proceedings ArticleDOI
04 Apr 2017
TL;DR: It is argued that the problem lies with the programming model and data structures used to write STAMP programs and that HTM and STM designs will benefit if more attention is paid to the division of labor between application programs, systems software, and hardware.
Abstract: Transactional memory (TM) has been the focus of numerous studies, and it is supported in processors such as the IBM Blue Gene/Q and Intel Haswell. Many studies have used the STAMP benchmark suite to evaluate their designs. However, the speedups obtained for the STAMP benchmarks on all TM systems we know of are quite limited; for example, with 64 threads on the IBM Blue Gene/Q, we observe a median speedup of 1.4X using the Blue Gene/Q hardware transactional memory (HTM), and a median speedup of 4.1X using a software transactional memory (STM).What limits the performance of these benchmarks on TMs? In this paper, we argue that the problem lies with the programming model and data structures used to write them. To make this point, we articulate two principles that we believe must be embodied in any scalable program and argue that STAMP programs violate both of them. By modifying the STAMP programs to satisfy both principles, we produce a new set of programs that we call the Stampede suite. Its median speedup on the Blue Gene/Q is 8.0X when using an STM. The two principles also permit us to simplify the TM design. Using this new STM with the Stampede benchmarks, we obtain a median speedup of 17.7X with 64 threads on the Blue Gene/Q and 13.2X with 32 threads on an Intel Westmere system.These results suggest that HTM and STM designs will benefit if more attention is paid to the division of labor between application programs, systems software, and hardware.

Journal ArticleDOI
TL;DR: This article reviews the wide variety of scheduling techniques proposed for Software Transactional Memories, and proposes a classification of the existing techniques, and discusses the specific characteristics of each technique.
Abstract: Transactional Memory (TM) is a practical programming paradigm for developing concurrent applications. Performance is a critical factor for TM implementations, and various studies demonstrated that specialised transaction/thread scheduling support is essential for implementing performance-effective TM systems. After one decade of research, this article reviews the wide variety of scheduling techniques proposed for Software Transactional Memories. Based on peculiarities and differences of the adopted scheduling strategies, we propose a classification of the existing techniques, and we discuss the specific characteristics of each technique. Also, we analyse the results of previous evaluation and comparison studies, and we present the results of a new experimental study encompassing techniques based on different scheduling strategies. Finally, we identify potential strengths and weaknesses of the different techniques, as well as the issues that require to be further investigated.

Book ChapterDOI
19 Jun 2017
TL;DR: A mechanism that manages thread synchronisation on behalf of a programmer so that blocks of code execute with the illusion of atomicity, which defines conditions for serialising concurrent transactions.
Abstract: Transactional memory (TM) is a mechanism that manages thread synchronisation on behalf of a programmer so that blocks of code execute with the illusion of atomicity The main safety criterion for transactional memory is opacity, which defines conditions for serialising concurrent transactions

Book ChapterDOI
Florian Haas1, Sebastian Weis1, Theo Ungerer1, Gilles Pokam2, Youfeng Wu2 
03 Apr 2017
TL;DR: This paper proposes a software/hardware hybrid approach, which targets Intel’s current x86 multi-core platforms of the Core and Xeon family, and leverage hardware transactional memory (Intel TSX) to support implicit checkpoint creation and fast rollback.
Abstract: The demand for fault-tolerant execution on high performance computer systems increases due to higher fault rates resulting from smaller structure sizes. As an alternative to hardware-based lockstep solutions, software-based fault-tolerance mechanisms can increase the reliability of multi-core commercial-of-the-shelf (COTS) CPUs while being cheaper and more flexible. This paper proposes a software/hardware hybrid approach, which targets Intel’s current x86 multi-core platforms of the Core and Xeon family. We leverage hardware transactional memory (Intel TSX) to support implicit checkpoint creation and fast rollback. Redundant execution of processes and signature-based comparison of their computations provides error detection, and transactional wrapping enables error recovery. Existing applications are enhanced towards fault-tolerant redundant execution by post-link binary instrumentation. Hardware enhancements to further increase the applicability of the approach are proposed and evaluated with SPEC CPU 2006 benchmarks. The resulting performance overhead is 47% on average, assuming the existence of the proposed hardware support.

Proceedings ArticleDOI
01 Sep 2017
TL;DR: This work presents speculation-aware multithreading (SAM), a simple policy that addresses major performance pathologies of speculative parallelism by coordinating instruction dispatch and conflict resolution priorities and makes multithreaded cores much more beneficial on speculative parallel programs.
Abstract: This work studies the interplay between multithreaded cores and speculative parallelism (e.g., transactional memory or thread-level speculation). These techniques are often used together, yet they have been developed independently. This disconnect causes major performance pathologies: increasing the number of threads per core adds conflicts and wasted work, and puts pressure on speculative execution resources. These pathologies often squander the benefits of multithreading.We present speculation-aware multithreading (SAM), a simple policy that addresses these pathologies. By coordinating instruction dispatch and conflict resolution priorities, SAM focuses execution resources on work that is more likely to commit, avoiding aborts and using speculation resources more efficiently.We design SAM variants for in-order and out-of-order cores. SAM is cheap to implement and makes multithreaded cores much more beneficial on speculative parallel programs. We evaluate SAM on systems with up to 64 SMT cores. With SAM, 8-threaded cores outperform single-threaded cores by 2.33x on average, while a speculation-oblivious policy yields a 1.85x speedup. SAM also reduces wasted work by 52%.

Patent
14 Nov 2017
TL;DR: In this paper, register restoration using transactional memory register snapshots is presented, where an indication that a transaction is to be initiated is obtained and a determination is made as to whether register restoration is in active use.
Abstract: Register restoration using transactional memory register snapshots. An indication that a transaction is to be initiated is obtained. Based on obtaining the indication, a determination is made as to whether register restoration is in active use. Based on obtaining the indication and determining register restoration is in active use, register restoration is deactivated. To recover one or more architected registers of the transaction, a transactional rollback snapshot is created.

Dissertation
03 Sep 2017
TL;DR: Object-based STMs or OSTMs which operate on higher level objects rather than read/write operations on memory buffers and hence denoted RWSTMs are considered, which could be an efficient means of achieving composability of higher-level operations in the software applications using the concurrent data structures.
Abstract: The major focus of software transaction memory systems (STMs) has been to facilitate the multiprocessor programming and provide parallel programmers with an abstraction for fast development of the concurrent and parallel applications. Thus, STMs allow the parallel programmers to focus on the logic of parallel programs rather than worrying about synchronization. Heart of such applications is the underlying concurrent data-structure. The design of the underlying concurrent data-structure is the deciding factor whether the software application would be efficient, scalable and composable. However, achieving composition in concurrent data structures such that they are efficient as well as easy to program poses many consistency and design challenges. We say a concurrent data structure compose when multiple operations from same or different object instances of the concurrent data structure can be glued together such that the new operation also behaves atomically. For example, assume we have a linked-list as the concurrent data structure with lookup, insert and delete as the atomic operations. Now, we want to implement the new move operation, which would delete a node from one position of the list and would insert into the another or same list. Such a move operation may not be atomic(transactional) as it may result in an execution where another process may access the inconsistent state of the linked-list where the node is deleted but not yet inserted into the list. Thus, this inability of composition in the concurrent data structures may hinder their practical use. In this context, the property of compositionality provided by the transactions in STMs can be handy. STMs provide easy to program and compose transactional interface which can be used to develop concurrent data structures thus the parallel software applications. However, whether this can be achieved efficiently is a question we would try to answer in this thesis. Most of the STMs proposed in the literature are based on read/write primitive operations(or methods) on memory buffers and hence denoted RWSTMs. These lower level read/write primitive operations do not provide any other useful information except that a write operation always needs to be ordered with any other read or write. Thus limiting the number of possible concurrent executions. In this thesis, we consider Object-based STMs or OSTMs which operate on higher level objects rather than read/write operations on memory locations. The main advantage of considering OSTMs is that with the greater semantic information provided by the methods of the object, the conflicts among the transactions can be reduced and as a result, the number of aborts will also be less. This allows for larger number of permissive concurrent executions leading to more concurrency. Hence, OSTMs could be an efficient means of achieving composability of higher-level operations in the software applications using the concurrent data structures. This would allow parallel programmers to leverage underlying multi-core architecture. To design the OSTM, we have adopted the transactional tree model developed for databases. We extend the traditional notion of conflicts and legality to higher level operations in STMs which allows efficient composability. Using these notions we define the standard STM correctness notion of Conflict-Opacity. The OSTM model can be easily extended to implement concurrent lists, sets, queues or other concurrent data structures. We use the proposed theoretical OSTM model to design HT-OSTM - an OSTM with underlying hash table object. We noticed that major concurrency hot-spot is the chaining data structure within the hash table. So, we have used Lazyskip-list approach which is time efficient compared to normal lists in terms of traversal overhead. At the transactional level, we use timestamp ordering protocol to ensure that the executions are conflict-opaque. We provide a detailed handcrafted proof of correctness starting from operational level to the transactional level. At the operational level we show that HT-OSTM generates legal sequential history. At vi transactional level we show that every such sequential history would be opaque thus co-opaque. The HT-OSTM exports STM insert, STM lookup and STM delete methods to the programmer along-with STM begin and STM trycommit. Using these higher level operations user may easily and efficiently program any parallel software application involving concurrent hash table. To demonstrate the efficiency of composition we build a test application which executes the number of hash-tab methods (generated with a given probability) atomically in a transaction. Finally, we evaluate HT-OSTM against ESTM based hash table of synchrobench and the hash-table designed for RWSTM based on basic time stamp ordering protocol. We observe that HT-OSTM outperforms ESTM by the average magnitude of 106 transactions per second(throughput) for both lookup intensive and update intensive work load. HT-OSTM outperforms RWSTM by 3% & 3.4% update intensive and lookup intensive workload respectively

Proceedings ArticleDOI
25 Jul 2017
TL;DR: In this article, the authors propose an approach that combines the best of both worlds: a software-only fallback path and a hardware-only fast path for lock-free data structures based on down-trees.
Abstract: Algorithms that use hardware transactional memory (HTM) must provide a software-only fallback path to guarantee progress. The design of the fallback path can have a profound impact on performance. If the fallback path is allowed to run concurrently with hardware transactions, then hardware transactions must be instrumented, adding significant overhead. Otherwise, hardware transactions must wait for any processes on the fallback path, causing concurrency bottlenecks, or move to the fallback path. We introduce an approach that combines the best of both worlds. The key idea is to use three execution paths: an HTM fast path, an HTM middle path, and a software fallback path, such that the middle path can run concurrently with each of the other two. The fast path and fallback path do not run concurrently, so the fast path incurs no instrumentation overhead. Furthermore, fast path transactions can move to the middle path instead of waiting or moving to the software path. We demonstrate our approach by producing an accelerated version of the tree update template of Brown et al., which can be used to implement fast lock-free data structures based on down-trees. We used the accelerated template to implement two lock-free trees: a binary search tree (BST), and an (a,b)-tree (a generalization of a B-tree). Experiments show that, with 72 concurrent processes, our accelerated ($a,b$)-tree performs between 4.0x and 4.2x as many operations per second as an implementation obtained using the original tree update template.

Journal ArticleDOI
TL;DR: The implementation reveals that HiperTM guarantees 0% of out-of-order optimistic deliveries and performance up to 3.5× better than atomic broadcast-based competitor (PaxosSTM) using the standard configuration of TPC-C benchmark.

Proceedings ArticleDOI
01 Sep 2017
TL;DR: An evaluation includes a diverse range of tree sizes and operation workloads and reveals that BSTs based on RCU-HTM outperform other alternatives by more than 18%, on average, on a multi-core server with 44 hardware threads.
Abstract: In this paper we introduce RCU-HTM, a technique that combines Read-Copy-Update (RCU) with Hardware Transactional Memory (HTM) to implement highly efficient concurrent Binary Search Trees (BSTs). Similarly to RCU-based algorithms, we perform the modifications of the tree structure in private copies of the affected parts of the tree rather than in-place. This allows threads that traverse the tree to proceed without any synchronization and without being affected by concurrent modifications. The novelty of RCU-HTM lies at leveraging HTM to permit multiple updating threads to execute concurrently. After appropriately modifying the private copy, we execute an HTM transaction, which atomically validates that all the affected parts of the tree have remained unchanged since they've been read and, only if this validation is successful, installs the copy in the tree structure.We apply RCU-HTM on AVL and Red-Black balanced BSTs and compare theirperformance to state-of-the-art lock-based, non-blocking, RCU- and HTM-basedBSTs. Our experimental evaluation reveals that BSTs implemented with RCU-HTMachieve high performance, not only for read-only operations, but also for update operations. More specifically, our evaluation includes a diverse range of tree sizes and operation workloads and reveals that BSTs based on RCU-HTM outperform other alternatives by more than 18%, on average, on a multi-core server with 44 hardware threads.

Proceedings ArticleDOI
14 Jun 2017
TL;DR: It is claimed that PhTM is a competitive alternative to HyTM and proposed PhTM*, the first implementation of PhTM on modern HTM-ready processors, which relies in avoiding unnecessary transitions to software mode by taking into account the categories of hardware aborts and adding a new serialization mode.
Abstract: In recent years, Hybrid TM (HyTM) has been proposed as a transactional memory approach that leverages on the advantages of both hardware (HTM) and software (STM) execution modes. HyTM assumes that concurrent transactions can have very different phases and thus should run under different execution modes. Although HyTM has shown to improve performance, the overall solution can be complicated to manage, both in terms of correctness and performance. On the other hand, Phased Transactional Memory (PhTM) considers that concurrent transactions have similar phases, and thus all transactions could run under the same mode. As a result, PhTM does not require coordination between transactions on distinct modes making its implementation simpler and more flexible. In this paper we claim that PhTM is a competitive alternative to HyTM and propose PhTM*, the first implementation of PhTM on modern HTM-ready processors. PhTM* novelty relies in avoiding unnecessary transitions to software mode by: (i) taking into account the categories of hardware aborts; (ii) adding a new serialization mode. Experimental results with Haswell's TSX reveal that, for the STAMP benchmark suite, PhTM* performs on average 11% better than PhTM, a previous phase-based TM, and 15% better than HyTM-NOrec, a state-of-the-art HyTM. In addition, PhTM* showed to be even more effective running on a Power8 machine by performing over 25% and 36% better than PhTM and HyTM-NOrec, respectively.