scispace - formally typeset
Search or ask a question

Showing papers on "Transactional memory published in 2018"


Proceedings ArticleDOI
11 Jul 2018
TL;DR: Romulus is presented, a user-level library persistent transactional memory (PTM) which provides durable transactions through the usage of twin copies of the data and achieves twice the throughput of current state of the art PTMs in update-only workloads, and more than one order of magnitude in read-mostly scenarios.
Abstract: Byte addressable persistent memory eliminates the need for serialization and deserialization of data, to and from persistent storage, allowing applications to interact with it through common store and load instructions. In the event of a process or system failure, applications rely on persistent techniques to provide consistent storage of data in non-volatile memory (NVM). For most of these techniques, consistency is ensured through logging of updates, with consequent intensive cache line flushing and persistent fences necessary to guarantee correctness. Undo log based approaches require store interposition and persistence fences before each in-place modification. Redo log based techniques can execute transactions using just two persistence fences, although they require store and load interposition which may incur a performance penalty for large transactions. So far, these techniques have been difficult to integrate with known memory allocators, requiring allocators or garbage collectors specifically designed for NVM. We present Romulus, a user-level library persistent transactional memory (PTM) which provides durable transactions through the usage of twin copies of the data. A transaction in Romulus requires at most four persistence fences, regardless of the transaction size. Romulus uses only store interposition. Any sequential implementation of a memory allocator can be adapted to work with Romulus. Thanks to its lightweight design and low synchronization overhead, Romulus achieves twice the throughput of current state of the art PTMs in update-only workloads, and more than one order of magnitude in read-mostly scenarios.

77 citations


Proceedings ArticleDOI
02 Jun 2018
TL;DR: DHTM (durable hardware transactional memory) is the first complete and practical hardware based solution for ACID transactions that has the potential to significantly ease the burden of crash consistent programming.
Abstract: The emergence of byte-addressable persistent (non-volatile) memory provides a low latency and high bandwidth path to durability. However, programmers need guarantees on what will remain in persistent memory in the event of a system crash. A widely accepted model for crash consistent programming is ACID transactions, in which updates within a transaction are made visible as well as durable in an atomic manner. However, existing software based proposals suffer from significant performance overheads. In this paper, we support both atomic visibility and durability in hardware. We propose DHTM (durable hardware transactional memory) that leverages a commercial HTM to provide atomic visibility and extends it with hardware support for redo logging to provide atomic durability. Furthermore, we leverage the same logging infrastructure to extend the supported transaction size (from being L1-limited to LLC-limited) with only minor changes to the coherence protocol. Our evaluation shows that DHTM outperforms the state-of-the-art by an average of 21% to 25% on TATP, TPC-C and a set of microbenchmarks. We believe DHTM is the first complete and practical hardware based solution for ACID transactions that has the potential to significantly ease the burden of crash consistent programming.

54 citations


Proceedings ArticleDOI
29 May 2018
TL;DR: In this article, a combination of state-of-the-art cache attacks with kernel-fuzzing techniques is proposed to detect and eliminate double-fetch bugs, which is a special type of race condition, where an unprivileged execution thread is able to change a memory location between the time of check and time of use of a privileged execution thread.
Abstract: Double-fetch bugs are a special type of race condition, where an unprivileged execution thread is able to change a memory location between the time-of-check and time-of-use of a privileged execution thread. If an unprivileged attacker changes the value at the right time, the privileged operation becomes inconsistent, leading to a change in control flow, and thus an escalation of privileges for the attacker. More severely, such double-fetch bugs can be introduced by the compiler, entirely invisible on the source-code level. We propose novel techniques to efficiently detect, exploit, and eliminate double-fetch bugs. We demonstrate the first combination of state-of-the-art cache attacks with kernel-fuzzing techniques to allow fully automated identification of double fetches. We demonstrate the first fully automated reliable detection and exploitation of double-fetch bugs, making manual analysis as in previous work superfluous. We show that cache-based triggers outperform state-of-the-art exploitation techniques significantly, leading to an exploitation success rate of up to 97%. Our modified fuzzer automatically detects double fetches and automatically narrows down this candidate set for double-fetch bugs to the exploitable ones. We present the first generic technique based on hardware transactional memory, to eliminate double-fetch bugs in a fully automated and transparent manner. We extend defensive programming techniques by retrofitting arbitrary code with automated double-fetch prevention, both in trusted execution environments as well as in syscalls, with a performance overhead below 1%.

39 citations


Proceedings ArticleDOI
10 Feb 2018
TL;DR: This work identifies a property of epoch-based memory reclamation algorithms that makes them ideal for implementing range queries, and produces three algorithms, which use locks, transactional memory and lock-free techniques, respectively.
Abstract: Concurrent sets with range query operations are highly desirable in applications such as in-memory databases However, few set implementations offer range queries Known techniques for augmenting data structures with range queries (or operations that can be used to build range queries) have numerous problems that limit their usefulness For example, they impose high overhead or rely heavily on garbage collection In this work, we show how to augment data structures with highly efficient range queries, without relying on garbage collection We identify a property of epoch-based memory reclamation algorithms that makes them ideal for implementing range queries, and produce three algorithms, which use locks, transactional memory and lock-free techniques, respectively Our algorithms are applicable to more data structures than previous work, and are shown to be highly efficient on a large scale Intel system

34 citations


Proceedings ArticleDOI
11 Jun 2018
TL;DR: This work aims to clarify the interplay between weak memory and TM by extending existing axiomatic weak memory models with new rules for TM with a key finding that a proposed TM extension to ARMv8 currently being considered within ARM Research is incompatible with lock elision without sacrificing portability or performance.
Abstract: Weak memory models provide a complex, system-centric semantics for concurrent programs, while transactional memory (TM) provides a simpler, programmer-centric semantics. Both have been studied in detail, but their combined semantics is not well understood. This is problematic because such widely-used architectures and languages as x86, Power, and C++ all support TM, and all have weak memory models. Our work aims to clarify the interplay between weak memory and TM by extending existing axiomatic weak memory models (x86, Power, ARMv8, and C++) with new rules for TM. Our formal models are backed by automated tooling that enables (1) the synthesis of tests for validating our models against existing implementations and (2) the model-checking of TM-related transformations, such as lock elision and compiling C++ transactions to hardware. A key finding is that a proposed TM extension to ARMv8 currently being considered within ARM Research is incompatible with lock elision without sacrificing portability or performance.

28 citations


Proceedings ArticleDOI
21 May 2018
TL;DR: NV-HTM is presented, a system that allows the execution of transactions over PM using unmodified commodity HTM implementations, and can achieve up to 10× speed-ups and up to 11.6× reduced flush operations with respect to state of the art solutions, which, unlike NV- HTM, require custom modifications to existing HTM systems.
Abstract: Persistent Memory (PM) and Hardware Transactional Memory (HTM) are two recent architectural developments whose joint usage promises to drastically accelerate the performance of concurrent, data-intensive applications. Unfortunately, combining these two mechanisms using existing architectural supports is far from being trivial. This paper presents NV-HTM, a system that allows the execution of transactions over PM using unmodified commodity HTM implementations. NV-HTM relies on a hardware-software co-design technique, which is based on three key ideas: i) relying on software to persist transactional modifications after they have been committed via HTM; ii) postponing the externalization of commit events to applications until it is ensured, via software, that any data version produced and observed by committed transactions is first logged in PM; ii) pruning the commit logs via checkpointing schemes that not only bound the log space and recovery time, but also implement wear levelling techniques to enhance PM's endurance. By means of an extensive experimental evaluation, we show that NV-HTM can achieve up to 10× speed-ups and up to 11.6× reduced flush operations with respect to state of the art solutions, which, unlike NV-HTM, require custom modifications to existing HTM systems.

22 citations


Proceedings ArticleDOI
29 May 2018
TL;DR: A systematic analysis of the security requirements that a software-only solution must meet to defeat cache attacks is provided, a software design that leverages HTM to satisfy these requirements is proposed and several optimization techniques in the implementation are devised to reduce performance impact caused by transaction aborts.
Abstract: A program's use of CPU caches may reveal its memory access pattern and thus leak sensitive information when the program performs secret-dependent memory accesses. In recent studies, it has been demonstrated that cache side-channel attacks that extract secrets by observing the victim program's cache uses can be conducted under a variety of scenarios, among which the most concerning are cross-VM attacks and those against SGX enclaves. In this paper, we propose a mechanism that leverages hardware transactional memory (HTM) to enable software programs to defend themselves against various cache side-channel attacks. We observe that when the HTM is implemented by retrofitting cache coherence protocols, as is the case of Intel's Transactional Synchronization Extensions, the cache interference that is necessary in cache side-channel attacks will inevitably terminate hardware transactions. We provide a systematic analysis of the security requirements that a software-only solution must meet to defeat cache attacks, propose a software design that leverages HTM to satisfy these requirements and devise several optimization techniques in our implementation to reduce performance impact caused by transaction aborts. The empirical evaluation suggests that the performance overhead caused by the HTM-based solution is low.

22 citations


Journal ArticleDOI
TL;DR: An alternative specification to SI is given that characterises it in terms of transactional dependency graphs of Adya et al., generalising serialisation graphs, and does not require adding additional information to dependency graphs about start and commit points of transactions.
Abstract: Snapshot isolation (SI) is a widely used consistency model for transaction processing, implemented by most major databases and some of transactional memory systems Unfortunately, its classical definition is given in a low-level operational way, by an idealised concurrency-control algorithm, and this complicates reasoning about the behaviour of applications running under SI We give an alternative specification to SI that characterises it in terms of transactional dependency graphs of Adya et al, generalising serialisation graphs Unlike previous work, our characterisation does not require adding additional information to dependency graphs about start and commit points of transactions We then exploit our specification to obtain two kinds of static analyses The first one checks when a set of transactions running under SI can be chopped into smaller pieces without introducing new behaviours, to improve performance The other analysis checks whether a set of transactions running under a weakening of SI behaves the same as when running under SI

21 citations


Journal ArticleDOI
TL;DR: Experimental results reveal that by implementing TLS on top of HTM, speed-ups of up to 3.8$\times$ can be obtained for some loops.
Abstract: This paper presents a detailed analysis of the application of Hardware Transactional Memory (HTM) support for loop parallelization with Thread-Level Speculation (TLS) and describes a careful evaluation of the implementation of TLS on the HTM extensions available in such machines. The sample implementation of TLS over HTM described in this paper also provides evidence that the programming effort to implement TLS over HTM support is non-trivial. Thus the paper also describes an extension to OpenMP that both makes TLS more accessible to OpenMP programmers and allows for the easy tuning of TLS parameters. As a result, it provides evidence to support several important claims about the performance of TLS over HTM in the Intel Core and the IBM POWER8 architectures. Experimental results reveal that by implementing TLS on top of HTM, speed-ups of up to 3.8 $\times$ can be obtained for some loops.

20 citations


Journal ArticleDOI
TL;DR: The proposed clfB-tree—a B-tree structure whose tree node fits in a single cache line— achieves atomicity and consistency via in-place update, which requires maximum four cache line flushes.
Abstract: Emerging byte-addressable non-volatile memory (NVRAM) is expected to replace block device storages as an alternative low-latency persistent storage device. If NVRAM is used as a persistent storage device, a cache line instead of a disk page will be the unit of data transfer, consistency, and durability.In this work, we design and develop clfB-tree—a B-tree structure whose tree node fits in a single cache line. We employ existing write combining store buffer and restricted transactional memory to provide a failure-atomic cache line write operation. Using the failure-atomic cache line write operations, we atomically update a clfB-tree node via a single cache line flush instruction without major changes in hardware. However, there exist many processors that do not provide SW interface for transactional memory. For those processors, our proposed clfB-tree achieves atomicity and consistency via in-place update, which requires maximum four cache line flushes. We evaluate the performance of clfB-tree on an NVRAM emulation board with ARM Cortex A-9 processor and a workstation that has Intel Xeon E7-4809 v3 processor. Our experimental results show clfB-tree outperforms wB-tree and CDDS B-tree by a large margin in terms of both insertion and search performance.

20 citations


Proceedings ArticleDOI
23 Jul 2018
TL;DR: This brief announcement presents a fundamental concurrent primitive for persistent memory - a persistent atomic multi-word compare-and-swap (PMCAS), carefully crafted to ensure that atomic updates to a multitude of words modified by the PMCAS are persisted correctly.
Abstract: This brief announcement presents a fundamental concurrent primitive for persistent memory - a persistent atomic multi-word compare-and-swap (PMCAS).We present a novel algorithm carefully crafted to ensure that atomic updates to a multitude of words modified by the PMCAS are persisted correctly. Our algorithm leverages hardware transactional memory (HTM) for concurrency control, and has a total of 3 persist barriers in its critical path. We also overview variants based on just the compare-and-swap (CAS) instruction and a hybrid of CAS and HTM.

Proceedings ArticleDOI
20 Oct 2018
TL;DR: Two techniques are contributed that enable seamlessly composing and coordinating speculative and non-speculative work in the context of ordered parallelism, and allow speculative tasks to safely invoke irrevocable actions.
Abstract: Multicore systems should support both speculative and non-speculative parallelism. Speculative parallelism is easy to use and is crucial to scale many challenging applications, while non-speculative parallelism is more efficient and allows parallel irrevocable actions (e.g., parallel I/O). Unfortunately, prior techniques are far from this goal. Hardware transactional memory (HTM) systems support speculative (transactional) and non-speculative (non-transactional) work, but lack coordination mechanisms between the two, and are limited to unordered parallelism. Prior work has extended HTMs to avoid the limitations of speculative execution, e.g., through escape actions and open-nested transactions. But these mechanisms are incompatible with systems that exploit ordered parallelism, which parallelize a broader range of applications and are easier to use. We contribute two techniques that enable seamlessly composing and coordinating speculative and non-speculative work in the context of ordered parallelism: (i) a task-based execution model that efficiently coordinates concurrent speculative and non-speculative ordered tasks, allowing them to create tasks of either kind and to operate on shared data; and (ii) a safe way for speculative tasks to invoke software-managed speculative actions that avoid hardware version management and conflict detection. These contributions improve efficiency and enable new capabilities. Across several benchmarks, they allow the system to dynamically choose whether to execute tasks speculatively or non-speculatively, avoid needless conflicts among speculative tasks, and allow speculative tasks to safely invoke irrevocable actions.

Journal ArticleDOI
TL;DR: This work proposes GPU-LocalTM as a lightweight and efficient transactional memory (TM) for GPU local memory, which provides from 1.1X up to 100X speedup over serialized critical sections.
Abstract: Graphics Processing Units (GPUs) have become the accelerator of choice for data-parallel applications, enabling the execution of thousands of threads in a Single Instruction - Multiple Thread (SIMT) fashion. Using OpenCL terminology, GPUs offer a global memory space shared by all the threads in the GPU, as well as a local memory space shared by only a subset of the threads. Programmers can use local memory as a scratchpad to improve the performance of their applications due to its lower latency as compared to global memory. In the SIMT execution model, data locking mechanisms used to protect shared data limit scalability. To take full advantage of the lower latency that local memory affords, and to provide an efficient synchronization mechanism, we propose GPU-LocalTM as a lightweight and efficient transactional memory (TM) for GPU local memory. To minimize the storage resources required for TM support, GPU-LocalTM allocates transactional metadata in the existing memory resources. Additionally, GPU-LocalTM implements different conflict detection mechanisms that can be used to match the characteristics of the application. For the workloads studied in our simulation-based evaluation, GPU-LocalTM provides from 1.1X up to 100X speedup over serialized critical sections.

Book ChapterDOI
09 May 2018
TL;DR: Higher-level methods of the underlying cds like lookup, insert or delete aid in ignoring unimportant lower level read/write conflicts and allow better concurrency.
Abstract: Composing together the individual atomic methods of concurrent data-structures (cds) pose multiple design and consistency challenges. In this context composition provided by transactions in software transaction memory (STM) can be handy. However, most of the STMs offer read/write primitives to access shared cds. These read/write primitives result in unnecessary aborts. Instead, semantically rich higher-level methods of the underlying cds like lookup, insert or delete (in case of hash-table or lists) aid in ignoring unimportant lower level read/write conflicts and allow better concurrency.

Journal ArticleDOI
TL;DR: This article proposes very simple architectural changes to the existing requester-wins HTM implementations that enhance conflict resolution between hardware transactions and thus improve their parallelism.
Abstract: Today’s hardware transactional memory (HTM) systems rely on existing coherence protocols, which implement a requester-wins strategy. This, in turn, leads to poor performance when transactions frequently conflict, causing them to resort to a non-speculative fallback path. Often, such a path severely limits parallelism. In this article, we propose very simple architectural changes to the existing requester-wins HTM implementations that enhance conflict resolution between hardware transactions and thus improve their parallelism. Our idea is compatible with existing HTM systems, requires no changes to target applications that employ traditional lock synchronization, and is shown to provide robust performance benefits.

Proceedings Article
07 Jan 2018
TL;DR: This paper provides a general architectural framework for the introduction of transactions into models of relaxed memory in hardware, including the SC, TSO, ARMv8 and PPC models, and proves abstraction theorems to demonstrate that the programmer API matches the intuitions and expectations about transactions.
Abstract: The integration of transactions into hardware relaxed memory architectures is a topic of research both in industry and academia. In this paper, we provide a general architectural framework (that includes the SC, TSO and ARM8 models) for the introduction of transactions into models of relaxed memory in hardware. Our model incorporates flexible and expressive forms of transaction abort and execution that have hitherto been in the realm of Software Transactional Memory. In contrast to Software transactional memory, we account for the characteristics of relaxed memory as a (restricted form of a) distributed system without a notion of global time. We prove abstraction theorems to demonstrate that the programmer API matches the intuitions and expectations about transactions.

Journal ArticleDOI
TL;DR: It is shown that there are problem instances for which there is no scheduling algorithm that can simultaneously minimize the completion time and communication cost, and these instances reveal a trade-off, minimizing execution time implies high communication cost and vice versa.
Abstract: We consider scheduling problems in the data flow model of distributed transactional memory. Objects shared by transactions move from one network node to another by following network paths. We examine how the objects’ transfer in the network affects the completion time of all transactions and the total communication cost. We show that there are problem instances for which there is no scheduling algorithm that can simultaneously minimize the completion time and communication cost. These instances reveal a trade-off, minimizing execution time implies high communication cost and vice versa. On the positive side, we provide scheduling algorithms which are independently communication cost near-optimal or execution time efficient.

Journal ArticleDOI
01 Apr 2018
TL;DR: This paper investigates the efficacy of exploiting semantic and temporal characteristics of critical sections in preventing excessive loss in computation accuracy, and devise a light-weight, proof-of-concept Approximate Speculative Lock Elision (ASLE) implementation, which exploits existing hardware support for SLE.
Abstract: Each synchronization point represents a point of serialization, and thereby can easily hurt parallel scalability. As demonstrated by recent studies, approximating, i.e., relaxing synchronization by eliminating a subset of synchronization points spatio-temporally can help improve parallel scalability, as long as approximation incurred violations of basic execution semantics remain predictable and controllable. Even if the divergence from fully-synchronized execution renders lower computation accuracy rather than catastrophic program termination, for approximation to be viable, the accuracy loss must be bounded. In this paper, we assess the viability of approximate synchronization using Speculative Lock Elision (SLE), which was adopted by hardware transactional memory implementations from industry, as a baseline for comparison. Specifically, we investigate the efficacy of exploiting semantic and temporal characteristics of critical sections in preventing excessive loss in computation accuracy, and devise a light-weight, proof-of-concept Approximate Speculative Lock Elision (ASLE) implementation, which exploits existing hardware support for SLE.

Proceedings ArticleDOI
16 Apr 2018
TL;DR: Non-volatile memory technology and its impact on systems is viewed as the convergence of several past research trends starting with the concept of single-level store, encompassing the 1980s excitement around bubble memory, building upon persistent object systems, and leveraging recent work in transactional memory.
Abstract: Around 2010, we observed significant research activity around the development of non-volatile memory technologies. Shortly thereafter, other research communities began considering the implications of non-volatile memory on system design, from storage systems to data management solutions to entire systems. Finally, in July 2015, Intel and Micron Technology announced 3D XPoint. It's now 2018; Intel is shipping its technology in SSD packages, but we've not yet seen the widespread availability of byte-addressable non-volatile memory that resides on the memory bus. We can view non-volatile memory technology and its impact on systems through an historical lens revealing it as the convergence of several past research trends starting with the concept of single-level store, encompassing the 1980s excitement around bubble memory, building upon persistent object systems, and leveraging recent work in transactional memory. We present this historical context, recalling past ideas that seem particularly relevant and potentially applicable and highlighting aspects that are novel.

Journal ArticleDOI
13 Jun 2018
TL;DR: This work presents a new methodology for transforming high-performance lock-free linked data structures into high- performanceLock-free transactional link data structures without revamping the data structures’ original synchronization design that achieves 4,700 to 915,000 times fewer spurious aborts than the alternatives.
Abstract: Nonblocking data structures allow scalable and thread-safe access to shared data. They provide individual operations that appear to execute atomically. However, it is often desirable to execute multiple operations atomically in a transactional manner. Previous solutions, such as Software Transactional Memory (STM) and transactional boosting, manage transaction synchronization separately from the underlying data structure’s thread synchronization. Although this reduces programming effort, it leads to overhead associated with additional synchronization and the need to rollback aborted transactions. In this work, we present a new methodology for transforming high-performance lock-free linked data structures into high-performance lock-free transactional linked data structures without revamping the data structures’ original synchronization design. Our approach leverages the semantic knowledge of the data structure to eliminate the overhead of false conflicts and rollbacks. We encapsulate all operations, operands, and transaction status in a transaction descriptor, which is shared among the nodes accessed by the same transaction. We coordinate threads to help finish the remaining operations of delayed transactions based on their transaction descriptors. When a transaction fails, we recover the correct abstract state by reversely interpreting the logical status of a node. We also present an obstruction-free version of our algorithm that can be applied to dynamic execution scenarios and an example of our approach applied to a hash map. In our experimental evaluation using transactions with randomly generated operations, our lock-free transactional data structures outperform the transactional boosted ones by 70% on average. They also outperform the alternative STM-based approaches by a factor of 2 to 13 across all scenarios. More importantly, we achieve 4,700 to 915,000 times fewer spurious aborts than the alternatives.

Proceedings ArticleDOI
01 Oct 2018
TL;DR: In this paper, a software protocol combined with a persistent memory controller is proposed to ensure the atomicity of transactions on persistent memory resident data and maintaining consistency between the order in which processors perform stores and that in which the updated values become durable.
Abstract: Emerging Persistent Memory technologies (also pm, Non-Volatile DIMMs, Storage Class Memory or scm) hold tremendous promise for accelerating popular data-management applications like in-memory databases. However, programmers now need to deal with ensuring the atomicity of transactions on Persistent Memory resident data and maintaining consistency between the order in which processors perform stores and that in which the updated values become durable. The problem is specially challenging when high-performance isolation mechanisms like Hardware Transactional Memory (htm) are used for concurrency control. This work shows how htm transactions can be ordered correctly and atomically into PM by the use of a novel software protocol combined with a Persistent Memory Controller, without requiring changes to processor cache hardware or htm protocols. In contrast, previous approaches require significant changes to existing processor microarchitectures. Our approach, evaluated using both micro-benchmarks and the stamp suite compares well with standard (volatile) htm transactions. It also yields significant gains in throughput and latency in comparison with persistent transactional locking.

Patent
24 May 2018
TL;DR: In this paper, a data processing system is configured to perform a hardware transactional memory (HTM) transaction and an indicator is added to the nonvolatile memory indicating the successful commit of the HTM transaction.
Abstract: The invention relates to a data processing system and a date processing method The data processing system is configured to perform a hardware transactional memory (HTM) transaction The data processing system comprises a byte-addressable nonvolatile memory for persistently storing data and a processor being configured to execute an atomic HTM write operation in connection with committing the HTM transaction by writing an indicator to the nonvolatile memory indicating the successful commit of the HTM transaction

Journal ArticleDOI
TL;DR: This work describes the experience of implementing the Sapphire algorithm as the first on-the-fly, parallel, replication copying, garbage collector for the Jikes RVM Java virtual machine (JVM).
Abstract: Constructing a high-performance garbage collector is hard. Constructing a fully concurrent ‘on-the-fly’ compacting collector is much more so. We describe our experience of implementing the Sapphire algorithm as the first on-the-fly, parallel, replication copying, garbage collector for the Jikes RVM Java virtual machine (JVM). In part, we explain our innovations such as copying with hardware and software transactions, on-the-fly management of Java’s reference types, and simple, yet correct, lock-free management of volatile fields in a replicating collector. We fully evaluate, for the first time, and using realistic benchmarks, Sapphire’s performance and suitability as a low latency collector. An important contribution of this work is a detailed description of our experience of building an on-the-fly copying collector for a complete JVM with some assurance that it is correct. A key aspect of this is model checking of critical components of this complicated and highly concurrent system.

Journal ArticleDOI
TL;DR: This work offers two optimization techniques for Transactional Memory (TM) that focus on the overhead of TM and enhances the speed of the adaptive system, and uses a combination of Linear Regression and decision tree to decide on the transaction size.
Abstract: Transactional memory (TM) is a programming paradigm that facilitates parallel programming for multi‐core processors. In the last few years, some chip manufacturers provided hardware support for TM to reduce runtime overhead of Software Transactional Memory (STM). In this work, we offer two optimization techniques for TMs. The first technique focuses on Restricted Transactional Memory (RTM) in Intel's Haswell processor and shows that while in some applications, RTM improves performance over STM, in some others, it falls behind STM. We exploit this variability and propose an adaptive technique that switches between RTM and STM, statically. The second technique focuses on the overhead of TM and enhances the speed of the adaptive system. In particular, we focus on the size of transactions and improve performance by changing the transaction size. Optimizing the transaction size manually is a time‐consuming process and requires significant software engineering effort. We use a combination of Linear Regression (LR) and decision tree to decide on the transaction size, automatically. We evaluate our optimization techniques using a set of benchmarks from NAS, DiscoPoP, and STAMP benchmark suites. Our experimental results reveal that our optimization techniques are able to improve the performance of TM programs by 9% and energy‐delay by 15%, on average.

29 Apr 2018
TL;DR: This work proposes a complete framework where an STM service is associated to a set of fully partitioned scheduling algorithms in order to improve the predictability of the system as well as guaranteeing that the timing constraints are met for all the tasks.
Abstract: The current trend in the development of recent real-time embedded systems is driven by (i) a shift from single-core to multi-core platform architectures at the hardware level; (ii) a shift from sequential to parallel programming paradigms at the software level; and finally (iii) the ever increasing demand of new functionalities (e.g. additional tasks with specific timing requirements).These trends taken together increase the complexity of the system as a whole, and have a significant impact on the type of mechanisms that are adopted in order to guarantee both the functional and non-functional correctness of the system.This holds true especially in the case where these mechanisms have to maintain the correctness of data shared between different tasks executing concurrently in parallel. The access to shared resources (e.g. main memory) on single-core systems has traditionally relied on lock-based mechanisms.At any time instant, a single task is granted an exclusive access to each shared resource.However, assuming the new settings, i.e. multi-core architectures executing a set of potentially parallel tasks sharing data, the big picture changes.Tasks executing in parallel on different cores and sharing the same data may have to compete before completing the execution.It has been proven that lock-based synchronisation approaches, which were sound in single-core context, do not to scale to multi-cores and, furthermore, they hinder the composability of the system, unfortunately. On the path to solving these issues, Software Transactional Memory (STM) based approaches have been proposed as promising candidates.By using these alternative techniques, the underlying STM service would solve the conflicts between contending tasks while maintaining data consistency, and critical sections would be executed speculatively -i.e. they are executed but if the result of the computation harms the system correctness, then changes made by the computation are reverted and the results are ignored.This way, the details on how to synchronise shared data would be hidden from the programmer, thus representing a significant advantage as compared to lock-based synchronisation techniques regarding the functional correctness of the system.Regarding the non-functional correctness instead, the use of STM based approaches in real-time systems also requires the tasks timing constraints to be met.This is due to the fact that each transaction aborting and repeating multiple times before its eventual commit incurs a timing overhead that might not be negligible and, therefore, must be taken into account to prevent deadline misses at runtime. This work considers a set of potentially parallel real-time tasks sharing data and executed on a multi-core platform.Assuming this setting, first it proposes a complete framework where an STM service is associated to a set of fully partitioned scheduling algorithms in order to improve the predictability of the system as well as guaranteeing that the timing constraints are met for all the tasks.Then, it proposes the corresponding schedulability analysis for each pair of STM and scheduling algorithms.Finally, it proposes a lightweight syntax to enrich the original Ada programming language in order to support STM for concurrent real-time applications. FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Real-Time Software Transactional Memory António Manuel de Sousa Barros Programa Doutoral em Engenharia Electrotécnica e de Computadores Supervisor: Luís Miguel Rosário da Silva Pinho

Journal ArticleDOI
TL;DR: In this article, the authors address concurrency issues via Software Transactional Memory (STM) bypasses locks to tackle synchronisation through transactions and implement feedback control loops to automate management of threads and diminish program execution time.
Abstract: A parallel program needs to manage the trade-off between the time spent in synchronisation andcomputation. This trade-off is significantly affected by its parallelism degree. A high parallelism degreemay decrease computing time while increasing synchronisation cost. Furthermore, thread placement onprocessor cores may impact program performance, as the data access time can vary from one core toanother due to intricacies of the underlying memory architecture. Alas, there is no universal rule to decidethread parallelism and its mapping to cores from an offline view, especially for a program with onlinebehaviour variation. Moreover, offline tuning is less precise. We present our work on dynamic control ofthread parallelism and mapping. We address concurrency issues via Software Transactional Memory (STM).STM bypasses locks to tackle synchronisation through transactions. Autonomic computing offers designersa framework of methods and techniques to build autonomic systems with well-mastered behaviours. Itskey idea is to implement feedback control loops to design safe, efficient and predictable controllers, whichenable monitoring and adjusting controlled systems dynamically while keeping overhead low. We implementfeedback control loops to automate management of threads and diminish program execution time.

Posted Content
TL;DR: The notion of data race-free data-race-free programs (DRF) as discussed by the authors allows the programmer to use privatization idioms, and it has been shown that DRF programs can have strongly atomic semantics.
Abstract: Transactional memory (TM) facilitates the development of concurrent applications by letting the programmer designate certain code blocks as atomic. Programmers using a TM often would like to access the same data both inside and outside transactions, e.g., to improve performance or to support legacy code. In this case, programmers would ideally like the TM to guarantee strong atomicity, where transactions can be viewed as executing atomically also with respect to non-transactional accesses. Since guaranteeing strong atomicity for arbitrary programs is prohibitively expensive, researchers have suggested guaranteeing it only for certain data-race free (DRF) programs, particularly those that follow the privatization idiom: from some point on, threads agree that a given object can be accessed non-transactionally. Supporting privatization safely in a TM is nontrivial, because this often requires correctly inserting transactional fences, which wait until all active transactions complete. Unfortunately, there is currently no consensus on a single definition of transactional DRF, in particular, because no existing notion of DRF takes into account transactional fences. In this paper we propose such a notion and prove that, if a TM satisfies a certain condition generalizing opacity and a program using it is DRF assuming strong atomicity, then the program indeed has strongly atomic semantics. We show that our DRF notion allows the programmer to use privatization idioms. We also propose a method for proving our generalization of opacity and apply it to the TL2 TM.

Proceedings ArticleDOI
01 May 2018
TL;DR: AUTOPN is proposed, an on-line self-tuning system that combines model-driven learning techniques with localized search heuristics in order to pursue a twofold goal to enhance convergence speed and increase robustness against modeling errors, via a final local search phase aimed at refining the model's prediction.
Abstract: This paper addresses the problem of self-tuning the parallelism degree in Transactional Memory (TM) systems that support parallel nesting (PN-TM). This problem has been long investigated for TMs not supporting nesting, but, to the best of our knowledge, has never been studied in the context of PN-TMs. Indeed, the problem complexity is inherently exacerbated in PN-TMs, since these require to identify the optimal parallelism degree not only for top-level transactions but also for nested sub-transactions. The increase of the problem dimensionality raises new challenges (e.g., increase of the search space, and proneness to suffer from local maxima), which are unsatisfactorily addressed by self-tuning solutions conceived for flat nesting TMs. We tackle these challenges by proposing AUTOPN, an on-line self-tuning system that combines model-driven learning techniques with localized search heuristics in order to pursue a twofold goal: i) enhance convergence speed by identifying the most promising region of the search space via model-driven techniques, while ii) increasing robustness against modeling errors, via a final local search phase aimed at refining the model's prediction. We further address the problem of tuning the duration of the monitoring windows used to collect feedback on the system's performance, by introducing novel, domain-specific, mechanisms aimed to strike an optimal trade-off between latency and accuracy of the self-tuning process. We integrated AUTOPN with a state of the art PN-TM (JVSTM) and evaluated it via an extensive experimental study. The results of this study highlight that AUTOPN can achieve gains of up to 45× in terms of increased accuracy and 4× faster convergence speed, when compared with several on-line optimization techniques (gradient descent, simulated annealing and genetic algorithm), some of which were already successfully used in the context of flat nesting TMs.

Proceedings ArticleDOI
21 May 2018
TL;DR: This work quantifies the performance and energy efficiency of HTM for scientific workloads based on the widely-used CLOMP-TM benchmark and discusses a set of generic software optimizations that can be effectively used to improve the performance of transactional science workloads on large-scale NUMA systems.
Abstract: Hardware transactional memory (HTM) is supported by widely-used commodity processors. While the effectiveness of HTM has been evaluated based on small-scale multi-core systems, it still remains unexplored to quantify the performance and energy-efficiency of HTM for scientific workloads on large-scale NUMA systems, which have been increasingly adopted to high-performance computing. To bridge this gap, this work investigates the performance and energy-efficiency impact of HTM on scientific applications on large-scale NUMA systems. We first quantify the performance and energy efficiency of HTM for scientific workloads based on the widely-used CLOMP-TM benchmark. We then discuss a set of generic software optimizations that can be effectively used to improve the performance and energy efficiency of transactional scientific workloads on large-scale NUMA systems. Finally, we present case studies in which we apply a set of the optimizations to representative transactional scientific applications and significantly optimize their performance and energy efficiency on large-scale NUMA systems.

Proceedings ArticleDOI
19 Mar 2018
TL;DR: This paper makes speculative parallelization less laborious and more feasible through low-overhead speculation validation, presenting the first complete design, implementation, and evaluation of hardware MTXs.
Abstract: Speculation with transactional memory systems helps pro- grammers and compilers produce profitable thread-level parallel programs. Prior work shows that supporting transactions that can span multiple threads, rather than requiring transactions be contained within a single thread, enables new types of speculative parallelization techniques for both programmers and parallelizing compilers. Unfortunately, software support for multi-threaded transactions (MTXs) comes with significant additional inter-thread communication overhead for speculation validation. This overhead can make otherwise good parallelization unprofitable for programs with sizeable read and write sets. Some programs using these prior software MTXs overcame this problem through significant efforts by expert programmers to minimize these sets and optimize communication, capabilities which compiler technology has been unable to equivalently achieve. Instead, this paper makes speculative parallelization less laborious and more feasible through low-overhead speculation validation, presenting the first complete design, implementation, and evaluation of hardware MTXs. Even with maximal speculation validation of every load and store inside transactions of tens to hundreds of millions of instructions, profitable parallelization of complex programs can be achieved. Across 8 benchmarks, this system achieves a geomean speedup of 99% over sequential execution on a multicore machine with 4 cores.