scispace - formally typeset
Search or ask a question

Showing papers on "Transactional memory published in 2022"


Proceedings ArticleDOI
14 Mar 2022
TL;DR: This evaluation demonstrates that Block-STM is adaptive to workloads with different conflict rates and utilizes the inherent parallelism therein.
Abstract: Block-STM is a parallel execution engine for smart contracts, built around the principles of Software Transactional Memory. Transactions are grouped in blocks, and every execution of the block must yield the same deterministic outcome. Block-STM further enforces that the outcome is consistent with executing transactions according to a preset order, leveraging this order to dynamically detect dependencies and avoid conflicts during speculative transaction execution. At the core of Block-STM is a novel, low-overhead collaborative scheduler of execution and validation tasks. Block-STM is implemented on the main branch of the Diem Blockchain code-base and runs in production at Aptos. Our evaluation demonstrates that Block-STM is adaptive to workloads with different conflict rates and utilizes the inherent parallelism therein. Block-STM achieves up to 110k tps in the Diem benchmarks and up to 170k tps in the Aptos Benchmarks, which is a 20x and 17x improvement over the sequential baseline with 32 threads, respectively. The throughput on a contended workload is up to 50k tps and 80k tps in Diem and Aptos benchmarks, respectively.

9 citations


Proceedings ArticleDOI
01 May 2022
TL;DR: Polynesia is proposed, a hardware-software co-designed system for in-memory HTAP databases that avoids the large throughput losses of traditional HTAP systems and reduces energy consumption by 48% over the prior lowest-energy HTAP sys-tem.
Abstract: A growth in data volume, combined with increasing demand for real-time analysis (using the most recent data), has resulted in the emergence of database systems that concurrently support transactions and data analytics. These hybrid transactional and analytical processing (HTAP) database systems can support real-time data analysis without the high costs of synchronizing across separate single-purpose databases. Unfortunately, for many applications that perform a high rate of data updates, state-of-the-art HTAP systems incur significant losses in transactional (up to 74.6%) and/or analytical (up to 49.8%) throughput compared to performing only transactional or only analytical queries in isolation, due to (1) data movement be-tween the CPU and memory, (2) data update propagation from transactional to analytical workloads, and (3) the cost to main-tain a consistent view of data across the system. We propose Polynesia, a hardware-software co-designed system for in-memory HTAP databases that avoids the large throughput losses of traditional HTAP systems. Polynesia (1) di-vides the HTAP system into transactional and analytical pro-cessing islands, (2) implements new custom hardware that un-locks software optimizations to reduce the costs of update prop-agation and consistency, and (3) exploits processing-in-memory for the analytical islands to alleviate data movement overheads. Our evaluation shows that Polynesia outperforms three state-of-the-art HTAP systems, with average transactional/analytical throughput improvements of 1.7×/3.7×, and reduces energy consumption by 48% over the prior lowest-energy HTAP sys-tem.

8 citations


Journal ArticleDOI
TL;DR: In this paper , a formal verification framework called C4 is presented to prove the correctness of transactional set objects. But the proof is modular, reasoning separately about the transactional and non-transactional parts of the implementation.
Abstract: Transactional objects combine the performance of classical concurrent objects with the high-level programmability of transactional memory. However, verifying the correctness of transactional objects is tricky, requiring reasoning simultaneously about classical concurrent objects, which guarantee the atomicity of individual methods—the property known as linearizability—and about software-transactional-memory libraries, which guarantee the atomicity of user-defined sequences of method calls—or serializability. We present a formal-verification framework called C4, built up from the familiar notion of linearizability and its compositional properties, that allows proof of both kinds of libraries, along with composition of theorems from both styles to prove correctness of applications or further libraries. We apply the framework in a significant case study, verifying a transactional set object built out of both classical and transactional components following the technique of transactional predication ; the proof is modular, reasoning separately about the transactional and nontransactional parts of the implementation. Central to our approach is the use of syntactic transformers on interaction trees —i.e., transactional libraries that transform client code to enforce particular synchronization disciplines. Our framework and case studies are mechanized in Coq.

6 citations


Proceedings ArticleDOI
28 Feb 2022
TL;DR: PMRace is proposed, the first PM-specific concurrency bug detection tool that identifies and defines two new types of concurrent crash-consistency bugs: PM Inter-thread Inconsistency and PM Synchronization Inconsistsency.
Abstract: Due to the salient DRAM-comparable performance, TB-scale capacity, and non-volatility, persistent memory (PM) provides new opportunities for large-scale in-memory computing with instant crash recovery. However, programming PM systems is error-prone due to the existence of crash-consistency bugs, which are challenging to diagnose especially with concurrent programming widely adopted in PM applications to exploit hardware parallelism. Existing bug detection tools for DRAM-based concurrency issues cannot detect PM crash-consistency bugs because they are oblivious to PM operations and PM consistency. On the other hand, existing PM-specific debugging tools only focus on sequential PM programs and cannot effectively detect crash-consistency issues hidden in concurrent executions. In order to effectively detect crash-consistency bugs that only manifest in concurrent executions, we propose PMRace, the first PM-specific concurrency bug detection tool. We identify and define two new types of concurrent crash-consistency bugs: PM Inter-thread Inconsistency and PM Synchronization Inconsistency. In particular, PMRace adopts PM-aware and coverage-guided fuzz testing to explore PM program executions. For PM Inter-thread Inconsistency, which denotes the data inconsistency hidden in thread interleavings, PMRace performs PM-aware interleaving exploration and thread scheduling to drive the execution towards executions that reveal such inconsistencies. For PM Synchronization Inconsistency between persisted synchronization variables and program data, PMRace identifies the inconsistency during interleaving exploration. The post-failure validation reduces the false positives that come from custom crash recovery mechanisms. PMRace has found 14 bugs (10 new bugs) in real-world concurrent PM systems including PM-version memcached.

5 citations


Journal ArticleDOI
TL;DR: Delayed transactional stores (Delayed Transactional Stores) as discussed by the authors is an HTM-aware store buffer design aimed at mitigating livelock in commercial hardware transactional memory systems.
Abstract: Commercial Hardware Transactional Memory (HTM) systems are best-effort designs that leverage the coherence substrate to detect conflicts eagerly. Resolving conflicts in favor of the requesting core is the simplest option for ensuring deadlock freedom, yet it is prone to livelocks. In this work, we propose and evaluate DeTraS (Delayed Transactional Stores), an HTM-aware store buffer design aimed at mitigating such livelocks. DeTraS takes advantage of the fact that modern commercial processors implement a large store buffer, and uses it to prevent transactional stores predicted to conflict from performing early in the transaction. By leveraging existing processor structures, we propose a simple design that improves the ability of requester-wins HTM systems to achieve forward progress in spite of high contention while side-stepping the performance penalty of falling back to mutual exclusion. With just over 50 extra bytes, DeTraS captures the advantages of lazy conflict management without the complexity brought into the coherence fabric by commit arbitration schemes nor the relaxation of the single-writer invariant of prior works. Through detailed simulations of a 16-core tiled CMP using gem5, we demonstrate that DeTraS brings reductions in average execution time of 25 percent when compared to an Intel RTM-like design.

2 citations


Journal ArticleDOI
TL;DR: TMS2-ra as mentioned in this paper is a relaxed operational transactional memory (TM) specification that provides a formal semantics for TM libraries and their clients that can be implemented by a C11 library, TML-ra, that uses relaxed and release-acquire atomics.
Abstract: Transactional memory (TM) is an intensively studied synchronisation paradigm with many proposed implementations in software and hardware, and combinations thereof. However, TM under relaxed memory, e.g., C11 (the 2011 C/C++ standard) is still poorly understood, lacking rigorous foundations that support verifiable implementations. This paper addresses this gap by developing TMS2-ra, a relaxed operational TM specification. We integrate TMS2-ra with RC11 (the repaired C11 memory model that disallows load-buffering) to provide a formal semantics for TM libraries and their clients. We develop a logic, TARO, for verifying client programs that use TMS2-ra for synchronisation. We also show how TMS2-ra can be implemented by a C11 library, TML-ra, that uses relaxed and release-acquire atomics, yet guarantees the synchronisation properties required by TMS2-ra. We benchmark TML-ra and show that it outperforms its sequentially consistent counterpart in the STAMP benchmarks. Finally, we use a simulation-based verification technique to prove correctness of TML-ra. Our entire development is supported by the Isabelle/HOL proof assistant.

1 citations


Book ChapterDOI
01 Jan 2022
TL;DR: In this article , the authors propose two new OpenMP clauses to parallel for speculative privatization of scalar or array in may DOACROSS loops: spec_private and spec_reduction.
Abstract: AbstractLoop Thread-Level Speculation on Hardware Transactional Memories is a promising strategy to improve application performance in the multicore era. However, the reuse of shared scalar or array variables introduces constraints (false dependences or false sharing) that obstruct efficient speculative parallelization. Speculative privatization relieves these constraints by creating speculatively private data copies for each transaction thus enabling scalable parallelization. To support it, this paper proposes two new OpenMP clauses to parallel for that enable speculative privatization of scalar or arrays in may DOACROSS loops: spec_private and spec_reduction. We also present an evaluation that reveals that, for certain loops, speed-ups of up to \(3.24\times \) can be obtained by applying speculative privatization in TLS.KeywordsPrivatizationReductionThread-level speculation

1 citations


Journal ArticleDOI
TL;DR: This paper presents a novel multi-version concurrency control approach that enables a memory-optimized disk-based system to achieve excellent performance on transactional workloads as well, and achieves transaction throughput up to an order of magnitude higher than competing disk- based systems.
Abstract: Pure in-memory database systems offer outstanding performance but degrade heavily if the working set does not fit into DRAM, which is problematic in view of declining main memory growth rates. In contrast, recently proposed memory-optimized disk-based systems such as Umbra leverage large in-memory buffers for query processing but rely on fast solid-state disks for persistent storage. They offer near in-memory performance while the working set is cached, and scale gracefully to arbitrarily large data sets far beyond main memory capacity. Past research has shown that this architecture is indeed feasible for read-heavy analytical workloads. We continue this line of work in the following paper, and present a novel multi-version concurrency control approach that enables a memory-optimized disk-based system to achieve excellent performance on transactional workloads as well. Our approach exploits that the vast majority of versioning information can be maintained entirely in-memory without ever being persisted to stable storage, which minimizes the overhead of concurrency control. Large write transactions for which this is not possible are extremely rare, and handled transparently by a lightweight fallback mechanism. Our experiments show that the proposed approach achieves transaction throughput up to an order of magnitude higher than competing disk-based systems, confirming its viability in a real-world setting.

1 citations


Posted ContentDOI
02 Jun 2022
TL;DR: A survey of thread and data mapping that uses solely information gathered from the STM runtime to guide thread and mapping decisions can be found in this paper , where the authors also discuss future research directions within this research area.
Abstract: In current microarchitectures, due to the complex memory hierarchies and different latencies on memory accesses, thread and data mapping are important issues to improve application performance. Software transactional memory (STM) is an abstraction used for thread synchronization, replacing the use of locks in parallel programming. Regarding thread and data mapping, STM presents new challenges and mapping opportunities, since (1) STM can use different conflict detection and resolution strategies, making the behavior of the application less predictable and; (2) the STM runtime has precise information about shared data and the intensity with each thread accesses them. These unique characteristics provide many opportunities for low-overhead, but precise statistics to guide mapping strategies for STM applications. The main objective of this paper is to survey the existing work about thread and data mapping that uses solely information gathered from the STM runtime to guide thread and data mapping decisions. We also discuss future research directions within this research area.

Journal ArticleDOI
TL;DR: In this article , a speculative barrier (SB) is used to elide barriers speculatively, keeping the updates private to the thread, and letting the HTM system detect potential conflicts.
Abstract: Transactional Memory (TM) is a synchronization model for parallel programming which provides optimistic concurrency control. Transactions can run in parallel and are only serialized in case of conflict. In this article we use hardware TM (HTM) to implement an optimistic speculative barrier (SB) to replace the lock-based solution. SBs leverage HTM support to elide barriers speculatively. When a thread reaches an SB, a new SB transaction is started, keeping the updates private to the thread, and letting the HTM system detect potential conflicts. Once the last thread reaches the corresponding SB, the speculative threads can commit their changes. The main contributions of this work are: an API for SBs implemented with HTM extensions; a procedure to check the speculation state in between barriers to enable SBs with non-transactional codes; a HTM SB-aware conflict resolution enhancement where SB transactions stall on a conflict with a standard transaction; and a set of SB use guidelines derived from our experience on using SBs in a variety of applications. We evaluated our proposals in two different architectures with a full-system simulator and an IBM Power8 server. Results show an overall performance improvement of SBs over traditional barriers.

Proceedings ArticleDOI
21 Jan 2022
TL;DR: This paper proposes SAMShm, a new version of SAM, which shows good performance under any number of cores and reduces the overhead of socket communication and garbage collection using the message channel based on shared memory.
Abstract: SAM is a parallel programming model in Haskell, suitable for manycore computing environments. It has been developed in two versions: SAMSoc adopting the socket communication and SAMSTM adopting the software transactional memory (STM). However, both versions of SAM do not always guarantee the best performance due to the overhead of synchronization. Therefore we have to select a specific version of SAMs to promote the performance depending on the number of cores available in the running environment. In this paper, we propose SAMShm, a new version of SAM, which shows good performance under any number of cores. SAMShm reduces the overhead of socket communication and garbage collection using the message channel based on shared memory. According to the performance test on the 72-core machine, the scalability of SAMShm is improved by 52% points over SAMSoc and 295% points over SAMSTM.

Journal ArticleDOI
TL;DR: This work proposes a hybrid transactional memory scheme based on both abort prediction and an adaptive retry policy, called HyTM-AP, which can predict not only conflicts between concurrently running transactions, but also the capacity and other aborts of transactions by collecting the information of transactions previously executed.
Abstract: Recently, works on integrating HTM with STM, called hybrid transactional memory (HyTM), have intensively studied. However, the existing works consider only the prediction of a conflict between two transactions and provide a static HTM configuration for all workloads. To solve the problems, we proposes a hybrid transactional memory scheme based on both abort prediction and an adaptive retry policy, called HyTM-AP. First, our HyTM-AP can predict not only conflicts between concurrently running transactions, but also the capacity and other aborts of transactions by collecting the information of transactions previously executed. Second, our HyTM-AP can provide an adaptive retry policy based on machine learning algorithms, according to the characteristic of a given workload. Finally, through our experimental performance analysis using the STAMP benchmark, our HyTM-AP shows 12~13% better performance than the existing HyTM schemes.

Proceedings ArticleDOI
15 Nov 2022
TL;DR: In this paper , the authors describe typical challenges in a computer architecture research demonstrated on a case study of hardware transactional memory, and show how the proposed concepts and solutions are implemented in a software simulator.
Abstract: The paper describes typical challenges in a computer architecture research demonstrated on a case study of hardware transactional memory. It shows how the proposed concepts and solutions are implemented in a software simulator. Then, various experiments are carefully prepared in order to evaluate the performance of the proposed transactional memory implementation. The transactional memory in our experiments was paired with an asymmetric multicore processor with a support for transaction migration. We present design decisions how to implement the transactional memory, the transaction migration and the different cache memory subsystem organization in the simulator. Also, we varied cache memory subsystem organization and parameters in the experiments. Important issue was also how to organize data collected from the experiments, and how to analyze and visually present them. Finally, the paper demonstrates the use of a benchmark suite for the transactional memory. Problems that we encountered during research are pointed out and discussed and solutions for them are provided. The paper concludes with brief lessons we have learned in this research effort.

Proceedings ArticleDOI
01 Jul 2022
TL;DR: In this article , a performance comparison between STL and a previously proposed technique that implements Thread-Level Speculation in the for worksharing construct (FOR-TLS) over a set of loops from cbench and SPEC2006 benchmarks is presented.
Abstract: Speculative Taskloop (STL) is a loop parallelization technique that takes the best of Task-based Parallelism and Thread-Level Speculation to speed up loops with may loop-carried dependencies that were previously difficult for compilers to parallelize. Previous studies show the efficiency of STL when implemented using Hardware Transactional Memory and the advantages it offers compared to a typical DOACROSS technique such as OpenMP ordered. This paper presents a performance comparison between STL and a previously proposed technique that implements Thread-Level Speculation (TLS) in the for worksharing construct (FOR-TLS) over a set of loops from cbench and SPEC2006 benchmarks. The results show interesting insights on how each technique can be more appropriate depending on the characteristics of the evaluated loop. Experimental results reveal that by implementing both techniques on top of HTM, speed-ups of up to 2.41× can be obtained for STL and up to 2× for FOR-TLS.


DissertationDOI
10 Jun 2022
TL;DR: In this paper , the authors study the challenges for supporting transactional workloads on the GPU, as well as propose methods for performance improvement, and discuss the implications of incorporating emerging NVRAM technologies into GPUs for building transaction processing systems.
Abstract: The continued evolution of GPUs have enabled the use of irregular algorithms which involve fine-grained data sharing between threads, as well as transaction processing applications such as databases. Transactional Memory (TM) is derived from databases, which by itself is a programming construct that simplifies the programming of parallel workloads and combines the advantages of traditional approaches including fine-grained and coarse-grained locking. While hardware support for TM has started to enter mainstream commodity products, it is much farther from becoming reality on the GPU and is still being researched. In this dissertation, we study the challenges for supporting transactional workloads on the GPU, as well as propose methods for performance improvement. On the low level, this dissertation discusses the design of various facets of software and hardware TM designs on the GPU. Chapter 1 discusses the relationship between GPU, transactional memory and transactional memory systems. Chapters 2 and 3 discuss two hardware improvements upon an existing proposal for hardware TM for the GPU, which are able to reduce contention and improve performance for various workloads. On the high level, this dissertation discusses the implications of incorporating emerging NVRAM technologies into GPUs for building transaction processing systems. The NVRAM delivers a higher capacity and fast access speed, filling the gap between the main memory and external storage. In addition, it possesses the non-volatility property. As such, the incorporation of NVRAM affects several facets of the system, with issues that need to be addressed to achieve optimal performance. We discuss the software-based method we’re proposing for efficient transaction processing involving NVRAMs in Chapter 4. In Chapter 5, we conclude our work and envision future directions that in which this work may continue.

Journal ArticleDOI
TL;DR: In this article , the authors propose a type system to statically estimate the maximum memory required by well-typed programs of a language with STM primitives, which is similar to our work.
Abstract: Software transactional memory (STM) programs usually use more memory resources than traditional programs. Therefore, estimating an upper bound of memory resources used by an STM program is crucial for optimizing the program and reducing the risks of out-of-memory runtime exceptions. However, due to the complex nesting of transactions and threads, the estimation problem is challenging. In our previous work, we have developed several type systems to address the problem for core imperative languages with STM primitives. This work advances our previous works, in which we add object-oriented constructs to the language while keeping the STM primitives, to make the language closer to practical languages. Then, we built a type system to statically estimate the maximum memory required by well-typed programs of the language.

DissertationDOI
10 Jun 2022
TL;DR: In this article , the authors introduce the notion of approximate consistency in transactional memory, which is a relaxed consistency property where transactions' read operations may return one of K most recent written values.
Abstract: Shared memory multi-core systems bene_x000C_t from transactional memory implementations due to the inherent avoidance of deadlocks and progress guarantees. In this research, we examine how the system performance is a_x000B_ected by transaction fairness in scheduling and by the precision in consistency. We _x000C_rst explore the fairness aspect using a Lazy Snapshot (multi-version) Algorithm. The fairness of transactions scheduling aims to balance the load between read-only and update transactions. We implement a fairness mechanism based on machine learning techniques that improve fairness decisions according to the transaction execution history. Experimental analysis shows that the throughput of the Lazy Snapshot Algorithm is improved with machine learning support. We also explore the impacts on performance of consistency relaxation. In transactional memory, correctness is typically proven with opacity which is a precise consistency property that requires a legal serialization of an execution such that transactions do not overlap (atomicity) and read instructions always return the most recent value (legality). In real systems there are situations where system delays do not allow precise consistency, such as in large scale applications, due to network or other time delays. Thus, we introduce here the notion of approximate consistency in transactional memory. We de_x000C_ne K-opacity as a relaxed consistency property where transactions' read operations may return one of K most recent written values. In multi-version transactional memory, this allows to save a new object version once every K object updates, which has two bene_x000C_ts: (i) it reduces space requirements by a factor of K, and (ii) it reduces the number of aborts, since there is smaller chance for con icts. In fact, we apply the concept of K-opacity on regular read and write, count and queue objects, which are common objects used in typical concurrent programs. We provide formal correctness proofs and we also demonstrate the performance bene_x000C_ts of our approach with experimental analysis. We compare the performance of precise consistent execution (1-opaque) with di_x000B_erent consistency values of K using micro benchmarks. The results show that increased relaxation of opacity gives higher throughput and decreases the aborts rate signi_x000C_cantly.

DissertationDOI
08 Mar 2022
TL;DR: In this article , the authors proposed a method to use the δένδρα αναζήτηση εφαρμογές όπου απαιτείται η διατήρδόπππ-π-κρηΌπηταδΎν εισαγωγής, δγγγυήσεις επίναι,
Abstract: Τα δένδρα αναζήτησης αποτελούν μία από τις πιο κλασσικές και ευρέως διαδεδομένες δομές δεδομένων. Χρησιμοποιούνται σε εφαρμογές όπου απαιτείται η διατήρηση μεγάλου ταξινομημένου όγκου δεδομένων με δυνατότητα γρήγορης αναζήτησης, εισαγωγής, διαγραφής και επιπλέον λειτουργιών, όπως είναι η αναζήτηση εύρους τιμών. Λόγω της σημασίας τους, ένα μεγάλο πλήθος ερευνητικών εργασιών έχει προτείνει πολλούς διαφορετικούς τύπους δένδρων με διαφορετικά χαρακτηριστικά όπως είναι, για παράδειγμα, το μέγιστο μήκος που μπορεί να έχει ένα μονοπάτι μέσα στο δένδρο. Κάθε τύπος δένδρου προσφέρει και διαφορετικές εγγυήσεις επίδοσης για την κάθε λειτουργία και κάθε δένδρο επιλέγεται με βάση τις ανάγκες της εκάστοτε εφαρμογής στην οποία θα ενσωματωθεί.Με την επικράτηση των πολυπύρηνων επεξεργαστών, όπου πολλαπλά νήματα εκτελούνται ταυτόχρονα και πιθανώς προσπελαύνουν κοινά δεδομένα, οι ταυτόχρονες δομές δεδομένων έχουν γίνει σημαντικό μέρος των εφαρμογών αυτών. Στις ταυτόχρονες δομές δεδομένων είναι αναγκαίος ο συντονισμός των ταυτόχρονων προσπελάσεων από διαφορετικά νήματα με τρόπο που να διατηρείται η ακεραιότητα της δομής και να εξασφαλίζεται η ορθή εκτέλεση όλων των επιμέρους λειτουργιών. Ο συντονισμός αυτός επιτυγχάνεται με τη χρήση κάποιου μηχανισμού συγχρονισμού όπως για παράδειγμα τα κλειδώματα, οι εντολές ατομικής προσπέλασης μνήμης που παρέχονται απο τους σύγχρονους επεξεργαστές, η τεχνική Διάβασε-Αντίγραψε-Ανανέωσε (Read-Copy-Update) και η μνήμη δοσοληψιών (Transactional Memory).Τα ταυτόχρονα δένδρα αναζήτησης είναι μία από τις πιο ευρέως χρησιμοποιούμενες δομές δεδομένων για την αποθήκευση και ανάκτηση δεδομένων σε σύγχρονες πολυνηματικές εφαρμογές. Παρά τον πολύ μεγάλο όγκο σχετικής δουλειάς, παραμένει ακόμα σημαντική πρόκληση η υλοποίηση ταυτόχρονων δένδρων αναζήτησης υψηλών επιδόσεων. Αυτό οφείλεται κυρίως στο γεγονός πως τόσο οι κλασσικές μέθοδοι συγχρονισμού (δηλαδή η χρήση κλειδωμάτων και η χρήση ατομικών λειτουργιών) όσο και οι πιο πρόσφατες (δηλαδή η τεχνική Read-Copy-Update και η Transactional Memory) δεν είναι αρκετές από μόνες τους ώστε να προσφέρουν λύσεις που θα είναι γενικές και εύκολα υλοποιήσιμες αλλά και την ίδια στιγμή θα προσφέρουν υψηλές επιδόσεις σε διαφορετικά σενάρια εκτέλεσης και επίπεδα συμφόρησης στη δομή.Μέχρι πρόσφατα, η Transactional Memory χρησιμοποιούταν κυρίως μέσω κάποιας βιβλιοθήκης που την υλοποιούσε σε επίπεδο λογισμικού. Ωστόσο, τα τελευταία χρόνια δύο απο τις μεγαλύτερες εταιρείες παραγωγής επεξεργαστών, η Intel και η IBM, έχουν προσθέσει υποστήριξη για Transactional Memory σε επίπεδο υλικού, αφαιρώντας με αυτό τον τρόπο τις μεγάλες καθυστερήσεις που εισάγονταν από τις υλοποιήσεις σε επίπεδου λογισμικού. Σε αυτή την εργασία εξετάζουμε τους τρόπους με τους οποίους μπορεί να χρησιμοποιηθεί η Transactional Memory για την υλοποίηση ταυτόχρονων δένδρων αναζήτησης υψηλής επίδοσης. Πιο συγκεκριμένα, παρουσιάζουμε την RCU-HTM, μία τεχνική συγχρονισμού που συνδυάζει τις τεχνικές Read-Copy-Update (RCU) και Hardware Transactional Memory (HTM) και: α) υποστηρίζει την υλοποίηση ταυτόχρονης έκδοσης οποιουδήποτε τύπου δένδρου αναζήτησης, και β) επιτυγχάνει πολύ υψηλές επιδόσεις για ένα μεγάλο εύρος σεναρίων εκτέλεσης.Στην RCU-HTM τα νήματα που τροποποιούν τη δομή του δένδρου με οποιοδήποτε τρόπο δουλεύουν σε αντίγραφα του τμήματος του δένδρου που επηρεάζουν. Μόλις το τοπικό τους αντίγραφο είναι έτοιμο, χρησιμοποιούν την HTM ώστε να επιβεβαιώσουν πως το μέρος του δένδρου που θα αντικατασταθεί δεν έχει στο μεταξύ τροποποιηθεί από κάποιο άλλο νήμα εκτέλεσης και, αν αυτό ισχύει, να αντικαταστήσουν το παλιό αντίγραφο με το τοπικό τους, το οποίο περιλαμβάνει τις κατάλληλες τροποποιήσεις.Για να δείξουμε τις δυνατότητες της τεχνικής μας, υλοποιούμε και αξιολογούμε ένα σημαντικό αριθμό δένδρων αναζήτησης με χρήση του RCU-HTM και συγκρίνουμε την επίδοσή τους με ένα πλήθος ανταγωνιστικών ταυτόχρονων δένδρων. Πιο συγκεκριμένα, εφαρμόζουμε την τεχνική RCU-HTM σε 12 διαφορετικούς τύπους δυαδικών δένδρων, Β+ δένδρων και (a-b)-δένδρων και συγκρίνουμε με πλήθος άλλων υλοποιήσεων που χρησιμοποιούν 4 διαφορετικούς μηχανισμούς συγχρονισμού, τα κλειδώματα, τις ατομικές λειτουργίες, το RCU και το HTM.Αξιολογούμε τα δένδρα αναζήτησης κάτω από πολλά διαφορετικά σενάρια εκτέλεσης μεταβάλλοντας το μέγεθος του κλειδιού που αποθηκεύεται στο δένδρο, τον αριθμό των κλειδιών που αποθηκεύονται στο δένδρο, το μείγμα από λειτουργίες που εκτελούνται καθώς και τον αριθμό των νημάτων που εκτελούν ταυτόχρονα λειτουργίες. Όλοι οι διαφορετικοί συνδυασμοί αυτών των παραμέτρων μας δίνουν 630 διαφορετικά σενάρια εκτέλεσης για κάθε δένδρο αναζήτησης. Επίσης, αξιολογούμε τα δένδρα χρησιμοποιώντας δύο μετροπρογράμματα που χρησιμοποιούνται κατα κόρον για την αξιολόγηση συστημάτων βάσεων δεδομένων, τα TPC-C και YCSB. Η αξιολόγηση μας δείχνει πως στην πλειονότητα των πειραμάτων τα δένδρα που χρησιμοποιούν το RCU-HTM έχουν υψηλότερες επιδόσεις από τους ανταγωνιστές τους, και ακόμα και στις ελάχιστες περιπτώσεις που δεν είναι τα καλύτερα, η επίδοσή τους είναι πολύ κοντά στην καλύτερη υλοποίηση. Αυτό, σε συνδυασμό με την ευκολία προγραμματισμού που προσφέρει η τεχνική RCU-HTM την καθιστούν την πρώτη τεχνική συγχρονισμού που μπορεί σχετικά εύκολα να εφαρμοστεί σε κάθε τύπου δένδρου αναζήτησης χωρίς να επηρεάζεται σε μεγάλο βαθμό η επίδοση τους.

Journal ArticleDOI
TL;DR: It is shown that STM’s performance overhead, due to an excessive amount of read and write barriers added by the compiler, also impacts the performance of HyTM systems, and the importance of the previously proposed annotation mechanism to reduce the performance gap between HTM and STM in phased runtime systems is revealed.
Abstract: With chip manufacturers such as Intel, IBM, and ARM offering native support for transactional memory in their instruction set architectures, memory transactions are on the verge of being considered a genuine application tool rather than just an interesting research topic. Despite this recent increase in popularity on the hardware side of transactional memory (HTM), software support for transactional memory (STM) is still scarce and the only compiler with transactional support currently available, the GNU Compiler Collection (GCC), does not generate code that achieves desirable performance. For hybrid solutions of TM (HyTM), which are frameworks that leverage the best aspects of HTM and STM, the subpar performance of the software side, caused by inefficient compiler generated code, might forbid HyTM to offer optimal results. This article extends previous work focused exclusively on STM implementations by presenting a detailed analysis of transactional code generated by GCC in the context of HybridTM implementations. In particular, it builds on previous research of transactional memory support in the Clang/LLVM compiler framework, which is decoupled from any TM runtime, and presents the following novel contributions: (a) it shows that STM’s performance overhead, due to an excessive amount of read and write barriers added by the compiler, also impacts the performance of HyTM systems; and (b) it reveals the importance of the previously proposed annotation mechanism to reduce the performance gap between HTM and STM in phased runtime systems. Furthermore, it shows that, by correctly using the annotations on just a few lines of code, it is possible to reduce the total number of instrumented barriers by 95% and to achieve speed-ups of up to 7× when compared to the original code generated by GCC and the Clang compiler.1

Proceedings ArticleDOI
06 Sep 2022
TL;DR: This paper introduces a new abstraction called Open Transactional Actions (OTAs) that provides a framework for wrapping non-transactional resources in a transactional layer and believes that OTAs could be used by expert programmers to implement useful system libraries and also to give a transactions semantics to fast linearizable data structures, i.e., transactional boosting.
Abstract: This paper addresses the problem of accessing external resources from inside transactions in STM Haskell, and for that purpose introduces a new abstraction called Open Transactional Actions (OTAs) that provides a framework for wrapping non-transactional resources in a transactional layer. OTAs allow the programmer to access resources through IO actions, from inside transactions, and also to register commit and abort handlers: the former are used to make the accesses to resources visible to other transactions at commit time, and the latter to undo changes in the resource if the transaction has to roll back. OTAs, once started, are guaranteed to be executed completely before the hosting transaction can be aborted, guarantying that if a resource is accessed, its respective commit and abort actions will be properly registered. We believe that OTAs could be used by expert programmers to implement useful system libraries and also to give a transactional semantics to fast linearizable data structures, i.e., transactional boosting. As a proof of concept, we present examples that use OTAs to implement transactional file access and transactional boosted data types that are faster than pure STM Haskell in most cases.

Journal ArticleDOI
TL;DR: This paper considers the problem of developing compact tools that support programming in dynamic transactional memory, implying operational generation of transactional pages, for the Cilk++ language, and proposes new solutions in the style of minimalism.
Abstract: In this paper, we consider the problem of developing compact tools that support programming in dynamic transactional memory, implying operational generation of transactional pages, for the Cilk++ language. It is argued that such an implementation requires weakened transaction isolation. The current state of the problem is analyzed. It is noted that the existing solutions are quite cumbersome, although they allow you to work with complex data structures such as lists and trees. It is argued that it is necessary to develop new solutions in the style of minimalism based on the use of specialized classes (generating transactional pages; implementing consistent transactional variables) in combination with a set of keywords characteristic of Cilk++. Appropriate new solutions are proposed. New syntax elements are introduced, implemented using language extension tools specific to the Planning C platform. The semantics of new language elements is described. It is noted that, unlike analogues, the developed tools allow declaratively to "build" transactions into a network (network schedule of work), which determines the order of execution of transactions and the potential for parallelism that exists at the same time. The proposed approach was tested on the example of the task of constructing a histogram. It is also mentioned about the successful solution, using the developed tools, of the problem of training an artificial neural network by the method of error back propagation and the problem of integer linear programming by the method of branches and boundaries.

DissertationDOI
10 Jun 2022
TL;DR: In this article , the authors study transaction scheduling problem in several systems that differ through the variation of the intra-core communication cost in accessing shared resources, and make several theoretical contributions providing tight, near-tight, and/or impossibility results on three different performance evaluation metrics: execution time, communication cost, and load, for any transaction scheduling algorithm.
Abstract: Transactional memory provides an alternative synchronization mechanism that removes many limitations of traditional lock-based synchronization so that concurrent program writing is easier than lock-based code in modern multicore architectures. The fundamental module in a transactional memory system is the transaction which represents a sequence of read and write operations that are performed atomically to a set of shared resources; transactions may conflict if they access the same shared resources. A transaction scheduling algorithm is used to handle these transaction conflicts and schedule appropriately the transactions. In this dissertation, we study transaction scheduling problem in several systems that differ through the variation of the intra-core communication cost in accessing shared resources. Symmetric communication costs imply tightly-coupled systems, asymmetric communication costs imply large-scale distributed systems, and partially asymmetric communication costs imply non-uniform memory access systems. We made several theoretical contributions providing tight, near-tight, and/or impossibility results on three different performance evaluation metrics: execution time, communication cost, and load, for any transaction scheduling algorithm. We then complement these theoretical results by experimental evaluations, whenever possible, showing their benefits in practical scenarios. To the best of our knowledge, the contributions of this dissertation are either the first of their kind or significant improvements over the best previously known results.

Proceedings ArticleDOI
01 Jun 2022
TL;DR: In this paper , the authors apply CSP (Communicating Sequential Processes) to formally analyze the components of DPSTM v2 architecture, the data exchange process between components, and two different transaction processing modes.
Abstract: Transactional memory is designed for developing parallel programs and improving the efficiency of parallel pro-grams. PSTM (python software transactional memory) mainly supports multi-core parallel programs based on the python language. In order to better adapt to the developing requirements of distributed concurrent programs and enhance the safety of the system, DPSTM (distributed python software transactional memory) was developed. Compared with PSTM, DPSTM has the advantages of higher operating efficiency and stronger fault tolerance. In this paper, we apply CSP (Communicating Sequential Processes) to formally analyze the components of DPSTM v2 architecture, the data exchange process between components, and two different transaction processing modes. We use the model checker PAT (Process Analysis Toolkit) to model the DPSTM v2 architecture and verify eight properties, including deadlock freedom, ACI (atomicity, isolation, and consistency), sequential consistency, data server availability, read tolerance, and crash tolerance. The verification results show that the DPSTM v2 archi-tecture can guarantee all of the above properties. In particular, the normal operation of the system can be maintained when some of the data servers are crashed, ensuring the safety of a distributed system.

Journal ArticleDOI
TL;DR: In this article , the authors provide a characterization of mode transitions and their impact on performance of phase-based transactional memory (PhTM) systems, such that all transactions in a phase execute in the same (hardware/software) mode.

Posted ContentDOI
24 Apr 2022
TL;DR: Polynesia as discussed by the authors divides the HTAP system into transactional and analytical processing islands, implements new custom hardware that unlocks software optimizations to reduce the costs of update propagation and consistency, and exploits processing-in-memory for the analytical islands to alleviate data movement overheads.
Abstract: A growth in data volume, combined with increasing demand for real-time analysis (using the most recent data), has resulted in the emergence of database systems that concurrently support transactions and data analytics. These hybrid transactional and analytical processing (HTAP) database systems can support real-time data analysis without the high costs of synchronizing across separate single-purpose databases. Unfortunately, for many applications that perform a high rate of data updates, state-of-the-art HTAP systems incur significant losses in transactional (up to 74.6%) and/or analytical (up to 49.8%) throughput compared to performing only transactional or only analytical queries in isolation, due to (1) data movement between the CPU and memory, (2) data update propagation from transactional to analytical workloads, and (3) the cost to maintain a consistent view of data across the system. We propose Polynesia, a hardware-software co-designed system for in-memory HTAP databases that avoids the large throughput losses of traditional HTAP systems. Polynesia (1) divides the HTAP system into transactional and analytical processing islands, (2) implements new custom hardware that unlocks software optimizations to reduce the costs of update propagation and consistency, and (3) exploits processing-in-memory for the analytical islands to alleviate data movement overheads. Our evaluation shows that Polynesia outperforms three state-of-the-art HTAP systems, with average transactional/analytical throughput improvements of 1.7x/3.7x, and reduces energy consumption by 48% over the prior lowest-energy HTAP system.

Journal ArticleDOI
TL;DR: DBXN as mentioned in this paper proposes a technique called parity version to decouple the process of HTM execution and NVM write, which can correctly and efficiently use NVM to reduce their commit latency with HTM.
Abstract: PDF HTML XML Export Cite reminder Reducing Transaction Processing Latency in Hardware Transactional Memory-based Database with Non-volatile Memory DOI: 10.21655/ijsi.1673-7288.00274 Author: Affiliation: Clc Number: Fund Project: National Key Research and Development Program of China (2020YFB2104100); National Science Fund for Distinguished Young Scholars of China (61925206) Article | Figures | Metrics | Reference | Related | Cited by | Materials | Comments Abstract:The emergency of Hardware Transactional Memory (HTM) has greatly boosted the transaction processing performance in in-memory databases. However, the group commit protocol, aiming at reducing the impact from slow storage devices, leads to high transaction commit latency. Non-Volatile Memory (NVM) opens opportunities for reducing transaction commit latency. However, HTM cannot cooperate with NVM together: flushing data to NVM will always cause HTM to abort. In this paper, we propose a technique called parity version to decouple the process of HTM execution and NVM write. Thus, the transactions can correctly and efficiently use NVM to reduce their commit latency with HTM. We have integrated this technique into DBX, a state-of-the-art HTM-based database, and propose DBXN: a low-latency and high-throughput in-memory transaction processing system. Evaluations using typical OLTP workloads including TPC-C show that it has 99% lower latency and 2.1 times higher throughput than DBX. Reference Related Cited by

Proceedings ArticleDOI
01 Mar 2022
TL;DR: This work characterize, for the first time, how the aggressiveness of the cores in terms of exploiting instruction-level parallelism can interact with thread-level speculation support brought by HTM systems and concludes that depending on contention, a careful choice over processor aggressiveness can reduce abort ratios.
Abstract: Hardware Transactional Memory (HTM) allows the use of transactions by programmers, making parallel programming easier and theoretically obtaining the performance of fine-grained locks. However, transactions can abort for a variety of reasons, resulting in the squash of speculatively executed instructions and the consequent loss in both performance and energy efficiency. Among the different sources of abort, conflicting concurrent accesses to the same shared memory locations from different transactions are often the prevalent cause.In this work, we characterize, for the first time to the best of our knowledge, how the aggressiveness of the cores in terms of exploiting instruction-level parallelism can interact with thread-level speculation support brought by HTM systems. We observe that altering the size of the structures that support out-of-order and speculative execution changes the number of aborts produced in the execution of transactional workloads on a best-effort HTM implementation. Our results show that a small number of powerful cores is more suitable for high-contention scenarios, whereas under low contention it is preferable to use a larger number of less aggressive cores. In addition, an aggressive core can lead to performance loss in medium-contention scenarios due to an increase in the number of aborts. We conclude that depending on contention, a careful choice over processor aggressiveness can reduce abort ratios.

Journal ArticleDOI
TL;DR: In this article , the authors show that implementation choices unrelated to concurrency control can explain some of the performance differences between OCC and non-OCC systems, and they also present two optimization techniques, deferred updates and timestamp splitting, that can dramatically improve the high-contention performance of both Optimistic Concurrency Control (OCC) and Multi-Varying Multi-Concurrency Control (MVCC) protocols.
Abstract: Main-memory multicore transactional systems have achieved excellent performance using single-version optimistic concurrency control (OCC), especially on uncontended workloads. Nevertheless, systems based on other concurrency control protocols, such as hybrid OCC/ locking and variations on multiversion concurrency control (MVCC), are reported to outperform the best OCC systems, especially with increasing contention. This paper shows that implementation choices unrelated to concurrency control can explain some of these performance differences. Our evaluation shows the strengths and weaknesses of OCC, MVCC, and TicToc concurrency control under varying workloads and contention levels, and the importance of several implementation choices called basis factors. Given sensible basis factor choices, OCC performance does not collapse on high-contention TPC-C. We also present two optimization techniques, deferred updates and timestamp splitting, that can dramatically improve the high-contention performance of both OCC and MVCC. These techniques are known, but we apply them in a new context and highlight their potency: when combined, they lead to performance gains of \(4.74\times \) for MVCC and \(5.01\times \) for OCC in a TPC-C workload.

Book ChapterDOI
01 Jan 2022
TL;DR: Transactional and analytic data are fundamentally different in terms of how they are written, read, stored, and managed as mentioned in this paper . To achieve optimal performance of either type of data requires architecting data structures and software that cater to each of these paradigms.
Abstract: Transactional and analytic data are fundamentally different in terms of how they are written, read, stored, and managed. To achieve optimal performance of either type of data requires architecting data structures and software that cater to each of these paradigms.