scispace - formally typeset
Search or ask a question

Showing papers on "Concurrency control published in 2020"


Journal ArticleDOI
01 Apr 2020
TL;DR: The researcher found that the proposed mechanism shall enable low-latency fog computing services of the IoT applications that are a delay sensitive and reduces the communication delay significantly.
Abstract: In cloud–fog environments, the opportunity to avoid using the upstream communication channel from the clients to the cloud server all the time is possible by fluctuating the conventional concurrency control protocols. Through the present paper, the researcher aimed to introduce a new variant of the optimistic concurrency control protocol. Through the deployment of augmented partial validation protocol, IoT transactions that are read-only can be processed at the fog node locally. For final validation, update transactions are the only ones sent to the cloud. Moreover, the update transactions go through partial validation at the fog node which makes them more opportunist to commit at the cloud. This protocol reduces communication and computation at the cloud as much as possible while supporting scalability of the transactional services needed by the applications running in such environments. Based on numerical studies, the researcher assessed the partial validation procedure under three concurrency protocols. The study’s results indicate that employing the proposed mechanism shall generate benefits for IoT users. These benefits are obtained from transactional services. We evaluated the effect of deployment the partial validation at the fog node for the three concurrency protocols, namely AOCCRBSC, AOCCRB and STUBcast. We performed a set of intensive experiments to compare the three protocols with and without such deployment. The result reported a reduction in miss rate, restart rate and communication delay in all of them. The researcher found that the proposed mechanism reduces the communication delay significantly. They found that the proposed mechanism shall enable low-latency fog computing services of the IoT applications that are a delay sensitive.

186 citations


Journal ArticleDOI
Youmin Chen1, Youyou Lu1, Kedong Fang1, Qing Wang1, Jiwu Shu1 
01 Jul 2020
TL;DR: A B-tree variant named μTree, which incorporates a shadow list-based layer to the leaf nodes of a B- tree to gain benefits from both list and tree data structures, and achieves a 99th percentile latency that is one order of magnitude lower and 2.7 times higher throughput.
Abstract: Tail latency is a critical design issue in recent storage systems. B+-tree, as a fundamental building block in storage systems, incurs high tail latency, especially when placed in persistent memory (PM). Our empirical study specifies two factors that lead to such latency spikes: (i) the internal structural refinement operations (i.e., split, merge, and balance), and (ii) the interference between concurrent operations. The problem is even worse when high concurrency meets with the low write bandwidth of persistent memory.In this paper, we propose a B+-tree variant named μTree. It incorporates a shadow list-based layer to the leaf nodes of a B+-tree to gain benefits from both list and tree data structures. The list layer in PM is exempt from the structural refinement operations since list nodes in the list layer own separate PM spaces, which are organized in an element-based way. Meanwhile, μTree still gains the locality benefit from the tree-based nodes. To alleviate the interference overhead, μTree coordinates the concurrency control between the tree and list layer, which moves the slow PM accesses out of the critical path. We compare μTree to state-of-the-art designs of PM-aware B+-tree indices under both YCSB workload and real-world applications. μTree achieves a 99th percentile latency that is one order of magnitude lower and 2.8 - 4.7 times higher throughput.

50 citations


Proceedings ArticleDOI
11 Jun 2020
TL;DR: Wang et al. as discussed by the authors proposed a fine-grained execution of execute-order-validate in Hyperledger Fabric, where unserializable transactions are aborted before reordering and the rest are guaranteed to be serializable.
Abstract: Smart contracts have enabled blockchain systems to evolve from simple cryptocurrency platforms to general transactional systems. A new architecture called execute-order-validate has been proposed in Hyperledger Fabric to support parallel transactions. However, this architecture might render many invalid transactions when serializing them. This problem is further exaggerated as the block formation rate is inherently limited due to other factors beside data processing, such as cryptography and consensus. Inspired by optimistic concurrency control in modern databases, we propose a novel method to enhance the execute-order-validate architecture, by reordering transactions to reduce the abort rate. In contrast to existing blockchains that adopt database's preventive approaches which might over-abort serializable transactions, our method is theoretically more fine-grained: unserializable transactions are aborted before reordering and the rest are guaranteed to be serializable. We implement our method in two blockchains respectively, FabricSharp on top of Hyperledger Fabric, and FastFabricSharp on top of FastFabric. We compare the performance of FabricSharp with vanilla Fabric and three related systems, two of which are respectively implemented with one standard and one state-of-the-art concurrency control techniques from databases. The results demonstrate that FabricSharp achieves 25% higher throughput compared to the other systems in nearly all experimental scenarios. Moreover, the FastFabricSharp's improvement on FastFabric is up to 66%.

42 citations


Proceedings ArticleDOI
09 Mar 2020
TL;DR: TimeStone is a highly scalable DTM system with low write amplification and minimal memory footprint, which uses a novel multi-layered hybrid logging technique, called TOC logging, to guarantee crash consistency and relies on Multi-Version Concurrency Control mechanism to achieve high scalability and to support different isolation levels on the same data set.
Abstract: Non-volatile main memory (NVMM) technologies promise byte addressability and near-DRAM access that allows developers to build persistent applications with common load and store instructions. However, it is difficult to realize these promises because NVMM software should also provide crash consistency while providing high performance, and scalability. Durable transactional memory (DTM) systems address these challenges. However, none of them scale beyond 16 cores. The poor scalability either stems from the underlying STM layer or from employing limited write parallelism (single writer or dual version). In addition, other fundamental issues with guaranteeing crash consistency are high write amplification and memory footprint in existing approaches. To address these challenges, we propose TimeStone: a highly scalable DTM system with low write amplification and minimal memory footprint. TimeStone uses a novel multi-layered hybrid logging technique, called TOC logging, to guarantee crash consistency. Also, TimeStone further relies on Multi-Version Concurrency Control (MVCC) mechanism to achieve high scalability and to support different isolation levels on the same data set. Our evaluation of TimeStone against the state-of-the-art DTM systems shows that it significantly outperforms other systems for a wide range of workloads with varying data-set size and contention levels, up to 112 hardware threads. In addition, with our TOC logging, TimeStone achieves a write amplification of less than 1, while existing DTM systems suffer from 2×-6× overhead.

37 citations


Journal ArticleDOI
01 Jan 2020
TL;DR: It is found that implementation choices unrelated to concurrency control may explain much of OCC's previously-reported degradation, and two optimization techniques, commit-time updates and timestamp splitting, that can dramatically improve the high-contention performance of both OCC and MVCC are presented.
Abstract: Optimistic concurrency control, or OCC, can achieve excellent performance on uncontended workloads for main-memory transactional databases. Contention causes OCC's performance to degrade, however, and recent concurrency control designs, such as hybrid OCC/locking systems and variations on multiversion concurrency control (MVCC), have claimed to outperform the best OCC systems. We evaluate several concurrency control designs under varying contention and varying workloads, including TPCC, and find that implementation choices unrelated to concurrency control may explain much of OCC's previously-reported degradation. When these implementation choices are made sensibly, OCC performance does not collapse on high-contention TPC-C. We also present two optimization techniques, commit-time updates and timestamp splitting, that can dramatically improve the high-contention performance of both OCC and MVCC. Though these techniques are known, we apply them in a new context and highlight their potency: when combined, they lead to performance gains of 3.4X for MVCC and 3.6X for OCC in a TPC-C workload.

35 citations


Journal ArticleDOI
01 Mar 2020
TL;DR: LiveGraph is presented, a graph storage system that outperforms both the best graph transactional systems and the best systems for real-time graph analytics on fresh data by ensuring that adjacency list scans, a key operation in graph workloads, are purely sequential.
Abstract: The specific characteristics of graph workloads make it hard to design a one-size-fits-all graph storage system. Systems that support transactional updates use data structures with poor data locality, which limits the efficiency of analytical workloads or even simple edge scans. Other systems run graph analytics workloads efficiently, but cannot properly support transactions.This paper presents LiveGraph, a graph storage system that outperforms both the best graph transactional systems and the best solutions for real-time graph analytics on fresh data. LiveGraph achieves this by ensuring that adjacency list scans, a key operation in graph workloads, are purely sequential: they never require random accesses even in presence of concurrent transactions. Such pure-sequential operations are enabled by combining a novel graph-aware data structure, the Transactional Edge Log (TEL), with a concurrency control mechanism that leverages TEL's data layout. Our evaluation shows that LiveGraph significantly outperforms state-of-the-art (graph) database solutions on both transactional and real-time analytical workloads.

35 citations


Posted Content
TL;DR: A novel method to enhance the execute-order-validate architecture, by reordering transactions to reduce the abort rate, and is theoretically more fine-grained: unserializable transactions are aborted before reordering and the rest are guaranteed to be serializable.
Abstract: Smart contracts have enabled blockchain systems to evolve from simple cryptocurrency platforms, such as Bitcoin, to general transactional systems, such as Ethereum. Catering for emerging business requirements, a new architecture called execute-order-validate has been proposed in Hyperledger Fabric to support parallel transactions and improve the blockchain's throughput. However, this new architecture might render many invalid transactions when serializing them. This problem is further exaggerated as the block formation rate is inherently limited due to other factors beside data processing, such as cryptography and consensus. In this work, we propose a novel method to enhance the execute-order-validate architecture, by reducing invalid transactions to improve the throughput of blockchains. Our method is inspired by state-of-the-art optimistic concurrency control techniques in modern database systems. In contrast to existing blockchains that adopt database's preventive approaches which might abort serializable transactions, our method is theoretically more fine-grained. Specifically, unserializable transactions are aborted before ordering and the remaining transactions are guaranteed to be serializable. For evaluation, we implement our method in two blockchains respectively, FabricSharp on top of Hyperledger Fabric, and FastFabricSharp on top of FastFabric. We compare the performance of FabricSharp with vanilla Fabric and three related systems, two of which are respectively implemented with one standard and one state-of-the-art concurrency control techniques from databases. The results demonstrate that FabricSharp achieves 25% higher throughput compared to the other systems in nearly all experimental scenarios. Moreover, the FastFabricSharp's improvement over FastFabric is up to 66%.

33 citations


Book ChapterDOI
25 Apr 2020
TL;DR: This work proposes a proof methodology for establishing that a given object maintains a given invariant, taking into account any concurrency control, for the subclass of state-based distributed systems.
Abstract: To provide high availability in distributed systems, object replicas allow concurrent updates. Although replicas eventually converge, they may diverge temporarily, for instance when the network fails. This makes it difficult for the developer to reason about the object's properties , and in particular, to prove invariants over its state. For the sub-class of state-based distributed systems, we propose a proof methodology for establishing that a given object maintains a given invariant, taking into account any concurrency control. Our approach allows reasoning about individual operations separately. We demonstrate that our rules are sound, and we illustrate their use with some representative examples. We automate the rule using Boogie, an SMT-based tool.

20 citations


Proceedings ArticleDOI
11 Jun 2020
TL;DR: While Strife incurs about 50% overhead relative to partitioned systems in the statically partitionable case, it performs 2x better when such static partitioning is not possible and adapts to dynamically varying workloads.
Abstract: Research on transaction processing has made significant progress towards improving performance of main memory multicore OLTP systems under low contention. However, these systems struggle on workloads with lots of conflicts. Partitioned databases (and variants) perform well on high contention workloads that are statically partitionable, but time-varying workloads often make them impractical. Towards addressing this, we propose Strife---a novel transaction processing scheme that clusters transactions together dynamically and executes most of them without any concurrency control. Strife executes transactions in batches, where each batch is partitioned into disjoint clusters without any cross-cluster conflicts and a small set of residuals. The clusters are then executed in parallel with no concurrency control, followed by residuals separately executed with concurrency control. Strife uses a fast dynamic clustering algorithm that exploits a combination of random sampling and concurrent union-find data structure to partition the batch online, before executing it. Strife outperforms lock-based and optimistic protocols by up to 2x on high contention workloads. While Strife incurs about 50% overhead relative to partitioned systems in the statically partitionable case, it performs 2x better when such static partitioning is not possible and adapts to dynamically varying workloads.

20 citations


Journal ArticleDOI
01 Sep 2020
TL;DR: This paper presents yet another concurrency control analysis platform, CCBench, which supports seven protocols and seven versatile optimization methods and enables the configuration of seven workload parameters and analyzed the protocols and optimization methods using various workload parameters.
Abstract: This paper presents yet another concurrency control analysis platform, CCBench. CCBench supports seven protocols (Silo, TicToc, MOCC, Cicada, SI, SI with latch-free SSN, 2PL) and seven versatile optimization methods and enables the configuration of seven workload parameters. We analyzed the protocols and optimization methods using various workload parameters and a thread count of 224. Previous studies focused on thread scalability and did not explore the space analyzed here. We classified the optimization methods on the basis of three performance factors: CPU cache, delay on conflict, and version lifetime. Analyses using CCBench and 224 threads, produced six insights. The performance of optimistic concurrency control protocol for a read-only workload rapidly degrades as cardinality increases even without L3 cache misses. (I2) Silo can outperform TicToc for some write-intensive workloads by using invisible reads optimization. (I3) The effectiveness of two approaches to coping with conflict (wait and no-wait) depends on the situation. (I4) OCC reads the same record two or more times if a concurrent transaction interruption occurs, which can improve performance. (I5) Mixing different implementations is inappropriate for deep analysis. (I6) Even a state-of-the-art garbage collection method cannot improve the performance of multi-version protocols if there is a single long transaction mixed into the workload. On the basis of I4, we defined the read phase extension optimization in which an artificial delay is added to the read phase. On the basis of I6, we defined the aggressive garbage collection optimization in which even visible versions are collected. The code for CCBench and all the data in this paper are available online at GitHub.

18 citations


Proceedings ArticleDOI
11 Jun 2020
TL;DR: Given the vast number of possible storage engine designs and their complexity, there is a need to be able to describe and communicate design decisions at a high level descriptive language and a first version of such a language is presented.
Abstract: Key-value stores are everywhere. They power a diverse set of data-driven applications across both industry and science. Key-value stores are used as stand-alone NoSQL systems but they are also used as a part of more complex pipelines and systems such as machine learning and relational systems. In this tutorial, we survey the state-of-the-art approaches on how the core storage engine of a key-value store system is designed. We focus on several critical components of the engine, starting with the core data structures to lay out data across the memory hierarchy. We also discuss design issues related to caching, timestamps, concurrency control, updates, shifting workloads, as well as mixed workloads with both analytical and transactional characteristics. We cover designs that are read-optimized, write-optimized as well as hybrids. We draw examples from several state-of-the-art systems but we also put everything together in a general framework which allows us to model storage engine designs under a single unified model and reason about the expected behavior of diverse designs. In addition, we show that given the vast number of possible storage engine designs and their complexity, there is a need to be able to describe and communicate design decisions at a high level descriptive language and we present a first version of such a language. We then use that framework to present several open challenges in the field, especially in terms of supporting increasingly more diverse and dynamic applications in the era of data science and AI, including neural networks, graphs, and data versioning.

Journal ArticleDOI
Pan Fan1, Jing Liu1, Wei Yin, Hui Wang, Xiaohong Chen1, Haiying Sun1 
TL;DR: This paper proposes 2PC*, a novel concurrency control protocol for distributed transactions that outperforms 2PC, allowing greater concurrency across multiple microservices, and improves the fault-tolerance mechanism of 2PC* using transaction compensation.
Abstract: The two-phase commit (2PC) protocol is a key technique for achieving distributed transactions in storage systems such as relational databases and distributed databases. 2PC is a strongly consistent and centralized atomic commit protocol that ensures the serialization of the transaction execution order. However, it does not scale well to large and high-throughput systems, especially for applications with many transactional conflicts, such as microservices and cloud computing. Therefore, 2PC has a performance bottleneck for distributed transaction control across multiple microservices. In this paper, we propose 2PC*, a novel concurrency control protocol for distributed transactions that outperforms 2PC, allowing greater concurrency across multiple microservices. 2PC* can greatly reduce overhead because locks are held throughout the transaction process. Moreover, we improve the fault-tolerance mechanism of 2PC* using transaction compensation. We also implement a middleware solution for transactions in microservice support using 2PC*. We compare 2PC* to 2PC by applying both to Ctrip MSECP, and 2PC* outperforms 2PC in workloads with varying degrees of contention. When the contention becomes high, the experimental results show that 2PC* achieves at most a 3.3x improvement in throughput and a 67% reduction in latency, which proves that our scheme can easily support distributed transactions with multi-microservice modules. Finally, we embed our middleware scheme in the PaaS cloud platform and demonstrate its strong applicability to cloud computing through long-term analysis of the monitoring results in the cloud platform.

Proceedings ArticleDOI
18 May 2020
TL;DR: In this article, the authors evaluate the performance of DRAM-cached-NVM for accelerating HPC applications and enabling large problems beyond the DRAM capacity, and identify write throttling and concurrency control as the priorities in optimizing applications.
Abstract: The emergence of high-density byte-addressable non-volatile memory (NVM) is promising to accelerate data-and compute-intensive applications. Current NVM technologies have lower performance than DRAM and, thus, are often paired with DRAM in a heterogeneous main memory. Recently, byte-addressable NVM hardware becomes available. This work provides a timely evaluation of representative HPC applications from the "Seven Dwarfs" on NVM-based main memory. Our results quantify the effectiveness of DRAM-cached-NVM for accelerating HPC applications and enabling large problems beyond the DRAM capacity. On uncached-NVM, HPC applications exhibit three tiers of performance sensitivity, i.e., insensitive, scaled, and bottlenecked. We identify write throttling and concurrency control as the priorities in optimizing applications. We highlight that concurrency change may have a diverging effect on read and write accesses in applications. Based on these findings, we explore two optimization approaches. First, we provide a prediction model that uses datasets from a small set of configurations to estimate performance at various concurrency and data sizes to avoid exhaustive search in the configuration space. Second, we demonstrate that write-aware data placement on uncached-NVM could achieve 2x performance improvement with a 60% reduction in DRAM usage.

Proceedings ArticleDOI
15 Jun 2020
TL;DR: This paper follows up on prior work with an evaluation of the characteristics of concurrency control schemes on real production multi-socket hardware with 1568 cores and makes several interesting findings.
Abstract: In this paper, we set out the goal to revisit the results of "Starring into the Abyss [...] of Concurrency Control with [1000] Cores" [27] and analyse in-memory DBMSs on today's large hardware. Despite the original assumption of the authors, today we do not see single-socket CPUs with 1000 cores. Instead multi-socket hardware made its way into production data centres. Hence, we follow up on this prior work with an evaluation of the characteristics of concurrency control schemes on real production multi-socket hardware with 1568 cores. To our surprise, we made several interesting findings which we report on in this paper.

Journal ArticleDOI
01 Sep 2020
TL;DR: MorphoSys is a distributed database system that dynamically chooses, and alters, its physical design based on the workload, making integrated design decisions for all of the data partitioning, replication and placement decisions on the fly using a learned cost model.
Abstract: Distributed database systems are widely used to meet the demands of storing and managing computation-heavy workloads. To boost performance and minimize resource and data contention, these systems require selecting a distributed physical design that determines where to place data, and which data items to replicate and partition. Deciding on a physical design is difficult as each choice poses a trade-off in the design space, and a poor choice can significantly degrade performance. Current design decisions are typically static and cannot adapt to workload changes or are unable to combine multiple design choices such as data replication and data partitioning integrally. This paper presents MorphoSys, a distributed database system that dynamically chooses, and alters, its physical design based on the workload. MorphoSys makes integrated design decisions for all of the data partitioning, replication and placement decisions on-the-fly using a learned cost model. MorphoSys provides efficient transaction execution in the face of design changes via a novel concurrency control and update propagation scheme. Our experimental evaluation, using several benchmark workloads and state-of-the-art comparison systems, shows that MorphoSys delivers excellent system performance through effective and efficient physical designs.

Proceedings ArticleDOI
11 Jun 2020
TL;DR: This paper introduces an index data structure, called the partitioned in-memory merge tree, to address the challenges that arise when indexing highly dynamic data, which are common in streaming settings, and proposes a low-cost and effective concurrency control mechanism to meet the demands of high-rate update queries.
Abstract: Indexing sliding window content to enhance the performance of streaming queries can be greatly improved by utilizing the computational capabilities of a multicore processor. Conventional indexing data structures optimized for frequent search queries on a prestored dataset do not meet the demands of indexing highly dynamic data as in streaming environments. In this paper, we introduce an index data structure, called the partitioned in-memory merge tree, to address the challenges that arise when indexing highly dynamic data, which are common in streaming settings. Utilizing the specific pattern of streaming data and the distribution of queries, we propose a low-cost and effective concurrency control mechanism to meet the demands of high-rate update queries. To complement the index, we design an algorithm to realize a parallel index-based stream join that exploits the computational power of multicore processors. Our experiments using an octa-core processor show that our parallel stream join achieves up to 5.5 times higher throughput than a single-threaded approach.

Journal ArticleDOI
TL;DR: The control algorithm is mathematically described with use of Finite State Machine methodology in canonical and concurrent variants of concurrent processes with real-time requirements.

Journal ArticleDOI
TL;DR: The RACE aims at reducing the transaction miss percent by eliminating the following problems—deadlock through dividing the execution stage in the locking phase and processing phase, the cyclic restart through prejudging its occurrence, and the pseudo-priority inversion that may occur with an intermediate lock holder low priority cohort.
Abstract: The two-phase locking with high priority (2PL-HP) protocol is a broadly used concurrency control protocol as it better handles the priority inversion problem. However, its performance might get degraded due to the inclusion of cyclic restart, deadlock, unnecessary abort, pseudo-priority inversion, and starvation. To overcome the above problems, this paper proposes a Reduction of long transactions starvation effect, Avoidance of deadlock and pseudo-priority inversion, and Conditional-restart for an Efficient resource utilization (RACE) concurrency control protocol. The RACE specifically aims at reducing the transaction miss percent by eliminating the following problems—deadlock through dividing the execution stage in the locking phase and processing phase, the cyclic restart through prejudging its occurrence, and the pseudo-priority inversion that may occur with an intermediate lock holder low priority cohort. Moreover, it reduces unnecessary transaction aborts through the utilization of the priority inheritance method and saves long transactions from being starved to some extent by ensuring fair chances of their completion. Simulation results confirm up to 11% improvement in transactions miss percent and up to 38% reduction in transactions’ rollbacks in RACE protocol over 2PL-HP and extended 2PL-HP.

Proceedings ArticleDOI
15 Sep 2020
TL;DR: The Sparse Synchronous model of computation, which allows a programmer to specify software timing more precisely than the traditional “heartbeat” of mainstream operating systems or the synchronous languages, is presented.
Abstract: We present the Sparse Synchronous model (SSM) of computation, which allows a programmer to specify software timing more precisely than the traditional “heartbeat” of mainstream operating systems or the synchronous languages. SSM is a mix of semantics inspired by discrete event simulators and the synchronous languages designed to operate in resource-constrained environments such as microcontrollers. SSM provides precise timing prescriptions, concurrency, and determinism. We present SSM, its motivations, and details of a lightweight runtime system upon which a future language will be built.

Proceedings ArticleDOI
14 Jun 2020
TL;DR: The robustness problem for the lower SQL isolation levels READ UNCOMMITTED and READ COMMITTED which are defined in terms of the forbidden dirty write and dirty read patterns is addressed and the coNP-hardness proof is obtained.
Abstract: While serializability always guarantees application correctness, lower isolation levels can be chosen to improve transaction throughput at the risk of introducing certain anomalies. A set of transactions is robust against a given isolation level if every possible interleaving of the transactions under the specified isolation level is serializable. Robustness therefore always guarantees application correctness with the performance benefit of the lower isolation level. While the robustness problem has received considerable attention in the literature, only sufficient conditions have been obtained. The most notable exception is the seminal work by Fekete where he obtained a characterization for deciding robustness against SNAPSHOT ISOLATION. In this paper, we address the robustness problem for the lower SQL isolation levels READ UNCOMMITTED and READ COMMITTED which are defined in terms of the forbidden dirty write and dirty read patterns. The first main contribution of this paper is that we characterize robustness against both isolation levels in terms of the absence of counter example schedules of a specific form (split and multi-split schedules) and by the absence of cycles in interference graphs that satisfy various properties. A critical difference with Fekete's work, is that the properties of cycles obtained in this paper have to take the relative ordering of operations within transactions into account as READ UNCOMMITTED and READ COMMITTED do not satisfy the atomic visibility requirement. A particular consequence is that the latter renders the robustness problem against READ COMMITTED coNP-complete. The second main contribution of this paper is the coNP-hardness proof. For READ UNCOMMITTED, we obtain LOGSPACE-completeness.

Journal ArticleDOI
TL;DR: A scalable deterministic concurrency control, Deterministic and Optimistic Concurrency Control (DOCC), which is able to scale the performance both within a single node and across multiple nodes.
Abstract: Deterministic databases can improve the performance of distributed workload by eliminating the distributed commit protocol and reducing the contention cost. Unfortunately, the current deterministic scheme does not consider the performance scalability within a single machine. In this paper, we describe a scalable deterministic concurrency control, Deterministic and Optimistic Concurrency Control (DOCC), which is able to scale the performance both within a single node and across multiple nodes. The performance improvement comes from enforcing the determinism lazily and avoiding read-only transaction blocking the execution. The evaluation shows that DOCC achieves 8x performance improvement than the popular deterministic database system, Calvin.

Journal ArticleDOI
TL;DR: In this article, the Paxos protocol is extended with the concept of consistent quorums to detect whether the previous consensus still needs to be consolidated or is already finished so that the next consensus value can be safely proposed.
Abstract: Building consensus sequences based on distributed, fault-tolerant consensus, as used for replicated state machines, typically requires a separate distributed state for every new consensus instance. Allocating and maintaining this state causes significant overhead. In particular, freeing the distributed, outdated states in a fault-tolerant way is not trivial and adds further complexity and cost to the system. In this article, we propose an extension to the single-decree Paxos protocol that can learn a sequence of consensus decisions `in-place', i.e., with a single set of distributed states. Our protocol does not require dynamic log structures and hence has no need for distributed log pruning, snapshotting, compaction, or dynamic resource allocation. The protocol builds a fault-tolerant atomic register that supports arbitrary read-modify-write operations. We use the concept of consistent quorums to detect whether the previous consensus still needs to be consolidated or is already finished so that the next consensus value can be safely proposed. Reading a consolidated consensus is done without state modifications and is thereby free of concurrency control and demand for serialisation. A proposer that is not interrupted reaches agreement on consecutive consensus decisions within a single message round-trip per decision by preparing the acceptors eagerly with the previous request.

Proceedings ArticleDOI
TL;DR: SI-HTM is proposed, which stretches the capacity bounds of the underlying HTM, thus opening HTM to a much broader class of applications, and exhibits improved scalability, achieving speedups of up to 300% relatively to HTM on in-memory database benchmarks.
Abstract: The hardware transactional memory (HTM) implementations in commercially available processors are significantly hindered by their tight capacity constraints. In practice, this renders current HTMs unsuitable to many real-world workloads of in-memory databases. This paper proposes SI-HTM, which stretches the capacity bounds of the underlying HTM, thus opening HTM to a much broader class of applications. SI-HTM leverages the HTM implementation of the IBM POWER architecture with a software layer to offer a single-version implementation of Snapshot Isolation. When compared to HTM- and software-based concurrency control alternatives, SI-HTM exhibits improved scalability, achieving speedups of up to 300% relatively to HTM on in-memory database benchmarks.

Proceedings ArticleDOI
TL;DR: RisGraph as mentioned in this paper proposes a data structure named Indexed Adjacency Lists and uses sparse arrays and hybrid parallel mode to enable localized data access and inter-update parallelism.
Abstract: Evolving graphs in the real world are large-scale and constantly changing, as hundreds of thousands of updates may come every second. Monotonic algorithms such as Reachability and Shortest Path are widely used in real-time analytics to gain both static and temporal insights and can be accelerated by incremental computing. Existing streaming systems adopt the incremental computing model and achieve either low latency or high throughput, but not both. However, both high throughput and low latency are required in real scenarios such as financial fraud detection. This paper presents RisGraph, a real-time streaming system that provides low-latency analysis for each update with high throughput. RisGraph addresses the challenge with localized data access and inter-update parallelism. We propose a data structure named Indexed Adjacency Lists and use sparse arrays and Hybrid Parallel Mode to enable localized data access. To achieve inter-update parallelism, we propose a domain-specific concurrency control mechanism based on the classification of safe and unsafe updates. Experiments show that RisGraph can ingest millions of updates per second for graphs with several hundred million vertices and billions of edges, and the P999 processing time latency is within 20 milliseconds. RisGraph achieves orders-of-magnitude improvement on throughput when analyses are executed for each update without batching and performs better than existing systems with batches of up to 20 million updates.

Book ChapterDOI
Zihao Zhang1, Huiqi Hu1, Yang Yu1, Weining Qian1, Ke Shu 
24 Sep 2020
Abstract: Modern databases are commonly deployed on multiple commercial machines with quorum-based replication to provide high availability and guarantee strong consistency. A widely adopted consensus protocol is Raft because it is easy to understand and implement. However, Raft’s strict serialization limits the concurrency of the system, making it unable to reflect the capability of high concurrent transaction processing brought by new hardware and concurrency control technologies. Upon realizing this, the work targets on improving the parallelism of replication. We propose a variant of Raft protocol named DP-Raft to support parallel replication of database logs so that it can match the speed of transaction execution. Our key contributions are: (1) we define the rules for using log dependencies to commit and apply logs out of order; (2) DP-Raft is proposed for replicating logs in parallel. DP-Raft preserves log dependencies to ensure the safety of parallel replication and with some data structures to reduce the cost of state maintenance; (3) experiments on YCSB benchmark show that our method can improve throughput and reduce latency of transaction processing in database systems than existing Raft-based solutions.

Patent
12 May 2020
TL;DR: In this paper, a transaction execution method and device, computer equipment and a storage medium, and belongs to the technical field of databases, are described, where the query result information is sent to coordination node equipment, responding to a global submission request for the target transaction sent by the coordination nodes under a target condition, the target transactions is subjected toglobal submission; and the target condition is used for representing that the conflict transaction of the targeted transaction does not exist in the database system, so that the concurrency control algorithm does not need to depend on distributed deadlock, the performance loss
Abstract: The invention discloses a transaction execution method and device, computer equipment and a storage medium, and belongs to the technical field of databases. According to the invention, node equipmentresponds to the conflict query request of coordination node equipment for the target transaction; whether a conflict transaction of the target transaction exists in the node equipment or not is inquired to obtain query result information, wherein the operation objects of the conflict transaction and the target transaction comprise the same data item; and the query result information is sent to coordination node equipment, responding to a global submission request for the target transaction sent by the coordination node equipment under a target condition, the target transaction is subjected toglobal submission; and the target condition is used for representing that the conflict transaction of the target transaction does not exist in the database system, so that the concurrency control algorithm does not need to depend on distributed deadlock, the performance loss caused by the distributed deadlock is avoided, and the transaction execution efficiency of the database system is improved.

Proceedings ArticleDOI
01 Nov 2020
TL;DR: In this article, the authors propose a concurrency analysis of out-of-core training behavior, and derive a strategy that combines layer swapping and redundant recomputing to achieve an average of 1.52x speedup in six different models over the state of the art.
Abstract: The dedicated memory of hardware accelerators can be insufficient to store all weights and/or intermediate states of large deep learning models. Although model parallelism is a viable approach to reduce the memory pressure issue, significant modification of the source code and considerations for algorithms are required. An alternative solution is to use out-of-core methods instead of, or in addition to, data parallelism.We propose a performance model based on the concurrency analysis of out-of-core training behavior, and derive a strategy that combines layer swapping and redundant recomputing. We achieve an average of 1. 52x speedup in six different models over the state-of-the-art out-of-core methods. We also introduce the first method to solve the challenging problem of out-of-core multi-node training by carefully pipelining gradient exchanges and performing the parameter updates on the host. Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.

Posted Content
28 Feb 2020
TL;DR: RCC is built, the first unified and comprehensive RDMA-enabled distributed transaction processing framework supporting six concurrency control protocols using either two-sided or one-sided primitives, and conducts the first and most comprehensive study of the six representative distributed concurrence control protocols on two clusters with different RDMA network capabilities.
Abstract: On-line transaction processing (OLTP) applications require efficient distributed transaction execution. When a transaction accesses multiple records in remote machines, network performance is a crucial factor affecting transaction latency and throughput. Due to its high bandwidth and very low latency, RDMA (Remote Direct Memory Access) has achieved much higher performance for distributed transactions than traditional TCP-based systems. RDMA provides primitives for both two-sided and one-sided communication. Although recent works have intensively studied the benefits of RDMA in distributed transaction systems, they either focus on primitive-level comparisons of two communication models (one-sided vs. two-sided) or only study one concurrency control protocol. A comprehensive understanding of the implication of RDMA for various concurrency control protocols is an open problem. In this paper, we build RCC, the first unified and comprehensive RDMA-enabled distributed transaction processing framework supporting six concurrency control protocols using either two-sided or one-sided primitives. We intensively optimize the performance of each protocol without bias, using known techniques such as co-routines, outstanding requests, and doorbell batching. Based on RCC, we conduct the first and most comprehensive (to the best of our knowledge) study of the six representative distributed concurrency control protocols on two clusters with different RDMA network capabilities.

Proceedings ArticleDOI
01 Dec 2020
TL;DR: In this article, the authors provide an overview of several in-memory data management systems that are not HTAP systems, some of them are purely transactional, some are purely analytical, and some support real-time analytics.
Abstract: These days, real-time analytics is one of the most often used notions in the world of databases. Broadly, this term means very fast analytics over very fresh data. Usually the term comes together with other popular terms, hybrid transactional/analytical processing (HTAP) and in-memory data processing. The reason is that the simplest way to provide fresh operational data for analysis is to combine in one system both transactional and analytical processing. The most effective way to provide fast transactional and analytical processing is to store an entire database in memory. So on the one hand, these three terms are related but on the other hand, each of them has its own right to life. In this paper, we provide an overview of several in-memory data management systems that are not HTAP systems. Some of them are purely transactional, some are purely analytical, and some support real-time analytics. Then we overview nine in-memory HTAP DBMSs, some of which don't support real-time analytics. Existing real-time in-memory HTAP DBMSs have very diverse and interesting architectures although they use a number of common approaches: multiversion concurrency control, multicore parallelization, advanced query optimization, just in time compilation, etc. Additionally, we are interested whether these systems use non-volatile memory, and, if yes, in what manner. We conclude that an emergence of new generation of NVM will greatly stimulate its use in in-memory HTAP systems.

Proceedings ArticleDOI
Shady Issa1, Miguel Viegas1, Pedro Raminhas, Nuno Machado2, Miguel Matos1, Paolo Romano1 
01 Nov 2020
TL;DR: Prognosticator as mentioned in this paper employs symbolic execution to build fine-grained transaction profiles, which are then used by a deterministic concurrency control algorithm to execute transactions with a high degree of parallelism.
Abstract: Deterministic databases (DDs) are a promising approach for replicating data across different replicas. A fundamental component of DDs is a deterministic concurrency control algorithm that, given a set of transactions in a specific order, guarantees that their execution always results in the same serial order. State-of-the-art approaches either rely on single threaded execution or on the knowledge of read- and write-sets of transactions to achieve this goal. The former yields poor performance in multi-core machines while the latter requires either manual inputs from the user — a time-consuming and error prone task — or a reconnaissance phase that increases both the latency and abort rates of transactions.In this paper, we present Prognosticator, a novel deterministic database system. Rather than relying on manual transaction classification or an expert programmer, Prognosticator employs Symbolic Execution to build fine-grained transaction profiles (at the key-level). These profiles are then used by Prognosticator’s novel deterministic concurrency control algorithm to execute transactions with a high degree of parallelism.Our experimental evaluation, based on both TPC-C and RUBiS benchmarks, shows that Prognosticator can achieve up to 5× higher throughput with respect to state-of-the-art solutions.