scispace - formally typeset
Search or ask a question

Showing papers on "Concurrency control published in 2019"


Proceedings ArticleDOI
08 Jun 2019
TL;DR: Coati, a system that supports event-driven concurrency via interrupts in an intermittent software execution model, is presented and it is shown that Coati prevents failures when interrupts are introduced, while the baseline fails in just seconds.
Abstract: Batteryless energy-harvesting devices are computing platforms that operate in environments where batteries are not viable for energy storage. Energy-harvesting devices operate intermittently, only as energy is available. Prior work developed software execution models robust to intermittent power failures but no existing intermittent execution model allows interrupts to update global persistent state without allowing incorrect behavior or requiring complex programming. We present Coati, a system that supports event-driven concurrency via interrupts in an intermittent software execution model. Coati exposes a task-based interface for synchronous computations and an event interface for asynchronous interrupts. Coati supports synchronizing tasks and events using transactions, which allow for multi-task atomic regions that extend across multiple power failures. This work explores two different models for serializing events and tasks that both safely provide intuitive semantics for event-driven intermittent programs. We implement a prototype of Coati as C language extensions and a runtime library. Using energy-harvesting hardware, we evaluate Coati on benchmarks adapted from prior work. We show that Coati prevents failures when interrupts are introduced, while the baseline fails in just seconds. Moreover, Coati operates with a reasonable run time overhead that is often comparable to an idealized baseline.

60 citations


Journal ArticleDOI
01 Oct 2019
TL;DR: This work proposes a novel garbage collection (GC) approach that prunes obsolete versions eagerly and its seamless integration into the transaction processing keeps the GC overhead minimal and ensures good scalability.
Abstract: To support Hybrid Transaction and Analytical Processing (HTAP), database systems generally rely on Multi-Version Concurrency Control (MVCC). While MVCC elegantly enables lightweight isolation of readers and writers, it also generates outdated tuple versions, which, eventually, have to be reclaimed. Surprisingly, we have found that in HTAP workloads, this reclamation of old versions, i.e., garbage collection, often becomes the performance bottleneck.It turns out that in the presence of long-running queries, state-of-the-art garbage collectors are too coarse-grained. As a consequence, the number of versions grows quickly slowing down the entire system. Moreover, the standard background cleaning approach makes the system vulnerable to sudden spikes in workloads.In this work, we propose a novel garbage collection (GC) approach that prunes obsolete versions eagerly. Its seamless integration into the transaction processing keeps the GC overhead minimal and ensures good scalability. We show that our approach handles mixed workloads well and also speeds up pure OLTP workloads like TPC-C compared to existing state-of-the-art approaches.

39 citations


Proceedings ArticleDOI
09 Dec 2019
TL;DR: Elections show that in general, FabricCRDT offers higher throughput of successful transactions than Fabric, while successfully committing and merging all conflicting transactions without any failures, which is a good sign for scalability.
Abstract: With the increased adaption of blockchain technologies, permissioned blockchains such as Hyperledger Fabric provide a robust ecosystem for developing production-grade decentralized applications. However, the additional latency between executing and committing transactions, due to Fabric's three-phase transaction lifecycle of Execute-Order-Validate (EOV), is a potential scalability bottleneck. The added latency increases the probability of concurrent updates on the same keys by different transactions, leading to transaction failures caused by Fabric's concurrency control mechanism. The transaction failures increase the application development complexity and decrease Fabric's throughput. Conflict-free Replicated Datatypes (CRDTs) provide a solution for merging and resolving conflicts in the presence of concurrent updates. In this work, we introduce FabricCRDT, an approach for integrating CRDTs to Fabric. Our evaluations show that in general, FabricCRDT offers higher throughput of successful transactions than Fabric, while successfully committing and merging all conflicting transactions without any failures.

30 citations


Journal ArticleDOI
01 Oct 2019
TL;DR: The Parallel Binary Tree (P-Tree) index structure is proposed, based on pure (immutable) data structures that use path-copying for updates for fast multi-versioning, to achieve SI and MVCC for multicore in-memory HTAP DBMSs.
Abstract: Modern data-driven applications require that databases support fast analytical queries while undergoing rapid updates---often referred to as Hybrid Transactional Analytical Processing (HTAP). Achieving fast queries and updates in a database management system (DBMS) is challenging since optimizations to improve analytical queries can cause overhead for updates. One solution is to use snapshot isolation (SI) for multi-version concurrency control (MVCC) to allow readers to make progress regardless of concurrent writers.In this paper, we propose the Parallel Binary Tree (P-Tree) index structure to achieve SI and MVCC for multicore in-memory HTAP DBMSs. At their core, P-Trees are based on pure (immutable) data structures that use path-copying for updates for fast multi-versioning. They support tree nesting to improve OLAP performance while still allowing for efficient updates. The data structure also enables parallel algorithms for bulk operations on indexes and their underlying tables. We evaluate P-Trees on OLTP and OLAP benchmarks, and compare them with state-of-the-art data structures and DBMSs. Our experiments show that P-Trees outperform many concurrent data structures for the YCSB workload, and is 4--9 x faster than existing DBMSs for analytical queries, while also achieving reasonable throughput for simultaneous transactional updates.

26 citations


Book ChapterDOI
07 Dec 2019
TL;DR: A new smart contract platform called “Aplos” based on the Scala functional programming language and Akka actors is designed, which has developed a blockchain for highly scalable storage that aligns with big data requirements.
Abstract: Smart contract is a programming interface to interact with the underlying blockchain storage models. It is a database abstraction layer for blockchain. Existing smart contract platforms follow the imperative style programming model since states are shared. As a result, there is no concurrency control mechanism when executing transactions, resulting in considerable latency and hindering scalability. To address performance and scalability issues of existing smart contract platforms, we design a new smart contract platform called “Aplos” based on the Scala functional programming language and Akka actors. In Aplos, all blockchain-related smart contract functions are implemented with Akka actors. The Aplos platform is built over Mystiko—a highly scalable blockchain storage for big data. Mystiko supports concurrent transactions, high transaction throughput, data analytics and machine learning. With Aplos smart contracts over Mystiko, we have developed a blockchain for highly scalable storage that aligns with big data requirements.

23 citations


Proceedings ArticleDOI
04 Apr 2019
TL;DR: Evaluation results show that MV-RLU significantly outperforms other techniques for a wide range of workloads with varying contention levels and data-set size, and proposes new techniques to make multi-versioning efficient.
Abstract: This paper presents multi-version read-log-update (MV-RLU), an extension of the read-log-update (RLU) synchronization mechanism. While RLU has many merits including an intuitive programming model and excellent performance for read-mostly workloads, we observed that the performance of RLU significantly drops in workloads with more write operations. The core problem is that RLU manages only two versions. To overcome such limitation, we extend RLU to support multi-versioning and propose new techniques to make multi-versioning efficient. At the core of MV-RLU design is concurrent autonomous garbage collection, which prevents reclaiming invisible versions being a bottleneck, and reduces the version traversal overhead the main overhead of multi-version design. We extensively evaluate MV-RLU with the state-of-the-art synchronization mechanisms, including RCU, RLU, software transactional memory (STM), and lock-free approaches, on concurrent data structures and real-world applications (database concurrency control and in-memory key-value store). Our evaluation results show that MV-RLU significantly outperforms other techniques for a wide range of workloads with varying contention levels and data-set size.

21 citations


Proceedings ArticleDOI
02 Jun 2019
TL;DR: This paper presents a design which enables failure-resilient intermittently-powered systems without runtime checkpointing, which enforces the consistency and serializability of concurrent task execution while maximizing computation progress, as well as allows instant system recovery after power resumption, by leveraging the characteristics of data accessed in hybrid memory.
Abstract: Self-powered intermittent systems enable accumulative execution in unstable power environments, where checkpointing is often adopted as a means to achieve data consistency and system recovery under power failures. However, existing approaches based on the checkpointing paradigm normally require system suspension and/or logging at runtime. This paper presents a design which enables failure-resilient intermittently-powered systems without runtime checkpointing. Our design enforces the consistency and serializability of concurrent task execution while maximizing computation progress, as well as allows instant system recovery after power resumption, by leveraging the characteristics of data accessed in hybrid memory. We integrated the design into FreeRTOS running on a Texas Instruments device. Experimental results show that our design achieves up to 11.8 times the computation progress achieved by checkpointing-based approaches, while reducing the recovery time by nearly 90%.CCS CONCEPTS• Computer systems organization $\rightarrow $ Embedded software; Reliability; • Computing methodologies $\rightarrow $ Concurrent algorithms;

20 citations


Journal ArticleDOI
01 Jul 2019
TL;DR: It is shown that high availability via replication can coexist with fast serializable transaction execution in distributed in-memory databases, with STAR outperforming systems that employ conventional concurrency control and replication algorithms by up to one order of magnitude.
Abstract: In this paper, we present STAR, a new distributed in-memory database with asymmetric replication. By employing a single-node non-partitioned architecture for some replicas and a partitioned architecture for other replicas, STAR is able to efficiently run both highly partitionable workloads and workloads that involve cross-partition transactions. The key idea is a new phase-switching algorithm where the execution of single-partition and cross-partition transactions is separated. In the partitioned phase, single-partition transactions are run on multiple machines in parallel to exploit more concurrency. In the single-master phase, mastership for the entire database is switched to a single designated master node, which can execute these transactions without the use of expensive coordination protocols like two-phase commit. Because the master node has a full copy of the database, this phase-switching can be done at negligible cost. Our experiments on two popular benchmarks (YCSB and TPC-C) show that high availability via replication can coexist with fast serializable transaction execution in distributed in-memory databases, with STAR outperforming systems that employ conventional concurrency control and replication algorithms by up to one order of magnitude.

18 citations


Journal ArticleDOI
TL;DR: LiveGraph as discussed by the authors is a graph storage system that supports real-time graph analytics on fresh data by ensuring that adjacency list scans, a key operation in graph workloads, are purely sequential.
Abstract: The specific characteristics of graph workloads make it hard to design a one-size-fits-all graph storage system. Systems that support transactional updates use data structures with poor data locality, which limits the efficiency of analytical workloads or even simple edge scans. Other systems run graph analytics workloads efficiently, but cannot properly support transactions. This paper presents LiveGraph, a graph storage system that outperforms both the best graph transactional systems and the best systems for real-time graph analytics on fresh data. LiveGraph does that by ensuring that adjacency list scans, a key operation in graph workloads, are purely sequential: they never require random accesses even in presence of concurrent transactions. This is achieved by combining a novel graph-aware data structure, the Transactional Edge Log (TEL), together with a concurrency control mechanism that leverages TEL's data layout. Our evaluation shows that LiveGraph significantly outperforms state-of-the-art (graph) database solutions on both transactional and real-time analytical workloads.

17 citations


Journal ArticleDOI
25 Jun 2019
TL;DR: Sun SolarDB as discussed by the authors is a distributed relational database system based on a two-layer log-structured merge-tree for cross-partition distributed transactions, which has been successfully tested at a large commercial bank.
Abstract: Efficient transaction processing over large databases is a key requirement for many mission-critical applications. Although modern databases have achieved good performance through horizontal partitioning, their performance deteriorates when cross-partition distributed transactions have to be executed. This article presents SolarDB, a distributed relational database system that has been successfully tested at a large commercial bank. The key features of SolarDB include (1) a shared-everything architecture based on a two-layer log-structured merge-tree; (2) a new concurrency control algorithm that works with the log-structured storage, which ensures efficient and non-blocking transaction processing even when the storage layer is compacting data among nodes in the background; and (3) find-grained data access to effectively minimize and balance network communication within the cluster. According to our empirical evaluations on TPC-C, Smallbank, and a real-world workload, SolarDB outperforms the existing shared-nothing systems by up to 50x when there are close to or more than 5% distributed transactions.

17 citations


DOI
01 Jan 2019
TL;DR: This work proposes a new language, Gallifrey, which provides orthogonal replication through restrictions with merge strategies, contingencies for conflicts arising from concurrency, and branches, a novel concurrency control construct inspired by version control, to contain provisional behavior.
Abstract: Programming efficient distributed, concurrent systems requires new abstractions that go beyond traditional sequential programming. But programmers already have trouble getting sequential code right, so simplicity is essential. The core problem is that low-latency, high-availability access to data requires replication of mutable state. Keeping replicas fully consistent is expensive, so the question is how to expose asynchronously replicated objects to programmers in a way that allows them to reason simply about their code. We propose an answer to this question in our ongoing work designing a new language, Gallifrey, which provides orthogonal replication through _restrictions_ with _merge strategies_, _contingencies_ for conflicts arising from concurrency, and _branches_, a novel concurrency control construct inspired by version control, to contain provisional behavior.

Journal ArticleDOI
01 Jul 2019
TL;DR: This work introduces Ocean Vista - a novel distributed protocol that guarantees strict serializability and outperforms a leading distributed transaction processing engine (TAPIR) more than 10-fold in terms of peak throughput, albeit at the cost of additional latency for gossip.
Abstract: Providing ACID transactions under conflicts across globally distributed data is the Everest of transaction processing protocols. Transaction processing in this scenario is particularly costly due to the high latency of cross-continent network links, which inflates concurrency control and data replication overheads. To mitigate the problem, we introduce Ocean Vista - a novel distributed protocol that guarantees strict serializability. We observe that concurrency control and replication address different aspects of resolving the visibility of transactions, and we address both concerns using a multi-version protocol that tracks visibility using version watermarks and arrives at correct visibility decisions using efficient gossip. Gossiping the watermarks enables asynchronous transaction processing and acknowledging transaction visibility in batches in the concurrency control and replication protocols, which improves efficiency under high cross-datacenter network delays. In particular, Ocean Vista can process conflicting transactions in parallel, and supports efficient write-quorum / read-one access using one round trip in the common case. We demonstrate experimentally in a multi-data-center cloud environment that our design outperforms a leading distributed transaction processing engine (TAPIR) more than 10-fold in terms of peak throughput, albeit at the cost of additional latency for gossip. The latency penalty is generally bounded by one wide area network (WAN) round trip time (RTT), and in the best case (i.e., under light load) our system nearly breaks even with TAPIR by committing transactions in around one WAN RTT.

Proceedings ArticleDOI
08 Apr 2019
TL;DR: This work introduces a novel multi-version concurrency protocol that achieves high performance while reducing the number of aborted schedules to a minimum and providing the best isolation level and shows experimentally that the graph-based scheduler has very competitive throughput in pure transactional workloads while providing fewer aborts and improved user experience.
Abstract: Concurrency control is one of the most performance critical steps in modern many-core database systems. Achieving higher throughput on multi-socket servers is difficult and many concurrency control algorithms reduce the amount of accepted schedules in favor of transaction throughput or relax the isolation level which introduces unwanted anomalies. Both approaches lead to unexpected transaction behavior that is difficult to understand by the database users. We introduce a novel multi-version concurrency protocol that achieves high performance while reducing the number of aborted schedules to a minimum and providing the best isolation level. Our approach leverages the idea of a graph-based scheduler that uses the concept of conflict graphs. As conflict serializable histories can be represented by acyclic conflict graphs, our scheduler maintains the conflict graph and allows all transactions that keep the graph acyclic. All conflict serializable schedules can be accepted by such a graph-based algorithm due to the conflict graph theorem. Hence, only transaction schedules that truly violate the serializability constraints need to abort. Our developed approach is able to accept the useful intersection of commit order preserving conflict serializable (COCSR) and recoverable (RC) schedules which are the two most desirable classes in terms of correctness and user experience. We show experimentally that our graph-based scheduler has very competitive throughput in pure transactional workloads while providing fewer aborts and improved user experience. Our multi-version extension helps to efficiently perform long-running read transactions on the same up-to-date database. Moreover, our graph-based scheduler can outperform the competitors on mixed workloads.

Proceedings ArticleDOI
25 Mar 2019
TL;DR: This work incorporates DRP into the software transactional objects library STO and finds that DRP improves STO's throughput on several STAMP benchmarks by up to 3.6x and an in-memory multicore database implemented with the modified variant of STO outperforms databases that use OCC or transaction chopping for concurrency control.
Abstract: DRP is a new concurrency control protocol for software transactional memory that achieves high throughput, even for skewed workloads that exhibit high contention. DRP builds on prior works that chop transactions into pieces to expose more concurrency opportunities, but unlike these works, DRP performs no static analyses and supports arbitrary workloads. DRP achieves a high degree of concurrency across most workloads and guarantees deadlock freedom, strict serializability, and opacity. We incorporate DRP into the software transactional objects library STO [18] and find that DRP improves STO's throughput on several STAMP benchmarks by up to 3.6x. Additionally, an in-memory multicore database implemented with our modified variant of STO outperforms databases that use OCC or transaction chopping for concurrency control. Specifically, DRP achieves 6.6x higher throughput than OCC when contention is high. Compared to transaction chopping, our DRP achieves 3.3x higher throughput when contention is medium or low. Furthermore, our implementation achieves comparable performance to OCC and transaction chopping at other contention levels.

Journal ArticleDOI
TL;DR: A dynamic concurrency management framework is developed that integrates the concurrency-aware model to intelligently reallocate soft resources in the system during the system scaling process and achieves significantly shorter tail latency and higher throughput compared to Amazon EC2-AutoScale under all the workload traces.
Abstract: Scaling complex distributed systems such as e-commerce is an importance practice to simultaneously achieve high performance and high resource efficiency in the cloud. Most previous research focuses on hardware resource scaling to handle runtime workload variation. Through extensive experiments using a representative n-tier web application benchmark (RUBBoS), we demonstrate that scaling an n-tier system by adding or removing VMs without appropriately re-allocating soft resources (e.g., server threads and connections) may lead to significant performance degradation resulting from implicit change of request processing concurrency in the system, causing either over- or under-utilization of the critical hardware resource in the system. We build a concurrency-aware model that determines a near optimal soft resource allocation of each tier by combining some operational queuing laws and the fine-grained online measurement data of the system. We then develop a dynamic concurrency management (DCM) framework that integrates the concurrency-aware model to intelligently reallocate soft resources in the system during the system scaling process. We compare DCM with Amazon EC2-AutoScale, the state-of-the-art hardware only scaling management solution using six real-world bursty workload traces. The experimental results show that DCM achieves significantly shorter tail latency and higher throughput compared to Amazon EC2-AutoScale under all the workload traces.

Proceedings ArticleDOI
20 May 2019
TL;DR: This paper extends an existing runtime system (the TensorFlow runtime) to enable automatic concurrency control and scheduling of operations and explores performance modeling to predict the performance of operations with various thread-level parallelism.
Abstract: Training neural network (NN) often uses a machine learning framework such as TensorFlow and Caffe2. These frameworks employ a dataflow model where the NN training is modeled as a directed graph composed of a set of nodes. Operations in NN training are typically implemented by the frameworks as primitives and represented as nodes in the dataflow graph. Training NN models in a dataflow-based machine learning framework involves a large number of fine-grained operations whcih present diverse memory access patterns and computation intensity. Managing and scheduling those operations is challenging, because we have to decide the number of threads to run each operation (concurrency control) and schedule those operations for good hardware utilization and system throughput. In this paper, we extend an existing runtime system (the TensorFlow runtime) to enable automatic concurrency control and scheduling of operations. We explore performance modeling to predict the performance of operations with various thread-level parallelism. Our performance model is highly accurate and lightweight. Leveraging the performance model, our runtime system employs a set of scheduling strategies that co-run operations to improve hardware utilization and system throughput. Our runtime system demonstrates a significant performance benefit. Comparing with using the recommended configurations for concurrency control and operation scheduling in TensorFlow, our approach achieves 36% performance (execution time) improvement on average (up to 49%) for four neural network models, and achieves high performance close to the optimal one manually obtained by the user.

Proceedings ArticleDOI
01 Nov 2019
TL;DR: Experimental results show that the multiversion concurrency control design presented can double computation progress by reducing the runtime overheads incurred by system checkpointing, especially when tasks are executed with high concurrency.
Abstract: Concurrency control allows multiple tasks that share data objects to be concurrently executed in a serializable order, thus significantly improving computation progress. However, to accumulate forward progress on energy-harvesting intermittent systems while achieving data consistency across power cycles, existing approaches based on the checkpointing paradigm typically require system suspension at runtime. The runtime overheads incurred by suspension will be more manifest when more tasks are suspended and resumed during checkpointing, offsetting the computation progress improved by concurrent task execution. This paper presents a multiversion concurrency control design, which enables concurrent task execution without system suspension during checkpointing, while maintaining the serializability of task execution and ensuring data consistency after system recovery. We integrated our design into FreeRTOS running on a Texas Instruments device. Experimental results show that, at the very best, our design can double computation progress by reducing the runtime overheads incurred by system checkpointing, especially when tasks are executed with high concurrency.

Proceedings ArticleDOI
08 Apr 2019
TL;DR: GRIT is able to achieve consistent, high throughput and serializable distributed transactions for any applications invoking microservices, by cleverly leveraging deterministic database technologies and optimistic concurrency control protocol(OCC).
Abstract: The popular microservice architecture for applications brings new challenges for consistent distributed transactions across multiple microservices. These microservices may be implemented in different languages, and access multiple underlying databases. Consistent distributed transactions are a real requirement but are very hard to achieve with existing technologies in these environments. In this demo we present GRIT: a system that resolves this challenge by cleverly leveraging deterministic database technologies and optimistic concurrency control protocol(OCC). A transaction is optimistically executed with its read-set and write-set captured during the execution phase. Then at the commit time, conflict checking is performed and a global commit decision is made. A logically committed transaction is persisted into logs first, and then asynchronously applied to the physical databases deterministically. GRIT is able to achieve consistent, high throughput and serializable distributed transactions for any applications invoking microservices. The demonstration offers a walk-through of how GRIT can easily support distributed transactions across multiple microservices and databases.

Journal ArticleDOI
01 Sep 2019
TL;DR: A distributed lock table supporting all the standard locking modes used in database engines is developed, focusing on strong consistency in the form of strict serializability implemented through strict 2PL, but also explore read-committed and repeatableread, two common isolation levels used in many systems.
Abstract: Concurrency control is a cornerstone of distributed database engines and storage systems. In pursuit of scalability, a common assumption is that Two-Phase Locking (2PL) and Two-Phase Commit (2PC) are not viable solutions due to their communication overhead. Recent results, however, have hinted that 2PL and 2PC might not have such a bad performance. Nevertheless, there has been no attempt to actually measure how a state-of-the-art implementation of 2PL and 2PC would perform on modern hardware.The goal of this paper is to establish a baseline for concurrency control mechanisms on thousands of cores connected through a low-latency network. We develop a distributed lock table supporting all the standard locking modes used in database engines. We focus on strong consistency in the form of strict serializability implemented through strict 2PL, but also explore read-committed and repeatable-read, two common isolation levels used in many systems. We do not leverage any known optimizations in the locking or commit parts of the protocols. The surprising result is that, for TPC-C, 2PL and 2PC can be made to scale to thousands of cores and hundreds of machines, reaching a throughput of over 21 million transactions per second with 9.5 million New Order operations per second. Since most existing relational database engines use some form of locking for implementing concurrency control, our findings provide a path for such systems to scale without having to significantly redesign transaction management. To achieve these results, our implementation relies on Remote Direct Memory Access (RDMA). Today, this technology is commonly available on both Infiniband as well as Ethernet networks, making the results valid across a wide range of systems and platforms, including database appliances, data centers, and cloud environments.

Proceedings ArticleDOI
14 Jul 2019
TL;DR: This work presents a new lock-free approach for providing transaction isolation that harnesses the already existing versioning of key-value pairs in the database, used primarily for a read-write conflict detection during the validation phase, to create a version-based snapshot isolation.
Abstract: Hyperledger Fabric is a distributed operating system for permissioned blockchains hosted by the Linux Foundation. It is the first truly extensible blockchain system for running distributed applications at enterprise grade scale. To achieve this, Hyperledger Fabric introduces a novel execute-order-validate blockchain architecture, allowing parallelization of transaction execution and validation. However, this raises the need for transaction isolation. Today transaction isolation is attained by locking the entire state database during simulation of transactions and database updates. This lock is one of the major performance bottlenecks as observed by previous work. This work presents a new lock-free approach for providing transaction isolation. It harnesses the already existing versioning of key-value pairs in the database, used primarily for a read-write conflict detection during the validation phase, to create a version-based snapshot isolation. We further implement and evaluate our new approach. We show that our solution outperforms the current implementation by 8.1x and that it is comparable to the optimal solution where no isolation mechanism is applied.

Proceedings ArticleDOI
23 Jul 2019
TL;DR: This paper designs the prototype MusaeusDB as a solution for existing database systems, either as an external tool or as an extended SQL interface, and presents the first two architectures for versioning an entire state of a database system with respect to references among multiple relations.
Abstract: As relational database systems do not support collaborative dataset editing, online lexicons---such as Wikipedia's Media Wiki---build their own version control above the database system to allow constraint-preserving version checkouts or commits involving multiple tables. To eliminate the need for purpose-specific solutions, we propose adding version control as a layer on top of the database system or integrating versioning in the database system's core.This paper presents the first two architectures for versioning an entire state of a database system with respect to references among multiple relations. We design the prototype MusaeusDB as a solution for existing database systems, either as an external tool or as an extended SQL interface. The prototype TardisDB---an extended main-memory database system---reuses multi-version concurrency control for in-place updates while keeping older versions accessible. For performance tests on different storage layouts, we create---based on Wikipedia's page history---the TardisBenchmark. Our results show that it is indeed feasible to reduce wasted space while still ensuring constant retrieval time. Also, extending a main-memory database system's multi-version concurrency control has no negative impact on the transactional throughput. For further research on database versioning, we offer a flexibly sized benchmark with time evolving, text-based datasets and compression techniques.

Journal ArticleDOI
01 Aug 2019
TL;DR: The design of the recovery algorithm is described and how it allowed us to improve the availability of Azure SQL Database by guaranteeing consistent recovery times of under 3 minutes for 99.999% of recovery cases in production is demonstrated.
Abstract: Azure SQL Database and the upcoming release of SQL Server introduce a novel database recovery mechanism that combines traditional ARIES recovery with multi-version concurrency control to achieve database recovery in constant time, regardless of the size of user transactions. Additionally, our algorithm enables continuous transaction log truncation, even in the presence of long running transactions, thereby allowing large data modifications using only a small, constant amount of log space. These capabilities are particularly important for any Cloud database service given a) the constantly increasing database sizes, b) the frequent failures of commodity hardware, c) the strict availability requirements of modern, global applications and d) the fact that software upgrades and other maintenance tasks are managed by the Cloud platform, introducing unexpected failures for the users. This paper describes the design of our recovery algorithm and demonstrates how it allowed us to improve the availability of Azure SQL Database by guaranteeing consistent recovery times of under 3 minutes for 99.999% of recovery cases in production.

Book ChapterDOI
19 Jun 2019
TL;DR: In this paper, a starvation free algorithm for multi-version STM is proposed, which can be used either with the case where the number of versions is unbounded and garbage collection is used or where only the latest K versions are maintained.
Abstract: Software Transactional Memory systems (STMs) have garnered significant interest as an elegant alternative for addressing synchronization and concurrency issues with multi-threaded programming in multi-core systems. Client programs use STMs by issuing transactions. STM ensures that transaction either commits or aborts. A transaction aborted due to conflicts is typically re-issued with the expectation that it will complete successfully in a subsequent incarnation. However, many existing STMs fail to provide starvation freedom, i.e., in these systems, it is possible that concurrency conflicts may prevent an incarnated transaction from committing. To overcome this limitation, we systematically derive a novel starvation free algorithm for multi-version STM. Our algorithm can be used either with the case where the number of versions is unbounded and garbage collection is used or where only the latest K versions are maintained, KSFTM. We have demonstrated that our proposed algorithm performs better than existing state-of-the-art STMs.

Proceedings ArticleDOI
06 Jul 2019
TL;DR: A Non-Database-Operations-Aware Priority Ceiling Protocol (N-DOAP) is proposed for an efficient scheduling of resources resulting into the improved schedulability.
Abstract: The operations performed on any complex real-time database system (RTDBS) may be categorized as database and/or non-database operations. The database operations require a lock on data objects to be acquired/released during the concurrent execution of transactions while the non-database operations use restrictive strategy by locking dummy data objects for ensuring freedom from read and write operations affecting system performance negatively. Almost no research work has been done in the handling of non-database operations; though, research in this direction may be useful to improve underlying system performance. Therefore, this paper proposes a Non-Database-Operations-Aware Priority Ceiling Protocol (N-DOAP) for an efficient scheduling of resources resulting into the improved schedulability. The performance of the above protocol is assessed considering the aspect of non-database operations.

Journal ArticleDOI
TL;DR: The design of automatic concurrency control algorithm for implementing power‐efficient communications on shared‐memory multicores is described, which automatically switches between nonblocking and blocking concurrency protocols, getting the best from the two worlds.
Abstract: Continuous streaming computations are usually composed of different modules, exchanging data through shared message queues. The selection of the algorithm used to access such queues (ie, the concurrency control) is a critical aspect both for performance and power consumption. In this paper, we describe the design of automatic concurrency control algorithm for implementing power‐efficient communications on shared‐memory multicores. The algorithm automatically switches between nonblocking and blocking concurrency protocols, getting the best from the two worlds, ie, obtaining the same throughput offered by the nonblocking implementation and the same power efficiency of the blocking concurrency protocol. We demonstrate the effectiveness of our approach using two micro‐benchmarks and two real streaming applications.

Posted Content
TL;DR: A novel efficient concurrency control scheme which is the first one to do optimization in both phases and outperforms state-of-art solutions significantly.
Abstract: Although the emergence of the programmable smart contract makes blockchain systems easily embrace a wider range of industrial areas, how to execute smart contracts efficiently becomes a big challenge nowadays. Due to the existence of Byzantine nodes, the mechanism of executing smart contracts is quite different from that in database systems, so that existing successful concurrency control protocols in database systems cannot be employed directly. Moreover, even though smart contract execution follows a two-phase style, i.e, the miner node executes a batch of smart contracts in the first phase and the validators replay them in the second phase, existing parallel solutions only focus on the optimization in the first phase, but not including the second phase. In this paper, we propose a novel efficient concurrency control scheme which is the first one to do optimization in both phases. Specifically, (i) in the first phase, we give a variant of OCC (Optimistic Concurrency Control) protocol based on {\em batching} feature to improve the concurrent execution efficiency for the miner and produce a schedule log with high parallelism for validators. Also, a graph partition algorithm is devised to divide the original schedule log into small pieces and further reduce the communication cost; and (ii) in the second phase, we give a deterministic OCC protocol to replay all smart contracts efficiently on multi-core validators where all cores can replay smart contracts independently. Theoretical analysis and extensive experimental results illustrate that the proposed scheme outperforms state-of-art solutions significantly.

Patent
02 Apr 2019
TL;DR: In this paper, a distributed service data lock implementation method based on Redis is presented, and the lock has the advantages of high performance and the like and is very suitable for being used in a distributed system.
Abstract: The invention provides a distributed service data lock implementation method based on Redis. The invention relates to the technical field of service data locks. According to the invention, the RedLockalgorithm based on Redis is applied to the distributed system; furthermore, the storage control of the service logic lock is realized through the Redis database, and the concurrency control problem of the service logic level is solved by referring to the general idea of the service logic lock, the RedLock distributed lock security algorithm and the Redis database storage in the prior art. And thesafety of the lock is ensured, and the lock has the advantages of high performance and the like and is very suitable for being used in a distributed system.

Journal ArticleDOI
TL;DR: A transactional system consisted by TicToc and P-WAL logging system assuming non-volatile memory and a parallel write ahead logging scheme for the recovery system was integrated.
Abstract: A transactional system consists of a concurrency control system and a recovery system. TicToc is one of the state of the art concurrency control protocols today, but it lacks recovery system. We studied the ways to integrate TicToc and recovery system. For efficiency, we adopted a parallel write ahead logging scheme for the recovery system. There are two methods to optimize the logging. First method is early lock release which executes lock release early on data objects. Second method is group commit which executes batched logs transfer to storage from memory. We integrated a transactional system consisted by TicToc and P-WAL logging system assuming non-volatile memory. We found that the two optimization methods incur performance degradation when storage access latency is equivalent to that of NVRAM.

Journal ArticleDOI
Huiqi Hu1, Xuan Zhou1, Tao Zhu1, Weining Qian1, Aoying Zhou1 
TL;DR: A wide spectrum of design and implementation considerations that may affect the efficiency or scalability of an in-memory OLTP system are surveyed, including concurrency control, logging, indexing and transaction compilation.
Abstract: Traditional disk-resident OLTP systems were mainly designed for computers with relatively small memory. Driven by the advance of hardware, OLTP systems need to be redesigned for larger memory and multi-core environments. Compared to disk-resident systems, in-memory systems have significant performance advantages, from the perspectives of both transaction throughput and query latency. Their performance is no longer limited by disk I/Os. Instead, the efficiency and scalability over multi-core CPUs become more important. In this paper, we survey and summarize a wide spectrum of design and implementation considerations that may affect the efficiency or scalability of an in-memory OLTP system. These considerations are concerned with most of the main components of databases, including concurrency control, logging, indexing and transaction compilation. For each of the components, we provide some in-depth analysis based on recent research works. This survey also aims to provide some guidance for designing or implementing high-performance in-memory OLTP systems.

Journal ArticleDOI
TL;DR: This paper presents a design for asynchronous stream generators for Scala, thereby extending previous facilities for asynchronous programming in Scala from tasks/futures to asynchronous streams and contributing a complete formalization of the programming model based on a reduction semantics and a static type system.