scispace - formally typeset
Search or ask a question
Author

Rebecca Taft

Bio: Rebecca Taft is an academic researcher from Massachusetts Institute of Technology. The author has contributed to research in topics: Online transaction processing & Cloud computing. The author has an hindex of 8, co-authored 10 publications receiving 411 citations.

Papers
More filters
Journal ArticleDOI
01 Nov 2014
TL;DR: E-Store is presented, an elastic partitioning framework for distributed OLTP DBMSs that automatically scales resources in response to demand spikes, periodic events, and gradual changes in an application's workload.
Abstract: On-line transaction processing (OLTP) database management systems (DBMSs) often serve time-varying workloads due to daily, weekly or seasonal fluctuations in demand, or because of rapid growth in demand due to a company's business success. In addition, many OLTP workloads are heavily skewed to "hot" tuples or ranges of tuples. For example, the majority of NYSE volume involves only 40 stocks. To deal with such fluctuations, an OLTP DBMS needs to be elastic; that is, it must be able to expand and contract resources in response to load fluctuations and dynamically balance load as hot tuples vary over time.This paper presents E-Store, an elastic partitioning framework for distributed OLTP DBMSs. It automatically scales resources in response to demand spikes, periodic events, and gradual changes in an application's workload. E-Store addresses localized bottlenecks through a two-tier data placement strategy: cold data is distributed in large chunks, while smaller ranges of hot tuples are assigned explicitly to individual nodes. This is in contrast to traditional single-tier hash and range partitioning strategies. Our experimental evaluation of E-Store shows the viability of our approach and its efficacy under variations in load across a cluster of machines. Compared to single-tier approaches, E-Store improves throughput by up to 130% while reducing latency by 80%.

172 citations

Proceedings ArticleDOI
11 Jun 2020
TL;DR: The design of CockroachDB and its novel transaction model that supports consistent geo-distributed transactions on commodity hardware is presented and its distributed SQL layer automatically scales with the size of the database cluster while providing the standard SQL interface that users expect.
Abstract: We live in an increasingly interconnected world, with many organizations operating across countries or even continents. To serve their global user base, organizations are replacing their legacy DBMSs with cloud-based systems capable of scaling OLTP workloads to millions of users. CockroachDB is a scalable SQL DBMS that was built from the ground up to support these global OLTP workloads while maintaining high availability and strong consistency. Just like its namesake, CockroachDB is resilient to disasters through replication and automatic recovery mechanisms. This paper presents the design of CockroachDB and its novel transaction model that supports consistent geo-distributed transactions on commodity hardware. We describe how CockroachDB replicates and distributes data to achieve fault tolerance and high performance, as well as how its distributed SQL layer automatically scales with the size of the database cluster while providing the standard SQL interface that users expect. Finally, we present a comprehensive performance evaluation and share a couple of case studies of CockroachDB users. We conclude by describing lessons learned while building CockroachDB over the last five years.

127 citations

Journal ArticleDOI
01 Nov 2016
TL;DR: A new on-line partitioning approach, called Clay, that supports both tree-based schemas and more complex "general" schemas with arbitrary foreign key relationships is presented and it is shown that it can generate partitioning schemes that enable the system to achieve up to 15× better throughput and 99% lower latency than existing approaches.
Abstract: Transaction processing database management systems (DBMSs) are critical for today's data-intensive applications because they enable an organization to quickly ingest and query new information. Many of these applications exceed the capabilities of a single server, and thus their database has to be deployed in a distributed DBMS. The key factor affecting such a system's performance is how the database is partitioned. If the database is partitioned incorrectly, the number of distributed transactions can be high. These transactions have to synchronize their operations over the network, which is considerably slower and leads to poor performance. Previous work on elastic database repartitioning has focused on a certain class of applications whose database schema can be represented in a hierarchical tree structure. But many applications cannot be partitioned in this manner, and thus are subject to distributed transactions that impede their performance and scalability.In this paper, we present a new on-line partitioning approach, called Clay, that supports both tree-based schemas and more complex "general" schemas with arbitrary foreign key relationships. Clay dynamically creates blocks of tuples to migrate among servers during repartitioning, placing no constraints on the schema but taking care to balance load and reduce the amount of data migrated. Clay achieves this goal by including in each block a set of hot tuples and other tuples co-accessed with these hot tuples. To evaluate our approach, we integrate Clay in a distributed, main-memory DBMS and show that it can generate partitioning schemes that enable the system to achieve up to 15× better throughput and 99% lower latency than existing approaches.

82 citations

Proceedings ArticleDOI
27 May 2015
TL;DR: The Squall technique for supporting live reconfiguration in partitioned, main memory DBMSs supports fine-grained repartitioning of databases in the presence of distributed transactions, high throughput client workloads, and replicated data.
Abstract: For data-intensive applications with many concurrent users, modern distributed main memory database management systems (DBMS) provide the necessary scale-out support beyond what is possible with single-node systems. These DBMSs are optimized for the short-lived transactions that are common in on-line transaction processing (OLTP) workloads. One way that they achieve this is to partition the database into disjoint subsets and use a single-threaded transaction manager per partition that executes transactions one-at-a-time in serial order. This minimizes the overhead of concurrency control mechanisms, but requires careful partitioning to limit distributed transactions that span multiple partitions. Previous methods used off-line analysis to determine how to partition data, but the dynamic nature of these applications means that they are prone to hotspots. In these situations, the DBMS needs to reconfigure how data is partitioned in real-time to maintain performance objectives. Bringing the system off-line to reorganize the database is unacceptable for on-line applications. To overcome this problem, we introduce the Squall technique for supporting live reconfiguration in partitioned, main memory DBMSs. Squall supports fine-grained repartitioning of databases in the presence of distributed transactions, high throughput client workloads, and replicated data. An evaluation of our approach on a distributed DBMS shows that Squall can reconfigure a database with no downtime and minimal overhead on transaction latency.

69 citations

Proceedings ArticleDOI
18 Jun 2014
TL;DR: A new benchmark designed to test database management system (DBMS) performance on a mix of data management tasks and complex analytics (regression, singular value decomposition, etc.) is introduced.
Abstract: This paper introduces a new benchmark designed to test database management system (DBMS) performance on a mix of data management tasks (joins, filters, etc.) and complex analytics (regression, singular value decomposition, etc.) Such mixed workloads are prevalent in a number of application areas including most science workloads and web analytics. As a specific use case, we have chosen genomics data for our benchmark and have constructed a collection of typical tasks in this domain. In addition to being representative of a mixed data management and analytics workload, this benchmark is also meant to scale to large dataset sizes and multiple nodes across a cluster. Besides presenting this benchmark, we have run it on a variety of storage systems including traditional row stores, newer column stores, Hadoop, and an array DBMS. We present performance numbers on all systems on single and multiple nodes, and show that performance differs by orders of magnitude between the various solutions. In addition, we demonstrate that most platforms have scalability issues. We also test offloading the analytics onto a coprocessor. The intent of this benchmark is to focus research interest in this area; to this end, all of our data, data generators, and scripts are available on our web site.

47 citations


Cited by
More filters
01 Mar 2001
TL;DR: Using singular value decomposition in transforming genome-wide expression data from genes x arrays space to reduced diagonalized "eigengenes" x "eigenarrays" space gives a global picture of the dynamics of gene expression, in which individual genes and arrays appear to be classified into groups of similar regulation and function, or similar cellular state and biological phenotype.
Abstract: ‡We describe the use of singular value decomposition in transforming genome-wide expression data from genes 3 arrays space to reduced diagonalized ‘‘eigengenes’’ 3 ‘‘eigenarrays’’ space, where the eigengenes (or eigenarrays) are unique orthonormal superpositions of the genes (or arrays). Normalizing the data by filtering out the eigengenes (and eigenarrays) that are inferred to represent noise or experimental artifacts enables meaningful comparison of the expression of different genes across different arrays in different experiments. Sorting the data according to the eigengenes and eigenarrays gives a global picture of the dynamics of gene expression, in which individual genes and arrays appear to be classified into groups of similar regulation and function, or similar cellular state and biological phenotype, respectively. After normalization and sorting, the significant eigengenes and eigenarrays can be associated with observed genome-wide effects of regulators, or with measured samples, in which these regulators are overactive or underactive, respectively.

1,815 citations

Proceedings ArticleDOI
14 Oct 2017
TL;DR: This work presents NetCache, a new key-value store architecture that leverages the power and flexibility of new-generation programmable switches to handle queries on hot items and balance the load across storage nodes, and shows that it improves the throughput by 3-10x and reduces the latency of up to 40% of queries by 50%, for high-performance, in-memory key- value stores.
Abstract: We present NetCache, a new key-value store architecture that leverages the power and flexibility of new-generation programmable switches to handle queries on hot items and balance the load across storage nodes. NetCache provides high aggregate throughput and low latency even under highly-skewed and rapidly-changing workloads. The core of NetCache is a packet-processing pipeline that exploits the capabilities of modern programmable switch ASICs to efficiently detect, index, cache and serve hot key-value items in the switch data plane. Additionally, our solution guarantees cache coherence with minimal overhead. We implement a NetCache prototype on Barefoot Tofino switches and commodity servers and demonstrate that a single switch can process 2+ billion queries per second for 64K items with 16-byte keys and 128-byte values, while only consuming a small portion of its hardware resources. To the best of our knowledge, this is the first time that a sophisticated application-level functionality, such as in-network caching, has been shown to run at line rate on programmable switches. Furthermore, we show that NetCache improves the throughput by 3-10x and reduces the latency of up to 40% of queries by 50%, for high-performance, in-memory key-value stores.

437 citations

Proceedings Article
16 Mar 2016
TL;DR: Ernest, a performance prediction framework for large scale analytics, and evaluation on Amazon EC2 using several workloads shows that the prediction error is low while having a training overhead of less than 5% for long-running jobs.
Abstract: Recent workload trends indicate rapid growth in the deployment of machine learning, genomics and scientific workloads on cloud computing infrastructure. However, efficiently running these applications on shared infrastructure is challenging and we find that choosing the right hardware configuration can significantly improve performance and cost. The key to address the above challenge is having the ability to predict performance of applications under various resource configurations so that we can automatically choose the optimal configuration. Our insight is that a number of jobs have predictable structure in terms of computation and communication. Thus we can build performance models based on the behavior of the job on small samples of data and then predict its performance on larger datasets and cluster sizes. To minimize the time and resources spent in building a model, we use optimal experiment design, a statistical technique that allows us to collect as few training points as required. We have built Ernest, a performance prediction framework for large scale analytics and our evaluation on Amazon EC2 using several workloads shows that our prediction error is low while having a training overhead of less than 5% for long-running jobs.

401 citations

Journal ArticleDOI
TL;DR: This survey aims to provide a thorough review of a wide range of in-memory data management and processing proposals and systems, including both data storage systems and data processing frameworks.
Abstract: Growing main memory capacity has fueled the development of in-memory big data management and processing. By eliminating disk I/O bottleneck, it is now possible to support interactive data analytics. However, in-memory systems are much more sensitive to other sources of overhead that do not matter in traditional I/O-bounded disk-based systems. Some issues such as fault-tolerance and consistency are also more challenging to handle in in-memory environment. We are witnessing a revolution in the design of database systems that exploits main memory as its data storage layer. Many of these researches have focused along several dimensions: modern CPU and memory hierarchy utilization, time/space efficiency, parallelism, and concurrency control. In this survey, we aim to provide a thorough review of a wide range of in-memory data management and processing proposals and systems, including both data storage systems and data processing frameworks. We also give a comprehensive presentation of important technology in memory management, and some key factors that need to be considered in order to achieve efficient in-memory data management and processing.

391 citations

Journal ArticleDOI
12 Aug 2015
TL;DR: In this paper, a new view of federated databases is presented to address the growing need for managing information that spans multiple data models. And the authors propose a polystore architecture, which is designed to unify querying over multiple models.
Abstract: This paper presents a new view of federated databases to address the growing need for managing information that spans multiple data models. This trend is fueled by the proliferation of storage engines and query languages based on the observation that 'no one size fits all'. To address this shift, we propose a polystore architecture; it is designed to unify querying over multiple data models. We consider the challenges and opportunities associated with polystores. Open questions in this space revolve around query optimization and the assignment of objects to storage engines. We introduce our approach to these topics and discuss our prototype in the context of the Intel Science and Technology Center for Big Data

244 citations