Home
/
Authors
/
Rebecca Taft

Author

Rebecca Taft

Bio: Rebecca Taft is an academic researcher from Massachusetts Institute of Technology. The author has contributed to research in topics: Online transaction processing & Cloud computing. The author has an hindex of 8, co-authored 10 publications receiving 411 citations.

Papers

PDF

Open Access

More filters

Journal Article•DOI•

E-store: fine-grained elastic partitioning for distributed transaction processing systems

[...]

Rebecca Taft¹, Essam Mansour², Marco Serafini², Jennie Duggan³, Aaron J. Elmore⁴, Ashraf Aboulnaga², Andrew Pavlo⁵, Michael Stonebraker¹ - Show less +4 more•Institutions (5)

Massachusetts Institute of Technology¹, Qatar Computing Research Institute², Northwestern University³, University of Chicago⁴, Carnegie Mellon University⁵

01 Nov 2014

TL;DR: E-Store is presented, an elastic partitioning framework for distributed OLTP DBMSs that automatically scales resources in response to demand spikes, periodic events, and gradual changes in an application's workload.

...read moreread less

Abstract: On-line transaction processing (OLTP) database management systems (DBMSs) often serve time-varying workloads due to daily, weekly or seasonal fluctuations in demand, or because of rapid growth in demand due to a company's business success. In addition, many OLTP workloads are heavily skewed to "hot" tuples or ranges of tuples. For example, the majority of NYSE volume involves only 40 stocks. To deal with such fluctuations, an OLTP DBMS needs to be elastic; that is, it must be able to expand and contract resources in response to load fluctuations and dynamically balance load as hot tuples vary over time.This paper presents E-Store, an elastic partitioning framework for distributed OLTP DBMSs. It automatically scales resources in response to demand spikes, periodic events, and gradual changes in an application's workload. E-Store addresses localized bottlenecks through a two-tier data placement strategy: cold data is distributed in large chunks, while smaller ranges of hot tuples are assigned explicitly to individual nodes. This is in contrast to traditional single-tier hash and range partitioning strategies. Our experimental evaluation of E-Store shows the viability of our approach and its efficacy under variations in load across a cluster of machines. Compared to single-tier approaches, E-Store improves throughput by up to 130% while reducing latency by 80%.

...read moreread less

172 citations

Proceedings Article•DOI•

CockroachDB: The Resilient Geo-Distributed SQL Database

[...]

Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, Paul Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, Peter Mattis - Show less +13 more

11 Jun 2020

TL;DR: The design of CockroachDB and its novel transaction model that supports consistent geo-distributed transactions on commodity hardware is presented and its distributed SQL layer automatically scales with the size of the database cluster while providing the standard SQL interface that users expect.

...read moreread less

Abstract: We live in an increasingly interconnected world, with many organizations operating across countries or even continents. To serve their global user base, organizations are replacing their legacy DBMSs with cloud-based systems capable of scaling OLTP workloads to millions of users. CockroachDB is a scalable SQL DBMS that was built from the ground up to support these global OLTP workloads while maintaining high availability and strong consistency. Just like its namesake, CockroachDB is resilient to disasters through replication and automatic recovery mechanisms. This paper presents the design of CockroachDB and its novel transaction model that supports consistent geo-distributed transactions on commodity hardware. We describe how CockroachDB replicates and distributes data to achieve fault tolerance and high performance, as well as how its distributed SQL layer automatically scales with the size of the database cluster while providing the standard SQL interface that users expect. Finally, we present a comprehensive performance evaluation and share a couple of case studies of CockroachDB users. We conclude by describing lessons learned while building CockroachDB over the last five years.

...read moreread less

127 citations

Journal Article•DOI•

Clay: fine-grained adaptive partitioning for general database schemas

[...]

Marco Serafini¹, Rebecca Taft², Aaron J. Elmore³, Andrew Pavlo⁴, Ashraf Aboulnaga¹, Michael Stonebraker² - Show less +2 more•Institutions (4)

Qatar Computing Research Institute¹, Massachusetts Institute of Technology², University of Chicago³, Carnegie Mellon University⁴

01 Nov 2016

TL;DR: A new on-line partitioning approach, called Clay, that supports both tree-based schemas and more complex "general" schemas with arbitrary foreign key relationships is presented and it is shown that it can generate partitioning schemes that enable the system to achieve up to 15× better throughput and 99% lower latency than existing approaches.

...read moreread less

Abstract: Transaction processing database management systems (DBMSs) are critical for today's data-intensive applications because they enable an organization to quickly ingest and query new information. Many of these applications exceed the capabilities of a single server, and thus their database has to be deployed in a distributed DBMS. The key factor affecting such a system's performance is how the database is partitioned. If the database is partitioned incorrectly, the number of distributed transactions can be high. These transactions have to synchronize their operations over the network, which is considerably slower and leads to poor performance. Previous work on elastic database repartitioning has focused on a certain class of applications whose database schema can be represented in a hierarchical tree structure. But many applications cannot be partitioned in this manner, and thus are subject to distributed transactions that impede their performance and scalability.In this paper, we present a new on-line partitioning approach, called Clay, that supports both tree-based schemas and more complex "general" schemas with arbitrary foreign key relationships. Clay dynamically creates blocks of tuples to migrate among servers during repartitioning, placing no constraints on the schema but taking care to balance load and reduce the amount of data migrated. Clay achieves this goal by including in each block a set of hot tuples and other tuples co-accessed with these hot tuples. To evaluate our approach, we integrate Clay in a distributed, main-memory DBMS and show that it can generate partitioning schemes that enable the system to achieve up to 15× better throughput and 99% lower latency than existing approaches.

...read moreread less

82 citations

Proceedings Article•DOI•

Squall: Fine-Grained Live Reconfiguration for Partitioned Main Memory Databases

[...]

Aaron J. Elmore¹, Vaibhav Arora², Rebecca Taft³, Andrew Pavlo⁴, Divyakant Agrawal², Amr El Abbadi² - Show less +2 more•Institutions (4)

University of Chicago¹, University of California, Santa Barbara², Massachusetts Institute of Technology³, Carnegie Mellon University⁴

27 May 2015

TL;DR: The Squall technique for supporting live reconfiguration in partitioned, main memory DBMSs supports fine-grained repartitioning of databases in the presence of distributed transactions, high throughput client workloads, and replicated data.

...read moreread less

Abstract: For data-intensive applications with many concurrent users, modern distributed main memory database management systems (DBMS) provide the necessary scale-out support beyond what is possible with single-node systems. These DBMSs are optimized for the short-lived transactions that are common in on-line transaction processing (OLTP) workloads. One way that they achieve this is to partition the database into disjoint subsets and use a single-threaded transaction manager per partition that executes transactions one-at-a-time in serial order. This minimizes the overhead of concurrency control mechanisms, but requires careful partitioning to limit distributed transactions that span multiple partitions. Previous methods used off-line analysis to determine how to partition data, but the dynamic nature of these applications means that they are prone to hotspots. In these situations, the DBMS needs to reconfigure how data is partitioned in real-time to maintain performance objectives. Bringing the system off-line to reorganize the database is unacceptable for on-line applications. To overcome this problem, we introduce the Squall technique for supporting live reconfiguration in partitioned, main memory DBMSs. Squall supports fine-grained repartitioning of databases in the presence of distributed transactions, high throughput client workloads, and replicated data. An evaluation of our approach on a distributed DBMS shows that Squall can reconfigure a database with no downtime and minimal overhead on transaction latency.

...read moreread less

69 citations

Proceedings Article•DOI•

GenBase: a complex analytics genomics benchmark

[...]

Rebecca Taft¹, Manasi Vartak¹, Nadathur Satish², Narayanan Sundaram², Samuel Madden¹, Michael Stonebraker¹ - Show less +2 more•Institutions (2)

Massachusetts Institute of Technology¹, Intel²

18 Jun 2014

TL;DR: A new benchmark designed to test database management system (DBMS) performance on a mix of data management tasks and complex analytics (regression, singular value decomposition, etc.) is introduced.

...read moreread less

Abstract: This paper introduces a new benchmark designed to test database management system (DBMS) performance on a mix of data management tasks (joins, filters, etc.) and complex analytics (regression, singular value decomposition, etc.) Such mixed workloads are prevalent in a number of application areas including most science workloads and web analytics. As a specific use case, we have chosen genomics data for our benchmark and have constructed a collection of typical tasks in this domain. In addition to being representative of a mixed data management and analytics workload, this benchmark is also meant to scale to large dataset sizes and multiple nodes across a cluster. Besides presenting this benchmark, we have run it on a variety of storage systems including traditional row stores, newer column stores, Hadoop, and an array DBMS. We present performance numbers on all systems on single and multiple nodes, and show that performance differs by orders of magnitude between the various solutions. In addition, we demonstrate that most platforms have scalability issues. We also test offloading the analytics onto a coprocessor. The intent of this benchmark is to focus research interest in this area; to this end, all of our data, data generators, and scripts are available on our web site.

...read moreread less

47 citations

Cited by

PDF

Open Access

More filters

Singular Value Decomposition for Genome-Wide Expression Data Processing and Modeling

[...]

Orly Alter¹, Patrick O. Brown, David Botstein•Institutions (1)

Stanford University¹

01 Mar 2001

TL;DR: Using singular value decomposition in transforming genome-wide expression data from genes x arrays space to reduced diagonalized "eigengenes" x "eigenarrays" space gives a global picture of the dynamics of gene expression, in which individual genes and arrays appear to be classified into groups of similar regulation and function, or similar cellular state and biological phenotype.

...read moreread less

Abstract: ‡We describe the use of singular value decomposition in transforming genome-wide expression data from genes 3 arrays space to reduced diagonalized ‘‘eigengenes’’ 3 ‘‘eigenarrays’’ space, where the eigengenes (or eigenarrays) are unique orthonormal superpositions of the genes (or arrays). Normalizing the data by filtering out the eigengenes (and eigenarrays) that are inferred to represent noise or experimental artifacts enables meaningful comparison of the expression of different genes across different arrays in different experiments. Sorting the data according to the eigengenes and eigenarrays gives a global picture of the dynamics of gene expression, in which individual genes and arrays appear to be classified into groups of similar regulation and function, or similar cellular state and biological phenotype, respectively. After normalization and sorting, the significant eigengenes and eigenarrays can be associated with observed genome-wide effects of regulators, or with measured samples, in which these regulators are overactive or underactive, respectively.

...read moreread less

1,815 citations

Proceedings Article•DOI•

NetCache: Balancing Key-Value Stores with Fast In-Network Caching

[...]

Xin Jin¹, Xiaozhou Li, Haoyu Zhang², Robert Soulé³, Jeongkeun Lee, Nate Foster⁴, Changhoon Kim, Ion Stoica⁵ - Show less +4 more•Institutions (5)

Johns Hopkins University¹, Princeton University², University of Lugano³, Cornell University⁴, University of California, Berkeley⁵

14 Oct 2017

TL;DR: This work presents NetCache, a new key-value store architecture that leverages the power and flexibility of new-generation programmable switches to handle queries on hot items and balance the load across storage nodes, and shows that it improves the throughput by 3-10x and reduces the latency of up to 40% of queries by 50%, for high-performance, in-memory key- value stores.

...read moreread less

Abstract: We present NetCache, a new key-value store architecture that leverages the power and flexibility of new-generation programmable switches to handle queries on hot items and balance the load across storage nodes. NetCache provides high aggregate throughput and low latency even under highly-skewed and rapidly-changing workloads. The core of NetCache is a packet-processing pipeline that exploits the capabilities of modern programmable switch ASICs to efficiently detect, index, cache and serve hot key-value items in the switch data plane. Additionally, our solution guarantees cache coherence with minimal overhead. We implement a NetCache prototype on Barefoot Tofino switches and commodity servers and demonstrate that a single switch can process 2+ billion queries per second for 64K items with 16-byte keys and 128-byte values, while only consuming a small portion of its hardware resources. To the best of our knowledge, this is the first time that a sophisticated application-level functionality, such as in-network caching, has been shown to run at line rate on programmable switches. Furthermore, we show that NetCache improves the throughput by 3-10x and reduces the latency of up to 40% of queries by 50%, for high-performance, in-memory key-value stores.

...read moreread less

437 citations

Proceedings Article•

Ernest: efficient performance prediction for large-scale advanced analytics

[...]

Shivaram Venkataraman¹, Zongheng Yang¹, Michael J. Franklin¹, Benjamin Recht¹, Ion Stoica¹ - Show less +1 more•Institutions (1)

University of California, Berkeley¹

16 Mar 2016

TL;DR: Ernest, a performance prediction framework for large scale analytics, and evaluation on Amazon EC2 using several workloads shows that the prediction error is low while having a training overhead of less than 5% for long-running jobs.

...read moreread less

Abstract: Recent workload trends indicate rapid growth in the deployment of machine learning, genomics and scientific workloads on cloud computing infrastructure. However, efficiently running these applications on shared infrastructure is challenging and we find that choosing the right hardware configuration can significantly improve performance and cost. The key to address the above challenge is having the ability to predict performance of applications under various resource configurations so that we can automatically choose the optimal configuration. Our insight is that a number of jobs have predictable structure in terms of computation and communication. Thus we can build performance models based on the behavior of the job on small samples of data and then predict its performance on larger datasets and cluster sizes. To minimize the time and resources spent in building a model, we use optimal experiment design, a statistical technique that allows us to collect as few training points as required. We have built Ernest, a performance prediction framework for large scale analytics and our evaluation on Amazon EC2 using several workloads shows that our prediction error is low while having a training overhead of less than 5% for long-running jobs.

...read moreread less

401 citations

Journal Article•DOI•

In-Memory Big Data Management and Processing: A Survey

[...]

Hao Zhang¹, Gang Chen², Beng Chin Ooi¹, Kian-Lee Tan¹, Meihui Zhang³ - Show less +1 more•Institutions (3)

National University of Singapore¹, Zhejiang University², Singapore University of Technology and Design³

01 Jul 2015-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This survey aims to provide a thorough review of a wide range of in-memory data management and processing proposals and systems, including both data storage systems and data processing frameworks.

...read moreread less

Abstract: Growing main memory capacity has fueled the development of in-memory big data management and processing. By eliminating disk I/O bottleneck, it is now possible to support interactive data analytics. However, in-memory systems are much more sensitive to other sources of overhead that do not matter in traditional I/O-bounded disk-based systems. Some issues such as fault-tolerance and consistency are also more challenging to handle in in-memory environment. We are witnessing a revolution in the design of database systems that exploits main memory as its data storage layer. Many of these researches have focused along several dimensions: modern CPU and memory hierarchy utilization, time/space efficiency, parallelism, and concurrency control. In this survey, we aim to provide a thorough review of a wide range of in-memory data management and processing proposals and systems, including both data storage systems and data processing frameworks. We also give a comprehensive presentation of important technology in memory management, and some key factors that need to be considered in order to achieve efficient in-memory data management and processing.

...read moreread less

391 citations

Journal Article•DOI•

The BigDAWG Polystore System

[...]

Jennie Duggan¹, Aaron J. Elmore², Michael Stonebraker³, Magda Balazinska⁴, Bill Howe⁴, Jeremy Kepner³, Samuel Madden³, David Maier⁵, Timothy G. Mattson⁶, Stan Zdonik⁷ - Show less +6 more•Institutions (7)

Northwestern University¹, University of Chicago², Massachusetts Institute of Technology³, University of Washington⁴, Portland State University⁵, Intel⁶, Brown University⁷

12 Aug 2015

TL;DR: In this paper, a new view of federated databases is presented to address the growing need for managing information that spans multiple data models. And the authors propose a polystore architecture, which is designed to unify querying over multiple models.

...read moreread less

Abstract: This paper presents a new view of federated databases to address the growing need for managing information that spans multiple data models. This trend is fueled by the proliferation of storage engines and query languages based on the observation that 'no one size fits all'. To address this shift, we propose a polystore architecture; it is designed to unify querying over multiple data models. We consider the challenges and opportunities associated with polystores. Open questions in this space revolve around query optimization and the assignment of objects to storage engines. We introduce our approach to these topics and discuss our prototype in the context of the Intel Science and Technology Center for Big Data

...read moreread less

244 citations

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94

Collapse