scispace - formally typeset
Search or ask a question
Author

Aaron J. Elmore

Bio: Aaron J. Elmore is an academic researcher from University of Chicago. The author has contributed to research in topics: Data management & Relational database. The author has an hindex of 22, co-authored 79 publications receiving 1974 citations. Previous affiliations of Aaron J. Elmore include University of California, Santa Barbara & University of Illinois at Chicago.

Papers published on a yearly basis

Papers
More filters
Proceedings ArticleDOI
12 Jun 2011
TL;DR: Zephyr is proposed, a technique to efficiently migrate a live database in a shared nothing transactional database architecture that uses phases of on-demand pull and asynchronous push of data, requires minimal synchronization, and provides ACID guarantees during migration and ensures correctness in the presence of failures.
Abstract: Multitenant data infrastructures for large cloud platforms hosting hundreds of thousands of applications face the challenge of serving applications characterized by small data footprint and unpredictable load patterns. When such a platform is built on an elastic pay-per-use infrastructure, an added challenge is to minimize the system's operating cost while guaranteeing the tenants' service level agreements (SLA). Elastic load balancing is therefore an important feature to enable scale-up during high load while scaling down when the load is low. Live migration, a technique to migrate tenants with minimal service interruption and no downtime, is critical to allow lightweight elastic scaling. We focus on the problem of live migration in the database layer. We propose Zephyr, a technique to efficiently migrate a live database in a shared nothing transactional database architecture. Zephyr uses phases of on-demand pull and asynchronous push of data, requires minimal synchronization, results no service unavailability and few or no aborted transactions, minimizes the data transfer overhead, provides ACID guarantees during migration, and ensures correctness in the presence of failures. We outline a prototype implementation using an open source relational database engine and an present a thorough evaluation using various transactional workloads. Zephyr's efficiency is evident from the few tens of failed operations, 10-20% change in average transaction latency, minimal messaging, and no overhead during normal operation when migrating a live database.

264 citations

Journal ArticleDOI
12 Aug 2015
TL;DR: In this paper, a new view of federated databases is presented to address the growing need for managing information that spans multiple data models. And the authors propose a polystore architecture, which is designed to unify querying over multiple models.
Abstract: This paper presents a new view of federated databases to address the growing need for managing information that spans multiple data models. This trend is fueled by the proliferation of storage engines and query languages based on the observation that 'no one size fits all'. To address this shift, we propose a polystore architecture; it is designed to unify querying over multiple data models. We consider the challenges and opportunities associated with polystores. Open questions in this space revolve around query optimization and the assignment of objects to storage engines. We introduce our approach to these topics and discuss our prototype in the context of the Intel Science and Technology Center for Big Data

244 citations

Journal ArticleDOI
01 Nov 2014
TL;DR: E-Store is presented, an elastic partitioning framework for distributed OLTP DBMSs that automatically scales resources in response to demand spikes, periodic events, and gradual changes in an application's workload.
Abstract: On-line transaction processing (OLTP) database management systems (DBMSs) often serve time-varying workloads due to daily, weekly or seasonal fluctuations in demand, or because of rapid growth in demand due to a company's business success. In addition, many OLTP workloads are heavily skewed to "hot" tuples or ranges of tuples. For example, the majority of NYSE volume involves only 40 stocks. To deal with such fluctuations, an OLTP DBMS needs to be elastic; that is, it must be able to expand and contract resources in response to load fluctuations and dynamically balance load as hot tuples vary over time.This paper presents E-Store, an elastic partitioning framework for distributed OLTP DBMSs. It automatically scales resources in response to demand spikes, periodic events, and gradual changes in an application's workload. E-Store addresses localized bottlenecks through a two-tier data placement strategy: cold data is distributed in large chunks, while smaller ranges of hot tuples are assigned explicitly to individual nodes. This is in contrast to traditional single-tier hash and range partitioning strategies. Our experimental evaluation of E-Store shows the viability of our approach and its efficacy under variations in load across a cluster of machines. Compared to single-tier approaches, E-Store improves throughput by up to 130% while reducing latency by 80%.

172 citations

Posted Content
TL;DR: In this article, a dataset version control system, DataHub, is proposed, which allows users to create, branch, merge, change, and search large, divergent collections of datasets.
Abstract: Relational databases have limited support for data collaboration, where teams collaboratively curate and analyze large datasets. Inspired by software version control systems like git, we propose (a) a dataset version control system, giving users the ability to create, branch, merge, difference and search large, divergent collections of datasets, and (b) a platform, DataHub, that gives users the ability to perform collaborative data analysis building on this version control system. We outline the challenges in providing dataset version control at scale.

141 citations

Journal ArticleDOI
01 Aug 2015
TL;DR: BigDAWG is presented, a reference implementation of a new architecture for "Big Data" applications that showcases novel approaches for querying across multiple storage engines, data visualization, and scalable real-time analytics.
Abstract: This paper presents BigDAWG, a reference implementation of a new architecture for "Big Data" applications. Such applications not only call for large-scale analytics, but also for real-time streaming support, smaller analytics at interactive speeds, data visualization, and cross-storage-system queries. Guided by the principle that "one size does not fit all", we build on top of a variety of storage engines, each designed for a specialized use case. To illustrate the promise of this approach, we demonstrate its effectiveness on a hospital application using data from an intensive care unit (ICU). This complex application serves the needs of doctors and researchers and provides real-time support for streams of patient data. It showcases novel approaches for querying across multiple storage engines, data visualization, and scalable real-time analytics.

119 citations


Cited by
More filters
Posted Content
TL;DR: Documentation to facilitate communication between dataset creators and consumers and consumers is presented.
Abstract: The machine learning community currently has no standardized process for documenting datasets, which can lead to severe consequences in high-stakes domains. To address this gap, we propose datasheets for datasets. In the electronics industry, every component, no matter how simple or complex, is accompanied with a datasheet that describes its operating characteristics, test results, recommended uses, and other information. By analogy, we propose that every dataset be accompanied with a datasheet that documents its motivation, composition, collection process, recommended uses, and so on. Datasheets for datasets will facilitate better communication between dataset creators and dataset consumers, and encourage the machine learning community to prioritize transparency and accountability.

1,080 citations

Book ChapterDOI
01 Dec 2018
TL;DR: Preliminary performance data on a subset of TPC-H is presented and it is shown that the system the team is building, C-Store, is substantially faster than popular commercial products.
Abstract: This paper presents the design of a read-optimized relational DBMS that contrasts sharply with most current systems, which are write-optimized. Among the many differences in its design are: storage of data by column rather than by row, careful coding and packing of objects into storage including main memory during query processing, storing an overlapping collection of column-oriented projections, rather than the current fare of tables and indexes, a non-traditional implementation of transactions which includes high availability and snapshot isolation for read-only transactions, and the extensive use of bitmap indexes to complement B-tree structures.We present preliminary performance data on a subset of TPC-H and show that the system we are building, C-Store, is substantially faster than popular commercial products. Hence, the architecture looks very encouraging.

1,063 citations

Book ChapterDOI
01 Dec 2018
TL;DR: The current RDBMS code lines, while attempting to be a "one size fits all" solution, in fact, excel at nothing and should be retired in favor of a collection of "from scratch" specialized engines.
Abstract: In previous papers [SC05, SBC+07], some of us predicted the end of "one size fits all" as a commercial relational DBMS paradigm. These papers presented reasons and experimental evidence that showed that the major RDBMS vendors can be outperformed by 1--2 orders of magnitude by specialized engines in the data warehouse, stream processing, text, and scientific database markets.Assuming that specialized engines dominate these markets over time, the current relational DBMS code lines will be left with the business data processing (OLTP) market and hybrid markets where more than one kind of capability is required. In this paper we show that current RDBMSs can be beaten by nearly two orders of magnitude in the OLTP market as well. The experimental evidence comes from comparing a new OLTP prototype, H-Store, which we have built at M.I.T. to a popular RDBMS on the standard transactional benchmark, TPC-C.We conclude that the current RDBMS code lines, while attempting to be a "one size fits all" solution, infact, excel at nothing. Hence, they are 25 year old legacy code lines that should be retired in favor of a collection of "from scratch" specialized engines. The DBMS vendors (and the research community) should start with a clean sheet of paper and design systems for yesterday's needs.

679 citations

Journal ArticleDOI
TL;DR: The historical events that lead to the interweaving of data and knowledge are tracked to help improve knowledge and understanding of the world around us.
Abstract: In this paper we provide a comprehensive introduction to knowledge graphs, which have recently garnered significant attention from both industry and academia in scenarios that require exploiting diverse, dynamic, large-scale collections of data. After some opening remarks, we motivate and contrast various graph-based data models and query languages that are used for knowledge graphs. We discuss the roles of schema, identity, and context in knowledge graphs. We explain how knowledge can be represented and extracted using a combination of deductive and inductive techniques. We summarise methods for the creation, enrichment, quality assessment, refinement, and publication of knowledge graphs. We provide an overview of prominent open knowledge graphs and enterprise knowledge graphs, their applications, and how they use the aforementioned techniques. We conclude with high-level future research directions for knowledge graphs.

560 citations