scispace - formally typeset
Search or ask a question

Showing papers by "Nesime Tatbul published in 2022"


Journal ArticleDOI

[...]

TL;DR: Bao takes advantage of the wisdom built into existing query optimizers by providing per-query optimization hints, and combines modern tree convolutional neural networks with Thompson sampling, a well-studied reinforcement learning algorithm, to automatically learn from its mistakes and adapt to changes in query workloads, data, and schema.
Abstract: Recent efforts applying machine learning techniques to query optimization have shown few practical gains due to substantive training overhead, inability to adapt to changes, and poor tail performance. Motivated by these difficulties, we introduce Bao (the Bandit optimizer). Bao takes advantage of the wisdom built into existing query optimizers by providing per-query optimization hints. Bao combines modern tree convolutional neural networks with Thompson sampling, a well-studied reinforcement learning algorithm. As a result, Bao automatically learns from its mistakes and adapts to changes in query workloads, data, and schema. Experimentally, we demonstrate that Bao can quickly learn strategies that improve end-to-end query execution performance, including tail latency, for several workloads containing longrunning queries. In cloud environments, we show that Bao can offer both reduced costs and better performance compared with a commercial system.

7 citations


Proceedings Article
TL;DR: A new self-organizing, self-optimizing, meta-data rich storage format for the cloud that enables order-of-magnitude performance improvements in data-intensive applications through instance-optimization, i.e., the adaptation of data representation to exploit both the distribu-tion of the data and the workload operating on it.
Abstract: We propose a new self-organizing, self-optimizing, meta-data rich storage format for the cloud, called a self-organizing data container (SDC), that enables order-of-magnitude performance improvements in data-intensive applications through instance-optimization, i.e., the adaptation of data representation to exploit both the distribu-tion of the data and the workload operating on it. Unlike existing low-level cloud storage formats like Apache Arrow and Parquet, SDCs capture both data and metadata, like access histories and distributional statistics, and are designed to be flexible enough to encompass a variety of modern high-performance representations for data analytics, including partitioning, replication, indexing, and materialization. We present a preliminary design for SDCs, some motivating experiments, and discuss new challenges they present.

5 citations


Proceedings Article
TL;DR: Mach’s lean, loosely coordinated architecture aggressively leverages the characteristics of metrics data and observability workloads, yielding an order-of-magnitude improvement over existing approaches—especially those marketed as “time series database systems” (TSDBs).
Abstract: Observability is gaining traction as a key capability for understanding the internal behavior of large-scale system deployments. Instrumenting these systems to report quantitative telemetry data called metrics enables engineers to monitor and maintain services that operate at an enormous scale so they can respond rapidly to any issues that might arise. To be useful, metrics must be ingested, stored, and queryable in real time, but many existing solutions cannot keep up with the sheer volume of generated data. This paper describesMach, a pluggable storage engine we are building specifically to handle high-volume metrics data. Similar to many popular libraries (e.g., Berkeley DB, LevelDB, RocksDB, WiredTiger), Mach provides a simple API to store and retrieve data. Mach’s lean, loosely coordinated architecture aggressively leverages the characteristics of metrics data and observability workloads, yielding an order-of-magnitude improvement over existing approaches—especially those marketed as “time series database systems” (TSDBs). In fact, our preliminary results show that Mach can achieve nearly 10× higher write throughput and 3× higher read throughput compared to several widely used alternatives.

1 citations


Journal ArticleDOI
TL;DR: This short note summarizes the discussion of a panel held during VLDB 2021 titled "Artifacts, Availability & Reproducibility", aiming to assess the reproducibility of data management research and to propose changes moving forward.
Abstract: In the last few years, SIGMOD and VLDB have intensified efforts to encourage, facilitate, and establish reproducibility as a key process for accepted research papers, awarding them with the Reproducibility badge. In addition, complementary efforts have focused on increasing the sharing of accompanying artifacts of published work (code, scripts, data), independently of reproducibility, awarding them the Artifacts Available badge. In this short note, we summarize the discussion of a panel held during VLDB 2021 titled "Artifacts, Availability & Reproducibility". We first present a more detailed summary of the recent efforts. Then, we present the discussion and the contributed key points that were made, aiming to assess the reproducibility of data management research and to propose changes moving forward.

1 citations


Journal ArticleDOI
TL;DR: An introduction to machine programming is introduced introducing its three pillars: intention, invention, and adaptation, and an overview of the data ecosystem central to all machine programming systems is provided, highlighting challenges and novel opportunities relevant to the data systems community.
Abstract: Machine programming is an emerging research area that improves the software development life cycle from design through deployment. We present a tutorial on machine programming research highlighting aspects relevant to the data systems community. We divide this tutorial into three parts: We begin with an introduction to machine programming introducing its three pillars: intention, invention, and adaptation. Then, we provide an overview of the data ecosystem central to all machine programming systems, highlighting challenges and novel opportunities relevant to the data systems community. Finally, we describe recent advances in machine programming research and how these directions use various data sets to improve the ease of creating and maintaining performant software systems.

Journal ArticleDOI
TL;DR: The Scalable Data Science (SDS) research track as mentioned in this paper was introduced as part of the 2019 International Conference on Very Large Data Bases (VLDB) to enhance the impact and visibility of the VLDB community on data science practice.
Abstract: As part of the International Conference on Very Large Data Bases (VLDB) 2021 / Proceedings of the VLDB Endowment Volume 14, a new Research Track category named Scalable Data Science (SDS) was launched [2, 6]. The goal of SDS is to attract cutting-edge and impactful real-world work in the scalable data science arena to enhance the impact and visibility of the VLDB community on data science practice, spur new technical connections, and inspire new follow-on research. The inaugural year proved to be successful, with numerous interesting papers from a wide cross section of both industry and academia, spanning several data science topics, and originating from several countries around the world. In this report, we reflect on the inaugural year of SDS with some statistics on both submissions and accepted papers, SDS invited talks, and our observations, lessons, and tips as inaugural Associate Editors for SDS. We hope this article is helpful to future authors, reviewers, and organizers of SDS, as well as other interested members of the wider database / data management community and beyond.

Journal Article
TL;DR: An introduction to machine programming is introduced introducing its three pillars: intention, invention, and adaptation, and an overview of the data ecosystem central to all machine programming systems is provided, highlighting challenges and novel opportunities relevant to the data systems community.
Abstract: Machine programming is an emerging research area that improves the software development life cycle from design through deployment. We present a tutorial on machine programming research highlighting aspects relevant to the data systems community. We divide this tutorial into three parts: We begin with an introduction to machine programming introducing its three pillars: intention, invention, and adaptation. Then, we provide an overview of the data ecosystem central to all machine programming systems, highlighting challenges and novel opportunities relevant to the data systems community. Finally, we describe recent advances in machine programming research and how these directions use various data sets to improve the ease of creating and maintaining performant software systems.


Proceedings ArticleDOI
10 Jun 2022
TL;DR: The DaMoN Workshop as mentioned in this paper has established itself as the primary database venue to present ideas on how to exploit new hardware for data management, in particular how to improve performance or scalability of databases, how new hardware unlocks new database application scenarios, and how data management could benefit from future hardware.
Abstract: New hardware, such as multi-core CPUs, GPUs, FPGAs, new memory and storage technologies, low-power devices, bring new challenges and opportunities in optimizing database systems performance. Consequently, exploiting the characteristics of modern hardware has become an important topic of database systems research. In the last two decades, the DaMoN Workshop has established itself as the primary database venue to present ideas on how to exploit new hardware for data management, in particular how to improve performance or scalability of databases, how new hardware unlocks new database application scenarios, and how data management could benefit from future hardware.