Showing papers by "Nesime Tatbul published in 2020"

PDF

Open Access

Proceedings Article•DOI•

Cost-Guided Cardinality Estimation: Focus Where it Matters

[...]

Parimarjan Negi¹, Ryan Marcus¹, Hongzi Mao¹, Nesime Tatbul¹, Tim Kraska¹, Mohammad Alizadeh¹ - Show less +2 more•Institutions (1)

Massachusetts Institute of Technology¹

20 Apr 2020

TL;DR: Surprisingly, models trained with cost-guided cardinality estimation achieve this increase in query performance while having higher prediction error than models trained without this approach, suggesting that prediction error for cardinalities is not necessarily the correct metric to optimize.

...read moreread less

Abstract: The increasing prevalence of machine learning techniques has resulted in many works attempting to replace cardinality estimation, a core component of relational query optimizers, with learned models. The majority of those works have trained models to minimize the prediction error between the model’s output for a particular query and the true cardinality of that query. However, when cardinality estimators are used for query optimization, not all cardinality estimates are equally important. We present cost-guided cardinality estimation, a technique to train learned cardinality estimators that penalizes models for errors that lead to sub-optimal query plans, and rewards models for estimates that lead to high-quality query plans, regardless of the accuracy of those estimates. In a preliminary experimental study, we show that our technique can reduce average query runtime by 1.7-2×. Surprisingly, models trained with our approach achieve this increase in query performance while having higher prediction error than models trained without our approach, suggesting that prediction error for cardinalities is not necessarily the correct metric to optimize.

...read moreread less

25 citations

Posted Content•

Bao: Learning to Steer Query Optimizers.

[...]

Ryan Marcus, Parimarjan Negi, Hongzi Mao, Nesime Tatbul, Mohammad Alizadeh, Tim Kraska¹ - Show less +2 more•Institutions (1)

Massachusetts Institute of Technology¹

08 Apr 2020-arXiv: Databases

TL;DR: Bao combines modern tree convolutional neural networks with Thompson sampling, a decades-old and well-studied reinforcement learning algorithm, to take advantage of the wisdom built into existing query optimizers by providing per-query optimization hints.

...read moreread less

Abstract: Query optimization remains one of the most challenging problems in data management systems. Recent efforts to apply machine learning techniques to query optimization challenges have been promising, but have shown few practical gains due to substantive training overhead, inability to adapt to changes, and poor tail performance. Motivated by these difficulties and drawing upon a long history of research in multi-armed bandits, we introduce Bao (the BAndit Optimizer). Bao takes advantage of the wisdom built into existing query optimizers by providing per-query optimization hints. Bao combines modern tree convolutional neural networks with Thompson sampling, a decades-old and well-studied reinforcement learning algorithm. As a result, Bao automatically learns from its mistakes and adapts to changes in query workloads, data, and schema. Experimentally, we demonstrate that Bao can quickly (an order of magnitude faster than previous approaches) learn strategies that improve end-to-end query execution performance, including tail latency. In cloud environments, we show that Bao can offer both reduced costs and better performance compared with a sophisticated commercial system.

...read moreread less

21 citations

Proceedings Article•DOI•

Large-scale in-memory analytics on Intel® Optane™ DC persistent memory

[...]

Anil Shanbhag¹, Nesime Tatbul², David M. Cohen², Samuel Madden¹•Institutions (2)

Massachusetts Institute of Technology¹, Intel²

15 Jun 2020

TL;DR: One of the first experimental studies on characterizing Intel® Optane™ DC PMM's performance behavior in the context of analytical database workloads is presented, revealing interesting performance tradeoffs that can help guide the design of next-generation OLAP systems in presence of persistent memory in the storage hierarchy.

...read moreread less

Abstract: New data storage technologies such as the recently introduced Intel® Optane™ DC Persistent Memory Module (PMM) offer exciting opportunities for optimizing the query processing performance of database workloads. In particular, the unique combination of low latency, byte-addressability, persistence, and large capacity make persistent memory (PMem) an attractive alternative along with DRAM and SSDs. Exploring the performance characteristics of this new medium is the first critical step in understanding how it will impact the design and performance of database systems. In this paper, we present one of the first experimental studies on characterizing Intel® Optane™ DC PMM's performance behavior in the context of analytical database workloads. First, we analyze basic access patterns common in such workloads, such as sequential, selective, and random reads as well as the complete Star Schema Benchmark, comparing standalone DRAM- and PMem-based implementations. Then we extend our analysis to join algorithms over larger datasets, which require using DRAM and PMem in a hybrid fashion while paying special attention to the read-write asymmetry of PMem. Our study reveals interesting performance tradeoffs that can help guide the design of next-generation OLAP systems in presence of persistent memory in the storage hierarchy.

...read moreread less

19 citations

Posted Content•

MISIM: An End-to-End Neural Code Similarity System.

[...]

Fangke Ye, Zhou Shengtian, Anand Venkat, Ryan Marcus, Nesime Tatbul, Jesmin Jahan Tithi, Paul Petersen, Timothy G. Mattson, Tim Kraska, Pradeep Dubey, Vivek Sarkar, Justin Gottschlich - Show less +8 more

05 Jun 2020-arXiv: Learning

TL;DR: MISIM uses a novel context-aware similarity structure, which is designed to aid in lifting semantic meaning from code syntax, and provides a neural-based code similarity scoring system, which can be implemented with various neural network algorithms and topologies with learned parameters.

...read moreread less

Abstract: Code similarity systems are integral to a range of applications from code recommendation to automated construction of software tests and defect mitigation. In this paper, we present Machine Inferred Code Similarity (MISIM), a novel end-to-end code similarity system that consists of two core components. First, MISIM uses a novel context-aware similarity structure, which is designed to aid in lifting semantic meaning from code syntax. Second, MISIM provides a neural-based code similarity scoring system, which can be implemented with various neural network algorithms and topologies with learned parameters. We compare MISIM to three other state-of-the-art code similarity systems: (i) code2vec, (ii) Neural Code Comprehension, and (iii) Aroma. In our experimental evaluation across 45,780 programs, MISIM consistently outperformed all three systems, often by a large factor (upwards of 40.6x).

...read moreread less

17 citations

Journal Article•DOI•

Debugging large-scale data science pipelines using dagger

[...]

El Kindi Rezig¹, Ashrita Brahmaroutu², Nesime Tatbul², Mourad Ouzzani, Nan Tang, Timothy G. Mattson², Samuel Madden¹, Michael Stonebraker¹ - Show less +4 more•Institutions (2)

Massachusetts Institute of Technology¹, Intel²

01 Aug 2020

TL;DR: Dagger is introduced, an end-to-end system to debug and mitigate data-centric errors in data pipelines, such as a data transformation gone wrong or a classifier underperforming due to noisy training data.

...read moreread less

Abstract: Data pipelines are the new code. Consequently, data scientists need new tools to support the often time-consuming process of debugging their pipelines. We introduce Dagger, an end-to-end system to debug and mitigate data-centric errors in data pipelines, such as a data transformation gone wrong or a classifier underperforming due to noisy training data. Dagger supports inter-module debugging, where the pipeline blocks are treated as black boxes, as well as intra-module debugging, where users can debug data objects in Python scripts (e.g., DataFrames). In this demo, we will walk the audience through a rich, real-world business intelligence use case from our industrial collaborators at Intel, to highlight how Dagger enables data scientists to productively identify and mitigate data-centric problems at different stages of pipeline development.

...read moreread less

9 citations

Posted Content•DOI•

LISA: Learned Indexes for DNA Sequence Analysis

[...]

Darryl Ho¹, Saurabh Kalikar², Sanchit Misra², Jialin Ding¹, Vasimuddin², Nesime Tatbul², Heng Li³, Tim Kraska¹ - Show less +4 more•Institutions (3)

Massachusetts Institute of Technology¹, Intel², Harvard University³

22 Dec 2020-bioRxiv

TL;DR: Learned indexes for sequence analysis (LISA) as mentioned in this paper is a learning-based approach to DNA sequence search that uses FM-index to accelerate the super-maximal exact match (SMEM) search.

...read moreread less

Abstract: Background: Next-generation sequencing (NGS) technologies have enabled affordable sequencing of billions of short DNA fragments at high throughput, paving the way for population-scale genomics. Genomics data analytics at this scale requires overcoming performance bottlenecks, such as searching for short DNA sequences over long reference sequences. Results: In this paper, we introduce LISA (Learned Indexes for Sequence Analysis), a novel learning-based approach to DNA sequence search. We focus on accelerating two of the most essential flavors of DNA sequence search-- exact search and super-maximal exact match (SMEM) search. LISA builds on and extends FM-index, which is the state-of-the-art technique widely deployed in genomics tools. Experiments with human, animal, and plant genome datasets indicate that LISA achieves up to 2.2x and 13.3x speedups over the state-of-the-art FM-index based implementations for exact search and super-maximal exact match (SMEM) search, respectively.

...read moreread less

3 citations

Posted Content•

MISIM: A Novel Code Similarity System

[...]

Fangke Ye, Zhou Shengtian, Anand Venkat, Ryan Marcus, Nesime Tatbul, Jesmin Jahan Tithi, Niranjan Hasabnis, Paul Petersen, Timothy G. Mattson, Tim Kraska, Pradeep Dubey, Vivek Sarkar, Justin Gottschlich - Show less +9 more

05 Jun 2020

TL;DR: This work presents machine Inferred Code Similarity (MISIM), a novel end-to-end code similarity system that consists of two core components: a novel context-aware semantic structure and a neural-based code similarity scoring algorithm that can be implemented with various neural network architectures with learned parameters.

...read moreread less

Abstract: Code similarity systems are integral to a range of applications from code recommendation to automated software defect correction. We argue that code similarity is now a first-order problem that must be solved. To begin to address this, we present machine Inferred Code Similarity (MISIM), a novel end-to-end code similarity system that consists of two core components. First, MISIM uses a novel context-aware semantic structure, which is designed to aid in lifting semantic meaning from code syntax. Second, MISIM provides a neural-based code similarity scoring algorithm, which can be implemented with various neural network architectures with learned parameters. We compare MISIM to three state-of-the-art code similarity systems: (i)code2vec, (ii)Neural Code Comprehension, and (iii)Aroma. In our experimental evaluation across 328,155 programs (over 18 million lines of code), MISIM has 1.5x to 43.4x better accuracy than all three systems.

...read moreread less

3 citations

Posted Content•

AnomalyBench: An Open Benchmark for Explainable Anomaly Detection.

[...]

Vincent Jacob, Fei Song, Arnaud Stiegler, Yanlei Diao, Nesime Tatbul - Show less +1 more

10 Oct 2020-arXiv: Learning

TL;DR: AnomalyBench is presented, the first comprehensive benchmark for explainable AD over high-dimensional (2000+) time series data and the key design features and practical utility of AnomalyBench are demonstrated through an experimental study with three state-of-the-art semi-supervised AD techniques.

...read moreread less

Abstract: Access to high-quality data repositories and benchmarks have been instrumental in advancing the state of the art in many domains, as they provide the research community a common ground for training, testing, evaluating, comparing, and experimenting with novel machine learning models. Lack of such community resources for anomaly detection (AD) severely limits progress. In this report, we present AnomalyBench, the first comprehensive benchmark for explainable AD over high-dimensional (2000+) time series data. AnomalyBench has been systematically constructed based on real data traces from ~100 repeated executions of 10 large-scale stream processing jobs on a Spark cluster. 30+ of these executions were disturbed by introducing ~100 instances of different types of anomalous events (e.g., misbehaving inputs, resource contention, process failures). For each of these anomaly instances, ground truth labels for the root-cause interval as well as those for the effect interval are available, providing a means for supporting both AD tasks and explanation discovery (ED) tasks via root-cause analysis. We demonstrate the key design features and practical utility of AnomalyBench through an experimental study with three state-of-the-art semi-supervised AD techniques.

...read moreread less

3 citations

Posted Content•

Exathlon: A Benchmark for Explainable Anomaly Detection over Time Series.

[...]

Vincent Jacob¹, Fei Song¹, Arnaud Stiegler¹, Bijan Rad¹, Yanlei Diao¹, Nesime Tatbul² - Show less +2 more•Institutions (2)

École Polytechnique¹, Intel²

10 Oct 2020-arXiv: Learning

TL;DR: Exathlon as mentioned in this paper is the first comprehensive public benchmark for explainable anomaly detection over high-dimensional time series data, based on real data traces from repeated executions of large-scale stream processing jobs on an Apache Spark cluster.

...read moreread less

Abstract: Access to high-quality data repositories and benchmarks have been instrumental in advancing the state of the art in many experimental research domains. While advanced analytics tasks over time series data have been gaining lots of attention, lack of such community resources severely limits scientific progress. In this paper, we present Exathlon, the first comprehensive public benchmark for explainable anomaly detection over high-dimensional time series data. Exathlon has been systematically constructed based on real data traces from repeated executions of large-scale stream processing jobs on an Apache Spark cluster. Some of these executions were intentionally disturbed by introducing instances of six different types of anomalous events (e.g., misbehaving inputs, resource contention, process failures). For each of the anomaly instances, ground truth labels for the root cause interval as well as those for the extended effect interval are provided, supporting the development and evaluation of a wide range of anomaly detection (AD) and explanation discovery (ED) tasks. We demonstrate the practical utility of Exathlon's dataset, evaluation methodology, and end-to-end data science pipeline design through an experimental study with three state-of-the-art AD and ED techniques.

...read moreread less

Posted Content•

MISIM: A Neural Code Semantics Similarity System Using the Context-Aware Semantics Structure

[...]

05 Jun 2020-arXiv: Learning

TL;DR: Machine Inferred Code Similarity (MISIM) as discussed by the authors is a neural code semantics similarity system consisting of two core components: (i) a novel context-aware semantics structure, which was purpose-built to lift semantics from code syntax; (ii) an extensible neural code similarity scoring algorithm, which can be used for various neural network architectures with learned parameters.

...read moreread less

Abstract: Code semantics similarity can be used for many tasks such as code recommendation, automated software defect correction, and clone detection. Yet, the accuracy of such systems has not yet reached a level of general purpose reliability. To help address this, we present Machine Inferred Code Similarity (MISIM), a neural code semantics similarity system consisting of two core components: (i)MISIM uses a novel context-aware semantics structure, which was purpose-built to lift semantics from code syntax; (ii)MISIM uses an extensible neural code similarity scoring algorithm, which can be used for various neural network architectures with learned parameters. We compare MISIM to four state-of-the-art systems, including two additional hand-customized models, over 328K programs consisting of over 18 million lines of code. Our experiments show that MISIM has 8.08% better accuracy (using MAP@R) compared to the next best performing system.

...read moreread less