scispace - formally typeset
Search or ask a question

Showing papers by "Nesime Tatbul published in 2020"


Proceedings ArticleDOI
20 Apr 2020
TL;DR: Surprisingly, models trained with cost-guided cardinality estimation achieve this increase in query performance while having higher prediction error than models trained without this approach, suggesting that prediction error for cardinalities is not necessarily the correct metric to optimize.
Abstract: The increasing prevalence of machine learning techniques has resulted in many works attempting to replace cardinality estimation, a core component of relational query optimizers, with learned models. The majority of those works have trained models to minimize the prediction error between the model’s output for a particular query and the true cardinality of that query. However, when cardinality estimators are used for query optimization, not all cardinality estimates are equally important. We present cost-guided cardinality estimation, a technique to train learned cardinality estimators that penalizes models for errors that lead to sub-optimal query plans, and rewards models for estimates that lead to high-quality query plans, regardless of the accuracy of those estimates. In a preliminary experimental study, we show that our technique can reduce average query runtime by 1.7-2×. Surprisingly, models trained with our approach achieve this increase in query performance while having higher prediction error than models trained without our approach, suggesting that prediction error for cardinalities is not necessarily the correct metric to optimize.

25 citations


Posted Content
TL;DR: Bao combines modern tree convolutional neural networks with Thompson sampling, a decades-old and well-studied reinforcement learning algorithm, to take advantage of the wisdom built into existing query optimizers by providing per-query optimization hints.
Abstract: Query optimization remains one of the most challenging problems in data management systems. Recent efforts to apply machine learning techniques to query optimization challenges have been promising, but have shown few practical gains due to substantive training overhead, inability to adapt to changes, and poor tail performance. Motivated by these difficulties and drawing upon a long history of research in multi-armed bandits, we introduce Bao (the BAndit Optimizer). Bao takes advantage of the wisdom built into existing query optimizers by providing per-query optimization hints. Bao combines modern tree convolutional neural networks with Thompson sampling, a decades-old and well-studied reinforcement learning algorithm. As a result, Bao automatically learns from its mistakes and adapts to changes in query workloads, data, and schema. Experimentally, we demonstrate that Bao can quickly (an order of magnitude faster than previous approaches) learn strategies that improve end-to-end query execution performance, including tail latency. In cloud environments, we show that Bao can offer both reduced costs and better performance compared with a sophisticated commercial system.

21 citations


Proceedings ArticleDOI
15 Jun 2020
TL;DR: One of the first experimental studies on characterizing Intel® Optane™ DC PMM's performance behavior in the context of analytical database workloads is presented, revealing interesting performance tradeoffs that can help guide the design of next-generation OLAP systems in presence of persistent memory in the storage hierarchy.
Abstract: New data storage technologies such as the recently introduced Intel® Optane™ DC Persistent Memory Module (PMM) offer exciting opportunities for optimizing the query processing performance of database workloads. In particular, the unique combination of low latency, byte-addressability, persistence, and large capacity make persistent memory (PMem) an attractive alternative along with DRAM and SSDs. Exploring the performance characteristics of this new medium is the first critical step in understanding how it will impact the design and performance of database systems. In this paper, we present one of the first experimental studies on characterizing Intel® Optane™ DC PMM's performance behavior in the context of analytical database workloads. First, we analyze basic access patterns common in such workloads, such as sequential, selective, and random reads as well as the complete Star Schema Benchmark, comparing standalone DRAM- and PMem-based implementations. Then we extend our analysis to join algorithms over larger datasets, which require using DRAM and PMem in a hybrid fashion while paying special attention to the read-write asymmetry of PMem. Our study reveals interesting performance tradeoffs that can help guide the design of next-generation OLAP systems in presence of persistent memory in the storage hierarchy.

19 citations


Posted Content
TL;DR: MISIM uses a novel context-aware similarity structure, which is designed to aid in lifting semantic meaning from code syntax, and provides a neural-based code similarity scoring system, which can be implemented with various neural network algorithms and topologies with learned parameters.
Abstract: Code similarity systems are integral to a range of applications from code recommendation to automated construction of software tests and defect mitigation. In this paper, we present Machine Inferred Code Similarity (MISIM), a novel end-to-end code similarity system that consists of two core components. First, MISIM uses a novel context-aware similarity structure, which is designed to aid in lifting semantic meaning from code syntax. Second, MISIM provides a neural-based code similarity scoring system, which can be implemented with various neural network algorithms and topologies with learned parameters. We compare MISIM to three other state-of-the-art code similarity systems: (i) code2vec, (ii) Neural Code Comprehension, and (iii) Aroma. In our experimental evaluation across 45,780 programs, MISIM consistently outperformed all three systems, often by a large factor (upwards of 40.6x).

17 citations


Journal ArticleDOI
01 Aug 2020
TL;DR: Dagger is introduced, an end-to-end system to debug and mitigate data-centric errors in data pipelines, such as a data transformation gone wrong or a classifier underperforming due to noisy training data.
Abstract: Data pipelines are the new code. Consequently, data scientists need new tools to support the often time-consuming process of debugging their pipelines. We introduce Dagger, an end-to-end system to debug and mitigate data-centric errors in data pipelines, such as a data transformation gone wrong or a classifier underperforming due to noisy training data. Dagger supports inter-module debugging, where the pipeline blocks are treated as black boxes, as well as intra-module debugging, where users can debug data objects in Python scripts (e.g., DataFrames). In this demo, we will walk the audience through a rich, real-world business intelligence use case from our industrial collaborators at Intel, to highlight how Dagger enables data scientists to productively identify and mitigate data-centric problems at different stages of pipeline development.

9 citations


Posted ContentDOI
22 Dec 2020-bioRxiv
TL;DR: Learned indexes for sequence analysis (LISA) as mentioned in this paper is a learning-based approach to DNA sequence search that uses FM-index to accelerate the super-maximal exact match (SMEM) search.
Abstract: Background: Next-generation sequencing (NGS) technologies have enabled affordable sequencing of billions of short DNA fragments at high throughput, paving the way for population-scale genomics. Genomics data analytics at this scale requires overcoming performance bottlenecks, such as searching for short DNA sequences over long reference sequences. Results: In this paper, we introduce LISA (Learned Indexes for Sequence Analysis), a novel learning-based approach to DNA sequence search. We focus on accelerating two of the most essential flavors of DNA sequence search-- exact search and super-maximal exact match (SMEM) search. LISA builds on and extends FM-index, which is the state-of-the-art technique widely deployed in genomics tools. Experiments with human, animal, and plant genome datasets indicate that LISA achieves up to 2.2x and 13.3x speedups over the state-of-the-art FM-index based implementations for exact search and super-maximal exact match (SMEM) search, respectively.

3 citations


Posted Content
05 Jun 2020
TL;DR: This work presents machine Inferred Code Similarity (MISIM), a novel end-to-end code similarity system that consists of two core components: a novel context-aware semantic structure and a neural-based code similarity scoring algorithm that can be implemented with various neural network architectures with learned parameters.
Abstract: Code similarity systems are integral to a range of applications from code recommendation to automated software defect correction. We argue that code similarity is now a first-order problem that must be solved. To begin to address this, we present machine Inferred Code Similarity (MISIM), a novel end-to-end code similarity system that consists of two core components. First, MISIM uses a novel context-aware semantic structure, which is designed to aid in lifting semantic meaning from code syntax. Second, MISIM provides a neural-based code similarity scoring algorithm, which can be implemented with various neural network architectures with learned parameters. We compare MISIM to three state-of-the-art code similarity systems: (i)code2vec, (ii)Neural Code Comprehension, and (iii)Aroma. In our experimental evaluation across 328,155 programs (over 18 million lines of code), MISIM has 1.5x to 43.4x better accuracy than all three systems.

3 citations


Posted Content
TL;DR: AnomalyBench is presented, the first comprehensive benchmark for explainable AD over high-dimensional (2000+) time series data and the key design features and practical utility of AnomalyBench are demonstrated through an experimental study with three state-of-the-art semi-supervised AD techniques.
Abstract: Access to high-quality data repositories and benchmarks have been instrumental in advancing the state of the art in many domains, as they provide the research community a common ground for training, testing, evaluating, comparing, and experimenting with novel machine learning models. Lack of such community resources for anomaly detection (AD) severely limits progress. In this report, we present AnomalyBench, the first comprehensive benchmark for explainable AD over high-dimensional (2000+) time series data. AnomalyBench has been systematically constructed based on real data traces from ~100 repeated executions of 10 large-scale stream processing jobs on a Spark cluster. 30+ of these executions were disturbed by introducing ~100 instances of different types of anomalous events (e.g., misbehaving inputs, resource contention, process failures). For each of these anomaly instances, ground truth labels for the root-cause interval as well as those for the effect interval are available, providing a means for supporting both AD tasks and explanation discovery (ED) tasks via root-cause analysis. We demonstrate the key design features and practical utility of AnomalyBench through an experimental study with three state-of-the-art semi-supervised AD techniques.

3 citations


Posted Content
TL;DR: Exathlon as mentioned in this paper is the first comprehensive public benchmark for explainable anomaly detection over high-dimensional time series data, based on real data traces from repeated executions of large-scale stream processing jobs on an Apache Spark cluster.
Abstract: Access to high-quality data repositories and benchmarks have been instrumental in advancing the state of the art in many experimental research domains. While advanced analytics tasks over time series data have been gaining lots of attention, lack of such community resources severely limits scientific progress. In this paper, we present Exathlon, the first comprehensive public benchmark for explainable anomaly detection over high-dimensional time series data. Exathlon has been systematically constructed based on real data traces from repeated executions of large-scale stream processing jobs on an Apache Spark cluster. Some of these executions were intentionally disturbed by introducing instances of six different types of anomalous events (e.g., misbehaving inputs, resource contention, process failures). For each of the anomaly instances, ground truth labels for the root cause interval as well as those for the extended effect interval are provided, supporting the development and evaluation of a wide range of anomaly detection (AD) and explanation discovery (ED) tasks. We demonstrate the practical utility of Exathlon's dataset, evaluation methodology, and end-to-end data science pipeline design through an experimental study with three state-of-the-art AD and ED techniques.

Posted Content
TL;DR: Machine Inferred Code Similarity (MISIM) as discussed by the authors is a neural code semantics similarity system consisting of two core components: (i) a novel context-aware semantics structure, which was purpose-built to lift semantics from code syntax; (ii) an extensible neural code similarity scoring algorithm, which can be used for various neural network architectures with learned parameters.
Abstract: Code semantics similarity can be used for many tasks such as code recommendation, automated software defect correction, and clone detection. Yet, the accuracy of such systems has not yet reached a level of general purpose reliability. To help address this, we present Machine Inferred Code Similarity (MISIM), a neural code semantics similarity system consisting of two core components: (i)MISIM uses a novel context-aware semantics structure, which was purpose-built to lift semantics from code syntax; (ii)MISIM uses an extensible neural code similarity scoring algorithm, which can be used for various neural network architectures with learned parameters. We compare MISIM to four state-of-the-art systems, including two additional hand-customized models, over 328K programs consisting of over 18 million lines of code. Our experiments show that MISIM has 8.08% better accuracy (using MAP@R) compared to the next best performing system.