scispace - formally typeset
Search or ask a question

Showing papers in "arXiv: Databases in 2017"


Posted Content
Daniel Kang1, John Emmons1, Firas Abuzaid1, Peter Bailis1, Matei Zaharia1 
TL;DR: NoScope is a system for querying videos that can reduce the cost of neural network video analysis by up to three orders of magnitude via inference-optimized model search and achieves two to three order of magnitude speed-ups on binary classification tasks over fixed-angle webcam and surveillance video while maintaining accuracy within 1-5% of state-of-the-art neural networks.
Abstract: Recent advances in computer vision-in the form of deep neural networks-have made it possible to query increasing volumes of video data with high accuracy. However, neural network inference is computationally expensive at scale: applying a state-of-the-art object detector in real time (i.e., 30+ frames per second) to a single video requires a $4000 GPU. In response, we present NoScope, a system for querying videos that can reduce the cost of neural network video analysis by up to three orders of magnitude via inference-optimized model search. Given a target video, object to detect, and reference neural network, NoScope automatically searches for and trains a sequence, or cascade, of models that preserves the accuracy of the reference network but is specialized to the target video and are therefore far less computationally expensive. NoScope cascades two types of models: specialized models that forego the full generality of the reference model but faithfully mimic its behavior for the target video and object; and difference detectors that highlight temporal differences across frames. We show that the optimal cascade architecture differs across videos and objects, so NoScope uses an efficient cost-based optimizer to search across models and cascades. With this approach, NoScope achieves two to three order of magnitude speed-ups (265-15,500x real-time) on binary classification tasks over fixed-angle webcam and surveillance video while maintaining accuracy within 1-5% of state-of-the-art neural networks.

187 citations


Posted Content
TL;DR: In this article, the authors present a benchmarking framework for understanding performance of private blockchains against data processing workloads, and conduct a comprehensive evaluation of three major blockchain systems based on BLOCKBENCH, namely Ethereum, Parity and Hyperledger Fabric.
Abstract: Blockchain technologies are gaining massive momentum in the last few years. Blockchains are distributed ledgers that enable parties who do not fully trust each other to maintain a set of global states. The parties agree on the existence, values and histories of the states. As the technology landscape is expanding rapidly, it is both important and challenging to have a firm grasp of what the core technologies have to offer, especially with respect to their data processing capabilities. In this paper, we first survey the state of the art, focusing on private blockchains (in which parties are authenticated). We analyze both in-production and research systems in four dimensions: distributed ledger, cryptography, consensus protocol and smart contract. We then present BLOCKBENCH, a benchmarking framework for understanding performance of private blockchains against data processing workloads. We conduct a comprehensive evaluation of three major blockchain systems based on BLOCKBENCH, namely Ethereum, Parity and Hyperledger Fabric. The results demonstrate several trade-offs in the design space, as well as big performance gaps between blockchain and database systems. Drawing from design principles of database systems, we discuss several research directions for bringing blockchain performance closer to the realm of databases.

173 citations


Posted Content
TL;DR: In this article, the authors show that the worst-case size of a query is characterised by the fractional edge cover number of its underlying hypergraph, a combinatorial parameter previously known to provide an upper bound.
Abstract: Relational joins are at the core of relational algebra, which in turn is the core of the standard database query language SQL. As their evaluation is expensive and very often dominated by the output size, it is an important task for database query optimisers to compute estimates on the size of joins and to find good execution plans for sequences of joins. We study these problems from a theoretical perspective, both in the worst-case model, and in an average-case model where the database is chosen according to a known probability distribution. In the former case, our first key observation is that the worst-case size of a query is characterised by the fractional edge cover number of its underlying hypergraph, a combinatorial parameter previously known to provide an upper bound. We complete the picture by proving a matching lower bound, and by showing that there exist queries for which the join-project plan suggested by the fractional edge cover approach may be substantially better than any join plan that does not use intermediate projections. On the other hand, we show that in the average-case model, every join-project plan can be turned into a plan containing no projections in such a way that the expected time to evaluate the plan increases only by a constant factor independent of the size of the database. Not surprisingly, the key combinatorial parameter in this context is the maximum density of the underlying hypergraph. We show how to make effective use of this parameter to eliminate the projections.

132 citations


Journal ArticleDOI
TL;DR: A thorough analysis and classification of TSMSs developed through academic or industrial research and documented through publications is presented and the capabilities of each system with regard to Stream Processing and Approximate Query Processing are provided.
Abstract: The collection of time series data increases as more monitoring and automation are being deployed. These deployments range in scale from an Internet of things (IoT) device located in a household to enormous distributed Cyber-Physical Systems (CPSs) producing large volumes of data at high velocity. To store and analyze these vast amounts of data, specialized Time Series Management Systems (TSMSs) have been developed to overcome the limitations of general purpose Database Management Systems (DBMSs) for times series management. In this paper, we present a thorough analysis and classification of TSMSs developed through academic or industrial research and documented through publications. Our classification is organized into categories based on the architectures observed during our analysis. In addition, we provide an overview of each system with a focus on the motivational use case that drove the development of the system, the functionality for storage and querying of time series a system implements, the components the system is composed of, and the capabilities of each system with regard to Stream Processing and Approximate Query Processing (AQP). Last, we provide a summary of research directions proposed by other researchers in the field and present our vision for a next generation TSMS.

113 citations


Posted Content
TL;DR: G-CORE as mentioned in this paper is a graph query language with two key characteristics: it should be composable, meaning that graphs are the input and the output of queries, and it should treat paths as first-class citizens.
Abstract: We report on a community effort between industry and academia to shape the future of graph query languages. We argue that existing graph database management systems should consider supporting a query language with two key characteristics. First, it should be composable, meaning, that graphs are the input and the output of queries. Second, the graph query language should treat paths as first-class citizens. Our result is G-CORE, a powerful graph query language design that fulfills these goals, and strikes a careful balance between path query expressivity and evaluation complexity.

112 citations


Journal ArticleDOI
TL;DR: This work presents a novel ER system, called DeepER, that achieves good accuracy, high efficiency, as well as ease-of-use, and requires much less human labeled data and does not need feature engineering, compared with traditional machine learning based approaches.
Abstract: Entity resolution (ER) is a key data integration problem. Despite the efforts in 70+ years in all aspects of ER, there is still a high demand for democratizing ER - humans are heavily involved in labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With the recent advances in deep learning, in particular distributed representation of words (a.k.a. word embeddings), we present a novel ER system, called DeepER, that achieves good accuracy, high efficiency, as well as ease-of-use (i.e., much less human efforts). For accuracy, we use sophisticated composition methods, namely uni- and bi-directional recurrent neural networks (RNNs) with long short term memory (LSTM) hidden units, to convert each tuple to a distributed representation (i.e., a vector), which can in turn be used to effectively capture similarities between tuples. We consider both the case where pre-trained word embeddings are available as well the case where they are not; we present ways to learn and tune the distributed representations. For efficiency, we propose a locality sensitive hashing (LSH) based blocking approach that uses distributed representations of tuples; it takes all attributes of a tuple into consideration and produces much smaller blocks, compared with traditional methods that consider only a few attributes. For ease-of-use, DeepER requires much less human labeled data and does not need feature engineering, compared with traditional machine learning based approaches which require handcrafted features, and similarity functions along with their associated thresholds. We evaluate our algorithms on multiple datasets (including benchmarks, biomedical data, as well as multi-lingual data) and the extensive experimental results show that DeepER outperforms existing solutions.

106 citations


Posted Content
TL;DR: BLOCKBENCH is described, the first evaluation framework for analyzing private blockchains and it serves as a fair means of comparison for different platforms and enables deeper understanding of different system design choices, and is released for public use.
Abstract: Blockchain technologies are taking the world by storm. Public blockchains, such as Bitcoin and Ethereum, enable secure peer-to-peer applications like crypto-currency or smart contracts. Their security and performance are well studied. This paper concerns recent private blockchain systems designed with stronger security (trust) assumption and performance requirement. These systems target and aim to disrupt applications which have so far been implemented on top of database systems, for example banking, finance applications. Multiple platforms for private blockchains are being actively developed and fine tuned. However, there is a clear lack of a systematic framework with which different systems can be analyzed and compared against each other. Such a framework can be used to assess blockchains' viability as another distributed data processing platform, while helping developers to identify bottlenecks and accordingly improve their platforms. In this paper, we first describe BlockBench, the first evaluation framework for analyzing private blockchains. It serves as a fair means of comparison for different platforms and enables deeper understanding of different system design choices. Any private blockchain can be integrated to BlockBench via simple APIs and benchmarked against workloads that are based on real and synthetic smart contracts. BlockBench measures overall and component-wise performance in terms of throughput, latency, scalability and fault-tolerance. Next, we use BlockBench to conduct comprehensive evaluation of three major private blockchains: Ethereum, Parity and Hyperledger Fabric. The results demonstrate that these systems are still far from displacing current database systems in traditional data processing workloads. Furthermore, there are gaps in performance among the three systems which are attributed to the design choices at different layers of the software stack.

103 citations


Posted Content
TL;DR: This paper introduces a system called One Button Machine, or OneBM for short, which automates feature discovery in relational databases, which automatically performs a key activity of data scientists, namely, joining of database tables and applying advanced data transformations to extract useful features from data.
Abstract: Feature engineering is one of the most important and time consuming tasks in predictive analytics projects It involves understanding domain knowledge and data exploration to discover relevant hand-crafted features from raw data In this paper, we introduce a system called One Button Machine, or OneBM for short, which automates feature discovery in relational databases OneBM automatically performs a key activity of data scientists, namely, joining of database tables and applying advanced data transformations to extract useful features from data We validated OneBM in Kaggle competitions in which OneBM achieved performance as good as top 16% to 24% data scientists in three Kaggle competitions More importantly, OneBM outperformed the state-of-the-art system in a Kaggle competition in terms of prediction accuracy and ranking on Kaggle leaderboard The results show that OneBM can be useful for both data scientists and non-experts It helps data scientists reduce data exploration time allowing them to try and error many ideas in short time On the other hand, it enables non-experts, who are not familiar with data science, to quickly extract value from their data with a little effort, time and cost

66 citations


Posted Content
TL;DR: This work presents BoostClean, which automatically selects an ensemble of error detection and repair combinations using statistical boosting from an extensible library that is pre-populated general detection functions, including a novel detector based on the Word2Vec deep learning model, which detects errors across a diverse set of domains.
Abstract: Predictive models based on machine learning can be highly sensitive to data error. Training data are often combined with a variety of different sources, each susceptible to different types of inconsistencies, and new data streams during prediction time, the model may encounter previously unseen inconsistencies. An important class of such inconsistencies is domain value violations that occur when an attribute value is outside of an allowed domain. We explore automatically detecting and repairing such violations by leveraging the often available clean test labels to determine whether a given detection and repair combination will improve model accuracy. We present BoostClean which automatically selects an ensemble of error detection and repair combinations using statistical boosting. BoostClean selects this ensemble from an extensible library that is pre-populated general detection functions, including a novel detector based on the Word2Vec deep learning model, which detects errors across a diverse set of domains. Our evaluation on a collection of 12 datasets from Kaggle, the UCI repository, real-world data analyses, and production datasets that show that Boost- Clean can increase absolute prediction accuracy by up to 9% over the best non-ensembled alternatives. Our optimizations including parallelism, materialization, and indexing techniques show a 22.2x end-to-end speedup on a 16-core machine.

64 citations


Posted Content
TL;DR: This study conducts an in-depth analytical study of the queries formulated by end-users and harvested from large and up-to-date query logs from a wide variety of RDF data sources, and introduces the novel con- cept of a streak, i.e., a sequence of queries that appear as subsequent modifications of a seed query.
Abstract: With the adoption of RDF as the data model for Linked Data and the Semantic Web, query specification from end- users has become more and more common in SPARQL end- points. In this paper, we conduct an in-depth analytical study of the queries formulated by end-users and harvested from large and up-to-date query logs from a wide variety of RDF data sources. As opposed to previous studies, ours is the first assessment on a voluminous query corpus, span- ning over several years and covering many representative SPARQL endpoints. Apart from the syntactical structure of the queries, that exhibits already interesting results on this generalized corpus, we drill deeper in the structural char- acteristics related to the graph- and hypergraph represen- tation of queries. We outline the most common shapes of queries when visually displayed as pseudographs, and char- acterize their (hyper-)tree width. Moreover, we analyze the evolution of queries over time, by introducing the novel con- cept of a streak, i.e., a sequence of queries that appear as subsequent modifications of a seed query. Our study offers several fresh insights on the already rich query features of real SPARQL queries formulated by real users, and brings us to draw a number of conclusions and pinpoint future di- rections for SPARQL query evaluation, query optimization, tuning, and benchmarking.

63 citations


Posted Content
TL;DR: A formal data model for JSON documents is proposed and, based on the common features present in available systems using JSON, a lightweight query language is defined allowing us to navigate through JSON documents.
Abstract: Despite the fact that JSON is currently one of the most popular formats for exchanging data on the Web, there are very few studies on this topic and there are no agreement upon theoretical framework for dealing with JSON. There- fore in this paper we propose a formal data model for JSON documents and, based on the common features present in available systems using JSON, we define a lightweight query language allowing us to navigate through JSON documents. We also introduce a logic capturing the schema proposal for JSON and study the complexity of basic computational tasks associated with these two formalisms.

Posted Content
TL;DR: The first tight theoretical bounds on the accuracy of marginals compiled under each approach are proved, which show that releasing information based on (local) Fourier transformations of the input is preferable to alternatives based directly on ( local) marginals.
Abstract: Many analysis and machine learning tasks require the availability of marginal statistics on multidimensional datasets while providing strong privacy guarantees for the data subjects. Applications for these statistics range from finding correlations in the data to fitting sophisticated prediction models. In this paper, we provide a set of algorithms for materializing marginal statistics under the strong model of local differential privacy. We prove the first tight theoretical bounds on the accuracy of marginals compiled under each approach, perform empirical evaluation to confirm these bounds, and evaluate them for tasks such as modeling and correlation testing. Our results show that releasing information based on (local) Fourier transformations of the input is preferable to alternatives based directly on (local) marginals.

Journal ArticleDOI
TL;DR: Wang et al. as mentioned in this paper investigated the potential privacy loss of a traditional differential privacy (DP) mechanism under temporal correlations and proposed data releasing mechanisms that convert any existing DP mechanism into one against TPL.
Abstract: Differential Privacy (DP) has received increasing attention as a rigorous privacy framework. Many existing studies employ traditional DP mechanisms (e.g., the Laplace mechanism) as primitives to continuously release private data for protecting privacy at each time point (i.e., event-level privacy), which assume that the data at different time points are independent, or that adversaries do not have knowledge of correlation between data. However, continuously generated data tend to be temporally correlated, and such correlations can be acquired by adversaries. In this paper, we investigate the potential privacy loss of a traditional DP mechanism under temporal correlations. First, we analyze the privacy leakage of a DP mechanism under temporal correlation that can be modeled using Markov Chain. Our analysis reveals that, the event-level privacy loss of a DP mechanism may \textit{increase over time}. We call the unexpected privacy loss \textit{temporal privacy leakage} (TPL). Although TPL may increase over time, we find that its supremum may exist in some cases. Second, we design efficient algorithms for calculating TPL. Third, we propose data releasing mechanisms that convert any existing DP mechanism into one against TPL. Experiments confirm that our approach is efficient and effective.

Posted Content
TL;DR: The state-of-the-art techniques for skyline query processing, the numerous variations of the initial algorithm that were proposed to solve similar problems and the application-specific approaches that were developed to provide a solution efficiently in each case are surveyed.
Abstract: Living in the Information Age allows almost everyone have access to a large amount of information and options to choose from in order to fulfill their needs. In many cases, the amount of information available and the rate of change may hide the optimal and truly desired solution. This reveals the need of a mechanism that will highlight the best options to choose among every possible scenario. Based on this the skyline query was proposed which is a decision support mechanism, that retrieves the valuefor- money options of a dataset by identifying the objects that present the optimal combination of the characteristics of the dataset. This paper surveys the state-of-the-art techniques for skyline query processing, the numerous variations of the initial algorithm that were proposed to solve similar problems and the application-specific approaches that were developed to provide a solution efficiently in each case. Aditionally in each section a taxonomy is outlined along with the key aspects of each algorithm and its relation to previous studies.

Proceedings ArticleDOI
TL;DR: LaraDB as mentioned in this paper is a middleware algebra of three operators for mixed-abstraction analytics tasks, which can be used as range iterators in Apache Accumulo, a popular implementation of Google BigTable.
Abstract: Analytics tasks manipulate structured data with variants of relational algebra (RA) and quantitative data with variants of linear algebra (LA). The two computational models have overlapping expressiveness, motivating a common programming model that affords unified reasoning and algorithm design. At the logical level we propose Lara, a lean algebra of three operators, that expresses RA and LA as well as relevant optimization rules. We show a series of proofs that position Lara %formal and informal at just the right level of expressiveness for a middleware algebra: more explicit than MapReduce but more general than RA or LA. At the physical level we find that the Lara operators afford efficient implementations using a single primitive that is available in a variety of backend engines: range scans over partitioned sorted maps. To evaluate these ideas, we implemented the Lara operators as range iterators in Apache Accumulo, a popular implementation of Google's BigTable. First we show how Lara expresses a sensor quality control task, and we measure the performance impact of optimizations Lara admits on this task. Second we show that the LaraDB implementation outperforms Accumulo's native MapReduce integration on a core task involving join and aggregation in the form of matrix multiply, especially at smaller scales that are typically a poor fit for scale-out approaches. We find that LaraDB offers a conceptually lean framework for optimizing mixed-abstraction analytics tasks, without giving up fast record-level updates and scans.

Posted Content
TL;DR: Froid as discussed by the authors is an extensible framework for optimizing imperative programs in relational databases, which automatically transforms entire UDFs into relational algebraic expressions, and embeds them into the calling SQL query.
Abstract: For decades, RDBMSs have supported declarative SQL as well as imperative functions and procedures as ways for users to express data processing tasks. While the evaluation of declarative SQL has received a lot of attention resulting in highly sophisticated techniques, the evaluation of imperative programs has remained naive and highly inefficient. Imperative programs offer several benefits over SQL and hence are often preferred and widely used. But unfortunately, their abysmal performance discourages, and even prohibits their use in many situations. We address this important problem that has hitherto received little attention. We present Froid, an extensible framework for optimizing imperative programs in relational databases. Froid's novel approach automatically transforms entire User Defined Functions (UDFs) into relational algebraic expressions, and embeds them into the calling SQL query. This form is now amenable to cost-based optimization and results in efficient, set-oriented, parallel plans as opposed to inefficient, iterative, serial execution of UDFs. Froid's approach additionally brings the benefits of many compiler optimizations to UDFs with no additional implementation effort. We describe the design of Froid and present our experimental evaluation that demonstrates performance improvements of up to multiple orders of magnitude on real workloads.

Posted Content
TL;DR: In this article, the authors proposed an instance-optimal algorithm LocalSearch whose time complexity is linearly proportional to the size of the smallest subgraph that a correct algorithm needs to access without indexes.
Abstract: Community search over large graphs is a fundamental problem in graph analysis. Recent studies propose to compute top-k influential communities, where each reported community not only is a cohesive subgraph but also has a high influence value. The existing approaches to the problem of top-k influential community search can be categorized as index-based algorithms and online search algorithms without indexes. The index-based algorithms, although being very efficient in conducting community searches, need to pre-compute a special-purpose index and only work for one built-in vertex weight vector. In this paper, we investigate on-line search approaches and propose an instance-optimal algorithm LocalSearch whose time complexity is linearly proportional to the size of the smallest subgraph that a correct algorithm needs to access without indexes. In addition, we also propose techniques to make LocalSearch progressively compute and report the communities in decreasing influence value order such that k does not need to be specified. Moreover, we extend our framework to the general case of top-k influential community search regarding other cohesiveness measures. Extensive empirical studies on real graphs demonstrate that our algorithms outperform the existing online search algorithms by several orders of magnitude.

Posted Content
Xuelian Lin1, Shuai Ma1, Han Zhang1, Tianyu Wo1, Jinpeng Huai1 
TL;DR: This study develops a one-pass error bounded trajectory simplification algorithm (OPERB), which scans each data point in a trajectory once and only once, and proposes an aggressive one- pass error bounded trajectories simplifying algorithm (operB-A), which allows interpolating new data points into a trajectory under certain conditions.
Abstract: Nowadays, various sensors are collecting, storing and transmitting tremendous trajectory data, and it is known that raw trajectory data seriously wastes the storage, network band and computing resource. Line simplification (LS) algorithms are an effective approach to attacking this issue by compressing data points in a trajectory to a set of continuous line segments, and are commonly used in practice. However, existing LS algorithms are not sufficient for the needs of sensors in mobile devices. In this study, we first develop a one-pass error bounded trajectory simplification algorithm (OPERB), which scans each data point in a trajectory once and only once. We then propose an aggressive one-pass error bounded trajectory simplification algorithm (OPERB-A), which allows interpolating new data points into a trajectory under certain conditions. Finally, we experimentally verify that our approaches (OPERB and OPERB-A) are both efficient and effective, using four real-life trajectory datasets.

Journal ArticleDOI
TL;DR: This work presents an optimized software library written in C implementing Roaring bitmaps: CRoaring, which benefits from several algorithms designed for the single‐instruction–multiple‐data instructions available on commodity processors.
Abstract: Compressed bitmap indexes are used in systems such as Git or Oracle to accelerate queries. They represent sets and often support operations such as unions, intersections, differences, and symmetric differences. Several important systems such as Elasticsearch, Apache Spark, Netflix's Atlas, LinkedIn's Pinot, Metamarkets' Druid, Pilosa, Apache Hive, Apache Tez, Microsoft Visual Studio Team Services and Apache Kylin rely on a specific type of compressed bitmap index called Roaring. We present an optimized software library written in C implementing Roaring bitmaps: CRoaring. It benefits from several algorithms designed for the single-instruction-multiple-data (SIMD) instructions available on commodity processors. In particular, we present vectorized algorithms to compute the intersection, union, difference and symmetric difference between arrays. We benchmark the library against a wide range of competitive alternatives, identifying weaknesses and strengths in our software. Our work is available under a liberal open-source license.

Posted Content
TL;DR: HoloClean as mentioned in this paper is a framework for holistic data repairing driven by probabilistic inference, which unifies existing qualitative data repairing approaches with quantitative data repairing methods, which leverage statistical properties of the input data.
Abstract: We introduce HoloClean, a framework for holistic data repairing driven by probabilistic inference. HoloClean unifies existing qualitative data repairing approaches, which rely on integrity constraints or external data sources, with quantitative data repairing methods, which leverage statistical properties of the input data. Given an inconsistent dataset as input, HoloClean automatically generates a probabilistic program that performs data repairing. Inspired by recent theoretical advances in probabilistic inference, we introduce a series of optimizations which ensure that inference over HoloClean's probabilistic model scales to instances with millions of tuples. We show that HoloClean scales to instances with millions of tuples and find data repairs with an average precision of ~90% and an average recall of above ~76% across a diverse array of datasets exhibiting different types of errors. This yields an average F1 improvement of more than 2x against state-of-the-art methods.

Journal ArticleDOI
TL;DR: In this article, a new analytics operator called ASAP is developed to automatically smooth streaming time series by adaptively optimizing the trade-off between noise reduction and trend retention, while retaining large-scale structure to highlight significant deviations.
Abstract: Time series visualization of streaming telemetry (i.e., charting of key metrics such as server load over time) is increasingly prevalent in modern data platforms and applications. However, many existing systems simply plot the raw data streams as they arrive, often obscuring large-scale trends due to small-scale noise. We propose an alternative: to better prioritize end users' attention, smooth time series visualizations as much as possible to remove noise, while retaining large-scale structure to highlight significant deviations. We develop a new analytics operator called ASAP that automatically smooths streaming time series by adaptively optimizing the trade-off between noise reduction (i.e., variance) and trend retention (i.e., kurtosis). We introduce metrics to quantitatively assess the quality of smoothed plots and provide an efficient search strategy for optimizing these metrics that combines techniques from stream processing, user interface design, and signal processing via autocorrelation-based pruning, pixel-aware preaggregation, and on-demand refresh. We demonstrate that ASAP can improve users' accuracy in identifying long-term deviations in time series by up to 38.4% while reducing response times by up to 44.3%. Moreover, ASAP delivers these results several orders of magnitude faster than alternative search strategies.

Posted Content
TL;DR: This work introduces OrpheusDB, a dataset version control system that "bolts on" versioning capabilities to a traditional relational database system, thereby gaining the analytics capabilities of the database "for free".
Abstract: Data science teams often collaboratively analyze datasets, generating dataset versions at each stage of iterative exploration and analysis. There is a pressing need for a system that can support dataset versioning, enabling such teams to efficiently store, track, and query across dataset versions. We introduce OrpheusDB, a dataset version control system that "bolts on" versioning capabilities to a traditional relational database system, thereby gaining the analytics capabilities of the database "for free". We develop and evaluate multiple data models for representing versioned data, as well as a light-weight partitioning scheme, LyreSplit, to further optimize the models for reduced query latencies. With LyreSplit, OrpheusDB is on average 1000x faster in finding effective (and better) partitionings than competing approaches, while also reducing the latency of version retrieval by up to 20x relative to schemes without partitioning. LyreSplit can be applied in an online fashion as new versions are added, alongside an intelligent migration scheme that reduces migration time by 10x on average.

Posted Content
TL;DR: The idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work provides just a glimpse of what might be possible.
Abstract: Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes. The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show, that by using neural nets we are able to outperform cache-optimized B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over several real-world data sets. More importantly though, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work just provides a glimpse of what might be possible.

Posted Content
TL;DR: In this paper, the complexity of querying text by Conjunctive Queries and Unions of CQs (UCQs) on top of regex formulas is investigated.
Abstract: Regular expressions with capture variables, also known as "regex formulas," extract relations of spans (interval positions) from text. These relations can be further manipulated via Relational Algebra as studied in the context of document spanners, Fagin et al.'s formal framework for information extraction. We investigate the complexity of querying text by Conjunctive Queries (CQs) and Unions of CQs (UCQs) on top of regex formulas. We show that the lower bounds (NP-completeness and W[1]-hardness) from the relational world also hold in our setting; in particular, hardness hits already single-character text! Yet, the upper bounds from the relational world do not carry over. Unlike the relational world, acyclic CQs, and even gamma-acyclic CQs, are hard to compute. The source of hardness is that it may be intractable to instantiate the relation defined by a regex formula, simply because it has an exponential number of tuples. Yet, we are able to establish general upper bounds. In particular, UCQs can be evaluated with polynomial delay, provided that every CQ has a bounded number of atoms (while unions and projection can be arbitrary). Furthermore, UCQ evaluation is solvable with FPT (Fixed-Parameter Tractable) delay when the parameter is the size of the UCQ.

Posted Content
TL;DR: State management and its use in diverse applications varies widely across big data processing systems as discussed by the authors, which is evident in both the research literature and existing systems, such as Apache Flink, Apache Samza, Apache Spark, and Apache Storm.
Abstract: State management and its use in diverse applications varies widely across big data processing systems. This is evident in both the research literature and existing systems, such as Apache Flink, Apache Samza, Apache Spark, and Apache Storm. Given the pivotal role that state management plays in various use cases, in this survey, we present some of the most important uses of state as an enabler, discuss the alternative approaches used to handle and implement state, propose a taxonomy to capture the many facets of state management, and highlight new research directions. Our aim is to provide insight into disparate state management techniques, motivate others to pursue research in this area, and draw attention to some open problems.

Posted Content
TL;DR: This survey discusses and analyzes the various implementations of Apriori on MapReduce framework on the basis of their distinguishing characteristics and includes the advantages and limitations of Map Reduce framework.
Abstract: The Apriori algorithm that mines frequent itemsets is one of the most popular and widely used data mining algorithms. Now days many algorithms have been proposed on parallel and distributed platforms to enhance the performance of Apriori algorithm. They differ from each other on the basis of load balancing technique, memory system, data decomposition technique and data layout used to implement them. The problems with most of the distributed framework are overheads of managing distributed system and lack of high level parallel programming language. Also with grid computing there is always potential chances of node failures which cause multiple re-executions of tasks. These problems can be overcome by the MapReduce framework introduced by Google. MapReduce is an efficient, scalable and simplified programming model for large scale distributed data processing on a large cluster of commodity computers and also used in cloud computing. In this paper, we present the overview of parallel Apriori algorithm implemented on MapReduce framework. They are categorized on the basis of Map and Reduce functions used to implement them e.g. 1-phase vs. k-phase, I/O of Mapper, Combiner and Reducer, using functionality of Combiner inside Mapper etc. This survey discusses and analyzes the various implementations of Apriori on MapReduce framework on the basis of their distinguishing characteristics. Moreover, it also includes the advantages and limitations of MapReduce framework.

Posted Content
TL;DR: The use of data readiness levels is proposed: it gives a rough outline of three stages of data preparedness and speculates on how formalisation of these levels into a common language for data readiness could facilitate project management.
Abstract: Application of models to data is fraught. Data-generating collaborators often only have a very basic understanding of the complications of collating, processing and curating data. Challenges include: poor data collection practices, missing values, inconvenient storage mechanisms, intellectual property, security and privacy. All these aspects obstruct the sharing and interconnection of data, and the eventual interpretation of data through machine learning or other approaches. In project reporting, a major challenge is in encapsulating these problems and enabling goals to be built around the processing of data. Project overruns can occur due to failure to account for the amount of time required to curate and collate. But to understand these failures we need to have a common language for assessing the readiness of a particular data set. This position paper proposes the use of data readiness levels: it gives a rough outline of three stages of data preparedness and speculates on how formalisation of these levels into a common language for data readiness could facilitate project management.

Book ChapterDOI
TL;DR: Odyssey as discussed by the authors is an approach that uses statistics that allow for a more accurate cost estimation for federated queries and therefore enables Odyssey to produce better query execution plans for SPARQL endpoints.
Abstract: Answering queries over a federation of SPARQL endpoints requires combining data from more than one data source Optimizing queries in such scenarios is particularly challenging not only because of (i) the large variety of possible query execution plans that correctly answer the query but also because (ii) there is only limited access to statistics about schema and instance data of remote sources To overcome these challenges, most federated query engines rely on heuristics to reduce the space of possible query execution plans or on dynamic programming strategies to produce optimal plans Nevertheless, these plans may still exhibit a high number of intermediate results or high execution times because of heuristics and inaccurate cost estimations In this paper, we present Odyssey, an approach that uses statistics that allow for a more accurate cost estimation for federated queries and therefore enables Odyssey to produce better query execution plans Our experimental results show that Odyssey produces query execution plans that are better in terms of data transfer and execution time than state-of-the-art optimizers Our experiments using the FedBench benchmark show execution time gains of at least 25 times on average

Book ChapterDOI
TL;DR: Graph database systems are increasingly adapted for storing and processing heterogeneous network-like datasets, however, due to the novelty of such systems, no standard data model or query language has yet emerged, thus subjecting users to the possibility of vendor lock-in.
Abstract: Graph database systems are increasingly adapted for storing and processing heterogeneous network-like datasets. However, due to the novelty of such systems, no standard data model or query language has yet emerged. Consequently, migrating datasets or applications even between related technologies often requires a large amount of manual work or ad-hoc solutions, thus subjecting the users to the possibility of vendor lock-in. To avoid this threat, vendors are working on supporting existing standard languages (e.g. SQL) or creating standardised languages. In this paper, we present a formal specification for openCypher, a high-level declarative graph query language with an ongoing standardisation effort. We introduce relational graph algebra, which extends relational operators by adapting graph-specific operators and define a mapping from core openCypher constructs to this algebra. We propose an algorithm that allows systematic compilation of openCypher queries.

Posted Content
TL;DR: In this article, the authors propose a framework that combines the distributive computational abilities of Apache Spark and the advanced machine learning architecture of a deep multi-layer perceptron (MLP), using the popular concept of Cascade Learning.
Abstract: With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular, especially in industries. It is becoming increasingly evident that effective big data analysis is key to solving artificial intelligence problems. Thus, a multi-algorithm library was implemented in the Spark framework, called MLlib. While this library supports multiple machine learning algorithms, there is still scope to use the Spark setup efficiently for highly time-intensive and computationally expensive procedures like deep learning. In this paper, we propose a novel framework that combines the distributive computational abilities of Apache Spark and the advanced machine learning architecture of a deep multi-layer perceptron (MLP), using the popular concept of Cascade Learning. We conduct empirical analysis of our framework on two real world datasets. The results are encouraging and corroborate our proposed framework, in turn proving that it is an improvement over traditional big data analysis methods that use either Spark or Deep learning as individual elements.