Showing papers in "arXiv: Databases in 2017"

PDF

Open Access

Posted Content•

NoScope: Optimizing Neural Network Queries over Video at Scale

[...]

Daniel Kang¹, John Emmons¹, Firas Abuzaid¹, Peter Bailis¹, Matei Zaharia¹ - Show less +1 more•Institutions (1)

07 Mar 2017-arXiv: Databases

TL;DR: NoScope is a system for querying videos that can reduce the cost of neural network video analysis by up to three orders of magnitude via inference-optimized model search and achieves two to three order of magnitude speed-ups on binary classification tasks over fixed-angle webcam and surveillance video while maintaining accuracy within 1-5% of state-of-the-art neural networks.

...read moreread less

Abstract: Recent advances in computer vision-in the form of deep neural networks-have made it possible to query increasing volumes of video data with high accuracy. However, neural network inference is computationally expensive at scale: applying a state-of-the-art object detector in real time (i.e., 30+ frames per second) to a single video requires a $4000 GPU. In response, we present NoScope, a system for querying videos that can reduce the cost of neural network video analysis by up to three orders of magnitude via inference-optimized model search. Given a target video, object to detect, and reference neural network, NoScope automatically searches for and trains a sequence, or cascade, of models that preserves the accuracy of the reference network but is specialized to the target video and are therefore far less computationally expensive. NoScope cascades two types of models: specialized models that forego the full generality of the reference model but faithfully mimic its behavior for the target video and object; and difference detectors that highlight temporal differences across frames. We show that the optimal cascade architecture differs across videos and objects, so NoScope uses an efficient cost-based optimizer to search across models and cascades. With this approach, NoScope achieves two to three order of magnitude speed-ups (265-15,500x real-time) on binary classification tasks over fixed-angle webcam and surveillance video while maintaining accuracy within 1-5% of state-of-the-art neural networks.

...read moreread less

187 citations

Posted Content•

Untangling Blockchain: A Data Processing View of Blockchain Systems

[...]

Tien Tuan Anh Dinh¹, Rui Liu¹, Meihui Zhang, Gang Chen², Beng Chin Ooi¹, Ji Wang¹ - Show less +2 more•Institutions (2)

National University of Singapore¹, Zhejiang University²

17 Aug 2017-arXiv: Databases

TL;DR: In this article, the authors present a benchmarking framework for understanding performance of private blockchains against data processing workloads, and conduct a comprehensive evaluation of three major blockchain systems based on BLOCKBENCH, namely Ethereum, Parity and Hyperledger Fabric.

...read moreread less

Abstract: Blockchain technologies are gaining massive momentum in the last few years. Blockchains are distributed ledgers that enable parties who do not fully trust each other to maintain a set of global states. The parties agree on the existence, values and histories of the states. As the technology landscape is expanding rapidly, it is both important and challenging to have a firm grasp of what the core technologies have to offer, especially with respect to their data processing capabilities. In this paper, we first survey the state of the art, focusing on private blockchains (in which parties are authenticated). We analyze both in-production and research systems in four dimensions: distributed ledger, cryptography, consensus protocol and smart contract. We then present BLOCKBENCH, a benchmarking framework for understanding performance of private blockchains against data processing workloads. We conduct a comprehensive evaluation of three major blockchain systems based on BLOCKBENCH, namely Ethereum, Parity and Hyperledger Fabric. The results demonstrate several trade-offs in the design space, as well as big performance gaps between blockchain and database systems. Drawing from design principles of database systems, we discuss several research directions for bringing blockchain performance closer to the realm of databases.

...read moreread less

173 citations

Posted Content•

Size bounds and query plans for relational joins

[...]

Albert Atserias, Martin Grohe, Dániel Marx

10 Nov 2017-arXiv: Databases

TL;DR: In this article, the authors show that the worst-case size of a query is characterised by the fractional edge cover number of its underlying hypergraph, a combinatorial parameter previously known to provide an upper bound.

...read moreread less

Abstract: Relational joins are at the core of relational algebra, which in turn is the core of the standard database query language SQL. As their evaluation is expensive and very often dominated by the output size, it is an important task for database query optimisers to compute estimates on the size of joins and to find good execution plans for sequences of joins. We study these problems from a theoretical perspective, both in the worst-case model, and in an average-case model where the database is chosen according to a known probability distribution. In the former case, our first key observation is that the worst-case size of a query is characterised by the fractional edge cover number of its underlying hypergraph, a combinatorial parameter previously known to provide an upper bound. We complete the picture by proving a matching lower bound, and by showing that there exist queries for which the join-project plan suggested by the fractional edge cover approach may be substantially better than any join plan that does not use intermediate projections. On the other hand, we show that in the average-case model, every join-project plan can be turned into a plan containing no projections in such a way that the expected time to evaluate the plan increases only by a constant factor independent of the size of the database. Not surprisingly, the key combinatorial parameter in this context is the maximum density of the underlying hypergraph. We show how to make effective use of this parameter to eliminate the projections.

...read moreread less

132 citations

Journal Article•DOI•

Time Series Management Systems: A Survey

[...]

Søren Kejser Jensen¹, Torben Bach Pedersen¹, Christian Thomsen¹•Institutions (1)

Aalborg University¹

03 Oct 2017-arXiv: Databases

TL;DR: A thorough analysis and classification of TSMSs developed through academic or industrial research and documented through publications is presented and the capabilities of each system with regard to Stream Processing and Approximate Query Processing are provided.

...read moreread less

Abstract: The collection of time series data increases as more monitoring and automation are being deployed. These deployments range in scale from an Internet of things (IoT) device located in a household to enormous distributed Cyber-Physical Systems (CPSs) producing large volumes of data at high velocity. To store and analyze these vast amounts of data, specialized Time Series Management Systems (TSMSs) have been developed to overcome the limitations of general purpose Database Management Systems (DBMSs) for times series management. In this paper, we present a thorough analysis and classification of TSMSs developed through academic or industrial research and documented through publications. Our classification is organized into categories based on the architectures observed during our analysis. In addition, we provide an overview of each system with a focus on the motivational use case that drove the development of the system, the functionality for storage and querying of time series a system implements, the components the system is composed of, and the capabilities of each system with regard to Stream Processing and Approximate Query Processing (AQP). Last, we provide a summary of research directions proposed by other researchers in the field and present our vision for a next generation TSMS.

...read moreread less

113 citations

Posted Content•

G-CORE: A Core for Future Graph Query Languages

[...]

Renzo Angles¹, Marcelo Arenas², Pablo Barceló³, Peter Boncz, George H. L. Fletcher⁴, Claudio Gutierrez³, Tobias Lindaaker, Marcus Paradies, Stefan Plantikow, Juan F. Sequeda, Oskar van Rest⁵, Hannes Voigt⁶ - Show less +8 more•Institutions (6)

University of Talca¹, Pontifical Catholic University of Chile², University of Chile³, Eindhoven University of Technology⁴, Oracle Corporation⁵, Dresden University of Technology⁶

05 Dec 2017-arXiv: Databases

TL;DR: G-CORE as mentioned in this paper is a graph query language with two key characteristics: it should be composable, meaning that graphs are the input and the output of queries, and it should treat paths as first-class citizens.

...read moreread less

Abstract: We report on a community effort between industry and academia to shape the future of graph query languages. We argue that existing graph database management systems should consider supporting a query language with two key characteristics. First, it should be composable, meaning, that graphs are the input and the output of queries. Second, the graph query language should treat paths as first-class citizens. Our result is G-CORE, a powerful graph query language design that fulfills these goals, and strikes a careful balance between path query expressivity and evaluation complexity.

...read moreread less

112 citations

Journal Article•DOI•

DeepER - Deep Entity Resolution.

[...]

Muhammad Ebraheem, Saravanan Thirumuruganathan, Shafiq Joty, Mourad Ouzzani, Nan Tang - Show less +1 more

02 Oct 2017-arXiv: Databases

TL;DR: This work presents a novel ER system, called DeepER, that achieves good accuracy, high efficiency, as well as ease-of-use, and requires much less human labeled data and does not need feature engineering, compared with traditional machine learning based approaches.

...read moreread less

Abstract: Entity resolution (ER) is a key data integration problem. Despite the efforts in 70+ years in all aspects of ER, there is still a high demand for democratizing ER - humans are heavily involved in labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With the recent advances in deep learning, in particular distributed representation of words (a.k.a. word embeddings), we present a novel ER system, called DeepER, that achieves good accuracy, high efficiency, as well as ease-of-use (i.e., much less human efforts). For accuracy, we use sophisticated composition methods, namely uni- and bi-directional recurrent neural networks (RNNs) with long short term memory (LSTM) hidden units, to convert each tuple to a distributed representation (i.e., a vector), which can in turn be used to effectively capture similarities between tuples. We consider both the case where pre-trained word embeddings are available as well the case where they are not; we present ways to learn and tune the distributed representations. For efficiency, we propose a locality sensitive hashing (LSH) based blocking approach that uses distributed representations of tuples; it takes all attributes of a tuple into consideration and produces much smaller blocks, compared with traditional methods that consider only a few attributes. For ease-of-use, DeepER requires much less human labeled data and does not need feature engineering, compared with traditional machine learning based approaches which require handcrafted features, and similarity functions along with their associated thresholds. We evaluate our algorithms on multiple datasets (including benchmarks, biomedical data, as well as multi-lingual data) and the extensive experimental results show that DeepER outperforms existing solutions.

...read moreread less

106 citations

Posted Content•

BLOCKBENCH: A Framework for Analyzing Private Blockchains

[...]

Tien Tuan Anh Dinh¹, Ji Wang¹, Gang Chen², Rui Liu¹, Beng Chin Ooi¹, Kian-Lee Tan¹ - Show less +2 more•Institutions (2)

National University of Singapore¹, Zhejiang University²

12 Mar 2017-arXiv: Databases

TL;DR: BLOCKBENCH is described, the first evaluation framework for analyzing private blockchains and it serves as a fair means of comparison for different platforms and enables deeper understanding of different system design choices, and is released for public use.

...read moreread less

Abstract: Blockchain technologies are taking the world by storm. Public blockchains, such as Bitcoin and Ethereum, enable secure peer-to-peer applications like crypto-currency or smart contracts. Their security and performance are well studied. This paper concerns recent private blockchain systems designed with stronger security (trust) assumption and performance requirement. These systems target and aim to disrupt applications which have so far been implemented on top of database systems, for example banking, finance applications. Multiple platforms for private blockchains are being actively developed and fine tuned. However, there is a clear lack of a systematic framework with which different systems can be analyzed and compared against each other. Such a framework can be used to assess blockchains' viability as another distributed data processing platform, while helping developers to identify bottlenecks and accordingly improve their platforms. In this paper, we first describe BlockBench, the first evaluation framework for analyzing private blockchains. It serves as a fair means of comparison for different platforms and enables deeper understanding of different system design choices. Any private blockchain can be integrated to BlockBench via simple APIs and benchmarked against workloads that are based on real and synthetic smart contracts. BlockBench measures overall and component-wise performance in terms of throughput, latency, scalability and fault-tolerance. Next, we use BlockBench to conduct comprehensive evaluation of three major private blockchains: Ethereum, Parity and Hyperledger Fabric. The results demonstrate that these systems are still far from displacing current database systems in traditional data processing workloads. Furthermore, there are gaps in performance among the three systems which are attributed to the design choices at different layers of the software stack.

...read moreread less

103 citations

Posted Content•

One button machine for automating feature engineering in relational databases

[...]

Hoang Thanh Lam, Johann-Michael Thiebaut, Mathieu Sinn, Bei Chen, Tiep Mai, Oznur Alkan - Show less +2 more

01 Jun 2017-arXiv: Databases

TL;DR: This paper introduces a system called One Button Machine, or OneBM for short, which automates feature discovery in relational databases, which automatically performs a key activity of data scientists, namely, joining of database tables and applying advanced data transformations to extract useful features from data.

...read moreread less

Abstract: Feature engineering is one of the most important and time consuming tasks in predictive analytics projects It involves understanding domain knowledge and data exploration to discover relevant hand-crafted features from raw data In this paper, we introduce a system called One Button Machine, or OneBM for short, which automates feature discovery in relational databases OneBM automatically performs a key activity of data scientists, namely, joining of database tables and applying advanced data transformations to extract useful features from data We validated OneBM in Kaggle competitions in which OneBM achieved performance as good as top 16% to 24% data scientists in three Kaggle competitions More importantly, OneBM outperformed the state-of-the-art system in a Kaggle competition in terms of prediction accuracy and ranking on Kaggle leaderboard The results show that OneBM can be useful for both data scientists and non-experts It helps data scientists reduce data exploration time allowing them to try and error many ideas in short time On the other hand, it enables non-experts, who are not familiar with data science, to quickly extract value from their data with a little effort, time and cost

...read moreread less

66 citations

Posted Content•

BoostClean: Automated Error Detection and Repair for Machine Learning

[...]

Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, Eugene Wu

03 Nov 2017-arXiv: Databases

TL;DR: This work presents BoostClean, which automatically selects an ensemble of error detection and repair combinations using statistical boosting from an extensible library that is pre-populated general detection functions, including a novel detector based on the Word2Vec deep learning model, which detects errors across a diverse set of domains.

...read moreread less

Abstract: Predictive models based on machine learning can be highly sensitive to data error. Training data are often combined with a variety of different sources, each susceptible to different types of inconsistencies, and new data streams during prediction time, the model may encounter previously unseen inconsistencies. An important class of such inconsistencies is domain value violations that occur when an attribute value is outside of an allowed domain. We explore automatically detecting and repairing such violations by leveraging the often available clean test labels to determine whether a given detection and repair combination will improve model accuracy. We present BoostClean which automatically selects an ensemble of error detection and repair combinations using statistical boosting. BoostClean selects this ensemble from an extensible library that is pre-populated general detection functions, including a novel detector based on the Word2Vec deep learning model, which detects errors across a diverse set of domains. Our evaluation on a collection of 12 datasets from Kaggle, the UCI repository, real-world data analyses, and production datasets that show that Boost- Clean can increase absolute prediction accuracy by up to 9% over the best non-ensembled alternatives. Our optimizations including parallelism, materialization, and indexing techniques show a 22.2x end-to-end speedup on a 16-core machine.

...read moreread less

64 citations

Posted Content•

An Analytical Study of Large SPARQL Query Logs

[...]

Angela Bonifati, Wim Martens¹, Thomas Timm¹•Institutions (1)

University of Bayreuth¹

01 Aug 2017-arXiv: Databases

TL;DR: This study conducts an in-depth analytical study of the queries formulated by end-users and harvested from large and up-to-date query logs from a wide variety of RDF data sources, and introduces the novel con- cept of a streak, i.e., a sequence of queries that appear as subsequent modifications of a seed query.

...read moreread less

Abstract: With the adoption of RDF as the data model for Linked Data and the Semantic Web, query specification from end- users has become more and more common in SPARQL end- points. In this paper, we conduct an in-depth analytical study of the queries formulated by end-users and harvested from large and up-to-date query logs from a wide variety of RDF data sources. As opposed to previous studies, ours is the first assessment on a voluminous query corpus, span- ning over several years and covering many representative SPARQL endpoints. Apart from the syntactical structure of the queries, that exhibits already interesting results on this generalized corpus, we drill deeper in the structural char- acteristics related to the graph- and hypergraph represen- tation of queries. We outline the most common shapes of queries when visually displayed as pseudographs, and char- acterize their (hyper-)tree width. Moreover, we analyze the evolution of queries over time, by introducing the novel con- cept of a streak, i.e., a sequence of queries that appear as subsequent modifications of a seed query. Our study offers several fresh insights on the already rich query features of real SPARQL queries formulated by real users, and brings us to draw a number of conclusions and pinpoint future di- rections for SPARQL query evaluation, query optimization, tuning, and benchmarking.

...read moreread less

63 citations

Posted Content•

JSON: data model, query languages and schema specification

[...]

Pierre Bourhis¹, Juan L. Reutter, Fernando Suárez, Domagoj Vrgoč•Institutions (1)

French Institute for Research in Computer Science and Automation¹

09 Jan 2017-arXiv: Databases

TL;DR: A formal data model for JSON documents is proposed and, based on the common features present in available systems using JSON, a lightweight query language is defined allowing us to navigate through JSON documents.

...read moreread less

Abstract: Despite the fact that JSON is currently one of the most popular formats for exchanging data on the Web, there are very few studies on this topic and there are no agreement upon theoretical framework for dealing with JSON. There- fore in this paper we propose a formal data model for JSON documents and, based on the common features present in available systems using JSON, we define a lightweight query language allowing us to navigate through JSON documents. We also introduce a logic capturing the schema proposal for JSON and study the complexity of basic computational tasks associated with these two formalisms.

...read moreread less

Posted Content•

Marginal Release Under Local Differential Privacy

[...]

Tejas Kulkarni, Graham Cormode, Divesh Srivastava

08 Nov 2017-arXiv: Databases

TL;DR: The first tight theoretical bounds on the accuracy of marginals compiled under each approach are proved, which show that releasing information based on (local) Fourier transformations of the input is preferable to alternatives based directly on ( local) marginals.

...read moreread less

Abstract: Many analysis and machine learning tasks require the availability of marginal statistics on multidimensional datasets while providing strong privacy guarantees for the data subjects. Applications for these statistics range from finding correlations in the data to fitting sophisticated prediction models. In this paper, we provide a set of algorithms for materializing marginal statistics under the strong model of local differential privacy. We prove the first tight theoretical bounds on the accuracy of marginals compiled under each approach, perform empirical evaluation to confirm these bounds, and evaluate them for tasks such as modeling and correlation testing. Our results show that releasing information based on (local) Fourier transformations of the input is preferable to alternatives based directly on (local) marginals.

...read moreread less

Journal Article•DOI•

Quantifying Differential Privacy in Continuous Data Release under Temporal Correlations.

[...]

Yang Cao¹, Masatoshi Yoshikawa², Yonghui Xiao³, Li Xiong¹•Institutions (3)

Emory University¹, Kyoto University², Google³

29 Nov 2017-arXiv: Databases

TL;DR: Wang et al. as mentioned in this paper investigated the potential privacy loss of a traditional differential privacy (DP) mechanism under temporal correlations and proposed data releasing mechanisms that convert any existing DP mechanism into one against TPL.

...read moreread less

Abstract: Differential Privacy (DP) has received increasing attention as a rigorous privacy framework. Many existing studies employ traditional DP mechanisms (e.g., the Laplace mechanism) as primitives to continuously release private data for protecting privacy at each time point (i.e., event-level privacy), which assume that the data at different time points are independent, or that adversaries do not have knowledge of correlation between data. However, continuously generated data tend to be temporally correlated, and such correlations can be acquired by adversaries. In this paper, we investigate the potential privacy loss of a traditional DP mechanism under temporal correlations. First, we analyze the privacy leakage of a DP mechanism under temporal correlation that can be modeled using Markov Chain. Our analysis reveals that, the event-level privacy loss of a DP mechanism may \textit{increase over time}. We call the unexpected privacy loss \textit{temporal privacy leakage} (TPL). Although TPL may increase over time, we find that its supremum may exist in some cases. Second, we design efficient algorithms for calculating TPL. Third, we propose data releasing mechanisms that convert any existing DP mechanism into one against TPL. Experiments confirm that our approach is efficient and effective.

...read moreread less

Posted Content•

A Survey of Skyline Query Processing.

[...]

Christos Kalyvas, Theodoros Tzouramanis

06 Apr 2017-arXiv: Databases

TL;DR: The state-of-the-art techniques for skyline query processing, the numerous variations of the initial algorithm that were proposed to solve similar problems and the application-specific approaches that were developed to provide a solution efficiently in each case are surveyed.

...read moreread less

Abstract: Living in the Information Age allows almost everyone have access to a large amount of information and options to choose from in order to fulfill their needs. In many cases, the amount of information available and the rate of change may hide the optimal and truly desired solution. This reveals the need of a mechanism that will highlight the best options to choose among every possible scenario. Based on this the skyline query was proposed which is a decision support mechanism, that retrieves the valuefor- money options of a dataset by identifying the objects that present the optimal combination of the characteristics of the dataset. This paper surveys the state-of-the-art techniques for skyline query processing, the numerous variations of the initial algorithm that were proposed to solve similar problems and the application-specific approaches that were developed to provide a solution efficiently in each case. Aditionally in each section a taxonomy is outlined along with the key aspects of each algorithm and its relation to previous studies.

...read moreread less

Proceedings Article•DOI•

LaraDB: A Minimalist Kernel for Linear and Relational Algebra Computation

[...]

Dylan Hutchison¹, Bill Howe¹, Dan Suciu¹•Institutions (1)

University of Washington¹

21 Mar 2017-arXiv: Databases

TL;DR: LaraDB as mentioned in this paper is a middleware algebra of three operators for mixed-abstraction analytics tasks, which can be used as range iterators in Apache Accumulo, a popular implementation of Google BigTable.

...read moreread less

Abstract: Analytics tasks manipulate structured data with variants of relational algebra (RA) and quantitative data with variants of linear algebra (LA). The two computational models have overlapping expressiveness, motivating a common programming model that affords unified reasoning and algorithm design. At the logical level we propose Lara, a lean algebra of three operators, that expresses RA and LA as well as relevant optimization rules. We show a series of proofs that position Lara %formal and informal at just the right level of expressiveness for a middleware algebra: more explicit than MapReduce but more general than RA or LA. At the physical level we find that the Lara operators afford efficient implementations using a single primitive that is available in a variety of backend engines: range scans over partitioned sorted maps. To evaluate these ideas, we implemented the Lara operators as range iterators in Apache Accumulo, a popular implementation of Google's BigTable. First we show how Lara expresses a sensor quality control task, and we measure the performance impact of optimizations Lara admits on this task. Second we show that the LaraDB implementation outperforms Accumulo's native MapReduce integration on a core task involving join and aggregation in the form of matrix multiply, especially at smaller scales that are typically a poor fit for scale-out approaches. We find that LaraDB offers a conceptually lean framework for optimizing mixed-abstraction analytics tasks, without giving up fast record-level updates and scans.

...read moreread less

Posted Content•

Optimization of Imperative Programs in a Relational Database

[...]

Karthik Ramachandra¹, Kwanghyun Park¹, K. Venkatesh Emani², Alan Halverson¹, Cesar A. Galindo-Legaria¹, Conor Cunningham¹ - Show less +2 more•Institutions (2)

Microsoft¹, Indian Institute of Technology Bombay²

01 Dec 2017-arXiv: Databases

TL;DR: Froid as discussed by the authors is an extensible framework for optimizing imperative programs in relational databases, which automatically transforms entire UDFs into relational algebraic expressions, and embeds them into the calling SQL query.

...read moreread less

Abstract: For decades, RDBMSs have supported declarative SQL as well as imperative functions and procedures as ways for users to express data processing tasks. While the evaluation of declarative SQL has received a lot of attention resulting in highly sophisticated techniques, the evaluation of imperative programs has remained naive and highly inefficient. Imperative programs offer several benefits over SQL and hence are often preferred and widely used. But unfortunately, their abysmal performance discourages, and even prohibits their use in many situations. We address this important problem that has hitherto received little attention. We present Froid, an extensible framework for optimizing imperative programs in relational databases. Froid's novel approach automatically transforms entire User Defined Functions (UDFs) into relational algebraic expressions, and embeds them into the calling SQL query. This form is now amenable to cost-based optimization and results in efficient, set-oriented, parallel plans as opposed to inefficient, iterative, serial execution of UDFs. Froid's approach additionally brings the benefits of many compiler optimizations to UDFs with no additional implementation effort. We describe the design of Froid and present our experimental evaluation that demonstrates performance improvements of up to multiple orders of magnitude on real workloads.

...read moreread less

Posted Content•

An Optimal and Progressive Approach to Online Search of Top-k Influential Communities

[...]

Fei Bi¹, Lijun Chang², Xuemin Lin¹, Wenjie Zhang¹•Institutions (2)

University of New South Wales¹, University of Sydney²

16 Nov 2017-arXiv: Databases

TL;DR: In this article, the authors proposed an instance-optimal algorithm LocalSearch whose time complexity is linearly proportional to the size of the smallest subgraph that a correct algorithm needs to access without indexes.

...read moreread less

Abstract: Community search over large graphs is a fundamental problem in graph analysis. Recent studies propose to compute top-k influential communities, where each reported community not only is a cohesive subgraph but also has a high influence value. The existing approaches to the problem of top-k influential community search can be categorized as index-based algorithms and online search algorithms without indexes. The index-based algorithms, although being very efficient in conducting community searches, need to pre-compute a special-purpose index and only work for one built-in vertex weight vector. In this paper, we investigate on-line search approaches and propose an instance-optimal algorithm LocalSearch whose time complexity is linearly proportional to the size of the smallest subgraph that a correct algorithm needs to access without indexes. In addition, we also propose techniques to make LocalSearch progressively compute and report the communities in decreasing influence value order such that k does not need to be specified. Moreover, we extend our framework to the general case of top-k influential community search regarding other cohesiveness measures. Extensive empirical studies on real graphs demonstrate that our algorithms outperform the existing online search algorithms by several orders of magnitude.

...read moreread less

Posted Content•

One-Pass Error Bounded Trajectory Simplification

[...]

Xuelian Lin¹, Shuai Ma¹, Han Zhang¹, Tianyu Wo¹, Jinpeng Huai¹ - Show less +1 more•Institutions (1)

Beihang University¹

18 Feb 2017-arXiv: Databases

TL;DR: This study develops a one-pass error bounded trajectory simplification algorithm (OPERB), which scans each data point in a trajectory once and only once, and proposes an aggressive one- pass error bounded trajectories simplifying algorithm (operB-A), which allows interpolating new data points into a trajectory under certain conditions.

...read moreread less

Abstract: Nowadays, various sensors are collecting, storing and transmitting tremendous trajectory data, and it is known that raw trajectory data seriously wastes the storage, network band and computing resource. Line simplification (LS) algorithms are an effective approach to attacking this issue by compressing data points in a trajectory to a set of continuous line segments, and are commonly used in practice. However, existing LS algorithms are not sufficient for the needs of sensors in mobile devices. In this study, we first develop a one-pass error bounded trajectory simplification algorithm (OPERB), which scans each data point in a trajectory once and only once. We then propose an aggressive one-pass error bounded trajectory simplification algorithm (OPERB-A), which allows interpolating new data points into a trajectory under certain conditions. Finally, we experimentally verify that our approaches (OPERB and OPERB-A) are both efficient and effective, using four real-life trajectory datasets.

...read moreread less

Journal Article•DOI•

Roaring Bitmaps: Implementation of an Optimized Software Library

[...]

Daniel Lemire¹, Owen Kaser², Nathan Kurz, Luca Deri, Chris O'Hara, François Saint-Jacques, Gregory Ssi-Yan-Kai - Show less +3 more•Institutions (2)

Université du Québec à Montréal¹, University of New Brunswick²

22 Sep 2017-arXiv: Databases

TL;DR: This work presents an optimized software library written in C implementing Roaring bitmaps: CRoaring, which benefits from several algorithms designed for the single‐instruction–multiple‐data instructions available on commodity processors.

...read moreread less

Abstract: Compressed bitmap indexes are used in systems such as Git or Oracle to accelerate queries. They represent sets and often support operations such as unions, intersections, differences, and symmetric differences. Several important systems such as Elasticsearch, Apache Spark, Netflix's Atlas, LinkedIn's Pinot, Metamarkets' Druid, Pilosa, Apache Hive, Apache Tez, Microsoft Visual Studio Team Services and Apache Kylin rely on a specific type of compressed bitmap index called Roaring. We present an optimized software library written in C implementing Roaring bitmaps: CRoaring. It benefits from several algorithms designed for the single-instruction-multiple-data (SIMD) instructions available on commodity processors. In particular, we present vectorized algorithms to compute the intersection, union, difference and symmetric difference between arrays. We benchmark the library against a wide range of competitive alternatives, identifying weaknesses and strengths in our software. Our work is available under a liberal open-source license.

...read moreread less

Posted Content•

HoloClean: Holistic Data Repairs with Probabilistic Inference

[...]

Theodoros Rekatsinas¹, Xu Chu², Ihab F. Ilyas², Christopher Ré¹•Institutions (2)

Stanford University¹, University of Waterloo²

02 Feb 2017-arXiv: Databases

TL;DR: HoloClean as mentioned in this paper is a framework for holistic data repairing driven by probabilistic inference, which unifies existing qualitative data repairing approaches with quantitative data repairing methods, which leverage statistical properties of the input data.

...read moreread less

Abstract: We introduce HoloClean, a framework for holistic data repairing driven by probabilistic inference. HoloClean unifies existing qualitative data repairing approaches, which rely on integrity constraints or external data sources, with quantitative data repairing methods, which leverage statistical properties of the input data. Given an inconsistent dataset as input, HoloClean automatically generates a probabilistic program that performs data repairing. Inspired by recent theoretical advances in probabilistic inference, we introduce a series of optimizations which ensure that inference over HoloClean's probabilistic model scales to instances with millions of tuples. We show that HoloClean scales to instances with millions of tuples and find data repairs with an average precision of ~90% and an average recall of above ~76% across a diverse array of datasets exhibiting different types of errors. This yields an average F1 improvement of more than 2x against state-of-the-art methods.

...read moreread less

Journal Article•DOI•

ASAP: Prioritizing Attention via Time Series Smoothing

[...]

Kexin Rong¹, Peter Bailis¹•Institutions (1)

Stanford University¹

02 Mar 2017-arXiv: Databases

TL;DR: In this article, a new analytics operator called ASAP is developed to automatically smooth streaming time series by adaptively optimizing the trade-off between noise reduction and trend retention, while retaining large-scale structure to highlight significant deviations.

...read moreread less

Abstract: Time series visualization of streaming telemetry (i.e., charting of key metrics such as server load over time) is increasingly prevalent in modern data platforms and applications. However, many existing systems simply plot the raw data streams as they arrive, often obscuring large-scale trends due to small-scale noise. We propose an alternative: to better prioritize end users' attention, smooth time series visualizations as much as possible to remove noise, while retaining large-scale structure to highlight significant deviations. We develop a new analytics operator called ASAP that automatically smooths streaming time series by adaptively optimizing the trade-off between noise reduction (i.e., variance) and trend retention (i.e., kurtosis). We introduce metrics to quantitatively assess the quality of smoothed plots and provide an efficient search strategy for optimizing these metrics that combines techniques from stream processing, user interface design, and signal processing via autocorrelation-based pruning, pixel-aware preaggregation, and on-demand refresh. We demonstrate that ASAP can improve users' accuracy in identifying long-term deviations in time series by up to 38.4% while reducing response times by up to 44.3%. Moreover, ASAP delivers these results several orders of magnitude faster than alternative search strategies.

...read moreread less

Posted Content•

OrpheusDB: Bolt-on Versioning for Relational Databases

[...]

Silu Huang¹, Liqi Xu¹, Jialin Liu¹, Aaron J. Elmore², Aditya Parameswaran¹ - Show less +1 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, University of Chicago²

07 Mar 2017-arXiv: Databases

TL;DR: This work introduces OrpheusDB, a dataset version control system that "bolts on" versioning capabilities to a traditional relational database system, thereby gaining the analytics capabilities of the database "for free".

...read moreread less

Abstract: Data science teams often collaboratively analyze datasets, generating dataset versions at each stage of iterative exploration and analysis. There is a pressing need for a system that can support dataset versioning, enabling such teams to efficiently store, track, and query across dataset versions. We introduce OrpheusDB, a dataset version control system that "bolts on" versioning capabilities to a traditional relational database system, thereby gaining the analytics capabilities of the database "for free". We develop and evaluate multiple data models for representing versioned data, as well as a light-weight partitioning scheme, LyreSplit, to further optimize the models for reduced query latencies. With LyreSplit, OrpheusDB is on average 1000x faster in finding effective (and better) partitionings than competing approaches, while also reducing the latency of version retrieval by up to 20x relative to schemes without partitioning. LyreSplit can be applied in an online fashion as new versions are added, alongside an intelligent migration scheme that reduces migration time by 10x on average.

...read moreread less

Posted Content•

The Case for Learned Index Structures

[...]

Tim Kraska¹, Alex Beutel², Ed H. Chi², Jeffrey Dean², Neoklis Polyzotis² - Show less +1 more•Institutions (2)

Massachusetts Institute of Technology¹, Google²

04 Dec 2017-arXiv: Databases

TL;DR: The idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work provides just a glimpse of what might be possible.

...read moreread less

Abstract: Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes. The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show, that by using neural nets we are able to outperform cache-optimized B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over several real-world data sets. More importantly though, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work just provides a glimpse of what might be possible.

...read moreread less

Posted Content•

Joining Extractions of Regular Expressions

[...]

Dominik D. Freydenberger¹, Benny Kimelfeld², Liat Peterfreund²•Institutions (2)

Loughborough University¹, Technion – Israel Institute of Technology²

30 Mar 2017-arXiv: Databases

TL;DR: In this paper, the complexity of querying text by Conjunctive Queries and Unions of CQs (UCQs) on top of regex formulas is investigated.

...read moreread less

Abstract: Regular expressions with capture variables, also known as "regex formulas," extract relations of spans (interval positions) from text. These relations can be further manipulated via Relational Algebra as studied in the context of document spanners, Fagin et al.'s formal framework for information extraction. We investigate the complexity of querying text by Conjunctive Queries (CQs) and Unions of CQs (UCQs) on top of regex formulas. We show that the lower bounds (NP-completeness and W[1]-hardness) from the relational world also hold in our setting; in particular, hardness hits already single-character text! Yet, the upper bounds from the relational world do not carry over. Unlike the relational world, acyclic CQs, and even gamma-acyclic CQs, are hard to compute. The source of hardness is that it may be intractable to instantiate the relation defined by a regex formula, simply because it has an exponential number of tuples. Yet, we are able to establish general upper bounds. In particular, UCQs can be evaluated with polynomial delay, provided that every CQ has a bounded number of atoms (while unions and projection can be arbitrary). Furthermore, UCQ evaluation is solvable with FPT (Fixed-Parameter Tractable) delay when the parameter is the size of the UCQ.

...read moreread less

Posted Content•

A Survey of State Management in Big Data Processing Systems

[...]

Quoc-Cuong To¹, Juan Soto¹, Volker Markl¹•Institutions (1)

German Research Centre for Artificial Intelligence¹

06 Feb 2017-arXiv: Databases

TL;DR: State management and its use in diverse applications varies widely across big data processing systems as discussed by the authors, which is evident in both the research literature and existing systems, such as Apache Flink, Apache Samza, Apache Spark, and Apache Storm.

...read moreread less

Abstract: State management and its use in diverse applications varies widely across big data processing systems. This is evident in both the research literature and existing systems, such as Apache Flink, Apache Samza, Apache Spark, and Apache Storm. Given the pivotal role that state management plays in various use cases, in this survey, we present some of the most important uses of state as an enabler, discuss the alternative approaches used to handle and implement state, propose a taxonomy to capture the many facets of state management, and highlight new research directions. Our aim is to provide insight into disparate state management techniques, motivate others to pursue research in this area, and draw attention to some open problems.

...read moreread less

Posted Content•

Review of Apriori Based Algorithms on MapReduce Framework

[...]

Sudhakar Singh, Rakhi Garg, P. K. Mishra

21 Feb 2017-arXiv: Databases

TL;DR: This survey discusses and analyzes the various implementations of Apriori on MapReduce framework on the basis of their distinguishing characteristics and includes the advantages and limitations of Map Reduce framework.

...read moreread less

Abstract: The Apriori algorithm that mines frequent itemsets is one of the most popular and widely used data mining algorithms. Now days many algorithms have been proposed on parallel and distributed platforms to enhance the performance of Apriori algorithm. They differ from each other on the basis of load balancing technique, memory system, data decomposition technique and data layout used to implement them. The problems with most of the distributed framework are overheads of managing distributed system and lack of high level parallel programming language. Also with grid computing there is always potential chances of node failures which cause multiple re-executions of tasks. These problems can be overcome by the MapReduce framework introduced by Google. MapReduce is an efficient, scalable and simplified programming model for large scale distributed data processing on a large cluster of commodity computers and also used in cloud computing. In this paper, we present the overview of parallel Apriori algorithm implemented on MapReduce framework. They are categorized on the basis of Map and Reduce functions used to implement them e.g. 1-phase vs. k-phase, I/O of Mapper, Combiner and Reducer, using functionality of Combiner inside Mapper etc. This survey discusses and analyzes the various implementations of Apriori on MapReduce framework on the basis of their distinguishing characteristics. Moreover, it also includes the advantages and limitations of MapReduce framework.

...read moreread less

Posted Content•

Data Readiness Levels

[...]

Neil D. Lawrence¹•Institutions (1)

University of Sheffield¹

05 May 2017-arXiv: Databases

TL;DR: The use of data readiness levels is proposed: it gives a rough outline of three stages of data preparedness and speculates on how formalisation of these levels into a common language for data readiness could facilitate project management.

...read moreread less

Abstract: Application of models to data is fraught. Data-generating collaborators often only have a very basic understanding of the complications of collating, processing and curating data. Challenges include: poor data collection practices, missing values, inconvenient storage mechanisms, intellectual property, security and privacy. All these aspects obstruct the sharing and interconnection of data, and the eventual interpretation of data through machine learning or other approaches. In project reporting, a major challenge is in encapsulating these problems and enabling goals to be built around the processing of data. Project overruns can occur due to failure to account for the amount of time required to curate and collate. But to understand these failures we need to have a common language for assessing the readiness of a particular data set. This position paper proposes the use of data readiness levels: it gives a rough outline of three stages of data preparedness and speculates on how formalisation of these levels into a common language for data readiness could facilitate project management.

...read moreread less

Book Chapter•DOI•

The Odyssey Approach for Optimizing Federated SPARQL Queries

[...]

Gabriela Montoya¹, Hala Skaf-Molli, Katja Hose¹•Institutions (1)

Aalborg University¹

17 May 2017-arXiv: Databases

TL;DR: Odyssey as discussed by the authors is an approach that uses statistics that allow for a more accurate cost estimation for federated queries and therefore enables Odyssey to produce better query execution plans for SPARQL endpoints.

...read moreread less

Abstract: Answering queries over a federation of SPARQL endpoints requires combining data from more than one data source Optimizing queries in such scenarios is particularly challenging not only because of (i) the large variety of possible query execution plans that correctly answer the query but also because (ii) there is only limited access to statistics about schema and instance data of remote sources To overcome these challenges, most federated query engines rely on heuristics to reduce the space of possible query execution plans or on dynamic programming strategies to produce optimal plans Nevertheless, these plans may still exhibit a high number of intermediate results or high execution times because of heuristics and inaccurate cost estimations In this paper, we present Odyssey, an approach that uses statistics that allow for a more accurate cost estimation for federated queries and therefore enables Odyssey to produce better query execution plans Our experimental results show that Odyssey produces query execution plans that are better in terms of data transfer and execution time than state-of-the-art optimizers Our experiments using the FedBench benchmark show execution time gains of at least 25 times on average

...read moreread less

Book Chapter•DOI•

Formalising opencypher Graph Queries in Relational Algebra

[...]

József Marton¹, Gábor Szárnyas², Gábor Szárnyas¹, Dániel Varró¹, Dániel Varró² - Show less +1 more•Institutions (2)

Budapest University of Technology and Economics¹, McGill University²

08 May 2017-arXiv: Databases

TL;DR: Graph database systems are increasingly adapted for storing and processing heterogeneous network-like datasets, however, due to the novelty of such systems, no standard data model or query language has yet emerged, thus subjecting users to the possibility of vendor lock-in.

...read moreread less

Abstract: Graph database systems are increasingly adapted for storing and processing heterogeneous network-like datasets. However, due to the novelty of such systems, no standard data model or query language has yet emerged. Consequently, migrating datasets or applications even between related technologies often requires a large amount of manual work or ad-hoc solutions, thus subjecting the users to the possibility of vendor lock-in. To avoid this threat, vendors are working on supporting existing standard languages (e.g. SQL) or creating standardised languages. In this paper, we present a formal specification for openCypher, a high-level declarative graph query language with an ongoing standardisation effort. We introduce relational graph algebra, which extends relational operators by adapting graph-specific operators and define a mapping from core openCypher constructs to this algebra. We propose an algorithm that allows systematic compilation of openCypher queries.

...read moreread less

Posted Content•

A Big Data Analysis Framework Using Apache Spark and Deep Learning

[...]

Anand Gupta¹, Hardeo Kumar Thakur², Ritvik Shrivastava², Pulkit Kumar³, Sreyashi Nag² - Show less +1 more•Institutions (3)

Insight Enterprises¹, Netaji Subhas Institute of Technology², Indraprastha Institute of Information Technology³

25 Nov 2017-arXiv: Databases

TL;DR: In this article, the authors propose a framework that combines the distributive computational abilities of Apache Spark and the advanced machine learning architecture of a deep multi-layer perceptron (MLP), using the popular concept of Cascade Learning.

...read moreread less

Abstract: With the spreading prevalence of Big Data, many advances have recently been made in this field. Frameworks such as Apache Hadoop and Apache Spark have gained a lot of traction over the past decades and have become massively popular, especially in industries. It is becoming increasingly evident that effective big data analysis is key to solving artificial intelligence problems. Thus, a multi-algorithm library was implemented in the Spark framework, called MLlib. While this library supports multiple machine learning algorithms, there is still scope to use the Spark setup efficiently for highly time-intensive and computationally expensive procedures like deep learning. In this paper, we propose a novel framework that combines the distributive computational abilities of Apache Spark and the advanced machine learning architecture of a deep multi-layer perceptron (MLP), using the popular concept of Cascade Learning. We conduct empirical analysis of our framework on two real world datasets. The results are encouraging and corroborate our proposed framework, in turn proving that it is an improvement over traditional big data analysis methods that use either Spark or Deep learning as individual elements.

...read moreread less

Collapse