scispace - formally typeset
Search or ask a question

Showing papers presented at "Extending Database Technology in 2017"


Proceedings ArticleDOI
01 Jan 2017
TL;DR: This paper contributes to improve the understanding of the utility of different features for web table to knowledge base matching by reimplementing different matching techniques as well as similarity score aggregation methods from literature within a single matching framework and evaluating different combinations of these techniques against a single gold standard.
Abstract: Relational HTML tables on the Web contain data describing a multitude of entities and covering a wide range of topics. Thus, web tables are very useful for filling missing values in cross-domain knowledge bases such as DBpedia, YAGO, or the Google Knowledge Graph. Before web table data can be used to fill missing values, the tables need to be matched to the knowledge base in question. This involves three matching tasks: table-to-class matching, rowto-instance matching, and attribute-to-property matching. Various matching approaches have been proposed for each of these tasks. Unfortunately, the existing approaches are evaluated using different web table corpora. Each individual approach also only exploits a subset of the web table and knowledge base features that are potentially helpful for the matching tasks. These two shortcomings make it difficult to compare the different matching approaches and to judge the impact of each feature on the overall matching results. This paper contributes to improve the understanding of the utility of different features for web table to knowledge base matching by reimplementing different matching techniques as well as similarity score aggregation methods from literature within a single matching framework and evaluating different combinations of these techniques against a single gold standard. The gold standard consists of class-, instance-, and property correspondences between the DBpedia knowledge base and web tables from the Web Data Commons web table corpus.

79 citations


Proceedings ArticleDOI
21 Mar 2017
TL;DR: This paper identifies a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give complete structural information about input data, enabling reasonable schema inference time for massive collections.
Abstract: Recent years have seen the widespread use of JSON as a data format to represent massive data collections. JSON data collections are usually schemaless. While this ensures several advantages, the absence of schema information has important negative consequences: the correctness of complex queries and programs cannot be statically checked, users cannot rely on schema information to quickly figure out structural properties that could speed up the formulation of correct queries, and many schema-based optimizations are not possible. In this paper we deal with the problem of inferring a schema from massive JSON data sets. We first identify a JSON type language which is simple and, at the same time, expressive enough to capture irregularities and to give complete structural information about input data. We then present our main contribution, which is the design of a schema inference algorithm, its theoretical study and its implementation based on Spark, enabling reasonable schema inference time for massive collections. Finally, we report about an experimental analysis showing the effectiveness of our approach in terms of execution time, precision and conciseness of inferred schemas, and scalability.

72 citations


Proceedings Article
01 Jan 2017
TL;DR: This paper proposes a secure SC framework based on encryption, which ensures that workers’ location information is never released to any party, yet the system can still assign tasks to workers situated in proximity of each task’s location, and proposes a novel secure indexing technique with a newly devised SKD-tree to index encrypted worker locations.
Abstract: In spatial crowdsourcing, spatial tasks are outsourced to a set of workers in proximity of the task locations for efficient assignment. It usually requires workers to disclose their locations, which inevitably raises security concerns about the privacy of the workers’ locations. In this paper, we propose a secure SC framework based on encryption, which ensures that workers’ location information is never released to any party, yet the system can still assign tasks to workers situated in proximity of each task’s location. We solve the challenge of assigning tasks based on encrypted data using homomorphic encryption. Moreover, to overcome the efficiency issue, we propose a novel secure indexing technique with a newly devised SKD-tree to index encrypted worker locations. Experiments on real-world data evaluate various aspects of the performance of the proposed SC platform.

53 citations


Proceedings Article
01 Jan 2017
TL;DR: This work conducted an exhaustive experimental survey by evaluating several state-of-the-art compression algorithms as well as cascades of basic techniques, finding that there is no single-best algorithm.
Abstract: Lightweight data compression algorithms are frequently applied in in-memory database systems to tackle the growing gap between processor speed and main memory bandwidth. In recent years, the vectorization of basic techniques such as delta coding and null suppression has considerably enlarged the corpus of available algorithms. As a result, today there is a large number of algorithms to choose from, while different algorithms are tailored to different data characteristics. However, a comparative evaluation of these algorithms under different data characteristics has never been sufficiently conducted in the literature. To close this gap, we conducted an exhaustive experimental survey by evaluating several state-of-the-art compression algorithms as well as cascades of basic techniques. We systematically investigated the influence of the data properties on the performance and the compression rates. The evaluated algorithms are based on publicly available implementations as well as our own vectorized reimplementations. We summarize our experimental findings leading to several new insights and to the conclusion, that there is no single-best algorithm.

49 citations


Proceedings ArticleDOI
01 Jan 2017
TL;DR: Current research challenges and trends tied to the integration, management, analysis, and visualization of objects moving at sea are reviewed as well as a few suggestions for a successful development of maritime forecasting and decision-support systems.
Abstract: The correlated exploitation of heterogeneous data sources offering very large historical as well as streaming data is important to increasing the accuracy of computations when analysing and predicting future states of moving entities. This is particularly critical in the maritime domain, where online tracking, early recognition of events, and real-time forecast of anticipated trajectories of vessels are crucial to safety and operations at sea. The objective of this paper is to review current research challenges and trends tied to the integration, management, analysis, and visualization of objects moving at sea as well as a few suggestions for a successful development of maritime forecasting and decision-support systems.

39 citations


Proceedings Article
01 Jan 2017
TL;DR: It is shown that relational main-memory database systems are capable of executing analytical algorithms in a fully transactional environment while still exceeding performance of state-of-the-art analytical systems rendering the division of data management and data analytics unnecessary.
Abstract: Data volume and complexity continue to increase, as does the need for insight into data. Today, data management and data analytics are most often conducted in separate systems: database systems and dedicated analytics systems. This separation leads to timeand resource-consuming data transfer, stale data, and complex IT architectures. In this paper we show that relational main-memory database systems are capable of executing analytical algorithms in a fully transactional environment while still exceeding performance of state-of-the-art analytical systems rendering the division of data management and data analytics unnecessary. We classify and assess multiple ways of integrating data analytics in database systems. Based on this assessment, we extend SQL with a non-appending iteration construct that provides an important building block for analytical algorithms while retaining the high expressiveness of the original language. Furthermore, we propose the integration of analytics operators directly into the database core, where algorithms can be highly tuned for modern hardware. These operators can be parameterized with our novel user-defined lambda expressions. As we integrate lambda expressions into SQL instead of proposing a new proprietary query language, we ensure usability for diverse groups of users. Additionally, we carry out an extensive experimental evaluation of our approaches in HyPer, our full-fledged SQL main-memory database system, and show their superior performance in comparison to dedicated solutions.

34 citations


Proceedings ArticleDOI
21 Mar 2017
TL;DR: Relevance offers the best task throughput while div-pay achieves the best outcome quality, and different strategies prevail for different dimensions.
Abstract: We investigate how to leverage the notion of motivation in assigning tasks to workers and improving the performance of a crowdsourcing system. In particular, we propose to model motivation as the balance between task diversity–i.e., the difference in skills among the tasks to complete, and task payment–i.e., the difference between how much a chosen task offers to pay and how much other available tasks pay. We propose to test different task assignment strategies: (1) relevance, a strategy that assigns matching tasks, i.e., those that fit a worker's profile, (2) diversity, a strategy that chooses matching and diverse tasks, and (3) div-pay, a strategy that selects matching tasks that offer the best compromise between diversity and payment. For each strategy , we study multiple iterations where tasks are reassigned to workers as their motivation evolves. At each iteration, relevance and diversity assign tasks to a worker from an available pool of filtered tasks. div-pay, on the other hand, estimates each worker's motivation on-the-fly at each iteration, and uses it to assign tasks to the worker. Our empirical experiments study the impact of each strategy on overall performance. We examine both requester-centric and worker-centric performance dimensions and find that different strategies prevail for different dimensions. In particular, relevance offers the best task throughput while div-pay achieves the best outcome quality.

32 citations


Proceedings Article
01 Jan 2017
TL;DR: This paper presents a novel normalization algorithm called Normalize, which uses discovered functional dependencies to normalize relational datasets into BCNF and introduces an efficient method for calculating the closure over sets of functional dependencies and novel features for choosing appropriate constraints.
Abstract: Ensuring Boyce-Codd Normal Form (BCNF) is the most popular way to remove redundancy and anomalies from datasets. Normalization to BCNF forces functional dependencies (FDs) into keys and foreign keys, which eliminates duplicate values and makes data constraints explicit. Despite being well researched in theory, converting the schema of an existing dataset into BCNF is still a complex, manual task, especially because the number of functional dependencies is huge and deriving keys and foreign keys is NP-hard. In this paper, we present a novel normalization algorithm called Normalize, which uses discovered functional dependencies to normalize relational datasets into BCNF. Normalize runs entirely data-driven, which means that redundancy is removed only where it can be observed, and it is (semi-)automatic, which means that a user may or may not interfere with the normalization process. The algorithm introduces an efficient method for calculating the closure over sets of functional dependencies and novel features for choosing appropriate constraints. Our evaluation shows that Normalize can process millions of FDs within a few minutes and that the constraint selection techniques support the construction of meaningful relations during normalization. 1. FUNCTIONAL DEPENDENCIES A functional dependency (FD) is a statement of the form X → A with X being a set of attributes and A being a single attribute from the same relation R. We say that the lefthand-side (Lhs) X functionally determines the right-handside (Rhs) A. This means that whenever two records in an instance r of R agree on all their X values, they must also agree on their A value [7]. More formally, an FD X → A holds in r, iff ∀t1, t2 ∈ r : t1[X] = t2[X]⇒ t1[A] = t2[A]. In the following, we consider only non-trivial FDs, which are FDs with A / ∈ X. Table 1 depicts an example address dataset for which the two functional dependencies Postcode→City and Postcode→Mayor hold. Because both FDs have the same Lhs, we c ©2017, Copyright is with the authors. Published in Proc. 20th International Conference on Extending Database Technology (EDBT), March 21-24, 2017 Venice, Italy: ISBN 978-3-89318-073-8, on OpenProceedings.org. Distribution of this paper is permitted under the terms of the Creative Commons license CC-by-nc-nd 4.0 Table 1: Example address dataset First Last Postcode City Mayor Thomas Miller 14482 Potsdam Jakobs Sarah Miller 14482 Potsdam Jakobs Peter Smith 60329 Frankfurt Feldmann Jasmine Cone 01069 Dresden Orosz Mike Cone 14482 Potsdam Jakobs Thomas Moore 60329 Frankfurt Feldmann can aggregate them to the notation Postcode→City,Mayor. The presence of this FD introduces anomalies in the dataset, because the values Potsdam, Frankfurt, Jakobs, and Feldmann are stored redundantly and updating these values might cause inconsistencies. So if, for instance, some Mr. Schmidt was elected as the new mayor of Potsdam, we must correctly change all three occurrences of Jakobs to Schmidt. Such anomalies can be avoided by normalizing relations into the Boyce-Codd Normal Form (BCNF). A relational schema R is in BCNF, iff for all FDs X → A in R the Lhs X is either a key or superkey [7]. Because Postcode is neither a key nor a superkey in the example dataset, this relation does not meet the BCNF condition. To bring all relations of a schema into BCNF, one has to perform six steps, which are explained in more detail later: (1) discover all FDs, (2) extend the FDs, (3) derive all necessary keys from the extended FDs, (4) identify the BCNF-violating FDs, (5) select a violating FD for decomposition (6) split the relation according to the chosen violating FD. The steps (3) to (5) repeat until step (4) finds no more violating FDs and the resulting schema is BCNF-conform. We find several FD discovery algorithms, such as Tane [14] and HyFD [19], that serve step (1), but there are, thus far, no algorithms available to efficiently and automatically solve the steps (2) to (6). For the example dataset, an FD discovery algorithm would find twelve valid FDs in step (1). These FDs must be aggregated and transitively extended in step (2) so that we find, inter alia, First,Last→Postcode,City,Mayor and Postcode→City,Mayor. In step (3), the former FD lets us derive the key {First, Last}, because these two attributes functionally determine all other attributes of the relation. Step (4), then, determines that the second FD violates the BCNF condition, because its Lhs Postcode is neither a key nor superkey. If we assume that step (5) is able to automatically select the second FD for decomposition, step (6) decomposes the example relation into R1(First, Last,Postcode) and R2(Postcode,City,Mayor) with {First, Last} and {Postcode} being primary keys and R1.Postcode→R2.Postcode a foreign key constraint. Table 2 shows this result. When again checking for violating FDs, we do not find any and stop the norTable 2: Normalized example address dataset First Last Postcode

31 citations


Proceedings Article
01 Jan 2017
TL;DR: In this tutorial, the previous work on multi-model data management is reviewed and the insights on the research challenges and directions for future work are provided.
Abstract: As more businesses realized that data, in all forms and sizes, is critical to making the best possible decisions, we see the continued growth of systems that support massive volume of non-relational or unstructured forms of data. Nothing shows the picture more starkly than the Gartner Magic quadrant for operational database management systems, which assumes that, by 2017, all leading operational DBMSs will offer multiple data models, relational and NoSQL, in a single DBMS platform. Having a single data platform for managing both well-structured data and NoSQL data is beneficial to users; this approach reduces significantly integration, migration, development, maintenance, and operational issues. Therefore, a challenging research work is how to develop efficient consolidated single data management platform covering both relational data and NoSQL to reduce integration issues, simplify operations, and eliminate migration issues. In this tutorial, we review the previous work on multi-model data management and provide the insights on the research challenges and directions for future work. The slides and more materials of this tutorial can be found at http://udbms.cs.helsinki.fi/?tutorials/edbt2017.

29 citations


Proceedings ArticleDOI
01 Jan 2017
TL;DR: This work proposes an unsupervised approach to partition the data, that does not exploit any external knowledge, but only relies on heuristics to select the blocking attributes and shows good results on a standard dataset of Web Tables.
Abstract: Entity matching, or record linkage, is the task of identifying records that refer to the same entity. Naive entity matching techniques (i.e., brute-force pairwise comparisons) have quadratic complexity. A typical shortcut to the problem is to employ blocking techniques to reduce the number of comparisons, i.e. to partition the data in several blocks and only compare records within the same block. While classic blocking methods are designed for data from relational databases with clearly defined schemas, they are not applicable to data from Web tables, which are more prone to noise and do not come with an explicit schema. At the same time, Web tables are an interesting data source for many knowledge intensive tasks, which makes record linkage on Web Tables an important challenge. In this work, we propose an unsupervised approach to partition the data, that does not exploit any external knowledge, but only relies on heuristics to select the blocking attributes. We compare different partitioning methods: we use (i) clustering on bagof-words, (ii) binning via Locality-Sensitive Hashing and (iii) clustering using word embeddings. In particular, the clustering methods show good results on a standard dataset of Web Tables, and, when combined with word embeddings, are a robust solution which allows for computing the clusters in a dense, low-dimensional space.

25 citations


Proceedings ArticleDOI
01 Jan 2017
TL;DR: This paper proposes MultiPass, an exact algorithm which traverses the network k−1 times and employs two pruning criteria to reduce the number of paths that have to be examined, and proposes two approximate algorithms that trade accuracy for efficiency.
Abstract: Shortest path computation is a fundamental problem in road networks with various applications in research and industry. However, returning only the shortest path is often not satisfying. Users might also be interested in alternative paths that are slightly longer but have other desired properties, e.g., less frequent traffic congestion. In this paper, we study alternative routing and, in particular, the k-Shortest Paths with Limited Overlap (k-SPwLO) query, which aims at computing paths that are (a) sufficiently dissimilar to each other, and (b) as short as possible. First, we propose MultiPass, an exact algorithm which traverses the network k−1 times and employs two pruning criteria to reduce the number of paths that have to be examined. To achieve better performance and scalability, we also propose two approximate algorithms that trade accuracy for efficiency. OnePass employs the same pruning criteria as MultiPass, but traverses the network only once. Therefore, some paths might be lost that otherwise would be part of the solution. ESX computes alternative paths by incrementally removing edges from the road network and running shortest path queries on the updated network. An extensive experimental analysis on real road networks shows that: (a) MultiPass outperforms state-of-the-art exact algorithms for computing k-SPwLO queries, (b) OnePass runs significantly faster than MultiPass and its result is close to the exact solution, and (c) ESX is faster than OnePass (though slightly less accurate) and scales for large road networks and large values of k.

Proceedings Article
01 Jan 2017
TL;DR: This paper analyzes and compares existing solutions for spatial data processing on Hadoop and Spark, and investigates their features as well as their performances in a micro benchmark for spatial filter and join queries.
Abstract: Nowadays, a vast amount of data is generated and collected every moment and often, this data has a spatial and/or temporal aspect. To analyze the massive data sets, big data platforms like Apache Hadoop MapReduce and Apache Spark emerged and extensions that take the spatial characteristics into account were created for them. In this paper, we analyze and compare existing solutions for spatial data processing on Hadoop and Spark. In our comparison, we investigate their features as well as their performances in a micro benchmark for spatial filter and join queries. Based on the results and our experiences with these frameworks, we outline the requirements for a general spatio-temporal benchmark for Big Spatial Data processing platforms and sketch first solutions to the identified problems.

Proceedings ArticleDOI
24 Mar 2017
TL;DR: Top-k Case Matching (TKCM) is resilient to consecutively missing values, and the accuracy of the imputed values does not decrease if blocks of values are missing, and is outperform the state-of-the-art solutions.
Abstract: Time series data is ubiquitous but often incomplete, e.g., due to sensor failures and transmission errors. Since many applications require complete data, missing values must be imputed before further data processing is possible. We propose Top-k Case Matching (TKCM) to impute missing values in streams of time series data. TKCM defines for each time series a set of reference time series and exploits similar historical situations in the reference time series for the imputation. A situation is characterized by the anchor point of a pattern that consists of l consecutive measurements over the reference time series. A missing value in a time series s is derived from the values of s at the anchor points of the k most similar patterns. We show that TKCM imputes missing values consistently if the reference time series pattern-determine time series s, i.e., the pattern of length l at time tn is repeated at least k times in the reference time series and the corresponding values of s at the anchor time points are similar to each other. In contrast to previous work, we support time series that are not linearly correlated but, e.g., phase shifted. TKCM is resilient to consecutively missing values, and the accuracy of the imputed values does not decrease if blocks of values are missing. The results of an exhaustive experimental evaluation using real-world and synthetic data shows that we outperform the state-of-the-art solutions.

Proceedings Article
01 Jan 2017
TL;DR: I, an interactive development environment that coordinates running cluster applications and corresponding visualizations such that only the currently depicted data points are processed and transferred, and shows how cluster programs can adapt to changed visualization properties at runtime to allow interactive data exploration on data streams.
Abstract: Developing scalable real-time data analysis programs is a challenging task. Developers need insights from the data to define meaningful analysis flows, which often makes the development a trial and error process. Data visualization techniques can provide insights to aid the development, but the sheer amount of available data frequently makes it impossible to visualize all data points at the same time. We present I, an interactive development environment that coordinates running cluster applications and corresponding visualizations such that only the currently depicted data points are processed and transferred. To this end, we present an algorithm for the real-time visualization of time series, which is proven to be correct and minimal in terms of transferred data. Moreover, we show how cluster programs can adapt to changed visualization properties at runtime to allow interactive data exploration on data streams.

Proceedings Article
01 Jan 2017
TL;DR: Evaluation using two real-world use cases shows that EXstream can outperform existing techniques significantly in conciseness and consistency while achieving comparable high prediction power and retaining a highly efficient implementation of a data stream system.
Abstract: In this paper, we present the EXstream system that provides high-quality explanations for anomalous behaviors that users annotate on CEP-based monitoring results. Given the new requirements for explanations, namely, conciseness, consistency with human interpretation, and prediction power, most existing techniques cannot produce explanations that satisfy all three of them. The key technical contributions of this work include a formal definition of optimally explaining anomalies in CEP monitoring, and three key techniques for generating sufficient feature space, characterizing the contribution of each feature to the explanation, and selecting a small subset of features as the optimal explanation, respectively. Evaluation using two real-world use cases shows that EXstream can outperform existing techniques significantly in conciseness and consistency while achieving comparable high prediction power and retaining a highly efficient implementation of a data stream system.

Proceedings ArticleDOI
21 Mar 2017
TL;DR: The design and implementation of GraphCache started in mid 2014 and in its current form it can improve the performance of methods from various categories of graph-query processing research, includingter-then-verify methods and direct subgraph-isomorphism algorithms, and across workloads and datasets of di�erent characteristics.
Abstract: Graph datasets and NoSQL graph databases are becoming increasingly popular for a large variety of applications, de- pendent on modelling entities and their relationships and interactions. In such systems, graph queries are essential for graph analytics. However, they can be very time-consuming, as graph query processing entails the subgraph isomorphism problem, which is NP-Complete. In general, caching sys- tems constitute a key component in software systems, in- cluding database systems, where they play a key role in ex- pediting query processing. In this context, we put forth GraphCache, the �rst caching system for graph query pro- cessing. We report on the design issues and goals that GraphCache must meet, its overall system architecture and implementation, coupled with novel cache replacement and admission control policies. We also report on results from extensive performance evaluations which showcase and quan- tify its bene�ts and overheads, highlighting lessons learned. GraphCache can be used as a front end, complementing any graph query processing method, which is viewed as a pluggable component. The design and implementation of GraphCache started in mid 2014 and in its current imple- mentation it can signi�cantly improve the performance of methods from di�erent categories of graph-query processing research { including �lter-then-verify (FTV) methods and direct subgraph-isomorphism (SI) algorithms, and across workloads and datasets of di�erent characteristics. Cur- rently, GraphCache comprises more than 6,000 lines of Java code, excluding the pluggable FTV/SI query-processing al- gorithms. It is available bundled with 3 top-performing FTV methods and 3 top-performing SI algorithms.

Proceedings ArticleDOI
01 Jan 2017
TL;DR: This paper proposes a suite of novel lower bound functions and a grouping-based solution with multi-level pruning in order to compute motifs with DFD efficiently and reveals that the approach is 3 orders of magnitude faster than a baseline solution.
Abstract: The discrete Fréchet distance (DFD) captures perceptual and geographical similarity between discrete trajectories. It has been successfully adopted in a multitude of applications, such as signature and handwriting recognition, computer graphics, as well as geographic applications. Spatial applications, e.g., sports analysis, traffic analysis, etc. require discovering the pair of most similar subtrajectories, be them parts of the same or of different input trajectories. The identified pair of subtrajectories is called a motif. The adoption of DFD as the similarity measure in motif discovery, although semantically ideal, is hindered by the high computational complexity of DFD calculation. In this paper, we propose a suite of novel lower bound functions and a grouping-based solution with multi-level pruning in order to compute motifs with DFD efficiently. Our techniques apply directly to motif discovery within the same or between different trajectories. An extensive empirical study on three real trajectory datasets reveals that our approach is 3 orders of magnitude faster than a baseline solution.

Proceedings ArticleDOI
21 Mar 2017
TL;DR: The effectiveness and the efficiency of the proposed algorithm are experimentally validated over synthetic and real-world trajectory datasets, demonstrating that STClustering outperforms an off-the-shelf in-DBMS solution using PostGIS by several orders of magnitude.
Abstract: In this paper, we propose an efficient in-DBMS solution for the problem of sub-trajectory clustering and outlier detection in large moving object datasets. The method relies on a two-phase process: a voting-and-segmentation phase that segments trajectories according to a local density criterion and trajectory similarity criteria, followed by a sampling-and-clustering phase that selects the most representative sub-trajectories to be used as seeds for the clustering process. Our proposal, called STClustering (for Sampling-based Sub-Trajectory Clustering) is novel since it is the first, to our knowledge, that addresses the pure spatiotemporal sub-trajectory clustering and outlier detection problem in a real-world setting (by ‘pure’ we mean that the entire spatiotemporal information of trajectories is taken into consideration). Moreover, our proposal can be efficiently registered as a database query operator in the context of extensible DBMS (namely, PostgreSQL in our current implementation). The effectiveness and the efficiency of the proposed algorithm are experimentally validated over synthetic and real-world trajectory datasets, demonstrating that STClustering outperforms an off-the-shelf in-DBMS solution using PostGIS by several orders of magnitude. CCS Concepts • Information systems ➝ Information systems applications ➝ Data mining ➝ Clustering • Information systems ➝ Information systems applications ➝ Spatio-temporal systems

Proceedings ArticleDOI
21 Mar 2017
TL;DR: The central idea is to employ parallelism in a novel way, whereby parallel matching/decision attempts are initiated, each using a query rewriting and/or an alternate algorithm, which is shown to be highly beneficial across algorithms and datasets.
Abstract: Subgraph queries are central to graph analytics and graph DBs. We analyze this problem and present key novel discoveries and observations on the nature of the problem which hold across query sizes, datasets, and top-performing algorithms. Firstly, we show that algorithms (for both the decision and matching versions of the problem) suffer from straggler queries, which dominate query workload times. As related research caps query times not reporting results for queries exceeding the cap, this can lead to erroneous conclusions of the methods’ relative performance. Secondly, we study and show the dramatic effect that isomorphic graph queries can have on query times. Thirdly, we show that for each query, isomorphic queries based on proposed query rewritings can introduce large performance benefits. Fourthly, that straggler queries are largely algorithm-specific: many challenging queries to one algorithm can be executed effi- ciently by another. Finally, the above discoveries naturally lead to the derivation of a novel framework for subgraph query processing. The central idea is to employ parallelism in a novel way, whereby parallel matching/decision attempts are initiated, each using a query rewriting and/or an alternate algorithm. The framework is shown to be highly beneficial across algorithms and datasets.

Proceedings ArticleDOI
01 Mar 2017
TL;DR: This paper devise the conditions under which a particular MaxRS solution may cease to be valid and a new optimal location for the query-rectangle R is needed, and solve the problem of maintaining the trajectory of the centroid of R.
Abstract: We address the problem of efficient maintenance of the answer to a new type of query: Continuous Maximizing Range- Sum (Co-MaxRS) for moving objects trajectories. The traditional static/spatial MaxRS problem finds a location for placing the centroid of a given (axes-parallel) rectangle R so that the sum of the weights of the point-objects from a given set O inside the interior of R is maximized. However, moving objects continuously change their locations over time, so the MaxRS solution for a particular time instant need not be a solution at another time instant. In this paper, we devise the conditions under which a particular MaxRS solution may cease to be valid and a new optimal location for the query-rectangle R is needed. More specifically, we solve the problem of maintaining the trajectory of the centroid of R. In addition, we propose efficient pruning strategies (and corresponding data structures) to speed-up the process of maintaining the accuracy of the Co-MaxRS solution. We prove the correctness of our approach and present experimental evaluations over both real and synthetic datasets, demonstrating the benefits of the proposed methods.

Proceedings Article
01 Jan 2017
TL;DR: An industrial project to develop a tool to facilitate access to a large database, with hydrocarbon exploration data, by combining RDF technology with keyword search by featuring an algorithm to translate a keyword query into a SParQL query such that each result of the SPARQL query is an answer for the keyword query.
Abstract: This paper presents the results of an industrial project, conducted by the TecGraf Institute and Petrobras (the Brazilian Petroleum Company), to develop a tool to facilitate access to a large database, with hydrocarbon exploration data, by combining RDF technology with keyword search. The tool features an algorithm to translate a keyword query into a SPARQL query such that each result of the SPARQL query is an answer for the keyword query. The algorithm explores the RDF schema of the RDF dataset to generate the SPARQL query and to avoid user intervention during the translation process. The tool offers an interface which allows the user to specify keywords, as well as filters and unit measures, and presents the results with the help of a table and a graph. Finally, the paper describes experiments which show that the tool achieves very good performance for the real-world industrial dataset and meets users’ expectations. The tool was further validated against full versions of the IMDb and Mondial datasets.

Proceedings ArticleDOI
20 Mar 2017
TL;DR: This work focuses on a common setting in which the matching function is a set of rules where each rule is in conjunctive normal form (CNF), and proposes the use of “early exit” and “dynamic memoing” to avoid unnecessary and redundant computations.
Abstract: Entity Matching (EM) identifies pairs of records referring to the same real-world entity. In practice, this is often accomplished by employing analysts to iteratively design and maintain sets of matching rules. An important task for such analysts is a “debugging” cycle in which they make a modification to the matching rules, apply the modified rules to a labeled subset of the data, inspect the result, and then perhaps make another change. Our goal is to make this process interactive by minimizing the time required to apply the modified rules. We focus on a common setting in which the matching function is a set of rules where each rule is in conjunctive normal form (CNF). We propose the use of “early exit” and “dynamic memoing” to avoid unnecessary and redundant computations. These techniques create a new optimization problem, and accordingly we develop a cost model and study the optimal ordering of rules and predicates in this context. We also provide techniques to reuse previous results and limit the computation required to apply incremental changes. Through experiments on six real-world data sets we demonstrate that our approach can yield a significant reduction in matching time and provide interactive response times.

Proceedings Article
01 Jan 2017
TL;DR: This paper identifies potential extensions to database systems to match the performance and usability of streaming systems by identifying main-memory database systems well-suited for analytical streaming workloads.
Abstract: Today’s streaming applications demand increasingly high event throughput rates and are often subject to strict latency constraints. To allow for more complex workloads, such as window-based aggregations, streaming systems need to support stateful event processing. This introduces new challenges for streaming engines as the state needs to be maintained in a consistent and durable manner and simultaneously accessed by complex queries for real-time analytics. Modern streaming systems, such as Apache Flink, do not allow for efficiently exposing the state to analytical queries. Thus, data engineers are forced to keep the state in external data stores, which significantly increases the latencies until events are visible to analytical queries. Proprietary solutions have been created to meet data freshness constraints. These solutions are expensive, error-prone, and difficult to maintain. Main-memory database systems, such as HyPer, achieve extremely low query response times while maintaining high update rates, which makes them well-suited for analytical streaming workloads. In this paper, we identify potential extensions to database systems to match the performance and usability of streaming systems.

Proceedings ArticleDOI
01 Jan 2017
TL;DR: This paper proposes fairness axioms that generalize existing work and pave the way to studying fairness for task assignment, task completion, and worker compensation and how fairness and transparency could be enforced and evaluated in a crowdsourcing platform.
Abstract: Despite the success of crowdsourcing, the question of ethics has not yet been addressed in its entirety. Existing efforts have studied fairness in worker compensation and in helping requesters detect malevolent workers. In this paper, we propose fairness axioms that generalize existing work and pave the way to studying fairness for task assignment, task completion, and worker compensation. Transparency on the other hand, has been addressed with the development of plug-ins and forums to track workers’ performance and rate requesters. Similarly to fairness, we define transparency axioms and advocate the need to address it in a holistic manner by providing declarative specifications. We also discuss how fairness and transparency could be enforced and evaluated in a crowdsourcing platform.


Proceedings ArticleDOI
01 Mar 2017
TL;DR: This research presents an advanced MapReduce-based parallel solution to efficiently address spatial skyline queries on large datasets and proposes a novel concept called independent regions, for parallelizing the process of spatial skyline evaluation.
Abstract: This research presents an advanced MapReduce-based parallel solution to efficiently address spatial skyline queries on large datasets. In particular, given a set of data points and a set of query points, we first generate the convex hull of the query points in the first MapReduce phase. Then, we propose a novel concept called independent regions, for parallelizing the process of spatial skyline evaluation. Spatial skyline candidates in an independent region do not depend on any data point in other independent regions. Thus, we calculate the independent regions based on the input data points and the convex hull of the query points in the second phase. With the independent regions, spatial skylines are evaluated in parallel in the third phase, in which data points are partitioned by their associated independent regions in the map functions, and spatial skyline candidates are calculated by reduce functions. The results of the spatial skyline queries are the union of outputs from the reduce functions. Due to high cost of the spatial dominance test, which requires comparing the distance from data points to all convex points, we propose a concept of pruning regions in independent regions. All data points in pruning regions can be discarded without the dominance test. Our experimental results show the efficiency and effectiveness of the proposed parallel spatial skyline solution utilizing MapReduce on large-scale real-world and synthetic datasets.

Proceedings ArticleDOI
01 Jan 2017
TL;DR: The results of the algorithm can be used to build efficient influence oracles for solving the Influence maximization problem which deals with finding top k seed nodes such that the information spread from these nodes is maximized.
Abstract: We study the potential flow of information in interaction networks, that is, networks in which the interactions between the nodes are being recorded. The central notion in our study is that of an information channel. An information channel is a sequence of interactions between nodes forming a path in the network which respects the time order. As such, an information channel represents a potential way information could have flown in the interaction network. We propose algorithms to estimate information channels of limited time span from every node to other nodes in the network. We present one exact and one more efficient approximate algorithm. Both algorithms are onepass algorithms. The approximation algorithm is based on an adaptation of the HyperLogLog sketch, which allows easily combining the sketches of individual nodes in order to get estimates of how many unique nodes can be reached from groups of nodes as well. We show how the results of our algorithm can be used to build efficient influence oracles for solving the Influence maximization problem which deals with finding top k seed nodes such that the information spread from these nodes is maximized. Experiments show that the use of information channels is an interesting data-driven and model-independent way to find top k influential nodes in interaction networks.

Proceedings Article
01 Jan 2017
TL;DR: The STARK project is demonstrated, which adds the required data types and operators, such as spatio-temporal filter and join with various predicates to Spark, and includes k nearest neighbor search and a density based clustering operator for data analysis tasks as well as spatial partitioning and indexing techniques for efficient processing.
Abstract: For Big Data processing, Apache Spark has been widely accepted. However, when dealing with events or any other spatio-temporal data sets, Spark becomes very inefficient as it does not include any spatial or temporal data types and operators. In this paper we demonstrate our STARK project that adds the required data types and operators, such as spatio-temporal filter and join with various predicates to Spark. Additionally, it includes k nearest neighbor search and a density based clustering operator for data analysis tasks as well as spatial partitioning and indexing techniques for efficient processing. During the demo, programs can be created on real world event data sets using STARK’s Scala API or our Pig Latin derivative Piglet in a web front end which also visualizes the results.

Proceedings Article
01 Jan 2017

Proceedings Article
21 Mar 2017
TL;DR: A pseudo-polynomial heuristic to pick the negation closest in size to the initial query, and construct a balanced learning set whose positive examples correspond to the results desired by analysts, and negative examples to those they do not want are described.
Abstract: Nowadays data scientists have access to gigantic data, many of them being accessible through SQL. Despite the inherent simplicity of SQL, writing relevant and efficient SQL queries is known to be difficult, especially for databases having a large number of attributes or meaningless attribute names. In this paper, we propose a " rewriting " technique to help data scientists formulate SQL queries, to rapidly and intuitively explore their big data, while keeping user input at a minimum, with no manual tuple specification or labeling. For a user specified query, we define a negation query, which produces tuples that are not wanted in the initial query's answer. Since there is an exponential number of such negation queries, we describe a pseudo-polynomial heuristic to pick the negation closest in size to the initial query, and construct a balanced learning set whose positive examples correspond to the results desired by analysts, and negative examples to those they do not want. The initial query is reformulated using machine learning techniques and a new query, more efficient and diverse, is obtained. We have implemented a prototype and conducted experiments on real-life datasets and synthetic query workloads to assess the scalability and precision of our proposition. A preliminary qualitative experiment conducted with astrophysicists is also described.