Showing papers presented at "Extending Database Technology in 2010"

PDF

Open Access

Proceedings Article•DOI•

Optimizing joins in a map-reduce environment

[...]

Foto N. Afrati¹, Jeffrey D. Ullman²•Institutions (2)

National Technical University of Athens¹, Stanford University²

22 Mar 2010

TL;DR: The problem of optimizing the shares, given a fixed number of Reduce processes, is studied, and an algorithm for detecting and fixing problems where an attribute is "mistakenly" included in the map-key is given.

...read moreread less

Abstract: Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the "map-key," the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribute of the map-key gets a "share," which is the number of buckets into which its values are hashed, to form a component of the identifier of a Reduce process. Relations have their tuples replicated in limited fashion, the degree of replication depending on the shares for those map-key attributes that are missing from their schema. We study the problem of optimizing the shares, given a fixed number of Reduce processes. An algorithm for detecting and fixing problems where an attribute is "mistakenly" included in the map-key is given. Then, we consider two important special cases: chain joins and star joins. In each case we are able to determine the map-key and determine the shares that yield the least replication. While the method we propose is not always superior to the conventional way of using map-reduce to implement joins, there are some important cases involving large-scale data where our method wins, including: (1) analytic queries in which a very large fact table is joined with smaller dimension tables, and (2) queries involving paths through graphs with high out-degree, such as the Web or a social network.

...read moreread less

382 citations

Proceedings Article•DOI•

Fast computation of SimRank for static and dynamic information networks

[...]

Cuiping Li¹, Jiawei Han², Guoming He¹, Xin Jin², Yizhou Sun², Yintao Yu², Tianyi Wu² - Show less +3 more•Institutions (2)

Renmin University of China¹, University of Illinois at Urbana–Champaign²

22 Mar 2010

TL;DR: A family of novel approximate SimRank computation algorithms for static and dynamic information networks are developed and their corresponding theoretical justification and analysis are given.

...read moreread less

Abstract: Information networks are ubiquitous in many applications and analysis on such networks has attracted significant attention in the academic communities. One of the most important aspects of information network analysis is to measure similarity between nodes in a network. SimRank is a simple and influential measure of this kind, based on a solid theoretical "random surfer" model. Existing work computes SimRank similarity scores in an iterative mode. We argue that the iterative method can be infeasible and inefficient when, as in many real-world scenarios, the networks change dynamically and frequently. We envision non-iterative method to bridge the gap. It allows users not only to update the similarity scores incrementally, but also to derive similarity scores for an arbitrary subset of nodes. To enable the non-iterative computation, we propose to rewrite the SimRank equation into a non-iterative form by using the Kronecker product and vectorization operators. Based on this, we develop a family of novel approximate SimRank computation algorithms for static and dynamic information networks, and give their corresponding theoretical justification and analysis. The non-iterative method supports efficient processing of various node analysis including similarity tracking and centrality tracking on evolving information networks. The effectiveness and efficiency of our proposed methods are evaluated on synthetic and real data sets.

...read moreread less

171 citations

Proceedings Article•DOI•

Private record matching using differential privacy

[...]

Ali Inan¹, Murat Kantarcioglu¹, Gabriel Ghinita², Elisa Bertino²•Institutions (2)

University of Texas at Dallas¹, Purdue University²

22 Mar 2010

TL;DR: Experiments conducted on the real-world Census-income dataset show that, although the proposed methods provide strong privacy, their effectiveness in reducing matching cost is not far from that of k-anonymity based counterparts.

...read moreread less

Abstract: Private matching between datasets owned by distinct parties is a challenging problem with several applications. Private matching allows two parties to identify the records that are close to each other according to some distance functions, such that no additional information other than the join result is disclosed to any party. Private matching can be solved securely and accurately using secure multi-party computation (SMC) techniques, but such an approach is prohibitively expensive in practice. Previous work proposed the release of sanitized versions of the sensitive datasets which allows blocking, i.e., filtering out sub-sets of records that cannot be part of the join result. This way, SMC is applied only to a small fraction of record pairs, reducing the matching cost to acceptable levels. The blocking step is essential for the privacy, accuracy and efficiency of matching. However, the state-of-the-art focuses on sanitization based on k-anonymity, which does not provide sufficient privacy. We propose an alternative design centered on differential privacy, a novel paradigm that provides strong privacy guarantees. The realization of the new model presents difficult challenges, such as the evaluation of distance-based matching conditions with the help of only a statistical queries interface. Specialized versions of data indexing structures (e.g., kd-trees) also need to be devised, in order to comply with differential privacy. Experiments conducted on the real-world Census-income dataset show that, although our methods provide strong privacy, their effectiveness in reducing matching cost is not far from that of k-anonymity based counterparts.

...read moreread less

171 citations

Proceedings Article•DOI•

k-symmetry model for identity anonymization in social networks

[...]

Wentao Wu¹, Yanghua Xiao¹, Wei Wang¹, Zhenying He¹, Zhihui Wang¹ - Show less +1 more•Institutions (1)

Fudan University¹

22 Mar 2010

TL;DR: K-symmetry model is proposed, which modifies a naively-anonymized network so that for any vertex in the network, there exist at least k -- 1 structurally equivalent counterparts and can be recovered through aggregations on quite a small number of sample graphs.

...read moreread less

Abstract: With more and more social network data being released, protecting the sensitive information within social networks from leakage has become an important concern of publishers. Adversaries with some background structural knowledge about a target individual can easily re-identify him from the network, even if the identifiers have been replaced by randomized integers(i.e., the network is naively-anonymized). Since there exists numerous topological information that can be used to attack a victim's privacy, to resist such structural re-identification becomes a great challenge. Previous works only investigated a minority of such structural attacks, without considering protecting against re-identification under any potential structural knowledge about a target. To achieve this objective, in this paper we propose k-symmetry model, which modifies a naively-anonymized network so that for any vertex in the network, there exist at least k -- 1 structurally equivalent counterparts. We also propose sampling methods to extract approximate versions of the original network from the anonymized network so that statistical properties of the original network could be evaluated. Extensive experiments show that we can successfully recover a variety of such properties of the original network through aggregations on quite a small number of sample graphs.

...read moreread less

144 citations

Proceedings Article•DOI•

Probabilistic path queries in road networks: traffic uncertainty aware path selection

[...]

Ming Hua¹, Jian Pei²•Institutions (2)

Facebook¹, Simon Fraser University²

22 Mar 2010

TL;DR: The P* algorithm, a best-first search method based on a novel hierarchical partition tree index and three effective heuristic evaluation functions are devised to evaluate probabilistic path queries efficiently.

...read moreread less

Abstract: Path queries such as "finding the shortest path in travel time from my hotel to the airport" are heavily used in many applications of road networks. Currently, simple statistic aggregates such as the average travel time between two vertices are often used to answer path queries. However, such simple aggregates often cannot capture the uncertainty inherent in traffic. In this paper, we study how to take traffic uncertainty into account in answering path queries in road networks. To capture the uncertainty in traffic such as the travel time between two vertices, the weight of an edge is modeled as a random variable and is approximated by a set of samples. We propose three novel types of probabilistic path queries using basic probability principles: (1) a probabilistic path query like "what are the paths from my hotel to the airport whose travel time is at most 30 minutes with a probability of at least 90%?"; (2) a weight-threshold top-k path query like "what are the top-3 paths from my hotel to the airport with the highest probabilities to take at most 30 minutes?"; and (3) a probability-threshold top-k path query like "what are the top-3 shortest paths from my hotel to the airport whose travel time is guaranteed by a probability of at least 90%?" To evaluate probabilistic path queries efficiently, we develop three efficient probability calculation methods: an exact algorithm, a constant factor approximation method and a sampling based approach. Moreover, we devise the P* algorithm, a best-first search method based on a novel hierarchical partition tree index and three effective heuristic evaluation functions. An extensive empirical study using real road networks and synthetic data sets shows the effectiveness of the proposed path queries and the efficiency of the query evaluation methods.

...read moreread less

138 citations

Proceedings Article•DOI•

An execution environment for C-SPARQL queries

[...]

Davide Francesco Barbieri¹, Daniele Braga¹, Stefano Ceri¹, Michael Grossniklaus¹•Institutions (1)

Polytechnic University of Milan¹

22 Mar 2010

TL;DR: The syntax and semantics of the C-SPARQL language are shown, a query graph model is introduced which is an intermediate representation of queries devoted to optimization, and optimizations in terms of rewriting rules applied to the querygraph model are introduced.

...read moreread less

Abstract: Continuous SPARQL (C-SPARQL) is proposed as new language for continuous queries over streams of RDF data. It covers a gap in the Semantic Web abstractions which is needed for many emerging applications, including our focus on Urban Computing. In this domain, sensor-based information on roads must be processed to deduce localized traffic conditions and then produce traffic management strategies. Executing C-SPARQL queries requires the effective integration of SPARQL and streaming technologies, which capitalize over a decade of research and development; such integration poses several nontrivial challenges.In this paper we (a) show the syntax and semantics of the C-SPARQL language together with some examples; (b) introduce a query graph model which is an intermediate representation of queries devoted to optimization; (c) discuss the features of an execution environment that leverages existing technologies; (d) introduce optimizations in terms of rewriting rules applied to the query graph model, so as to efficiently exploit the execution environment; and (e) show evidence of the effectiveness of our optimizations on a prototype of execution environment.

...read moreread less

137 citations

Proceedings Article•DOI•

Self-selecting, self-tuning, incrementally optimized indexes

[...]

Goetz Graefe¹, Harumi Kuno¹•Institutions (1)

Hewlett-Packard¹

22 Mar 2010

TL;DR: Adaptive merging as discussed by the authors is an adaptive, incremental, and efficient technique for index creation that focuses on key ranges used in actual queries, and it is comparable to that of traditional B-tree creation.

...read moreread less

Abstract: In a relational data warehouse with many tables, the number of possible and promising indexes exceeds human comprehension and requires automatic index tuning. While monitoring and reactive index tuning have been proposed, adaptive indexing focuses on adapting the physical database layout for and by actual queries."Database cracking" is one such technique. Only if and when a column is used in query predicates, an index for the column is created; and only if and when a key range is queried, the index is optimized for this key range. The effect is akin to a sort that is adaptive and incremental. This sort is, however, very inefficient, particularly when applied on block-access devices. In contrast, traditional index creation sorts data with an efficient merge sort optimized for block-access devices, but it is neither adaptive nor incremental.We propose adaptive merging, an adaptive, incremental, and efficient technique for index creation. Index optimization focuses on key ranges used in actual queries. The resulting index adapts more quickly to new data and to new query patterns than database cracking. Sort efficiency is comparable to that of traditional B-tree creation. Nonetheless, the new technique promises better query performance than database cracking, both in memory and on block-access storage.

...read moreread less

123 citations

Proceedings Article•DOI•

Timely YAGO: harvesting, querying, and visualizing temporal knowledge from Wikipedia

[...]

Yafang Wang¹, Mingjie Zhu¹, Lizhen Qu¹, Marc Spaniol¹, Gerhard Weikum¹ - Show less +1 more•Institutions (1)

Max Planck Society¹

22 Mar 2010

TL;DR: This paper introduces Timely YAGO, which extends the previously built knowledge base Y AGO with temporal aspects, and extracts temporal facts from Wikipedia infoboxes, categories, and lists in articles, and integrates these into the TimelyYAGO knowledge base.

...read moreread less

Abstract: Recent progress in information extraction has shown how to automatically build large ontologies from high-quality sources like Wikipedia. But knowledge evolves over time; facts have associated validity intervals. Therefore, ontologies should include time as a first-class dimension. In this paper, we introduce Timely YAGO, which extends our previously built knowledge base YAGO with temporal aspects. This prototype system extracts temporal facts from Wikipedia infoboxes, categories, and lists in articles, and integrates these into the Timely YAGO knowledge base. We also support querying temporal facts, by temporal predicates in a SPARQL-style language. Visualization of query results is provided in order to better understand of the dynamic nature of knowledge.

...read moreread less

109 citations

Proceedings Article•DOI•

Position list word aligned hybrid: optimizing space and performance for compressed bitmaps

[...]

Francois Deliege¹, Torben Bach Pedersen¹•Institutions (1)

Aalborg University¹

22 Mar 2010

TL;DR: The Position List Word Aligned Hybrid (PLWAH) compression scheme is presented, that improves significantly over WAH compression by better utilizing the available bits and new CPU instructions.

...read moreread less

Abstract: Compressed bitmap indexes are increasingly used for efficiently querying very large and complex databases. The Word Aligned Hybrid (WAH) bitmap compression scheme is commonly recognized as the most efficient compression scheme in terms of CPU efficiency. However, WAH compressed bitmaps use a lot of storage space. This paper presents the Position List Word Aligned Hybrid (PLWAH) compression scheme that improves significantly over WAH compression by better utilizing the available bits and new CPU instructions. For typical bit distributions, PLWAH compressed bitmaps are often half the size of WAH bitmaps and, at the same time, offer an even better CPU efficiency. The results are verified by theoretical estimates and extensive experiments on large amounts of both synthetic and real-world data.

...read moreread less

97 citations

Proceedings Article•DOI•

HARRA: fast iterative hashed record linkage for large-scale data collections

[...]

Hung-sik Kim¹, Dongwon Lee¹•Institutions (1)

Pennsylvania State University¹

22 Mar 2010

TL;DR: This work proposes the Iterative Locality-Sensitive Hashing (ILSH) that dynamically merges LSH-based has tables for quick and accurate blocking, and develops a suite of I-LSH-based RL algorithms, named as HARRA, which is thoroughly validated using various real data sets.

...read moreread less

Abstract: We study the performance issue of the "iterative" record linkage (RL) problem, where match and merge operations may occur together in iterations until convergence emerges. We first propose the Iterative Locality-Sensitive Hashing (ILSH) that dynamically merges LSH-based has tables for quick and accurate blocking. Then, by exploiting inherent characteristics within/across data sets, we develop a suite of I-LSH-based RL algorithms, named as HARRA (HAshed RecoRd linkAge). The superiority of HARRA in speed over competing RL solutions is thoroughly validated using various real data sets. While maintaining equivalent or comparable accuracy levels, for instance, HARRA runs: (1) 4.5 and 10.5 times faster than StringMap and R-Swoosh in iteratively linking 4,000 x 4,000 short records (i.e., one of the small test cases), and (2) 5.6 and 3.4 times faster than basic LSH and Multi-Probe LSH algorithms in iteratively linking 400,000 x 400,000 long records (i.e., the largest test case).

...read moreread less

96 citations

Proceedings Article•DOI•

Techniques for efficiently querying scientific workflow provenance graphs

[...]

Manish Kumar Anand¹, Shawn Bowers², Bertram Ludäscher¹•Institutions (2)

University of California, Davis¹, Gonzaga University²

22 Mar 2010

TL;DR: This work provides a formal semantics for the language and present novel techniques for efficiently evaluating lineage queries and shows that these strategies are feasible and can significantly reduce both provenance storage size and query execution time when compared with standard approaches.

...read moreread less

Abstract: A key advantage of scientific workflow systems over traditional scripting approaches is their ability to automatically record data and process dependencies introduced during workflow runs. This information is often represented through provenance graphs, which can be used by scientists to better understand, reproduce, and verify scientific results. However, while most systems record and store data and process dependencies, few provide easy-to-use and efficient approaches for accessing and querying provenance information. Instead, users formulate provenance graph queries directly against physical data representations (e.g., relational, XML, or RDF), leading to queries that are difficult to express and expensive to evaluate. We address these problems through a high-level query language tailored for expressing provenance graph queries. The language is based on a general model of provenance supporting scientific workflows that process XML data and employ update semantics. Query constructs are provided for querying both structure and lineage information. Unlike other languages that return sets of nodes as answers, our query language is closed, i.e., answers to lineage queries are sets of lineage dependencies (edges) allowing answers to be further queried. We provide a formal semantics for the language and present novel techniques for efficiently evaluating lineage queries. Experimental results on real and synthetic provenance traces demonstrate that our lineage based optimizations outperform an in-memory and standard database implementation by orders of magnitude. We also show that our strategies are feasible and can significantly reduce both provenance storage size and query execution time when compared with standard approaches.

...read moreread less

Proceedings Article•DOI•

Probabilistic threshold k nearest neighbor queries over moving objects in symbolic indoor space

[...]

Bin Yang¹, Hua Lu¹, Christian S. Jensen¹•Institutions (1)

Aalborg University¹

22 Mar 2010

TL;DR: The new concept of minimal indoor walking distance (MIWD) is proposed along with algorithms and data structures for distance computing and storage, and the states of indoor moving objects are differentiated based on a positioning device deployment graph, utilize these states in effective object indexing structures, and capture the uncertainty of object locations.

...read moreread less

Abstract: The availability of indoor positioning renders it possible to deploy location-based services in indoor spaces. Many such services will benefit from the efficient support for k nearest neighbor (kNN) queries over large populations of indoor moving objects. However, existing kNN techniques fall short in indoor spaces because these differ from Euclidean and spatial network spaces and because of the limited capabilities of indoor positioning technologies.To contend with indoor settings, we propose the new concept of minimal indoor walking distance (MIWD) along with algorithms and data structures for distance computing and storage; and we differentiate the states of indoor moving objects based on a positioning device deployment graph, utilize these states in effective object indexing structures, and capture the uncertainty of object locations. On these foundations, we study the probabilistic threshold kNN (PTkNN) query. Given a query location q and a probability threshold T, this query returns all subsets of k objects that have probability larger than T of containing the kNN query result of q. We propose a combination of three techniques for processing this query. The first uses the MIWD metric to prune objects that are too far away. The second uses fast probability estimates to prune unqualified objects and candidate result subsets. The third uses efficient probability evaluation for computing the final result on the remaining candidate subsets. An empirical study using both synthetic and real data shows that the techniques are efficient.

...read moreread less

Proceedings Article•DOI•

Privacy preserving group nearest neighbor queries

[...]

Tanzima Hashem¹, Lars Kulik¹, Rui Zhang¹•Institutions (1)

University of Melbourne¹

22 Mar 2010

TL;DR: A private filter is developed that determines the actual group nearest neighbor from the retrieved candidate answers without revealing user locations to any involved party, including the LSP.

...read moreread less

Abstract: User privacy in location-based services has attracted great interest in the research community. We introduce a novel framework based on a decentralized architecture for privacy preserving group nearest neighbor queries. A group nearest neighbor (GNN) query returns the location of a meeting place that minimizes the aggregate distance from a spread out group of users; for example, a group of users can ask for a restaurant that minimizes the total travel distance from them. We identify the challenges in preserving user privacy for GNN queries and provide a comprehensive solution to this problem. In our approach, users provide their locations as regions instead of exact points to a location service provider (LSP) to preserve their privacy. The LSP returns a set of candidate answers that includes the actual group nearest neighbor. We develop a private filter that determines the actual group nearest neighbor from the retrieved candidate answers without revealing user locations to any involved party, including the LSP. We also propose an efficient algorithm to evaluate GNN queries with respect to the provided set of regions (the users' imprecise locations). An extensive experimental study shows the effectiveness of our proposed technique.

...read moreread less

Proceedings Article•DOI•

Fast ELCA computation for keyword queries on XML data

[...]

Rui Zhou¹, Chengfei Liu¹, Jianxin Li¹•Institutions (1)

Swinburne University of Technology¹

22 Mar 2010

TL;DR: This paper proposes an algorithm named Hash Count to find ELCA (Exclusive LCA) semantics, which is first proposed by Guo et al. and afterwards named by Xu and Papakonstantinou, and compares it with the state-of-the-art algorithms.

...read moreread less

Abstract: Keyword search is integrated in many applications on account of the convenience to convey users' query intention. Recently, answering keyword queries on XML data has drawn the attention of web and database communities, because the success of this research will relieve users from learning complex XML query languages, such as XPath/XQuery, and/or knowing the underlying schema of the queried XML data. As a result, information in XML data can be discovered much easier.To model the result of answering keyword queries on XML data, many LCA (lowest common ancestor) based notions have been proposed. In this paper, we focus on ELCA (Exclusive LCA) semantics, which is first proposed by Guo et al. and afterwards named by Xu and Papakonstantinou. We propose an algorithm named Hash Count to find ELCAs efficiently. Our analysis shows the complexity of Hash Count algorithm is O(kd|S1|), where k is the number of keywords, d is the depth of the queried XML document and |S1| is the frequency of the rarest keyword. This complexity is the best result known so far. We also evaluate the algorithm on a real DBLP dataset, and compare it with the state-of-the-art algorithms. The experimental results demonstrate the advantage of Hash Count algorithm in practice.

...read moreread less

Proceedings Article•DOI•

Fine-grained and efficient lineage querying of collection-based workflow provenance

[...]

Paolo Missier¹, Norman W. Paton¹, Khalid Belhajjame¹•Institutions (1)

University of Manchester¹

22 Mar 2010

TL;DR: This paper provides an approach to provenance querying that avoids joins over provenance logs by using information about the workflow definition to inform the construction of queries that directly target relevant lineage results, and provides fine grained provenances querying, even for workflows that create and consume collections.

...read moreread less

Abstract: The management and querying of workflow provenance data underpins a collection of activities, including the analysis of workflow results, and the debugging of workflows or services. Such activities require efficient evaluation of lineage queries over potentially complex and voluminous provenance logs. Naive implementations of lineage queries navigate provenance logs by joining tables that represent the flow of data between connected processors invoked from workflows. In this paper we provide an approach to provenance querying that: (i) avoids joins over provenance logs by using information about the workflow definition to inform the construction of queries that directly target relevant lineage results; (ii) provides fine grained provenance querying, even for workflows that create and consume collections; and (iii) scales effectively to address complex workflows, workflows with large intermediate data sets, and queries over multiple workflows.

...read moreread less

Proceedings Article•DOI•

DEDUCE: at the intersection of MapReduce and stream processing

[...]

Vineet Kumar¹, Henrique Andrade¹, Bugra Gedik¹, Kun-Lung Wu¹•Institutions (1)

IBM¹

22 Mar 2010

TL;DR: Deduce is presented, which extends IBM's System S stream processing middleware with support for MapReduce by providing language and runtime support for easily specifying and embedding Map Reduce jobs as elements of a larger data-flow.

...read moreread less

Abstract: MapReduce and stream processing are two emerging, but different, paradigms for analyzing, processing and making sense of large volumes of modern day data. While MapReduce offers the capability to analyze several terabytes of stored data, stream processing solutions offer the ability to process, possibly, a few million updates every second. However, there is an increasing number of data processing applications which need a solution that effectively and efficiently combines the benefits of MapReduce and stream processing to address their data processing needs. For example, in the automated stock trading domain, applications usually require periodic analysis of large amounts of stored data to generate a model using MapReduce, which is then used to process a stream of incident updates using a stream processing system. This paper presents Deduce, which extends IBM's System S stream processing middleware with support for MapReduce by providing (1) language and runtime support for easily specifying and embedding MapReduce jobs as elements of a larger data-flow, (2) capability to describe reusable modules that can be used as map and reduce tasks, and (3) configuration parameters that can be tweaked to control and manage the usage of shared resources by the MapReduce and stream processing components. We describe the motivation for Deduce and the design and implementation of the MapReduce extensions for System S, and then present experimental results.

...read moreread less

Proceedings Article•DOI•

BSkyTree: scalable skyline computation using a balanced pivot selection

[...]

Jongwuk Lee¹, Seung-won Hwang¹•Institutions (1)

Pohang University of Science and Technology¹

22 Mar 2010

TL;DR: This work proposes that data incomparability should be treated as another key factor in optimizing skyline computation, and identifies common modules shared by existing non-index skyline algorithms to develop a cost model to guide a balanced pivot point selection.

...read moreread less

Abstract: Skyline queries have gained a lot of attention for multi-criteria analysis in large-scale datasets. While existing skyline algorithms have focused mostly on exploiting data dominance to achieve efficiency, we propose that data incomparability should be treated as another key factor in optimizing skyline computation. Specifically, to optimize both factors, we first identify common modules shared by existing non-index skyline algorithms, and then analyze them to develop a cost model to guide a balanced pivot point selection. Based on the cost model, we lastly implement our balanced pivot selection in two algorithms, BSkyTree-S and BSkyTree-P, treating both dominance and incomparability as key factors. Our experimental results demonstrate that proposed algorithms outperform state-of-the-art skyline algorithms up to two orders of magnitude.

...read moreread less

Proceedings Article•DOI•

The hardness and approximation algorithms for l-diversity

[...]

Xiaokui Xiao¹, Ke Yi², Yufei Tao³•Institutions (3)

Nanyang Technological University¹, Hong Kong University of Science and Technology², The Chinese University of Hong Kong³

22 Mar 2010

TL;DR: First, it is shown that optimal l-diverse generalization is NP-hard even when there are only 3 distinct sensitive values in the microdata, and an (l · d)-approximation algorithm is developed, which is the first known algorithm with a non-trivial bound on information loss.

...read moreread less

Abstract: The existing solutions to privacy preserving publication can be classified into the theoretical and heuristic categories. The former guarantees provably low information loss, whereas the latter incurs gigantic loss in the worst case, but is shown empirically to perform well on many real inputs. While numerous heuristic algorithms have been developed to satisfy advanced privacy principles such as l-diversity, t-closeness, etc., the theoretical category is currently limited to k-anonymity which is the earliest principle known to have severe vulnerability to privacy attacks. Motivated by this, we present the first theoretical study on l-diversity, a popular principle that is widely adopted in the literature. First, we show that optimal l-diverse generalization is NP-hard even when there are only 3 distinct sensitive values in the microdata. Then, an (l · d)-approximation algorithm is developed, where d is the dimensionality of the underlying dataset. This is the first known algorithm with a non-trivial bound on information loss. Extensive experiments with real datasets validate the effectiveness and efficiency of proposed solution.

...read moreread less

Proceedings Article•DOI•

Suggestion of promising result types for XML keyword search

[...]

Jianxin Li¹, Chengfei Liu¹, Rui Zhou¹, Wei Wang²•Institutions (2)

Swinburne University of Technology¹, University of New South Wales²

22 Mar 2010

TL;DR: This work proposes an estimation-based approach to compute the promising result types for a keyword query, which can help a user quickly narrow down to her specific information need and designs new algorithms based on the indexes to be built.

...read moreread less

Abstract: Although keyword query enables inexperienced users to easily search XML database with no specific knowledge of complex structured query languages or XML data schemas, the ambiguity of keyword query may result in generating a great number of results that may be classified into different types. For users, each result type implies a possible search intention. To improve the performance of keyword query, it is desirable to efficiently work out the most relevant result type from the data to be retrieved.Several recent research works have focused on this interesting problem by using data schema information or pure IR-style statical information. However, this problem is still open due to some requirements. (1) The data to be retrieved may not contain schema information; (2) Relevant result types should be efficiently computed before keyword query evaluation; (3) The correlation between a result type and a keyword query should be measured by analyzing the distribution of relevant values and structures within the data. As we know, none of existing work satisfies the above three requirements together. To address the problem, we propose an estimation-based approach to compute the promising result types for a keyword query, which can help a user quickly narrow down to her specific information need. To speed up the computation, we designed new algorithms based on the indexes to be built. Finally, we present a set of experimental results that evaluate the proposed algorithms and show the potential of this work.

...read moreread less

Proceedings Article•DOI•

B-Fabric: the Swiss Army Knife for life sciences

[...]

Can Türker¹, Fuat Akal¹, Dieter Joho¹, Christian Panse¹, Simon Barkow-Oesterreicher¹, Hubert Rehrauer¹, Ralph Schlapbach¹ - Show less +3 more•Institutions (1)

ETH Zurich¹

22 Mar 2010

TL;DR: B-Fabric is a system infrastructure supporting on-the fly coupling of user applications, and thus serving as extensible platform for fast-paced, cutting-edge, collaborative research.

...read moreread less

Abstract: This paper demonstrates B-Fabric, an all-in-one solution for two major purposes in life sciences. On the one hand, it is a system for the integrated management of experimental data and scientific annotations. On the other hand, it is a system infrastructure supporting on-the fly coupling of user applications, and thus serving as extensible platform for fast-paced, cutting-edge, collaborative research.

...read moreread less

Proceedings Article•DOI•

Gossiping personalized queries

[...]

Xiao Bai¹, Marin Bertier¹, Rachid Guerraoui², Anne-Marie Kermarrec³, Vincent Leroy¹ - Show less +1 more•Institutions (3)

Intelligence and National Security Alliance¹, École Polytechnique Fédérale de Lausanne², French Institute for Research in Computer Science and Automation³

22 Mar 2010

TL;DR: Analytical and experimental evaluations convey the scalability of P3Q for top-k query processing, and show that on a 10,000-user delicious trace, with little storage at each user, the queries are accurately computed within reasonable time and bandwidth consumption.

...read moreread less

Abstract: This paper presents P3Q, a fully decentralized gossip-based protocol to personalize query processing in social tagging systems. P3Q dynamically associates each user with social acquaintances sharing similar tagging behaviours. Queries are gossiped among such acquaintances, computed on the fly in a collaborative, yet partitioned manner, and results are iteratively refined and returned to the querier. Analytical and experimental evaluations convey the scalability of P3Q for top-k query processing. More specifically, we show that on a 10,000-user delicious trace, with little storage at each user, the queries are accurately computed within reasonable time and bandwidth consumption. We also report on the inherent ability of P3Q to cope with users updating profiles and departing.

...read moreread less

Proceedings Article•DOI•

Feedback-based annotation, selection and refinement of schema mappings for dataspaces

[...]

Khalid Belhajjame¹, Norman W. Paton¹, Suzanne M. Embury¹, Alvaro A. A. Fernandes¹, Cornelia Hedeler¹ - Show less +1 more•Institutions (1)

University of Manchester¹

22 Mar 2010

TL;DR: An approach for incrementally annotating schema mappings using feedback obtained from end users and a method for selecting from the set of candidate mappings, those to be used for query evaluation considering user requirements in terms of precision and recall are presented.

...read moreread less

Abstract: The specification of schema mappings has proved to be time and resource consuming, and has been recognized as a critical bottleneck to the large scale deployment of data integration systems. In an attempt to address this issue, dataspaces have been proposed as a data management abstraction that aims to reduce the up-front cost required to setup a data integration system by gradually specifying schema mappings through interaction with end users in a pay-as-you-go fashion. As a step in this direction, we explore an approach for incrementally annotating schema mappings using feedback obtained from end users. In doing so, we do not expect users to examine mapping specifications; rather, they comment on results to queries evaluated using the mappings. Using annotations computed on the basis of user feedback, we present a method for selecting from the set of candidate mappings, those to be used for query evaluation considering user requirements in terms of precision and recall. In doing so, we cast mapping selection as an optimization problem. Mapping annotations may reveal that the quality of schema mappings is poor. We also show how feedback can be used to support the derivation of better quality mappings from existing mappings through refinement. An evolutionary algorithm is used to efficiently and effectively explore the large space of mappings that can be obtained through refinement. The results of evaluation exercises show the effectiveness of our solution for annotating, selecting and refining schema mappings.

...read moreread less

Proceedings Article•DOI•

Querying trajectories using flexible patterns

[...]

Marcos R. Vieira¹, Petko Bakalov², Vassilis J. Tsotras¹•Institutions (2)

University of California, Riverside¹, Esri²

22 Mar 2010

TL;DR: This paper introduces a framework for efficient processing of flexible pattern queries that includes an underlying indexing structure and algorithms for query processing using different evaluation strategies and an extensive performance evaluation shows significant performance improvement when compared to existing solutions.

...read moreread less

Abstract: The wide adaptation of GPS and cellular technologies has created many applications that collect and maintain large repositories of data in the form of trajectories. Previous work on querying/analyzing trajectorial data typically falls into methods that either address spatial range and NN queries, or, similarity based queries. Nevertheless, trajectories are complex objects whose behavior over time and space can be better captured as a sequence of interesting events. We thus facilitate the use of motion "pattern" queries which allow the user to select trajectories based on specific motion patterns. Such patterns are described as regular expressions over a spatial alphabet that can be implicitly or explicitly anchored to the time domain. Moreover, we are interested in "flexible" patterns that allow the user to include "variables" in the query pattern and thus greatly increase its expressive power. In this paper we introduce a framework for efficient processing of flexible pattern queries. The framework includes an underlying indexing structure and algorithms for query processing using different evaluation strategies. An extensive performance evaluation of this framework shows significant performance improvement when compared to existing solutions.

...read moreread less

Proceedings Article•DOI•

PerK: personalized keyword search in relational databases through preferences

[...]

Kostas Stefanidis¹, Marina Drosou¹, Evaggelia Pitoura¹•Institutions (1)

University of Ioannina¹

22 Mar 2010

TL;DR: This paper presents an algorithm for processing preference queries that uses the preferential order between keywords to direct the joining of relevant tuples from multiple relations and shows how to reduce the complexity of this algorithm by sharing computational steps.

...read moreread less

Abstract: Keyword-based search in relational databases allows users to discover relevant information without knowing the database schema or using complicated queries. However, such searches may return an overwhelming number of results, often loosely related to the user intent. In this paper, we propose personalizing keyword database search by utilizing user preferences. Query results are ranked based on both their relevance to the query and their preference degree for the user. To further increase the quality of results, we consider two new metrics that evaluate the goodness of the result as a set, namely coverage of many user interests and content diversity. We present an algorithm for processing preference queries that uses the preferential order between keywords to direct the joining of relevant tuples from multiple relations. We then show how to reduce the complexity of this algorithm by sharing computational steps. Finally, we report evaluation results of the efficiency and effectiveness of our approach.

...read moreread less

Proceedings Article•DOI•

Rewrite techniques for performance optimization of schema matching processes

[...]

Eric Peukert, Henrike Berthold, Erhard Rahm¹•Institutions (1)

Leipzig University¹

22 Mar 2010

TL;DR: A novel rewrite-based optimization technique that is generally applicable to different types of matching processes is introduced and a filter-based rewrite rules similar to predicate push-down in query optimization is introduced.

...read moreread less

Abstract: A recurring manual task in data integration, ontology alignment or model management is finding mappings between complex meta data structures In order to reduce the manual effort, many matching algorithms for semi-automatically computing mappings were introducedUnfortunately, current matching systems severely lack performance when matching large schemas Recently, some systems tried to tackle the performance problem within individual matching approaches However, none of them developed solutions on the level of matching processesIn this paper we introduce a novel rewrite-based optimization technique that is generally applicable to different types of matching processes We introduce filter-based rewrite rules similar to predicate push-down in query optimization In addition we introduce a modeling tool and recommendation system for rewriting matching processesOur evaluation on matching large web service message types shows significant performance improvements without losing the quality of automatically computed results

...read moreread less

Proceedings Article•DOI•

A simple (yet powerful) algebra for pervasive environments

[...]

Yann Gripay¹, Frédérique Laforest¹, Jean-Marc Petit¹•Institutions (1)

University of Lyon¹

22 Mar 2010

TL;DR: A data-centric view of pervasive environments is defined: the classical notion of database is extended to come up with a broader notion, defined as relational pervasive environment, integrating data, streams and active/passive services and the so-called Serena algebra is proposed with operators to homogeneously handle data and services.

...read moreread less

Abstract: Querying non-conventional data is recognized as a major issue in new environments and applications such as those occurring in pervasive computing. A key issue is the ability to query data, streams and services in a declarative way. Our overall objective is to make the development of pervasive applications easier through database principles. In this paper, through the notion of virtual attributes and binding patterns, we define a data-centric view of pervasive environments: the classical notion of database is extended to come up with a broader notion, defined as relational pervasive environment, integrating data, streams and active/passive services. Then, the so-called Serena algebra is proposed with operators to homogeneously handle data and services. Moreover, the notion of stream can also be smoothly integrated into this algebra. A prototype of Pervasive Environment Management System has been implemented on which first experiments have been conducted to validate our approach.

...read moreread less

Proceedings Article•DOI•

Querying the deep web

[...]

Andrea Calì¹, Davide Martinenghi²•Institutions (2)

University of Oxford¹, Polytechnic University of Milan²

22 Mar 2010

TL;DR: This work surveys techniques to optimize query processing on the Deep Web, in a setting where data are represented in the relational model, and illustrates optimizations both at query plan generation time and at runtime, highlighting the role of integrity constraints.

...read moreread less

Abstract: Data stored outside Web pages and accessible from the Web, typically through HTML forms, consitute the so-called Deep Web. Such data are of great value, but difficult to query and search. We survey techniques to optimize query processing on the Deep Web, in a setting where data are represented in the relational model. We illustrate optimizations both at query plan generation time and at runtime, highlighting the role of integrity constraints. We discuss several prototype systems that address the query processing problem.

...read moreread less

Proceedings Article•DOI•

How to authenticate graphs without leaking

[...]

Ashish Kundu¹, Elisa Bertino¹•Institutions (1)

Purdue University¹

22 Mar 2010

TL;DR: This paper proposes two schemes on how to authenticate DAGs and directed cyclic graphs without leaking, which are the first such schemes in the literature based on the structure of the graph as defined by depth-first graph traversals and aggregate signatures.

...read moreread less

Abstract: Secure data sharing in multi-party environments requires that both authenticity and confidentiality of the data be assured. Digital signature schemes are commonly employed for authentication of data. However, no such technique exists for directed graphs, even though such graphs are one of the most widely used data organization structures. Existing schemes for DAGs are authenticity-preserving but not confidentiality-preserving, and lead to leakage of sensitive information during authentication.In this paper, we propose two schemes on how to authenticate DAGs and directed cyclic graphs without leaking, which are the first such schemes in the literature. It is based on the structure of the graph as defined by depth-first graph traversals and aggregate signatures. Graphs are structurally different from trees in that they have four types of edges: tree, forward, cross, and back-edges in a depth-first traversal. The fact that an edge is a forward, cross or a back-edge conveys information that is sensitive in several contexts. Moreover, back-edges pose a more difficult problem than the one posed by forward, and cross-edges primarily because back-edges add bidirectional properties to graphs. We prove that the proposed technique is both authenticity-preserving and non-leaking. While providing such strong security properties, our scheme is also efficient, as supported by the performance results.

...read moreread less

Proceedings Article•DOI•

Bridging the gap between intensional and extensional query evaluation in probabilistic databases

[...]

Abhay Jha¹, Dan Olteanu², Dan Suciu¹•Institutions (2)

University of Washington¹, University of Oxford²

22 Mar 2010

TL;DR: This paper proposes an approach that can translate the evaluation of any query into extensional operators, followed by some post-processing that requires probabilistic inference, and uses characteristics of the data to adapt smoothly between the two evaluation strategies.

...read moreread less

Abstract: There are two broad approaches to query evaluation over probabilistic databases: (1) Intensional Methods proceed by manipulating expressions over symbolic events associated with uncertain tuples. This approach is very general and can be applied to any query, but requires an expensive postprocessing phase, which involves some general-purpose probabilistic inference. (2) Extensional Methods, on the other hand, evaluate the query by translating operations over symbolic events to a query plan; extensional methods scale well, but they are restricted to safe queries.In this paper, we bridge this gap by proposing an approach that can translate the evaluation of any query into extensional operators, followed by some post-processing that requires probabilistic inference. Our approach uses characteristics of the data to adapt smoothly between the two evaluation strategies. If the query is safe or becomes safe because of the data instance, then the evaluation is completely extensional and inside the database. If the query/data combination departs from the ideal setting of a safe query, then some intensional processing is performed, whose complexity depends only on the distance from the ideal setting.

...read moreread less

Proceedings Article•DOI•

Advanced knowledge discovery on movement data with the GeoPKDD system

[...]

Mirco Nanni¹, Roberto Trasarti¹, Chiara Renso¹, Fosca Giannotti¹, Dino Pedreschi² - Show less +1 more•Institutions (2)

Istituto di Scienza e Tecnologie dell'Informazione¹, University of Pisa²

22 Mar 2010

TL;DR: This paper presents an experiment on a real world scenario that demonstrates the strong analytical power of massive, raw trajectory data made available as a by-product of telecom services, in unveiling the complexity of urban mobility.

...read moreread less

Abstract: The growing availability of mobile devices produces an enormous quantity of personal tracks which calls for advanced analysis methods capable of extracting knowledge out of massive trajectories datasets. In this paper we present an experiment on a real world scenario that demonstrates the strong analytical power of massive, raw trajectory data made available as a by-product of telecom services, in unveiling the complexity of urban mobility. The experiment has been made possible by the GeoPKDD system, an integrated platform for complex analysis of mobility data. The system combines spatio-temporal querying capabilities with data mining and semantic technologies, thus providing a full support for the Mobility Knowledge Discovery process.

...read moreread less