scispace - formally typeset
Search or ask a question

Showing papers presented at "Extending Database Technology in 2010"


Proceedings ArticleDOI
22 Mar 2010
TL;DR: The problem of optimizing the shares, given a fixed number of Reduce processes, is studied, and an algorithm for detecting and fixing problems where an attribute is "mistakenly" included in the map-key is given.
Abstract: Implementations of map-reduce are being used to perform many operations on very large data. We examine strategies for joining several relations in the map-reduce environment. Our new approach begins by identifying the "map-key," the set of attributes that identify the Reduce process to which a Map process must send a particular tuple. Each attribute of the map-key gets a "share," which is the number of buckets into which its values are hashed, to form a component of the identifier of a Reduce process. Relations have their tuples replicated in limited fashion, the degree of replication depending on the shares for those map-key attributes that are missing from their schema. We study the problem of optimizing the shares, given a fixed number of Reduce processes. An algorithm for detecting and fixing problems where an attribute is "mistakenly" included in the map-key is given. Then, we consider two important special cases: chain joins and star joins. In each case we are able to determine the map-key and determine the shares that yield the least replication. While the method we propose is not always superior to the conventional way of using map-reduce to implement joins, there are some important cases involving large-scale data where our method wins, including: (1) analytic queries in which a very large fact table is joined with smaller dimension tables, and (2) queries involving paths through graphs with high out-degree, such as the Web or a social network.

382 citations


Proceedings ArticleDOI
22 Mar 2010
TL;DR: A family of novel approximate SimRank computation algorithms for static and dynamic information networks are developed and their corresponding theoretical justification and analysis are given.
Abstract: Information networks are ubiquitous in many applications and analysis on such networks has attracted significant attention in the academic communities. One of the most important aspects of information network analysis is to measure similarity between nodes in a network. SimRank is a simple and influential measure of this kind, based on a solid theoretical "random surfer" model. Existing work computes SimRank similarity scores in an iterative mode. We argue that the iterative method can be infeasible and inefficient when, as in many real-world scenarios, the networks change dynamically and frequently. We envision non-iterative method to bridge the gap. It allows users not only to update the similarity scores incrementally, but also to derive similarity scores for an arbitrary subset of nodes. To enable the non-iterative computation, we propose to rewrite the SimRank equation into a non-iterative form by using the Kronecker product and vectorization operators. Based on this, we develop a family of novel approximate SimRank computation algorithms for static and dynamic information networks, and give their corresponding theoretical justification and analysis. The non-iterative method supports efficient processing of various node analysis including similarity tracking and centrality tracking on evolving information networks. The effectiveness and efficiency of our proposed methods are evaluated on synthetic and real data sets.

171 citations


Proceedings ArticleDOI
22 Mar 2010
TL;DR: Experiments conducted on the real-world Census-income dataset show that, although the proposed methods provide strong privacy, their effectiveness in reducing matching cost is not far from that of k-anonymity based counterparts.
Abstract: Private matching between datasets owned by distinct parties is a challenging problem with several applications. Private matching allows two parties to identify the records that are close to each other according to some distance functions, such that no additional information other than the join result is disclosed to any party. Private matching can be solved securely and accurately using secure multi-party computation (SMC) techniques, but such an approach is prohibitively expensive in practice. Previous work proposed the release of sanitized versions of the sensitive datasets which allows blocking, i.e., filtering out sub-sets of records that cannot be part of the join result. This way, SMC is applied only to a small fraction of record pairs, reducing the matching cost to acceptable levels. The blocking step is essential for the privacy, accuracy and efficiency of matching. However, the state-of-the-art focuses on sanitization based on k-anonymity, which does not provide sufficient privacy. We propose an alternative design centered on differential privacy, a novel paradigm that provides strong privacy guarantees. The realization of the new model presents difficult challenges, such as the evaluation of distance-based matching conditions with the help of only a statistical queries interface. Specialized versions of data indexing structures (e.g., kd-trees) also need to be devised, in order to comply with differential privacy. Experiments conducted on the real-world Census-income dataset show that, although our methods provide strong privacy, their effectiveness in reducing matching cost is not far from that of k-anonymity based counterparts.

171 citations


Proceedings ArticleDOI
Wentao Wu1, Yanghua Xiao1, Wei Wang1, Zhenying He1, Zhihui Wang1 
22 Mar 2010
TL;DR: K-symmetry model is proposed, which modifies a naively-anonymized network so that for any vertex in the network, there exist at least k -- 1 structurally equivalent counterparts and can be recovered through aggregations on quite a small number of sample graphs.
Abstract: With more and more social network data being released, protecting the sensitive information within social networks from leakage has become an important concern of publishers. Adversaries with some background structural knowledge about a target individual can easily re-identify him from the network, even if the identifiers have been replaced by randomized integers(i.e., the network is naively-anonymized). Since there exists numerous topological information that can be used to attack a victim's privacy, to resist such structural re-identification becomes a great challenge. Previous works only investigated a minority of such structural attacks, without considering protecting against re-identification under any potential structural knowledge about a target. To achieve this objective, in this paper we propose k-symmetry model, which modifies a naively-anonymized network so that for any vertex in the network, there exist at least k -- 1 structurally equivalent counterparts. We also propose sampling methods to extract approximate versions of the original network from the anonymized network so that statistical properties of the original network could be evaluated. Extensive experiments show that we can successfully recover a variety of such properties of the original network through aggregations on quite a small number of sample graphs.

144 citations


Proceedings ArticleDOI
22 Mar 2010
TL;DR: The P* algorithm, a best-first search method based on a novel hierarchical partition tree index and three effective heuristic evaluation functions are devised to evaluate probabilistic path queries efficiently.
Abstract: Path queries such as "finding the shortest path in travel time from my hotel to the airport" are heavily used in many applications of road networks. Currently, simple statistic aggregates such as the average travel time between two vertices are often used to answer path queries. However, such simple aggregates often cannot capture the uncertainty inherent in traffic. In this paper, we study how to take traffic uncertainty into account in answering path queries in road networks. To capture the uncertainty in traffic such as the travel time between two vertices, the weight of an edge is modeled as a random variable and is approximated by a set of samples. We propose three novel types of probabilistic path queries using basic probability principles: (1) a probabilistic path query like "what are the paths from my hotel to the airport whose travel time is at most 30 minutes with a probability of at least 90%?"; (2) a weight-threshold top-k path query like "what are the top-3 paths from my hotel to the airport with the highest probabilities to take at most 30 minutes?"; and (3) a probability-threshold top-k path query like "what are the top-3 shortest paths from my hotel to the airport whose travel time is guaranteed by a probability of at least 90%?" To evaluate probabilistic path queries efficiently, we develop three efficient probability calculation methods: an exact algorithm, a constant factor approximation method and a sampling based approach. Moreover, we devise the P* algorithm, a best-first search method based on a novel hierarchical partition tree index and three effective heuristic evaluation functions. An extensive empirical study using real road networks and synthetic data sets shows the effectiveness of the proposed path queries and the efficiency of the query evaluation methods.

138 citations


Proceedings ArticleDOI
22 Mar 2010
TL;DR: The syntax and semantics of the C-SPARQL language are shown, a query graph model is introduced which is an intermediate representation of queries devoted to optimization, and optimizations in terms of rewriting rules applied to the querygraph model are introduced.
Abstract: Continuous SPARQL (C-SPARQL) is proposed as new language for continuous queries over streams of RDF data. It covers a gap in the Semantic Web abstractions which is needed for many emerging applications, including our focus on Urban Computing. In this domain, sensor-based information on roads must be processed to deduce localized traffic conditions and then produce traffic management strategies. Executing C-SPARQL queries requires the effective integration of SPARQL and streaming technologies, which capitalize over a decade of research and development; such integration poses several nontrivial challenges.In this paper we (a) show the syntax and semantics of the C-SPARQL language together with some examples; (b) introduce a query graph model which is an intermediate representation of queries devoted to optimization; (c) discuss the features of an execution environment that leverages existing technologies; (d) introduce optimizations in terms of rewriting rules applied to the query graph model, so as to efficiently exploit the execution environment; and (e) show evidence of the effectiveness of our optimizations on a prototype of execution environment.

137 citations


Proceedings ArticleDOI
22 Mar 2010
TL;DR: Adaptive merging as discussed by the authors is an adaptive, incremental, and efficient technique for index creation that focuses on key ranges used in actual queries, and it is comparable to that of traditional B-tree creation.
Abstract: In a relational data warehouse with many tables, the number of possible and promising indexes exceeds human comprehension and requires automatic index tuning. While monitoring and reactive index tuning have been proposed, adaptive indexing focuses on adapting the physical database layout for and by actual queries."Database cracking" is one such technique. Only if and when a column is used in query predicates, an index for the column is created; and only if and when a key range is queried, the index is optimized for this key range. The effect is akin to a sort that is adaptive and incremental. This sort is, however, very inefficient, particularly when applied on block-access devices. In contrast, traditional index creation sorts data with an efficient merge sort optimized for block-access devices, but it is neither adaptive nor incremental.We propose adaptive merging, an adaptive, incremental, and efficient technique for index creation. Index optimization focuses on key ranges used in actual queries. The resulting index adapts more quickly to new data and to new query patterns than database cracking. Sort efficiency is comparable to that of traditional B-tree creation. Nonetheless, the new technique promises better query performance than database cracking, both in memory and on block-access storage.

123 citations


Proceedings ArticleDOI
Yafang Wang1, Mingjie Zhu1, Lizhen Qu1, Marc Spaniol1, Gerhard Weikum1 
22 Mar 2010
TL;DR: This paper introduces Timely YAGO, which extends the previously built knowledge base Y AGO with temporal aspects, and extracts temporal facts from Wikipedia infoboxes, categories, and lists in articles, and integrates these into the TimelyYAGO knowledge base.
Abstract: Recent progress in information extraction has shown how to automatically build large ontologies from high-quality sources like Wikipedia. But knowledge evolves over time; facts have associated validity intervals. Therefore, ontologies should include time as a first-class dimension. In this paper, we introduce Timely YAGO, which extends our previously built knowledge base YAGO with temporal aspects. This prototype system extracts temporal facts from Wikipedia infoboxes, categories, and lists in articles, and integrates these into the Timely YAGO knowledge base. We also support querying temporal facts, by temporal predicates in a SPARQL-style language. Visualization of query results is provided in order to better understand of the dynamic nature of knowledge.

109 citations


Proceedings ArticleDOI
22 Mar 2010
TL;DR: The Position List Word Aligned Hybrid (PLWAH) compression scheme is presented, that improves significantly over WAH compression by better utilizing the available bits and new CPU instructions.
Abstract: Compressed bitmap indexes are increasingly used for efficiently querying very large and complex databases. The Word Aligned Hybrid (WAH) bitmap compression scheme is commonly recognized as the most efficient compression scheme in terms of CPU efficiency. However, WAH compressed bitmaps use a lot of storage space. This paper presents the Position List Word Aligned Hybrid (PLWAH) compression scheme that improves significantly over WAH compression by better utilizing the available bits and new CPU instructions. For typical bit distributions, PLWAH compressed bitmaps are often half the size of WAH bitmaps and, at the same time, offer an even better CPU efficiency. The results are verified by theoretical estimates and extensive experiments on large amounts of both synthetic and real-world data.

97 citations


Proceedings ArticleDOI
22 Mar 2010
TL;DR: This work proposes the Iterative Locality-Sensitive Hashing (ILSH) that dynamically merges LSH-based has tables for quick and accurate blocking, and develops a suite of I-LSH-based RL algorithms, named as HARRA, which is thoroughly validated using various real data sets.
Abstract: We study the performance issue of the "iterative" record linkage (RL) problem, where match and merge operations may occur together in iterations until convergence emerges. We first propose the Iterative Locality-Sensitive Hashing (ILSH) that dynamically merges LSH-based has tables for quick and accurate blocking. Then, by exploiting inherent characteristics within/across data sets, we develop a suite of I-LSH-based RL algorithms, named as HARRA (HAshed RecoRd linkAge). The superiority of HARRA in speed over competing RL solutions is thoroughly validated using various real data sets. While maintaining equivalent or comparable accuracy levels, for instance, HARRA runs: (1) 4.5 and 10.5 times faster than StringMap and R-Swoosh in iteratively linking 4,000 x 4,000 short records (i.e., one of the small test cases), and (2) 5.6 and 3.4 times faster than basic LSH and Multi-Probe LSH algorithms in iteratively linking 400,000 x 400,000 long records (i.e., the largest test case).

96 citations


Proceedings ArticleDOI
22 Mar 2010
TL;DR: This work provides a formal semantics for the language and present novel techniques for efficiently evaluating lineage queries and shows that these strategies are feasible and can significantly reduce both provenance storage size and query execution time when compared with standard approaches.
Abstract: A key advantage of scientific workflow systems over traditional scripting approaches is their ability to automatically record data and process dependencies introduced during workflow runs. This information is often represented through provenance graphs, which can be used by scientists to better understand, reproduce, and verify scientific results. However, while most systems record and store data and process dependencies, few provide easy-to-use and efficient approaches for accessing and querying provenance information. Instead, users formulate provenance graph queries directly against physical data representations (e.g., relational, XML, or RDF), leading to queries that are difficult to express and expensive to evaluate. We address these problems through a high-level query language tailored for expressing provenance graph queries. The language is based on a general model of provenance supporting scientific workflows that process XML data and employ update semantics. Query constructs are provided for querying both structure and lineage information. Unlike other languages that return sets of nodes as answers, our query language is closed, i.e., answers to lineage queries are sets of lineage dependencies (edges) allowing answers to be further queried. We provide a formal semantics for the language and present novel techniques for efficiently evaluating lineage queries. Experimental results on real and synthetic provenance traces demonstrate that our lineage based optimizations outperform an in-memory and standard database implementation by orders of magnitude. We also show that our strategies are feasible and can significantly reduce both provenance storage size and query execution time when compared with standard approaches.

Proceedings ArticleDOI
22 Mar 2010
TL;DR: The new concept of minimal indoor walking distance (MIWD) is proposed along with algorithms and data structures for distance computing and storage, and the states of indoor moving objects are differentiated based on a positioning device deployment graph, utilize these states in effective object indexing structures, and capture the uncertainty of object locations.
Abstract: The availability of indoor positioning renders it possible to deploy location-based services in indoor spaces. Many such services will benefit from the efficient support for k nearest neighbor (kNN) queries over large populations of indoor moving objects. However, existing kNN techniques fall short in indoor spaces because these differ from Euclidean and spatial network spaces and because of the limited capabilities of indoor positioning technologies.To contend with indoor settings, we propose the new concept of minimal indoor walking distance (MIWD) along with algorithms and data structures for distance computing and storage; and we differentiate the states of indoor moving objects based on a positioning device deployment graph, utilize these states in effective object indexing structures, and capture the uncertainty of object locations. On these foundations, we study the probabilistic threshold kNN (PTkNN) query. Given a query location q and a probability threshold T, this query returns all subsets of k objects that have probability larger than T of containing the kNN query result of q. We propose a combination of three techniques for processing this query. The first uses the MIWD metric to prune objects that are too far away. The second uses fast probability estimates to prune unqualified objects and candidate result subsets. The third uses efficient probability evaluation for computing the final result on the remaining candidate subsets. An empirical study using both synthetic and real data shows that the techniques are efficient.

Proceedings ArticleDOI
22 Mar 2010
TL;DR: A private filter is developed that determines the actual group nearest neighbor from the retrieved candidate answers without revealing user locations to any involved party, including the LSP.
Abstract: User privacy in location-based services has attracted great interest in the research community. We introduce a novel framework based on a decentralized architecture for privacy preserving group nearest neighbor queries. A group nearest neighbor (GNN) query returns the location of a meeting place that minimizes the aggregate distance from a spread out group of users; for example, a group of users can ask for a restaurant that minimizes the total travel distance from them. We identify the challenges in preserving user privacy for GNN queries and provide a comprehensive solution to this problem. In our approach, users provide their locations as regions instead of exact points to a location service provider (LSP) to preserve their privacy. The LSP returns a set of candidate answers that includes the actual group nearest neighbor. We develop a private filter that determines the actual group nearest neighbor from the retrieved candidate answers without revealing user locations to any involved party, including the LSP. We also propose an efficient algorithm to evaluate GNN queries with respect to the provided set of regions (the users' imprecise locations). An extensive experimental study shows the effectiveness of our proposed technique.

Proceedings ArticleDOI
22 Mar 2010
TL;DR: This paper proposes an algorithm named Hash Count to find ELCA (Exclusive LCA) semantics, which is first proposed by Guo et al. and afterwards named by Xu and Papakonstantinou, and compares it with the state-of-the-art algorithms.
Abstract: Keyword search is integrated in many applications on account of the convenience to convey users' query intention. Recently, answering keyword queries on XML data has drawn the attention of web and database communities, because the success of this research will relieve users from learning complex XML query languages, such as XPath/XQuery, and/or knowing the underlying schema of the queried XML data. As a result, information in XML data can be discovered much easier.To model the result of answering keyword queries on XML data, many LCA (lowest common ancestor) based notions have been proposed. In this paper, we focus on ELCA (Exclusive LCA) semantics, which is first proposed by Guo et al. and afterwards named by Xu and Papakonstantinou. We propose an algorithm named Hash Count to find ELCAs efficiently. Our analysis shows the complexity of Hash Count algorithm is O(kd|S1|), where k is the number of keywords, d is the depth of the queried XML document and |S1| is the frequency of the rarest keyword. This complexity is the best result known so far. We also evaluate the algorithm on a real DBLP dataset, and compare it with the state-of-the-art algorithms. The experimental results demonstrate the advantage of Hash Count algorithm in practice.

Proceedings ArticleDOI
22 Mar 2010
TL;DR: This paper provides an approach to provenance querying that avoids joins over provenance logs by using information about the workflow definition to inform the construction of queries that directly target relevant lineage results, and provides fine grained provenances querying, even for workflows that create and consume collections.
Abstract: The management and querying of workflow provenance data underpins a collection of activities, including the analysis of workflow results, and the debugging of workflows or services. Such activities require efficient evaluation of lineage queries over potentially complex and voluminous provenance logs. Naive implementations of lineage queries navigate provenance logs by joining tables that represent the flow of data between connected processors invoked from workflows. In this paper we provide an approach to provenance querying that: (i) avoids joins over provenance logs by using information about the workflow definition to inform the construction of queries that directly target relevant lineage results; (ii) provides fine grained provenance querying, even for workflows that create and consume collections; and (iii) scales effectively to address complex workflows, workflows with large intermediate data sets, and queries over multiple workflows.

Proceedings ArticleDOI
22 Mar 2010
TL;DR: Deduce is presented, which extends IBM's System S stream processing middleware with support for MapReduce by providing language and runtime support for easily specifying and embedding Map Reduce jobs as elements of a larger data-flow.
Abstract: MapReduce and stream processing are two emerging, but different, paradigms for analyzing, processing and making sense of large volumes of modern day data. While MapReduce offers the capability to analyze several terabytes of stored data, stream processing solutions offer the ability to process, possibly, a few million updates every second. However, there is an increasing number of data processing applications which need a solution that effectively and efficiently combines the benefits of MapReduce and stream processing to address their data processing needs. For example, in the automated stock trading domain, applications usually require periodic analysis of large amounts of stored data to generate a model using MapReduce, which is then used to process a stream of incident updates using a stream processing system. This paper presents Deduce, which extends IBM's System S stream processing middleware with support for MapReduce by providing (1) language and runtime support for easily specifying and embedding MapReduce jobs as elements of a larger data-flow, (2) capability to describe reusable modules that can be used as map and reduce tasks, and (3) configuration parameters that can be tweaked to control and manage the usage of shared resources by the MapReduce and stream processing components. We describe the motivation for Deduce and the design and implementation of the MapReduce extensions for System S, and then present experimental results.

Proceedings ArticleDOI
22 Mar 2010
TL;DR: This work proposes that data incomparability should be treated as another key factor in optimizing skyline computation, and identifies common modules shared by existing non-index skyline algorithms to develop a cost model to guide a balanced pivot point selection.
Abstract: Skyline queries have gained a lot of attention for multi-criteria analysis in large-scale datasets. While existing skyline algorithms have focused mostly on exploiting data dominance to achieve efficiency, we propose that data incomparability should be treated as another key factor in optimizing skyline computation. Specifically, to optimize both factors, we first identify common modules shared by existing non-index skyline algorithms, and then analyze them to develop a cost model to guide a balanced pivot point selection. Based on the cost model, we lastly implement our balanced pivot selection in two algorithms, BSkyTree-S and BSkyTree-P, treating both dominance and incomparability as key factors. Our experimental results demonstrate that proposed algorithms outperform state-of-the-art skyline algorithms up to two orders of magnitude.

Proceedings ArticleDOI
22 Mar 2010
TL;DR: First, it is shown that optimal l-diverse generalization is NP-hard even when there are only 3 distinct sensitive values in the microdata, and an (l · d)-approximation algorithm is developed, which is the first known algorithm with a non-trivial bound on information loss.
Abstract: The existing solutions to privacy preserving publication can be classified into the theoretical and heuristic categories. The former guarantees provably low information loss, whereas the latter incurs gigantic loss in the worst case, but is shown empirically to perform well on many real inputs. While numerous heuristic algorithms have been developed to satisfy advanced privacy principles such as l-diversity, t-closeness, etc., the theoretical category is currently limited to k-anonymity which is the earliest principle known to have severe vulnerability to privacy attacks. Motivated by this, we present the first theoretical study on l-diversity, a popular principle that is widely adopted in the literature. First, we show that optimal l-diverse generalization is NP-hard even when there are only 3 distinct sensitive values in the microdata. Then, an (l · d)-approximation algorithm is developed, where d is the dimensionality of the underlying dataset. This is the first known algorithm with a non-trivial bound on information loss. Extensive experiments with real datasets validate the effectiveness and efficiency of proposed solution.

Proceedings ArticleDOI
22 Mar 2010
TL;DR: This work proposes an estimation-based approach to compute the promising result types for a keyword query, which can help a user quickly narrow down to her specific information need and designs new algorithms based on the indexes to be built.
Abstract: Although keyword query enables inexperienced users to easily search XML database with no specific knowledge of complex structured query languages or XML data schemas, the ambiguity of keyword query may result in generating a great number of results that may be classified into different types. For users, each result type implies a possible search intention. To improve the performance of keyword query, it is desirable to efficiently work out the most relevant result type from the data to be retrieved.Several recent research works have focused on this interesting problem by using data schema information or pure IR-style statical information. However, this problem is still open due to some requirements. (1) The data to be retrieved may not contain schema information; (2) Relevant result types should be efficiently computed before keyword query evaluation; (3) The correlation between a result type and a keyword query should be measured by analyzing the distribution of relevant values and structures within the data. As we know, none of existing work satisfies the above three requirements together. To address the problem, we propose an estimation-based approach to compute the promising result types for a keyword query, which can help a user quickly narrow down to her specific information need. To speed up the computation, we designed new algorithms based on the indexes to be built. Finally, we present a set of experimental results that evaluate the proposed algorithms and show the potential of this work.

Proceedings ArticleDOI
22 Mar 2010
TL;DR: B-Fabric is a system infrastructure supporting on-the fly coupling of user applications, and thus serving as extensible platform for fast-paced, cutting-edge, collaborative research.
Abstract: This paper demonstrates B-Fabric, an all-in-one solution for two major purposes in life sciences. On the one hand, it is a system for the integrated management of experimental data and scientific annotations. On the other hand, it is a system infrastructure supporting on-the fly coupling of user applications, and thus serving as extensible platform for fast-paced, cutting-edge, collaborative research.

Proceedings ArticleDOI
22 Mar 2010
TL;DR: Analytical and experimental evaluations convey the scalability of P3Q for top-k query processing, and show that on a 10,000-user delicious trace, with little storage at each user, the queries are accurately computed within reasonable time and bandwidth consumption.
Abstract: This paper presents P3Q, a fully decentralized gossip-based protocol to personalize query processing in social tagging systems. P3Q dynamically associates each user with social acquaintances sharing similar tagging behaviours. Queries are gossiped among such acquaintances, computed on the fly in a collaborative, yet partitioned manner, and results are iteratively refined and returned to the querier. Analytical and experimental evaluations convey the scalability of P3Q for top-k query processing. More specifically, we show that on a 10,000-user delicious trace, with little storage at each user, the queries are accurately computed within reasonable time and bandwidth consumption. We also report on the inherent ability of P3Q to cope with users updating profiles and departing.

Proceedings ArticleDOI
22 Mar 2010
TL;DR: An approach for incrementally annotating schema mappings using feedback obtained from end users and a method for selecting from the set of candidate mappings, those to be used for query evaluation considering user requirements in terms of precision and recall are presented.
Abstract: The specification of schema mappings has proved to be time and resource consuming, and has been recognized as a critical bottleneck to the large scale deployment of data integration systems. In an attempt to address this issue, dataspaces have been proposed as a data management abstraction that aims to reduce the up-front cost required to setup a data integration system by gradually specifying schema mappings through interaction with end users in a pay-as-you-go fashion. As a step in this direction, we explore an approach for incrementally annotating schema mappings using feedback obtained from end users. In doing so, we do not expect users to examine mapping specifications; rather, they comment on results to queries evaluated using the mappings. Using annotations computed on the basis of user feedback, we present a method for selecting from the set of candidate mappings, those to be used for query evaluation considering user requirements in terms of precision and recall. In doing so, we cast mapping selection as an optimization problem. Mapping annotations may reveal that the quality of schema mappings is poor. We also show how feedback can be used to support the derivation of better quality mappings from existing mappings through refinement. An evolutionary algorithm is used to efficiently and effectively explore the large space of mappings that can be obtained through refinement. The results of evaluation exercises show the effectiveness of our solution for annotating, selecting and refining schema mappings.

Proceedings ArticleDOI
22 Mar 2010
TL;DR: This paper introduces a framework for efficient processing of flexible pattern queries that includes an underlying indexing structure and algorithms for query processing using different evaluation strategies and an extensive performance evaluation shows significant performance improvement when compared to existing solutions.
Abstract: The wide adaptation of GPS and cellular technologies has created many applications that collect and maintain large repositories of data in the form of trajectories. Previous work on querying/analyzing trajectorial data typically falls into methods that either address spatial range and NN queries, or, similarity based queries. Nevertheless, trajectories are complex objects whose behavior over time and space can be better captured as a sequence of interesting events. We thus facilitate the use of motion "pattern" queries which allow the user to select trajectories based on specific motion patterns. Such patterns are described as regular expressions over a spatial alphabet that can be implicitly or explicitly anchored to the time domain. Moreover, we are interested in "flexible" patterns that allow the user to include "variables" in the query pattern and thus greatly increase its expressive power. In this paper we introduce a framework for efficient processing of flexible pattern queries. The framework includes an underlying indexing structure and algorithms for query processing using different evaluation strategies. An extensive performance evaluation of this framework shows significant performance improvement when compared to existing solutions.

Proceedings ArticleDOI
22 Mar 2010
TL;DR: This paper presents an algorithm for processing preference queries that uses the preferential order between keywords to direct the joining of relevant tuples from multiple relations and shows how to reduce the complexity of this algorithm by sharing computational steps.
Abstract: Keyword-based search in relational databases allows users to discover relevant information without knowing the database schema or using complicated queries. However, such searches may return an overwhelming number of results, often loosely related to the user intent. In this paper, we propose personalizing keyword database search by utilizing user preferences. Query results are ranked based on both their relevance to the query and their preference degree for the user. To further increase the quality of results, we consider two new metrics that evaluate the goodness of the result as a set, namely coverage of many user interests and content diversity. We present an algorithm for processing preference queries that uses the preferential order between keywords to direct the joining of relevant tuples from multiple relations. We then show how to reduce the complexity of this algorithm by sharing computational steps. Finally, we report evaluation results of the efficiency and effectiveness of our approach.

Proceedings ArticleDOI
22 Mar 2010
TL;DR: A novel rewrite-based optimization technique that is generally applicable to different types of matching processes is introduced and a filter-based rewrite rules similar to predicate push-down in query optimization is introduced.
Abstract: A recurring manual task in data integration, ontology alignment or model management is finding mappings between complex meta data structures In order to reduce the manual effort, many matching algorithms for semi-automatically computing mappings were introducedUnfortunately, current matching systems severely lack performance when matching large schemas Recently, some systems tried to tackle the performance problem within individual matching approaches However, none of them developed solutions on the level of matching processesIn this paper we introduce a novel rewrite-based optimization technique that is generally applicable to different types of matching processes We introduce filter-based rewrite rules similar to predicate push-down in query optimization In addition we introduce a modeling tool and recommendation system for rewriting matching processesOur evaluation on matching large web service message types shows significant performance improvements without losing the quality of automatically computed results

Proceedings ArticleDOI
22 Mar 2010
TL;DR: A data-centric view of pervasive environments is defined: the classical notion of database is extended to come up with a broader notion, defined as relational pervasive environment, integrating data, streams and active/passive services and the so-called Serena algebra is proposed with operators to homogeneously handle data and services.
Abstract: Querying non-conventional data is recognized as a major issue in new environments and applications such as those occurring in pervasive computing. A key issue is the ability to query data, streams and services in a declarative way. Our overall objective is to make the development of pervasive applications easier through database principles. In this paper, through the notion of virtual attributes and binding patterns, we define a data-centric view of pervasive environments: the classical notion of database is extended to come up with a broader notion, defined as relational pervasive environment, integrating data, streams and active/passive services. Then, the so-called Serena algebra is proposed with operators to homogeneously handle data and services. Moreover, the notion of stream can also be smoothly integrated into this algebra. A prototype of Pervasive Environment Management System has been implemented on which first experiments have been conducted to validate our approach.

Proceedings ArticleDOI
22 Mar 2010
TL;DR: This work surveys techniques to optimize query processing on the Deep Web, in a setting where data are represented in the relational model, and illustrates optimizations both at query plan generation time and at runtime, highlighting the role of integrity constraints.
Abstract: Data stored outside Web pages and accessible from the Web, typically through HTML forms, consitute the so-called Deep Web. Such data are of great value, but difficult to query and search. We survey techniques to optimize query processing on the Deep Web, in a setting where data are represented in the relational model. We illustrate optimizations both at query plan generation time and at runtime, highlighting the role of integrity constraints. We discuss several prototype systems that address the query processing problem.

Proceedings ArticleDOI
22 Mar 2010
TL;DR: This paper proposes two schemes on how to authenticate DAGs and directed cyclic graphs without leaking, which are the first such schemes in the literature based on the structure of the graph as defined by depth-first graph traversals and aggregate signatures.
Abstract: Secure data sharing in multi-party environments requires that both authenticity and confidentiality of the data be assured. Digital signature schemes are commonly employed for authentication of data. However, no such technique exists for directed graphs, even though such graphs are one of the most widely used data organization structures. Existing schemes for DAGs are authenticity-preserving but not confidentiality-preserving, and lead to leakage of sensitive information during authentication.In this paper, we propose two schemes on how to authenticate DAGs and directed cyclic graphs without leaking, which are the first such schemes in the literature. It is based on the structure of the graph as defined by depth-first graph traversals and aggregate signatures. Graphs are structurally different from trees in that they have four types of edges: tree, forward, cross, and back-edges in a depth-first traversal. The fact that an edge is a forward, cross or a back-edge conveys information that is sensitive in several contexts. Moreover, back-edges pose a more difficult problem than the one posed by forward, and cross-edges primarily because back-edges add bidirectional properties to graphs. We prove that the proposed technique is both authenticity-preserving and non-leaking. While providing such strong security properties, our scheme is also efficient, as supported by the performance results.

Proceedings ArticleDOI
22 Mar 2010
TL;DR: This paper proposes an approach that can translate the evaluation of any query into extensional operators, followed by some post-processing that requires probabilistic inference, and uses characteristics of the data to adapt smoothly between the two evaluation strategies.
Abstract: There are two broad approaches to query evaluation over probabilistic databases: (1) Intensional Methods proceed by manipulating expressions over symbolic events associated with uncertain tuples. This approach is very general and can be applied to any query, but requires an expensive postprocessing phase, which involves some general-purpose probabilistic inference. (2) Extensional Methods, on the other hand, evaluate the query by translating operations over symbolic events to a query plan; extensional methods scale well, but they are restricted to safe queries.In this paper, we bridge this gap by proposing an approach that can translate the evaluation of any query into extensional operators, followed by some post-processing that requires probabilistic inference. Our approach uses characteristics of the data to adapt smoothly between the two evaluation strategies. If the query is safe or becomes safe because of the data instance, then the evaluation is completely extensional and inside the database. If the query/data combination departs from the ideal setting of a safe query, then some intensional processing is performed, whose complexity depends only on the distance from the ideal setting.

Proceedings ArticleDOI
22 Mar 2010
TL;DR: This paper presents an experiment on a real world scenario that demonstrates the strong analytical power of massive, raw trajectory data made available as a by-product of telecom services, in unveiling the complexity of urban mobility.
Abstract: The growing availability of mobile devices produces an enormous quantity of personal tracks which calls for advanced analysis methods capable of extracting knowledge out of massive trajectories datasets. In this paper we present an experiment on a real world scenario that demonstrates the strong analytical power of massive, raw trajectory data made available as a by-product of telecom services, in unveiling the complexity of urban mobility. The experiment has been made possible by the GeoPKDD system, an integrated platform for complex analysis of mobility data. The system combines spatio-temporal querying capabilities with data mining and semantic technologies, thus providing a full support for the Mobility Knowledge Discovery process.