Showing papers presented at "Extending Database Technology in 2009"

PDF

Open Access

Proceedings Article•DOI•

RankClus: integrating clustering with ranking for heterogeneous information network analysis

[...]

Yizhou Sun¹, Jiawei Han¹, Peixiang Zhao¹, Zhijun Yin¹, Hong Cheng², Tianyi Wu¹ - Show less +2 more•Institutions (2)

University of Illinois at Urbana–Champaign¹, The Chinese University of Hong Kong²

24 Mar 2009

TL;DR: This paper addresses the problem of generating clusters for a specified type of objects, as well as ranking information for all types of objects based on these clusters in a multi-typed information network, and proposes a novel clustering framework called RankClus that directly generates clusters integrated with ranking.

...read moreread less

Abstract: As information networks become ubiquitous, extracting knowledge from information networks has become an important task. Both ranking and clustering can provide overall views on information network data, and each has been a hot topic by itself. However, ranking objects globally without considering which clusters they belong to often leads to dumb results, e.g., ranking database and computer architecture conferences together may not make much sense. Similarly, clustering a huge number of objects (e.g., thousands of authors) in one huge cluster without distinction is dull as well.In this paper, we address the problem of generating clusters for a specified type of objects, as well as ranking information for all types of objects based on these clusters in a multi-typed (i.e., heterogeneous) information network. A novel clustering framework called RankClus is proposed that directly generates clusters integrated with ranking. Based on initial K clusters, ranking is applied separately, which serves as a good measure for each cluster. Then, we use a mixture model to decompose each object into a K-dimensional vector, where each dimension is a component coefficient with respect to a cluster, which is measured by rank distribution. Objects then are reassigned to the nearest cluster under the new measure space to improve clustering. As a result, quality of clustering and ranking are mutually enhanced, which means that the clusters are getting more accurate and the ranking is getting more meaningful. Such a progressive refinement process iterates until little change can be made. Our experiment results show that RankClus can generate more accurate clusters and in a more efficient way than the state-of-the-art link-based clustering methods. Moreover, the clustering results with ranks can provide more informative views of data compared with traditional clustering.

...read moreread less

399 citations

Proceedings Article•DOI•

GADDI: distance index based subgraph matching in biological networks

[...]

Shijie Zhang¹, Shirong Li¹, Jiong Yang¹•Institutions (1)

Case Western Reserve University¹

24 Mar 2009

TL;DR: A novel distance measurement is proposed which reintroduces the idea of frequent substructures in a single large graph in a given large graph of thousands of vertices and the novel structure distance based approach (GADDI) is devised to efficiently find matches of the query graph.

...read moreread less

Abstract: Currently, a huge amount of biological data can be naturally represented by graphs, e.g., protein interaction networks, gene regulatory networks, etc. The need for indexing large graphs is an urgent research problem of great practical importance. The main challenge is size. Each graph may contain thousands (or more) vertices. Most of the previous work focuses on indexing a set of small or medium sized database graphs (with only tens of vertices) and finding whether a query graph occurs in any of these. In this paper, we are interested in finding all the matches of a query graph in a given large graph of thousands of vertices, which is a very important task in many biological applications. This increases the complexity significantly. We propose a novel distance measurement which reintroduces the idea of frequent substructures in a single large graph. We devise the novel structure distance based approach (GADDI) to efficiently find matches of the query graph. GADDI is further optimized by the use of a dynamic matching scheme to minimize redundant calculations. Last but not least, a number of real and synthetic data sets are used to evaluate the efficiency and scalability of our proposed method.

...read moreread less

243 citations

Proceedings Article•DOI•

Shore-MT: a scalable storage manager for the multicore era

[...]

Ryan Johnson¹, Ippokratis Pandis², Nikos Hardavellas², Anastasia Ailamaki¹, Babak Falsafi¹ - Show less +1 more•Institutions (2)

École Polytechnique Fédérale de Lausanne¹, Carnegie Mellon University²

24 Mar 2009

TL;DR: Shore-MT is presented, a multithreaded and highly scalable version of Shore which was developed by identifying and successively removing internal bottlenecks, and exhibits superior scalability and 2--4 times higher absolute throughput than its peers.

...read moreread less

Abstract: Database storage managers have long been able to efficiently handle multiple concurrent requests. Until recently, however, a computer contained only a few single-core CPUs, and therefore only a few transactions could simultaneously access the storage manager's internal structures. This allowed storage managers to use non-scalable approaches without any penalty. With the arrival of multicore chips, however, this situation is rapidly changing. More and more threads can run in parallel, stressing the internal scalability of the storage manager. Systems optimized for high performance at a limited number of cores are not assured similarly high performance at a higher core count, because unanticipated scalability obstacles arise.We benchmark four popular open-source storage managers (Shore, BerkeleyDB, MySQL, and PostgreSQL) on a modern multicore machine, and find that they all suffer in terms of scalability. We briefly examine the bottlenecks in the various storage engines. We then present Shore-MT, a multithreaded and highly scalable version of Shore which we developed by identifying and successively removing internal bottlenecks. When compared to other DBMS, Shore-MT exhibits superior scalability and 2--4 times higher absolute throughput than its peers. We also show that designers should favor scalability to single-thread performance, and highlight important principles for writing scalable storage engines, illustrated with real examples from the development of Shore-MT.

...read moreread less

242 citations

Proceedings Article•DOI•

Data integration flows for business intelligence

[...]

Umeshwar Dayal¹, Malu Castellanos¹, Alkis Simitsis¹, Kevin Wilkinson¹•Institutions (1)

Hewlett-Packard¹

24 Mar 2009

TL;DR: The requirements for data integration flows in this next generation of operational BI system are described, the limitations of current technologies, the research challenges in meeting these requirements, and a framework for addressing these challenges are described.

...read moreread less

Abstract: Business Intelligence (BI) refers to technologies, tools, and practices for collecting, integrating, analyzing, and presenting large volumes of information to enable better decision making. Today's BI architecture typically consists of a data warehouse (or one or more data marts), which consolidates data from several operational databases, and serves a variety of front-end querying, reporting, and analytic tools. The back-end of the architecture is a data integration pipeline for populating the data warehouse by extracting data from distributed and usually heterogeneous operational sources; cleansing, integrating and transforming the data; and loading it into the data warehouse. Since BI systems have been used primarily for off-line, strategic decision making, the traditional data integration pipeline is a oneway, batch process, usually implemented by extract-transform-load (ETL) tools. The design and implementation of the ETL pipeline is largely a labor-intensive activity, and typically consumes a large fraction of the effort in data warehousing projects. Increasingly, as enterprises become more automated, data-driven, and real-time, the BI architecture is evolving to support operational decision making. This imposes additional requirements and tradeoffs, resulting in even more complexity in the design of data integration flows. These include reducing the latency so that near real-time data can be delivered to the data warehouse, extracting information from a wider variety of data sources, extending the rigidly serial ETL pipeline to more general data flows, and considering alternative physical implementations. We describe the requirements for data integration flows in this next generation of operational BI system, the limitations of current technologies, the research challenges in meeting these requirements, and a framework for addressing these challenges. The goal is to facilitate the design and implementation of optimal flows to meet business requirements.

...read moreread less

201 citations

Proceedings Article•DOI•

It takes variety to make a world: diversification in recommender systems

[...]

Cong Yu¹, Laks V. S. Lakshmanan², Sihem Amer-Yahia¹•Institutions (2)

Yahoo!¹, University of British Columbia²

24 Mar 2009

TL;DR: This paper develops efficient diversification algorithms built upon the notion of explanation-based diversity and demonstrates their efficiency and effectiveness in diversification on two real life data sets: del.icio.us and Yahoo! Movies.

...read moreread less

Abstract: Recommendations in collaborative tagging sites such as del.icio.us and Yahoo! Movies, are becoming increasingly important, due to the proliferation of general queries on those sites and the ineffectiveness of the traditional search paradigm to address those queries. Regardless of the underlying recommendation strategy, item-based or user-based, one of the key concerns in producing recommendations, is over-specialization, which results in returning items that are too homogeneous. Traditional solutions rely on post-processing returned items to identify those which differ in their attribute values (e.g., genre and actors for movies). Such approaches are not always applicable when intrinsic attributes are not available (e.g., URLs in del.icio.us). In a recent paper [20], we introduced the notion of explanation-based diversity and formalized the diversification problem as a compromise between accuracy and diversity. In this paper, we develop efficient diversification algorithms built upon this notion. The algorithms explore compromises between accuracy and diversity. We demonstrate their efficiency and effectiveness in diversification on two real life data sets: del.icio.us and Yahoo! Movies.

...read moreread less

196 citations

Proceedings Article•DOI•

Anonymizing moving objects: how to hide a MOB in a crowd?

[...]

Roman Yarovoy¹, Francesco Bonchi², Laks V. S. Lakshmanan¹, Wendy Hui Wang³•Institutions (3)

University of British Columbia¹, Yahoo!², Stevens Institute of Technology³

24 Mar 2009

TL;DR: This paper argues that in MOD, there does not exist a fixed set of quasi-identifier (QID) attributes for all the MOBs, and proposes two approaches, namely extreme-union and symmetric anonymization, to build anonymization groups that provably satisfy the proposed k-anonymity requirement, as well as yield low information loss.

...read moreread less

Abstract: Moving object databases (MOD) have gained much interest in recent years due to the advances in mobile communications and positioning technologies. Study of MOD can reveal useful information (e.g., traffic patterns and congestion trends) that can be used in applications for the common benefit. In order to mine and/or analyze the data, MOD must be published, which can pose a threat to the location privacy of a user. Indeed, based on prior knowledge of a user's location at several time points, an attacker can potentially associate that user to a specific moving object (MOB) in the published database and learn her position information at other time points.In this paper, we study the problem of privacy-preserving publishing of moving object database. Unlike in microdata, we argue that in MOD, there does not exist a fixed set of quasi-identifier (QID) attributes for all the MOBs. Consequently the anonymization groups of MOBs (i.e., the sets of other MOBs within which to hide) may not be disjoint. Thus, there may exist MOBs that can be identified explicitly by combining different anonymization groups. We illustrate the pitfalls of simple adaptations of classical k-anonymity and develop a notion which we prove is robust against privacy attacks. We propose two approaches, namely extreme-union and symmetric anonymization, to build anonymization groups that provably satisfy our proposed k-anonymity requirement, as well as yield low information loss. We ran an extensive set of experiments on large real-world and synthetic datasets of vehicular traffic. Our results demonstrate the effectiveness of our approach.

...read moreread less

159 citations

Proceedings Article•DOI•

Zerber+R: top-k retrieval from a confidential index

[...]

Sergej Zerr¹, Daniel Olmedilla¹, Wolfgang Nejdl¹, Wolf Siberski¹•Institutions (1)

Leibniz University of Hanover¹

24 Mar 2009

TL;DR: This paper presents Zerber+R -- a ranking model which allows for privacy-preserving top-k retrieval from an outsourced inverted index and proposes a relevance score transformation function which makes relevance scores of different terms indistinguishable, such that even if stored on an untrusted server they do not reveal information about the indexed data.

...read moreread less

Abstract: Privacy-preserving document exchange among collaboration groups in an enterprise as well as across enterprises requires techniques for sharing and search of access-controlled information through largely untrusted servers. In these settings search systems need to provide confidentiality guarantees for shared information while offering IR properties comparable to the ordinary search engines. Top-k is a standard IR technique which enables fast query execution on very large indexes and makes systems highly scalable. However, indexing access-controlled information for top-k retrieval is a challenging task due to the sensitivity of the term statistics used for ranking.In this paper we present Zerber+R -- a ranking model which allows for privacy-preserving top-k retrieval from an outsourced inverted index. We propose a relevance score transformation function which makes relevance scores of different terms indistinguishable, such that even if stored on an untrusted server they do not reveal information about the indexed data. Experiments on two real-world data sets show that Zerber+R makes economical usage of bandwidth and offers retrieval properties comparable with an ordinary inverted index.

...read moreread less

148 citations

Proceedings Article•DOI•

Neighbor-based pattern detection for windows over streaming data

[...]

Di Yang¹, Elke A. Rundensteiner¹, Matthew O. Ward¹•Institutions (1)

Worcester Polytechnic Institute¹

24 Mar 2009

TL;DR: This work develops the first solution for incremental detection of neighbor-based patterns specific to sliding window scenarios, exploiting the "predictability" property of sliding windows to elegantly discount the effect of expiring objects on the remaining pattern structures.

...read moreread less

Abstract: The discovery of complex patterns such as clusters, outliers, and associations from huge volumes of streaming data has been recognized as critical for many domains. However, pattern detection with sliding window semantics, as required by applications ranging from stock market analysis to moving object tracking remains largely unexplored. Applying static pattern detection algorithms from scratch to every window is prohibitively expensive due to their high algorithmic complexity. This work tackles this problem by developing the first solution for incremental detection of neighbor-based patterns specific to sliding window scenarios. The specific pattern types covered in this work include density-based clusters and distance-based outliers. Incremental pattern computation in highly dynamic streaming environments is challenging, because purging a large amount of to-be-expired data from previously formed patterns may cause complex pattern changes including migration, splitting, merging and termination of these patterns. Previous incremental neighbor-based pattern detection algorithms, which were typically not designed to handle sliding windows, such as incremental DBSCAN, are not able to solve this problem efficiently in terms of both CPU and memory consumption. To overcome this, we exploit the "predictability" property of sliding windows to elegantly discount the effect of expiring objects on the remaining pattern structures. Our solution achieves minimal CPU utilization, while still keeping the memory utilization linear in the number of objects in the window. Our comprehensive experimental study, using both synthetic as well as real data from domains of stock trades and moving object monitoring, demonstrates superiority of our proposed strategies over alternate methods in both CPU and memory utilization.

...read moreread less

122 citations

Proceedings Article•DOI•

Interactive query refinement

[...]

Chaitanya Mishra¹, Nick Koudas¹•Institutions (1)

University of Toronto¹

24 Mar 2009

TL;DR: This work formalizes the problem of query refinement and proposes a framework to support it in a database system, and introduces an interactive model of refinement that incorporates user feedback to best capture user preferences.

...read moreread less

Abstract: We investigate the problem of refining SQL queries to satisfy cardinality constraints on the query result. This has applications to the many/few answers problems often faced by database users. We formalize the problem of query refinement and propose a framework to support it in a database system. We introduce an interactive model of refinement that incorporates user feedback to best capture user preferences. Our techniques are designed to handle queries having range and equality predicates on numerical and categorical attributes. We present an experimental evaluation of our framework implemented in an open source data manager and demonstrate the feasibility and practical utility of our approach.

...read moreread less

104 citations

Proceedings Article•DOI•

Data clouds: summarizing keyword search results over structured data

[...]

Georgia Koutrika¹, Zahra Mohammadi Zadeh¹, Hector Garcia-Molina¹•Institutions (1)

Stanford University¹

24 Mar 2009

TL;DR: This paper proposes coupling the flexibility of keyword searches over structured data with the summarization and navigation capabilities of tag clouds to help users access a database and presents a system that offers a unified search and browse interface to a course database.

...read moreread less

Abstract: Keyword searches are attractive because they facilitate users searching structured databases. On the other hand, tag clouds are popular for navigation and visualization purposes over unstructured data because they can highlight the most significant concepts and hidden relationships in the underlying content dynamically. In this paper, we propose coupling the flexibility of keyword searches over structured data with the summarization and navigation capabilities of tag clouds to help users access a database. We propose using clouds over structured data (data clouds) to summarize the results of keyword searches over structured data and to guide users to refine their searches. The cloud presents the most significant words associated with the search results. Our keyword search model allows searching for entities than can span multiple tables in the database rather than just tuples, as existing keyword searches over databases do. We present several methods to compute the scores both for the entities and for the terms in the search results. We describe algorithms for keyword searches with data clouds and we present our system, CourseCloud, that offers a unified search and browse interface to a course database. We present experimental results showing (a) the appropriateness of the methods used for scoring terms, (b) the performance of the proposed algorithms, and (c) the effectiveness of CourseCloud compared to typical search and browse interfaces to a course database.

...read moreread less

98 citations

Proceedings Article•DOI•

[...]

Parisa Haghani¹, Sebastian Michel¹, Karl Aberer¹•Institutions (1)

École Polytechnique Fédérale de Lausanne¹

24 Mar 2009

TL;DR: This paper considers distributed K-Nearest Neighbor (KNN) search and range query processing in high dimensional data and shows how to leverage the linearly aligned data for efficient KNN search and how to efficiently process range queries which is not possible in existing LSH schemes.

...read moreread less

Abstract: In this paper we consider distributed K-Nearest Neighbor (KNN) search and range query processing in high dimensional data. Our approach is based on Locality Sensitive Hashing (LSH) which has proven very efficient in answering KNN queries in centralized settings. We consider mappings from the multi-dimensional LSH bucket space to the linearly ordered set of peers that jointly maintain the indexed data and derive requirements to achieve high quality search results and limit the number of network accesses. We put forward two such mappings that come with these salient properties: being locality preserving so that buckets likely to hold similar data are stored on the same or neighboring peers and having a predictable output distribution to ensure fair load balancing. We show how to leverage the linearly aligned data for efficient KNN search and how to efficiently process range queries which is, to the best of our knowledge, not possible in existing LSH schemes. We show by comprehensive performance evaluations using real world data that our approach brings major performance and accuracy gains compared to state-of-the-art.

...read moreread less

Proceedings Article•DOI•

Continuous privacy preserving publishing of data streams

[...]

Bin Zhou¹, Yi Han², Jian Pei¹, Bin Jiang¹, Yufei Tao³, Yan Jia² - Show less +2 more•Institutions (3)

Simon Fraser University¹, National University of Defense Technology², The Chinese University of Hong Kong³

24 Mar 2009

TL;DR: This paper develops a novel approach which considers both the distribution of the data entries to be published and the statistical distribution ofthe data stream to solve an emerging problem of continuous privacy preserving publishing of data streams which cannot be solved by any straightforward extensions of the existing privacy preserving Publishing methods on static data.

...read moreread less

Abstract: Recently, privacy preserving data publishing has received a lot of attention in both research and applications. Most of the previous studies, however, focus on static data sets. In this paper, we study an emerging problem of continuous privacy preserving publishing of data streams which cannot be solved by any straightforward extensions of the existing privacy preserving publishing methods on static data. To tackle the problem, we develop a novel approach which considers both the distribution of the data entries to be published and the statistical distribution of the data stream. An extensive performance study using both real data sets and synthetic data sets verifies the effectiveness and the efficiency of our methods.

...read moreread less

Proceedings Article•DOI•

Evaluating probability threshold k-nearest-neighbor queries over uncertain data

[...]

Reynold Cheng¹, Lei Chen², Jinchuan Chen³, Xike Xie¹•Institutions (3)

University of Hong Kong¹, Hong Kong University of Science and Technology², Hong Kong Polytechnic University³

24 Mar 2009

TL;DR: The Probabilistic Threshold k-Nearest-Neighbor Query (T-k-PNN), which returns sets of k objects that satisfy the query with probabilities higher than some threshold T, and can be applied to uncertain data with arbitrary probability density functions.

...read moreread less

Abstract: In emerging applications such as location-based services, sensor monitoring and biological management systems, the values of the database items are naturally imprecise. For these uncertain databases, an important query is the Probabilistic k-Nearest-Neighbor Query (k-PNN), which computes the probabilities of sets of k objects for being the closest to a given query point. The evaluation of this query can be both computationally- and I/O-expensive, since there is an exponentially large number of k object-sets, and numerical integration is required. Often a user may not be concerned about the exact probability values. For example, he may only need answers that have sufficiently high confidence. We thus propose the Probabilistic Threshold k-Nearest-Neighbor Query (T-k-PNN), which returns sets of k objects that satisfy the query with probabilities higher than some threshold T. Three steps are proposed to handle this query efficiently. In the first stage, objects that cannot constitute an answer are filtered with the aid of a spatial index. The second step, called probabilistic candidate selection, significantly prunes a number of candidate sets to be examined. The remaining sets are sent for verification, which derives the lower and upper bounds of answer probabilities, so that a candidate set can be quickly decided on whether it should be included in the answer. We also examine spatially-efficient data structures that support these methods. Our solution can be applied to uncertain data with arbitrary probability density functions. We have also performed extensive experiments to examine the effectiveness of our methods.

...read moreread less

Proceedings Article•DOI•

On-line exact shortest distance query processing

[...]

Jiefeng Cheng¹, Jeffrey Xu Yu¹•Institutions (1)

The Chinese University of Hong Kong¹

24 Mar 2009

TL;DR: This work focuses on fast computing distance-aware 2-hop covers, which can encode the all-pairs shortest paths of a graph in O(|V|·|E|1/2) space and exploits strongly connected components collapsing and graph partitioning to gain speed, while it can overcome the challenges in correctly retaining node distance information.

...read moreread less

Abstract: Shortest-path query processing not only serves as a long established routine for numerous applications in the past but also is of increasing popularity to support novel graph applications in very large databases nowadays. For a large graph, there is the new scenario to query intensively against arbitrary nodes, asking to quickly return node distance or even shortest paths. And traditional main memory algorithms and shortest paths materialization become inadequate. We are interested in graph labelings to encode the underlying graphs and assign labels to nodes to support efficient query processing. Surprisingly, the existing work of this category mainly emphasizes on reachability query processing, while no sufficient effort has been given to distance labelings to support querying exact shortest distances between nodes. Distance labelings must be developed on the graph in whole to correctly retain node distance information. It makes many existing methods to be inapplicable. We focus on fast computing distance-aware 2-hop covers, which can encode the all-pairs shortest paths of a graph in O(|V|·|E|1/2) space. Our approach exploits strongly connected components collapsing and graph partitioning to gain speed, while it can overcome the challenges in correctly retaining node distance information and appropriately encoding all-pairs shortest paths with small overhead. Furthermore, our approach avoids pre-computing all-pairs shortest paths, which can be prohibitive over large graphs. We conducted extensive performance studies, and confirm the efficiency of our proposed new approaches.

...read moreread less

Proceedings Article•DOI•

Continuous probabilistic nearest-neighbor queries for uncertain trajectories

[...]

Goce Trajcevski¹, Roberto Tamassia², Hui Ding¹, Peter Scheuermann¹, Isabel F. Cruz³ - Show less +1 more•Institutions (3)

Northwestern University¹, Brown University², University of Illinois at Chicago³

24 Mar 2009

TL;DR: This work formalizes the impact of uncertainty on the answers to the continuous probabilistic NN-queries, provides a compact structure for their representation and efficient algorithms for constructing that structure.

...read moreread less

Abstract: This work addresses the problem of processing continuous nearest neighbor (NN) queries for moving objects trajectories when the exact position of a given object at a particular time instant is not known, but is bounded by an uncertainty region. As has already been observed in the literature, the answers to continuous NN-queries in spatio-temporal settings are time parameterized in the sense that the objects in the answer vary over time. Incorporating uncertainty in the model yields additional attributes that affect the semantics of the answer to this type of queries. In this work, we formalize the impact of uncertainty on the answers to the continuous probabilistic NN-queries, provide a compact structure for their representation and efficient algorithms for constructing that structure. We also identify syntactic constructs for several qualitative variants of continuous probabilistic NN-queries for uncertain trajectories and present efficient algorithms for their processing.

...read moreread less

Proceedings Article•DOI•

Top-k dominating queries in uncertain databases

[...]

Xiang Lian¹, Lei Chen¹•Institutions (1)

Hong Kong University of Science and Technology¹

24 Mar 2009

TL;DR: An effective pruning approach to reduce the PTD search space is proposed, and an efficient query procedure to answer PTD queries is presented to answer probabilistic top-k dominating (PTD) queries.

...read moreread less

Abstract: Due to the existence of uncertain data in a wide spectrum of real applications, uncertain query processing has become increasingly important, which dramatically differs from handling certain data in a traditional database. In this paper, we formulate and tackle an important query, namely probabilistic top-k dominating (PTD) query, in the uncertain database. In particular, a PTD query retrieves k uncertain objects that are expected to dynamically dominate the largest number of uncertain objects. We propose an effective pruning approach to reduce the PTD search space, and present an efficient query procedure to answer PTD queries. Furthermore, approximate PTD query processing and the case where the PTD query is issued from an uncertain query object are also discussed. Extensive experiments have demonstrated the efficiency and effectiveness of our proposed PTD query processing approaches.

...read moreread less

Proceedings Article•DOI•

Efficient provenance storage over nested data collections

[...]

Manish Kumar Anand¹, Shawn Bowers¹, Timothy M. McPhillips¹, Bertram Ludäscher¹•Institutions (1)

University of California, Davis¹

24 Mar 2009

TL;DR: This work presents a provenance model, extending the conventional approach, that supports (i) explicit data dependencies and (ii) nested data collections and adopts techniques from reference-based XML versioning, adding annotations for process and data dependencies.

...read moreread less

Abstract: Scientific workflow systems are increasingly used to automate complex data analyses, largely due to their benefits over traditional approaches for workflow design, optimization, and provenance recording. Many workflow systems employ a simple dependency model to represent the provenance of data produced by workflow runs. Although commonly adopted, this model does not capture explicit data dependencies introduced by "provenance-aware" processes, and it can lead to inefficient storage when workflow data is complex or structured. We present a provenance model, extending the conventional approach, that supports (i) explicit data dependencies and (ii) nested data collections. Our model adopts techniques from reference-based XML versioning, adding annotations for process and data dependencies. We present strategies and reduction techniques to store immediate and transitive provenance information within our model, and examine trade-offs among update time, storage size, and query response time. We evaluate our approach on real-world and synthetic workflow execution traces, demonstrating significant reductions in storage size, while also reducing the time required to store and query provenance information.

...read moreread less

Proceedings Article•DOI•

Privacy-preserving data mashup

[...]

Noman Mohammed¹, Benjamin C. M. Fung¹, Ke Wang², Patrick C. K. Hung³•Institutions (3)

Concordia University¹, Simon Fraser University², University of Ontario Institute of Technology³

24 Mar 2009

TL;DR: A privacy-preserving data mashup algorithm to securely integrate private data from different data providers, whereas the integrated data still retains the essential information for supporting general data exploration or a specific data mining task, such as classification analysis is proposed.

...read moreread less

Abstract: Mashup is a web technology that combines information from more than one source into a single web application. This technique provides a new platform for different data providers to flexibly integrate their expertise and deliver highly customizable services to their customers. Nonetheless, combining data from different sources could potentially reveal person-specific sensitive information. In this paper, we study and resolve a real-life privacy problem in a data mashup application for the financial industry in Sweden, and propose a privacy-preserving data mashup (PPMashup) algorithm to securely integrate private data from different data providers, whereas the integrated data still retains the essential information for supporting general data exploration or a specific data mining task, such as classification analysis. Experiments on real-life data suggest that our proposed method is effective for simultaneously preserving both privacy and information usefulness, and is scalable for handling large volume of data.

...read moreread less

Proceedings Article•DOI•

Efficiently indexing shortest paths by exploiting symmetry in graphs

[...]

Yanghua Xiao¹, Wentao Wu¹, Jian Pei², Wei Wang¹, Zhenying He¹ - Show less +1 more•Institutions (2)

Fudan University¹, Simon Fraser University²

24 Mar 2009

TL;DR: This paper develops a framework to index a large graph at the orbit level instead of the vertex level so that the number of breadth-first search trees materialized is reduced from O(N) to O(|Δ|), where |Δ | ≤ N is the numberof orbits in the graph.

...read moreread less

Abstract: Shortest path queries (SPQ) are essential in many graph analysis and mining tasks. However, answering shortest path queries on-the-fly on large graphs is costly. To online answer shortest path queries, we may materialize and index shortest paths. However, a straightforward index of all shortest paths in a graph of N vertices takes O(N2) space. In this paper, we tackle the problem of indexing shortest paths and online answering shortest path queries. As many large real graphs are shown richly symmetric, the central idea of our approach is to use graph symmetry to reduce the index size while retaining the correctness and the efficiency of shortest path query answering. Technically, we develop a framework to index a large graph at the orbit level instead of the vertex level so that the number of breadth-first search trees materialized is reduced from O(N) to O(|Δ|), where |Δ| ≤ N is the number of orbits in the graph. We explore orbit adjacency and local symmetry to obtain compact breadth-first-search trees (compact BFS-trees). An extensive empirical study using both synthetic data and real data shows that compact BFS-trees can be built efficiently and the space cost can be reduced substantially. Moreover, online shortest path query answering can be achieved using compact BFS-trees.

...read moreread less

Proceedings Article•DOI•

PROUD: a probabilistic approach to processing similarity queries over uncertain data streams

[...]

Mi-Yen Yeh¹, Kun-Lung Wu², Philip S. Yu³, Ming-Syan Chen¹•Institutions (3)

National Taiwan University¹, IBM², University of Illinois at Chicago³

24 Mar 2009

TL;DR: The results show that, compared with Det, PROUD offers a flexible trade-off between false positives and false negatives by controlling a threshold, while maintaining a similar computation cost.

...read moreread less

Abstract: We present PROUD -- A PRObabilistic approach to processing similarity queries over Uncertain Data streams, where the data streams here are mainly time series streams. In contrast to data with certainty, an uncertain series is an ordered sequence of random variables. The distance between two uncertain series is also a random variable. We use a general uncertain data model, where only the mean and the deviation of each random variable at each timestamp are available. We derive mathematical conditions for progressively pruning candidates to reduce the computation cost. We then apply PROUD to a streaming environment where only sketches of streams, like wavelet synopses, are available. Extensive experiments are conducted to evaluate the effectiveness of PROUD and compare it with Det, a deterministic approach that directly processes data without considering uncertainty. The results show that, compared with Det, PROUD offers a flexible trade-off between false positives and false negatives by controlling a threshold, while maintaining a similar computation cost. In contrast, Det does not provide such flexibility. This trade-off is important as in some applications false negatives are more costly, while in others, it is more critical to keep the false positives low.

...read moreread less

Proceedings Article•DOI•

Top-k dominant web services under multi-criteria matching

[...]

Dimitrios Skoutas¹, Dimitris Sacharidis¹, Alkis Simitsis², Verena Kantere³, Timos Sellis¹ - Show less +1 more•Institutions (3)

National Technical University of Athens¹, Hewlett-Packard², École Polytechnique Fédérale de Lausanne³

24 Mar 2009

TL;DR: This work introduces an objective measure that assigns a dominance score to each advertised Web service, and investigates three distinct definitions of dominance score, and devise efficient algorithms that retrieve the top-k most dominant Web services in each case.

...read moreread less

Abstract: As we move from a Web of data to a Web of services, enhancing the capabilities of the current Web search engines with effective and efficient techniques for Web services retrieval and selection becomes an important issue. Traditionally, the relevance of a Web service advertisement to a service request is determined by computing an overall score that aggregates individual matching scores among the various parameters in their descriptions. Two drawbacks characterize such approaches. First, there is no single matching criterion that is optimal for determining the similarity between parameters. Instead, there are numerous approaches ranging from using Information Retrieval similarity metrics up to semantic logic-based inference rules. Second, the reduction of individual scores to an overall similarity leads to significant information loss. Since there is no consensus on how to weight these scores, existing methods are typically pessimistic, adopting a worst-case scenario. As a consequence, several services, e.g., those having a single unrelated parameter, can be excluded from the result set, even though they are potentially good alternatives. In this work, we present a methodology that overcomes both deficiencies. Given a request, we introduce an objective measure that assigns a dominance score to each advertised Web service. This score takes into consideration all the available criteria for each parameter in the request. We investigate three distinct definitions of dominance score, and we devise efficient algorithms that retrieve the top-k most dominant Web services in each case. Extensive experimental evaluation on real requests and relevance sets, as well as on synthetically generated scenarios, demonstrates both the effectiveness of the proposed technique and the efficiency of the algorithms.

...read moreread less

Proceedings Article•DOI•

Retrieving meaningful relaxed tightest fragments for XML keyword search

[...]

Lingbo Kong¹, Rémi Gilleron¹, Aurélien Lemay Mostrare¹•Institutions (1)

French Institute for Research in Computer Science and Automation¹

24 Mar 2009

TL;DR: The ValidRTF algorithm not only overcomes those two problems in MaxMatch, but also satisfies the axiomatic properties deduced in [1] that an XKS technique should satisfy.

...read moreread less

Abstract: Adapting keyword search to XML data has been attractive recently, generalized as XML keyword search (XKS). One of its key tasks is to return the meaningful fragments as the result. [1] is the latest work following this trend, and it focuses on returning the fragments rooted at SLCA (Smallest LCA -- Lowest Common Ancestor) nodes. To guarantee that the fragments only contain interesting nodes, [1] proposes a contributor-based filtering mechanism in its MaxMatch algorithm. However, the filtering mechanism is not sufficient. It will commit the false positive problem (discarding interesting nodes) and the redundancy problem (keeping uninteresting nodes).In this paper, our interest is to propose a framework of retrieving meaningful fragments rooted at not only the SLCA nodes, but all LCA nodes. We begin by introducing the concept of Relaxed Tightest Fragment (RTF) as the basic result type. Then we propose a new filtering mechanism to overcome those two problems in Max-Match. Its kernel is the concept of valid contributor, which helps to distinguish the interesting children of a node. The new filtering mechanism is then to prune the nodes in a RTF which are not valid contributors to their parents. Based on the valid contributor concept, our ValidRTF algorithm not only overcomes those two problems in MaxMatch, but also satisfies the axiomatic properties deduced in [1] that an XKS technique should satisfy. We compare ValidRTF with MaxMatch on real and synthetic XML data. The result verifies our claims, and shows the effectiveness of our valid-contributor-based filtering mechanism.

...read moreread less

Proceedings Article•DOI•

Reverse k-nearest neighbor search in dynamic and general metric databases

[...]

Elke Achtert¹, Hans-Peter Kriegel¹, Peer Kröger¹, Matthias Renz¹, Andreas Züfle¹ - Show less +1 more•Institutions (1)

Ludwig Maximilian University of Munich¹

24 Mar 2009

TL;DR: The most general solution for the general reverse k-nearest neighbor (RkNN) search problem is proposed and outperforms existing methods in terms of query execution times because it exploits different strategies for pruning false drops and identifying true hits as soon as possible.

...read moreread less

Abstract: In this paper, we propose an original solution for the general reverse k-nearest neighbor (RkNN) search problem. Compared to the limitations of existing methods for the RkNN search, our approach works on top of any hierarchically organized tree-like index structure and, thus, is applicable to any type of data as long as a metric distance function is defined on the data objects. We will exemplarily show how our approach works on top of the most prevalent index structures for Euclidean and metric data, the R-Tree and the M-Tree, respectively. Our solution is applicable for arbitrary values of k and can also be applied in dynamic environments where updates of the database frequently occur. Although being the most general solution for the RkNN problem, our solution outperforms existing methods in terms of query execution times because it exploits different strategies for pruning false drops and identifying true hits as soon as possible.

...read moreread less

Proceedings Article•DOI•

Indexing density models for incremental learning and anytime classification on data streams

[...]

Thomas Seidl¹, Ira Assent², Philipp Kranen¹, Ralph Krieger¹, Jennifer Herrmann¹ - Show less +1 more•Institutions (2)

RWTH Aachen University¹, Aalborg University²

24 Mar 2009

TL;DR: This work proposes a novel index-based technique that can handle all three of the above challenges using the established Bayes classifier on effective kernel density estimators and demonstrates the anytime learning performance of the Bayes tree.

...read moreread less

Abstract: Classification of streaming data faces three basic challenges: it has to deal with huge amounts of data, the varying time between two stream data items must be used best possible (anytime classification) and additional training data must be incrementally learned (anytime learning) for applying the classifier consistently to fast data streams. In this work, we propose a novel index-based technique that can handle all three of the above challenges using the established Bayes classifier on effective kernel density estimators. Our novel Bayes tree automatically generates (adapted efficiently to the individual object to be classified) a hierarchy of mixture densities that represent kernel density estimators at successively coarser levels. Our probability density queries together with novel classification improvement strategies provide the necessary information for very effective classification at any point of interruption. Moreover, we propose a novel evaluation method for anytime classification using Poisson streams and demonstrate the anytime learning performance of the Bayes tree.

...read moreread less

Proceedings Article•DOI•

Rule-based multi-query optimization

[...]

Mingsheng Hong¹, Mirek Riedewald¹, Christoph Koch¹, Johannes Gehrke¹, Alan Demers¹ - Show less +1 more•Institutions (1)

Cornell University¹

24 Mar 2009

TL;DR: This work lays the foundation for a powerful multi-query optimizer for data stream processing, using a rule-based MQO framework that incorporates a set of new abstractions, extending their counterparts, physical operators, transformation rules, and streams, in a traditional RDBMS or stream processing system.

...read moreread less

Abstract: Data stream management systems usually have to process many long-running queries that are active at the same time. Multiple queries can be evaluated more efficiently together than independently, because it is often possible to share state and computation. Motivated by this observation, various Multi-Query Optimization (MQO) techniques have been proposed. However, these approaches suffer from two limitations. First, they focus on very specialized workloads. Second, integrating MQO techniques for CQL-style stream engines and those for event pattern detection engines is even harder, as the processing models of these two types of stream engines are radically different.In this paper, we propose a rule-based MQO framework. This framework incorporates a set of new abstractions, extending their counterparts, physical operators, transformation rules, and streams, in a traditional RDBMS or stream processing system. Within this framework, we can integrate new and existing MQO techniques through the use of transformation rules. This allows us to build an expressive and scalable stream system. Just as relational optimizers are crucial for the success of RDBMSes, a powerful multi-query optimizer is needed for data stream processing. This work lays the foundation for such a multi-query optimizer, creating opportunities for future research. We experimentally demonstrate the efficacy of our approach.

...read moreread less

Proceedings Article•DOI•

Exploiting the power of relational databases for efficient stream processing

[...]

Erietta Liarou, Romulo Goncalves, Stratos Idreos

24 Mar 2009

TL;DR: A complete architecture is proposed, the DataCell, which is implemented on top of an open-source column-oriented DBMS, which allows batch processing of tuples and selectively pick tuples from a basket based on the query requirements exploiting a novel query component, the basket expressions.

...read moreread less

Abstract: Stream applications gained significant popularity over the last years that lead to the development of specialized stream engines. These systems are designed from scratch with a different philosophy than nowadays database engines in order to cope with the stream applications requirements. However, this means that they lack the power and sophisticated techniques of a full fledged database system that exploits techniques and algorithms accumulated over many years of database research.In this paper, we take the opposite route and design a stream engine directly on top of a database kernel. Incoming tuples are directly stored upon arrival in a new kind of system tables, called baskets. A continuous query can then be evaluated over its relevant baskets as a typical one-time query exploiting the power of the relational engine. Once a tuple has been seen by all relevant queries/operators, it is dropped from its basket. A basket can be the input to a single or multiple similar query plans. Furthermore, a query plan can be split into multiple parts each one with its own input/output baskets allowing for flexible load sharing query scheduling. Contrary to traditional stream engines, that process one tuple at a time, this model allows batch processing of tuples, e.g., query a basket only after x tuples arrive or after a time threshold has passed. Furthermore, we are not restricted to process tuples in the order they arrive. Instead, we can selectively pick tuples from a basket based on the query requirements exploiting a novel query component, the basket expressions.We investigate the opportunities and challenges that arise with such a direction and we show that it carries significant advantages. We propose a complete architecture, the DataCell, which we implemented on top of an open-source column-oriented DBMS. A detailed analysis and experimental evaluation of the core algorithms using both micro benchmarks and the standard Linear Road benchmark demonstrate the potential of this new approach.

...read moreread less

Proceedings Article•DOI•

Fast object search on road networks

[...]

Ken C. K. Lee¹, Wang-Chien Lee¹, Baihua Zheng²•Institutions (2)

Pennsylvania State University¹, Singapore Management University²

24 Mar 2009

TL;DR: The experiment result shows the superiority of ROAD over the state-of-the-art approaches, and several properties useful to construct Rnet hierarchy.

...read moreread less

Abstract: In this paper, we present ROAD, a general framework to evaluate Location-Dependent Spatial Queries (LDSQ)s that searches for spatial objects on road networks. By exploiting search space pruning technique and providing a dynamic object mapping mechanism, ROAD is very efficient and flexible for various types of queries, namely, range search and nearest neighbor search, on objects over large-scale networks. ROAD is named after its two components, namely, Route Overlay and Association Directory, designed to address the network traversal and object access aspects of the framework. In ROAD, a large road network is organized as a hierarchy of interconnected regional sub-networks (called Rnets) augmented with 1) shortcuts for accelerating network traversals; and 2) object abstracts for guiding traversals. In this paper, we present (i) the Rnet hierarchy and several properties useful to construct Rnet hierarchy, (ii) the design and implementation of the ROAD framework, (iii) efficient object search algorithms for various queries, and (iv) incremental update techniques for framework maintenance in presence of object and network changes. We conducted extensive experiments with real road networks to evaluate ROAD. The experiment result shows the superiority of ROAD over the state-of-the-art approaches.

...read moreread less

Proceedings Article•DOI•

LCS-Hist: taming massive high-dimensional data cube compression

[...]

Alfredo Cuzzocrea¹, Paolo Serafino¹•Institutions (1)

University of Calabria¹

24 Mar 2009

TL;DR: This paper proposes LCS-Hist, an innovative multidimensional histogram devising a complex methodology that combines intelligent data modeling and processing techniques in order to tame the annoying problem of compressing massive high-dimensional data cubes.

...read moreread less

Abstract: The problem of efficiently compressing massive high-dimensional data cubes still waits for efficient solutions capable of overcoming well-recognized scalability limitations of state-of-the-art histogram-based techniques, which perform well on small-in-size low-dimensional data cubes, whereas their performance in both representing the input data domain and efficiently supporting approximate query answering against the generated compressed data structure decreases dramatically when data cubes grow in dimension number and size. To overcome this relevant research challenge, in this paper we propose LCS-Hist, an innovative multidimensional histogram devising a complex methodology that combines intelligent data modeling and processing techniques in order to tame the annoying problem of compressing massive high-dimensional data cubes. With respect to similar histogram-based proposals, our technique introduces (i) a surprising consumption of the storage space available to house the compressed representation of the input data cube, and (ii) a superior scalability on high-dimensional data cubes. Finally, several experimental results performed against various classes of data cubes confirm the advantages of LCS-Hist, even in comparison with those given by state-of-the-art similar techniques.

...read moreread less

Proceedings Article•DOI•

A novel approach for efficient supergraph query processing on graph databases

[...]

Shuo Zhang¹, Jianzhong Li¹, Hong Gao¹, Zhaonian Zou¹•Institutions (1)

Harbin Institute of Technology¹

24 Mar 2009

TL;DR: An optimal compact method for organizing graph databases is proposed, a novel algorithm of testing subgraph isomorphisms from multiple graphs to one graph is presented, and a query processing method is proposed based on these techniques.

...read moreread less

Abstract: In recent years, large amount of data modeled by graphs, namely graph data, have been collected in various domains. Efficiently processing queries on graph databases has attracted a lot of research attentions. Supergraph query is a kind of new and important queries in practice. A supergraph query, q, on a graph database D is to retrieve all graphs in D such that q is a supergraph of them. Because the number of graphs in databases is large and subgraph isomorphism testing is NP-complete, efficiently processing such queries is a big challenge. This paper first proposes an optimal compact method for organizing graph databases. Common subgraphs of the graphs in a database are stored only once in the compact organization of the database, in order to reduce the overall cost of subgraph isomorphism testings from stored graphs to queries during query processing. Then, an exact algorithm and an approximate algorithm for generating significant feature set with optimal order are proposed to construct indices on graph databases. The optimal order on the feature set is to reduce the number of subgraph isomorphism testings during query processing. Based on the compact organization of graph databases, a novel algorithm of testing subgraph isomorphisms from multiple graphs to one graph is presented. Finally, based on all these techniques, a query processing method is proposed. Analytical and experimental results show that the proposed algorithms outper-form the existing similar algorithms by one to two orders of magnitude.

...read moreread less

Proceedings Article•DOI•

Efficient skyline computation in metric space

[...]

David Fuhry¹, Ruoming Jin¹, Donghui Zhang²•Institutions (2)

Kent State University¹, North East University²

24 Mar 2009

TL;DR: New algorithms to aggressively prune non-skyline points from the search space are developed and two new optimization techniques are contributed to reduce the number of distance computations and dominance tests.

...read moreread less

Abstract: Given a set of n query points in a general metric space, a metric-space skyline (MSS) query asks what are the closest points to all these query points in the database. Here, consider for any point p, if there are no other points in the database which have less or equal distance to all the query points, then p is denoted as one of the closest points to the query points. This problem is a direct generalization of the recently proposed spatial-skyline query problem, where all the points are located in two or three dimensional Euclidean space. It is also closely related with the nearest neighbor (NN) query, the range query and the common skyline query problem. In this paper, we have developed new algorithms to aggressively prune non-skyline points from the search space. We also contribute two new optimization techniques to reduce the number of distance computations and dominance tests. Our experimental evaluation has shown the effectiveness and efficiency of our approach.

...read moreread less