Showing papers in "ACM Transactions on Database Systems in 2003"

PDF

Open Access

Journal Article•DOI•

A simple algorithm for finding frequent elements in streams and bags

[...]

Richard M. Karp¹, Scott Shenker¹, Christos H. Papadimitriou²•Institutions (2)

International Computer Science Institute¹, University of California, Berkeley²

01 Mar 2003-ACM Transactions on Database Systems

TL;DR: A simple, exact algorithm for identifying in a multiset the items with frequency more than a threshold θ, which requires two passes, linear time, and space 1/θ.

...read moreread less

Abstract: We present a simple, exact algorithm for identifying in a multiset the items with frequency more than a threshold θ. The algorithm requires two passes, linear time, and space 1/θ. The first pass is an on-line algorithm, generalizing a well-known algorithm for finding a majority element, for identifying a set of at most 1/θ items that includes, possibly among others, all items with frequency greater than θ.

...read moreread less

613 citations

Journal Article•DOI•

Preference formulas in relational queries

[...]

Jan Chomicki¹•Institutions (1)

University at Buffalo¹

01 Dec 2003-ACM Transactions on Database Systems

TL;DR: A simple, natural embedding of preference formulas into relational algebra (and SQL) through a single winnow operator parameterized by a preference formula is proposed, which makes possible the formulation of complex preference queries by piggybacking on existing SQL constructs.

...read moreread less

Abstract: The handling of user preferences is becoming an increasingly important issue in present-day information systems. Among others, preferences are used for information filtering and extraction to reduce the volume of data presented to the user. They are also used to keep track of user profiles and formulate policies to improve and automate decision making.We propose here a simple, logical framework for formulating preferences as preference formulas. The framework does not impose any restrictions on the preference relations, and allows arbitrary operation and predicate signatures in preference formulas. It also makes the composition of preference relations straightforward. We propose a simple, natural embedding of preference formulas into relational algebra (and SQL) through a single winnow operator parameterized by a preference formula. The embedding makes possible the formulation of complex preference queries, for example, involving aggregation, by piggybacking on existing SQL constructs. It also leads in a natural way to the definition of further, preference-related concepts like ranking. Finally, we present general algebraic laws governing the winnow operator and its interactions with other relational algebra operators. The preconditions on the applicability of the laws are captured by logical formulas. The laws provide a formal foundation for the algebraic optimization of preference queries. We demonstrate the usefulness of our approach through numerous examples.

...read moreread less

497 citations

Journal Article•DOI•

[...]

Gísli R. Hjaltason¹, Hanan Samet¹•Institutions (1)

University of Maryland, College Park¹

01 Dec 2003-ACM Transactions on Database Systems

TL;DR: This article focuses on methods for similarity search that make the general assumption that similarity is represented with a distance metric d, and presents algorithms for common types of queries that operate on an arbitrary "search hierarchy."

...read moreread less

Abstract: Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this article, we focus on methods for similarity search that make the general assumption that similarity is represented with a distance metric d. Existing methods for handling similarity search in this setting typically fall into one of two classes. The first directly indexes the objects based on distances (distance-based indexing), while the second is based on mapping to a vector space (mapping-based approach). The main part of this article is dedicated to a survey of distance-based indexing methods, but we also briefly outline how search occurs in mapping-based methods. We also present a general framework for performing search based on distances, and present algorithms for common types of queries that operate on an arbitrary "search hierarchy." These algorithms can be applied on each of the methods presented, provided a suitable search hierarchy is defined.

...read moreread less

480 citations

Journal Article•DOI•

Path sharing and predicate evaluation for high-performance XML filtering

[...]

Yanlei Diao¹, Mehmet Altinel², Michael J. Franklin¹, Hao Zhang¹, Peter M. Fischer³ - Show less +1 more•Institutions (3)

University of California, Berkeley¹, IBM², Heidelberg University³

01 Dec 2003-ACM Transactions on Database Systems

TL;DR: The results show that the path sharing employed by YFilter can provide order-of-magnitude performance benefits, and two alternative techniques for extending YFilter's shared structure matching with support for value-based predicates are proposed, and the performance of these two techniques are compared.

...read moreread less

Abstract: XML filtering systems aim to provide fast, on-the-fly matching of XML-encoded data to large numbers of query specifications containing constraints on both structure and content It is now well accepted that approaches using event-based parsing and Finite State Machines (FSMs) can provide the basis for highly scalable structure-oriented XML filtering systems The XFilter system [Altinel and Franklin 2000] was the first published FSM-based XML filtering approach XFilter used a separate FSM per path query and a novel indexing mechanism to allow all of the FSMs to be executed simultaneously during the processing of a document Building on the insights of the XFilter work, we describe a new method, called "YFilter" that combines all of the path queries into a single Nondeterministic Finite Automaton (NFA) YFilter exploits commonality among queries by merging common prefixes of the query paths such that they are processed at most once The resulting shared processing provides tremendous improvements in structure matching performance but complicates the handling of value-based predicatesIn this article, we first describe the XFilter and YFilter approaches and present results of a detailed performance comparison of structure matching for these algorithms as well as a hybrid approach The results show that the path sharing employed by YFilter can provide order-of-magnitude performance benefits We then propose two alternative techniques for extending YFilter's shared structure matching with support for value-based predicates, and compare the performance of these two techniques The results of this latter study demonstrate some key differences between shared XML filtering and traditional database query processing Finally, we describe how the YFilter approach is extended to handle more complicated queries containing nested path expressions

...read moreread less

422 citations

Journal Article•DOI•

Effective page refresh policies for Web crawlers

[...]

Junghoo Cho¹, Hector Garcia-Molina²•Institutions (2)

University of California, Los Angeles¹, Stanford University²

01 Dec 2003-ACM Transactions on Database Systems

TL;DR: This article proposes various refresh policies and studies their effectiveness, and shows that a Poisson process is a good model to describe the changes of Web pages and improves the "freshness" of data very significantly.

...read moreread less

Abstract: In this article, we study how we can maintain local copies of remote data sources "fresh," when the source data is updated autonomously and independently. In particular, we study the problem of Web crawlers that maintain local copies of remote Web pages for Web search engines. In this context, remote data sources (Websites) do not notify the copies (Web crawlers) of new changes, so we need to periodically poll the sources to maintain the copies up-to-date. Since polling the sources takes significant time and resources, it is very difficult to keep the copies completely up-to-date.This article proposes various refresh policies and studies their effectiveness. We first formalize the notion of "freshness" of copied data by defining two freshness metrics, and we propose a Poisson process as the change model of data sources. Based on this framework, we examine the effectiveness of the proposed refresh policies analytically and experimentally. We show that a Poisson process is a good model to describe the changes of Web pages and we also show that our proposed refresh policies improve the "freshness" of data very significantly. In certain cases, we got orders of magnitude improvement from existing policies.

...read moreread less

294 citations

Journal Article•DOI•

Discovering all most specific sentences

[...]

Dimitrios Gunopulos¹, Roni Khardon², Heikki Mannila³, Sanjeev Saluja⁴, Hannu Toivonen³, Ram Sewak Sharma¹ - Show less +2 more•Institutions (4)

University of California, Riverside¹, Tufts University², University of Helsinki³, LSI Corporation⁴

01 Jun 2003-ACM Transactions on Database Systems

TL;DR: The analysis shows that the a priori algorithm can solve the problem of enumerating the transversals of a hypergraph, improving on previously known results in a special case, and it is shown that the Dualize and Advance algorithm has worst-case running time that is sub-exponential to the output size.

...read moreread less

Abstract: Data mining can be viewed, in many instances, as the task of computing a representation of a theory of a model or a database, in particular by finding a set of maximally specific sentences satisfying some property. We prove some hardness results that rule out simple approaches to solving the problem.The a priori algorithm is an algorithm that has been successfully applied to many instances of the problem. We analyze this algorithm, and prove that is optimal when the maximally specific sentences are "small". We also point out its limitations.We then present a new algorithm, the Dualize and Advance algorithm, and prove worst-case complexity bounds that are favorable in the general case. Our results use the concept of hypergraph transversals. Our analysis shows that the a priori algorithm can solve the problem of enumerating the transversals of a hypergraph, improving on previously known results in a special case. On the other hand, using results for the general case of the hypergraph transversal enumeration problem, we can show that the Dualize and Advance algorithm has worst-case running time that is sub-exponential to the output size (i.e., the number of maximally specific sentences).We further show that the problem of finding maximally specific sentences is closely related to the problem of exact learning with membership queries studied in computational learning theory.

...read moreread less

256 citations

Journal Article•DOI•

Are quorums an alternative for data replication

[...]

Ricardo Jiménez-Peris¹, Marta Patiño-Martínez¹, Gustavo Alonso², Bettina Kemme³•Institutions (3)

Technical University of Madrid¹, École Polytechnique Fédérale de Lausanne², McGill University³

01 Sep 2003-ACM Transactions on Database Systems

TL;DR: Evaluation of several quorum types indicates that the conventional read-one/write-all-available approach is the best choice for a large range of applications requiring data replication, and it is shown that this is an important result for anybody developing code for computing clusters as it is much simpler to implement and more flexible than quorum-based approaches.

...read moreread less

Abstract: Data replication is playing an increasingly important role in the design of parallel information systems. In particular, the widespread use of cluster architectures often requires to replicate data for performance and availability reasons. However, maintaining the consistency of the different replicas is known to cause severe scalability problems. To address this limitation, quorums are often suggested as a way to reduce the overall overhead of replication. In this article, we analyze several quorum types in order to better understand their behavior in practice. The results obtained challenge many of the assumptions behind quorum based replication. Our evaluation indicates that the conventional read-one/write-all-available approach is the best choice for a large range of applications requiring data replication. We believe this is an important result for anybody developing code for computing clusters as the read-one/write-all-available strategy is much simpler to implement and more flexible than quorum-based approaches. In this article, we show that, in addition, it is also the best choice using a number of other selection criteria.

...read moreread less

147 citations

Journal Article•DOI•

Spatial queries in dynamic environments

[...]

Yufei Tao¹, Dimitris Papadias²•Institutions (2)

City University of Hong Kong¹, Hong Kong University of Science and Technology²

01 Jun 2003-ACM Transactions on Database Systems

TL;DR: Time-parameterized and continuous versions of the most common spatial queries, i.e., window queries, nearest neighbors, spatial joins, are studied, proposing efficient processing algorithms and accurate cost models.

...read moreread less

Abstract: Conventional spatial queries are usually meaningless in dynamic environments since their results may be invalidated as soon as the query or data objects move. In this paper we formulate two novel query types, time parameterized and continuous queries, applicable in such environments. A time-parameterized query retrieves the actual result at the time when the query is issued, the expiry time of the result given the current motion of the query and database objects, and the change that causes the expiration. A continuous query retrieves tuples of the form , where each result is accompanied by a future interval, during which it is valid. We study time-parameterized and continuous versions of the most common spatial queries (i.e., window queries, nearest neighbors, spatial joins), proposing efficient processing algorithms and accurate cost models.

...read moreread less

81 citations

Journal Article•DOI•

Efficient dynamic mining of constrained frequent sets

[...]

Laks V. S. Lakshmanan¹, Carson K. Leung², Raymond T. Ng¹•Institutions (2)

University of British Columbia¹, University of Manitoba²

01 Dec 2003-ACM Transactions on Database Systems

TL;DR: This article develops an algorithm, called DCF, for Dynamic Constrained Frequent-set computation, enhanced with a few optimizations, exploiting a lightweight structure called a segment support map, which enables DCF to obtain sharper bounds on the support of sets of items, and to better exploit properties of constraints.

...read moreread less

Abstract: Data mining is supposed to be an iterative and exploratory process. In this context, we are working on a project with the overall objective of developing a practical computing environment for the human-centered exploratory mining of frequent sets. One critical component of such an environment is the support for the dynamic mining of constrained frequent sets of items. Constraints enable users to impose a certain focus on the mining process; dynamic means that, in the middle of the computation, users are able to (i) change (such as tighten or relax) the constraints and/or (ii) change the minimum support threshold, thus having a decisive influence on subsequent computations. In a real-life situation, the available buffer space may be limited, thus adding another complication to the problem.In this article, we develop an algorithm, called DCF, for Dynamic Constrained Frequent-set computation. This algorithm is enhanced with a few optimizations, exploiting a lightweight structure called a segment support map. It enables DCF to (i) obtain sharper bounds on the support of sets of items, and to (ii) better exploit properties of constraints. Furthermore, when handling dynamic changes to constraints, DCF relies on the concept of a delta member generating function, which generates precisely the sets of items that satisfy the new but not the old constraints. Our experimental results show the effectiveness of these enhancements.

...read moreread less

75 citations

Journal Article•DOI•

Adaptive algorithms for set containment joins

[...]

Sergey Melnik¹, Hector Garcia-Molina¹•Institutions (1)

Stanford University¹

01 Mar 2003-ACM Transactions on Database Systems

TL;DR: It is shown that APSJ outperforms previously suggested algorithms for many data sets, often by an order of magnitude, and the Adaptive Divide-and-Conquer Join is proposed.

...read moreread less

Abstract: A set containment join is a join between set-valued attributes of two relations, whose join condition is specified using the subset (⊆) operator. Set containment joins are deployed in many database applications, even those that do not support set-valued attributes. In this article, we propose two novel partitioning algorithms, called the Adaptive Pick-and-Sweep Join (APSJ) and the Adaptive Divide-and-Conquer Join (ADCJ), which allow computing set containment joins efficiently. We show that APSJ outperforms previously suggested algorithms for many data sets, often by an order of magnitude. We present a detailed analysis of the algorithms and study their performance on real and synthetic data using an implemented testbed.

...read moreread less

74 citations

Journal Article•DOI•

Analysis of predictive spatio-temporal queries

[...]

Yufei Tao¹, Jimeng Sun², Dimitris Papadias³•Institutions (3)

City University of Hong Kong¹, Carnegie Mellon University², Hong Kong University of Science and Technology³

01 Dec 2003-ACM Transactions on Database Systems

TL;DR: Probabilistic cost models that estimate the selectivity of spatio-temporal window queries and joins, and the expected distance between a query and its nearest neighbor(s) are presented.

...read moreread less

Abstract: Given a set of objects S, a spatio-temporal window query q retrieves the objects of S that will intersect the window during the (future) interval qT. A nearest neighbor query q retrieves the objects of S closest to q during qT. Given a threshold d, a spatio-temporal join retrieves the pairs of objects from two datasets that will come within distance d from each other during qT. In this article, we present probabilistic cost models that estimate the selectivity of spatio-temporal window queries and joins, and the expected distance between a query and its nearest neighbor(s). Our models capture any query/object mobility combination (moving queries, moving objects or both) and any data type (points and rectangles) in arbitrary dimensionality. In addition, we develop specialized spatio-temporal histograms, which take into account both location and velocity information, and can be incrementally maintained. Extensive performance evaluation verifies that the proposed techniques produce highly accurate estimation on both uniform and non-uniform data.

...read moreread less

Journal Article•DOI•

On the computation of relational view complements

[...]

Jens Lechtenbörger¹, Gottfried Vossen¹•Institutions (1)

University of Münster¹

01 Jun 2003-ACM Transactions on Database Systems

TL;DR: In this article, it is shown how to compute reasonably small (and in special cases even minimal) complements for a large class of relational views.

...read moreread less

Abstract: Views as a means to describe parts of a given data collection play an important role in many database applications. In dynamic environments where data is updated, not only information provided by views, but also information provided by data sources yet missing from views turns out to be relevant: Previously, this missing information has been characterized in terms of view complements; recently, it has been shown that view complements can be exploited in the context of data warehouses to guarantee desirable warehouse properties such as independence and self-maintainability. As the complete source information is a trivial complement for any view, a natural interest for "small" or even "minimal" complements arises. However, the computation of minimal complements is still not very well understood. In this article, it is shown how to compute reasonably small (and in special cases even minimal) complements for a large class of relational views.

...read moreread less

Journal Article•DOI•

Description logics for semantic query optimization in object-oriented database systems

[...]

Domenico Beneventano, Sonia Bergamaschi, Claudio Sartori¹•Institutions (1)

University of Bologna¹

01 Mar 2003-ACM Transactions on Database Systems

TL;DR: A general method for semantic query optimization in the framework of Object-Oriented Database Systems, providing an ODMG-compliant interface that allows a full interaction with OQL queries, wrapping underlying Description Logic representation and techniques to the user.

...read moreread less

Abstract: Semantic query optimization uses semantic knowledge (i.e., integrity constraints) to transform a query into an equivalent one that may be answered more efficiently. This article proposes a general method for semantic query optimization in the framework of Object-Oriented Database Systems. The method is effective for a large class of queries, including conjunctive recursive queries expressed with regular path expressions and is based on three ingredients. The first is a Description Logic, ODLRE, providing a type system capable of expressing: class descriptions, queries, views, integrity constraint rules and inference techniques, such as incoherence detection and subsumption computation. The second is a semantic expansion function for queries, which incorporates restrictions logically implied by the query and the schema (classes + rules) in one query. The third is an optimal rewriting method of a query with respect to the schema classes that rewrites a query into an equivalent one, by determining more specialized classes to be accessed and by reducing the number of factors. We implemented the method in a tool providing an ODMG-compliant interface that allows a full interaction with OQL queries, wrapping underlying Description Logic representation and techniques to the user.

...read moreread less

Journal Article•DOI•

Iterative spatial join

[...]

Edwin H. Jacox¹, Hanan Samet¹•Institutions (1)

University of Maryland, College Park¹

01 Sep 2003-ACM Transactions on Database Systems

TL;DR: The Iterative Spatial Join is based on a plane sweep algorithm, which requires the entire data set to fit in internal memory, and overcomes the performance limitations of the other algorithms for data sets of all sizes as well as differing amounts of internal memory.

...read moreread less

Abstract: The key issue in performing spatial joins is finding the pairs of intersecting rectangles. For unindexed data sets, this is usually resolved by partitioning the data and then performing a plane sweep on the individual partitions. The resulting join can be viewed as a two-step process where the partition corresponds to a hash-based join while the plane-sweep corresponds to a sort-merge join. In this article, we look at extending the idea of the sort-merge join for one-dimensional data to multiple dimensions and introduce the Iterative Spatial Join. As with the sort-merge join, the Iterative Spatial Join is best suited to cases where the data is already sorted. However, as we show in the experiments, the Iterative Spatial Join performs well when internal memory is limited, compared to the partitioning methods. This suggests that the Iterative Spatial Join would be useful for very large data sets or in situations where internal memory is a shared resource and is therefore limited, such as with today's database engines which share internal memory amongst several queries. Furthermore, the performance of the Iterative Spatial Join is predictable and has no parameters which need to be tuned, unlike other algorithms. The Iterative Spatial Join is based on a plane sweep algorithm, which requires the entire data set to fit in internal memory. When internal memory overflows, the Iterative Spatial Join simply makes additional passes on the data, thereby exhibiting only a gradual performance degradation. To demonstrate the use and efficacy of the Iterative Spatial Join, we first examine and analyze current approaches to performing spatial joins, and then give a detailed analysis of the Iterative Spatial Join as well as present the results of extensive testing of the algorithm, including a comparison with partitioning-based spatial join methods. These tests show that the Iterative Spatial Join overcomes the performance limitations of the other algorithms for data sets of all sizes as well as differing amounts of internal memory.

...read moreread less

Journal Article•DOI•

Removing permissions in the flexible authorization framework

[...]

Duminda Wijesekera¹, Sushil Jajodia¹, Francesco Parisi-Presicce¹, Åsa Hagström²•Institutions (2)

George Mason University¹, Linköping University²

01 Sep 2003-ACM Transactions on Database Systems

TL;DR: This work presents two options for removing permissions in FAF and provides details on the option which is representation independent.

...read moreread less

Abstract: The Flexible Authorization Framework (FAF) defined by Jajodia et al. [2001] provides a policy-neutral framework for specifying access control policies that is expressive enough to specify many known access control policies. Although the original formulation of FAF indicated how rules could be added to or deleted from a FAF specification, it did not address the removal of access permissions from users. We present two options for removing permissions in FAF and provide details on the option which is representation independent.

...read moreread less