scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Database Systems in 2003"


Journal ArticleDOI
TL;DR: A simple, exact algorithm for identifying in a multiset the items with frequency more than a threshold θ, which requires two passes, linear time, and space 1/θ.
Abstract: We present a simple, exact algorithm for identifying in a multiset the items with frequency more than a threshold θ. The algorithm requires two passes, linear time, and space 1/θ. The first pass is an on-line algorithm, generalizing a well-known algorithm for finding a majority element, for identifying a set of at most 1/θ items that includes, possibly among others, all items with frequency greater than θ.

613 citations


Journal ArticleDOI
TL;DR: A simple, natural embedding of preference formulas into relational algebra (and SQL) through a single winnow operator parameterized by a preference formula is proposed, which makes possible the formulation of complex preference queries by piggybacking on existing SQL constructs.
Abstract: The handling of user preferences is becoming an increasingly important issue in present-day information systems. Among others, preferences are used for information filtering and extraction to reduce the volume of data presented to the user. They are also used to keep track of user profiles and formulate policies to improve and automate decision making.We propose here a simple, logical framework for formulating preferences as preference formulas. The framework does not impose any restrictions on the preference relations, and allows arbitrary operation and predicate signatures in preference formulas. It also makes the composition of preference relations straightforward. We propose a simple, natural embedding of preference formulas into relational algebra (and SQL) through a single winnow operator parameterized by a preference formula. The embedding makes possible the formulation of complex preference queries, for example, involving aggregation, by piggybacking on existing SQL constructs. It also leads in a natural way to the definition of further, preference-related concepts like ranking. Finally, we present general algebraic laws governing the winnow operator and its interactions with other relational algebra operators. The preconditions on the applicability of the laws are captured by logical formulas. The laws provide a formal foundation for the algebraic optimization of preference queries. We demonstrate the usefulness of our approach through numerous examples.

497 citations


Journal ArticleDOI
TL;DR: This article focuses on methods for similarity search that make the general assumption that similarity is represented with a distance metric d, and presents algorithms for common types of queries that operate on an arbitrary "search hierarchy."
Abstract: Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this article, we focus on methods for similarity search that make the general assumption that similarity is represented with a distance metric d. Existing methods for handling similarity search in this setting typically fall into one of two classes. The first directly indexes the objects based on distances (distance-based indexing), while the second is based on mapping to a vector space (mapping-based approach). The main part of this article is dedicated to a survey of distance-based indexing methods, but we also briefly outline how search occurs in mapping-based methods. We also present a general framework for performing search based on distances, and present algorithms for common types of queries that operate on an arbitrary "search hierarchy." These algorithms can be applied on each of the methods presented, provided a suitable search hierarchy is defined.

480 citations


Journal ArticleDOI
TL;DR: The results show that the path sharing employed by YFilter can provide order-of-magnitude performance benefits, and two alternative techniques for extending YFilter's shared structure matching with support for value-based predicates are proposed, and the performance of these two techniques are compared.
Abstract: XML filtering systems aim to provide fast, on-the-fly matching of XML-encoded data to large numbers of query specifications containing constraints on both structure and content It is now well accepted that approaches using event-based parsing and Finite State Machines (FSMs) can provide the basis for highly scalable structure-oriented XML filtering systems The XFilter system [Altinel and Franklin 2000] was the first published FSM-based XML filtering approach XFilter used a separate FSM per path query and a novel indexing mechanism to allow all of the FSMs to be executed simultaneously during the processing of a document Building on the insights of the XFilter work, we describe a new method, called "YFilter" that combines all of the path queries into a single Nondeterministic Finite Automaton (NFA) YFilter exploits commonality among queries by merging common prefixes of the query paths such that they are processed at most once The resulting shared processing provides tremendous improvements in structure matching performance but complicates the handling of value-based predicatesIn this article, we first describe the XFilter and YFilter approaches and present results of a detailed performance comparison of structure matching for these algorithms as well as a hybrid approach The results show that the path sharing employed by YFilter can provide order-of-magnitude performance benefits We then propose two alternative techniques for extending YFilter's shared structure matching with support for value-based predicates, and compare the performance of these two techniques The results of this latter study demonstrate some key differences between shared XML filtering and traditional database query processing Finally, we describe how the YFilter approach is extended to handle more complicated queries containing nested path expressions

422 citations


Journal ArticleDOI
TL;DR: This article proposes various refresh policies and studies their effectiveness, and shows that a Poisson process is a good model to describe the changes of Web pages and improves the "freshness" of data very significantly.
Abstract: In this article, we study how we can maintain local copies of remote data sources "fresh," when the source data is updated autonomously and independently. In particular, we study the problem of Web crawlers that maintain local copies of remote Web pages for Web search engines. In this context, remote data sources (Websites) do not notify the copies (Web crawlers) of new changes, so we need to periodically poll the sources to maintain the copies up-to-date. Since polling the sources takes significant time and resources, it is very difficult to keep the copies completely up-to-date.This article proposes various refresh policies and studies their effectiveness. We first formalize the notion of "freshness" of copied data by defining two freshness metrics, and we propose a Poisson process as the change model of data sources. Based on this framework, we examine the effectiveness of the proposed refresh policies analytically and experimentally. We show that a Poisson process is a good model to describe the changes of Web pages and we also show that our proposed refresh policies improve the "freshness" of data very significantly. In certain cases, we got orders of magnitude improvement from existing policies.

294 citations


Journal ArticleDOI
TL;DR: The analysis shows that the a priori algorithm can solve the problem of enumerating the transversals of a hypergraph, improving on previously known results in a special case, and it is shown that the Dualize and Advance algorithm has worst-case running time that is sub-exponential to the output size.
Abstract: Data mining can be viewed, in many instances, as the task of computing a representation of a theory of a model or a database, in particular by finding a set of maximally specific sentences satisfying some property. We prove some hardness results that rule out simple approaches to solving the problem.The a priori algorithm is an algorithm that has been successfully applied to many instances of the problem. We analyze this algorithm, and prove that is optimal when the maximally specific sentences are "small". We also point out its limitations.We then present a new algorithm, the Dualize and Advance algorithm, and prove worst-case complexity bounds that are favorable in the general case. Our results use the concept of hypergraph transversals. Our analysis shows that the a priori algorithm can solve the problem of enumerating the transversals of a hypergraph, improving on previously known results in a special case. On the other hand, using results for the general case of the hypergraph transversal enumeration problem, we can show that the Dualize and Advance algorithm has worst-case running time that is sub-exponential to the output size (i.e., the number of maximally specific sentences).We further show that the problem of finding maximally specific sentences is closely related to the problem of exact learning with membership queries studied in computational learning theory.

256 citations


Journal ArticleDOI
TL;DR: Evaluation of several quorum types indicates that the conventional read-one/write-all-available approach is the best choice for a large range of applications requiring data replication, and it is shown that this is an important result for anybody developing code for computing clusters as it is much simpler to implement and more flexible than quorum-based approaches.
Abstract: Data replication is playing an increasingly important role in the design of parallel information systems. In particular, the widespread use of cluster architectures often requires to replicate data for performance and availability reasons. However, maintaining the consistency of the different replicas is known to cause severe scalability problems. To address this limitation, quorums are often suggested as a way to reduce the overall overhead of replication. In this article, we analyze several quorum types in order to better understand their behavior in practice. The results obtained challenge many of the assumptions behind quorum based replication. Our evaluation indicates that the conventional read-one/write-all-available approach is the best choice for a large range of applications requiring data replication. We believe this is an important result for anybody developing code for computing clusters as the read-one/write-all-available strategy is much simpler to implement and more flexible than quorum-based approaches. In this article, we show that, in addition, it is also the best choice using a number of other selection criteria.

147 citations


Journal ArticleDOI
TL;DR: Time-parameterized and continuous versions of the most common spatial queries, i.e., window queries, nearest neighbors, spatial joins, are studied, proposing efficient processing algorithms and accurate cost models.
Abstract: Conventional spatial queries are usually meaningless in dynamic environments since their results may be invalidated as soon as the query or data objects move. In this paper we formulate two novel query types, time parameterized and continuous queries, applicable in such environments. A time-parameterized query retrieves the actual result at the time when the query is issued, the expiry time of the result given the current motion of the query and database objects, and the change that causes the expiration. A continuous query retrieves tuples of the form , where each result is accompanied by a future interval, during which it is valid. We study time-parameterized and continuous versions of the most common spatial queries (i.e., window queries, nearest neighbors, spatial joins), proposing efficient processing algorithms and accurate cost models.

81 citations


Journal ArticleDOI
TL;DR: This article develops an algorithm, called DCF, for Dynamic Constrained Frequent-set computation, enhanced with a few optimizations, exploiting a lightweight structure called a segment support map, which enables DCF to obtain sharper bounds on the support of sets of items, and to better exploit properties of constraints.
Abstract: Data mining is supposed to be an iterative and exploratory process. In this context, we are working on a project with the overall objective of developing a practical computing environment for the human-centered exploratory mining of frequent sets. One critical component of such an environment is the support for the dynamic mining of constrained frequent sets of items. Constraints enable users to impose a certain focus on the mining process; dynamic means that, in the middle of the computation, users are able to (i) change (such as tighten or relax) the constraints and/or (ii) change the minimum support threshold, thus having a decisive influence on subsequent computations. In a real-life situation, the available buffer space may be limited, thus adding another complication to the problem.In this article, we develop an algorithm, called DCF, for Dynamic Constrained Frequent-set computation. This algorithm is enhanced with a few optimizations, exploiting a lightweight structure called a segment support map. It enables DCF to (i) obtain sharper bounds on the support of sets of items, and to (ii) better exploit properties of constraints. Furthermore, when handling dynamic changes to constraints, DCF relies on the concept of a delta member generating function, which generates precisely the sets of items that satisfy the new but not the old constraints. Our experimental results show the effectiveness of these enhancements.

75 citations


Journal ArticleDOI
TL;DR: It is shown that APSJ outperforms previously suggested algorithms for many data sets, often by an order of magnitude, and the Adaptive Divide-and-Conquer Join is proposed.
Abstract: A set containment join is a join between set-valued attributes of two relations, whose join condition is specified using the subset (⊆) operator. Set containment joins are deployed in many database applications, even those that do not support set-valued attributes. In this article, we propose two novel partitioning algorithms, called the Adaptive Pick-and-Sweep Join (APSJ) and the Adaptive Divide-and-Conquer Join (ADCJ), which allow computing set containment joins efficiently. We show that APSJ outperforms previously suggested algorithms for many data sets, often by an order of magnitude. We present a detailed analysis of the algorithms and study their performance on real and synthetic data using an implemented testbed.

74 citations


Journal ArticleDOI
TL;DR: Probabilistic cost models that estimate the selectivity of spatio-temporal window queries and joins, and the expected distance between a query and its nearest neighbor(s) are presented.
Abstract: Given a set of objects S, a spatio-temporal window query q retrieves the objects of S that will intersect the window during the (future) interval qT. A nearest neighbor query q retrieves the objects of S closest to q during qT. Given a threshold d, a spatio-temporal join retrieves the pairs of objects from two datasets that will come within distance d from each other during qT. In this article, we present probabilistic cost models that estimate the selectivity of spatio-temporal window queries and joins, and the expected distance between a query and its nearest neighbor(s). Our models capture any query/object mobility combination (moving queries, moving objects or both) and any data type (points and rectangles) in arbitrary dimensionality. In addition, we develop specialized spatio-temporal histograms, which take into account both location and velocity information, and can be incrementally maintained. Extensive performance evaluation verifies that the proposed techniques produce highly accurate estimation on both uniform and non-uniform data.

Journal ArticleDOI
TL;DR: In this article, it is shown how to compute reasonably small (and in special cases even minimal) complements for a large class of relational views.
Abstract: Views as a means to describe parts of a given data collection play an important role in many database applications. In dynamic environments where data is updated, not only information provided by views, but also information provided by data sources yet missing from views turns out to be relevant: Previously, this missing information has been characterized in terms of view complements; recently, it has been shown that view complements can be exploited in the context of data warehouses to guarantee desirable warehouse properties such as independence and self-maintainability. As the complete source information is a trivial complement for any view, a natural interest for "small" or even "minimal" complements arises. However, the computation of minimal complements is still not very well understood. In this article, it is shown how to compute reasonably small (and in special cases even minimal) complements for a large class of relational views.

Journal ArticleDOI
TL;DR: A general method for semantic query optimization in the framework of Object-Oriented Database Systems, providing an ODMG-compliant interface that allows a full interaction with OQL queries, wrapping underlying Description Logic representation and techniques to the user.
Abstract: Semantic query optimization uses semantic knowledge (i.e., integrity constraints) to transform a query into an equivalent one that may be answered more efficiently. This article proposes a general method for semantic query optimization in the framework of Object-Oriented Database Systems. The method is effective for a large class of queries, including conjunctive recursive queries expressed with regular path expressions and is based on three ingredients. The first is a Description Logic, ODLRE, providing a type system capable of expressing: class descriptions, queries, views, integrity constraint rules and inference techniques, such as incoherence detection and subsumption computation. The second is a semantic expansion function for queries, which incorporates restrictions logically implied by the query and the schema (classes + rules) in one query. The third is an optimal rewriting method of a query with respect to the schema classes that rewrites a query into an equivalent one, by determining more specialized classes to be accessed and by reducing the number of factors. We implemented the method in a tool providing an ODMG-compliant interface that allows a full interaction with OQL queries, wrapping underlying Description Logic representation and techniques to the user.

Journal ArticleDOI
TL;DR: The Iterative Spatial Join is based on a plane sweep algorithm, which requires the entire data set to fit in internal memory, and overcomes the performance limitations of the other algorithms for data sets of all sizes as well as differing amounts of internal memory.
Abstract: The key issue in performing spatial joins is finding the pairs of intersecting rectangles. For unindexed data sets, this is usually resolved by partitioning the data and then performing a plane sweep on the individual partitions. The resulting join can be viewed as a two-step process where the partition corresponds to a hash-based join while the plane-sweep corresponds to a sort-merge join. In this article, we look at extending the idea of the sort-merge join for one-dimensional data to multiple dimensions and introduce the Iterative Spatial Join. As with the sort-merge join, the Iterative Spatial Join is best suited to cases where the data is already sorted. However, as we show in the experiments, the Iterative Spatial Join performs well when internal memory is limited, compared to the partitioning methods. This suggests that the Iterative Spatial Join would be useful for very large data sets or in situations where internal memory is a shared resource and is therefore limited, such as with today's database engines which share internal memory amongst several queries. Furthermore, the performance of the Iterative Spatial Join is predictable and has no parameters which need to be tuned, unlike other algorithms. The Iterative Spatial Join is based on a plane sweep algorithm, which requires the entire data set to fit in internal memory. When internal memory overflows, the Iterative Spatial Join simply makes additional passes on the data, thereby exhibiting only a gradual performance degradation. To demonstrate the use and efficacy of the Iterative Spatial Join, we first examine and analyze current approaches to performing spatial joins, and then give a detailed analysis of the Iterative Spatial Join as well as present the results of extensive testing of the algorithm, including a comparison with partitioning-based spatial join methods. These tests show that the Iterative Spatial Join overcomes the performance limitations of the other algorithms for data sets of all sizes as well as differing amounts of internal memory.

Journal ArticleDOI
TL;DR: This work presents two options for removing permissions in FAF and provides details on the option which is representation independent.
Abstract: The Flexible Authorization Framework (FAF) defined by Jajodia et al. [2001] provides a policy-neutral framework for specifying access control policies that is expressive enough to specify many known access control policies. Although the original formulation of FAF indicated how rules could be added to or deleted from a FAF specification, it did not address the removal of access permissions from users. We present two options for removing permissions in FAF and provide details on the option which is representation independent.