scispace - formally typeset
Search or ask a question

Showing papers in "ACM Transactions on Database Systems in 2005"


Journal ArticleDOI
TL;DR: An efficient B+-tree based indexing method for K-nearest neighbor (KNN) search in a high-dimensional metric space, called iDistance, which partitions the data based on a space- or data-partitioning strategy, and selects a reference point for each partition.
Abstract: In this article, we present an efficient Bp-tree based indexing method, called iDistance, for K-nearest neighbor (KNN) search in a high-dimensional metric space. iDistance partitions the data based on a space- or data-partitioning strategy, and selects a reference point for each partition. The data points in each partition are transformed into a single dimensional value based on their similarity with respect to the reference point. This allows the points to be indexed using a Bp-tree structure and KNN search to be performed using one-dimensional range search. The choice of partition and reference points adapts the index structure to the data distribution.We conducted extensive experiments to evaluate the iDistance technique, and report results demonstrating its effectiveness. We also present a cost model for iDistance KNN search, which can be exploited in query optimization.

607 citations


Journal Article
TL;DR: BLOCKIN BLOCKINÒ BLOCKin× ½¸ÔÔº ¾ßß¿º ¿ ¾ ¾ à ¼ à à 0
Abstract: BLOCKIN BLOCKINÒ BLOCKIN× ½¸ÔÔº ¿ßß¿º ¿

373 citations


Journal ArticleDOI
TL;DR: A theory is developed that characterizes when nonserializable executions of applications can occur under Snapshot Isolation, and it is applied to demonstrate that the TPC-C benchmark application has no serialization anomalies under SI, and how this demonstration can be generalized to other applications.
Abstract: Snapshot Isolation (SI) is a multiversion concurrency control algorithm, first described in Berenson et al. [1995]. SI is attractive because it provides an isolation level that avoids many of the common concurrency anomalies, and has been implemented by Oracle and Microsoft SQL Server (with certain minor variations). SI does not guarantee serializability in all cases, but the TPC-C benchmark application [TPC-C], for example, executes under SI without serialization anomalies. All major database system products are delivered with default nonserializable isolation levels, often ones that encounter serialization anomalies more commonly than SI, and we suspect that numerous isolation errors occur each day at many large sites because of this, leading to corrupt data sometimes noted in data warehouse applications. The classical justification for lower isolation levels is that applications can be run under such levels to improve efficiency when they can be shown not to result in serious errors, but little or no guidance has been offered to application programmers and DBAs by vendors as to how to avoid such errors. This article develops a theory that characterizes when nonserializable executions of applications can occur under SI. Near the end of the article, we apply this theory to demonstrate that the TPC-C benchmark application has no serialization anomalies under SI, and then discuss how this demonstration can be generalized to other applications. We also present a discussion on how to modify the program logic of applications that are nonserializable under SI so that serializability will be guaranteed.

351 citations


Journal ArticleDOI
TL;DR: If Q fits in memory and
Abstract: Given two spatial datasets P (eg, facilities) and Q (queries), an aggregate nearest neighbor (ANN) query retrieves the point(s) of P with the smallest aggregate distance(s) to points in Q Assuming, for example, n users at locations q1,…qn, an ANN query outputs the facility p ∈ P that minimizes the sum of distances vpqiv for 1 ≤ i ≤ n that the users have to travel in order to meet there Similarly, another ANN query may report the point p ∈ P that minimizes the maximum distance that any user has to travel, or the minimum distance from some user to his/her closest facility If Q fits in memory and P is indexed by an R-tree, we develop algorithms for aggregate nearest neighbors that capture several versions of the problem, including weighted queries and incremental reporting of results Then, we analyze their performance and propose cost models for query optimization Finally, we extend our techniques for disk-resident queries and approximate ANN retrieval The efficiency of the algorithms and the accuracy of the cost models are evaluated through extensive experiments with real and synthetic datasets

283 citations


Journal ArticleDOI
TL;DR: It is shown that XPath can be processed much more efficiently, and proposed main-memory algorithms for this problem with polynomial-time combined query evaluation complexity with profitably integrated into existing XPath processors.
Abstract: Our experimental analysis of several popular XPath processors reveals a striking fact: Query evaluation in each of the systems requires time exponential in the size of queries in the worst case. We show that XPath can be processed much more efficiently, and propose main-memory algorithms for this problem with polynomial-time combined query evaluation complexity. Moreover, we show how the main ideas of our algorithm can be profitably integrated into existing XPath processors. Finally, we present two fragments of XPath for which linear-time query processing algorithms exist and another fragment with linear-space/quadratic-time query processing.

222 citations


Journal ArticleDOI
Jef Wijsen1
TL;DR: This work proposes a theoretical framework that also covers updates as a repair primitive, and shows the construct of nucleus: a single database that yields consistent answers to a class of queries, without the need for query rewriting.
Abstract: Repairing a database means bringing the database in accordance with a given set of integrity constraints by applying some minimal change. If a database can be repaired in more than one way, then the consistent answer to a query is defined as the intersection of the query answers on all repaired versions of the database.Earlier approaches have confined the repair work to deletions and insertions of entire tuples. We propose a theoretical framework that also covers updates as a repair primitive. Update-based repairing is interesting in that it allows rectifying an error within a tuple without deleting the tuple, thereby preserving consistent values in the tuple. Another novel idea is the construct of nucleus: a single database that yields consistent answers to a class of queries, without the need for query rewriting. We show the construction of nuclei for full dependencies and conjunctive queries. Consistent query answering and constructing nuclei is generally intractable under update-based repairing. Nevertheless, we also show some tractable cases of practical interest.

212 citations


Journal ArticleDOI
TL;DR: A schema mapping is a specification that describes how data structured under one schema is to be transformed into data structuredUnder a different schema (the target schema) when the source schema is changed.
Abstract: A schema mapping is a specification that describes how data structured under one schema (the source schema) is to be transformed into data structured under a different schema (the target schema). A...

162 citations


Journal ArticleDOI
TL;DR: XQuery By Example (XQBE) as discussed by the authors is a visual query language for expressing a large subset of XQuery in a visual form, which is designed for both unskilled users and expert users.
Abstract: The spreading of XML data in many contexts of modern computing infrastructures and systems causes a pressing need for adequate XML querying capabilities; to address this need, the W3C is proposing XQuery as the standard query language for XML, with a language paradigm and a syntactic flavor comparable to the SQL relational language. XQuery is designed for meeting the requirements of skilled database programmers; its inherent complexity makes the new language unsuited to unskilled users.In this article we present XQBE (XQuery By Example), a visual query language for expressing a large subset of XQuery in a visual form. In designing XQBE, we targeted both unskilled users and expert users wishing to speed up the construction of their queries; we have been inspired by QBE, a relational language initially proposed as an alternative to SQL, which is supported by Microsoft Access. QBE is extremely successful among users who are not computer professionals and do not understand the subtleties of query languages, as well as among professionals who can draft their queries very quickly.According to the hierarchical nature of XML, XQBE's main graphical elements are trees. One or more trees denote the documents assumed as query input, and one tree denotes the document produced by the query. Similar to QBE, trees are annotated so as to express selection predicates, joins, and the passing of information from the input trees to the output tree.This article formally defines the syntax and semantics of XQBE, provides a large set of examples, and presents a prototype implementation.

118 citations


Journal ArticleDOI
TL;DR: A sound and complete algorithm for solving the implication of dimension constraints that uses heuristics based on the structure of the dimension and the constraints to speed up its execution is given.
Abstract: In multidimensional data models intended for online analytic processing (OLAP), data are viewed as points in a multidimensional space. Each dimension has structure, described by a directed graph of categories, a set of members for each category, and a child/parent relation between members. An important application of this structure is to use it to infer summarizability, that is, whether an aggregate view defined for some category can be correctly derived from a set of precomputed views defined for other categories. A dimension is called structurally heterogeneous if two members in a given category are allowed to have ancestors in different categories. In this article, we propose a class of integrity constraints, dimension constraints, that allow us to reason about summarizability in heterogeneous dimensions. We introduce the notion of frozen dimensions which are minimal homogeneous dimension instances representing the different structures that are implicitly combined in a heterogeneous dimension. Frozen dimensions provide the basis for efficiently testing the implication of dimension constraints and are a useful aid to understanding heterogeneous dimensions. We give a sound and complete algorithm for solving the implication of dimension constraints that uses heuristics based on the structure of the dimension and the constraints to speed up its execution. We study the intrinsic complexity of the implication problem and the running time of our algorithm.

98 citations


Journal ArticleDOI
TL;DR: A detailed experimental study is presented that characterizes the performance of XSQ and related systems, and that illustrates the performance implications of XPath features such as closures.
Abstract: We have implemented and released the XSQ system for evaluating XPath queries on streaming XML data. XSQ supports XPath features such as multiple predicates, closures, and aggregation, which pose interesting challenges for streaming evaluation. Our implementation is based on using a hierarchical arrangement of augmented finite state automata. A design goal of XSQ is buffering data for the least amount of time possible. We present a detailed experimental study that characterizes the performance of XSQ and related systems, and that illustrates the performance implications of XPath features such as closures.

70 citations


Journal ArticleDOI
TL;DR: While the main contributions are conceptual, the federated model, FISQL/FIRA, and the notion of transformational completeness nevertheless have important applications to data integration and OLAP.
Abstract: In this article, we develop a relational algebra for metadata integration, Federated Interoperable Relational Algebra (FIRA). FIRA has many desirable properties such as compositionality, closure, a deterministic semantics, a modest complexity, support for nested queries, a subalgebra equivalent to canonical Relational Algebra (RA), and robustness under certain classes of schema evolution. Beyond this, FIRA queries are capable of producing fully dynamic output schemas, where the number of relations and/or the number of columns in relations of the output varies dynamically with the input instance. Among existing query languages for relational metadata integration, only FIRA provides generalized dynamic output schemas, where the values in any (fixed) number of input columns can determine output schemas.Further contributions of this article include development of an extended relational model for metadata integration, the Federated Relational Data Model, which is strictly downward compatible with the relational model. Additionally, we define the notion of Transformational Completeness for relational query languages and postulate FIRA as a canonical transformationally complete language. We also give a declarative, SQL-like query language that is equivalent to FIRA, called Federated Interoperable Structured Query Language (FISQL).While our main contributions are conceptual, the federated model, FISQL/FIRA, and the notion of transformational completeness nevertheless have important applications to data integration and OLAP. In addition to summarizing these applications, we illustrate the use of FIRA to optimize FISQL queries using rule-based transformations that directly parallel their canonical relational counterparts. We conclude the article with an extended discussion of related work as well as an indication of current and future work on FISQL/FIRA.

Journal ArticleDOI
TL;DR: In this paper, a high-availability scalable distributed data structure (LHaRS) is proposed, in which the value of k transparently grows with the file to offset the reliability decline and only the number of the storage nodes potentially limits the file growth.
Abstract: LHaRS is a high-availability scalable distributed data structure (SDDS). An LHaRS file is hash partitioned over the distributed RAM of a multicomputer, for example, a network of PCs, and supports the unavailability of any k g 1 of its server nodes. The value of k transparently grows with the file to offset the reliability decline. Only the number of the storage nodes potentially limits the file growth. The high-availability management uses a novel parity calculus that we have developed, based on Reed-Salomon erasure correcting coding. The resulting parity storage overhead is about the lowest possible. The parity encoding and decoding are faster than for any other candidate coding we are aware of. We present our scheme and its performance analysis, including experiments with a prototype implementation on Wintel PCs. The capabilities of LHaRS offer new perspectives to data intensive applications, including the emerging ones of grids and of P2P computing.

Journal ArticleDOI
TL;DR: A metric called mapping redundancy is introduced to characterize the efficiency of a mapping method in terms of disk page accesses and analyze its behavior for point, range and kNN queries.
Abstract: Multidimensional data points can be mapped to one-dimensional space to exploit single dimensional indexing structures such as the Bp-tree. In this article we present a Generalized structure for data Mapping and query Processing (GiMP), which supports extensible mapping methods and query processing. GiMP can be easily customized to behave like many competent indexing mechanisms for multi-dimensional indexing, such as the UB-Tree, the Pyramid technique, the iMinMax, and the iDistance. Besides being an extendible indexing structure, GiMP also serves as a framework to study the characteristics of the mapping and hence the efficiency of the indexing scheme. Specifically, we introduce a metric called mapping redundancy to characterize the efficiency of a mapping method in terms of disk page accesses and analyze its behavior for point, range and kNN queries. We also address the fundamental problem of whether an efficient mapping exists and how to define such a mapping for a given data set.

Journal ArticleDOI
TL;DR: This work provides tight upper bounds for the maximal number of candidate patterns that can be generated on the next level of the standard levelwise algorithm, derived from a combinatorial result from the sixties by Kruskal and Katona.
Abstract: In the context of mining for frequent patterns using the standard levelwise algorithm, the following question arises: given the current level and the current set of frequent patterns, what is the maximal number of candidate patterns that can be generated on the next level? We answer this question by providing tight upper bounds, derived from a combinatorial result from the sixties by Kruskal and Katona. Our result is useful to secure existing algorithms from a combinatorial explosion of the number of candidate patterns.

Journal ArticleDOI
TL;DR: This work considers the view maintenance problem for the situation when the database contains a weighted graph and the view is either the transitive closure or the answer to the all-pairs shortest-distance problem (APSD).
Abstract: Given a database, the view maintenance problem is concerned with the efficient computation of the new contents of a given view when updates to the database happen. We consider the view maintenance problem for the situation when the database contains a weighted graph and the view is either the transitive closure or the answer to the all-pairs shortest-distance problem (APSD). We give incremental algorithms for APSD, which support both edge insertions and deletions. For transitive closure, the algorithm is applicable to a more general class of graphs than those previously explored. Our algorithms use first-order queries, along with addition (p) and less-than (

Journal ArticleDOI
TL;DR: A new approach is proposed based on the recent trend of self-tuning DBMS by which the cost model is maintained dynamically and incrementally as UDFs are being executed online.
Abstract: Query optimizers in object-relational database management systems typically require users to provide the execution cost models of user-defined functions (UDFs). Despite this need, however, there has been little work done to provide such a model. The existing approaches are static in that they require users to train the model a priori with pregenerated UDF execution cost data. Static approaches can not adapt to changing UDF execution patterns and thus degrade in accuracy when the UDF executions used for generating training data do not reflect the patterns of those performed during operation. This article proposes a new approach based on the recent trend of self-tuning DBMS by which the cost model is maintained dynamically and incrementally as UDFs are being executed online. In the context of UDF cost modeling, our approach faces a number of challenges, that is, it should work with limited memory, work with limited computation time, and adjust to the fluctuations in the execution costs (e.g., caching effect). In this article, we first provide a set of guidelines for developing techniques that meet these challenges, while achieving accurate and fast cost prediction with small overheads. Then, we present two concrete techniques developed under the guidelines. One is an instance-based technique based on the conventional k-nearest neighbor (KNN) technique which uses a multidimensional index like the Ra-tree. The other is a summary-based technique which uses the quadtree to store summary values at multiple resolutions. We have performed extensive performance evaluations comparing these two techniques against existing histogram-based techniques and the KNN technique, using both real and synthetic UDFs/data sets. The results show our techniques provide better performance in most situations considered.

Journal ArticleDOI
TL;DR: A number of alternative synopsis structures have been proposed, but histograms as discussed by the authors have been the most successful ones for query optimisation in the past few years, and they have been shown to outperform other synopsis structures.
Abstract: Database systems use precomputed synopses of data to estimate the cost of alternative plans during query optimization. A number of alternative synopsis structures have been proposed, but histograms...