scispace - formally typeset
Search or ask a question

Showing papers by "Yufei Tao published in 2005"


Journal ArticleDOI
01 Mar 2005
TL;DR: In this paper, a branch-and-bound skyline (BBS) algorithm based on nearest-neighbor search is proposed, which is I/O optimal and performs a single access only to those nodes that may contain skyline points.
Abstract: The skyline of a d-dimensional dataset contains the points that are not dominated by any other point on all dimensions. Skyline computation has recently received considerable attention in the database community, especially for progressive methods that can quickly return the initial results without reading the entire database. All the existing algorithms, however, have some serious shortcomings which limit their applicability in practice. In this article we develop branch-and-bound skyline (BBS), an algorithm based on nearest-neighbor search, which is I/O optimal, that is, it performs a single access only to those nodes that may contain skyline points. BBS is simple to implement and supports all types of progressive processing (e.g., user preferences, arbitrary dimensionality, etc). Furthermore, we propose several interesting variations of skyline computation, and show how BBS can be applied for their efficient processing.

905 citations


Proceedings Article
30 Aug 2005
TL;DR: The U-tree is proposed, an access method designed to optimize both the I/O and CPU time of range retrieval on multi-dimensional imprecise data and is fully dynamic, and does not place any constraints on the data pdfs.
Abstract: In an "uncertain database", an object o is associated with a multi-dimensional probability density function(pdf), which describes the likelihood that o appears at each position in the data space. A fundamental operation is the "probabilistic range search" which, given a value pq and a rectangular area rq, retrieves the objects that appear in rq with probabilities at least pq. In this paper, we propose the U-tree, an access method designed to optimize both the I/O and CPU time of range retrieval on multi-dimensional imprecise data. The new structure is fully dynamic (i.e., objects can be incrementally inserted/deleted in any order), and does not place any constraints on the data pdfs. We verify the query and update efficiency of U-trees with extensive experiments.

310 citations


Journal ArticleDOI
TL;DR: If Q fits in memory and
Abstract: Given two spatial datasets P (eg, facilities) and Q (queries), an aggregate nearest neighbor (ANN) query retrieves the point(s) of P with the smallest aggregate distance(s) to points in Q Assuming, for example, n users at locations q1,…qn, an ANN query outputs the facility p ∈ P that minimizes the sum of distances vpqiv for 1 ≤ i ≤ n that the users have to travel in order to meet there Similarly, another ANN query may report the point p ∈ P that minimizes the maximum distance that any user has to travel, or the minimum distance from some user to his/her closest facility If Q fits in memory and P is indexed by an R-tree, we develop algorithms for aggregate nearest neighbors that capture several versions of the problem, including weighted queries and incremental reporting of results Then, we analyze their performance and propose cost models for query optimization Finally, we extend our techniques for disk-resident queries and approximate ANN retrieval The efficiency of the algorithms and the accuracy of the cost models are evaluated through extensive experiments with real and synthetic datasets

283 citations


Proceedings Article
30 Aug 2005
TL;DR: The semantics of skylines are investigated, the subspace skyline analysis is proposed, and a novel notion of skyline group is introduced which essentially is a group of objects that are coincidentally in the skyline of some subspaces.
Abstract: The skyline operator is important for multi-criteria decision making applications. Although many recent studies developed efficient methods to compute skyline objects in a specific space, the fundamental problem on the semantics of skylines remains open: Why and in which subspaces is (or is not) an object in the skyline? Practically, users may also be interested in the skylines in any subspaces. Then, what is the relationship between the skylines in the subspaces and those in the super-spaces? How can we effectively analyze the subspace skylines? Can we efficiently compute skylines in various subspaces?In this paper, we investigate the semantics of skylines, propose the subspace skyline analysis, and extend the full-space skyline computation to subspace skyline computation. We introduce a novel notion of skyline group which essentially is a group of objects that are coincidentally in the skylines of some subspaces. We identify the decisive subspaces that qualify skyline groups in the subspace skylines. The new notions concisely capture the semantics and the structures of skylines in various subspaces. Multidimensional roll-up and drilldown analysis is introduced. We also develop an efficient algorithm, Skyey, to compute the set of skyline groups and, for each subspace, the set of objects that are in the subspace skyline. A performance study is reported to evaluate our approach.

271 citations


Journal ArticleDOI
TL;DR: This work presents a threshold-based algorithm for the continuous monitoring of nearest neighbors that minimizes the communication overhead between the server and the data objects and can be used with multiple, static, or moving queries, for any distance definition.
Abstract: Assume a set of moving objects and a central server that monitors their positions over time, while processing continuous nearest neighbor queries from geographically distributed clients. In order to always report up-to-date results, the server could constantly obtain the most recent position of all objects. However, this naive solution requires the transmission of a large number of rapid data streams corresponding to location updates. Intuitively, current information is necessary only for objects that may influence some query result (i.e., they may be included in the nearest neighbor set of some client). Motivated by this observation, we present a threshold-based algorithm for the continuous monitoring of nearest neighbors that minimizes the communication overhead between the server and the data objects. The proposed method can be used with multiple, static, or moving queries, for any distance definition, and does not require additional knowledge (e.g., velocity vectors) besides object locations.

112 citations


Book ChapterDOI
22 Aug 2005
TL;DR: This work proposes adaptations of spatial access methods and search algorithms for probabilistic versions of range queries and nearest neighbors and conducts an extensive experimental study, which evaluates the effectiveness of proposed solutions.
Abstract: We study the problem of answering spatial queries in databases where objects exist with some uncertainty and they are associated with an existential probability. The goal of a thresholding probabilistic spatial query is to retrieve the objects that qualify the spatial predicates with probability that exceeds a threshold. Accordingly, a ranking probabilistic spatial query selects the objects with the highest probabilities to qualify the spatial predicates. We propose adaptations of spatial access methods and search algorithms for probabilistic versions of range queries and nearest neighbors and conduct an extensive experimental study, which evaluates the effectiveness of proposed solutions.

102 citations


Proceedings ArticleDOI
14 Jun 2005
TL;DR: A new algorithm RPJ, which maximizes the output rate by optimizing its execution according to the characteristics of the join relations (e.g., data distribution, tuple arrival pattern, etc.).
Abstract: We consider the problem of "progressively" joining relations whose records are continuously retrieved from remote sources through an unstable network that may incur temporary failures. The objectives are to (i) start reporting the first output tuples as soon as possible (before the participating relations are completely received), and (ii) produce the remaining results at a fast rate. We develop a new algorithm RPJ (Rate-based Progressive Join) based on solid theoretical analysis. RPJ maximizes the output rate by optimizing its execution according to the characteristics of the join relations (e.g., data distribution, tuple arrival pattern, etc.). Extensive experiments prove that our technique delivers results significantly faster than the previous methods.

82 citations


Journal ArticleDOI
TL;DR: Specialized methods, which integrate spatio-temporal indexing with pre-aggregation for the efficient processing of historical aggregate queries without a priori knowledge of grouping hierarchies are presented.
Abstract: Spatio-temporal databases store information about the positions of individual objects over time. However, in many applications such as traffic supervision or mobile communication systems, only summarized data, like the number of cars in an area for a specific period, or phone-calls serviced by a cell each day, is required. Although this information can be obtained from operational databases, its computation is expensive, rendering online processing inapplicable. In this paper, we present specialized methods, which integrate spatio-temporal indexing with pre-aggregation. The methods support dynamic spatio-temporal dimensions for the efficient processing of historical aggregate queries without a priori knowledge of grouping hierarchies. The superiority of the proposed techniques over existing methods is demonstrated through a comprehensive probabilistic analysis and an extensive experimental evaluation.

71 citations


Proceedings ArticleDOI
05 Apr 2005
TL;DR: Algorithms and optimization techniques for RNN queries are proposed by utilizing some characteristics of networks to solve reverse nearest neighbor queries in large graphs.
Abstract: A reverse nearest neighbor query returns the data objects that have a query point as their nearest neighbor. Although such queries have been studied quite extensively in Euclidean spaces, there is no previous work in the context of large graphs. In this paper, we propose algorithms and optimization techniques for RNN queries by utilizing some characteristics of networks.

60 citations


Proceedings ArticleDOI
05 Apr 2005
TL;DR: Venn sampling (VS), a novel estimation method optimized for a set of "pivot queries" that reflect the distribution of actual ones, is developed, which permits the development of a novel "query-driven" update policy, which reduces the update cost of conventional policies significantly.
Abstract: Given a region q/sub R/ and a future timestamp q/sub T/, a "range aggregate" query estimates the number of objects expected to appear in q/sub R/ at time q/sub T/. Currently the only methods for processing such queries are based on spatio-temporal histograms, which have several serious problems. First, they consume considerable space in order to provide accurate estimation. Second, they incur high evaluation cost. Third, their efficiency continuously deteriorates with time. Fourth, their maintenance requires significant update overhead. Motivated by this, we develop Venn sampling (VS), a novel estimation method optimized for a set of "pivot queries" that reflect the distribution of actual ones. In particular, given m pivot queries, VS achieves perfect estimation with only O(m) samples, as opposed to O(2/sup m/) required by the current state of the art in workload-aware sampling. Compared with histograms, our technique is much more accurate (given the same space), produces estimates with negligible cost, and does not deteriorate with time. Furthermore, it permits the development of a novel "query-driven" update policy, which reduces the update cost of conventional policies significantly.

21 citations