scispace - formally typeset
Search or ask a question

Showing papers by "Reynold Cheng published in 2008"


Proceedings ArticleDOI
07 Apr 2008
TL;DR: The constrained nearest-neighbor query (C-PNN) is proposed, which returns the IDs of objects whose probabilities are higher than some threshold, with a given error bound in the answers.
Abstract: In applications like location-based services, sensor monitoring and biological databases, the values of the database items are inherently uncertain in nature. An important query for uncertain objects is the probabilistic nearest-neighbor query (PNN), which computes the probability of each object for being the nearest neighbor of a query point. Evaluating this query is computationally expensive, since it needs to consider the relationship among uncertain objects, and requires the use of numerical integration or Monte-Carlo methods. Sometimes, a query user may not be concerned about the exact probability values. For example, he may only need answers that have sufficiently high confidence. We thus propose the constrained nearest-neighbor query (C-PNN), which returns the IDs of objects whose probabilities are higher than some threshold, with a given error bound in the answers. The C-PNN can be answered efficiently with probabilistic verifiers. These are methods that derive the lower and upper bounds of answer probabilities, so that an object can be quickly decided on whether it should be included in the answer. We have developed three probabilistic verifiers, which can be used on uncertain data with arbitrary probability density functions. Extensive experiments were performed to examine the effectiveness of these approaches.

171 citations


Proceedings ArticleDOI
07 Apr 2008
TL;DR: This paper presents a model for handling arbitrary probabilistic uncertain data natively at the database level, and develops a model that is consistent with possible worlds semantics and closed under basic relational operators.
Abstract: The inherent uncertainty of data present in numerous applications such as sensor databases, text annotations, and information retrieval motivate the need to handle imprecise data at the database level. Uncertainty can be at the attribute or tuple level and is present in both continuous and discrete data domains. This paper presents a model for handling arbitrary probabilistic uncertain data (both discrete and continuous) natively at the database level. Our approach leads to a natural and efficient representation for probabilistic data. We develop a model that is consistent with possible worlds semantics and closed under basic relational operators. This is the first model that accurately and efficiently handles both continuous and discrete uncertainty. The model is implemented in a real database system (PostgreSQL) and the effectiveness and efficiency of our approach is validated experimentally.

122 citations


Journal ArticleDOI
01 Aug 2008
TL;DR: This work presents the PWS-quality metric, a universal measure that quantifies the ambiguity of query answers under the possible world semantics, and investigates how such a metric can be used for data cleaning purposes.
Abstract: Uncertain or imprecise data are pervasive in applications like location-based services, sensor monitoring, and data collection and integration. For these applications, probabilistic databases can be used to store uncertain data, and querying facilities are provided to yield answers with statistical confidence. Given that a limited amount of resources is available to "clean" the database (e.g., by probing some sensor data values to get their latest values), we address the problem of choosing the set of uncertain objects to be cleaned, in order to achieve the best improvement in the quality of query answers. For this purpose, we present the PWS-quality metric, which is a universal measure that quantifies the ambiguity of query answers under the possible world semantics. We study how PWS-quality can be efficiently evaluated for two major query classes: (1) queries that examine the satisfiability of tuples independent of other tuples (e.g., range queries); and (2) queries that require the knowledge of the relative ranking of the tuples (e.g., MAX queries). We then propose a polynomial-time solution to achieve an optimal improvement in PWS-quality. Other fast heuristics are presented as well. Experiments, performed on both real and synthetic datasets, show that the PWS-quality metric can be evaluated quickly, and that our cleaning algorithm provides an optimal solution with high efficiency. To our best knowledge, this is the first work that develops a quality metric for a probabilistic database, and investigates how such a metric can be used for data cleaning purposes.

97 citations


Proceedings ArticleDOI
04 Nov 2008
TL;DR: Wang et al. as mentioned in this paper proposed a framework for preserving location privacy based on the idea of sending to the service provider suitably modified location information, which not only prevents the service providers from knowing the exact locations of users, but also protects information about user movements and locations from being disclosed to other users who are not authorized to access this information.
Abstract: The expanding use of location-based services has profound implications on the privacy of personal information. In this paper, we propose a framework for preserving location privacy based on the idea of sending to the service provider suitably modified location information. Agents execute data transformation and the service provider directly processes the transformed dataset. Our technique not only prevents the service provider from knowing the exact locations of users, but also protects information about user movements and locations from being disclosed to other users who are not authorized to access this information. We also define a privacy model to analyze our framework, and examine our approach experimentally.

20 citations


Book ChapterDOI
09 Jul 2008
TL;DR: An entropy-based metric is presented to quantify the degree of ambiguity of probabilistic query answers due to data uncertainty and a new method to improve the query answer quality is developed.
Abstract: In applications like sensor network monitoring and location-based services, due to limited network bandwidth and battery power, a system cannot always acquire accurate and fresh data from the external environment. To capture data errors in these environments, recent researches have proposed to model uncertainty as a probability distribution function (pdf), as well as the notion of probabilistic queries, which provide statistical guarantees on answer correctness. In this paper, we present an entropy-based metric to quantify the degree of ambiguity of probabilistic query answers due to data uncertainty. Based on this metric, we develop a new method to improve the query answer quality. The main idea of this method is to acquire (or probe) data from a selected set of sensing devices, in order to reduce data uncertainty and improve the quality of a query answer. Given that a query is assigned a limited number of probing resources, we investigate how the quality of a query answer can attain an optimal improvement. To improve the efficiency of our solution, we further present heuristics which achieve near-to-optimal quality improvement. We generalize our solution to handle multiple queries. An experimental simulation over a realistic dataset is performed to validate our approaches.

11 citations


Journal ArticleDOI
01 Dec 2008
TL;DR: This paper proposes a novel index structure, the Mean Variance Tree (MVTree), which is built based on the mean and variance of the data instead of the actual data values that can change continuously, which significantly reduces the index update cost.
Abstract: Traditional spatial indexes like R-tree usually assume the database is not updated frequently. In applications like location-based services and sensor networks, this assumption is no longer true since data updates can be numerous and frequent. As a result these indexes can suffer from a high update overhead, leading to poor performance. In this paper we propose a novel index structure, the Mean Variance Tree (MVTree), which is built based on the mean and variance of the data instead of the actual data values that can change continuously. Since the mean and variance are relatively stable features compared to the actual values, the MVTree significantly reduces the index update cost. The mean and the variance of the data item can be dynamically adjusted to match the observed fluctuation of the data. Our experiments show that the MVTree substantially improves index update performance while maintaining satisfactory query performance.

6 citations


Proceedings Article
01 Jan 2008

1 citations