scispace - formally typeset
Search or ask a question

Showing papers by "Reynold Cheng published in 2005"


Proceedings Article
30 Aug 2005
TL;DR: The U-tree is proposed, an access method designed to optimize both the I/O and CPU time of range retrieval on multi-dimensional imprecise data and is fully dynamic, and does not place any constraints on the data pdfs.
Abstract: In an "uncertain database", an object o is associated with a multi-dimensional probability density function(pdf), which describes the likelihood that o appears at each position in the data space. A fundamental operation is the "probabilistic range search" which, given a value pq and a rectangular area rq, retrieves the objects that appear in rq with probabilities at least pq. In this paper, we propose the U-tree, an access method designed to optimize both the I/O and CPU time of range retrieval on multi-dimensional imprecise data. The new structure is fully dynamic (i.e., objects can be incrementally inserted/deleted in any order), and does not place any constraints on the data pdfs. We verify the query and update efficiency of U-trees with extensive experiments.

310 citations


Proceedings Article
30 Aug 2005
TL;DR: U-DBMS extends the database system with uncertainty management functionalities, and each data value is represented as an interval and a probability distribution function, and it can be processed with probabilistic query operators to produce imprecise answers.
Abstract: In many systems, sensors are used to acquire information from external environments such as temperature, pressure and locations Due to continuous changes in these values, and limited resources (eg, network bandwidth and battery power), it is often infeasible for the database to store the exact values at all times Queries that uses these old values can produce invalid results In order to manage the uncertainty between the actual sensor value and the database value, we propose a system called U-DBMS U-DBMS extends the database system with uncertainty management functionalities In particular, each data value is represented as an interval and a probability distribution function, and it can be processed with probabilistic query operators to produce imprecise (but correct) answers This demonstration presents a PostgreSQL-based system that handles uncertainty and probabilistic queries for constantly-evolving data

116 citations


01 Jan 2005
TL;DR: This paper proposes that when data mining is performed on uncertain data, data uncertainty has to be considered in order to obtain high quality data mining results, and presents the UK-means clustering algorithm as an example to illustrate how the traditional K-mean algorithm can be modified to handle data uncertainty in data mining.
Abstract: Data uncertainty is often found in real-world applications due to reasons such as imprecise measurement, outdated sources, or sampling errors. Recently, much research has been published in the area of managing data uncertainty in databases. We propose that when data mining is performed on uncertain data, data uncertainty has to be considered in order to obtain high quality data mining results. We call this the "Uncertain Data Mining" problem. In this paper, we present a framework for possible research directions in this area. We also present the UK-means clustering algorithm as an example to illustrate how the traditional K-means algorithm can be modified to handle data uncertainty in data mining.

52 citations


01 Jan 2005
TL;DR: This paper investigates different non-value-based error tolerance definitions and discusses how they are applied to two classes of entity- based queries: non-rankbased and rank-based queries.
Abstract: We study the problem of applying adaptive filters for approximate query processing in a distributed stream environment. We propose filter bound assignment protocols with the objective of reducing communication cost. Most previous works focus on value-based queries (e.g., average) with numerical error tolerance. In this paper, we cover entity-based queries (e.g., a nearest neighbor query returns object names rather than a single value). In particular, we study non-value-based tolerance (e.g., the answer to the nearest-neighbor query should rank third or above). We investigate different non-value-based error tolerance definitions and discuss how they are applied to two classes of entity-based queries: non-rankbased and rank-based queries. Extensive experiments show that our protocols achieve significant savings in both communication overhead and server computation.

45 citations


Proceedings Article
30 Aug 2005
TL;DR: In this article, the problem of applying adaptive filters for approximate query processing in a distributed stream environment is studied and filter bound assignment protocols with the objective of reducing communication cost are proposed.
Abstract: We study the problem of applying adaptive filters for approximate query processing in a distributed stream environment. We propose filter bound assignment protocols with the objective of reducing communication cost. Most previous works focus on value-based queries (e.g., average) with numerical error tolerance. In this paper, we cover entity-based queries (e.g., nearest neighbor) with non-value-based error tolerance. We investigate different non-value-based error tolerance definitions and discuss how they are applied to two classes of entity-based queries: non-rank-based and rank-based queries. Extensive experiments show that our protocols achieve significant savings in both communication overhead and server computation.

44 citations


Proceedings ArticleDOI
05 Apr 2005
TL;DR: This paper proposes an index structure explicitly designed to perform well for both querying and updating, and observes that objects often stay in a region for an extended amount of time, and exploits this phenomenon to optimize an index for both updates and queries.
Abstract: Index structures are designed to optimize search performance, while at the same time supporting efficient data updates. Although not explicit, existing index structures are typically based upon the assumption that the rate of updates will be small compared to the rate of querying. This assumption is not valid in streaming data environments such as sensor and moving object databases, where updates are received incessantly. In fact, for many applications, the rate of updates may well exceed the rate of querying. In such environments, index structures suffer from poor performance due to the large overhead of keeping the index updated with the latest data. Recent efforts at indexing moving object data assume objects move in a restrictive manner (e.g. in straight lines with constant velocity). In this paper, we propose an index structure explicitly designed to perform well for both querying and updating. We assume a more relaxed model of object movement. In particular, we observe that objects often stay in a region (e.g., building) for an extended amount of time, and exploit this phenomenon to optimize an index for both updates and queries. The paper is developed with the example of R-trees, but the ideas can be extended to other index structures as well. We present the design of the change tolerant R-tree, and an experimental evaluation.

32 citations


Proceedings ArticleDOI
17 Aug 2005
TL;DR: This paper proposes a statistical approach to decide which sensors to be used to answer a query with the aid of continuous probabilistic query (CPQ), which is originally used to manage uncertain data and is associated with a Probabilistic guarantee on the query result.
Abstract: An approach to improve the reliability of query results based on error-prone sensors is to use redundant sensors. However, this approach is expensive; moreover, some sensors may malfunction and their readings need to be discarded. In this paper, we propose a statistical approach to decide which sensors to be used to answer a query. In particular, we propose to solve the problem with the aid of continuous probabilistic query (CPQ), which is originally used to manage uncertain data and is associated with a probabilistic guarantee on the query result. Based on the historical data values from the sensors, the query type, and the requirement on the query, we present methods to select an appropriate set of sensors and provide reliable answers for aggregate queries. Our algorithm is demonstrated in simulation experiments to provide accurate and robust query results.

13 citations


Proceedings ArticleDOI
Yuni Xia1, Sunil Prabhakar1, Shan Lei1, Reynold Cheng1, Rahul Shah1 
13 Mar 2005
TL;DR: A novel index structure, the MVTree, which is built based on the mean and variance of the data instead of the actual data values that are in constant flux is proposed, which significantly reduces the index update cost.
Abstract: Constantly evolving data arise in various mobile applications such as location-based services and sensor networks. The problem of indexing the data for efficient query processing is of increasing importance. Due to the constant changing nature of the data, traditional indexes suffer from a high update overhead which leads to poor performance. In this paper, we propose a novel index structure, the MVTree, which is built based on the mean and variance of the data instead of the actual data values that are in constant flux. Since the mean and variance are relatively stable features compared to the actual values, the MVTree significantly reduces the index update cost. The distribution interval and probability distribution function of the data are not required to be known a priori. The mean and variance for each data item can be dynamically adjusted to match the observed fluctuation of the data. Experiments show that compared to traditional index schemes, the MVTree substantially improves index update performance while maintaining satisfactory query performance.

13 citations


Book ChapterDOI
01 Jan 2005
TL;DR: Sensors are often used to monitor the status of an environment continuously, and if the value of an entity being monitored is constantly evolving, the recorded data value may differ from the actual value.
Abstract: Sensors are often used to monitor the status of an environment continuously. The sensor readings are reported to the application for making decisions and answering user queries. For example, a fire-alarm system in a building employs temperature sensors to detect any abrupt change in temperature. An aircraft is equipped with sensors to track the wind speed, and radars are used to report the aircraft’s location to a military application. These applications usually include a database or server to which the sensor readings are sent. Limited network bandwidth and battery power imply that it is often not practical for the server to record the exact status of an entity it monitors at every time instant. In particular, if the value of an entity (e.g., temperature, location) being monitored is constantly evolving, the recorded data value may differ from the actual value. Querying the database can then produce incorrect results. Consider a simple example where a user asks the database: “which room has a temperature between 10

7 citations


01 Jan 2005
TL;DR: This paper proposes using imprecise queries to hide the location of the query issuer and evaluate uncertain information, and suggests a framework where uncertainty can be controlled to provide high quality and privacy-preserving services.
Abstract: Location-based services, such as finding the nearest gas station, require users to supply their location information. However, a user’s location can be tracked without her consent or knowledge. Lowering the spatial and temporal resolution of location data sent to the server has been proposed as a solution. Although this technique is effective in protecting privacy, it may be overkill and the quality of desired services can be severely affected. In this paper, we investigate the relationship between uncertainty, privacy, and quality of services. We propose using imprecise queries to hide the location of the query issuer and evaluate uncertain information. We also suggest a framework where uncertainty can be controlled to provide high quality and privacy-preserving services. We study how the idea can be applied to a moving range query over moving objects. We further investigate how the linkability of the proposed solution can be protected against trajectory-tracing.

4 citations


01 Jan 2005
TL;DR: The notion of equality and inequality operators for uncertainty is presented, namely item-leveL page-level and index-level pruning and vVe also introduces the concept of "approximation" in these comparison operators.
Abstract: In database systems that collect information about the external environment, such as temperature and location values, it is often infeasible to obtain accurate information due to measurement and sampling errors, and resource limitations. Queries evaluated over these inaccurate data can potentially yield incorrect results. To avoid these problems. the idea of using uncertainty models (such as an interval associated with a probability density function) instead of a single value for modeling a data item has been explored in recent years. These works have focussed on simple queries such as range and nearest-neighbor queries. Queries that join multiple relations have not been addressed in earlier work despite the significance of joins in databases. In this paper we address join queries over uncertain data. As with other queries over uncertain data, these joins return probabilistic answers. A probabilistic Join Query (PJQ) augments the results with probability guarantees to indicate the likelihood of each join tuple being part of the result. Traditional join operators, such as equality and inequality, need to be extended to support uncertain data. In this paper, we present the notion of equality and inequality operators for uncertainty. vVe also introduce the concept of "approximation" in these comparison operators. Although PJQs are more informative than traditional joins. they are expensive to evaluate. To overcome this problem, we observe that often it is only necessary to know whether the probability of the results exceeds a given threshold. instead of the precise probability value. By incorporating this constraint into PJQ, it is possible to achieve much better performance. In particular, we develop three sets of optimization techniques, namely item-leveL page-level and index-level pruning. for different join operators. These techniques facilitate pruning with little space and time overhead, and are easily adapted to most join algorithms. Extensive simulation results show that these techniques improve the performance of joins significantly.