scispace - formally typeset
Search or ask a question

Showing papers by "Reynold Cheng published in 2013"


Journal ArticleDOI
01 Dec 2013
TL;DR: This paper focuses on optimizing the refinement phase of EMD-based similarity search by adapting an efficient min-cost flow algorithm (SIA) for EMD computation, proposing a dynamic distance bound, and proposed a dynamic refinement order for the candidates which, paired with a concurrent EMD refinement strategy, reduces the amount of needless computations.
Abstract: Earth Mover's Distance (EMD), as a similarity measure, has received a lot of attention in the fields of multimedia and probabilistic databases, computer vision, image retrieval, machine learning, etc. EMD on multidimensional histograms provides better distinguishability between the objects approximated by the histograms (e.g., images), compared to classic measures like Euclidean distance. Despite its usefulness, EMD has a high computational cost; therefore, a number of effective filtering methods have been proposed, to reduce the pairs of histograms for which the exact EMD has to be computed, during similarity search. Still, EMD calculations in the refinement step remain the bottleneck of the whole similarity search process. In this paper, we focus on optimizing the refinement phase of EMD-based similarity search by (i) adapting an efficient min-cost flow algorithm (SIA) for EMD computation, (ii) proposing a dynamic distance bound, which can be used to terminate an EMD refinement early, and (iii) proposing a dynamic refinement order for the candidates which, paired with a concurrent EMD refinement strategy, reduces the amount of needless computations. Our proposed techniques are orthogonal to and can be easily integrated with the state-of-the-art filtering techniques, reducing the cost of EMD-based similarity queries by orders of magnitude.

44 citations


Proceedings ArticleDOI
Luyi Mo1, Reynold Cheng1, Xiang Li1, David W. Cheung1, Xuan S. Yang1 
08 Apr 2013
TL;DR: This paper develops efficient algorithms to compute the quality of this query under the possible world semantics, and addresses the cleaning of a probabilistic database, in order to improve top-k query quality.
Abstract: The information managed in emerging applications, such as sensor networks, location-based services, and data integration, is inherently imprecise. To handle data uncertainty, probabilistic databases have been recently developed. In this paper, we study how to quantify the ambiguity of answers returned by a probabilistic top-k query. We develop efficient algorithms to compute the quality of this query under the possible world semantics. We further address the cleaning of a probabilistic database, in order to improve top-k query quality. Cleaning involves the reduction of ambiguity associated with the database entities. For example, the uncertainty of a temperature value acquired from a sensor can be reduced, or cleaned, by requesting its newest value from the sensor. While this “cleaning operation” may produce a better query result, it may involve a cost and fail. We investigate the problem of selecting entities to be cleaned under a limited budget. Particularly, we propose an optimal solution and several heuristics. Experiments show that the greedy algorithm is efficient and close to optimal.

34 citations


Proceedings ArticleDOI
08 Apr 2013
TL;DR: How to derive an axis-parallel hyper-rectangle (called the Uncertain Bounding Rectangle, or UBR) that tightly contains a PV-cell is studied, and the PV-index, a structure that stores UBRs, is developed to evaluate probabilistic nearest neighbor queries over uncertain data.
Abstract: In Voronoi-based nearest neighbor search, the Voronoi cell of every point p in a database can be used to check whether p is the closest to some query point q. We extend the notion of Voronoi cells to support uncertain objects, whose attribute values are inexact. Particularly, we propose the Possible Voronoi cell (or PV-cell). A PV-cell of a multi-dimensional uncertain object o is a region R, such that for any point pϵR, o may be the nearest neighbor of p. If the PV-cells of all objects in a database S are known, they can be used to identify objects that have a chance to be the nearest neighbor of q. However, there is no efficient algorithm for computing an exact PV-cell. We hence study how to derive an axis-parallel hyper-rectangle (called the Uncertain Bounding Rectangle, or UBR) that tightly contains a PV-cell. We further develop the PV-index, a structure that stores UBRs, to evaluate probabilistic nearest neighbor queries over uncertain data. An advantage of the PV-index is that upon updates on S, it can be incrementally updated. Extensive experiments on both synthetic and real datasets are carried out to validate the performance of the PV-index.

34 citations


Journal ArticleDOI
01 Jun 2013
TL;DR: The Uncertain-Voronoi diagram (or UV-diagram), which divides the data space into disjoint “UV-partitions”, and uses a set of UV-cells to design the UV-index, which supports different queries, and can be constructed in polynomial time.
Abstract: The Voronoi diagram is an important technique for answering nearest-neighbor queries for spatial databases. We study how the Voronoi diagram can be used for uncertain spatial data, which are inherent in scientific and business applications. Specifically, we propose the Uncertain-Voronoi diagram (or UV-diagram), which divides the data space into disjoint "UV-partitions". Each UV-partition $$P$$ is associated with a set $$S$$ of objects, such that any point $$q$$ located in $$P$$ has the set $$S$$ as its nearest neighbor with nonzero probabilities. The UV-diagram enables queries that return objects with nonzero chances of being the nearest neighbor (NN) of a given point $$q$$ . It supports "continuous nearest-neighbor search", which refreshes the set of NN objects of $$q$$ , as the position of $$q$$ changes. It also allows the analysis of nearest-neighbor information, for example, to find out the number of objects that are the nearest neighbors of any point in a given area. A UV-diagram requires exponential construction and storage costs. To tackle these problems, we devise an alternative representation of a UV-diagram, by using a set of UV-cells. A UV-cell of an object $$o$$ is the extent $$e$$ for which $$o$$ can be the nearest neighbor of any point $$q \in e$$ . We study how to speed up the derivation of UV-cells by considering its nearby objects. We also use the UV-cells to design the UV-index, which supports different queries, and can be constructed in polynomial time. We have performed extensive experiments on both real and synthetic data to validate the efficiency of our approaches.

25 citations


Proceedings ArticleDOI
27 Oct 2013
TL;DR: This work proposes a dynamic programming (DP) algorithm for solving the plurality assignment problem (PAP) and identifies two interesting properties, namely, monotonicity and diminishing return, which are satisfied by a HIT if the quality of the HIT's answer increases monotonically at a decreasing rate with its plurality.
Abstract: In a crowdsourcing system, Human Intelligence Tasks (HITs) (e.g., translating sentences, matching photos, tagging videos with keywords) can be conveniently specified. HITs are made available to a large pool of workers, who are paid upon completing the HITs they have selected. Since workers may have different capabilities, some difficult HITs may not be satisfactorily performed by a single worker. If more workers are employed to perform a HIT, the quality of the HIT's answer could be statistically improved. Given a set of HITs and a fixed "budget", we address the important problem of determining the number of workers (or plurality) of each HIT so that the overall answer quality is optimized. We propose a dynamic programming (DP) algorithm for solving the plurality assignment problem (PAP). We identify two interesting properties, namely, monotonicity and diminishing return, which are satisfied by a HIT if the quality of the HIT's answer increases monotonically at a decreasing rate with its plurality. We show for HITs that satisfy the two properties (e.g., multiple-choice-question HITs), the PAP is approximable. We propose an efficient greedy algorithm for such case. We conduct extensive experiments on synthetic and real datasets to evaluate our algorithms. Our experiments show that our greedy algorithm provides close-to-optimal solutions in practice.

24 citations


Journal ArticleDOI
TL;DR: This paper proposes a novel methods to capture the itemset mining process as a probability distribution function taking two models into account: the Poisson distribution and the normal distribution, and gives an intuition which model-based approach fits best to different types of data sets.
Abstract: Data uncertainty is inherent in emerging applications such as location-based services, sensor monitoring systems, and data integration. To handle a large amount of impre- cise information, uncertain databases have been recently developed. In this paper, we study how to efficiently discover frequent itemsets from large uncertain databases, interpreted under the Possible World Semantics. This is technically challenging, since an uncertain data- base induces an exponential number of possible worlds. To tackle this problem, we propose a novel methods to capture the itemset mining process as a probability distribution func- tion taking two models into account: the Poisson distribution and the normal distribution. These model-based approaches extract frequent itemsets with a high degree of accuracy and

21 citations


Proceedings ArticleDOI
Xuan S. Yang1, Reynold Cheng1, Luyi Mo1, Ben Kao1, David W. Cheung1 
08 Apr 2013
TL;DR: This work formalizes the concepts of tagging quality (TQ) and tagging stability (TS) in measuring the quality of a resource's tag description and proposes a theoretically optimal algorithm given a fixed “budget” (i.e., the amount of money paid for tagging resources).
Abstract: A social tagging system, such as del.icio.us and Flickr, allows users to annotate resources (e.g., web pages and photos) with text descriptions called tags. Tags have proven to be invaluable information for searching, mining, and recommending resources. In practice, however, not all resources receive the same attention from users. As a result, while some highly-popular resources are over-tagged, most of the resources are under-tagged. Incomplete tagging on resources severely affects the effectiveness of all tag-based techniques and applications. We address an interesting question: if users are paid to tag specific resources, how can we allocate incentives to resources in a crowd-sourcing environment so as to maximize the tagging quality of resources? We address this question by observing that the tagging quality of a resource becomes stable after it has been tagged a sufficient number of times. We formalize the concepts of tagging quality (TQ) and tagging stability (TS) in measuring the quality of a resource's tag description. We propose a theoretically optimal algorithm given a fixed “budget” (i.e., the amount of money paid for tagging resources). This solution decides the amount of rewards that should be invested on each resource in order to maximize tagging stability. We further propose a few simple, practical, and efficient incentive allocation strategies. On a dataset from del.icio.us, our best strategy provides resources with a close-to-optimal gain in tagging stability.

19 citations


Journal ArticleDOI
TL;DR: The probabilistic filter protocol is proposed, which helps remote sensor devices to decide whether values collected should be reported to the query server and can significantly reduce the communication and energy costs of sensor devices.

10 citations


Proceedings ArticleDOI
27 Oct 2013
TL;DR: This paper proposes an efficient approach to identify and evaluate iceberg cells of s-cuboid, which is a multidimensional array of cells associated with a pattern instantiated from the query's pattern template.
Abstract: A Sequence OLAP (S-OLAP) system provides a platform on which pattern-based aggregate (PBA) queries on a sequence database are evaluated. In its simplest form, a PBA query consists of a pattern template T and an aggregate function F. A pattern template is a sequence of variables, each is defined over a domain. For example, the template T = (X,Y ,Y ,X) consists of two variables X and Y . Each variable is instantiated with all possible values in its corresponding domain to derive all possible patterns of the template. Sequences are grouped based on the patterns they possess. The answer to a PBA query is a sequence cuboid (s-cuboid), which is a multidimensional array of cells. Each cell is associated with a pattern instantiated from the query's pattern template. The value of each s-cuboid cell is obtained by applying the aggregate function F to the set of data sequences that belong to that cell. Since a pattern template can involve many variables and can be arbitrarily long, the induced s-cuboid for a PBA query can be huge. For most analytical tasks, however, only iceberg cells with very large aggregate values are of interest. This paper proposes an efficient approach to identify and evaluate iceberg cells of s-cuboids. Experimental results show that our algorithms are orders of magnitude faster than existing approaches.

3 citations


Proceedings ArticleDOI
18 Dec 2013
TL;DR: This paper considers the popular navigation application, where users may continuously query different location based servers during their movements, and recommends a privacy-preserving path based on a set of metrics on privacy, distance and the quality of services that a LBS requester often desires.
Abstract: With the increasing adoption of location based services, privacy is becoming a major concern. To hide the identity and location of a request on location based service, most methods consider a set of users in a reasonable region so as to confuse their requests. When there are not enough users, the cloaking region needs expanding to a larger area or the response needs delay. Either way degrades the quality-of-service. In this paper, we tackle the privacy problem in a predication way by recommending a privacy-preserving path for a requester. We consider the popular navigation application, where users may continuously query different location based servers during their movements. Based on a set of metrics on privacy, distance and the quality of services that a LBS requester often desires, a secure path is computed for each request according to user's preference, and can be dynamically adjusted when the situation is changed. A set of experiments are performed to verify our method and the relationship between parameters are discussed in details. We also discuss how to apply our method into practical applications.

2 citations


Book ChapterDOI
01 Jan 2013
TL;DR: This work presents the PWS-quality metric, which is a universal measure that quantifies the ambiguity of query answers under the possible world semantics, and proposes a polynomial-time solution to achieve an optimal improvement in P WS-quality.
Abstract: Uncertain or imprecise data are pervasive in applications like location-based services, sensor monitoring, and data collection and integration. For these applications, probabilistic databases can be used to store uncertain data, and querying facilities are provided to yield answers with statistical confidence. Given that a limited amount of resources is available to “clean” the database (e.g., by probing some sensor data values to get their latest values), we address the problem of choosing the set of uncertain objects to be cleaned, in order to achieve the best improvement in the quality of query answers. For this purpose, we present the PWS-quality metric, which is a universal measure that quantifies the ambiguity of query answers under the possible world semantics. We study how PWS-quality can be efficiently evaluated for two major query classes: (1) queries that examine the satisfiability of tuples independent of other tuples (e.g., range queries) and (2) queries that require the knowledge of the relative ranking of the tuples (e.g., MAX queries). We then propose a polynomial-time solution to achieve an optimal improvement in PWS-quality. Other fast heuristics are also examined.