Showing papers by "Reynold Cheng published in 2013"

PDF

Open Access

Journal Article•DOI•

Earth mover's distance based similarity search at scale

[...]

Yu Tang¹, Leong Hou U², Yilun Cai¹, Nikos Mamoulis¹, Reynold Cheng¹ - Show less +1 more•Institutions (2)

University of Hong Kong¹, University of Macau²

01 Dec 2013

TL;DR: This paper focuses on optimizing the refinement phase of EMD-based similarity search by adapting an efficient min-cost flow algorithm (SIA) for EMD computation, proposing a dynamic distance bound, and proposed a dynamic refinement order for the candidates which, paired with a concurrent EMD refinement strategy, reduces the amount of needless computations.

...read moreread less

Abstract: Earth Mover's Distance (EMD), as a similarity measure, has received a lot of attention in the fields of multimedia and probabilistic databases, computer vision, image retrieval, machine learning, etc. EMD on multidimensional histograms provides better distinguishability between the objects approximated by the histograms (e.g., images), compared to classic measures like Euclidean distance. Despite its usefulness, EMD has a high computational cost; therefore, a number of effective filtering methods have been proposed, to reduce the pairs of histograms for which the exact EMD has to be computed, during similarity search. Still, EMD calculations in the refinement step remain the bottleneck of the whole similarity search process. In this paper, we focus on optimizing the refinement phase of EMD-based similarity search by (i) adapting an efficient min-cost flow algorithm (SIA) for EMD computation, (ii) proposing a dynamic distance bound, which can be used to terminate an EMD refinement early, and (iii) proposing a dynamic refinement order for the candidates which, paired with a concurrent EMD refinement strategy, reduces the amount of needless computations. Our proposed techniques are orthogonal to and can be easily integrated with the state-of-the-art filtering techniques, reducing the cost of EMD-based similarity queries by orders of magnitude.

...read moreread less

44 citations

Proceedings Article•DOI•

Cleaning uncertain data for top-k queries

[...]

Luyi Mo¹, Reynold Cheng¹, Xiang Li¹, David W. Cheung¹, Xuan S. Yang¹ - Show less +1 more•Institutions (1)

University of Hong Kong¹

08 Apr 2013

TL;DR: This paper develops efficient algorithms to compute the quality of this query under the possible world semantics, and addresses the cleaning of a probabilistic database, in order to improve top-k query quality.

...read moreread less

Abstract: The information managed in emerging applications, such as sensor networks, location-based services, and data integration, is inherently imprecise. To handle data uncertainty, probabilistic databases have been recently developed. In this paper, we study how to quantify the ambiguity of answers returned by a probabilistic top-k query. We develop efficient algorithms to compute the quality of this query under the possible world semantics. We further address the cleaning of a probabilistic database, in order to improve top-k query quality. Cleaning involves the reduction of ambiguity associated with the database entities. For example, the uncertainty of a temperature value acquired from a sensor can be reduced, or cleaned, by requesting its newest value from the sensor. While this “cleaning operation” may produce a better query result, it may involve a cost and fail. We investigate the problem of selecting entities to be cleaned under a limited budget. Particularly, we propose an optimal solution and several heuristics. Experiments show that the greedy algorithm is efficient and close to optimal.

...read moreread less

34 citations

Proceedings Article•DOI•

Voronoi-based nearest neighbor search for multi-dimensional uncertain databases

[...]

Peiwu Zhang¹, Reynold Cheng¹, Nikos Mamoulis¹, Matthias Renz², Andreas Züfle², Yu Tang¹, Tobias Emrich² - Show less +3 more•Institutions (2)

University of Hong Kong¹, Ludwig Maximilian University of Munich²

08 Apr 2013

TL;DR: How to derive an axis-parallel hyper-rectangle (called the Uncertain Bounding Rectangle, or UBR) that tightly contains a PV-cell is studied, and the PV-index, a structure that stores UBRs, is developed to evaluate probabilistic nearest neighbor queries over uncertain data.

...read moreread less

Abstract: In Voronoi-based nearest neighbor search, the Voronoi cell of every point p in a database can be used to check whether p is the closest to some query point q. We extend the notion of Voronoi cells to support uncertain objects, whose attribute values are inexact. Particularly, we propose the Possible Voronoi cell (or PV-cell). A PV-cell of a multi-dimensional uncertain object o is a region R, such that for any point pϵR, o may be the nearest neighbor of p. If the PV-cells of all objects in a database S are known, they can be used to identify objects that have a chance to be the nearest neighbor of q. However, there is no efficient algorithm for computing an exact PV-cell. We hence study how to derive an axis-parallel hyper-rectangle (called the Uncertain Bounding Rectangle, or UBR) that tightly contains a PV-cell. We further develop the PV-index, a structure that stores UBRs, to evaluate probabilistic nearest neighbor queries over uncertain data. An advantage of the PV-index is that upon updates on S, it can be incrementally updated. Extensive experiments on both synthetic and real datasets are carried out to validate the performance of the PV-index.

...read moreread less

34 citations

Journal Article•DOI•

UV-diagram: a voronoi diagram for uncertain spatial databases

[...]

Xike Xie¹, Reynold Cheng², Man Lung Yiu³, Liwen Sun⁴, Jinchuan Chen⁵ - Show less +1 more•Institutions (5)

Aalborg University¹, University of Hong Kong², Hong Kong Polytechnic University³, University of California, Berkeley⁴, Renmin University of China⁵

01 Jun 2013

TL;DR: The Uncertain-Voronoi diagram (or UV-diagram), which divides the data space into disjoint “UV-partitions”, and uses a set of UV-cells to design the UV-index, which supports different queries, and can be constructed in polynomial time.

...read moreread less

Abstract: The Voronoi diagram is an important technique for answering nearest-neighbor queries for spatial databases. We study how the Voronoi diagram can be used for uncertain spatial data, which are inherent in scientific and business applications. Specifically, we propose the Uncertain-Voronoi diagram (or UV-diagram), which divides the data space into disjoint "UV-partitions". Each UV-partition $$P$$ is associated with a set $$S$$ of objects, such that any point $$q$$ located in $$P$$ has the set $$S$$ as its nearest neighbor with nonzero probabilities. The UV-diagram enables queries that return objects with nonzero chances of being the nearest neighbor (NN) of a given point $$q$$ . It supports "continuous nearest-neighbor search", which refreshes the set of NN objects of $$q$$ , as the position of $$q$$ changes. It also allows the analysis of nearest-neighbor information, for example, to find out the number of objects that are the nearest neighbors of any point in a given area. A UV-diagram requires exponential construction and storage costs. To tackle these problems, we devise an alternative representation of a UV-diagram, by using a set of UV-cells. A UV-cell of an object $$o$$ is the extent $$e$$ for which $$o$$ can be the nearest neighbor of any point $$q \in e$$ . We study how to speed up the derivation of UV-cells by considering its nearby objects. We also use the UV-cells to design the UV-index, which supports different queries, and can be constructed in polynomial time. We have performed extensive experiments on both real and synthetic data to validate the efficiency of our approaches.

...read moreread less

25 citations

Proceedings Article•DOI•

Optimizing plurality for human intelligence tasks

[...]

Luyi Mo¹, Reynold Cheng¹, Ben Kao¹, Xuan S. Yang¹, Chenghui Ren¹, Siyu Lei¹, David W. Cheung¹, Eric Lo² - Show less +4 more•Institutions (2)

University of Hong Kong¹, Hong Kong Polytechnic University²

27 Oct 2013

TL;DR: This work proposes a dynamic programming (DP) algorithm for solving the plurality assignment problem (PAP) and identifies two interesting properties, namely, monotonicity and diminishing return, which are satisfied by a HIT if the quality of the HIT's answer increases monotonically at a decreasing rate with its plurality.

...read moreread less

Abstract: In a crowdsourcing system, Human Intelligence Tasks (HITs) (e.g., translating sentences, matching photos, tagging videos with keywords) can be conveniently specified. HITs are made available to a large pool of workers, who are paid upon completing the HITs they have selected. Since workers may have different capabilities, some difficult HITs may not be satisfactorily performed by a single worker. If more workers are employed to perform a HIT, the quality of the HIT's answer could be statistically improved. Given a set of HITs and a fixed "budget", we address the important problem of determining the number of workers (or plurality) of each HIT so that the overall answer quality is optimized. We propose a dynamic programming (DP) algorithm for solving the plurality assignment problem (PAP). We identify two interesting properties, namely, monotonicity and diminishing return, which are satisfied by a HIT if the quality of the HIT's answer increases monotonically at a decreasing rate with its plurality. We show for HITs that satisfy the two properties (e.g., multiple-choice-question HITs), the PAP is approximable. We propose an efficient greedy algorithm for such case. We conduct extensive experiments on synthetic and real datasets to evaluate our algorithms. Our experiments show that our greedy algorithm provides close-to-optimal solutions in practice.

...read moreread less

24 citations

Journal Article•DOI•

Model-based probabilistic frequent itemset mining

[...]

Thomas Bernecker¹, Reynold Cheng², David W. Cheung², Hans-Peter Kriegel¹, Sau Dan Lee², Matthias Renz¹, Florian Verhein¹, Liang Wang², Andreas Zuefle¹, Andreas Zuefle² - Show less +6 more•Institutions (2)

Ludwig Maximilian University of Munich¹, University of Hong Kong²

01 Oct 2013-Knowledge and Information Systems

TL;DR: This paper proposes a novel methods to capture the itemset mining process as a probability distribution function taking two models into account: the Poisson distribution and the normal distribution, and gives an intuition which model-based approach fits best to different types of data sets.

...read moreread less

Abstract: Data uncertainty is inherent in emerging applications such as location-based services, sensor monitoring systems, and data integration. To handle a large amount of impre- cise information, uncertain databases have been recently developed. In this paper, we study how to efficiently discover frequent itemsets from large uncertain databases, interpreted under the Possible World Semantics. This is technically challenging, since an uncertain data- base induces an exponential number of possible worlds. To tackle this problem, we propose a novel methods to capture the itemset mining process as a probability distribution func- tion taking two models into account: the Poisson distribution and the normal distribution. These model-based approaches extract frequent itemsets with a high degree of accuracy and

...read moreread less

21 citations

Proceedings Article•DOI•

On incentive-based tagging

[...]

Xuan S. Yang¹, Reynold Cheng¹, Luyi Mo¹, Ben Kao¹, David W. Cheung¹ - Show less +1 more•Institutions (1)

University of Hong Kong¹

08 Apr 2013

TL;DR: This work formalizes the concepts of tagging quality (TQ) and tagging stability (TS) in measuring the quality of a resource's tag description and proposes a theoretically optimal algorithm given a fixed “budget” (i.e., the amount of money paid for tagging resources).

...read moreread less

Abstract: A social tagging system, such as del.icio.us and Flickr, allows users to annotate resources (e.g., web pages and photos) with text descriptions called tags. Tags have proven to be invaluable information for searching, mining, and recommending resources. In practice, however, not all resources receive the same attention from users. As a result, while some highly-popular resources are over-tagged, most of the resources are under-tagged. Incomplete tagging on resources severely affects the effectiveness of all tag-based techniques and applications. We address an interesting question: if users are paid to tag specific resources, how can we allocate incentives to resources in a crowd-sourcing environment so as to maximize the tagging quality of resources? We address this question by observing that the tagging quality of a resource becomes stable after it has been tagged a sufficient number of times. We formalize the concepts of tagging quality (TQ) and tagging stability (TS) in measuring the quality of a resource's tag description. We propose a theoretically optimal algorithm given a fixed “budget” (i.e., the amount of money paid for tagging resources). This solution decides the amount of rewards that should be invested on each resource in order to maximize tagging stability. We further propose a few simple, practical, and efficient incentive allocation strategies. On a dataset from del.icio.us, our best strategy provides resources with a close-to-optimal gain in tagging stability.

...read moreread less

19 citations

Journal Article•DOI•

Probabilistic filters: A stream protocol for continuous probabilistic queries

[...]

Yinuo Zhang¹, Reynold Cheng²•Institutions (2)

University of Southern California¹, University of Hong Kong²

01 Mar 2013-Information Systems

TL;DR: The probabilistic filter protocol is proposed, which helps remote sensor devices to decide whether values collected should be reported to the query server and can significantly reduce the communication and energy costs of sensor devices.

...read moreread less

10 citations

Proceedings Article•DOI•

Fast evaluation of iceberg pattern-based aggregate queries

[...]

Zhian He¹, Petrie Wong¹, Ben Kao², Eric Lo¹, Reynold Cheng² - Show less +1 more•Institutions (2)

Hong Kong Polytechnic University¹, University of Hong Kong²

27 Oct 2013

TL;DR: This paper proposes an efficient approach to identify and evaluate iceberg cells of s-cuboid, which is a multidimensional array of cells associated with a pattern instantiated from the query's pattern template.

...read moreread less

Abstract: A Sequence OLAP (S-OLAP) system provides a platform on which pattern-based aggregate (PBA) queries on a sequence database are evaluated. In its simplest form, a PBA query consists of a pattern template T and an aggregate function F. A pattern template is a sequence of variables, each is defined over a domain. For example, the template T = (X,Y ,Y ,X) consists of two variables X and Y . Each variable is instantiated with all possible values in its corresponding domain to derive all possible patterns of the template. Sequences are grouped based on the patterns they possess. The answer to a PBA query is a sequence cuboid (s-cuboid), which is a multidimensional array of cells. Each cell is associated with a pattern instantiated from the query's pattern template. The value of each s-cuboid cell is obtained by applying the aggregate function F to the set of data sequences that belong to that cell. Since a pattern template can involve many variables and can be arbitrarily long, the induced s-cuboid for a PBA query can be huge. For most analytical tasks, however, only iceberg cells with very large aggregate values are of interest. This paper proposes an efficient approach to identify and evaluate iceberg cells of s-cuboids. Experimental results show that our algorithms are orders of magnitude faster than existing approaches.

...read moreread less

3 citations

Proceedings Article•DOI•

Privacy Preserving Path Recommendation for Moving User on Location Based Service

[...]

Yuqing Sun¹, Haoran Xu, Reynold Cheng²•Institutions (2)

Shandong University¹, University of Hong Kong²

18 Dec 2013

TL;DR: This paper considers the popular navigation application, where users may continuously query different location based servers during their movements, and recommends a privacy-preserving path based on a set of metrics on privacy, distance and the quality of services that a LBS requester often desires.

...read moreread less

Abstract: With the increasing adoption of location based services, privacy is becoming a major concern. To hide the identity and location of a request on location based service, most methods consider a set of users in a reasonable region so as to confuse their requests. When there are not enough users, the cloaking region needs expanding to a larger area or the response needs delay. Either way degrades the quality-of-service. In this paper, we tackle the privacy problem in a predication way by recommending a privacy-preserving path for a requester. We consider the popular navigation application, where users may continuously query different location based servers during their movements. Based on a set of metrics on privacy, distance and the quality of services that a LBS requester often desires, a secure path is computed for each request according to user's preference, and can be dynamically adjusted when the situation is changed. A set of experiments are performed to verify our method and the relationship between parameters are discussed in details. We also discuss how to apply our method into practical applications.

...read moreread less

2 citations

Book Chapter•DOI•

Managing Quality of Probabilistic Databases

[...]

Reynold Cheng¹•Institutions (1)

University of Hong Kong¹

01 Jan 2013

TL;DR: This work presents the PWS-quality metric, which is a universal measure that quantifies the ambiguity of query answers under the possible world semantics, and proposes a polynomial-time solution to achieve an optimal improvement in P WS-quality.

...read moreread less

Abstract: Uncertain or imprecise data are pervasive in applications like location-based services, sensor monitoring, and data collection and integration. For these applications, probabilistic databases can be used to store uncertain data, and querying facilities are provided to yield answers with statistical confidence. Given that a limited amount of resources is available to “clean” the database (e.g., by probing some sensor data values to get their latest values), we address the problem of choosing the set of uncertain objects to be cleaned, in order to achieve the best improvement in the quality of query answers. For this purpose, we present the PWS-quality metric, which is a universal measure that quantifies the ambiguity of query answers under the possible world semantics. We study how PWS-quality can be efficiently evaluated for two major query classes: (1) queries that examine the satisfiability of tuples independent of other tuples (e.g., range queries) and (2) queries that require the knowledge of the relative ranking of the tuples (e.g., MAX queries). We then propose a polynomial-time solution to achieve an optimal improvement in PWS-quality. Other fast heuristics are also examined.

...read moreread less

Proceedings of the 13th international conference on Advances in Spatial and Temporal Databases

[...]

Mario A. Nascimento¹, Timos Sellis², Reynold Cheng³, Jörg Sander¹, Yu Zheng⁴ - Show less +1 more•Institutions (4)

University of Alberta¹, RMIT University², University of Hong Kong³, Microsoft⁴

21 Aug 2013