Showing papers by "Yufei Tao published in 2013"

PDF

Open Access

Journal Article•DOI•

Clustering Uncertain Data Based on Probability Distribution Similarity

[...]

Bin Jiang¹, Jian Pei¹, Yufei Tao², Xuemin Lin³•Institutions (3)

Simon Fraser University¹, The Chinese University of Hong Kong², University of New South Wales³

01 Apr 2013-IEEE Transactions on Knowledge and Data Engineering

TL;DR: This work uses the well-known Kullback-Leibler divergence to measure similarity between uncertain objects in both the continuous and discrete cases, and integrates it into partitioning and density-based clustering methods to cluster uncertain objects.

...read moreread less

Abstract: Clustering on uncertain data, one of the essential tasks in mining uncertain data, posts significant challenges on both modeling similarity between uncertain objects and developing efficient computational methods The previous methods extend traditional partitioning clustering methods like $(k)$-means and density-based clustering methods like DBSCAN to uncertain data, thus rely on geometric distances between objects Such methods cannot handle uncertain objects that are geometrically indistinguishable, such as products with the same mean but very different variances in customer ratings Surprisingly, probability distributions, which are essential characteristics of uncertain objects, have not been considered in measuring similarity between uncertain objects In this paper, we systematically model uncertain objects in both continuous and discrete domains, where an uncertain object is modeled as a continuous and discrete random variable, respectively We use the well-known Kullback-Leibler divergence to measure similarity between uncertain objects in both the continuous and discrete cases, and integrate it into partitioning and density-based clustering methods to cluster uncertain objects Nevertheless, a naive implementation is very costly Particularly, computing exact KL divergence in the continuous case is very costly or even infeasible To tackle the problem, we estimate KL divergence in the continuous case by kernel density estimation and employ the fast Gauss transform technique to further speed up the computation Our extensive experiment results verify the effectiveness, efficiency, and scalability of our approaches

...read moreread less

149 citations

Proceedings Article•DOI•

Massive graph triangulation

[...]

Xiaocheng Hu¹, Yufei Tao¹, Chin-Wan Chung²•Institutions (2)

The Chinese University of Hong Kong¹, KAIST²

22 Jun 2013

TL;DR: A new algorithm is developed that is provably I/O and CPU efficient at the same time, without making any assumption on the input G at all, and outperformed the existing competitors by a factor over an order of magnitude in extensive experimentation.

...read moreread less

Abstract: This paper studies I/O-efficient algorithms for settling the classic triangle listing problem, whose solution is a basic operator in dealing with many other graph problems. Specifically, given an undirected graph G, the objective of triangle listing is to find all the cliques involving 3 vertices in G. The problem has been well studied in internal memory, but remains an urgent difficult challenge when G does not fit in memory, rendering any algorithm to entail frequent I/O accesses. Although previous research has attempted to tackle the challenge, the state-of-the-art solutions rely on a set of crippling assumptions to guarantee good performance. Motivated by this, we develop a new algorithm that is provably I/O and CPU efficient at the same time, without making any assumption on the input G at all. The algorithm uses ideas drastically different from all the previous approaches, and outperformed the existing competitors by a factor over an order of magnitude in our extensive experimentation.

...read moreread less

109 citations

Proceedings Article•DOI•

Minimal MapReduce algorithms

[...]

Yufei Tao¹, Wenqing Lin², Xiaokui Xiao²•Institutions (2)

The Chinese University of Hong Kong¹, Nanyang Technological University²

22 Jun 2013

TL;DR: The notion of minimal algorithm is presented, that is, an algorithm that guarantees the best parallelization in multiple aspects at the same time, up to a small constant factor.

...read moreread less

Abstract: MapReduce has become a dominant parallel computing paradigm for big data, i.e., colossal datasets at the scale of tera-bytes or higher. Ideally, a MapReduce system should achieve a high degree of load balancing among the participating machines, and minimize the space usage, CPU and I/O time, and network transfer at each machine. Although these principles have guided the development of MapReduce algorithms, limited emphasis has been placed on enforcing serious constraints on the aforementioned metrics simultaneously. This paper presents the notion of minimal algorithm, that is, an algorithm that guarantees the best parallelization in multiple aspects at the same time, up to a small constant factor. We show the existence of elegant minimal algorithms for a set of fundamental database problems, and demonstrate their excellent performance with extensive experiments.

...read moreread less

80 citations

Journal Article•DOI•

Approximate MaxRS in spatial databases

[...]

Yufei Tao¹, Xiaocheng Hu², Dong-Wan Choi¹, Chin-Wan Chung¹•Institutions (2)

KAIST¹, The Chinese University of Hong Kong²

01 Aug 2013

TL;DR: The present paper studies the (1 - e)-approximate MaxRS problem, which admits the same inputs as MaxRS, but aims instead to return a rectangle whose covered weight is at least (1-e)m*, where m* is the optimal covered weight, and e can be an arbitrarily small constant between 0 and 1.

...read moreread less

Abstract: In the maximizing range sum (MaxRS) problem, given (i) a set P of 2D points each of which is associated with a positive weight, and (ii) a rectangle r of specific extents, we need to decide where to place r in order to maximize the covered weight of r - that is, the total weight of the data points covered by r. Algorithms solving the problem exactly entail expensive CPU or I/O cost. In practice, exact answers are often not compulsory in a MaxRS application, where slight imprecision can often be comfortably tolerated, provided that approximate answers can be computed considerably faster. Motivated by this, the present paper studies the (1 - e)-approximate MaxRS problem, which admits the same inputs as MaxRS, but aims instead to return a rectangle whose covered weight is at least (1-e)m*, where m* is the optimal covered weight, and e can be an arbitrarily small constant between 0 and 1. We present fast algorithms that settle this problem with strong theoretical guarantees.

...read moreread less

50 citations

Proceedings Article•DOI•

Optimal splitters for temporal and multi-version databases

[...]

Wangchao Le¹, Feifei Li¹, Yufei Tao², Robert Christensen¹•Institutions (2)

University of Utah¹, KAIST²

22 Jun 2013

TL;DR: This work introduces the concept of optimal splitters for temporal and multi-version databases, which induce a partition of the input data set, and guarantee that the size of the maximum bucket be minimized among all possible configurations, given a budget for the desired number of buckets.

...read moreread less

Abstract: Temporal and multi-version databases are ideal candidates for a distributed store, which offers large storage space, and parallel and distributed processing power from a cluster of (commodity) machines. A key challenge is to achieve a good load balancing algorithm for storage and processing of these data, which is done by partitioning the database. We introduce the concept of optimal splitters for temporal and multi-version databases, which induce a partition of the input data set, and guarantee that the size of the maximum bucket be minimized among all possible configurations, given a budget for the desired number of buckets. We design efficient methods for memory- and disk resident data respectively, and show that they significantly outperform competing baseline methods both theoretically and empirically on large real data sets.

...read moreread less

15 citations

Posted Content•

I/O-Efficient Planar Range Skyline and Attrition Priority Queues

[...]

Casper Kejlberg-Rasmussen¹, Yufei Tao², Konstantinos Tsakalidis³, Kostas Tsichlas⁴, Jeonghun Yoon⁵ - Show less +1 more•Institutions (5)

Aarhus University¹, The Chinese University of Hong Kong², Hong Kong University of Science and Technology³, Aristotle University of Thessaloniki⁴, KAIST⁵

12 Jun 2013-arXiv: Data Structures and Algorithms

TL;DR: In this paper, the authors give a dynamic structure for top-open queries with optimal query cost O((n/B)^e + k/B), and amortized update cost O(log n/B).

...read moreread less

Abstract: In the planar range skyline reporting problem, we store a set P of n 2D points in a structure such that, given a query rectangle Q = [a_1, a_2] x [b_1, b_2], the maxima (a.k.a. skyline) of P \cap Q can be reported efficiently. The query is 3-sided if an edge of Q is grounded, giving rise to two variants: top-open (b_2 = \infty) and left-open (a_1 = -\infty) queries. All our results are in external memory under the O(n/B) space budget, for both the static and dynamic settings: * For static P, we give structures that answer top-open queries in O(log_B n + k/B), O(loglog_B U + k/B), and O(1 + k/B) I/Os when the universe is R^2, a U x U grid, and a rank space grid [O(n)]^2, respectively (where k is the number of reported points). The query complexity is optimal in all cases. * We show that the left-open case is harder, such that any linear-size structure must incur \Omega((n/B)^e + k/B) I/Os for a query. We show that this case is as difficult as the general 4-sided queries, for which we give a static structure with the optimal query cost O((n/B)^e + k/B). * We give a dynamic structure that supports top-open queries in O(log_2B^e (n/B) + k/B^1-e) I/Os, and updates in O(log_2B^e (n/B)) I/Os, for any e satisfying 0 \le e \le 1. This leads to a dynamic structure for 4-sided queries with optimal query cost O((n/B)^e + k/B), and amortized update cost O(log (n/B)). As a contribution of independent interest, we propose an I/O-efficient version of the fundamental structure priority queue with attrition (PQA). Our PQA supports FindMin, DeleteMin, and InsertAndAttrite all in O(1) worst case I/Os, and O(1/B) amortized I/Os per operation. We also add the new CatenateAndAttrite operation that catenates two PQAs in O(1) worst case and O(1/B) amortized I/Os. This operation is a non-trivial extension to the classic PQA of Sundar, even in internal memory.

...read moreread less

11 citations

Proceedings Article•DOI•

I/O-efficient planar range skyline and attrition priority queues

[...]

Casper Kejlberg-Rasmussen¹, Yufei Tao², Konstantinos Tsakalidis³, Kostas Tsichlas⁴, Jeonghun Yoon⁵ - Show less +1 more•Institutions (5)

Aarhus University¹, The Chinese University of Hong Kong², Hong Kong University of Science and Technology³, Aristotle University of Thessaloniki⁴, KAIST⁵

22 Jun 2013

TL;DR: The first dynamic linear space data structure that supports top-open queries in O is presented, and the lower and upper bounds coincide with those of the planar orthogonal range reporting problem, i.e., the skyline requirement does not alter the problem difficulty at all.

...read moreread less

Abstract: We study the static and dynamic planar range skyline reporting problem in the external memory model with block size B, under a linear space budget. The problem asks for an O(n/B) space data structure that stores n points in the plane, and supports reporting the k maximal input points (a.k.a.skyline) among the points that lie within a given query rectangle Q = [α1[α2] × [β1β2. When Q is 3-sided, i.e. one of its edges is grounded, two variants arise: top-open for β2 = ∞ and left-open for α1 = - ∞ (symmetrically bottom-open and right-open) queries.We present optimal static data structures for top-open queries, for the cases where the universe is R2, a U × U grid, and rank space [O(n)]2. We also show that left-open queries are harder, as they require Ω((n/B)e + k/B) I/Os for e > 0, when only linear space is allowed. We show that the lower bound is tight, by a structure that supports 4-sided queries in matching complexities. Interestingly, these lower and upper bounds coincide with those of the planar orthogonal range reporting problem, i.e., the skyline requirement does not alter the problem difficulty at all!Finally, we present the first dynamic linear space data structure that supports top-open queries in O(log2Ben + k/B1 e > and updates in O(log2Ben) worst case I/Os, for e ∈ [0, 1]. This also yields a linear space data structure for 4-sided queries with optimal query I/Os and O(log(n/B)) amortized update I/Os. We consider of independent interest the main component of our dynamic structures, a new real-time I/O-efficient and catenable variant of the fundamental structure priority queue with attrition by Sundar.

...read moreread less

9 citations

Proceedings Article•DOI•

Output-sensitive skyline algorithms in external memory

[...]

Xiaocheng Hu¹, Cheng Sheng¹, Yufei Tao², Yi Yang³, Shuigeng Zhou³ - Show less +1 more•Institutions (3)

The Chinese University of Hong Kong¹, KAIST², Fudan University³

06 Jan 2013

TL;DR: New results in external memory for finding the skyline (a.k.a. maxima) of N points in d-dimensional space are presented and a deterministic algorithm for solving the problem in O(N/B) I/Os is given.

...read moreread less

Abstract: This paper presents new results in external memory for finding the skyline (a.k.a. maxima) of N points in d-dimensional space. The state of the art uses O((N/B) logd−2M/B(N/B)) I/Os for fixed d ≥ 3, and O((N/B) logM/B(N/B)) I/Os for d = 2, where M and B are the sizes (in words) of memory and a disk block, respectively. We give algorithms whose running time depends on the number K of points in the skyline. Specifically, we achieve O((N/B) logd−2M/B(K/B)) expected cost for fixed d ≥ 3, and O((N/B) logM/B(K/B)) worst-case cost for d = 2.As a side product, we solve two problems both of independent interest. The first one, the M-skyline problem, aims at reporting M arbitrary skyline points, or the entire skyline if its size is at most M. We settle this problem in O(N/B) expected time in any fixed dimensionality d. The second one, the M-pivot problem, is more fundamental: given a set S of N elements drawn from an ordered domain, it outputs M evenly scattered elements (called pivots) from S, namely, S has asymptotically the same number of elements between each pair of consecutive pivots. We give a deterministic algorithm for solving the problem in O(N/B) I/Os.

...read moreread less

7 citations

Journal Article•DOI•

Deep Web and MapReduce

[...]

Yufei Tao¹•Institutions (1)

KAIST¹

30 Sep 2013-Journal of computing science and engineering

TL;DR: Algorithms for exploring the deep Web, which refers to the collection of Web pages that cannot be reached by conventional Web crawlers, are discussed and sorting algorithms on the MapReduce system are discussed.

...read moreread less

Abstract: This invited paper introduces results on Web science and technology obtained during work with the Korea Advanced Institute of Science and Technology. In the first part, we discuss algorithms for exploring the deep Web, which refers to the collection of Web pages that cannot be reached by conventional Web crawlers. In the second part, we discuss sorting algorithms on the MapReduce system, which has become a dominant paradigm for massive parallel computing.

...read moreread less

1 citations