scispace - formally typeset
Search or ask a question

Showing papers by "Yufei Tao published in 2013"


Journal ArticleDOI
TL;DR: This work uses the well-known Kullback-Leibler divergence to measure similarity between uncertain objects in both the continuous and discrete cases, and integrates it into partitioning and density-based clustering methods to cluster uncertain objects.
Abstract: Clustering on uncertain data, one of the essential tasks in mining uncertain data, posts significant challenges on both modeling similarity between uncertain objects and developing efficient computational methods The previous methods extend traditional partitioning clustering methods like $(k)$-means and density-based clustering methods like DBSCAN to uncertain data, thus rely on geometric distances between objects Such methods cannot handle uncertain objects that are geometrically indistinguishable, such as products with the same mean but very different variances in customer ratings Surprisingly, probability distributions, which are essential characteristics of uncertain objects, have not been considered in measuring similarity between uncertain objects In this paper, we systematically model uncertain objects in both continuous and discrete domains, where an uncertain object is modeled as a continuous and discrete random variable, respectively We use the well-known Kullback-Leibler divergence to measure similarity between uncertain objects in both the continuous and discrete cases, and integrate it into partitioning and density-based clustering methods to cluster uncertain objects Nevertheless, a naive implementation is very costly Particularly, computing exact KL divergence in the continuous case is very costly or even infeasible To tackle the problem, we estimate KL divergence in the continuous case by kernel density estimation and employ the fast Gauss transform technique to further speed up the computation Our extensive experiment results verify the effectiveness, efficiency, and scalability of our approaches

149 citations


Proceedings ArticleDOI
22 Jun 2013
TL;DR: A new algorithm is developed that is provably I/O and CPU efficient at the same time, without making any assumption on the input G at all, and outperformed the existing competitors by a factor over an order of magnitude in extensive experimentation.
Abstract: This paper studies I/O-efficient algorithms for settling the classic triangle listing problem, whose solution is a basic operator in dealing with many other graph problems. Specifically, given an undirected graph G, the objective of triangle listing is to find all the cliques involving 3 vertices in G. The problem has been well studied in internal memory, but remains an urgent difficult challenge when G does not fit in memory, rendering any algorithm to entail frequent I/O accesses. Although previous research has attempted to tackle the challenge, the state-of-the-art solutions rely on a set of crippling assumptions to guarantee good performance. Motivated by this, we develop a new algorithm that is provably I/O and CPU efficient at the same time, without making any assumption on the input G at all. The algorithm uses ideas drastically different from all the previous approaches, and outperformed the existing competitors by a factor over an order of magnitude in our extensive experimentation.

109 citations


Proceedings ArticleDOI
22 Jun 2013
TL;DR: The notion of minimal algorithm is presented, that is, an algorithm that guarantees the best parallelization in multiple aspects at the same time, up to a small constant factor.
Abstract: MapReduce has become a dominant parallel computing paradigm for big data, i.e., colossal datasets at the scale of tera-bytes or higher. Ideally, a MapReduce system should achieve a high degree of load balancing among the participating machines, and minimize the space usage, CPU and I/O time, and network transfer at each machine. Although these principles have guided the development of MapReduce algorithms, limited emphasis has been placed on enforcing serious constraints on the aforementioned metrics simultaneously. This paper presents the notion of minimal algorithm, that is, an algorithm that guarantees the best parallelization in multiple aspects at the same time, up to a small constant factor. We show the existence of elegant minimal algorithms for a set of fundamental database problems, and demonstrate their excellent performance with extensive experiments.

80 citations


Journal ArticleDOI
01 Aug 2013
TL;DR: The present paper studies the (1 - e)-approximate MaxRS problem, which admits the same inputs as MaxRS, but aims instead to return a rectangle whose covered weight is at least (1-e)m*, where m* is the optimal covered weight, and e can be an arbitrarily small constant between 0 and 1.
Abstract: In the maximizing range sum (MaxRS) problem, given (i) a set P of 2D points each of which is associated with a positive weight, and (ii) a rectangle r of specific extents, we need to decide where to place r in order to maximize the covered weight of r - that is, the total weight of the data points covered by r. Algorithms solving the problem exactly entail expensive CPU or I/O cost. In practice, exact answers are often not compulsory in a MaxRS application, where slight imprecision can often be comfortably tolerated, provided that approximate answers can be computed considerably faster. Motivated by this, the present paper studies the (1 - e)-approximate MaxRS problem, which admits the same inputs as MaxRS, but aims instead to return a rectangle whose covered weight is at least (1-e)m*, where m* is the optimal covered weight, and e can be an arbitrarily small constant between 0 and 1. We present fast algorithms that settle this problem with strong theoretical guarantees.

50 citations


Proceedings ArticleDOI
22 Jun 2013
TL;DR: This work introduces the concept of optimal splitters for temporal and multi-version databases, which induce a partition of the input data set, and guarantee that the size of the maximum bucket be minimized among all possible configurations, given a budget for the desired number of buckets.
Abstract: Temporal and multi-version databases are ideal candidates for a distributed store, which offers large storage space, and parallel and distributed processing power from a cluster of (commodity) machines. A key challenge is to achieve a good load balancing algorithm for storage and processing of these data, which is done by partitioning the database. We introduce the concept of optimal splitters for temporal and multi-version databases, which induce a partition of the input data set, and guarantee that the size of the maximum bucket be minimized among all possible configurations, given a budget for the desired number of buckets. We design efficient methods for memory- and disk resident data respectively, and show that they significantly outperform competing baseline methods both theoretically and empirically on large real data sets.

15 citations


Posted Content
TL;DR: In this paper, the authors give a dynamic structure for top-open queries with optimal query cost O((n/B)^e + k/B), and amortized update cost O(log n/B).
Abstract: In the planar range skyline reporting problem, we store a set P of n 2D points in a structure such that, given a query rectangle Q = [a_1, a_2] x [b_1, b_2], the maxima (a.k.a. skyline) of P \cap Q can be reported efficiently. The query is 3-sided if an edge of Q is grounded, giving rise to two variants: top-open (b_2 = \infty) and left-open (a_1 = -\infty) queries. All our results are in external memory under the O(n/B) space budget, for both the static and dynamic settings: * For static P, we give structures that answer top-open queries in O(log_B n + k/B), O(loglog_B U + k/B), and O(1 + k/B) I/Os when the universe is R^2, a U x U grid, and a rank space grid [O(n)]^2, respectively (where k is the number of reported points). The query complexity is optimal in all cases. * We show that the left-open case is harder, such that any linear-size structure must incur \Omega((n/B)^e + k/B) I/Os for a query. We show that this case is as difficult as the general 4-sided queries, for which we give a static structure with the optimal query cost O((n/B)^e + k/B). * We give a dynamic structure that supports top-open queries in O(log_2B^e (n/B) + k/B^1-e) I/Os, and updates in O(log_2B^e (n/B)) I/Os, for any e satisfying 0 \le e \le 1. This leads to a dynamic structure for 4-sided queries with optimal query cost O((n/B)^e + k/B), and amortized update cost O(log (n/B)). As a contribution of independent interest, we propose an I/O-efficient version of the fundamental structure priority queue with attrition (PQA). Our PQA supports FindMin, DeleteMin, and InsertAndAttrite all in O(1) worst case I/Os, and O(1/B) amortized I/Os per operation. We also add the new CatenateAndAttrite operation that catenates two PQAs in O(1) worst case and O(1/B) amortized I/Os. This operation is a non-trivial extension to the classic PQA of Sundar, even in internal memory.

11 citations


Proceedings ArticleDOI
22 Jun 2013
TL;DR: The first dynamic linear space data structure that supports top-open queries in O is presented, and the lower and upper bounds coincide with those of the planar orthogonal range reporting problem, i.e., the skyline requirement does not alter the problem difficulty at all.
Abstract: We study the static and dynamic planar range skyline reporting problem in the external memory model with block size B, under a linear space budget. The problem asks for an O(n/B) space data structure that stores n points in the plane, and supports reporting the k maximal input points (a.k.a.skyline) among the points that lie within a given query rectangle Q = [α1[α2] × [β1β2. When Q is 3-sided, i.e. one of its edges is grounded, two variants arise: top-open for β2 = ∞ and left-open for α1 = - ∞ (symmetrically bottom-open and right-open) queries.We present optimal static data structures for top-open queries, for the cases where the universe is R2, a U × U grid, and rank space [O(n)]2. We also show that left-open queries are harder, as they require Ω((n/B)e + k/B) I/Os for e > 0, when only linear space is allowed. We show that the lower bound is tight, by a structure that supports 4-sided queries in matching complexities. Interestingly, these lower and upper bounds coincide with those of the planar orthogonal range reporting problem, i.e., the skyline requirement does not alter the problem difficulty at all!Finally, we present the first dynamic linear space data structure that supports top-open queries in O(log2Ben + k/B1 e > and updates in O(log2Ben) worst case I/Os, for e ∈ [0, 1]. This also yields a linear space data structure for 4-sided queries with optimal query I/Os and O(log(n/B)) amortized update I/Os. We consider of independent interest the main component of our dynamic structures, a new real-time I/O-efficient and catenable variant of the fundamental structure priority queue with attrition by Sundar.

9 citations


Proceedings ArticleDOI
06 Jan 2013
TL;DR: New results in external memory for finding the skyline (a.k.a. maxima) of N points in d-dimensional space are presented and a deterministic algorithm for solving the problem in O(N/B) I/Os is given.
Abstract: This paper presents new results in external memory for finding the skyline (a.k.a. maxima) of N points in d-dimensional space. The state of the art uses O((N/B) logd−2M/B(N/B)) I/Os for fixed d ≥ 3, and O((N/B) logM/B(N/B)) I/Os for d = 2, where M and B are the sizes (in words) of memory and a disk block, respectively. We give algorithms whose running time depends on the number K of points in the skyline. Specifically, we achieve O((N/B) logd−2M/B(K/B)) expected cost for fixed d ≥ 3, and O((N/B) logM/B(K/B)) worst-case cost for d = 2.As a side product, we solve two problems both of independent interest. The first one, the M-skyline problem, aims at reporting M arbitrary skyline points, or the entire skyline if its size is at most M. We settle this problem in O(N/B) expected time in any fixed dimensionality d. The second one, the M-pivot problem, is more fundamental: given a set S of N elements drawn from an ordered domain, it outputs M evenly scattered elements (called pivots) from S, namely, S has asymptotically the same number of elements between each pair of consecutive pivots. We give a deterministic algorithm for solving the problem in O(N/B) I/Os.

7 citations


Journal ArticleDOI
Yufei Tao1
TL;DR: Algorithms for exploring the deep Web, which refers to the collection of Web pages that cannot be reached by conventional Web crawlers, are discussed and sorting algorithms on the MapReduce system are discussed.
Abstract: This invited paper introduces results on Web science and technology obtained during work with the Korea Advanced Institute of Science and Technology. In the first part, we discuss algorithms for exploring the deep Web, which refers to the collection of Web pages that cannot be reached by conventional Web crawlers. In the second part, we discuss sorting algorithms on the MapReduce system, which has become a dominant paradigm for massive parallel computing.

1 citations