Showing papers by "Yufei Tao published in 2019"

PDF

Open Access

Journal Article•DOI•

Output-Optimal Massively Parallel Algorithms for Similarity Joins

[...]

Xiao Hu¹, Ke Yi¹, Yufei Tao²•Institutions (2)

Hong Kong University of Science and Technology¹, The Chinese University of Hong Kong²

08 Apr 2019-ACM Transactions on Database Systems

TL;DR: A lower bound is presented, which essentially eliminates the possibility of having output-optimal algorithms for any join on more than two relations, and is designed for a large class of similarity joins.

...read moreread less

Abstract: Parallel join algorithms have received much attention in recent years due to the rapid development of massively parallel systems such as MapReduce and Spark. In the database theory community, most efforts have been focused on studying worst-case optimal algorithms. However, the worst-case optimality of these join algorithms relies on the hard instances having very large output sizes. In the case of a two-relation join, the hard instance is just a Cartesian product, with an output size that is quadratic in the input size.In practice, however, the output size is usually much smaller. One recent parallel join algorithm by Beame et al. has achieved output-optimality (i.e., its cost is optimal in terms of both the input size and the output size), but their algorithm only works for a 2-relation equi-join and has some imperfections. In this article, we first improve their algorithm to true optimality. Then we design output-optimal algorithms for a large class of similarity joins. Finally, we present a lower bound, which essentially eliminates the possibility of having output-optimal algorithms for any join on more than two relations.

...read moreread less

22 citations

Proceedings Article•DOI•

Interactive Graph Search

[...]

Yufei Tao¹, Yuanbing Li², Guoliang Li²•Institutions (2)

The Chinese University of Hong Kong¹, Tsinghua University²

25 Jun 2019

TL;DR: Algorithms that solve the IGS problem by asking a provably small number of questions are described, and lower bounds indicating that the algorithms are optimal up to a small additive factor are established.

...read moreread less

Abstract: We study \em interactive graph search (IGS), with the conceptual objective of departing from the conventional "top-down" strategy in searching a poly-hierarchy, a.k.a.\ a decision graph. In IGS, a machine assists a human in looking for a target node z in an acyclic directed graph G, by repetitively asking questions. In each \em question, the machine picks a node u in G, asks a human "is there a path from u to $z?"', and takes a boolean answer from the human. The efficiency goal is to locate z with as few questions as possible. We describe algorithms that solve the problem by asking a provably small number of questions, and establish lower bounds indicating that the algorithms are optimal up to a small additive factor. An experimental evaluation is presented to demonstrate the usefulness of our solutions in real-world scenarios.

...read moreread less

13 citations

Journal Article•DOI•

A Guide to Designing Top-k Indexes

[...]

Saladi Rahul¹, Yufei Tao²•Institutions (2)

University of Illinois at Urbana–Champaign¹, The Chinese University of Hong Kong²

19 Dec 2019

TL;DR: A list of techniques for designing top-k search indexes with strong performance guarantees is introduced and several promising directions for future work are discussed.

...read moreread less

Abstract: Top-k search, which reports the k elements of the highest importance from all the elements in an underlying dataset that satisfy a certain predicate, has attracted significant attention from the database community. The search efficiency crucially depends on the quality of an index structure that can be utilized to filter the underlying data by both the user-specified predicate and the ranking of importance. This article introduces the reader to a list of techniques for designing such indexes with strong performance guarantees. Several promising directions for future work are also discussed.

...read moreread less

6 citations

Journal Article•DOI•

Building an Optimal Point-Location Structure in $$O( sort (n))$$ O ( s o r t ( n ) ) I/Os

[...]

Xiaocheng Hu¹, Cheng Sheng¹, Yufei Tao¹•Institutions (1)

The Chinese University of Hong Kong¹

01 May 2019-Algorithmica

TL;DR: This work presents the first algorithm that solves the problem of constructing an external memory data structure on a planar subdivision formed by n segments to answer point location queries optimally in O(logBn) I/Os deterministically.

...read moreread less

Abstract: We revisit the problem of constructing an external memory data structure on a planar subdivision formed by n segments to answer point location queries optimally in $$O(\log _B n)$$ I/Os. The objective is to achieve the I/O cost of $$ sort (n) = O(\frac{n}{B} \log _{M/B} \frac{n}{B})$$ , where B is the number of words in a disk block, and M being the number of words in memory. The previous algorithms are able to achieve this either in expectation or under the tall cache assumption of $$M \ge B^2$$ . We present the first algorithm that solves the problem deterministically for all values of M and B satisfying $$M \ge 2B$$ .

...read moreread less

3 citations

Journal Article•DOI•

Entity Matching with Quality and Error Guarantees

[...]

Yufei Tao¹•Institutions (1)

The Chinese University of Hong Kong¹

05 Nov 2019

TL;DR: This article describes an algorithm that achieves the purpose of entity matching using the methodology of active monotone classification, and ensures an asymptotically optimal tradeoff between the number of pairs inspected and thenumber of mistakes made.

...read moreread less

Abstract: Given two sets of entities X and Y , entity matching aims to decide whether x and y represent the same entity for each pair (x, y) ∈ X × Y . In many scenarios, the only way to ensure perfect accuracy is to launch a costly inspection procedure on every (x, y), whereas performing the procedure |X| · |Y | times is prohibitively expensive. It is, therefore, important to design an algorithm that carries out the procedure on only some pairs, and renders the verdicts on the other pairs automatically with as few mistakes as possible. This article describes an algorithm that achieves the purpose using the methodology of active monotone classification. The algorithm ensures an asymptotically optimal tradeoff between the number of pairs inspected and the number of mistakes made.

...read moreread less

1 citations

Posted Content•

Distribution-Sensitive Bounds on Relative Approximations of Geometric Ranges

[...]

Yufei Tao¹, Yu Wang¹•Institutions (1)

The Chinese University of Hong Kong¹

15 Mar 2019-arXiv: Computational Geometry

TL;DR: A more general bound sensitive to the content of $X is shown, which is the first formal justification on why the term $1/\rho$ is not compulsory for "realistic" inputs and constrain $\mathcal{R}$ to be the set of halfspaces in $\mathbb{R]^d$ for a constant $d$.

...read moreread less

Abstract: A family $\mathcal{R}$ of ranges and a set $X$ of points together define a range space $(X, \mathcal{R}|_X)$, where $\mathcal{R}|_X = \{X \cap h \mid h \in \mathcal{R}\}$. We want to find a structure to estimate the quantity $|X \cap h|/|X|$ for any range $h \in \mathcal{R}$ with the $(\rho, \epsilon)$-guarantee: (i) if $|X \cap h|/|X| > \rho$, the estimate must have a relative error $\epsilon$; (ii) otherwise, the estimate must have an absolute error $\rho \epsilon$. The objective is to minimize the size of the structure. Currently, the dominant solution is to compute a relative $(\rho, \epsilon)$-approximation, which is a subset of $X$ with $\tilde{O}(\lambda/(\rho \epsilon^2))$ points, where $\lambda$ is the VC-dimension of $(X, \mathcal{R}|_X)$, and $\tilde{O}$ hides polylog factors. This paper shows a more general bound sensitive to the content of $X$. We give a structure that stores $O(\log (1/\rho))$ integers plus $\tilde{O}(\theta \cdot (\lambda/\epsilon^2))$ points of $X$, where $\theta$ - called the disagreement coefficient - measures how much the ranges differ from each other in their intersections with $X$. The value of $\theta$ is between 1 and $1/\rho$, such that our space bound is never worse than that of relative $(\rho, \epsilon)$-approximations, but we improve the latter's $1/\rho$ term whenever $\theta = o(\frac{1}{\rho \log (1/\rho)})$. We also prove that, in the worst case, summaries with the $(\rho, 1/2)$-guarantee must consume $\Omega(\theta)$ words even for $d = 2$ and $\lambda \le 3$. We then constrain $\mathcal{R}$ to be the set of halfspaces in $\mathbb{R}^d$ for a constant $d$, and prove the existence of structures with $o(1/(\rho \epsilon^2))$ size offering $(\rho,\epsilon)$-guarantees, when $X$ is generated from various stochastic distributions. This is the first formal justification on why the term $1/\rho$ is not compulsory for "realistic" inputs.

...read moreread less