scispace - formally typeset
Search or ask a question

Showing papers by "Yufei Tao published in 2019"


Journal ArticleDOI
TL;DR: A lower bound is presented, which essentially eliminates the possibility of having output-optimal algorithms for any join on more than two relations, and is designed for a large class of similarity joins.
Abstract: Parallel join algorithms have received much attention in recent years due to the rapid development of massively parallel systems such as MapReduce and Spark. In the database theory community, most efforts have been focused on studying worst-case optimal algorithms. However, the worst-case optimality of these join algorithms relies on the hard instances having very large output sizes. In the case of a two-relation join, the hard instance is just a Cartesian product, with an output size that is quadratic in the input size.In practice, however, the output size is usually much smaller. One recent parallel join algorithm by Beame et al. has achieved output-optimality (i.e., its cost is optimal in terms of both the input size and the output size), but their algorithm only works for a 2-relation equi-join and has some imperfections. In this article, we first improve their algorithm to true optimality. Then we design output-optimal algorithms for a large class of similarity joins. Finally, we present a lower bound, which essentially eliminates the possibility of having output-optimal algorithms for any join on more than two relations.

22 citations


Proceedings ArticleDOI
25 Jun 2019
TL;DR: Algorithms that solve the IGS problem by asking a provably small number of questions are described, and lower bounds indicating that the algorithms are optimal up to a small additive factor are established.
Abstract: We study \em interactive graph search (IGS), with the conceptual objective of departing from the conventional "top-down" strategy in searching a poly-hierarchy, a.k.a.\ a decision graph. In IGS, a machine assists a human in looking for a target node z in an acyclic directed graph G, by repetitively asking questions. In each \em question, the machine picks a node u in G, asks a human "is there a path from u to $z?"', and takes a boolean answer from the human. The efficiency goal is to locate z with as few questions as possible. We describe algorithms that solve the problem by asking a provably small number of questions, and establish lower bounds indicating that the algorithms are optimal up to a small additive factor. An experimental evaluation is presented to demonstrate the usefulness of our solutions in real-world scenarios.

13 citations


Journal ArticleDOI
19 Dec 2019
TL;DR: A list of techniques for designing top-k search indexes with strong performance guarantees is introduced and several promising directions for future work are discussed.
Abstract: Top-k search, which reports the k elements of the highest importance from all the elements in an underlying dataset that satisfy a certain predicate, has attracted significant attention from the database community. The search efficiency crucially depends on the quality of an index structure that can be utilized to filter the underlying data by both the user-specified predicate and the ranking of importance. This article introduces the reader to a list of techniques for designing such indexes with strong performance guarantees. Several promising directions for future work are also discussed.

6 citations


Journal ArticleDOI
TL;DR: This work presents the first algorithm that solves the problem of constructing an external memory data structure on a planar subdivision formed by n segments to answer point location queries optimally in O(logBn) I/Os deterministically.
Abstract: We revisit the problem of constructing an external memory data structure on a planar subdivision formed by n segments to answer point location queries optimally in $$O(\log _B n)$$ I/Os. The objective is to achieve the I/O cost of $$ sort (n) = O(\frac{n}{B} \log _{M/B} \frac{n}{B})$$ , where B is the number of words in a disk block, and M being the number of words in memory. The previous algorithms are able to achieve this either in expectation or under the tall cache assumption of $$M \ge B^2$$ . We present the first algorithm that solves the problem deterministically for all values of M and B satisfying $$M \ge 2B$$ .

3 citations


Journal ArticleDOI
05 Nov 2019
TL;DR: This article describes an algorithm that achieves the purpose of entity matching using the methodology of active monotone classification, and ensures an asymptotically optimal tradeoff between the number of pairs inspected and thenumber of mistakes made.
Abstract: Given two sets of entities X and Y , entity matching aims to decide whether x and y represent the same entity for each pair (x, y) ∈ X × Y . In many scenarios, the only way to ensure perfect accuracy is to launch a costly inspection procedure on every (x, y), whereas performing the procedure |X| · |Y | times is prohibitively expensive. It is, therefore, important to design an algorithm that carries out the procedure on only some pairs, and renders the verdicts on the other pairs automatically with as few mistakes as possible. This article describes an algorithm that achieves the purpose using the methodology of active monotone classification. The algorithm ensures an asymptotically optimal tradeoff between the number of pairs inspected and the number of mistakes made.

1 citations


Posted Content
TL;DR: A more general bound sensitive to the content of $X is shown, which is the first formal justification on why the term $1/\rho$ is not compulsory for "realistic" inputs and constrain $\mathcal{R}$ to be the set of halfspaces in $\mathbb{R]^d$ for a constant $d$.
Abstract: A family $\mathcal{R}$ of ranges and a set $X$ of points together define a range space $(X, \mathcal{R}|_X)$, where $\mathcal{R}|_X = \{X \cap h \mid h \in \mathcal{R}\}$. We want to find a structure to estimate the quantity $|X \cap h|/|X|$ for any range $h \in \mathcal{R}$ with the $(\rho, \epsilon)$-guarantee: (i) if $|X \cap h|/|X| > \rho$, the estimate must have a relative error $\epsilon$; (ii) otherwise, the estimate must have an absolute error $\rho \epsilon$. The objective is to minimize the size of the structure. Currently, the dominant solution is to compute a relative $(\rho, \epsilon)$-approximation, which is a subset of $X$ with $\tilde{O}(\lambda/(\rho \epsilon^2))$ points, where $\lambda$ is the VC-dimension of $(X, \mathcal{R}|_X)$, and $\tilde{O}$ hides polylog factors. This paper shows a more general bound sensitive to the content of $X$. We give a structure that stores $O(\log (1/\rho))$ integers plus $\tilde{O}(\theta \cdot (\lambda/\epsilon^2))$ points of $X$, where $\theta$ - called the disagreement coefficient - measures how much the ranges differ from each other in their intersections with $X$. The value of $\theta$ is between 1 and $1/\rho$, such that our space bound is never worse than that of relative $(\rho, \epsilon)$-approximations, but we improve the latter's $1/\rho$ term whenever $\theta = o(\frac{1}{\rho \log (1/\rho)})$. We also prove that, in the worst case, summaries with the $(\rho, 1/2)$-guarantee must consume $\Omega(\theta)$ words even for $d = 2$ and $\lambda \le 3$. We then constrain $\mathcal{R}$ to be the set of halfspaces in $\mathbb{R}^d$ for a constant $d$, and prove the existence of structures with $o(1/(\rho \epsilon^2))$ size offering $(\rho,\epsilon)$-guarantees, when $X$ is generated from various stochastic distributions. This is the first formal justification on why the term $1/\rho$ is not compulsory for "realistic" inputs.