Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem

doi:10.1137/070701649

Journal ArticleDOI

Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem

Sofya Raskhodnikova, +3 more

- 01 Sep 2009 -

SIAM Journal on Computing

- Vol. 39, Iss: 3, pp 813-842

TLDR

A nearly linear in n lower bound on the query complexity is proved, applicable even when the number of distinct elements is large (up to linear in $n$) and even for approximation with additive error.

Abstract:

We consider the problem of approximating the support size of a distribution from a small number of samples, when each element in the distribution appears with probability at least $\frac{1}{n}$. This problem is closely related to the problem of approximating the number of distinct elements in a sequence of length $n$. Charikar, Chaudhuri, Motwani, and Narasayya [in Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 2000, pp. 268-279] and Bar-Yossef, Kumar, and Sivakumar [in Proceedings of the Thirty-Third Annual ACM Symposium on Theory of Computing, ACM Press, New York, 2001, pp. 266-275] proved that multiplicative approximation for these problems within a factor $\alpha>1$ requires $\Theta(\frac{n}{\alpha^2})$ queries to the input sequence. Their lower bound applies only when the number of distinct elements (or the support size of a distribution) is very small. For both problems, we prove a nearly linear in $n$ lower bound on the query complexity, applicable even when the number of distinct elements is large (up to linear in $n$) and even for approximation with additive error. At the heart of the lower bound is a construction of two positive integer random variables, $\mathsf{X}_1$ and $\mathsf{X}_2$, with very different expectations and the following condition on the first $k$ moments: $\mathsf{E}[\mathsf{X}_1]/\mathsf{E}[\mathsf{X}_2] = \mathsf{E}[\mathsf{X}_1^2]/\mathsf{E}[\mathsf{X}_2^2] = \cdots = \mathsf{E}[\mathsf{X}_1^k]/\E[\mathsf{X}_2^k]$. It is related to a well-studied mathematical question, the truncated Hamburger problem, but differs in the requirement that our random variables have to be supported on integers. Our lower bound method is also applicable to other problems and, in particular, gives a new lower bound for the sample complexity of approximating the entropy of a distribution.

Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem

Citations

Introduction to Property Testing

Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs

Algorithmic and Analysis Techniques in Property Testing

Testing Closeness of Discrete Distributions

The Power of Linear Estimators

References

Elements of information theory

A universal algorithm for sequential data compression

All of Statistics: A Concise Course in Statistical Inference

The Space Complexity of Approximating the Frequency Moments

Probabilistic counting algorithms for data base applications

Related Papers (5)

Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs

Testing random variables for independence and identity

Testing that distributions are close

Property testing and its connection to learning and approximation

A Coincidence-Based Test for Uniformity Given Very Sparsely Sampled Discrete Data