scispace - formally typeset
Journal ArticleDOI

Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem

TLDR
A nearly linear in n lower bound on the query complexity is proved, applicable even when the number of distinct elements is large (up to linear in $n$) and even for approximation with additive error.
Abstract
We consider the problem of approximating the support size of a distribution from a small number of samples, when each element in the distribution appears with probability at least $\frac{1}{n}$. This problem is closely related to the problem of approximating the number of distinct elements in a sequence of length $n$. Charikar, Chaudhuri, Motwani, and Narasayya [in Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 2000, pp. 268-279] and Bar-Yossef, Kumar, and Sivakumar [in Proceedings of the Thirty-Third Annual ACM Symposium on Theory of Computing, ACM Press, New York, 2001, pp. 266-275] proved that multiplicative approximation for these problems within a factor $\alpha>1$ requires $\Theta(\frac{n}{\alpha^2})$ queries to the input sequence. Their lower bound applies only when the number of distinct elements (or the support size of a distribution) is very small. For both problems, we prove a nearly linear in $n$ lower bound on the query complexity, applicable even when the number of distinct elements is large (up to linear in $n$) and even for approximation with additive error. At the heart of the lower bound is a construction of two positive integer random variables, $\mathsf{X}_1$ and $\mathsf{X}_2$, with very different expectations and the following condition on the first $k$ moments: $\mathsf{E}[\mathsf{X}_1]/\mathsf{E}[\mathsf{X}_2] = \mathsf{E}[\mathsf{X}_1^2]/\mathsf{E}[\mathsf{X}_2^2] = \cdots = \mathsf{E}[\mathsf{X}_1^k]/\E[\mathsf{X}_2^k]$. It is related to a well-studied mathematical question, the truncated Hamburger problem, but differs in the requirement that our random variables have to be supported on integers. Our lower bound method is also applicable to other problems and, in particular, gives a new lower bound for the sample complexity of approximating the entropy of a distribution.

read more

Citations
More filters
Book

Introduction to Property Testing

TL;DR: In this article, a wide range of algorithmic techniques for the design and analysis of tests for algebraic properties, properties of Boolean functions, graph properties, and properties of distributions are presented.
Proceedings ArticleDOI

Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs

TL;DR: A new approach to characterizing the unobserved portion of a distribution is introduced, which provides sublinear--sample estimators achieving arbitrarily small additive constant error for a class of properties that includes entropy and distribution support size.
Book

Algorithmic and Analysis Techniques in Property Testing

TL;DR: This monograph surveys results in property testing, where the emphasis is on common analysis and algorithmic techniques.
Journal ArticleDOI

Testing Closeness of Discrete Distributions

TL;DR: In this article, the authors present an algorithm which uses sublinear in n, specifically, O(n2/3e−8/3 log n), independent samples from each distribution, runs in time linear in the sample size, makes no assumptions about the structure of the distributions, and distinguishes the cases when the distance between the distributions is small (less than {e4/3n−1/3/32, en−1 /2/4}) or large (more than e) in e 1 distance.
Proceedings ArticleDOI

The Power of Linear Estimators

TL;DR: The main result is that for any property in this broad class of practically relevant distribution properties, there exists a near-optimal linear estimator, and a practical and polynomial-time algorithm for constructing such estimators for any given parameters.
References
More filters
Book

Elements of information theory

TL;DR: The author examines the role of entropy, inequality, and randomness in the design of codes and the construction of codes in the rapidly changing environment.
Journal ArticleDOI

A universal algorithm for sequential data compression

TL;DR: The compression ratio achieved by the proposed universal code uniformly approaches the lower bounds on the compression ratios attainable by block-to-variable codes and variable- to-block codes designed to match a completely specified source.
Book

All of Statistics: A Concise Course in Statistical Inference

TL;DR: This book covers a much wider range of topics than a typical introductory text on mathematical statistics, and includes modern topics like nonparametric curve estimation, bootstrapping and classification, topics that are usually relegated to follow-up courses.
Journal ArticleDOI

The Space Complexity of Approximating the Frequency Moments

TL;DR: In this paper, the authors considered the space complexity of randomized algorithms that approximate the frequency moments of a sequence, where the elements of the sequence are given one by one and cannot be stored.
Journal ArticleDOI

Probabilistic counting algorithms for data base applications

TL;DR: A class of probabilistic counting algorithms with which one can estimate the number of distinct elements in a large collection of data in a single pass using only a small additional storage and only a few operations per element scanned is introduced.
Related Papers (5)