Convergence of Chao Unseen Species Estimator

doi:10.1109/ISIT.2019.8849732

Citations

PDF

Open Access

More filters

Book Chapter•DOI•

Estimating the Number of Unseen Species: How Many Words did Shakespeare Know?

[...]

Peter McCullagh

01 Jan 2008

TL;DR: Efron and Thisted as discussed by the authors studied the frequency distribution of words in the Shakespearean canon and found that the expected number of words that occur x ≥ 1 times in a large sample of n words is

...read moreread less

Abstract: This paper is the first of two written by Brad Efron and Ron Thisted studying the frequency distribution of words in the Shakespearean canon. The key idea due to Fisher in the context of sampling of species is simple and elegant. When applied to Shakespeare the idea appears to be preposterous: an author has a personal vocabulary of word species represented by a distribution G, and text is generated by sampling from this distribution. Most results do not require successive words to be sampled independently, which leaves room for individual style and context, but stationarity is needed for prediction and inference. The expected number of words that occur x ≥ 1 times in a large sample of n words is

...read moreread less

199 citations

References

PDF

Open Access

More filters

Reference Entry•DOI•

Species Estimation and Applications

[...]

Anne Chao¹•Institutions (1)

National Tsing Hua University¹

29 Sep 2014

151 citations

Proceedings Article•

Estimating the Unseen: Improved Estimators for Entropy and other Properties

[...]

Paul Valiant¹, Gregory Valiant²•Institutions (2)

Brown University¹, Stanford University²

05 Dec 2013

TL;DR: This work proposes a novel modification of the Good-Turing frequency estimation scheme, which seeks to estimate the shape of the unobserved portion of the distribution, and is robust, general, and theoretically principled; it is expected that it may be fruitfully used as a component within larger machine learning and data analysis systems.

...read moreread less

Abstract: Recently, Valiant and Valiant [1, 2] showed that a class of distributional properties, which includes such practically relevant properties as entropy, the number of distinct elements, and distance metrics between pairs of distributions, can be estimated given a sublinear sized sample. Specifically, given a sample consisting of independent draws from any distribution over at most n distinct elements, these properties can be estimated accurately using a sample of size O(n/ log n).We propose a novel modification of this approach and show: 1) theoretically, this estimator is optimal (to constant factors, over worst-case instances), and 2) in practice, it performs exceptionally well for a variety of estimation tasks, on a variety of natural distributions, for a wide range of parameters. Perhaps unsurprisingly, the key step in our approach is to first use the sample to characterize the "unseen" portion of the distribution. This goes beyond such tools as the Good-Turing frequency estimation scheme, which estimates the total probability mass of the unobserved portion of the distribution: we seek to estimate the shape of the unobserved portion of the distribution. This approach is robust, general, and theoretically principled; we expect that it may be fruitfully used as a component within larger machine learning and data analysis systems.

...read moreread less

128 citations

Journal Article•DOI•

Optimal prediction of the number of unseen species

[...]

Alon Orlitsky¹, Ananda Theertha Suresh², Yihong Wu³•Institutions (3)

University of California, San Diego¹, Google², Yale University³

22 Nov 2016-Proceedings of the National Academy of Sciences of the United States of America

TL;DR: A class of simple algorithms are obtained that provably predict U all of the way up to t∝log⁡n samples, and it is shown that this range is the best possible and that the estimator’s mean-square error is near optimal for any t.

...read moreread less

Abstract: Estimating the number of unseen species is an important problem in many scientific endeavors. Its most popular formulation, introduced by Fisher et al. [Fisher RA, Corbet AS, Williams CB (1943) J Animal Ecol 12(1):42−58], uses n samples to predict the number U of hitherto unseen species that would be observed if t ⋅ n t⋅n new samples were collected. Of considerable interest is the largest ratio t between the number of new and existing samples for which U can be accurately predicted. In seminal works, Good and Toulmin [Good I, Toulmin G (1956) Biometrika 43(102):45−63] constructed an intriguing estimator that predicts U for all t ≤ 1 t≤1. Subsequently, Efron and Thisted [Efron B, Thisted R (1976) Biometrika 63(3):435−447] proposed a modification that empirically predicts U even for some t > 1 t>1, but without provable guarantees. We derive a class of estimators that provably predict U all of the way up to t ∝ log ⁡ n t∝log⁡n. We also show that this range is the best possible and that the estimator’s mean-square error is near optimal for any t. Our approach yields a provable guarantee for the Efron−Thisted estimator and, in addition, a variant with stronger theoretical and experimental performance than existing methodologies on a variety of synthetic and real datasets. The estimators are simple, linear, computationally efficient, and scalable to massive datasets. Their performance guarantees hold uniformly for all distributions, and apply to all four standard sampling models commonly used across various scientific disciplines: multinomial, Poisson, hypergeometric, and Bernoulli product.

...read moreread less

115 citations

Journal Article•DOI•

Strong Lower Bounds for Approximating Distribution Support Size and the Distinct Elements Problem

[...]

Sofya Raskhodnikova, Dana Ron, Amir Shpilka¹, Adam Smith•Institutions (1)

Technion – Israel Institute of Technology¹

01 Sep 2009-SIAM Journal on Computing

TL;DR: A nearly linear in n lower bound on the query complexity is proved, applicable even when the number of distinct elements is large (up to linear in $n$) and even for approximation with additive error.

...read moreread less

Abstract: We consider the problem of approximating the support size of a distribution from a small number of samples, when each element in the distribution appears with probability at least $\frac{1}{n}$. This problem is closely related to the problem of approximating the number of distinct elements in a sequence of length $n$. Charikar, Chaudhuri, Motwani, and Narasayya [in Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 2000, pp. 268-279] and Bar-Yossef, Kumar, and Sivakumar [in Proceedings of the Thirty-Third Annual ACM Symposium on Theory of Computing, ACM Press, New York, 2001, pp. 266-275] proved that multiplicative approximation for these problems within a factor $\alpha>1$ requires $\Theta(\frac{n}{\alpha^2})$ queries to the input sequence. Their lower bound applies only when the number of distinct elements (or the support size of a distribution) is very small. For both problems, we prove a nearly linear in $n$ lower bound on the query complexity, applicable even when the number of distinct elements is large (up to linear in $n$) and even for approximation with additive error. At the heart of the lower bound is a construction of two positive integer random variables, $\mathsf{X}_1$ and $\mathsf{X}_2$, with very different expectations and the following condition on the first $k$ moments: $\mathsf{E}[\mathsf{X}_1]/\mathsf{E}[\mathsf{X}_2] = \mathsf{E}[\mathsf{X}_1^2]/\mathsf{E}[\mathsf{X}_2^2] = \cdots = \mathsf{E}[\mathsf{X}_1^k]/\E[\mathsf{X}_2^k]$. It is related to a well-studied mathematical question, the truncated Hamburger problem, but differs in the requirement that our random variables have to be supported on integers. Our lower bound method is also applicable to other problems and, in particular, gives a new lower bound for the sample complexity of approximating the entropy of a distribution.

...read moreread less

115 citations

Journal Article•DOI•

Chebyshev polynomials, moment matching, and optimal estimation of the unseen

[...]

Yihong Wu¹, Pengkun Yang¹•Institutions (1)

University of Illinois at Urbana–Champaign¹

01 Apr 2019-Annals of Statistics

TL;DR: In this paper, the authors considered the problem of estimating the support size of a discrete distribution whose minimum nonzero mass is at least ε(1/k) and showed that the sample complexity to achieve an additive error with probability at least 0.1 is within universal constant factors.

...read moreread less

Abstract: We consider the problem of estimating the support size of a discrete distribution whose minimum nonzero mass is at least $\frac{1}{k}$. Under the independent sampling model, we show that the sample complexity, that is, the minimal sample size to achieve an additive error of $\varepsilon k$ with probability at least 0.1 is within universal constant factors of $\frac{k}{\log k}\log^{2}\frac{1}{\varepsilon }$, which improves the state-of-the-art result of $\frac{k}{\varepsilon^{2}\log k}$ in [In Advances in Neural Information Processing Systems (2013) 2157–2165]. Similar characterization of the minimax risk is also obtained. Our procedure is a linear estimator based on the Chebyshev polynomial and its approximation-theoretic properties, which can be evaluated in $O(n+\log^{2}k)$ time and attains the sample complexity within constant factors. The superiority of the proposed estimator in terms of accuracy, computational efficiency and scalability is demonstrated in a variety of synthetic and real datasets.

...read moreread less

84 citations

Convergence of Chao Unseen Species Estimator

Citations

References

Related Papers (5)