scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

Convergence of Chao Unseen Species Estimator

TL;DR: In this article, the authors analyze the Chao estimator and show that its worst case mean squared error (MSE) is smaller than the MSE of the plug-in estimator by a factor of
Abstract: Support size estimation and the related problem of unseen species estimation have wide applications in ecology and database analysis. Perhaps the most used support size estimator is the Chao estimator. Despite its widespread use, little is known about its theoretical properties. We analyze the Chao estimator and show that its worst case mean squared error (MSE) is smaller than the MSE of the plug-in estimator by a factor of ${\mathcal{O}}\left( {{{\left( {k/n} \right)}^2}} \right)$. Our main technical contribution is a new method to analyze rational estimators for discrete distribution properties, which may be of independent interest.
Citations
More filters
Book ChapterDOI
01 Jan 2008
TL;DR: Efron and Thisted as discussed by the authors studied the frequency distribution of words in the Shakespearean canon and found that the expected number of words that occur x ≥ 1 times in a large sample of n words is
Abstract: This paper is the first of two written by Brad Efron and Ron Thisted studying the frequency distribution of words in the Shakespearean canon. The key idea due to Fisher in the context of sampling of species is simple and elegant. When applied to Shakespeare the idea appears to be preposterous: an author has a personal vocabulary of word species represented by a distribution G, and text is generated by sampling from this distribution. Most results do not require successive words to be sampled independently, which leaves room for individual style and context, but stationarity is needed for prediction and inference. The expected number of words that occur x ≥ 1 times in a large sample of n words is

199 citations

References
More filters
Proceedings Article
07 Dec 2015
TL;DR: The first universally near-optimal probability estimators are described, which estimate every distribution nearly as well as the best estimator designed with prior knowledge of the exact distribution, but as all natural estimators, restricted to assign the same probability to all symbols appearing the same number of times.
Abstract: Estimating distributions over large alphabets is a fundamental machine-learning tenet. Yet no method is known to estimate all distributions well. For example, add-constant estimators are nearly min-max optimal but often perform poorly in practice, and practical estimators such as absolute discounting, Jelinek-Mercer, and Good-Turing are not known to be near optimal for essentially any distribution. We describe the first universally near-optimal probability estimators. For every discrete distribution, they are provably nearly the best in the following two competitive ways. First they estimate every distribution nearly as well as the best estimator designed with prior knowledge of the distribution up to a permutation. Second, they estimate every distribution nearly as well as the best estimator designed with prior knowledge of the exact distribution, but as all natural estimators, restricted to assign the same probability to all symbols appearing the same number of times. Specifically, for distributions over k symbols and n samples, we show that for both comparisons, a simple variant of Good-Turing estimator is always within KL divergence of (3 + on(1))/n1/3 from the best estimator, and that a more involved estimator is within On(min(k/n, 1/√n)). Conversely, we show that any estimator must have a KL divergence at least Ωn(min(k/n, 1/n2/3)) over the best estimator for the first comparison, and at least Ωn(min(k/n, 1/√n)) for the second.

70 citations