Proceedings ArticleDOI

# Convergence of Chao Unseen Species Estimator

01 Jul 2019-pp 46-50

TL;DR: In this article, the authors analyze the Chao estimator and show that its worst case mean squared error (MSE) is smaller than the MSE of the plug-in estimator by a factor of

AbstractSupport size estimation and the related problem of unseen species estimation have wide applications in ecology and database analysis. Perhaps the most used support size estimator is the Chao estimator. Despite its widespread use, little is known about its theoretical properties. We analyze the Chao estimator and show that its worst case mean squared error (MSE) is smaller than the MSE of the plug-in estimator by a factor of ${\mathcal{O}}\left( {{{\left( {k/n} \right)}^2}} \right)$. Our main technical contribution is a new method to analyze rational estimators for discrete distribution properties, which may be of independent interest.

...read more

##### Citations
More filters
Book ChapterDOI
01 Jan 2008
TL;DR: Efron and Thisted as discussed by the authors studied the frequency distribution of words in the Shakespearean canon and found that the expected number of words that occur x ≥ 1 times in a large sample of n words is
Abstract: This paper is the first of two written by Brad Efron and Ron Thisted studying the frequency distribution of words in the Shakespearean canon. The key idea due to Fisher in the context of sampling of species is simple and elegant. When applied to Shakespeare the idea appears to be preposterous: an author has a personal vocabulary of word species represented by a distribution G, and text is generated by sampling from this distribution. Most results do not require successive words to be sampled independently, which leaves room for individual style and context, but stationarity is needed for prediction and inference. The expected number of words that occur x ≥ 1 times in a large sample of n words is

182 citations

##### References
More filters
Journal Article
TL;DR: On applique la methode d'Efron (1981, 1982) a la construction d'intervalles de confiance bases sur des distributions du bootstrap as discussed by the authors.
Abstract: On applique la methode d'Efron (1981, 1982) a la construction d'intervalles de confiance bases sur des distributions du bootstrap

3,461 citations

Journal ArticleDOI

TL;DR: The purpose of this study was to determine the bacterial diversity in the human subgingival plaque by using culture-independent molecular methods as part of an ongoing effort to obtain full 16S rRNA sequences for all cultivable and not-yet-cultivated species of human oral bacteria.
Abstract: The purpose of this study was to determine the bacterial diversity in the human subgingival plaque by using culture-independent molecular methods as part of an ongoing effort to obtain full 16S rRNA sequences for all cultivable and not-yet-cultivated species of human oral bacteria. Subgingival plaque was analyzed from healthy subjects and subjects with refractory periodontitis, adult periodontitis, human immunodeficiency virus periodontitis, and acute necrotizing ulcerative gingivitis. 16S ribosomal DNA (rDNA) bacterial genes from DNA isolated from subgingival plaque samples were PCR amplified with all-bacterial or selective primers and cloned into Escherichia coli. The sequences of cloned 16S rDNA inserts were used to determine species identity or closest relatives by comparison with sequences of known species. A total of 2,522 clones were analyzed. Nearly complete sequences of approximately 1,500 bases were obtained for putative new species. About 60% of the clones fell into 132 known species, 70 of which were identified from multiple subjects. About 40% of the clones were novel phylotypes. Of the 215 novel phylotypes, 75 were identified from multiple subjects. Known putative periodontal pathogens such as Porphyromonas gingivalis, Bacteroides forsythus, and Treponema denticola were identified from multiple subjects, but typically as a minor component of the plaque as seen in cultivable studies. Several phylotypes fell into two recently described phyla previously associated with extreme natural environments, for which there are no cultivable species. A number of species or phylotypes were found only in subjects with disease, and a few were found only in healthy subjects. The organisms identified only from diseased sites deserve further study as potential pathogens. Based on the sequence data in this study, the predominant subgingival microbial community consisted of 347 species or phylotypes that fall into 9 bacterial phyla. Based on the 347 species seen in our sample of 2,522 clones, we estimate that there are 68 additional unseen species, for a total estimate of 415 species in the subgingival plaque. When organisms found on other oral surfaces such as the cheek, tongue, and teeth are added to this number, the best estimate of the total species diversity in the oral cavity is approximately 500 species, as previously proposed.

1,835 citations

Journal ArticleDOI

TL;DR: In this paper, the authors provide new unconditional variance estimators for classical, individual-based rarefaction and for Coleman Rarefaction under two sampling models: sampling-theoretic predictors for the number of species in a larger sample (multinomial model), a larger area (Poisson model) or a larger number of sampling units (Bernoulli product model), based on an estimate of asymptotic species richness.
Abstract: Aims In ecology and conservation biology, the number of species counted in a biodiversity study is a key metric but is usually a biased underestimate of total species richness because many rare species are not detected. Moreover, comparing species richness among sites or samples is a statistical challenge because the observed number of species is sensitive to the number of individuals counted or the area sampled. For individual-based data, we treat a single, empirical sample of species abundances from an investigator-defined species assemblage or community as a reference point for two estimation objectives under two sampling models: estimating the expected number of species (and its unconditional variance) in a random sample of (i) a smaller number of individuals (multinomial model) or a smaller area sampled (Poisson model) and (ii) a larger number of individuals or a larger area sampled. For sample-based incidence (presence–absence) data, under a Bernoulli product model, we treat a single set of species incidence frequencies as the reference point to estimate richness for smaller and larger numbers of sampling units. Methods The first objective is a problem in interpolation that we address with classical rarefaction (multinomial model) and Coleman rarefaction (Poisson model) for individual-based data and with sample-based rarefaction (Bernoulli product model) for incidence frequencies. The second is a problem in extrapolation that we address with sampling-theoretic predictors for the number of species in a larger sample (multinomial model), a larger area (Poisson model) or a larger number of sampling units (Bernoulli product model), based on an estimate of asymptotic species richness. Although published methods exist for many of these objectives, we bring them together here with some new estimators under a unified statistical and notational framework. This novel integration of mathematically distinct approaches allowed us to link interpolated (rarefaction) curves and extrapolated curves to plot a unified species accumulation curve for empirical examples. We provide new, unconditional variance estimators for classical, individual-based rarefaction and for Coleman rarefaction, long missing from the toolkit of biodiversity measurement. We illustrate these methods with datasets for tropical beetles, tropical trees and tropical ants.

1,225 citations

Journal ArticleDOI
TL;DR: New genetic techniques have revealed extensive microbial diversity that was previously undetected with culture-dependent methods and morphological methods, which have revealed how well a sample reflects a community's “true” diversity.
Abstract: All biologists who sample natural communities are plagued with the problem of how well a sample reflects a community's “true” diversity. New genetic techniques have revealed extensive microbial diversity that was previously undetected with culture-dependent methods and morphological

1,143 citations

Proceedings ArticleDOI
08 May 2007
TL;DR: The study involved half a million users over athree month period and gets extremely detailed data on password strength, the types and lengths of passwords chosen, and how they vary by site.
Abstract: We report the results of a large scale study of password use andpassword re-use habits. The study involved half a million users over athree month period. A client component on users' machines recorded a variety of password strength, usage and frequency metrics. This allows us to measure or estimate such quantities as the average number of passwords and average number of accounts each user has, how many passwords she types per day, how often passwords are shared among sites, and how often they are forgotten. We get extremely detailed data on password strength, the types and lengths of passwords chosen, and how they vary by site. The data is the first large scale study of its kind, and yields numerous other insights into the role the passwords play in users' online experience.

1,011 citations