scispace - formally typeset
Search or ask a question
Journal ArticleDOI

ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time.

01 Aug 2011-Nucleic Acids Research (Oxford University Press)-Vol. 39, Iss: 14
TL;DR: A new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work and exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.
Abstract: Taxonomy-independent analysis plays an essential role in microbial community analysis. Hierarchical clustering is one of the most widely employed approaches to finding operational taxonomic units, the basis for many downstream analyses. Most existing algorithms have quadratic space and computational complexities, and thus can be used only for small or medium-scale problems. We propose a new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work. The basic idea is to partition a sequence space into a set of subspaces using a partition tree constructed using a pseudometric, then recursively refine a clustering structure in these subspaces. The technique relies on new methods for fast closest-pair searching and efficient dynamic insertion and deletion of tree nodes. To avoid exhaustive computation of pairwise distances between clusters, we represent each cluster of sequences as a probabilistic sequence, and define a set of operations to align these probabilistic sequences and compute genetic distances between them. We present analyses of space and computational complexity, and demonstrate the effectiveness of our new algorithm using a human gut microbiota data set with over one million sequences. The new algorithm exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: The Poisson tree processes (PTP) model is introduced to infer putative species boundaries on a given phylogenetic input tree and yields more accurate results than de novo species delimitation methods.
Abstract: Motivation: Sequence-based methods to delimit species are central to DNA taxonomy, microbial community surveys and DNA metabarcoding studies. Current approaches either rely on simple sequence similarity thresholds (OTU-picking) or on complex and compute-intensive evolutionary models. The OTU-picking methods scale well on large datasets, but the results are highly sensitive to the similarity threshold. Coalescent-based species delimitation approaches often rely on Bayesian statistics and Markov Chain Monte Carlo sampling, and can therefore only be applied to small datasets. Results: We introduce the Poisson tree processes (PTP) model to infer putative species boundaries on a given phylogenetic input tree. We also integrate PTP with our evolutionary placement algorithm (EPA-PTP) to count the number of species in phylogenetic placements. We compare our approaches with popular OTU-picking methods and the General Mixed Yule Coalescent (GMYC) model. For de novo species delimitation, the stand-alone PTP model generally outperforms GYMC as well as OTU-picking methods when evolutionary distances between species are small. PTP neither requires an ultrametric input tree nor a sequence similarity threshold as input. In the open reference species delimitation approach, EPA-PTP yields more accurate results than de novo species delimitation methods. Finally, EPA-PTP scales on large datasets because it relies on the parallel implementations of the EPA and RAxML, thereby allowing to delimit species in high-throughput sequencing data. Availability and implementation: The code is freely available at www.

1,868 citations


Cites methods from "ESPRIT-Tree: hierarchical clusterin..."

  • ...De novo OTU-picking usually relies on unsupervised machine learning methods (Cai and Sun, 2011; Edgar, 2010; Fu et al., 2012) that *To whom correspondence should be addressed....

    [...]

  • ...De novo OTU-picking usually relies on unsupervised machine learning methods (Cai and Sun, 2011; Edgar, 2010; Fu et al., 2012) that*To whom correspondence should be addressed....

    [...]

Journal ArticleDOI
TL;DR: This Galaxy‐supported pipeline, called FROGS, is designed to analyze large sets of amplicon sequences and produce abundance tables of Operational Taxonomic Units (OTUs) and their taxonomic affiliation to highlight databases conflicts and uncertainties.
Abstract: Motivation Metagenomics leads to major advances in microbial ecology and biologists need user friendly tools to analyze their data on their own. Results This Galaxy-supported pipeline, called FROGS, is designed to analyze large sets of amplicon sequences and produce abundance tables of Operational Taxonomic Units (OTUs) and their taxonomic affiliation. The clustering uses Swarm. The chimera removal uses VSEARCH, combined with original cross-sample validation. The taxonomic affiliation returns an innovative multi-affiliation output to highlight databases conflicts and uncertainties. Statistical results and numerous graphical illustrations are produced along the way to monitor the pipeline. FROGS was tested for the detection and quantification of OTUs on real and in silico datasets and proved to be rapid, robust and highly sensitive. It compares favorably with the widespread mothur, UPARSE and QIIME. Availability and implementation Source code and instructions for installation: https://github.com/geraldinepascal/FROGS.git. A companion website: http://frogs.toulouse.inra.fr. Contact geraldine.pascal@inra.fr. Supplementary information Supplementary data are available at Bioinformatics online.

527 citations


Cites background from "ESPRIT-Tree: hierarchical clusterin..."

  • ...Particularly, Illumina data, with dozens of samples routinely sequenced at depths over 100 000 reads, are hard to process in a reasonable time (Cai and Sun, 2011; Fu et al., 2012)....

    [...]

Journal ArticleDOI
TL;DR: Minimum Entropy Decomposition (MED) provides a computationally efficient means to partition marker gene datasets into ‘MED nodes’, which represent homogeneous operational taxonomic units and enables sensitive discrimination of closely related organisms in marker gene amplicon datasets without relying on extensive computational heuristics and user supervision.
Abstract: Molecular microbial ecology investigations often employ large marker gene datasets, for example, ribosomal RNAs, to represent the occurrence of single-cell genomes in microbial communities. Massively parallel DNA sequencing technologies enable extensive surveys of marker gene libraries that sometimes include nearly identical sequences. Computational approaches that rely on pairwise sequence alignments for similarity assessment and de novo clustering with de facto similarity thresholds to partition high-throughput sequencing datasets constrain fine-scale resolution descriptions of microbial communities. Minimum Entropy Decomposition (MED) provides a computationally efficient means to partition marker gene datasets into 'MED nodes', which represent homogeneous operational taxonomic units. By employing Shannon entropy, MED uses only the information-rich nucleotide positions across reads and iteratively partitions large datasets while omitting stochastic variation. When applied to analyses of microbiomes from two deep-sea cryptic sponges Hexadella dedritifera and Hexadella cf. dedritifera, MED resolved a key Gammaproteobacteria cluster into multiple MED nodes that are specific to different sponges, and revealed that these closely related sympatric sponge species maintain distinct microbial communities. MED analysis of a previously published human oral microbiome dataset also revealed that taxa separated by less than 1% sequence variation distributed to distinct niches in the oral cavity. The information theory-guided decomposition process behind the MED algorithm enables sensitive discrimination of closely related organisms in marker gene amplicon datasets without relying on extensive computational heuristics and user supervision.

472 citations


Cites background from "ESPRIT-Tree: hierarchical clusterin..."

  • ...…(Matias Rodrigues and von Mering, 2014)) as well as greedy but more computationally efficient heuristics that perform sequence comparison and OTU identification simultaneously (i.e., ESPRIT-Tree (Cai and Sun, 2011), CD-HIT (Li et al., 2001), UCLUST (Edgar, 2010), DySC (Zheng et al., 2012))....

    [...]

  • ...E-mail: meren@mbl.edu Received 13 June 2014; revised 2 September 2014; accepted 7 September 2014; published online 17 October 2014 The ISME Journal (2015) 9, 968–979 & 2015 International Society for Microbial Ecology All rights reserved 1751-7362/15 www.nature.com/ismej HPC-CLUST (Matias Rodrigues and von Mering, 2014)) as well as greedy but more computationally efficient heuristics that perform sequence comparison and OTU identification simultaneously (i.e., ESPRIT-Tree (Cai and Sun, 2011), CD-HIT (Li et al., 2001), UCLUST (Edgar, 2010), DySC (Zheng et al., 2012))....

    [...]

Journal ArticleDOI
14 Mar 2012-PLOS ONE
TL;DR: Although the bacterial taxa may vary considerably between cow rumens, they appear to be phylogenetically related, which suggests that the functional requirement imposed by the rumen ecological niche selects taxa that potentially share similar genetic features.
Abstract: The bovine rumen houses a complex microbiota which is responsible for cattle's remarkable ability to convert indigestible plant mass into food products. Despite this ecosystem's enormous significance for humans, the composition and similarity of bacterial communities across different animals and the possible presence of some bacterial taxa in all animals' rumens have yet to be determined. We characterized the rumen bacterial populations of 16 individual lactating cows using tag amplicon pyrosequencing. Our data showed 51% similarity in bacterial taxa across samples when abundance and occurrence were analyzed using the Bray-Curtis metric. By adding taxon phylogeny to the analysis using a weighted UniFrac metric, the similarity increased to 82%. We also counted 32 genera that are shared by all samples, exhibiting high variability in abundance across samples. Taken together, our results suggest a core microbiome in the bovine rumen. Furthermore, although the bacterial taxa may vary considerably between cow rumens, they appear to be phylogenetically related. This suggests that the functional requirement imposed by the rumen ecological niche selects taxa that potentially share similar genetic features.

470 citations


Cites methods from "ESPRIT-Tree: hierarchical clusterin..."

  • ...Therefore, we used three different clustering methods for OTU generation: UCLUST [25], ESPRIT-tree [26] and CD_HIT_OTU [27] (Table S1), which have been proven to generate satisfactory and comparable numbers of OTUs [24]....

    [...]

Journal ArticleDOI
TL;DR: Using a large set of high‐quality 16S rRNA sequences from finished genomes, the correspondence of OTUs to species is assessed for five representative clustering algorithms using four accuracy metrics and all algorithms had comparable accuracy when tuned to a given metric.
Abstract: Motivation The 16S ribosomal RNA (rRNA) gene is widely used to survey microbial communities Sequences are often clustered into Operational Taxonomic Units (OTUs) as proxies for species The canonical clustering threshold is 97% identity, which was proposed in 1994 when few 16S rRNA sequences were available, motivating a reassessment on current data Results Using a large set of high-quality 16S rRNA sequences from finished genomes, I assessed the correspondence of OTUs to species for five representative clustering algorithms using four accuracy metrics All algorithms had comparable accuracy when tuned to a given metric Optimal identity thresholds were ∼99% for full-length sequences and ∼100% for the V4 hypervariable region Availability and implementation Reference sequences and source code are provided in the Supplementary Material Supplementary information Supplementary data are available at Bioinformatics online

443 citations

References
More filters
Journal ArticleDOI
TL;DR: In this issue of PNAS, Fierer et al. (9) advance the authors' knowledge of skin-associated bacterial communities and provide a forensic approach that opens new doors of inquiry in the Human Microbiome Project.
Abstract: Since the earliest days of microbiology, it has been clear that all humans carry many of the same microbial lineages. We now call the collection of microbes living in or on our bodies the human microbiome (1), and a large international effort, called the Human Microbiome Project (HMP) by the National Institutes of Health (2), and known more broadly as the International Human Microbiome Consortium (IHMC), is aimed at its characterization. Studies over the past 5 years focusing on the bacteria present in human skin, for example, indicate that in addition to conserved organisms, there is enormous diversity in identities and abundances (3–6). In part, our appreciation of the conservation and diversity is a function of the level at which the microbiome is being observed. At the phylum level, humans are remarkably similar to one another [and to other mammals (7)], whereas at the genus, species, and strain population levels, the diversity is highly specific for each individual (3–6). A very recent study shows that even for 57 gut bacterial species present in >90% of 124 sampled persons, their estimated abundances varied by 12- to 2,000-fold (8). The challenge is: How can we make sense of our microbial and metagenomic diversity and, importantly, use the information to improve the human condition? In this issue of PNAS, Fierer et al. (9) advance our knowledge of skin-associated bacterial communities and provide a forensic approach that opens new doors of inquiry. Animals and their resident microbes have coexisted for at least a billion years (10). To understand their relationships, we can ask a series of fundamental biological questions (Fig. 1). Currently, there is extensive interest in obtaining a census of bacteria populating the different niches in various individuals (question 1). This area is advancing well, at least in part because the analytical and computational tools are well formed to answer such questions (8, 11). More important is to understand the physiochemical and metabolic activities (8, 12) of the microbiota (question 2) because these are more biologically significant than census taking. For example, the skin has many important functions critical to host survival, including protection against pathogens and physical agents, metabolic synthesis and storage, heat regulation, and sensation. What roles do our cutaneous microbiota have in these processes, up until now considered exclusively in the host domain? Fig. 1. Five fundamental biological questions that underlie the Human Microbiome Project. Initially, answers are phenomenologic but ultimately must be understood in the context of evolutionary processes. Along the way, myriad applications will be discovered. ... Host responses (question 3) may be immunologic, metabolic, and/or physical (e.g., peristaltic motion, sloughing of cells), and ultimately are important in understanding both health and disease. The conservation of microbes across most hosts, and over long time periods, implies a biological equilibrium (question 4), but there has been little analysis of the forces that create and maintain specific equilibria (13). As more information is revealed, the framework for the relationship may become more apparent. In total, each individual is the summation of all of the ground rules, circumstances, activities, and interactions (question 5); the characteristics of our genome and microbiome and their interactions in large measure define our individual uniqueness (14–16). After the 2001 anthrax bioterrorism attacks, scientists saw the need for having reliable and validated procedures for tracking microbial “suspects” and identifying their “hideout,” but also considered that the need for such procedures is much more general (17, 18). How can one “fingerprint” a microbe (17)? Microbial forensics had its origins in molecular epidemiology, using molecular (often DNA based) techniques to solve questions of disease extent and the spread and transmission of individual microbial species. In this issue of PNAS, Fierer et al. (9) explore whether the microbial characteristics of the residue of human fingers and hands left on inanimate objects leave a sufficient pattern that can be used to create a microbial fingerprint useful for forensic purposes. This is a novel application, and is particularly remarkable because most prior studies that examined the conjunction of bacteria and fingertips sought ways to remove bacteria, or to prevent their transmission, especially nosocomial pathogens, from fingers to mouth, for example (19, 20). Although preliminary, the work of Fierer et al. (9) is soundly based on techniques and analyses that the group has pioneered (4, 6, 7, 11), and points the way in which microbiome analysis can be used to further microbial forensic technology. Future advances in this field might encompass greater sequencing depth, microbial genes beyond 16S rRNA, or inanimate objects such as glass, ceramic, or even clothing. Just as bloodhounds can detect the unique spoor of an individual, we can harness nucleic acid technologies and informatics to follow a microbial trail. The work of Fierer et al. (9) leads us along an interesting new path in forensics and biology. Can changes in microbial compositions provide clues about poisonings? Could we purposefully store samples of our microbiome from saliva or stool, or in the form of our new fingerprints, to provide another level of proof of identity? Surely, such analysis will lead to new ethical dilemmas. One important issue is whether washing or antibiotic treatments, for example, sufficiently alter our profile so that identity can be hidden or is intrinsically unstable. Or in contrast, can we identify a core individual-specific signature that withstands such adventitious or purposeful perturbations? Just as we have an international HapMap project to understand the variation in the human genome (21), would a similarly planned project to understand worldwide conservation and diversity in the human microbiome (2) be a critical platform for future forensics? Could we also use such analysis to outline changes in the human microbiome associated with socioeconomic development (22) that might have important consequences for health and disease? In our increasingly sanitized world, can such studies help us approach the contexts in which personal hand-sanitizers are beneficial or harmful? More than a billion years ago, animals began domesticating microbes and allowing them permanent residence. Although we might ask who domesticated whom, we are learning how important these residents are to host survival, and how to further understand them to deepen our knowledge of personal health, identity, and social interaction. Our microbial fingerprints (9) are a fine example of our advancing knowledge.

64 citations

Journal ArticleDOI
TL;DR: In this paper, the authors gave a data structure of size O(n) that maintains a closest pair of n points in O(log n) time per insertion and deletion.
Abstract: Given a set S of n points in {k} -dimensional space, and an Lt metric, the dynamic closest-pair problem is defined as follows: find a closest pair of S after each update of S (the insertion or the deletion of a point). For fixed dimension {k} and fixed metric Lt , we give a data structure of size O(n) that maintains a closest pair of S in O(log n) time per insertion and deletion. The running time of the algorithm is optimal up to a constant factor because Ω (log n) is a lower bound, in an algebraic decision-tree model of computation, on the time complexity of any algorithm that maintains the closest pair (for k=1 ). The algorithm is based on the fair-split tree. The constant factor in the update time is exponential in the dimension. We modify the fair-split tree to reduce it.

58 citations

Journal ArticleDOI
TL;DR: A novel analytical strategy including discriminant and topology analyses that enables researchers to deeply investigate the hidden world of microbial communities, far beyond basic microbial diversity estimation is described.
Abstract: With the aid of next-generation sequencing technology, researchers can now obtain millions of microbial signature sequences for diverse applications ranging from human epidemiological studies to global ocean surveys. The development of advanced computational strategies to maximally extract pertinent information from massive nucleotide data has become a major focus of the bioinformatics community. Here, we describe a novel analytical strategy including discriminant and topology analyses that enables researchers to deeply investigate the hidden world of microbial communities, far beyond basic microbial diversity estimation. We demonstrate the utility of our approach through a computational study performed on a previously published massive human gut 16S rRNA data set. The application of discriminant and topology analyses enabled us to derive quantitative disease-associated microbial signatures and describe microbial community structure in far more detail than previously achievable. Our approach provides rigorous statistical tools for sequence-based studies aimed at elucidating associations between known or unknown organisms and a variety of physiological or environmental conditions.

45 citations


"ESPRIT-Tree: hierarchical clusterin..." refers background or methods in this paper

  • ...We recently developed a new algorithm, referred to as ESPRIT, that enables researchers to handle up to one million sequences by using a computer cluster (12,13)....

    [...]

  • ...We have previously applied ESPRIT to the same gut data set using a computer cluster of 100 processors (12)....

    [...]

  • ...In addition to microbial diversity estimation, there is currently increased interest in applying taxonomyindependent analysis to analyze millions of sequences for comparative microbial community analysis (11,12)....

    [...]

Journal ArticleDOI
TL;DR: The strategies of COML projects are outlined to efficiently reveal the 95% of the biosphere beneath the waves, from microbes to whales, to maintain their capacity to provide crucial services to the authors' blue planet.
Abstract: The Census of Marine Life aims to assess and explain the changing diversity, distribution, and abundance of marine species from the past to the present, and to project future ocean life. It assembles known historical data back to 1500 in an online Ocean Biogeographic Information System (OBIS) and has over 1000 scientists from 70 countries using advanced technologies to quantify and discover unknown life in under-explored ocean realms. Over 99% of the 6 million records now in OBIS are from the top 1000m of the water column, so the mid-waters and floor of the open ocean and the polar ice oceans are special targets. Even where the species are known, their distributions and abundance are largely speculative. This report outlines the strategies of COML projects to efficiently reveal the 95% of the biosphere beneath the waves, from microbes to whales. Open access to the OBIS data set will improve capacity to predict future impacts of climate and human activity. The baseline created by 2010 and the calibrated techniques developed will become important tools for monitoring and managing future ocean ecosystems to maintain their capacity to provide crucial services to our blue planet.

32 citations

Journal ArticleDOI
06 Sep 2002
TL;DR: It is shown that the complete linkage clustering of n points in Rd, where d ≥ 1 is a constant, can be computed in optimal O(nlogn) time and linear space, under the L1 and L∞-metrics.
Abstract: It is shown that the complete linkage clustering of n points in Rd, where d ≥ 1 is a constant, can be computed in optimal O(nlogn) time and linear space, under the L1 and L∞-metrics. Furthermore, for every other fixed Lt-metric, it is shown that it can be approximated within an arbitrarily small constant factor in O(nlogn) time and linear space.

26 citations


"ESPRIT-Tree: hierarchical clusterin..." refers background in this paper

  • ...03 level 1045 (19) 1137 (30) 1193 (26) 920 (23)...

    [...]

  • ...In the last decade, researchers developed several approximate hierarchical clustering algorithms with sub-quadratic time complexity (19,20)....

    [...]

Related Papers (5)