scispace - formally typeset
Search or ask a question
Journal ArticleDOI

ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time.

01 Aug 2011-Nucleic Acids Research (Oxford University Press)-Vol. 39, Iss: 14
TL;DR: A new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work and exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.
Abstract: Taxonomy-independent analysis plays an essential role in microbial community analysis. Hierarchical clustering is one of the most widely employed approaches to finding operational taxonomic units, the basis for many downstream analyses. Most existing algorithms have quadratic space and computational complexities, and thus can be used only for small or medium-scale problems. We propose a new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work. The basic idea is to partition a sequence space into a set of subspaces using a partition tree constructed using a pseudometric, then recursively refine a clustering structure in these subspaces. The technique relies on new methods for fast closest-pair searching and efficient dynamic insertion and deletion of tree nodes. To avoid exhaustive computation of pairwise distances between clusters, we represent each cluster of sequences as a probabilistic sequence, and define a set of operations to align these probabilistic sequences and compute genetic distances between them. We present analyses of space and computational complexity, and demonstrate the effectiveness of our new algorithm using a human gut microbiota data set with over one million sequences. The new algorithm exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: The Poisson tree processes (PTP) model is introduced to infer putative species boundaries on a given phylogenetic input tree and yields more accurate results than de novo species delimitation methods.
Abstract: Motivation: Sequence-based methods to delimit species are central to DNA taxonomy, microbial community surveys and DNA metabarcoding studies. Current approaches either rely on simple sequence similarity thresholds (OTU-picking) or on complex and compute-intensive evolutionary models. The OTU-picking methods scale well on large datasets, but the results are highly sensitive to the similarity threshold. Coalescent-based species delimitation approaches often rely on Bayesian statistics and Markov Chain Monte Carlo sampling, and can therefore only be applied to small datasets. Results: We introduce the Poisson tree processes (PTP) model to infer putative species boundaries on a given phylogenetic input tree. We also integrate PTP with our evolutionary placement algorithm (EPA-PTP) to count the number of species in phylogenetic placements. We compare our approaches with popular OTU-picking methods and the General Mixed Yule Coalescent (GMYC) model. For de novo species delimitation, the stand-alone PTP model generally outperforms GYMC as well as OTU-picking methods when evolutionary distances between species are small. PTP neither requires an ultrametric input tree nor a sequence similarity threshold as input. In the open reference species delimitation approach, EPA-PTP yields more accurate results than de novo species delimitation methods. Finally, EPA-PTP scales on large datasets because it relies on the parallel implementations of the EPA and RAxML, thereby allowing to delimit species in high-throughput sequencing data. Availability and implementation: The code is freely available at www.

1,868 citations


Cites methods from "ESPRIT-Tree: hierarchical clusterin..."

  • ...De novo OTU-picking usually relies on unsupervised machine learning methods (Cai and Sun, 2011; Edgar, 2010; Fu et al., 2012) that *To whom correspondence should be addressed....

    [...]

  • ...De novo OTU-picking usually relies on unsupervised machine learning methods (Cai and Sun, 2011; Edgar, 2010; Fu et al., 2012) that*To whom correspondence should be addressed....

    [...]

Journal ArticleDOI
TL;DR: This Galaxy‐supported pipeline, called FROGS, is designed to analyze large sets of amplicon sequences and produce abundance tables of Operational Taxonomic Units (OTUs) and their taxonomic affiliation to highlight databases conflicts and uncertainties.
Abstract: Motivation Metagenomics leads to major advances in microbial ecology and biologists need user friendly tools to analyze their data on their own. Results This Galaxy-supported pipeline, called FROGS, is designed to analyze large sets of amplicon sequences and produce abundance tables of Operational Taxonomic Units (OTUs) and their taxonomic affiliation. The clustering uses Swarm. The chimera removal uses VSEARCH, combined with original cross-sample validation. The taxonomic affiliation returns an innovative multi-affiliation output to highlight databases conflicts and uncertainties. Statistical results and numerous graphical illustrations are produced along the way to monitor the pipeline. FROGS was tested for the detection and quantification of OTUs on real and in silico datasets and proved to be rapid, robust and highly sensitive. It compares favorably with the widespread mothur, UPARSE and QIIME. Availability and implementation Source code and instructions for installation: https://github.com/geraldinepascal/FROGS.git. A companion website: http://frogs.toulouse.inra.fr. Contact geraldine.pascal@inra.fr. Supplementary information Supplementary data are available at Bioinformatics online.

527 citations


Cites background from "ESPRIT-Tree: hierarchical clusterin..."

  • ...Particularly, Illumina data, with dozens of samples routinely sequenced at depths over 100 000 reads, are hard to process in a reasonable time (Cai and Sun, 2011; Fu et al., 2012)....

    [...]

Journal ArticleDOI
TL;DR: Minimum Entropy Decomposition (MED) provides a computationally efficient means to partition marker gene datasets into ‘MED nodes’, which represent homogeneous operational taxonomic units and enables sensitive discrimination of closely related organisms in marker gene amplicon datasets without relying on extensive computational heuristics and user supervision.
Abstract: Molecular microbial ecology investigations often employ large marker gene datasets, for example, ribosomal RNAs, to represent the occurrence of single-cell genomes in microbial communities. Massively parallel DNA sequencing technologies enable extensive surveys of marker gene libraries that sometimes include nearly identical sequences. Computational approaches that rely on pairwise sequence alignments for similarity assessment and de novo clustering with de facto similarity thresholds to partition high-throughput sequencing datasets constrain fine-scale resolution descriptions of microbial communities. Minimum Entropy Decomposition (MED) provides a computationally efficient means to partition marker gene datasets into 'MED nodes', which represent homogeneous operational taxonomic units. By employing Shannon entropy, MED uses only the information-rich nucleotide positions across reads and iteratively partitions large datasets while omitting stochastic variation. When applied to analyses of microbiomes from two deep-sea cryptic sponges Hexadella dedritifera and Hexadella cf. dedritifera, MED resolved a key Gammaproteobacteria cluster into multiple MED nodes that are specific to different sponges, and revealed that these closely related sympatric sponge species maintain distinct microbial communities. MED analysis of a previously published human oral microbiome dataset also revealed that taxa separated by less than 1% sequence variation distributed to distinct niches in the oral cavity. The information theory-guided decomposition process behind the MED algorithm enables sensitive discrimination of closely related organisms in marker gene amplicon datasets without relying on extensive computational heuristics and user supervision.

472 citations


Cites background from "ESPRIT-Tree: hierarchical clusterin..."

  • ...…(Matias Rodrigues and von Mering, 2014)) as well as greedy but more computationally efficient heuristics that perform sequence comparison and OTU identification simultaneously (i.e., ESPRIT-Tree (Cai and Sun, 2011), CD-HIT (Li et al., 2001), UCLUST (Edgar, 2010), DySC (Zheng et al., 2012))....

    [...]

  • ...E-mail: meren@mbl.edu Received 13 June 2014; revised 2 September 2014; accepted 7 September 2014; published online 17 October 2014 The ISME Journal (2015) 9, 968–979 & 2015 International Society for Microbial Ecology All rights reserved 1751-7362/15 www.nature.com/ismej HPC-CLUST (Matias Rodrigues and von Mering, 2014)) as well as greedy but more computationally efficient heuristics that perform sequence comparison and OTU identification simultaneously (i.e., ESPRIT-Tree (Cai and Sun, 2011), CD-HIT (Li et al., 2001), UCLUST (Edgar, 2010), DySC (Zheng et al., 2012))....

    [...]

Journal ArticleDOI
14 Mar 2012-PLOS ONE
TL;DR: Although the bacterial taxa may vary considerably between cow rumens, they appear to be phylogenetically related, which suggests that the functional requirement imposed by the rumen ecological niche selects taxa that potentially share similar genetic features.
Abstract: The bovine rumen houses a complex microbiota which is responsible for cattle's remarkable ability to convert indigestible plant mass into food products. Despite this ecosystem's enormous significance for humans, the composition and similarity of bacterial communities across different animals and the possible presence of some bacterial taxa in all animals' rumens have yet to be determined. We characterized the rumen bacterial populations of 16 individual lactating cows using tag amplicon pyrosequencing. Our data showed 51% similarity in bacterial taxa across samples when abundance and occurrence were analyzed using the Bray-Curtis metric. By adding taxon phylogeny to the analysis using a weighted UniFrac metric, the similarity increased to 82%. We also counted 32 genera that are shared by all samples, exhibiting high variability in abundance across samples. Taken together, our results suggest a core microbiome in the bovine rumen. Furthermore, although the bacterial taxa may vary considerably between cow rumens, they appear to be phylogenetically related. This suggests that the functional requirement imposed by the rumen ecological niche selects taxa that potentially share similar genetic features.

470 citations


Cites methods from "ESPRIT-Tree: hierarchical clusterin..."

  • ...Therefore, we used three different clustering methods for OTU generation: UCLUST [25], ESPRIT-tree [26] and CD_HIT_OTU [27] (Table S1), which have been proven to generate satisfactory and comparable numbers of OTUs [24]....

    [...]

Journal ArticleDOI
TL;DR: Using a large set of high‐quality 16S rRNA sequences from finished genomes, the correspondence of OTUs to species is assessed for five representative clustering algorithms using four accuracy metrics and all algorithms had comparable accuracy when tuned to a given metric.
Abstract: Motivation The 16S ribosomal RNA (rRNA) gene is widely used to survey microbial communities Sequences are often clustered into Operational Taxonomic Units (OTUs) as proxies for species The canonical clustering threshold is 97% identity, which was proposed in 1994 when few 16S rRNA sequences were available, motivating a reassessment on current data Results Using a large set of high-quality 16S rRNA sequences from finished genomes, I assessed the correspondence of OTUs to species for five representative clustering algorithms using four accuracy metrics All algorithms had comparable accuracy when tuned to a given metric Optimal identity thresholds were ∼99% for full-length sequences and ∼100% for the V4 hypervariable region Availability and implementation Reference sequences and source code are provided in the Supplementary Material Supplementary information Supplementary data are available at Bioinformatics online

443 citations

References
More filters
Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

88,255 citations

Journal ArticleDOI
TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.
Abstract: We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the logexpectation score, and refinement using treedependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.

37,524 citations

Journal ArticleDOI
TL;DR: An overview of the analysis pipeline and links to raw data and processed output from the runs with and without denoising are provided.
Abstract: Supplementary Figure 1 Overview of the analysis pipeline. Supplementary Table 1 Details of conventionally raised and conventionalized mouse samples. Supplementary Discussion Expanded discussion of QIIME analyses presented in the main text; Sequencing of 16S rRNA gene amplicons; QIIME analysis notes; Expanded Figure 1 legend; Links to raw data and processed output from the runs with and without denoising.

28,911 citations


"ESPRIT-Tree: hierarchical clusterin..." refers background in this paper

  • ...In addition to microbial diversity estimation, there is currently increased interest in applying taxonomyindependent analysis to analyze millions of sequences for comparative microbial community analysis (11,12)....

    [...]

  • ...05 level 241 (7) 268 (6) 362 (11) 314 (9) peak NMI-species 402 (9) 400 (9) 590 (13) 314 (9) peak NMI-genus 190 (5) 176 (7) 216 (6) 243 (7)...

    [...]

Book
01 Jan 1990
TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.
Abstract: From the Publisher: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures. Like the first edition,this text can also be used for self-study by technical professionals since it discusses engineering issues in algorithm design as well as the mathematical aspects. In its new edition,Introduction to Algorithms continues to provide a comprehensive introduction to the modern study of algorithms. The revision has been updated to reflect changes in the years since the book's original publication. New chapters on the role of algorithms in computing and on probabilistic analysis and randomized algorithms have been included. Sections throughout the book have been rewritten for increased clarity,and material has been added wherever a fuller explanation has seemed useful or new information warrants expanded coverage. As in the classic first edition,this new edition of Introduction to Algorithms presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers. Further,the algorithms are presented in pseudocode to make the book easily accessible to students from all programming language backgrounds. Each chapter presents an algorithm,a design technique,an application area,or a related topic. The chapters are not dependent on one another,so the instructor can organize his or her use of the book in the way that best suits the course's needs. Additionally,the new edition offers a 25% increase over the first edition in the number of problems,giving the book 155 problems and over 900 exercises thatreinforcethe concepts the students are learning.

21,651 citations

Journal ArticleDOI
TL;DR: In this paper, a procedure for forming hierarchical groups of mutually exclusive subsets, each of which has members that are maximally similar with respect to specified characteristics, is suggested for use in large-scale (n > 100) studies when a precise optimal solution for a specified number of groups is not practical.
Abstract: A procedure for forming hierarchical groups of mutually exclusive subsets, each of which has members that are maximally similar with respect to specified characteristics, is suggested for use in large-scale (n > 100) studies when a precise optimal solution for a specified number of groups is not practical. Given n sets, this procedure permits their reduction to n − 1 mutually exclusive sets by considering the union of all possible n(n − 1)/2 pairs and selecting a union having a maximal value for the functional relation, or objective function, that reflects the criterion chosen by the investigator. By repeating this process until only one group remains, the complete hierarchical structure and a quantitative estimate of the loss associated with each stage in the grouping can be obtained. A general flowchart helpful in computer programming and a numerical example are included.

17,405 citations

Related Papers (5)