scispace - formally typeset

Journal ArticleDOI

ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time.

01 Aug 2011-Nucleic Acids Research (Oxford University Press)-Vol. 39, Iss: 14

TL;DR: A new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work and exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.

AbstractTaxonomy-independent analysis plays an essential role in microbial community analysis. Hierarchical clustering is one of the most widely employed approaches to finding operational taxonomic units, the basis for many downstream analyses. Most existing algorithms have quadratic space and computational complexities, and thus can be used only for small or medium-scale problems. We propose a new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work. The basic idea is to partition a sequence space into a set of subspaces using a partition tree constructed using a pseudometric, then recursively refine a clustering structure in these subspaces. The technique relies on new methods for fast closest-pair searching and efficient dynamic insertion and deletion of tree nodes. To avoid exhaustive computation of pairwise distances between clusters, we represent each cluster of sequences as a probabilistic sequence, and define a set of operations to align these probabilistic sequences and compute genetic distances between them. We present analyses of space and computational complexity, and demonstrate the effectiveness of our new algorithm using a human gut microbiota data set with over one million sequences. The new algorithm exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.

...read more

Content maybe subject to copyright    Report

Citations
More filters

Journal ArticleDOI
TL;DR: The Poisson tree processes (PTP) model is introduced to infer putative species boundaries on a given phylogenetic input tree and yields more accurate results than de novo species delimitation methods.
Abstract: Motivation: Sequence-based methods to delimit species are central to DNA taxonomy, microbial community surveys and DNA metabarcoding studies. Current approaches either rely on simple sequence similarity thresholds (OTU-picking) or on complex and compute-intensive evolutionary models. The OTU-picking methods scale well on large datasets, but the results are highly sensitive to the similarity threshold. Coalescent-based species delimitation approaches often rely on Bayesian statistics and Markov Chain Monte Carlo sampling, and can therefore only be applied to small datasets. Results: We introduce the Poisson tree processes (PTP) model to infer putative species boundaries on a given phylogenetic input tree. We also integrate PTP with our evolutionary placement algorithm (EPA-PTP) to count the number of species in phylogenetic placements. We compare our approaches with popular OTU-picking methods and the General Mixed Yule Coalescent (GMYC) model. For de novo species delimitation, the stand-alone PTP model generally outperforms GYMC as well as OTU-picking methods when evolutionary distances between species are small. PTP neither requires an ultrametric input tree nor a sequence similarity threshold as input. In the open reference species delimitation approach, EPA-PTP yields more accurate results than de novo species delimitation methods. Finally, EPA-PTP scales on large datasets because it relies on the parallel implementations of the EPA and RAxML, thereby allowing to delimit species in high-throughput sequencing data. Availability and implementation: The code is freely available at www.

1,380 citations


Cites methods from "ESPRIT-Tree: hierarchical clusterin..."

  • ...De novo OTU-picking usually relies on unsupervised machine learning methods (Cai and Sun, 2011; Edgar, 2010; Fu et al., 2012) that *To whom correspondence should be addressed....

    [...]

  • ...De novo OTU-picking usually relies on unsupervised machine learning methods (Cai and Sun, 2011; Edgar, 2010; Fu et al., 2012) that*To whom correspondence should be addressed....

    [...]


Journal ArticleDOI
TL;DR: Minimum Entropy Decomposition (MED) provides a computationally efficient means to partition marker gene datasets into ‘MED nodes’, which represent homogeneous operational taxonomic units and enables sensitive discrimination of closely related organisms in marker gene amplicon datasets without relying on extensive computational heuristics and user supervision.
Abstract: Molecular microbial ecology investigations often employ large marker gene datasets, for example, ribosomal RNAs, to represent the occurrence of single-cell genomes in microbial communities. Massively parallel DNA sequencing technologies enable extensive surveys of marker gene libraries that sometimes include nearly identical sequences. Computational approaches that rely on pairwise sequence alignments for similarity assessment and de novo clustering with de facto similarity thresholds to partition high-throughput sequencing datasets constrain fine-scale resolution descriptions of microbial communities. Minimum Entropy Decomposition (MED) provides a computationally efficient means to partition marker gene datasets into 'MED nodes', which represent homogeneous operational taxonomic units. By employing Shannon entropy, MED uses only the information-rich nucleotide positions across reads and iteratively partitions large datasets while omitting stochastic variation. When applied to analyses of microbiomes from two deep-sea cryptic sponges Hexadella dedritifera and Hexadella cf. dedritifera, MED resolved a key Gammaproteobacteria cluster into multiple MED nodes that are specific to different sponges, and revealed that these closely related sympatric sponge species maintain distinct microbial communities. MED analysis of a previously published human oral microbiome dataset also revealed that taxa separated by less than 1% sequence variation distributed to distinct niches in the oral cavity. The information theory-guided decomposition process behind the MED algorithm enables sensitive discrimination of closely related organisms in marker gene amplicon datasets without relying on extensive computational heuristics and user supervision.

404 citations


Cites background from "ESPRIT-Tree: hierarchical clusterin..."

  • ...…(Matias Rodrigues and von Mering, 2014)) as well as greedy but more computationally efficient heuristics that perform sequence comparison and OTU identification simultaneously (i.e., ESPRIT-Tree (Cai and Sun, 2011), CD-HIT (Li et al., 2001), UCLUST (Edgar, 2010), DySC (Zheng et al., 2012))....

    [...]

  • ...E-mail: meren@mbl.edu Received 13 June 2014; revised 2 September 2014; accepted 7 September 2014; published online 17 October 2014 The ISME Journal (2015) 9, 968–979 & 2015 International Society for Microbial Ecology All rights reserved 1751-7362/15 www.nature.com/ismej HPC-CLUST (Matias Rodrigues and von Mering, 2014)) as well as greedy but more computationally efficient heuristics that perform sequence comparison and OTU identification simultaneously (i.e., ESPRIT-Tree (Cai and Sun, 2011), CD-HIT (Li et al., 2001), UCLUST (Edgar, 2010), DySC (Zheng et al., 2012))....

    [...]


Journal ArticleDOI
14 Mar 2012-PLOS ONE
TL;DR: Although the bacterial taxa may vary considerably between cow rumens, they appear to be phylogenetically related, which suggests that the functional requirement imposed by the rumen ecological niche selects taxa that potentially share similar genetic features.
Abstract: The bovine rumen houses a complex microbiota which is responsible for cattle's remarkable ability to convert indigestible plant mass into food products. Despite this ecosystem's enormous significance for humans, the composition and similarity of bacterial communities across different animals and the possible presence of some bacterial taxa in all animals' rumens have yet to be determined. We characterized the rumen bacterial populations of 16 individual lactating cows using tag amplicon pyrosequencing. Our data showed 51% similarity in bacterial taxa across samples when abundance and occurrence were analyzed using the Bray-Curtis metric. By adding taxon phylogeny to the analysis using a weighted UniFrac metric, the similarity increased to 82%. We also counted 32 genera that are shared by all samples, exhibiting high variability in abundance across samples. Taken together, our results suggest a core microbiome in the bovine rumen. Furthermore, although the bacterial taxa may vary considerably between cow rumens, they appear to be phylogenetically related. This suggests that the functional requirement imposed by the rumen ecological niche selects taxa that potentially share similar genetic features.

397 citations


Cites methods from "ESPRIT-Tree: hierarchical clusterin..."

  • ...Therefore, we used three different clustering methods for OTU generation: UCLUST [25], ESPRIT-tree [26] and CD_HIT_OTU [27] (Table S1), which have been proven to generate satisfactory and comparable numbers of OTUs [24]....

    [...]


Journal ArticleDOI
TL;DR: Despite a promising outlook, the field of eukaryotic marker gene surveys faces significant challenges: how to generate data that are most useful to the community, especially in the face of evolving sequencing technologies and bioinformatics pipelines, and how to incorporate an expanding number of target genes.
Abstract: Microscopic eukaryotes are abundant, diverse and fill critical ecological roles across every ecosystem on Earth, yet there is a well-recognized gap in understanding of their global biodiversity. Fundamental advances in DNA sequencing and bioinformatics now allow accurate en masse biodiversity assessments of microscopic eukaryotes from environmental samples. Despite a promising outlook, the field of eukaryotic marker gene surveys faces significant challenges: how to generate data that are most useful to the community, especially in the face of evolving sequencing technologies and bioinformatics pipelines, and how to incorporate an expanding number of target genes.

382 citations


Journal ArticleDOI
TL;DR: This Galaxy‐supported pipeline, called FROGS, is designed to analyze large sets of amplicon sequences and produce abundance tables of Operational Taxonomic Units (OTUs) and their taxonomic affiliation to highlight databases conflicts and uncertainties.
Abstract: Motivation Metagenomics leads to major advances in microbial ecology and biologists need user friendly tools to analyze their data on their own. Results This Galaxy-supported pipeline, called FROGS, is designed to analyze large sets of amplicon sequences and produce abundance tables of Operational Taxonomic Units (OTUs) and their taxonomic affiliation. The clustering uses Swarm. The chimera removal uses VSEARCH, combined with original cross-sample validation. The taxonomic affiliation returns an innovative multi-affiliation output to highlight databases conflicts and uncertainties. Statistical results and numerous graphical illustrations are produced along the way to monitor the pipeline. FROGS was tested for the detection and quantification of OTUs on real and in silico datasets and proved to be rapid, robust and highly sensitive. It compares favorably with the widespread mothur, UPARSE and QIIME. Availability and implementation Source code and instructions for installation: https://github.com/geraldinepascal/FROGS.git. A companion website: http://frogs.toulouse.inra.fr. Contact geraldine.pascal@inra.fr. Supplementary information Supplementary data are available at Bioinformatics online.

324 citations


Cites background from "ESPRIT-Tree: hierarchical clusterin..."

  • ...Particularly, Illumina data, with dozens of samples routinely sequenced at depths over 100 000 reads, are hard to process in a reasonable time (Cai and Sun, 2011; Fu et al., 2012)....

    [...]


References
More filters

Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.
Abstract: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score. Recent mathematical results on the stochastic properties of MSP scores allow an analysis of the performance of this method as well as the statistical significance of alignments it generates. The basic algorithm is simple and robust; it can be implemented in a number of ways and applied in a variety of contexts including straight-forward DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. In addition to its flexibility and tractability to mathematical analysis, BLAST is an order of magnitude faster than existing sequence comparison tools of comparable sensitivity.

81,150 citations


Journal ArticleDOI
TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.
Abstract: We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the logexpectation score, and refinement using treedependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.

32,394 citations


Journal ArticleDOI
TL;DR: An overview of the analysis pipeline and links to raw data and processed output from the runs with and without denoising are provided.
Abstract: Supplementary Figure 1 Overview of the analysis pipeline. Supplementary Table 1 Details of conventionally raised and conventionalized mouse samples. Supplementary Discussion Expanded discussion of QIIME analyses presented in the main text; Sequencing of 16S rRNA gene amplicons; QIIME analysis notes; Expanded Figure 1 legend; Links to raw data and processed output from the runs with and without denoising.

24,116 citations


"ESPRIT-Tree: hierarchical clusterin..." refers background in this paper

  • ...In addition to microbial diversity estimation, there is currently increased interest in applying taxonomyindependent analysis to analyze millions of sequences for comparative microbial community analysis (11,12)....

    [...]

  • ...05 level 241 (7) 268 (6) 362 (11) 314 (9) peak NMI-species 402 (9) 400 (9) 590 (13) 314 (9) peak NMI-genus 190 (5) 176 (7) 216 (6) 243 (7)...

    [...]


Book
01 Jan 1990
TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.
Abstract: From the Publisher: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures. Like the first edition,this text can also be used for self-study by technical professionals since it discusses engineering issues in algorithm design as well as the mathematical aspects. In its new edition,Introduction to Algorithms continues to provide a comprehensive introduction to the modern study of algorithms. The revision has been updated to reflect changes in the years since the book's original publication. New chapters on the role of algorithms in computing and on probabilistic analysis and randomized algorithms have been included. Sections throughout the book have been rewritten for increased clarity,and material has been added wherever a fuller explanation has seemed useful or new information warrants expanded coverage. As in the classic first edition,this new edition of Introduction to Algorithms presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers. Further,the algorithms are presented in pseudocode to make the book easily accessible to students from all programming language backgrounds. Each chapter presents an algorithm,a design technique,an application area,or a related topic. The chapters are not dependent on one another,so the instructor can organize his or her use of the book in the way that best suits the course's needs. Additionally,the new edition offers a 25% increase over the first edition in the number of problems,giving the book 155 problems and over 900 exercises thatreinforcethe concepts the students are learning.

21,642 citations


Journal ArticleDOI
TL;DR: A system of cluster analysis for genome-wide expression data from DNA microarray hybridization is described that uses standard statistical algorithms to arrange genes according to similarity in pattern of gene expression, finding in the budding yeast Saccharomyces cerevisiae that clustering gene expression data groups together efficiently genes of known similar function.
Abstract: A system of cluster analysis for genome-wide expression data from DNA microarray hybridization is de- scribed that uses standard statistical algorithms to arrange genes according to similarity in pattern of gene expression. The output is displayed graphically, conveying the clustering and the underlying expression data simultaneously in a form intuitive for biologists. We have found in the budding yeast Saccharomyces cerevisiae that clustering gene expression data groups together efficiently genes of known similar function, and we find a similar tendency in human data. Thus patterns seen in genome-wide expression experiments can be inter- preted as indications of the status of cellular processes. Also, coexpression of genes of known function with poorly charac- terized or novel genes may provide a simple means of gaining leads to the functions of many genes for which information is not available currently.

16,000 citations


Related Papers (5)