ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time.

doi:10.1093/NAR/GKR349

Open AccessJournal ArticleDOI

ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time.

Yunpeng Cai, +1 more

- 01 Aug 2011 -

Nucleic Acids Research

- Vol. 39, Iss: 14

TLDR

A new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work and exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.

Abstract:

Taxonomy-independent analysis plays an essential role in microbial community analysis. Hierarchical clustering is one of the most widely employed approaches to finding operational taxonomic units, the basis for many downstream analyses. Most existing algorithms have quadratic space and computational complexities, and thus can be used only for small or medium-scale problems. We propose a new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work. The basic idea is to partition a sequence space into a set of subspaces using a partition tree constructed using a pseudometric, then recursively refine a clustering structure in these subspaces. The technique relies on new methods for fast closest-pair searching and efficient dynamic insertion and deletion of tree nodes. To avoid exhaustive computation of pairwise distances between clusters, we represent each cluster of sequences as a probabilistic sequence, and define a set of operations to align these probabilistic sequences and compute genetic distances between them. We present analyses of space and computational complexity, and demonstrate the effectiveness of our new algorithm using a human gut microbiota data set with over one million sequences. The new algorithm exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

A general species delimitation method with applications to phylogenetic placements

Jiajie Zhang, +3 more

- 15 Nov 2013 -

Bioinformatics

TL;DR: The Poisson tree processes (PTP) model is introduced to infer putative species boundaries on a given phylogenetic input tree and yields more accurate results than de novo species delimitation methods.

...read moreread less

Journal ArticleDOI

FROGS: Find, Rapidly, OTUs with Galaxy Solution.

Frédéric Escudié, +9 more

- 15 Apr 2018 -

Bioinformatics

TL;DR: This Galaxy‐supported pipeline, called FROGS, is designed to analyze large sets of amplicon sequences and produce abundance tables of Operational Taxonomic Units (OTUs) and their taxonomic affiliation to highlight databases conflicts and uncertainties.

...read moreread less

Journal ArticleDOI

Minimum entropy decomposition: unsupervised oligotyping for sensitive partitioning of high-throughput marker gene sequences.

A. Murat Eren, +5 more

- 17 Mar 2015 -

The ISME Journal

TL;DR: Minimum Entropy Decomposition (MED) provides a computationally efficient means to partition marker gene datasets into ‘MED nodes’, which represent homogeneous operational taxonomic units and enables sensitive discrimination of closely related organisms in marker gene amplicon datasets without relying on extensive computational heuristics and user supervision.

...read moreread less

Journal ArticleDOI

Composition and Similarity of Bovine Rumen Microbiota across Individual Animals

Elie Jami, +1 more

- 14 Mar 2012 -

PLOS ONE

TL;DR: Although the bacterial taxa may vary considerably between cow rumens, they appear to be phylogenetically related, which suggests that the functional requirement imposed by the rumen ecological niche selects taxa that potentially share similar genetic features.

...read moreread less

Journal ArticleDOI

Updating the 97% identity threshold for 16S ribosomal RNA OTUs.

Robert C. Edgar

- 15 Jul 2018 -

Bioinformatics

TL;DR: Using a large set of high‐quality 16S rRNA sequences from finished genomes, the correspondence of OTUs to species is assessed for five representative clustering algorithms using four accuracy metrics and all algorithms had comparable accuracy when tuned to a given metric.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

Weizhong Li, +1 more

- 01 Jul 2006 -

Bioinformatics

TL;DR: Cd-hit-2d compares two protein datasets and reports similar matches between them; cd- Hit-est clusters a DNA/RNA sequence database and cd- hit-est-2D compares two nucleotide datasets.

...read moreread less

Journal ArticleDOI

Diversity of the human intestinal microbial flora.

Paul B. Eckburg, +10 more

- 10 Jun 2005 -

Science

TL;DR: A majority of the bacterial sequences corresponded to uncultivated species and novel microorganisms, and significant intersubject variability and differences between stool and mucosa community composition were discovered.

...read moreread less

Journal ArticleDOI

A core gut microbiome in obese and lean twins

Peter J. Turnbaugh, +14 more

- 22 Jan 2009 -

Nature

TL;DR: The faecal microbial communities of adult female monozygotic and dizygotic twin pairs concordant for leanness or obesity, and their mothers are characterized to address how host genotype, environmental exposure and host adiposity influence the gut microbiome.

...read moreread less

Journal ArticleDOI

A greedy algorithm for aligning DNA sequences.

Zheng Zhang, +3 more

- 01 Feb 2000 -

Journal of Computational Biology

TL;DR: A new greedy alignment algorithm is introduced with particularly good performance and it is shown that it computes the same alignment as does a certain dynamic programming algorithm, while executing over 10 times faster on appropriate data.

...read moreread less

Journal ArticleDOI

The Ribosomal Database Project: improved alignments and new tools for rRNA analysis

James R. Cole, +10 more

- 01 Jan 2009 -

Nucleic Acids Research

TL;DR: An improved alignment strategy uses the Infernal secondary structure aware aligner to provide a more consistent higher quality alignment and faster processing of user sequences, and a new Pyrosequencing Pipeline that provides tools to support analysis of ultra high-throughput rRNA sequencing data.

...read moreread less

Collapse

Related Papers (5)

Search and clustering orders of magnitude faster than BLAST

Robert C. Edgar

- 01 Oct 2010 -

Bioinformatics

UCHIME improves sensitivity and speed of chimera detection

Robert C. Edgar, +4 more

- 01 Aug 2011 -

Bioinformatics

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

Weizhong Li, +1 more

- 01 Jul 2006 -

Bioinformatics

ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time.

Citations

A general species delimitation method with applications to phylogenetic placements

FROGS: Find, Rapidly, OTUs with Galaxy Solution.

Minimum entropy decomposition: unsupervised oligotyping for sensitive partitioning of high-throughput marker gene sequences.

Composition and Similarity of Bovine Rumen Microbiota across Individual Animals

Updating the 97% identity threshold for 16S ribosomal RNA OTUs.

References

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

Diversity of the human intestinal microbial flora.

A core gut microbiome in obese and lean twins

A greedy algorithm for aligning DNA sequences.

The Ribosomal Database Project: improved alignments and new tools for rRNA analysis

Related Papers (5)

Search and clustering orders of magnitude faster than BLAST

Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities

QIIME allows analysis of high-throughput community sequencing data.

UCHIME improves sensitivity and speed of chimera detection

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences