ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time.

doi:10.1093/NAR/GKR349

Open AccessJournal ArticleDOI

ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time.

Yunpeng Cai, +1 more

- 01 Aug 2011 -

Nucleic Acids Research

- Vol. 39, Iss: 14

Chats0

TLDR

A new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work and exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.

Abstract:

Taxonomy-independent analysis plays an essential role in microbial community analysis. Hierarchical clustering is one of the most widely employed approaches to finding operational taxonomic units, the basis for many downstream analyses. Most existing algorithms have quadratic space and computational complexities, and thus can be used only for small or medium-scale problems. We propose a new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work. The basic idea is to partition a sequence space into a set of subspaces using a partition tree constructed using a pseudometric, then recursively refine a clustering structure in these subspaces. The technique relies on new methods for fast closest-pair searching and efficient dynamic insertion and deletion of tree nodes. To avoid exhaustive computation of pairwise distances between clusters, we represent each cluster of sequences as a probabilistic sequence, and define a set of operations to align these probabilistic sequences and compute genetic distances between them. We present analyses of space and computational complexity, and demonstrate the effectiveness of our new algorithm using a human gut microbiota data set with over one million sequences. The new algorithm exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.

Citations

PDF

Open Access

More filters

Proceedings ArticleDOI

Parallel Hierarchical Clustering in Linearithmic Time for Large-Scale Sequence Analysis

Qi Mao, +5 more

TL;DR: A new hierarchical clustering method that achieves good clustering performance and high scalability on large sequence datasets is proposed and can recover the true hierarchy with a high probability under some mild conditions and has a linearithmic time complexity with respect to the number of input sequences.

...read moreread less

Journal ArticleDOI

DMclust, a Density‐based Modularity Method for Accurate OTU Picking of 16S rRNA Sequences

Ze-Gang Wei, +2 more

- 06 Jun 2017 -

Molecular Informatics

TL;DR: A novel density‐based modularity clustering method, called DMclust, is proposed in this paper to bin 16S rRNA sequences into OTUs with high clustering accuracy and acceptable memory usage.

...read moreread less

Journal ArticleDOI

Skin Microbiome Differences in Atopic Dermatitis and Healthy Controls in Egyptian Children and Adults, and Association with Serum Immunoglobulin E.

Mohammed A. Ramadan, +5 more

- 17 May 2019 -

Omics A Journal of Integrative Biology

TL;DR: The first microbiome study and new insights on the relationship between skin microbiota variation and AD susceptibility in a population sample from Egypt are reported, attest to the promise of microbiome science and metagenomic analysis in AD specifically, and clinical dermatology broadly.

...read moreread less

Journal ArticleDOI

FunFrame: Functional Gene Ecological Analysis Pipeline

David Weisman, +2 more

- 01 May 2013 -

Bioinformatics

TL;DR: FunFrame is described, an R-based data-analysis pipeline that uses recently described algorithms to de-noise functional gene pyrosequences and performs ecological analysis on de- noised sequence data that reduced spurious diversity while retaining more sequences than a commonly used de-Noising method that discards sequences with frameshift errors.

...read moreread less

Journal ArticleDOI

Marine Oxygen-Deficient Zones Harbor Depauperate Denitrifying Communities Compared to Novel Genetic Diversity in Coastal Sediments

Jennifer L. Bowen, +5 more

- 28 Feb 2015 -

Microbial Ecology

TL;DR: Examination of the community structure of bacteria containing the nirS gene from estuarine and salt marsh sediments and from the water column of two of the world’s largest marine oxygen-deficient zones indicates that ODZs are remarkably depauperate in nIRS genes compared to the remarkable genetic richness found in coastal sediments.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Basic Local Alignment Search Tool

Stephen F. Altschul, +4 more

- 01 Oct 1990 -

Journal of Molecular Biology

TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

...read moreread less

Journal ArticleDOI

MUSCLE: multiple sequence alignment with high accuracy and high throughput

Robert C. Edgar

- 01 Mar 2004 -

Nucleic Acids Research

TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.

...read moreread less

Journal ArticleDOI

QIIME allows analysis of high-throughput community sequencing data.

J. Gregory Caporaso, +27 more

- 11 Apr 2010 -

Nature Methods

TL;DR: An overview of the analysis pipeline and links to raw data and processed output from the runs with and without denoising are provided.

...read moreread less

Book

Introduction to Algorithms

Thomas H. Cormen, +2 more

TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.

...read moreread less

Journal ArticleDOI

Hierarchical Grouping to Optimize an Objective Function

Joe H. Ward

- 01 Mar 1963 -

Journal of the American Statistical Asso...

TL;DR: In this paper, a procedure for forming hierarchical groups of mutually exclusive subsets, each of which has members that are maximally similar with respect to specified characteristics, is suggested for use in large-scale (n > 100) studies when a precise optimal solution for a specified number of groups is not practical.

...read moreread less

Collapse

Related Papers (5)

Search and clustering orders of magnitude faster than BLAST

Robert C. Edgar

- 01 Oct 2010 -

Bioinformatics

UCHIME improves sensitivity and speed of chimera detection

Robert C. Edgar, +4 more

- 01 Aug 2011 -

Bioinformatics

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences

Weizhong Li, +1 more

- 01 Jul 2006 -

Bioinformatics

ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time.

Citations

Parallel Hierarchical Clustering in Linearithmic Time for Large-Scale Sequence Analysis

DMclust, a Density‐based Modularity Method for Accurate OTU Picking of 16S rRNA Sequences

Skin Microbiome Differences in Atopic Dermatitis and Healthy Controls in Egyptian Children and Adults, and Association with Serum Immunoglobulin E.

FunFrame: Functional Gene Ecological Analysis Pipeline

Marine Oxygen-Deficient Zones Harbor Depauperate Denitrifying Communities Compared to Novel Genetic Diversity in Coastal Sediments

References

Basic Local Alignment Search Tool

MUSCLE: multiple sequence alignment with high accuracy and high throughput

QIIME allows analysis of high-throughput community sequencing data.

Introduction to Algorithms

Hierarchical Grouping to Optimize an Objective Function

Related Papers (5)

Search and clustering orders of magnitude faster than BLAST

Introducing mothur: Open-Source, Platform-Independent, Community-Supported Software for Describing and Comparing Microbial Communities

QIIME allows analysis of high-throughput community sequencing data.

UCHIME improves sensitivity and speed of chimera detection

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences