scispace - formally typeset
Search or ask a question
Journal ArticleDOI

ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time.

01 Aug 2011-Nucleic Acids Research (Oxford University Press)-Vol. 39, Iss: 14
TL;DR: A new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work and exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.
Abstract: Taxonomy-independent analysis plays an essential role in microbial community analysis. Hierarchical clustering is one of the most widely employed approaches to finding operational taxonomic units, the basis for many downstream analyses. Most existing algorithms have quadratic space and computational complexities, and thus can be used only for small or medium-scale problems. We propose a new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work. The basic idea is to partition a sequence space into a set of subspaces using a partition tree constructed using a pseudometric, then recursively refine a clustering structure in these subspaces. The technique relies on new methods for fast closest-pair searching and efficient dynamic insertion and deletion of tree nodes. To avoid exhaustive computation of pairwise distances between clusters, we represent each cluster of sequences as a probabilistic sequence, and define a set of operations to align these probabilistic sequences and compute genetic distances between them. We present analyses of space and computational complexity, and demonstrate the effectiveness of our new algorithm using a human gut microbiota data set with over one million sequences. The new algorithm exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.

Content maybe subject to copyright    Report

Citations
More filters
BookDOI
01 Jan 2015
TL;DR: This work focuses on the exploration and exploitation of rumen microbes, an underutilized niche for industrially important enzymes in a non-ruminant gut, and the implications for ruminant health and welfare.
Abstract: Part 1 - Overview of rumen and ruminants .- 1. Rumen Microbiology: An Overview.- 2. Rumen Microbial Ecosystem of Domesticated Ruminants.- 3. Domesticated Rare Animals (Yak, Mithun and Camel): Rumen Microbial Diversity.- 4. Wild Ruminants.- 5. Structure-and-function of a non-ruminant gut: a porcine model.- Part 2 - Rumen microbial diversity.- 6. Rumen bacteria.- 7. Rumen fungi.- 8. Rumen Protozoa.- 9. Ruminal viruses (Bacteriophages, Archaeaphages).- 10. Rumen Methanogens.- Part 3 - Rumen manipulation.- 11. Plant SecondaryMetabolites.- 12. Microbial feed additives.- 13. Utilization of organic acids to manipulate ruminal fermentation and improve ruminant productivity.- 14. Selective inhibition of harmful rumen microbes.- 15. Various 'Omics' approaches to understand and manipulate rumen microbial function.- Part 6 - Exploration and exploitation of rumen microbes.- 16. Rumen Metagenomics.- 17. Rumen: an underutilized niche for industrially important enzymes.- 18. Ruminal Fermentations to Produce Liquid and Gaseous Fuels.- 19. Commercial application of rumen microbial enzymes.- 20. Molecular characterization of Euryarcheal community within an anaerobic digester.- Part 5 - Intestinal disorders and rumen microbes.- 21. Acidosis in cattle.- 22. Urea/ ammonia metabolism in the rumen and toxicity in ruminants.- 23. Nitrate/ nitrite toxicity and possibilities of their use in ruminant diet.- Part 6 - Future prospects of rumen microbiology.- 24. The Revolution in Rumen Microbiology

112 citations

Journal ArticleDOI
TL;DR: In this paper, a review of the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities, is presented.
Abstract: The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.

105 citations

Journal ArticleDOI
TL;DR: This work developed a simple and accurate method of classifying large datasets of pmoA sequences, a common marker for methanotrophic bacteria, using a naïve Bayesian classifier and a lowest common ancestor (LCA) algorithm.
Abstract: The classification of high-throughput sequencing data of protein-encoding genes is not as well established as for 16S rRNA. The objective of this work was to develop a simple and accurate method of classifying large datasets of pmoA sequences, a common marker for methanotrophic bacteria. A taxonomic system for pmoA was developed based on a phylogenetic analysis of available sequences. The taxonomy incorporates the known diversity of pmoA present in public databases, including both sequences from cultivated and uncultivated organisms. Representative sequences from closely related genes, such as those encoding the bacterial ammonia monooxygenase, were also included in the pmoA taxonomy. In total, 53 low-level taxa (genus-level) are included. Using previously published datasets of high-throughput pmoA amplicon sequence data, we tested two approaches for classifying pmoA: a naive Bayesian classifier and BLAST. Classification of pmoA sequences based on BLAST analyses was performed using the lowest common ancestor (LCA) algorithm in MEGAN, a software program commonly used for the analysis of metagenomic data. Both the naive Bayesian and BLAST methods were able to classify pmoA sequences and provided similar classifications; however, the naive Bayesian classifier was prone to misclassifying contaminant sequences present in the datasets. Another advantage of the BLAST/LCA method was that it provided a user-interpretable output and enabled novelty detection at various levels, from highly divergent pmoA sequences to genus-level novelty.

98 citations


Cites methods from "ESPRIT-Tree: hierarchical clusterin..."

  • ...The objective of this work was to develop a simple and accurate method of classifying large datasets of pmoA sequences, a common marker for methanotrophic bacteria....

    [...]

  • ...The taxonomy-independent approach includes methods to compare sequence alignments and analyze operational taxonomic units (OTUs) based on sequence dissimilarity (Schloss and Handelsman, 2004; Cai and Sun, 2011)....

    [...]

Journal ArticleDOI
TL;DR: A new denoising algorithm that is more accurate and over an order of magnitude faster than AmpliconNoise is introduced, which eliminates the need for training data to establish error parameters, fully utilizes sequence-abundance information, and enables inclusion of context-dependent PCR error rates.
Abstract: PCR amplification and high-throughput sequencing theoretically enable the characterization of the finest-scale diversity in natural microbial and viral populations, but each of these methods introduces random errors that are difficult to distinguish from genuine biological diversity. Several approaches have been proposed to denoise these data but lack either speed or accuracy. We introduce a new denoising algorithm that we call DADA (Divisive Amplicon Denoising Algorithm). Without training data, DADA infers both the sample genotypes and error parameters that produced a metagenome data set. We demonstrate performance on control data sequenced on Roche’s 454 platform, and compare the results to the most accurate denoising software currently available, AmpliconNoise. DADA is more accurate and over an order of magnitude faster than AmpliconNoise. It eliminates the need for training data to establish error parameters, fully utilizes sequence-abundance information, and enables inclusion of context-dependent PCR error rates. It should be readily extensible to other sequencing platforms such as Illumina.

97 citations


Cites background from "ESPRIT-Tree: hierarchical clusterin..."

  • ...A newer version of ESPRIT promises to be released soon that may dramatically lower this time [18]....

    [...]

Journal ArticleDOI
09 Aug 2013-PLOS ONE
TL;DR: High resolution analysis of the intestinal microbiota composition in uninfected mice from the two facilities by deep sequencing of partial 16S rRNA amplicons found significant differences in microbiota composition, highlighting the importance of characterizing the intestinal microbiome when studying murine models of IBD.
Abstract: The mouse pathobiont Helicobacter hepaticus can induce typhlocolitis in interleukin-10-deficient mice, and H. hepaticus infection of immunodeficient mice is widely used as a model to study the role of pathogens and commensal bacteria in the pathogenesis of inflammatory bowel disease. C57BL/6J Il10−/− mice kept under specific pathogen-free conditions in two different facilities (MHH and MIT), displayed strong differences with respect to their susceptibilities to H. hepaticus-induced intestinal pathology. Mice at MIT developed robust typhlocolitis after infection with H. hepaticus, while mice at MHH developed no significant pathology after infection with the same H. hepaticus strain. We hypothesized that the intestinal microbiota might be responsible for these differences and therefore performed high resolution analysis of the intestinal microbiota composition in uninfected mice from the two facilities by deep sequencing of partial 16S rRNA amplicons. The microbiota composition differed markedly between mice from both facilities. Significant differences were also detected between two groups of MHH mice born in different years. Of the 119 operational taxonomic units (OTUs) that occurred in at least half the cecum or colon samples of at least one mouse group, 24 were only found in MIT mice, and another 13 OTUs could only be found in MHH samples. While most of the MHH-specific OTUs could only be identified to class or family level, the MIT-specific set contained OTUs identified to genus or species level, including the opportunistic pathogen, Bilophila wadsworthia. The susceptibility to H. hepaticus-induced colitis differed considerably between Il10−/− mice originating from the two institutions. This was associated with significant differences in microbiota composition, highlighting the importance of characterizing the intestinal microbiome when studying murine models of IBD.

96 citations


Cites methods from "ESPRIT-Tree: hierarchical clusterin..."

  • ...OTUs were calculated using ESPRIT-Tree [70]....

    [...]

  • ...Following Cai and Sun [70], the cutoff OTU level to best represent species level was determined using the normalized mutual information (NMI) criterion [71] on a subset of sequences that could be identified to species....

    [...]

  • ...Supporting Information Figure S1 Normalized mutual information (NMI) value between species and OTUs obtained for 30 difference levels using ESPRIT-Tree....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

88,255 citations

Journal ArticleDOI
TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.
Abstract: We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the logexpectation score, and refinement using treedependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.

37,524 citations

Journal ArticleDOI
TL;DR: An overview of the analysis pipeline and links to raw data and processed output from the runs with and without denoising are provided.
Abstract: Supplementary Figure 1 Overview of the analysis pipeline. Supplementary Table 1 Details of conventionally raised and conventionalized mouse samples. Supplementary Discussion Expanded discussion of QIIME analyses presented in the main text; Sequencing of 16S rRNA gene amplicons; QIIME analysis notes; Expanded Figure 1 legend; Links to raw data and processed output from the runs with and without denoising.

28,911 citations


"ESPRIT-Tree: hierarchical clusterin..." refers background in this paper

  • ...In addition to microbial diversity estimation, there is currently increased interest in applying taxonomyindependent analysis to analyze millions of sequences for comparative microbial community analysis (11,12)....

    [...]

  • ...05 level 241 (7) 268 (6) 362 (11) 314 (9) peak NMI-species 402 (9) 400 (9) 590 (13) 314 (9) peak NMI-genus 190 (5) 176 (7) 216 (6) 243 (7)...

    [...]

Book
01 Jan 1990
TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.
Abstract: From the Publisher: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures. Like the first edition,this text can also be used for self-study by technical professionals since it discusses engineering issues in algorithm design as well as the mathematical aspects. In its new edition,Introduction to Algorithms continues to provide a comprehensive introduction to the modern study of algorithms. The revision has been updated to reflect changes in the years since the book's original publication. New chapters on the role of algorithms in computing and on probabilistic analysis and randomized algorithms have been included. Sections throughout the book have been rewritten for increased clarity,and material has been added wherever a fuller explanation has seemed useful or new information warrants expanded coverage. As in the classic first edition,this new edition of Introduction to Algorithms presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers. Further,the algorithms are presented in pseudocode to make the book easily accessible to students from all programming language backgrounds. Each chapter presents an algorithm,a design technique,an application area,or a related topic. The chapters are not dependent on one another,so the instructor can organize his or her use of the book in the way that best suits the course's needs. Additionally,the new edition offers a 25% increase over the first edition in the number of problems,giving the book 155 problems and over 900 exercises thatreinforcethe concepts the students are learning.

21,651 citations

Journal ArticleDOI
TL;DR: In this paper, a procedure for forming hierarchical groups of mutually exclusive subsets, each of which has members that are maximally similar with respect to specified characteristics, is suggested for use in large-scale (n > 100) studies when a precise optimal solution for a specified number of groups is not practical.
Abstract: A procedure for forming hierarchical groups of mutually exclusive subsets, each of which has members that are maximally similar with respect to specified characteristics, is suggested for use in large-scale (n > 100) studies when a precise optimal solution for a specified number of groups is not practical. Given n sets, this procedure permits their reduction to n − 1 mutually exclusive sets by considering the union of all possible n(n − 1)/2 pairs and selecting a union having a maximal value for the functional relation, or objective function, that reflects the criterion chosen by the investigator. By repeating this process until only one group remains, the complete hierarchical structure and a quantitative estimate of the loss associated with each stage in the grouping can be obtained. A general flowchart helpful in computer programming and a numerical example are included.

17,405 citations

Related Papers (5)