scispace - formally typeset
Search or ask a question
Journal ArticleDOI

ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time.

01 Aug 2011-Nucleic Acids Research (Oxford University Press)-Vol. 39, Iss: 14
TL;DR: A new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work and exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.
Abstract: Taxonomy-independent analysis plays an essential role in microbial community analysis. Hierarchical clustering is one of the most widely employed approaches to finding operational taxonomic units, the basis for many downstream analyses. Most existing algorithms have quadratic space and computational complexities, and thus can be used only for small or medium-scale problems. We propose a new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work. The basic idea is to partition a sequence space into a set of subspaces using a partition tree constructed using a pseudometric, then recursively refine a clustering structure in these subspaces. The technique relies on new methods for fast closest-pair searching and efficient dynamic insertion and deletion of tree nodes. To avoid exhaustive computation of pairwise distances between clusters, we represent each cluster of sequences as a probabilistic sequence, and define a set of operations to align these probabilistic sequences and compute genetic distances between them. We present analyses of space and computational complexity, and demonstrate the effectiveness of our new algorithm using a human gut microbiota data set with over one million sequences. The new algorithm exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.

Content maybe subject to copyright    Report

Citations
More filters
Dissertation
23 Mar 2017
TL;DR: In this article, the authors propose des solutions d'amelioration des etudes de metagenomique ciblee par le developpement d'outils et de methodes innovantes, apportant une meilleure comprehension des biais d'analyse inherents a de telles etudes, and a mineure conception des plans d'experience.
Abstract: La metagenomique ciblee, etude de la composition et de la diversite des communautes microbiennes presentes dans differents echantillon biologiques sur la base d'un marqueur genomique, a connu un veritable essor lors de cette derniere decennie grâce a l'arrivee du sequencage haut-debit. Faisant appel a des outils de biologie moleculaire et de bioinformatique, elle a ete a l’origine de substantiels progres dans les domaines de l’evolution et de la diversite microbienne. Cependant, de nouvelles problematiques sont apparues avec le sequencage haut-debit : la generation exponentielle de donnees souleve des problemes d'analyse bioinformatique, qui doit etre adaptee aux plans d'experience et aux questions biologiques associees. Cette these propose des solutions d'amelioration des etudes de metagenomique ciblee par le developpement d'outils et de methodes innovantes, apportant une meilleure comprehension des biais d'analyse inherents a de telles etudes, et une meilleure conception des plans d'experience. Tout d'abord, une expertise du pipeline d'analyse utilise en production sur la plate-forme PEGASE-biosciences a ete menee. Cette evaluation a revele la necessite de mettre en place une methode d'evaluation formelle de pipelines d'analyses de donnees de metagenomique ciblee, qui a ete developpee sur la base de donnees simulees et reelles, et de metriques d'evaluation adaptees. Cette methode a ete utilisee sur plusieurs pipelines d'analyse couramment utilises par la communaute, tout comme sur de nouvelles approches d'analyse jamais utilisees dans un tel contexte. Cette evaluation a permis de mieux comprendre les biais du plan d'experience qui peuvent affecter les resultats et les conclusions biologiques associees. Un de ces biais majeurs est le choix des amorces d'amplification de la cible ; un logiciel de design d'amorces adaptees au plan d'experience a ete specifiquement developpe pour minimiser ce biais. Enfin, des recommandations de montage de plan d'experience et d'analyse ont ete emises afin d'ameliorer la robustesse des etudes de metagenomique ciblee.

2 citations


Cites background from "ESPRIT-Tree: hierarchical clusterin..."

  • ...Pour réduire tout de même les temps d'analyse, ESPRIT a été remplacé par son successeur ESPRIT-Tree [Cai & Sun, 2011], allégeant les temps de calcul en promettant une qualité de résultats d'analyse similaire....

    [...]

  • ...Les auteurs d'ESPRIT ont développé une approche intermédiaire entre clustering hiérarchique et clustering par centroïdes, qui est ESPRIT-Tree [Cai & Sun 2011]....

    [...]

Proceedings ArticleDOI
01 Dec 2013
TL;DR: Empirical studies on the synthetic and real environmental microbial community datasets show that the proposed model has better predictions on test dataset than existing methods such as Lasso, Elastic Net, dirty model and rMTFL (robust multi-task feature learning).
Abstract: Feature selection is important for many biological studies, especially when the number of available samples is limited (in order of hundreds) while the number of input features is large (in order of millions), such as eQTL (expression quantitative trait loci) mapping, GWAS (genome wide association study) and environmental microbial community study. We study the problem of multiple output regression which leverages the underlying common relationship shared by multiple output features and propose an efficient and accurate approach for feature selection. Our approach considers both intra- and inter-group sparsities. The intergroup sparsity assumes that only small set of input features are related to the output features. The intragroup sparsity assumes that each input features may relate to multiple output features which should have different kinds of sparsity. Most existing methods do not model the intragroup sparsity well by either assuming uniform regularization on each group, i.e. each input feature relates to similar number of output features, or requiring prior knowledge of the relationship of input and output features. By modelling the regression coefficients as a mixture distributions of Laplacian and Gaussian, we can shrink group regression coefficients to be small adaptively and learn the intergroup, intragroup sparsity and shrinkage estimation patterns. Empirical studies on the synthetic and real environmental microbial community datasets show that our model has better predictions on test dataset than existing methods such as Lasso, Elastic Net, dirty model and rMTFL (robust multi-task feature learning). Moreover, by using least angle regression or coordinate descent and projected gradient descent techniques for optimization, we can obtain the optimal regression efficiently.

2 citations


Cites methods from "ESPRIT-Tree: hierarchical clusterin..."

  • ...1) Preprocessing of ICoMM Data: We employ ESPIRITTree [8] to cluster sequences into OTUs at various distance levels....

    [...]

Posted ContentDOI
26 Mar 2016-bioRxiv
TL;DR: The Matthew’s correlation coefficient is applied to assess the ability of 15 reference-independent and ‐dependent clustering algorithms to assign sequences to OTUs and the most consistently robust method was the average neighbor algorithm; however, for some datasets other algorithms matched its performance.
Abstract: Assigning 16S rRNA gene sequences to operational taxonomic units (OTUs) allows microbial ecologists to overcome the inconsistencies and biases within bacterial taxonomy and provides a strategy for clustering similar sequences that do not have representatives in a reference database. I have applied the Matthew's correlation coefficient to assess the ability of 15 reference-independent and -dependent clustering algorithms to assign sequences to OTUs. This metric quantifies the ability of an algorithm to reflect the relationships between sequences without the use of a reference and can be applied to any dataset or method. The most consistently robust method was the average neighbor algorithm; however, for some datasets other algorithms matched its performance.

1 citations


Cites background from "ESPRIT-Tree: hierarchical clusterin..."

  • ...In a second approach, developers have compared the time and memory required to 27 cluster sequences in a dataset (6, 13, 17, 18)....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

88,255 citations

Journal ArticleDOI
TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.
Abstract: We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the logexpectation score, and refinement using treedependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.

37,524 citations

Journal ArticleDOI
TL;DR: An overview of the analysis pipeline and links to raw data and processed output from the runs with and without denoising are provided.
Abstract: Supplementary Figure 1 Overview of the analysis pipeline. Supplementary Table 1 Details of conventionally raised and conventionalized mouse samples. Supplementary Discussion Expanded discussion of QIIME analyses presented in the main text; Sequencing of 16S rRNA gene amplicons; QIIME analysis notes; Expanded Figure 1 legend; Links to raw data and processed output from the runs with and without denoising.

28,911 citations


"ESPRIT-Tree: hierarchical clusterin..." refers background in this paper

  • ...In addition to microbial diversity estimation, there is currently increased interest in applying taxonomyindependent analysis to analyze millions of sequences for comparative microbial community analysis (11,12)....

    [...]

  • ...05 level 241 (7) 268 (6) 362 (11) 314 (9) peak NMI-species 402 (9) 400 (9) 590 (13) 314 (9) peak NMI-genus 190 (5) 176 (7) 216 (6) 243 (7)...

    [...]

Book
01 Jan 1990
TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.
Abstract: From the Publisher: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures. Like the first edition,this text can also be used for self-study by technical professionals since it discusses engineering issues in algorithm design as well as the mathematical aspects. In its new edition,Introduction to Algorithms continues to provide a comprehensive introduction to the modern study of algorithms. The revision has been updated to reflect changes in the years since the book's original publication. New chapters on the role of algorithms in computing and on probabilistic analysis and randomized algorithms have been included. Sections throughout the book have been rewritten for increased clarity,and material has been added wherever a fuller explanation has seemed useful or new information warrants expanded coverage. As in the classic first edition,this new edition of Introduction to Algorithms presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers. Further,the algorithms are presented in pseudocode to make the book easily accessible to students from all programming language backgrounds. Each chapter presents an algorithm,a design technique,an application area,or a related topic. The chapters are not dependent on one another,so the instructor can organize his or her use of the book in the way that best suits the course's needs. Additionally,the new edition offers a 25% increase over the first edition in the number of problems,giving the book 155 problems and over 900 exercises thatreinforcethe concepts the students are learning.

21,651 citations

Journal ArticleDOI
TL;DR: In this paper, a procedure for forming hierarchical groups of mutually exclusive subsets, each of which has members that are maximally similar with respect to specified characteristics, is suggested for use in large-scale (n > 100) studies when a precise optimal solution for a specified number of groups is not practical.
Abstract: A procedure for forming hierarchical groups of mutually exclusive subsets, each of which has members that are maximally similar with respect to specified characteristics, is suggested for use in large-scale (n > 100) studies when a precise optimal solution for a specified number of groups is not practical. Given n sets, this procedure permits their reduction to n − 1 mutually exclusive sets by considering the union of all possible n(n − 1)/2 pairs and selecting a union having a maximal value for the functional relation, or objective function, that reflects the criterion chosen by the investigator. By repeating this process until only one group remains, the complete hierarchical structure and a quantitative estimate of the loss associated with each stage in the grouping can be obtained. A general flowchart helpful in computer programming and a numerical example are included.

17,405 citations

Related Papers (5)