scispace - formally typeset
Search or ask a question
Journal ArticleDOI

FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix

TL;DR: FastTree is a method for constructing large phylogenies and for estimating their reliability, instead of storing a distance matrix, that uses sequence profiles of internal nodes in the tree to implement Neighbor-Joining and uses heuristics to quickly identify candidate joins.
Abstract: Gene families are growing rapidly, but standard methods for inferring phylogenies do not scale to alignments with over 10,000 sequences. We present FastTree, a method for constructing large phylogenies and for estimating their reliability. Instead of storing a distance matrix, FastTree stores sequence profiles of internal nodes in the tree. FastTree uses these profiles to implement Neighbor-Joining and uses heuristics to quickly identify candidate joins. FastTree then uses nearest neighbor interchanges to reduce the length of the tree. For an alignment with N sequences, L sites, and a different characters, a distance matrix requires O(N2) space and O(N2L) time, but FastTree requires just O(NLa + N) memory and O(Nlog (N)La) time. To estimate the tree's reliability, FastTree uses local bootstrapping, which gives another 100-fold speedup over a distance matrix. For example, FastTree computed a tree and support values for 158,022 distinct 16S ribosomal RNAs in 17 h and 2.4 GB of memory. Just computing pairwise Jukes–Cantor distances and storing them, without inferring a tree or bootstrapping, would require 17 h and 50 GB of memory. In simulations, FastTree was slightly more accurate than Neighbor-Joining, BIONJ, or FastME; on genuine alignments, FastTree's topologies had higher likelihoods. FastTree is available at http://microbesonline.org/fasttree.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
10 Mar 2010-PLOS ONE
TL;DR: Improvements to FastTree are described that improve its accuracy without sacrificing scalability, and FastTree 2 allows the inference of maximum-likelihood phylogenies for huge alignments.
Abstract: Background We recently described FastTree, a tool for inferring phylogenies for alignments with up to hundreds of thousands of sequences. Here, we describe improvements to FastTree that improve its accuracy without sacrificing scalability.

10,010 citations


Cites methods or result from "FastTree: Computing Large Minimum E..."

  • ...The simulated protein alignments and the genuine COG alignments were described previously [2]....

    [...]

  • ...0 is more accurate than most other minimum-evolution methods, but not as accurate as maximum-likelihood methods [2]....

    [...]

  • ...We tested FastTree on simulated protein alignments with 250 to 5,000 sequences [2]....

    [...]

  • ...Nevertheless, FastTree with NNIs and FastME with NNIs give very similar results [2], and computing the exact change in total tree length does not improve the accuracy of FastTree’s SPRs (data not shown)....

    [...]

  • ...For example, on simulated protein alignments with just 10 sequences (from [2]), adding the CAT model improves FastTree’s accuracy from 76....

    [...]

Journal ArticleDOI
TL;DR: This work sequences a diverse array of 25 environmental samples and three known “mock communities” at a depth averaging 3.1 million reads per sample to demonstrate excellent consistency in taxonomic recovery and recapture diversity patterns that were previously reported on the basis of metaanalysis of many studies from the literature.
Abstract: The ongoing revolution in high-throughput sequencing continues to democratize the ability of small groups of investigators to map the microbial component of the biosphere. In particular, the coevolution of new sequencing platforms and new software tools allows data acquisition and analysis on an unprecedented scale. Here we report the next stage in this coevolutionary arms race, using the Illumina GAIIx platform to sequence a diverse array of 25 environmental samples and three known “mock communities” at a depth averaging 3.1 million reads per sample. We demonstrate excellent consistency in taxonomic recovery and recapture diversity patterns that were previously reported on the basis of metaanalysis of many studies from the literature (notably, the saline/nonsaline split in environmental samples and the split between host-associated and free-living communities). We also demonstrate that 2,000 Illumina single-end reads are sufficient to recapture the same relationships among samples that we observe with the full dataset. The results thus open up the possibility of conducting large-scale studies analyzing thousands of samples simultaneously to survey microbial communities at an unprecedented spatial and temporal resolution.

6,767 citations


Cites methods from "FastTree: Computing Large Minimum E..."

  • ...reference collection using fasttree (23) was used for the calculation of phylogeny-based α and β diversity metrics....

    [...]

Journal ArticleDOI
TL;DR: An objective measure of genome quality is proposed that can be used to select genomes suitable for specific gene- and genome-centric analyses of microbial communities and is shown to provide accurate estimates of genome completeness and contamination and to outperform existing approaches.
Abstract: Large-scale recovery of genomes from isolates, single cells, and metagenomic data has been made possible by advances in computational methods and substantial reductions in sequencing costs. Although this increasing breadth of draft genomes is providing key information regarding the evolutionary and functional diversity of microbial life, it has become impractical to finish all available reference genomes. Making robust biological inferences from draft genomes requires accurate estimates of their completeness and contamination. Current methods for assessing genome quality are ad hoc and generally make use of a limited number of “marker” genes conserved across all bacterial or archaeal genomes. Here we introduce CheckM, an automated method for assessing the quality of a genome using a broader set of marker genes specific to the position of a genome within a reference genome tree and information about the collocation of these genes. We demonstrate the effectiveness of CheckM using synthetic data and a wide range of isolate-, single-cell-, and metagenome-derived genomes. CheckM is shown to provide accurate estimates of genome completeness and contamination and to outperform existing approaches. Using CheckM, we identify a diverse range of errors currently impacting publicly available isolate genomes and demonstrate that genomes obtained from single cells and metagenomic data vary substantially in quality. In order to facilitate the use of draft genomes, we propose an objective measure of genome quality that can be used to select genomes suitable for specific gene- and genome-centric analyses of microbial communities.

5,788 citations


Cites background from "FastTree: Computing Large Minimum E..."

  • ...1.3 (Price et al. 2009) under the WAG (Whelan and Goldman Genome Research 1051 www.genome.org 2001) andGAMMA (Yang 1994)models....

    [...]

Journal ArticleDOI
TL;DR: It is found that in direct contrast to the highly differentiated communities of their mothers, neonates harbored bacterial communities that were undifferentiated across multiple body habitats, regardless of delivery mode.
Abstract: Upon delivery, the neonate is exposed for the first time to a wide array of microbes from a variety of sources, including maternal bacteria. Although prior studies have suggested that delivery mode shapes the microbiota's establishment and, subsequently, its role in child health, most researchers have focused on specific bacterial taxa or on a single body habitat, the gut. Thus, the initiation stage of human microbiome development remains obscure. The goal of the present study was to obtain a community-wide perspective on the influence of delivery mode and body habitat on the neonate's first microbiota. We used multiplexed 16S rRNA gene pyrosequencing to characterize bacterial communities from mothers and their newborn babies, four born vaginally and six born via Cesarean section. Mothers' skin, oral mucosa, and vagina were sampled 1 h before delivery, and neonates' skin, oral mucosa, and nasopharyngeal aspirate were sampled <5 min, and meconium <24 h, after delivery. We found that in direct contrast to the highly differentiated communities of their mothers, neonates harbored bacterial communities that were undifferentiated across multiple body habitats, regardless of delivery mode. Our results also show that vaginally delivered infants acquired bacterial communities resembling their own mother's vaginal microbiota, dominated by Lactobacillus, Prevotella, or Sneathia spp., and C-section infants harbored bacterial communities similar to those found on the skin surface, dominated by Staphylococcus, Corynebacterium, and Propionibacterium spp. These findings establish an important baseline for studies tracking the human microbiome's successional development in different body habitats following different delivery modes, and their associated effects on infant health.

3,640 citations


Cites methods from "FastTree: Computing Large Minimum E..."

  • ...Taxonomy was assigned using the Ribosomal Database Project (RDP) classifier with a minimum support threshold of 60% (42) and the RDP taxonomic nomenclature....

    [...]

Journal ArticleDOI
TL;DR: Soils collected across a long-term liming experiment were used to investigate the direct influence of pH on the abundance and composition of the two major soil microbial taxa, fungi and bacteria, and both the relative abundance and diversity of bacteria were positively related to pH.
Abstract: Soils collected across a long-term liming experiment (pH 4.0-8.3), in which variation in factors other than pH have been minimized, were used to investigate the direct influence of pH on the abundance and composition of the two major soil microbial taxa, fungi and bacteria. We hypothesized that bacterial communities would be more strongly influenced by pH than fungal communities. To determine the relative abundance of bacteria and fungi, we used quantitative PCR (qPCR), and to analyze the composition and diversity of the bacterial and fungal communities, we used a bar-coded pyrosequencing technique. Both the relative abundance and diversity of bacteria were positively related to pH, the latter nearly doubling between pH 4 and 8. In contrast, the relative abundance of fungi was unaffected by pH and fungal diversity was only weakly related with pH. The composition of the bacterial communities was closely defined by soil pH; there was as much variability in bacterial community composition across the 180-m distance of this liming experiment as across soils collected from a wide range of biomes in North and South America, emphasizing the dominance of pH in structuring bacterial communities. The apparent direct influence of pH on bacterial community composition is probably due to the narrow pH ranges for optimal growth of bacteria. Fungal community composition was less strongly affected by pH, which is consistent with pure culture studies, demonstrating that fungi generally exhibit wider pH ranges for optimal growth.

2,966 citations


Cites methods from "FastTree: Computing Large Minimum E..."

  • ...Phylogenetic trees were then built from all representative sequences using the FastTree algorithm (Price et al., 2009)....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: The neighbor-joining method and Sattath and Tversky's method are shown to be generally better than the other methods for reconstructing phylogenetic trees from evolutionary distance data.
Abstract: A new method called the neighbor-joining method is proposed for reconstructing phylogenetic trees from evolutionary distance data. The principle of this method is to find pairs of operational taxonomic units (OTUs [= neighbors]) that minimize the total branch length at each stage of clustering of OTUs starting with a starlike tree. The branch lengths as well as the topology of a parsimonious tree can quickly be obtained by using this method. Using computer simulation, we studied the efficiency of this method in obtaining the correct unrooted tree in comparison with that of five other tree-making methods: the unweighted pair group method of analysis, Farris's method, Sattath and Tversky's method, Li's method, and Tateno et al.'s modified Farris method. The new, neighbor-joining method and Sattath and Tversky's method are shown to be generally better than the other methods.

57,055 citations


"FastTree: Computing Large Minimum E..." refers methods in this paper

  • ...Given an alignment, Neighbor-Joining and related minimum evolution methods are the fastest and most scalable approaches for inferring phylogenies (Saitou and Nei, 1987; Studier and Keppler, 1988; Desper and Gascuel, 2002)....

    [...]

  • ...FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix...

    [...]

Journal ArticleDOI
TL;DR: The recently‐developed statistical method known as the “bootstrap” can be used to place confidence intervals on phylogenies and shows significant evidence for a group if it is defined by three or more characters.
Abstract: The recently-developed statistical method known as the "bootstrap" can be used to place confidence intervals on phylogenies. It involves resampling points from one's own data, with replacement, to create a series of bootstrap samples of the same size as the original data. Each of these is analyzed, and the variation among the resulting estimates taken to indicate the size of the error involved in making estimates from the original data. In the case of phylogenies, it is argued that the proper method of resampling is to keep all of the original species while sampling characters with replacement, under the assumption that the characters have been independently drawn by the systematist and have evolved independently. Majority-rule consensus trees can be used to construct a phylogeny showing all of the inferred monophyletic groups that occurred in a majority of the bootstrap samples. If a group shows up 95% of the time or more, the evidence for it is taken to be statistically significant. Existing computer programs can be used to analyze different bootstrap samples by using weights on the characters, the weight of a character being how many times it was drawn in bootstrap sampling. When all characters are perfectly compatible, as envisioned by Hennig, bootstrap sampling becomes unnecessary; the bootstrap method would show significant evidence for a group if it is defined by three or more characters.

40,349 citations


"FastTree: Computing Large Minimum E..." refers methods in this paper

  • ...…is to use the bootstrap: to resample the columns of the alignment, to rerun the method 100– 1,000 times, to compare the resulting trees to each other or to the tree inferred from the full alignment, and to count the number of times that each split occurs in the resulting trees (Felsenstein 1985)....

    [...]

  • ...FastTree: Computing Large Minimum Evolution Trees with Profiles instead of a Distance Matrix...

    [...]

Journal ArticleDOI
TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.
Abstract: We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the logexpectation score, and refinement using treedependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.

37,524 citations


"FastTree: Computing Large Minimum E..." refers background in this paper

  • ...A faster UPGMA variant of FastTree is available at http://www.microbesonline.org/fasttree and might be useful for this purpose, both because of its speed and because UPGMA guide trees may lead to better alignments ( Edgar, 2004 )....

    [...]

Journal ArticleDOI
TL;DR: A nonparametric approach to the analysis of areas under correlated ROC curves is presented, by using the theory on generalized U-statistics to generate an estimated covariance matrix.
Abstract: Methods of evaluating and comparing the performance of diagnostic tests are of increasing importance as new tests are developed and marketed. When a test is based on an observed variable that lies on a continuous or graded scale, an assessment of the overall value of the test can be made through the use of a receiver operating characteristic (ROC) curve. The curve is constructed by varying the cutpoint used to determine which values of the observed variable will be considered abnormal and then plotting the resulting sensitivities against the corresponding false positive rates. When two or more empirical curves are constructed based on tests performed on the same individuals, statistical analysis on differences between curves must take into account the correlated nature of the data. This paper presents a nonparametric approach to the analysis of areas under correlated ROC curves, by using the theory on generalized U-statistics to generate an estimated covariance matrix.

16,496 citations


"FastTree: Computing Large Minimum E..." refers methods in this paper

  • ...To quantify how effective the measures were in distinguishing correct splits, we used the area under the receiver operating characteristic curve (AOC, DeLong and Clarke-Pearson (1998))....

    [...]

Journal ArticleDOI
TL;DR: This work has used extensive and realistic computer simulations to show that the topological accuracy of this new method is at least as high as that of the existing maximum-likelihood programs and much higher than the performance of distance-based and parsimony approaches.
Abstract: The increase in the number of large data sets and the complexity of current probabilistic sequence evolution models necessitates fast and reliable phylogeny reconstruction methods. We describe a new approach, based on the maximum- likelihood principle, which clearly satisfies these requirements. The core of this method is a simple hill-climbing algorithm that adjusts tree topology and branch lengths simultaneously. This algorithm starts from an initial tree built by a fast distance-based method and modifies this tree to improve its likelihood at each iteration. Due to this simultaneous adjustment of the topology and branch lengths, only a few iterations are sufficient to reach an optimum. We used extensive and realistic computer simulations to show that the topological accuracy of this new method is at least as high as that of the existing maximum-likelihood programs and much higher than the performance of distance-based and parsimony approaches. The reduction of computing time is dramatic in comparison with other maximum-likelihood packages, while the likelihood maximization ability tends to be higher. For example, only 12 min were required on a standard personal computer to analyze a data set consisting of 500 rbcL sequences with 1,428 base pairs from plant plastids, thus reaching a speed of the same order as some popular distance-based and parsimony algorithms. This new method is implemented in the PHYML program, which is freely available on our web page: http://www.lirmm.fr/w3ifa/MAAS/. (Algorithm; computer simulations; maximum likelihood; phylogeny; rbcL; RDPII project.) The size of homologous sequence data sets has in- creased dramatically in recent years, and many of these data sets now involve several hundreds of taxa. More- over, current probabilistic sequence evolution models (Swofford et al., 1996 ; Page and Holmes, 1998 ), notably those including rate variation among sites (Uzzell and Corbin, 1971 ; Jin and Nei, 1990 ; Yang, 1996 ), require an increasing number of calculations. Therefore, the speed of phylogeny reconstruction methods is becoming a sig- nificant requirement and good compromises between speed and accuracy must be found. The maximum likelihood (ML) approach is especially accurate for building molecular phylogenies. Felsenstein (1981) brought this framework to nucleotide-based phy- logenetic inference, and it was later also applied to amino acid sequences (Kishino et al., 1990). Several vari- ants were proposed, most notably the Bayesian meth- ods (Rannala and Yang 1996; and see below), and the discrete Fourier analysis of Hendy et al. (1994), for ex- ample. Numerous computer studies (Huelsenbeck and Hillis, 1993; Kuhner and Felsenstein, 1994; Huelsenbeck, 1995; Rosenberg and Kumar, 2001; Ranwez and Gascuel, 2002) have shown that ML programs can recover the cor- rect tree from simulated data sets more frequently than other methods can. Another important advantage of the ML approach is the ability to compare different trees and evolutionary models within a statistical framework (see Whelan et al., 2001, for a review). However, like all optimality criterion-based phylogenetic reconstruction approaches, ML is hampered by computational difficul- ties, making it impossible to obtain the optimal tree with certainty from even moderate data sets (Swofford et al., 1996). Therefore, all practical methods rely on heuristics that obtain near-optimal trees in reasonable computing time. Moreover, the computation problem is especially difficult with ML, because the tree likelihood not only depends on the tree topology but also on numerical pa- rameters, including branch lengths. Even computing the optimal values of these parameters on a single tree is not an easy task, particularly because of possible local optima (Chor et al., 2000). The usual heuristic method, implemented in the pop- ular PHYLIP (Felsenstein, 1993 ) and PAUP ∗ (Swofford, 1999 ) packages, is based on hill climbing. It combines stepwise insertion of taxa in a growing tree and topolog- ical rearrangement. For each possible insertion position and rearrangement, the branch lengths of the resulting tree are optimized and the tree likelihood is computed. When the rearrangement improves the current tree or when the position insertion is the best among all pos- sible positions, the corresponding tree becomes the new current tree. Simple rearrangements are used during tree growing, namely "nearest neighbor interchanges" (see below), while more intense rearrangements can be used once all taxa have been inserted. The procedure stops when no rearrangement improves the current best tree. Despite significant decreases in computing times, no- tably in fastDNAml (Olsen et al., 1994 ), this heuristic becomes impracticable with several hundreds of taxa. This is mainly due to the two-level strategy, which sepa- rates branch lengths and tree topology optimization. In- deed, most calculations are done to optimize the branch lengths and evaluate the likelihood of trees that are finally rejected. New methods have thus been proposed. Strimmer and von Haeseler (1996) and others have assembled four- taxon (quartet) trees inferred by ML, in order to recon- struct a complete tree. However, the results of this ap- proach have not been very satisfactory to date (Ranwez and Gascuel, 2001 ). Ota and Li (2000, 2001) described

16,261 citations


"FastTree: Computing Large Minimum E..." refers methods in this paper

  • ...To quantify the quality of each topology, we used PhyML to optimize the branch lengths and compute the log likelihood....

    [...]

  • ...To quantify the quality of each topology, we used PhyML with the Hasegawa–Kishino–Yano 85 model, which accounts for the higher rate of transitions over transversions, and four categories of gamma-distributed rates....

    [...]

  • ...(Despite the high usage of virtual memory by PhyML, both PhyML and RAxML ran at over 99% CPU utilization.)...

    [...]

  • ...We ran PhyML with the Jones, Taylor, and Thorton (JTT) model of amino acid substitution and four categories of gamma-distributed rates....

    [...]

  • ...Even for COG alignments of just 1,250 proteins, PhyML 3 typically took over a week....

    [...]