scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Inferring Species Trees Directly from Biallelic Genetic Markers: Bypassing Gene Trees in a Full Coalescent Analysis

TL;DR: A polynomial-time algorithm that computes the likelihood of a species tree directly from the markers under a finite-sites model of mutation effectively integrating over all possible gene trees is described.
Abstract: The multispecies coalescent provides an elegant theoretical framework for estimating species trees and species demographics from genetic markers. However, practical applications of the multispecies coalescent model are limited by the need to integrate or sample over all gene trees possible for each genetic marker. Here we describe a polynomial-time algorithm that computes the likelihood of a species tree directly from the markers under a finite-sites model of mutation effectively integrating over all possible gene trees. The method applies to independent (unlinked) biallelic markers such as well-spaced single nucleotide polymorphisms, and we have implemented it in SNAPP, a Markov chain Monte Carlo sampler for inferring species trees, divergence dates, and population sizes. We report results from simulation experiments and from an analysis of 1997 amplified fragment length polymorphism loci in 69 individuals sampled from six species of Ourisia (New Zealand native foxglove).

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: BEAST 2 now has a fully developed package management system that allows third party developers to write additional functionality that can be directly installed to the BEAST 2 analysis platform via a package manager without requiring a new software release of the platform.
Abstract: We present a new open source, extensible and flexible software platform for Bayesian evolutionary analysis called BEAST 2. This software platform is a re-design of the popular BEAST 1 platform to correct structural deficiencies that became evident as the BEAST 1 software evolved. Key among those deficiencies was the lack of post-deployment extensibility. BEAST 2 now has a fully developed package management system that allows third party developers to write additional functionality that can be directly installed to the BEAST 2 analysis platform via a package manager without requiring a new software release of the platform. This package architecture is showcased with a number of recently published new models encompassing birth-death-sampling tree priors, phylodynamics and model averaging for substitution models and site partitioning. A second major improvement is the ability to read/write the entire state of the MCMC chain to/from disk allowing it to be easily shared between multiple instances of the BEAST software. This facilitates checkpointing and better support for multi-processor and high-end computing extensions. Finally, the functionality in new packages can be easily added to the user interface (BEAUti 2) by a simple XML template-based mechanism because BEAST 2 has been re-designed to provide greater integration between the analysis engine and the user interface so that, for example BEAST and BEAUti use exactly the same XML file format.

5,183 citations


Cites methods from "Inferring Species Trees Directly fr..."

  • ...The SNAPP package implements a multi-species coalescent for SNP and AFLP data [33]....

    [...]

Journal ArticleDOI
TL;DR: A series of major new developments in the BEAST 2 core platform and model hierarchy that have occurred since the first release of the software, culminating in the recent 2.5 release are described.
Abstract: Elaboration of Bayesian phylogenetic inference methods has continued at pace in recent years with major new advances in nearly all aspects of the joint modelling of evolutionary data. It is increasingly appreciated that some evolutionary questions can only be adequately answered by combining evidence from multiple independent sources of data, including genome sequences, sampling dates, phenotypic data, radiocarbon dates, fossil occurrences, and biogeographic range information among others. Including all relevant data into a single joint model is very challenging both conceptually and computationally. Advanced computational software packages that allow robust development of compatible (sub-)models which can be composed into a full model hierarchy have played a key role in these developments. Developing such software frameworks is increasingly a major scientific activity in its own right, and comes with specific challenges, from practical software design, development and engineering challenges to statistical and conceptual modelling challenges. BEAST 2 is one such computational software platform, and was first announced over 4 years ago. Here we describe a series of major new developments in the BEAST 2 core platform and model hierarchy that have occurred since the first release of the software, culminating in the recent 2.5 release.

2,045 citations

Journal ArticleDOI
TL;DR: A method to infer relationships among quartets of taxa under the coalescent model using techniques from algebraic statistics is developed, and uncertainty in the estimated relationships is quantified using the nonparametric bootstrap.
Abstract: Motivation: Increasing attention has been devoted to estimation of species-level phylogenetic relationships under the coalescent model. However, existing methods either use summary statistics (gene trees) to carry out estimation, ignoring an important source of variability in the estimates, or involve computationally intensive Bayesian Markov chain Monte Carlo algorithms that do not scale well to whole-genome datasets. Results: We develop a method to infer relationships among quartets of taxa under the coalescent model using techniques from algebraic statistics. Uncertainty in the estimated relationships is quantified using the nonparametric bootstrap. The performance of our method is assessed with simulated data. We then describe how our method could be used for species tree inference in larger taxon samples, and demonstrate its utility using datasets for Sistrurus rattlesnakes and for soybeans. Availability and implementation: The method to infer the phylogenetic relationship among quartets is implemented in the software SVDquartets, available at www.stat.osu.edu/∼lkubatko/software/SVDquartets. Contact: ude.uso.tats@oktabukl Supplementary information: Supplementary data are available at Bioinformatics online.

908 citations


Cites methods from "Inferring Species Trees Directly fr..."

  • ...The three most common methods in this group, BEST (Liu and Pearl, 2007), *BEAST (Heled and Drummond, 2010), and SNAPP (Bryant et al., 2012), all seek to estimate the posterior distribution for the species tree using Markov chain Monte Carlo (MCMC), but differ in some details of the implementation....

    [...]

  • ...SNAPP infers the species tree using the coalescent model and is designed for biallelic data consisting of unlinked SNPs (Bryant et al., 2012)....

    [...]

  • ...We also carried out computations in SNAPP (Bryant et al., 2012), which is suitable for the soybean dataset as it consists of SNP (rather than multi-locus) data, to compare the run times....

    [...]

  • ...Much recent effort has been devoted to the development of methods to estimate species-level phylogenies from multi-locus data under the coalescent model (Bryant et al., 2012; Heled and Drummond, 2010; Kubatko et al., 2009; Liu and Pearl, 2007; Liu et al., 2009b; Than and Nakhleh, 2009)....

    [...]

  • ...The three most common methods in this group, BEST (Liu and Pearl, 2007), *BEAST (Heled and Drummond, 2010) and SNAPP (Bryant et al., 2012), all seek to estimate the posterior distribution for the species tree using Markov chain Monte Carlo (MCMC), but differ in some details of the implementation....

    [...]

Journal ArticleDOI
TL;DR: In this article, the authors outline some of the major obstacles specific to the application of NGS to phylogeography and phylogenetics, including the focus on non-model organisms, the necessity of obtaining orthologous loci in a cost-effective manner, and the predominate use of gene trees in these fields.

586 citations

Journal ArticleDOI
TL;DR: This article proposes a fast algorithm to compute quartet-based support for each branch of a given species tree with regard to a given set of gene trees and evaluates the precision and recall of the local PP on a wide set of simulated and biological datasets.
Abstract: Species tree reconstruction is complicated by effects of incomplete lineage sorting, commonly modeled by the multi-species coalescent model (MSC). While there has been substantial progress in developing methods that estimate a species tree given a collection of gene trees, less attention has been paid to fast and accurate methods of quantifying support. In this article, we propose a fast algorithm to compute quartet-based support for each branch of a given species tree with regard to a given set of gene trees. We then show how the quartet support can be used in the context of the MSC to compute (1) the local posterior probability (PP) that the branch is in the species tree and (2) the length of the branch in coalescent units. We evaluate the precision and recall of the local PP on a wide set of simulated and biological datasets, and show that it has very high precision and improved recall compared with multi-locus bootstrapping. The estimated branch lengths are highly accurate when gene tree estimation error is low, but are underestimated when gene tree estimation error increases. Computation of both the branch length and local PP is implemented as new features in ASTRAL.

578 citations


Cites methods from "Inferring Species Trees Directly fr..."

  • ...In the MSC model, the quartet topology found in the true species tree has the highest probability of appearing in gene trees (Allman et al. 2011), and the two alternative topologies have identical probabilities....

    [...]

  • ...On real data, we need to instead estimate gene trees from sequence data, and further, it is not always clear that our sample is unbiased, nor that gene trees are generated by the MSC. Importantly, we further assume that all four clusters around the branch we are scoring are correct....

    [...]

  • ...The most scalable family of MSC-based methods are based on a two-step process where gene trees are first estimated independently for each gene and are then combined to build the species tree using a summary method....

    [...]

  • ...On the other hand, considering only the MSC and ignoring issues such as long branch attraction, long branches can be easily reconstructed confidently even with few genes....

    [...]

  • ...We now conclude Theorem 1 Given (1) a set of n gene trees generated by the MSC on a model species tree generated by the Yule process with rate k and (2) an internal branch represented by a quadripartition Q where the four clusters around Q are each present in the species tree, let z ¼ ðz1; z2; z3Þ be the average quartet frequencies around Q (where z1 corresponds to the topology of Q); the local PP that the species tree has the topology given by Q is: PðQj Z ¼ zÞ ¼ hðz1Þ hðz1Þ þ 2z2 z1 hðz2Þ þ 2z3 z1 hðz3Þ (6) for hðxÞ ¼ Bðxþ 1;n xþ 2kÞð1 I1 3 ðxþ 1;n xþ 2kÞÞ....

    [...]

References
More filters
Book
01 Feb 1987
TL;DR: Recent developments of statistical methods in molecular phylogenetics are reviewed and it is shown that the mathematical foundations of these methods are not well established, but computer simulations and empirical data indicate that currently used methods produce reasonably good phylogenetic trees when a sufficiently large number of nucleotides or amino acids are used.
Abstract: Recent developments of statistical methods in molecular phylogenetics are reviewed. It is shown that the mathematical foundations of these methods are not well established, but computer simulations and empirical data indicate that currently used methods such as neighbor joining, minimum evolution, likelihood, and parsimony methods produce reasonably good phylogenetic trees when a sufficiently large number of nucleotides or amino acids are used. However, when the rate of evolution varies exlensively from branch to branch, many methods may fail to recover the true topology. Solid statistical tests for examining'the accuracy of trees obtained by neighborjoining, minimum evolution, and least-squares method are available, but the methods for likelihood and parsimony trees are yet to be refined. Parsimony, likelihood, and distance methods can all be used for inferring amino acid sequences of the proteins of ancestral organisms that have become extinct.

15,840 citations


"Inferring Species Trees Directly fr..." refers background or methods in this paper

  • ...This ability represents a qualitative difference between SNAPP and the methods of Nielsen et al. (1998) and RoyChoudhury et al....

    [...]

  • ...This ability represents a qualitative difference between SNAPP and the methods of Nielsen et al. (1998) and RoyChoudhury et al. (2008). A more difficult and complex problem, and one beyond the scope of this paper, would be to properly characterize the situations in which the θ values can be reliably inferred....

    [...]

  • ...Early contributions to the development of multispecies models built on the branches of a species tree were made by Hudson (1983), Tajima (1983), Takahata and Nei (1985), Nei (1987), Pamilo and Nei (1988), and Takahata (1989)....

    [...]

Journal ArticleDOI
TL;DR: The focus is on applied inference for Bayesian posterior distributions in real problems, which often tend toward normal- ity after transformations and marginalization, and the results are derived as normal-theory approximations to exact Bayesian inference, conditional on the observed simulations.
Abstract: The Gibbs sampler, the algorithm of Metropolis and similar iterative simulation methods are potentially very helpful for summarizing multivariate distributions. Used naively, however, iterative simulation can give misleading answers. Our methods are simple and generally applicable to the output of any iterative simulation; they are designed for researchers primarily interested in the science underlying the data and models they are analyzing, rather than for researchers interested in the probability theory underlying the iterative simulations themselves. Our recommended strategy is to use several independent sequences, with starting points sampled from an overdispersed distribution. At each step of the iterative simulation, we obtain, for each univariate estimand of interest, a distributional estimate and an estimate of how much sharper the distributional estimate might become if the simulations were continued indefinitely. Because our focus is on applied inference for Bayesian posterior distributions in real problems, which often tend toward normality after transformations and marginalization, we derive our results as normal-theory approximations to exact Bayesian inference, conditional on the observed simulations. The methods are illustrated on a random-effects mixture model applied to experimental measurements of reaction times of normal and schizophrenic patients.

13,884 citations

Journal ArticleDOI
TL;DR: A computationally feasible method for finding such maximum likelihood estimates is developed, and a computer program is available that allows the testing of hypotheses about the constancy of evolutionary rates by likelihood ratio tests.
Abstract: The application of maximum likelihood techniques to the estimation of evolutionary trees from nucleic acid sequence data is discussed. A computationally feasible method for finding such maximum likelihood estimates is developed, and a computer program is available. This method has advantages over the traditional parsimony algorithms, which can give misleading results if rates of evolution differ in different lineages. It also allows the testing of hypotheses about the constancy of evolutionary rates by likelihood ratio tests, and gives rough indication of the error of the estimate of the tree.

13,111 citations


"Inferring Species Trees Directly fr..." refers background or methods in this paper

  • ...See Felsenstein (2004), Degnan and Rosenberg (2009), and Heled and Drummond (2010) for general introductions to the multispecies coalescent. Early contributions to the development of multispecies models built on the branches of a species tree were made by Hudson (1983), Tajima (1983), Takahata and Nei (1985), Nei (1987), Pamilo and Nei (1988), and Takahata (1989)....

    [...]

  • ...See Felsenstein (2004), Degnan and Rosenberg (2009), and Heled and Drummond (2010) for general introductions to the multispecies coalescent....

    [...]

  • ...The algorithm works in a similar manner to Felsenstein’s pruning algorithm (Felsenstein 1981) for computing the likelihood of a gene tree: we define partial likelihoods that focus only on a specific subtree; the partial likelihoods are then computed starting at the leaves (of the species tree),…...

    [...]

  • ...See Felsenstein (2004), Degnan and Rosenberg (2009), and Heled and Drummond (2010) for general introductions to the multispecies coalescent. Early contributions to the development of multispecies models built on the branches of a species tree were made by Hudson (1983), Tajima (1983), Takahata and Nei (1985), Nei (1987), Pamilo and Nei (1988), and Takahata (1989). The multispecies coalescent determines a distribution for gene trees and their branch lengths, conditional on a species tree....

    [...]

  • ...The algorithm works in a similar manner to Felsenstein’s pruning algorithm (Felsenstein 1981) for computing the likelihood of a gene tree: we define partial likelihoods that focus only on a specific subtree; the partial likelihoods are then computed starting at the leaves (of the species tree), working upward to the root....

    [...]

Journal ArticleDOI
TL;DR: BEAST is a fast, flexible software architecture for Bayesian analysis of molecular sequences related by an evolutionary tree that provides models for DNA and protein sequence evolution, highly parametric coalescent analysis, relaxed clock phylogenetics, non-contemporaneous sequence data, statistical alignment and a wide range of options for prior distributions.
Abstract: The evolutionary analysis of molecular sequence variation is a statistical enterprise. This is reflected in the increased use of probabilistic models for phylogenetic inference, multiple sequence alignment, and molecular population genetics. Here we present BEAST: a fast, flexible software architecture for Bayesian analysis of molecular sequences related by an evolutionary tree. A large number of popular stochastic models of sequence evolution are provided and tree-based models suitable for both within- and between-species sequence data are implemented. BEAST version 1.4.6 consists of 81000 lines of Java source code, 779 classes and 81 packages. It provides models for DNA and protein sequence evolution, highly parametric coalescent analysis, relaxed clock phylogenetics, non-contemporaneous sequence data, statistical alignment and a wide range of options for prior distributions. BEAST source code is object-oriented, modular in design and freely available at http://beast-mcmc.googlecode.com/ under the GNU LGPL license. BEAST is a powerful and flexible evolutionary analysis package for molecular sequence variation. It also provides a resource for the further development of new models and statistical methods of evolutionary analysis.

11,916 citations


"Inferring Species Trees Directly fr..." refers methods in this paper

  • ...Following Drummond and Rambaut (2007), we assume a pure birth (Yule) model for the species tree topology and species divergence times, with a hyperparameter λ equal to the birth rate of the species tree. This hyperparameter is either fixed or allowed to vary with an improper uniform hyperprior. 3. Following Rannala and Yang (2003), we use independent gamma prior distributions for the population size parameters θ....

    [...]

  • ...This is the approach taken by BATWING (Wilson et al. 2003), BEST (Liu and Pearl 2007), and STAR-BEAST (Heled and Drummond 2010), among others....

    [...]

  • ...SNAPP, which interfaces with the BEAST package (Drummond and Rambaut 2007), takes a range of biallelic data types as input and returns a sample of species trees with (relative) divergence times and population sizes....

    [...]

  • ...The SNAPP sampler differs from methods such as BEST (Liu and Pearl 2007) and STAR-BEAST (Heled and Drummond 2010), which sample gene trees explicitly....

    [...]

  • ...The MCMC proposal functions implemented in SNAPP are standard and are a subset of those available in BEAST (Drummond and Rambaut 2007) when sampling from molecular clock trees....

    [...]