scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Delimiting Species Using Single-Locus Data and the Generalized Mixed Yule Coalescent Approach: A Revised Method and Evaluation on Simulated Data Sets

01 Sep 2013-Systematic Biology (Oxford University Press)-Vol. 62, Iss: 5, pp 707-724
TL;DR: These findings support the robustness of GMYC as a tool for delimiting species when only single-locus information is available and argue that this might represent a fundamental limit due to the nature of evidence used to delimit species in this approach.
Abstract: DNA barcoding-type studies assemble single-locus data from large samples of individuals and species, and have provided new kinds of data for evolutionary surveys of diversity. An important goal of many such studies is to delimit evolutionarily significant species units, especially in biodiversity surveys from environmental DNA samples. The Generalized Mixed Yule Coalescent (GMYC) method is a likelihood method for delimiting species by fitting within- and between-species branching models to reconstructed gene trees. Although the method has been widely used, it has not previously been described in detail or evaluated fully against simulations of alternative scenarios of true patterns of population variation and divergence between species. Here, we present important reformulations to the GMYC method as originally specified, and demonstrate its robustness to a range of departures from its simplifying assumptions. The main factor affecting the accuracy of delimitation is the mean population size of species relative to divergence times between them. Other departures from the model assumptions, such as varying population sizes among species, alternative scenarios for speciation and extinction, and population growth or subdivision within species, have relatively smaller effects. Our simulations demonstrate that support measures derived from the likelihood function provide a robust indication of when the model performs well and when it leads to inaccurate delimitations. Finally, the so-called single-threshold version of the method outperforms the multiple-threshold version of the method on simulated data: we argue that this might represent a fundamental limit due to the nature of evidence used to delimit species in this approach. Together with other studies comparing its performance relative to other methods, our findings support the robustness of GMYC as a tool for delimiting species when only single-locus information is available.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: The Poisson tree processes (PTP) model is introduced to infer putative species boundaries on a given phylogenetic input tree and yields more accurate results than de novo species delimitation methods.
Abstract: Motivation: Sequence-based methods to delimit species are central to DNA taxonomy, microbial community surveys and DNA metabarcoding studies. Current approaches either rely on simple sequence similarity thresholds (OTU-picking) or on complex and compute-intensive evolutionary models. The OTU-picking methods scale well on large datasets, but the results are highly sensitive to the similarity threshold. Coalescent-based species delimitation approaches often rely on Bayesian statistics and Markov Chain Monte Carlo sampling, and can therefore only be applied to small datasets. Results: We introduce the Poisson tree processes (PTP) model to infer putative species boundaries on a given phylogenetic input tree. We also integrate PTP with our evolutionary placement algorithm (EPA-PTP) to count the number of species in phylogenetic placements. We compare our approaches with popular OTU-picking methods and the General Mixed Yule Coalescent (GMYC) model. For de novo species delimitation, the stand-alone PTP model generally outperforms GYMC as well as OTU-picking methods when evolutionary distances between species are small. PTP neither requires an ultrametric input tree nor a sequence similarity threshold as input. In the open reference species delimitation approach, EPA-PTP yields more accurate results than de novo species delimitation methods. Finally, EPA-PTP scales on large datasets because it relies on the parallel implementations of the EPA and RAxML, thereby allowing to delimit species in high-throughput sequencing data. Availability and implementation: The code is freely available at www.

1,868 citations


Cites background or methods from "Delimiting Species Using Single-Loc..."

  • ...The General Mixed Yule Coalescent (GMYC) model (Fujisawa and Barraclough, 2013; Pons et al., 2006) for delimiting species on single genes is frequently used in empirical studies (Carstens and Dewey, 2010; Fontaneto et al., 2007; Monaghan et al., 2009; Powell, 2012; Vuataz et al., 2011)....

    [...]

  • ...The General Mixed Yule Coalescent (GMYC) model (Fujisawa and Barraclough, 2013; Pons et al., 2006) for delimiting species on single genes is frequently used in empirical studies (Carstens and Dewey, 2010; Fontaneto et al....

    [...]

  • ...The total number of possible delimitations on a rooted binary tree with m tips ranges between m (caterpillar tree) and 1:502m, depending on the actual tree shape (Fujisawa and Barraclough, 2013)....

    [...]

  • ...The total number of possible delimitations on a rooted binary tree with m tips ranges between m (caterpillar tree) and 1:502, depending on the actual tree shape (Fujisawa and Barraclough, 2013)....

    [...]

Journal ArticleDOI
TL;DR: MALDI-TOF mass spectrometry readily distinguishes the newly recognized species, which differ in aspects of pathogenicity, prevalence for patient groups, as well as biochemical and physiological aspects, such as susceptibility to antifungals.

543 citations

Journal ArticleDOI
TL;DR: The multi‐rate PTP is introduced, an improved method that alleviates the theoretical and technical shortcomings of PTP and consistently yields more accurate delimitations with respect to the taxonomy (i.e., identifies more taxonomic species, infers species numbers closer to theTaxonomy).
Abstract: Motivation: In recent years, molecular species delimitation has become a routine approach for quantifying and classifying biodiversity. Barcoding methods are of particular importance in large-scale surveys as they promote fast species discovery and biodiversity estimates. Among those, distance-based methods are the most common choice as they scale well with large datasets; however, they are sensitive to similarity threshold parameters and they ignore evolutionary relationships. The recently introduced "Poisson Tree Processes" (PTP) method is a phylogeny-aware approach that does not rely on such thresholds. Yet, two weaknesses of PTP impact its accuracy and practicality when applied to large datasets; it does not account for divergent intraspecific variation and is slow for a large number of sequences. Results: We introduce the multi-rate PTP (mPTP), an improved method that alleviates the theoretical and technical shortcomings of PTP. It incorporates different levels of intraspecific genetic diversity deriving from differences in either the evolutionary history or sampling of each species. Results on empirical data suggest that mPTP is superior to PTP and popular distance-based methods as it, consistently yields more accurate delimitations with respect to the taxonomy (i.e., identifies more taxonomic species, infers species numbers closer to the taxonomy). Moreover, mPTP does not require any similarity threshold as input. The novel dynamic programming algorithm attains a speedup of at least five orders of magnitude compared to PTP, allowing it to delimit species in large (meta-) barcoding data. In addition, Markov Chain Monte Carlo sampling provides a comprehensive evaluation of the inferred delimitation in just a few seconds for millions of steps, independently of tree size. Availability and Implementation: mPTP is implemented in C and is available for download at http://github.com/Pas-Kapli/mptp under the GNU Affero 3 license. A web-service is available at http://mptp.h-its.org . Contact: : paschalia.kapli@h-its.org or alexandros.stamatakis@h-its.org or tomas.flouri@h-its.org. Supplementary information: Supplementary data are available at Bioinformatics online.

535 citations


Cites methods from "Delimiting Species Using Single-Loc..."

  • ...The GMYC method (Fujisawa and Barraclough, 2013) uses a speciation (Yule, 1925) and a neutral coalescent model (Hudson, 1990)....

    [...]

  • ...Several algorithms and implementations exist for this purpose, most of which are inspired by the phylogenetic species concept (Fujisawa and Barraclough, 2013; Yang and Rannala, 2014; Zhang et al., 2013) and the DNA barcoding concept (Hao et al., 2011; Edgar, 2010; Puillandre et al., 2012)....

    [...]

  • ...The General Mixed Yule Coalescent (GMYC; Fujisawa and Barraclough, 2013; Pons et al., 2006) and the recently introduced Poisson Tree Processes (PTP; Zhang et al., 2013) are two similar models that bridge the gap between “species-tree” and distancebased methods....

    [...]

  • ...With the introduction of DNA-barcoding (Hebert et al., 2003) and the advances in coalescent models (Fujisawa and Barraclough, 2013; Yang and Rannala, 2014), genetic data became the most popular data source for delimiting species....

    [...]

  • ...Finally, we avoided comparisons with time-based delimitation methods (Fujisawa and Barraclough, 2013; Jones et al., 2015; Yang and Rannala, 2014) which are time consuming and heavily dependent on the factorization accuracy of branch lengths into time and evolutionary rate....

    [...]

Journal ArticleDOI
TL;DR: It is demonstrated that ASAP has the potential to become a major tool for taxonomists as it proposes rapidly in a full graphical exploratory interface relevant species hypothesis as a first step of the integrative taxonomy process.
Abstract: Here, we describe Assemble Species by Automatic Partitioning (ASAP), a new method to build species partitions from single locus sequence alignments (i.e., barcode data sets). ASAP is efficient enough to split data sets as large 104 sequences into putative species in several minutes. Although grounded in evolutionary theory, ASAP is the implementation of a hierarchical clustering algorithm that only uses pairwise genetic distances, avoiding the computational burden of phylogenetic reconstruction. Importantly, ASAP proposes species partitions ranked by a new scoring system that uses no biological prior insight of intraspecific diversity. ASAP is a stand-alone program that can be used either through a graphical web-interface or that can be downloaded and compiled for local usage. We have assessed its power along with three others programs (ABGD, PTP and GMYC) on 10 real COI barcode data sets representing various degrees of challenge (from small and easy cases to large and complicated data sets). We also used Monte-Carlo simulations of a multispecies coalescent framework to assess the strengths and weaknesses of ASAP and the other programs. Through these analyses, we demonstrate that ASAP has the potential to become a major tool for taxonomists as it proposes rapidly in a full graphical exploratory interface relevant species hypothesis as a first step of the integrative taxonomy process.

393 citations


Cites background from "Delimiting Species Using Single-Loc..."

  • ...Conversely, the multiple-threshold version of GMYC is particularly prone to oversplitting (Fujisawa & Barraclough, 2013; Kekkonen & Hebert, 2014)....

    [...]

  • ...…is similar to ASAP first–second, that PTP and mPTP tend to not perform very well, that GMYC performs very well provided that the number of species is not too large and that, as previously reported in the literature, mGMYC generally oversplits (Fujisawa & Barraclough, 2013; Kekkonen & Hebert, 2014)....

    [...]

References
More filters
Journal Article
TL;DR: Copyright (©) 1999–2012 R Foundation for Statistical Computing; permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and permission notice are preserved on all copies.
Abstract: Copyright (©) 1999–2012 R Foundation for Statistical Computing. Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies. Permission is granted to copy and distribute modified versions of this manual under the conditions for verbatim copying, provided that the entire resulting derived work is distributed under the terms of a permission notice identical to this one. Permission is granted to copy and distribute translations of this manual into another language, under the above conditions for modified versions, except that this permission notice may be stated in a translation approved by the R Core Team.

272,030 citations

Book
19 Jun 2013
TL;DR: The second edition of this book is unique in that it focuses on methods for making formal statistical inference from all the models in an a priori set (Multi-Model Inference).
Abstract: Introduction * Information and Likelihood Theory: A Basis for Model Selection and Inference * Basic Use of the Information-Theoretic Approach * Formal Inference From More Than One Model: Multi-Model Inference (MMI) * Monte Carlo Insights and Extended Examples * Statistical Theory and Numerical Results * Summary

36,993 citations

Journal ArticleDOI
TL;DR: UNLABELLED RAxML-VI-HPC (randomized axelerated maximum likelihood for high performance computing) is a sequential and parallel program for inference of large phylogenies with maximum likelihood (ML) that has been used to compute ML trees on two of the largest alignments to date.
Abstract: Summary: RAxML-VI-HPC (randomized axelerated maximum likelihood for high performance computing) is a sequential and parallel program for inference of large phylogenies with maximum likelihood (ML). Low-level technical optimizations, a modification of the search algorithm, and the use of the GTR+CAT approximation as replacement for GTR+Γ yield a program that is between 2.7 and 52 times faster than the previous version of RAxML. A large-scale performance comparison with GARLI, PHYML, IQPNNI and MrBayes on real data containing 1000 up to 6722 taxa shows that RAxML requires at least 5.6 times less main memory and yields better trees in similar times than the best competing program (GARLI) on datasets up to 2500 taxa. On datasets ≥4000 taxa it also runs 2--3 times faster than GARLI. RAxML has been parallelized with MPI to conduct parallel multiple bootstraps and inferences on distinct starting trees. The program has been used to compute ML trees on two of the largest alignments to date containing 25 057 (1463 bp) and 2182 (51 089 bp) taxa, respectively. Availability: icwww.epfl.ch/~stamatak Contact: Alexandros.Stamatakis@epfl.ch Supplementary information: Supplementary data are available at Bioinformatics online.

14,847 citations


"Delimiting Species Using Single-Loc..." refers methods in this paper

  • ...Gene trees were inferred from the simulated sequences using RAxML (Stamatakis 2006) with 100 bootstrap pseudoreplicates, then made ultrametric with the molecular clock assumption using the Langley–Fitch method implemented in r8s (Sanderson 2003)....

    [...]

Journal ArticleDOI
TL;DR: UNLABELLED Analysis of Phylogenetics and Evolution (APE) is a package written in the R language for use in molecular evolution and phylogenetics that provides both utility functions for reading and writing data and manipulating phylogenetic trees.
Abstract: Summary: Analysis of Phylogenetics and Evolution (APE) is a package written in the R language for use in molecular evolution and phylogenetics. APE provides both utility functions for reading and writing data and manipulating phylogenetic trees, as well as several advanced methods for phylogenetic and evolutionary analysis (e.g. comparative and population genetic methods). APE takes advantage of the many R functions for statistics and graphics, and also provides a flexible framework for developing and implementing further statistical methods for the analysis of evolutionary processes. Availability: The program is free and available from the official R package archive at http://cran.r-project.org/src/contrib/PACKAGES.html#ape. APE is licensed under the GNU General Public License.

10,818 citations


"Delimiting Species Using Single-Loc..." refers methods in this paper

  • ...2009) with custom scripts, and the APE and apTreeshape packages (Paradis et al. 2004; Bortolussi et al. 2006)....

    [...]

  • ...All data processing and analyses were performed in R (R Development Core Team 2010) using the splits package (Ezard et al. 2009) with custom scripts, and the APE and apTreeshape packages (Paradis et al. 2004; Bortolussi et al. 2006)....

    [...]

Journal ArticleDOI
TL;DR: It is established that the mitochondrial gene cytochrome c oxidase I (COI) can serve as the core of a global bioidentification system for animals and will provide a reliable, cost–effective and accessible solution to the current problem of species identification.
Abstract: Although much biological research depends upon species diagnoses, taxonomic expertise is collapsing. We are convinced that the sole prospect for a sustainable identification capability lies in the construction of systems that employ DNA sequences as taxon 'barcodes'. We establish that the mitochondrial gene cytochrome c oxidase I (COI) can serve as the core of a global bioidentification system for animals. First, we demonstrate that COI profiles, derived from the low-density sampling of higher taxonomic categories, ordinarily assign newly analysed taxa to the appropriate phylum or order. Second, we demonstrate that species-level assignments can be obtained by creating comprehensive COI profiles. A model COI profile, based upon the analysis of a single individual from each of 200 closely allied species of lepidopterans, was 100% successful in correctly identifying subsequent specimens. When fully developed, a COI identification system will provide a reliable, cost-effective and accessible solution to the current problem of species identification. Its assembly will also generate important new insights into the diversification of life and the rules of molecular evolution.

9,879 citations