scispace - formally typeset
Search or ask a question

Showing papers in "Systematic Biology in 2015"


Journal ArticleDOI
TL;DR: It is shown that for many empirical phylogenies, characters simulated in the absence of state-dependent diversification exhibit an even higher Type I error rate, indicating that the method is susceptible to additional, unknown model inadequacies.
Abstract: Species richness varies widely across the tree of life, and there is great interest in identifying ecological, geographic, and other factors that affect rates of species proliferation. Recent methods for explicitly modeling the relationships among character states, speciation rates, and extinction rates on phylogenetic trees— BiSSE, QuaSSE, GeoSSE, and related models— have been widely used to test hypotheses about character state-dependent diversification rates. Here, we document the disconcerting ease with which neutral traits are inferred to have statistically significant associations with speciation rate. We first demonstrate this unfortunate effect for a known model assumption violation: shifts in speciation rate associated with a character not included in the model. We further show that for many empirical phylogenies, characters simulated in the absence of state-dependent diversification exhibit an even higher Type I error rate, indicating that the method is susceptible to additional, unknown model inadequacies. For traits that evolve slowly, the root cause appears to be a statistical framework that does not require replicated shifts in character state and diversification. However, spurious associations between character state and speciation rate arise even for traits that lack phylogenetic signal, suggesting that phylogenetic pseudoreplication alone cannot fully explain the problem. The surprising severity of this phenomenon suggests that many trait-diversification relationships reported in the literature may not be real. More generally, we highlight the need for diagnosing and understanding the consequences of model inadequacy in phylogenetic comparative methods. (Character evolution; extinction; macroevolution; speciation; statistics.)

383 citations


Journal ArticleDOI
TL;DR: This paper aims to demonstrate the efforts towards in-situ applicability of EMMARM, as to provide real-time information about the response of the immune system to Representative Tournaisian infectious disease.
Abstract: 1Departments of Zoology and Botany and Beaty Biodiversity Museum, University of British Columbia, 6270 University Boulevard, Vancouver, British Columbia, Canada V6T 1Z4; 2Departments of Zoology and Biodiversity Research Centre, University of British Columbia, 6270 University Boulevard, Vancouver, British Columbia, Canada V6T 1Z4 and 3Department of Biological Sciences, Macquarie University, Sydney NSW 2109, Australia ∗Correspondence to be sent to: Department of Zoology, University of British Columbia, 6270 University Boulevard, Vancouver, British Columbia, Canada V6T 1Z4; E-mail: wayne.maddison@ubc.ca.

376 citations


Journal ArticleDOI
TL;DR: Recent approaches to modeling the evolution of cancer, including population dynamics models of tumor initiation and progression, phylogenetic methods to model the evolutionary relationship between tumor subclones, and probabilistic graphical models to describe dependencies among mutations are reviewed.
Abstract: Cancer is a somatic evolutionary process characterized by the accumulation of mutations, which contribute to tumor growth, clinical progression, immune escape, and drug resistance development. Evolutionary theory can be used to analyze the dynamics of tumor cell populations and to make inference about the evolutionary history of a tumor from molecular data. We review recent approaches to modeling the evolution of cancer, including population dynamics models of tumor initiation and progression, phylogenetic methods to model the evolutionary relationship between tumor subclones, and probabilistic graphical models to describe dependencies among mutations. Evolutionary modeling helps to understand how tumors arise and will also play an increasingly important prognostic role in predicting disease progression and the outcome of medical interventions, such as targeted therapy.

344 citations


Journal ArticleDOI
TL;DR: Phylogenetic analyses of the empirical data using concatenation or a coalescent-based species tree approach provide strong support for many of the accepted relationships among phrynosomatid lizards, suggesting that RAD loci contain useful phylogenetic signal across a range of divergence times despite the presence of missing data.
Abstract: Single nucleotide polymorphisms (SNPs) are useful markers for phylogenetic studies owing in part to their ubiquity throughout the genome and ease of collection. Restriction site associated DNA sequencing (RADseq) methods are becoming increasingly popular for SNP data collection, but an assessment of the best practises for using these data in phylogenetics is lacking. We use computer simulations, and new double digest RADseq (ddRADseq) data for the lizard family Phrynosomatidae, to investigate the accuracy of RAD loci for phylogenetic inference. We compare the two primary ways RAD loci are used during phylogenetic analysis, including the analysis of full sequences (i.e., SNPs together with invariant sites), or the analysis of SNPs on their own after excluding invariant sites. We find that using full sequences rather than just SNPs is preferable from the perspectives of branch length and topological accuracy, but not of computational time. We introduce two new acquisition bias corrections for dealing with alignments composed exclusively of SNPs, a conditional likelihood method and a reconstituted DNA approach. The conditional likelihood method conditions on the presence of variable characters only (the number of invariant sites that are unsampled but known to exist is not considered), while the reconstituted DNA approach requires the user to specify the exact number of unsampled invariant sites prior to the analysis. Under simulation, branch length biases increase with the amount of missing data for both acquisition bias correction methods, but branch length accuracy is much improved in the reconstituted DNA approach compared to the conditional likelihood approach. Phylogenetic analyses of the empirical data using concatenation or a coalescent-based species tree approach provide strong support for many of the accepted relationships among phrynosomatid lizards, suggesting that RAD loci contain useful phylogenetic signal across a range of divergence times despite the presence of missing data. Phylogenetic analysis of RAD loci requires careful attention to model assumptions, especially if downstream analyses depend on branch lengths.

273 citations


Journal ArticleDOI
TL;DR: The aim of this article is to test the hypothesis that the majority of species will probably go extinct before they are described, as part of a renewal of taxonomy, illustrated by the increasing number of published articles related to species concepts, species delimitation methodology and its application.
Abstract: Integrative taxonomy was formally introduced in 2005 as a comprehensive framework to delimit and describe taxa by integrating information from different types of data and methodologies (Dayrat 2005; Will et al. 2005). Even if debate remains about the hierarchy of the types of characters and criteria to use for species delimitation (Schlick-Steiner et al., 2009; Padial et al., 2010; Yeates et al., 2011), most, if not all taxonomists agree that objectively evaluating several lines of evidence within a formalized framework is the most efficient and theoretically-grounded approach to defining robust species hypotheses (Samadi and Barberousse 2006; de Queiroz 2007).The last ten years have seen a renewal of taxonomy, illustrated by the increasing number of published articles related to species concepts, species delimitation methodology and its application. In the early 90s, many systematists began to suspect that the majority of species would remain undescribed (Costello et al. 2013a; Erwin 1982; Mora et al. 2011 – but see Costello et al. 2013b) and that some of them will probably go extinct before we have a chance to describe them (Barnosky et al., 2011; Leakey and Lewin, 1995; Pimm et al., 2006). The use of molecular data, and in particular molecular barcoding (Hebert et al., 2003), was presented as one answer to this “taxonomic impediment” (as defined in Rodman and Cody, 2003), and welcomed as such by taxonomists. It thus adds to the toolkit of taxonomy, which continues its development as a synergic discipline involving morphological taxonomists, field ecologists, naturalists, and statisticians (Knapp 2008). Integrative taxonomy, used for many decades by taxonomists but only recently formalized concomitantly with the molecular revolution, is organised following a three-step workflow (see also Evenhuis 2007): first, we need to accumulate data on numerous specimens (from various types of data: DNA, morphology, ecology…); second, we need to circumscribe groups of organisms using concepts that ensure that these groups correspond to species (this second step may be coupled with the first, as biological data are continuously accumulated and species hypotheses re-discussed); and third, we need to provide a species description, i.e. a diagnosis and a name for the species recognized as new. Naming new species is a fundamental step when describing biodiversity and is the only way to ensure that scientists are talking about the same entity, and that all the data linked to conspecific specimens but produced by different researchers (or amateurs) can be associated in a comparative analysis (Patterson et al., 2010; Satler et al., 2013; Schlick-Steiner et al., 2007). Not linking biological data (should they be molecular, morphological, or ecological) to a formal species name will result in these data losing tremendous value (Goldstein and DeSalle 2011). Indeed, when authors publish data on entities that are not defined within the framework of a referencing system (e.g. solely identified by an alphanumeric label), they make it very difficult for other authors to build on these data. The best example is the need for taxa to be named to have a chance to be listed in an endangered species list and to benefit from a conservation program: no name, no surviving (Mace 2004). Beyond the need for communication among scientists, names are also key to communicating with non-scientist audiences. While it is now widely recognized that integrating several lines of evidence is the most efficient and theoretically grounded way to delimit new species (e.g. de Queiroz, 2007; Schlick-Steiner et al., 2009; Yeates et al., 2011), the formal naming of new entities may have become decoupled from species delimitation. Indeed, we noted that in several cases new delimited species were not accompanied by formal species description (see also Goldstein and DeSalle 2011). The aim of this article is therefore to test the hypothesis that integrative taxonomy, as defined in 2005 (Dayrat 2005; Will et al. 2005), and in particular the use of molecular data, helped to alleviate the taxonomic impediment by delimiting and describing new species. We reviewed part of the “integrative taxonomy” literature of the last eight years (2006-2013) and tested if authors that delimit new species also name them. We also looked at how the number and type of characters used, across different taxa, varies across articles.

230 citations


Journal ArticleDOI
TL;DR: New [Formula: see text] tests are proposed as an integrated framework to infer both the taxa involved in and the direction of introgression for a symmetric five-taxon phylogeny and are computationally inexpensive to calculate and can easily be applied to phylogenomic data sets.
Abstract: When multiple speciation events occur rapidly in succession, discordant genealogies due to incomplete lineage sorting (ILS) can complicate the detection of introgression. A variety of methods, including the D-statistic (a.k.a. the "ABBA- BABA test"), have been proposed to infer introgression in the presence of ILS for a four-taxon clade. However, no integrated method exists to detect introgression using allelic patterns for more complex phylogenies. Here we explore the issues associated with previous systems of applying D-statistics to a larger tree topology, and propose new DFOIL tests as an integrated framework to infer both the taxa involved in and the direction of introgression for a symmetric five-taxon phylogeny. Using theory and simulations, we show that the DFOIL statistics correctly identify the introgression donor and recipient lineages, even at low rates of introgression. DFOIL is also shown to have extremely low false-positive rates. The DFOIL tests are computationally inexpensive to calculate and can easily be applied to phylogenomic data sets, both genome-wide and in windows of the genome. In addition, we explore both the principles and problems of introgression detection in even more complex phylogenies. (ABBA-BABA; D-statistics; genomics; hybridization; incomplete lineage sorting; introgression; phylogenetics; phylogenomics.)

220 citations


Journal ArticleDOI
TL;DR: It is demonstrated precisely how using standard PCA can mislead inferences: the first few principal components of traits evolved under constant-rate multivariate Brownian motion will appear to have evolved via an "early burst" process.
Abstract: Most existing methods for modeling trait evolution are univariate, although researchers are often interested in investigating evolutionary patterns and processes across multiple traits. Principal components analysis (PCA) is commonly used to reduce the dimensionality of multivariate data so that univariate trait models can be fit to individual principal components. The problem with using standard PCA on phylogenetically structured data has been previously pointed out yet it continues to be widely used in the literature. Here we demonstrate precisely how using standard PCA can mislead inferences: The first few principal components of traits evolved under constant-rate multivariate Brownian motion will appear to have evolved via an "early burst" process. A phylogenetic PCA (pPCA) has been proprosed to alleviate these issues. However, when the true model of trait evolution deviates from the model assumed in the calculation of the pPCA axes, we find that the use of pPCA suffers from similar artifacts as standard PCA. We show that data sets with high effective dimensionality are particularly likely to lead to erroneous inferences. Ultimately, all of the problems we report stem from the same underlying issue—by considering only the first few principal components as univariate traits, we are effectively examining a biased sample of a multivariate pattern. These results highlight the need for truly multivariate phylogenetic comparative methods. As these methods are still being developed, we discuss potential alternative strategies for using and interpreting models fit to univariate axes of multivariate data. (Brownian motion; early burst; multivariate evolution; Ornstein-Uhlenbeck; phylogenetic comparative methods; principal components analysis; quantitative genetics)

212 citations


Journal ArticleDOI
TL;DR: The first comprehensive, time-calibrated phylogeny of this group is used to test the hypotheses of a diversification rate increase driven by the dramatic environmental changes in the Neotropics over the past 23 myr, or changes caused by diversity-dependent effects on the rate of diversification.
Abstract: Mullerian mimicry among Neotropical Heliconiini butterflies is an excellent example of natural selection, associated with the diversification of a large continental-scale radiation. Some of the processes driving the evolution of mimicry rings are likely to generate incongruent phylogenetic signals across the assemblage, and thus pose a challenge for systematics. We use a data set of 22 mitochondrial and nuclear markers from 92% of species in the tribe, obtained by Sanger sequencing and de novo assembly of short read data, to re-examine the phylogeny of Heliconiini with both supermatrix and multispecies coalescent approaches, characterize the patterns of conflicting signal, and compare the performance of various methodological approaches to reflect the heterogeneity across the data. Despite the large extent of reticulate signal and strong conflict between markers, nearly identical topologies are consistently recovered by most of the analyses, although the supermatrix approach failed to reflect the underlying variation in the history of individual loci. However, the supermatrix represents a useful approximation where multiple rare species represented by short sequences can be incorporated easily. The first comprehensive, time-calibrated phylogeny of this group is used to test the hypotheses of a diversification rate increase driven by the dramatic environmental changes in the Neotropics over the past 23 myr, or changes caused by diversity-dependent effects on the rate of diversification. We find that the rate of diversification has increased on the branch leading to the presently most species-rich genus Heliconius, but the change occurred gradually and cannot be unequivocally attributed to a specific environmental driver. Our study provides comprehensive comparison of philosophically distinct species tree reconstruction methods and provides insights into the diversification of an important insect radiation in the most biodiverse region of the planet. (Amazonia; diversification rate; incongruence; Lepidoptera; Miocene; mimicry; multispecies coalescent.)

201 citations


Journal ArticleDOI
TL;DR: It is shown that the trees obtained from filtered MSAs are on average worse than those obtained from unfiltered MSAs and that alignment filtering often leads to an increase in the proportion of well-supported branches that are actually wrong.
Abstract: Phylogenetic inference is generally performed on the basis of multiple sequence alignments (MSA). Because errors in an alignment can lead to errors in tree estimation, there is a strong interest in identifying and removing unreliable parts of the alignment. In recent years several automated filtering approaches have been proposed, but despite their popularity, a systematic and comprehensive comparison of different alignment filtering methods on real data has been lacking. Here, we extend and apply recently introduced phylogenetic tests of alignment accuracy on a large number of gene families and contrast the performance of unfiltered versus filtered alignments in the context of single-gene phylogeny reconstruction. Based on multiple genome-wide empirical and simulated data sets, we show that the trees obtained from filtered MSAs are on average worse than those obtained from unfiltered MSAs. Furthermore, alignment filtering often leads to an increase in the proportion of well-supported branches that are actually wrong. We confirm that our findings hold for a wide range of parameters and methods. Although our results suggest that light filtering (up to 20% of alignment positions) has little impact on tree accuracy and may save some computation time, contrary to widespread practice, we do not generally recommend the use of current alignment filtering methods for phylogenetic inference. By providing a way to rigorously and systematically measure the impact of filtering on alignments, the methodology set forth here will guide the development of better filtering algorithms.

196 citations


Journal ArticleDOI
TL;DR: A review of the various models that have been used to describe the relationships between gene trees and species trees can be found in this article, where the authors predict that gene tree-species tree methods that can deal with genomic data sets will be instrumental to advancing our understanding of genomic evolution.
Abstract: This article reviews the various models that have been used to describe the relationships between gene trees and species trees. Molecular phylogeny has focused mainly on improving models for the reconstruction of gene trees based on sequence alignments. Yet, most phylogeneticists seek to reveal the history of species. Although the histories of genes and species are tightly linked, they are seldom identical, because genes duplicate, are lost or horizontally transferred, and because alleles can coexist in populations for periods that may span several speciation events. Building models describing the relationship between gene and species trees can thus improve the reconstruction of gene trees when a species tree is known, and vice versa. Several approaches have been proposed to solve the problem in one direction or the other, but in general neither gene trees nor species trees are known. Only a few studies have attempted to jointly infer gene trees and species trees. These models account for gene duplication and loss, transfer or incomplete lineage sorting. Some of them consider several types of events together, but none exists currently that considers the full repertoire of processes that generate gene trees along the species tree. Simulations as well as empirical studies on genomic data show that combining gene tree-species tree models with models of sequence evolution improves gene tree reconstruction. In turn, these better gene trees provide a more reliable basis for studying genome evolution or reconstructing ancestral chromosomes and ancestral gene sequences. We predict that gene tree-species tree methods that can deal with genomic data sets will be instrumental to advancing our understanding of genomic evolution. (Algorithm; amalgamation; Bayesian inference; birth-death model; coalescent; dynamic programming; gene duplication; gene loss; gene transfer; gene tree; hybridization; maximum likelihood; phylogenetics; species tree.)

191 citations


Journal ArticleDOI
TL;DR: Here it is explored the mathematical guarantees of coalescent-based methods when analyzing estimated rather than true gene trees, and some insight is provided into the differences between promise in theory and their performance in practice.
Abstract: The estimation of species trees using multiple loci has become increasingly common. Because dierent loci can have dierent phylogenetic histories (reected in dierent gene tree topologies) for multiple biological causes, new approaches to species tree

Journal ArticleDOI
TL;DR: The results illuminate the challenges of estimating a bifurcating tree in a rapid and recent radiation, providing a rare empirical example of a nearly simultaneous series of speciation events in a terrestrial animal lineage as it spreads across an oceanic archipelago.
Abstract: Phylogenetic relationships in recent, rapid radiations can be difficult to resolve due to incomplete lineage sorting and reliance on genetic markers that evolve slowly relative to the rate of speciation. By incorporating hundreds to thousands of unlinked loci, phylogenomic analyses have the potential to mitigate these difficulties. Here, we attempt to resolve phylogenetic relationships among eight shrew species (genus Crocidura) from the Philippines, a phylogenetic problem that has proven intractable with small (< 10 loci) data sets. We sequenced hundreds of ultraconserved elements and whole mitochondrial genomes in these species and estimated phylogenies using concatenation, summary coalescent, and hierarchical coalescent methods. The concatenated approach recovered a maximally supported and fully resolved tree. In contrast, the coalescent-based approaches produced similar topologies, but each had several poorly supported nodes. Using simulations, we demonstrate that the concatenated tree could be positively misleading. Our simulations also show that the tree shape we tend to infer, which involves a series of short internal branches, is difficult to resolve, even if substitution models are known and multiple individuals per species are sampled. As such, the low support we obtained for backbone relationships in our coalescent-based inferences reflects a real and appropriate lack of certainty. Our results illuminate the challenges of estimating a bifurcating tree in a rapid and recent radiation, providing a rare empirical example of a nearly simultaneous series of speciation events in a terrestrial animal lineage as it spreads across an oceanic archipelago. (Coalescence; concatenation; Crocidura; Philippines; SNPs; Soricidae; species tree; ultraconserved elements.)

Journal ArticleDOI
TL;DR: This study introduces a new approach to incorporate into biogeographic inference the temporal, spatial, and environmental information provided by the fossil record, as a direct evidence of the extinct biodiversity fraction, and shows that an integrative approach to historical biogeography could help to obtain more accurate reconstructions of ancient evolutionary history.
Abstract: In disciplines such as macroevolution that are not amenable to experimentation, scientists usually rely on current observations to test hypotheses about historical events, assuming that "the present is the key to the past." Biogeographers, for example, used this assumption to reconstruct ancestral ranges from the distribution of extant species. Yet, under scenarios of high extinction rates, the biodiversity we observe today might not be representative of the historical diversity and this could result in incorrect biogeographic reconstructions. Here, we introduce a new approach to incorporate into biogeographic inference the temporal, spatial, and environmental information provided by the fossil record, as a direct evidence of the extinct biodiversity fraction. First, inferences of ancestral ranges for those nodes in the phylogeny calibrated with the fossil record are constrained to include the geographic distribution of the fossil. Second, we use fossil distribution and past climate data to reconstruct the climatic preferences and potential distribution of ancestral lineages over time, and use this information to build a biogeographic model that takes into account "ecological connectivity" through time. To show the power of this approach, we reconstruct the biogeographic history of the large angiosperm genus Hypericum, which has a fossil record extending back to the Early Cenozoic. Unlike previous reconstructions based on extant species distributions, our results reveal that Hypericum stem lineages were already distributed in the Holarctic before diversification of its crown-group, and that the geographic distribution of the genus has been relatively stable throughout the climatic oscillations of the Cenozoic. Geographical movement was mediated by the existence of climatic corridors, like Beringia, whereas the equatorial tropical belt acted as a climatic barrier, preventing Hypericum lineages to reach the southern temperate regions. Our study shows that an integrative approach to historical biogeography—that combines sources of evidence as diverse as paleontology, ecology, and phylogenetics—could help us obtain more accurate reconstructions of ancient evolutionary history. It also reveals the confounding effect different rates of extinction across regions have in biogeography, sometimes leading to ancestral areas being erroneously inferred as recent colonization events. (Biogeography; Cenozoic climate change; environmental niche modeling; extinction; fossils; Hypericum; phylogenetics.)

Journal ArticleDOI
TL;DR: A new analysis of crocodylomorphs with increased outgroup sampling recovers Thalattosuchia as the sister group to CroCodyliformes, distantly related to long-snouted croc Codyliforms, and demonstrates the importance of careful out group sampling and character construction, and their profound effect on the position of labile clades.
Abstract: Outgroup sampling is a central issue in phylogenetic analysis. However, good justification is rarely given for outgroup selection in published analyses. Recent advances in our understanding of archosaur phylogeny suggest that many previous studies of crocodylomorph and crocodyliform relationships have rooted trees on outgroup taxa that are only very distantly related to the ingroup (e.g., Gracilisuchus stipanicicorum), or might actually belong within the ingroup. Thalattosuchia, a group of Mesozoic marine crocodylomorphs, has a controversial phylogenetic position--they are recovered as either the sister group to Crocodyliformes, in a basal position within Crocodyliformes, or nested high in the crocodyliform tree. Thalattosuchians lack several crocodyliform apomorphies, but share several character states with derived long-snouted forms with a similar ecological habit, suggesting their derived position may be the result of convergent evolution. Several of these "shared" characters may result from ambiguously worded character state definitions--structures that are superficially similar but anatomically different in detail are identically coded. A new analysis of crocodylomorphs with increased outgroup sampling recovers Thalattosuchia as the sister group to Crocodyliformes, distantly related to long-snouted crocodyliforms. I also demonstrate that expanding the outgroup sampling of previously published matrices results in the recovery of thalattosuchians as sister to Crocodyliformes. The exclusion of thalattosuchians from Crocodyliformes has numerous implications for large-scale evolutionary trends within the group, including extensive convergence in the evolution of the secondary palate characteristic of the group. These results demonstrate the importance of careful outgroup sampling and character construction, and their profound effect on the position of labile clades.

Journal ArticleDOI
TL;DR: The Phylogenetic Likelihood Library is introduced, a highly optimized application programming interface for developing likelihood-based phylogenetic inference and postanalysis software that improves the sequential performance of current software by a factor of 2–10 while requiring only 1 month of programming time for integration.
Abstract: We introduce the Phylogenetic Likelihood Library (PLL), a highly optimized application programming interface for developing likelihood-based phylogenetic inference and postanalysis software. The PLL implements appropriate data structures and functions that allow users to quickly implement common, error-prone, and labor-intensive tasks, such as likelihood calculations, model parameter as well as branch length optimization, and tree space exploration. The highly optimized and parallelized implementation of the phylogenetic likelihood function and a thorough documentation provide a framework for rapid development of scalable parallel phylogenetic software. By example of two likelihood-based phylogenetic codes we show that the PLL improves the sequential performance of current software by a factor of 2-10 while requiring only 1 month of programming time for integration. We show that, when numerical scaling for preventing floating point underflow is enabled, the double precision likelihood calculations in the PLL are up to 1.9 times faster than those in BEAGLE. On an empirical DNA dataset with 2000 taxa the AVX version of PLL is 4 times faster than BEAGLE (scaling enabled and required). The PLL is available at http://www.libpll.org under the GNU General Public License (GPL).

Journal ArticleDOI
TL;DR: The most parsimonious network topology is identified from a set of five competing scenarios differing in the interpretation of homoeolog extinctions and lineage sorting, based on fewest possible ghost subgenome lineages, least possible deviation from expected ploidy as inferred from available chromosome counts of the involved polyploid taxa.
Abstract: Allopolyploidization accounts for a significant fraction of speciation events in many eukaryotic lineages. However, existing phylogenetic and dating methods require tree-like topologies and are unable to handle the network-like phylogenetic relationships of lineages containing allopolyploids. No explicit framework has so far been established for evaluating competing network topologies, and few attempts have been made to date phylogenetic networks. We used a four-step approach to generate a dated polyploid species network for the cosmopolitan angiosperm genus Viola L. (Violaceae Batch.). The genus contains ca 600 species and both recent (neo-) and more ancient (meso-) polyploid lineages distributed over 16 sections. First, we obtained DNA sequences of three low-copy nuclear genes and one chloroplast region, from 42 species representing all 16 sections. Second, we obtained fossil-calibrated chronograms for each nuclear gene marker. Third, we determined the most parsimonious multilabeled genome tree and its corresponding network, resolved at the section (not the species) level. Reconstructing the "correct" network for a set of polyploids depends on recovering all homoeologs, i.e., all subgenomes, in these polyploids. Assuming the presence of Viola subgenome lineages that were not detected by the nuclear gene phylogenies ("ghost subgenome lineages") significantly reduced the number of inferred polyploidization events. We identified the most parsimonious network topology from a set of five competing scenarios differing in the interpretation of homoeolog extinctions and lineage sorting, based on (i) fewest possible ghost subgenome lineages, (ii) fewest possible polyploidization events, and (iii) least possible deviation from expected ploidy as inferred from available chromosome counts of the involved polyploid taxa. Finally, we estimated the homoploid and polyploid speciation times of the most parsimonious network. Homoploid speciation times were estimated by coalescent analysis of gene tree node ages. Polyploid speciation times were estimated by comparing branch lengths and speciation rates of lineages with and without ploidy shifts. Our analyses recognize Viola as an old genus (crown age 31 Ma) whose evolutionary history has been profoundly affected by allopolyploidy. Between 16 and 21 allopolyploidizations are necessary to explain the diversification of the 16 major lineages (sections) of Viola, suggesting that allopolyploidy has accounted for a high percentage—between 67% and 88%—of the speciation events at this level. The theoretical and methodological approaches presented here for (i) constructing networks and (ii) dating speciation events within a network, have general applicability for phylogenetic studies of groups where allopolyploidization has occurred. They make explicit use of a hitherto underexplored source of ploidy information from chromosome counts to help resolve phylogenetic cases where incomplete sequence data hampers network inference. Importantly, the coalescent-based method used herein circumvents the assumption of tree-like evolution required by most techniques for dating speciation events. (Dating; low-copy nuclear gene; polyploidy; species network; Viola; violaceae.)

Journal ArticleDOI
TL;DR: In this paper, a binary phylogenetic network may or may not be obtainable from a tree by the addition of directed edges (arcs) between tree arcs, and a precise and easily tested criterion (based on 2-SAT) is established to efficiently determine whether or not any given given network can be realized in this way.
Abstract: A binary phylogenetic network may or may not be obtainable from a tree by the addition of directed edges (arcs) between tree arcs. Here, we establish a precise and easily tested criterion (based on “2-SAT”) that efficiently determines whether or not any given network can be realized in this way. Moreover, the proof provides a polynomial-time algorithm for finding one or more trees (when they exist) on which the network can be based. A number of interesting consequences are presented as corollaries; these lead to some further relevant questions and observations, which we outline in the conclusion.

Journal ArticleDOI
TL;DR: This method provides generally well-supported relationships at interspecific and intergeneric levels that agree with results from more standard phylogenetic analyses of commonly used markers, and it is proposed that this methodology may prove especially useful in groups where there is little genetic differentiation in standard phylogenetics markers.
Abstract: A large proportion of genomic information, particularly repetitive elements, is usually ignored when researchers are using next-generation sequencing. Here we demonstrate the usefulness of this repetitive fraction in phylogenetic analyses, utilizing comparative graph-based clustering of next-generation sequence reads, which results in abundance estimates of different classes of genomic repeats. Phylogenetic trees are then inferred based on the genome-wide abundance of different repeat types treated as continuously varying characters; such repeats are scattered across chromosomes and in angiosperms can constitute a majority of nuclear genomic DNA. In six diverse examples, five angiosperms and one insect, this method provides generally well-supported relationships at interspecific and intergeneric levels that agree with results from more standard phylogenetic analyses of commonly used markers. We propose that this methodology may prove especially useful in groups where there is little genetic differentiation in standard phylogenetic markers. At the same time as providing data for phylogenetic inference, this method additionally yields a wealth of data for comparative studies of genome evolution.(Repetitive DNA; continuous characters; genomics; next-generation sequencing; phylogenetics; molecular systematics.)

Journal ArticleDOI
TL;DR: A series of simulations are used to explore the possibility that the older age estimates obtained using current relaxed-clock methods are a consequence of (i) major shifts in the rate of sequence evolution near the base of the angiosperms and/or the representative taxon sampling strategy employed in such studies.
Abstract: Dating analyses based on molecular data imply that crown angiosperms existed in the Triassic, long before their undisputed appearance in the fossil record in the Early Cretaceous. Following a re-analysis of the age of angiosperms using updated sequences and fossil calibrations, we use a series of simulations to explore the possibility that the older age estimates are a consequence of (i) major shifts in the rate of sequence evolution near the base of the angiosperms and/or (ii) the representative taxon sampling strategy employed in such studies. We show that both of these factors do tend to yield substantially older age estimates. These analyses do not prove that younger age estimates based on the fossil record are correct, but they do suggest caution in accepting the older age estimates obtained using current relaxed-clock methods. Although we have focused here on the angiosperms, we suspect that these results will shed light on dating discrepancies in other major clades.

Journal ArticleDOI
TL;DR: This study highlights the importance of gene selection in phylogenomic analyses, suggesting that simply using a large amount of data cannot guarantee correct results, and constructing question-specific data sets may be more powerful for resolving problematic nodes.
Abstract: Incongruence between different phylogenomic analyses is the main challenge faced by phylogeneticists in the genomic era. To reduce incongruence, phylogenomic studies normally adopt some data filtering approaches, such as reducing missing data or using slowly evolving genes, to improve the signal quality of data. Here, we assembled a phylogenomic data set of 58 jawed vertebrate taxa and 4682 genes to investigate the backbone phylogeny of jawed vertebrates under both concatenation and coalescent-based frameworks. To evaluate the efficiency of extracting phylogenetic signals among different data filtering methods, we chose six highly intractable internodes within the backbone phylogeny of jawed vertebrates as our test questions. We found that our phylogenomic data set exhibits substantial conflicting signal among genes for these questions. Our analyses showed that non-specific data sets that are generated without bias toward specific questions are not sufficient to produce consistent results when there are several difficult nodes within a phylogeny. Moreover, phylogenetic accuracy based on non-specific data is considerably influenced by the size of data and the choice of tree inference methods. To address such incongruences, we selected genes that resolve a given internode but not the entire phylogeny. Notably, not only can this strategy yield correct relationships for the question, but it also reduces inconsistency associated with data sizes and inference methods. Our study highlights the importance of gene selection in phylogenomic analyses, suggesting that simply using a large amount of data cannot guarantee correct results. Constructing question-specific data sets may be more powerful for resolving problematic nodes.

Journal ArticleDOI
TL;DR: It is shown that regions of reduced effective population size can produce positively misleading species tree topologies and disclosed the pitfalls of using loci potentially under selection as phylogenetic markers and highlighted the potential of modeling approaches to disentangle species relationships in systems with large effective population sizes and post-divergence gene flow.
Abstract: Using genetic data to resolve the evolutionary relationships of species is of major interest in evolutionary and systematic biology. However, reconstructing the sequence of speciation events, the so-called species tree, in closely related and potentially hybridizing species is very challenging. Processes such as incomplete lineage sorting and interspecific gene flow result in local gene genealogies that differ in their topology from the species tree, and analyses of few loci with a single sequence per species are likely to produce conflicting or even misleading results. To study these phenomena on a full phylogenomic scale, we use whole-genome sequence data from 200 individuals of four black-and-white flycatcher species with so far unresolved phylogenetic relationships to infer gene tree topologies and visualize genome-wide patterns of gene tree incongruence. Using phylogenetic analysis in nonoverlapping 10-kb windows, we show that gene tree topologies are extremely diverse and change on a very small physical scale. Moreover, we find strong evidence for gene flow among flycatcher species, with distinct patterns of reduced introgression on the Z chromosome. To resolve species relationships on the background of widespread gene tree incongruence, we used four complementary coalescent-based methods for species tree reconstruction, including complex modeling approaches that incorporate post-divergence gene flow among species. This allowed us to infer the most likely species tree with high confidence. Based on this finding, we show that regions of reduced effective population size, which have been suggested as particularly useful for species tree inference, can produce positively misleading species tree topologies. Our findings disclose the pitfalls of using loci potentially under selection as phylogenetic markers and highlight the potential of modeling approaches to disentangle species relationships in systems with large effective population sizes and post-divergence gene flow.

Journal ArticleDOI
TL;DR: This study proves that it is possible to obtain a multilocus species-level phylogeny for di- and polyploid taxa by combining PCR with next-generation sequencing, without cloning and without creating a heavy load of sequence data.
Abstract: Polyploidization is an important speciation mechanism in the barley genus Hordeum. To analyze evolutionary changes after allopolyploidization, knowledge of parental relationships is essential. One chloroplast and 12 nuclear single-copy loci were amplified by polymerase chain reaction (PCR) in all Hordeum plus six out-group species. Amplicons from each of 96 individuals were pooled, sheared, labeled with individual-specific barcodes and sequenced in a single run on a 454 platform. Reference sequences were obtained by cloning and Sanger sequencing of all loci for nine supplementary individuals. The 454 reads were assembled into contigs representing the 13 loci and, for polyploids, also homoeologues. Phylogenetic analyses were conducted for all loci separately and for a concatenated data matrix of all loci. For diploid taxa, a Bayesian concordance analysis and a coalescent-based dated species tree was inferred from all gene trees. Chloroplast matK was used to determine the maternal parent in allopolyploid taxa. The relative performance of different multilocus analyses in the presence of incomplete lineage sorting and hybridization was also assessed. The resulting multilocus phylogeny reveals for the first time species phylogeny and progenitor-derivative relationships of all di- and polyploid Hordeum taxa within a single analysis. Our study proves that it is possible to obtain a multilocus species-level phylogeny for di- and polyploid taxa by combining PCR with next-generation sequencing, without cloning and without creating a heavy load of sequence data.

Journal ArticleDOI
TL;DR: A new method for joint analyses of quantitative traits within- and between species, the Expression Variance and Evolution (EVE) model, which parameterizes the ratio of population to evolutionary expression variance, facilitating a wide variety of analyses.
Abstract: A number of methods have been developed for modeling the evolution of a quantitative trait on a phylogeny. These methods have received renewed interest in the context of genome-wide studies of gene expression, in which the expression levels of many genes can be modeled as quantitative traits. We here develop a new method for joint analyses of quantitative traits within- and between species, the Expression Variance and Evolution (EVE) model. The model parameterizes the ratio of population to evolutionary expression variance, facilitating a wide variety of analyses, including a test for lineage-specific shifts in expression level, and a phylogenetic ANOVA that can detect genes with increased or decreased ratios of expression divergence to diversity, analogous to the famous Hudson Kreitman Aguade (HKA) test used to detect selection at the DNA level. We use simulations to explore the properties of these tests under a variety of circumstances and show that the phylogenetic ANOVA is more accurate than the standard ANOVA (no accounting for phylogeny) sometimes used in transcriptomics. We then apply the EVE model to a mammalian phylogeny of 15 species typed for expression levels in liver tissue. We identify genes with high expression divergence between species as candidates for expression level adaptation, and genes with high expression diversity within species as candidates for expression level conservation and/or plasticity. Using the test for lineage-specific expression shifts, we identify several candidate genes for expression level adaptation on the catarrhine and human lineages, including genes putatively related to dietary changes in humans. We compare these results to those reported previously using a model which ignores expression variance within species, uncovering important differences in performance. We demonstrate the necessity for a phylogenetic model in comparative expression studies and show the utility of the EVE model to detect expression divergence, diversity, and branch-specific shifts.

Journal ArticleDOI
TL;DR: This paper conducted an extensive simulation study to quantify the statistical properties of a class of models toward the simpler end of the spectrum that model phenotypic evolution using Ornstein-Uhlenbeck processes and found that model selection power can be high even in regions that were previously thought to be difficult, such as when tree size is small.
Abstract: Phylogenetic comparative analysis is an approach to inferring evolutionary process from a combination of phylogenetic and phenotypic data. The last few years have seen increasingly sophisticated models employed in the evaluation of more and more detailed evolutionary hypotheses, including adaptive hypotheses with multiple selective optima and hypotheses with rate variation within and across lineages. The statistical performance of these sophisticated models has received relatively little systematic attention, however. We conducted an extensive simulation study to quantify the statistical properties of a class of models toward the simpler end of the spectrum that model phenotypic evolution using Ornstein-Uhlenbeck processes. We focused on identifying where, how, and why these methods break down so that users can apply them with greater understanding of their strengths and weaknesses. Our analysis identifies three key determinants of performance: a discriminability ratio, a signal-to-noise ratio, and the number of taxa sampled. Interestingly, we find that model-selection power can be high even in regions that were previously thought to be difficult, such as when tree size is small. On the other hand, we find that model parameters are in many circumstances difficult to estimate accurately, indicating a relative paucity of information in the data relative to these parameters. Nevertheless, we note that accurate model selection is often possible when parameters are only weakly identified. Our results have implications for more sophisticated methods inasmuch as the latter are generalizations of the case we study.

Journal ArticleDOI
TL;DR: It is shown conclusively that topological peaks do occur in Bayesian phylogenetic posteriors from real data sets as sampled with standard MCMC approaches, and the efficiency of Metropolis-coupled MCMC (MCMCMC) in traversing the valleys between peaks is investigated.
Abstract: In order to gain an understanding of the effectiveness of phylogenetic Markov chain Monte Carlo (MCMC), it is important to understand how quickly the empirical distribution of the MCMC converges to the posterior distribution. In this article, we investigate this problem on phylogenetic tree topologies with a metric that is especially well suited to the task: the subtree prune-and-regraft (SPR) metric. This metric directly corresponds to the minimum number of MCMC rearrangements required to move between trees in common phylogenetic MCMC implementations. We develop a novel graph-based approach to analyze tree posteriors and find that the SPR metric is much more informative than simpler metrics that are unrelated to MCMC moves. In doing so, we show conclusively that topological peaks do occur in Bayesian phylogenetic posteriors from real data sets as sampled with standard MCMC approaches, investigate the efficiency of Metropolis-coupled MCMC (MCMCMC) in traversing the valleys between peaks, and show that conditional clade distribution (CCD) can have systematic problems when there are multiple peaks.


Journal ArticleDOI
TL;DR: The results show that barcode gap detection and GMYC models are unable to delineate species properly in data sets composed of one or two species, two situations in which haplowebs outperform them, suggesting that multilocus approaches may be necessary to tackle such cases.
Abstract: Most single-locus molecular approaches to species delimitation available to date have been designed and tested on data sets comprising at least tens of species, whereas the opposite case (species-poor data sets for which the hypothesis that all individuals are conspecific cannot by rejected beforehand) has rarely been the focus of such attempts. Here we compare the performance of barcode gap detection, haplowebs and generalized mixed Yule-coalescent (GMYC) models to delineate chimpanzees and bonobos using nuclear sequence markers, then apply these single-locus species delimitation methods to data sets of one, three, or six species simulated under a wide range of population sizes, speciation rates, mutation rates and sampling efforts. Our results show that barcode gap detection and GMYC models are unable to delineate species properly in data sets composed of one or two species, two situations in which haplowebs outperform them. For data sets composed of three or six species, bGMYC and haplowebs outperform the single-threshold and multiple-threshold versions of GMYC, whereas a clear barcode gap is only observed when population sizes and speciation rates are both small. The latter conditions represent a "sweet spot" for molecular taxonomy where all the single-locus approaches tested work well; however, the performance of these methods decreases strongly when population sizes and speciation rates are high, suggesting that multilocus approaches may be necessary to tackle such cases.

Journal ArticleDOI
TL;DR: It is shown how failure to account for such variable evolutionary rates can cause highly anomalous results, while three methods that accommodate rate variability yield the opposite, more plausible, and more robust reconstructions.
Abstract: Virtually all models for reconstructing ancestral states for discrete characters make the crucial assumption that the trait of interest evolves at a uniform rate across the entire tree. However, this assumption is unlikely to hold in many situations, particularly as ancestral state reconstructions are being performed on increasingly large phylogenies. Here, we show how failure to account for such variable evolutionary rates can cause highly anomalous (and likely incorrect) results, while three methods that accommodate rate variability yield the opposite, more plausible, and more robust reconstructions. The random local clock method, implemented in BEAST, estimates the position and magnitude of rate changes on the tree; split BiSSE estimates separate rate parameters for pre-specified clades; and the hidden rates model partitions each character state into a number of rate categories. Simulations show the inadequacy of traditional models when characters evolve with both asymmetry (different rates of change between states within a character) and heterotachy (different rates of character evolution across different clades). The importance of accounting for rate heterogeneity in ancestral state reconstruction is highlighted empirically with a new analysis of the evolution of viviparity in squamate reptiles, which reveal a predominance of forward (oviparous-viviparous) transitions and very few reversals.

Journal ArticleDOI
TL;DR: It is suggested that filtering genes according to their clock-likeness or posterior predictive effect size (PPES, an inference-based measure of model violation) improves phylogenetic reliability and congruence and that this approach can yield a collection of genes with more reliable phylogenetic signal.
Abstract: Topological heterogeneity among gene trees is widely observed in phylogenomic analyses and some of this variation is likely caused by systematic error in gene tree estimation. Systematic error can be mitigated by improving models of sequence evolution to account for all evolutionary processes relevant to each gene or identifying those genes whose evolution best conforms to existing models. However, the best method for identifying such genes is not well established. Here, we ask if filtering genes according to their clock-likeness or posterior predictive effect size (PPES, an inference-based measure of model violation) improves phylogenetic reliability and congruence. We compared these approaches to each other, and to the common practice of filtering based on rate of evolution, using two different metrics. First, we compared gene-tree topologies to accepted reference topologies. Second, we examined topological similarity among gene trees in filtered sets. Our results suggest that filtering genes based on clock-likeness and PPES can yield a collection of genes with more reliable phylogenetic signal. For the two exemplar data sets we explored, from yeast and amniotes, clock-likeness and PPES outperformed rate-based filtering in both congruence and reliability.

Journal ArticleDOI
TL;DR: The results suggest that with the fossil calibrations fixed, analyzing multiple loci or site partitions is the most effective way for improving the precision of posterior time estimation, however, even if a huge amount of sequence data is analyzed, considerable uncertainty will persist in time estimates.
Abstract: Genetic sequence data provide information about the distances between species or branch lengths in a phylogeny, but not about the absolute divergence times or the evolutionary rates directly. Bayesian methods for dating species divergences estimate times and rates by assigning priors on them. In particular, the prior on times (node ages on the phylogeny) incorporates information in the fossil record to calibrate the molecular tree. Because times and rates are confounded, our posterior time estimates will not approach point values even if an infinite amount of sequence data are used in the analysis. In a previous study we developed a finite-sites theory to characterize the uncertainty in Bayesian divergence time estimation in analysis of large but finite sequence data sets under a strict molecular clock. As most modern clock dating analyses use more than one locus and are conducted under relaxed clock models, here we extend the theory to the case of relaxed clock analysis of data from multiple loci (site partitions). Uncertainty in posterior time estimates is partitioned into three sources: Sampling errors in the estimates of branch lengths in the tree for each locus due to limited sequence length, variation of substitution rates among lineages and among loci, and uncertainty in fossil calibrations. Using a simple but analogous estimation problem involving the multivariate normal distribution, we predict that as the number of loci (L) goes to infinity, the variance in posterior time estimates decreases and approaches the infinite-data limit at the rate of 1/L, and the limit is independent of the number of sites in the sequence alignment. We then confirmed the predictions by using computer simulation on phylogenies of two or three species, and by analyzing a real genomic data set for six primate species. Our results suggest that with the fossil calibrations fixed, analyzing multiple loci or site partitions is the most effective way for improving the precision of posterior time estimation. However, even if a huge amount of sequence data is analyzed, considerable uncertainty will persist in time estimates.