scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Inferring phylogeny and introgression using RADseq data: an example from flowering plants (Pedicularis: Orobanchaceae).

01 Sep 2013-Systematic Biology (Oxford University Press)-Vol. 62, Iss: 5, pp 689-706
TL;DR: An important role for geographic isolation in the emergence of species barriers is suggested, by facilitating local adaptation and differentiation in the absence of homogenizing gene flow.
Abstract: Phylogenetic relationships among recently diverged species are often difficult to resolve due to insufficient phylogenetic signal in available markers and/or conflict among gene trees. Here we explore the use of reduced-representation genome sequencing, specifically in the form of restriction-site associated DNA (RAD), for phylogenetic inference and the detection of ancestral hybridization in non-model organisms. As a case study, we investigate Pedicularis section Cyathophora, a systematically recalcitrant clade of flowering plants in the broomrape family (Orobanchaceae). Two methods of phylogenetic inference, maximum likelihood and Bayesian concordance, were applied to data sets that included as many as 40,000 RAD loci. Both methods yielded similar topologies that included two major clades: a “rex-thamnophila” clade, composed of two species and several subspecies with relatively low floral diversity, and geographically widespread distributions at lower elevations, and a “superba” clade, composed of three species characterized by relatively high floral diversity and isolated geographic distributions at higher elevations. Levels of molecular divergence between subspecies in the rex-thamnophila clade are similar to those between species in the superba clade. Using Patterson’s D-statistic test, including a novel extension of the method that enables finer-grained resolution of introgression among multiple candidate taxa by removing the effect of their shared ancestry, we detect significant introgression among nearly all taxa in the rex-thamnophila clade, but not between clades or among taxa within the superba clade. These results suggest an important role for geographic isolation in the emergence of species barriers, by facilitating local adaptation and differentiation in the absence of homogenizing gene flow. [Concordance factors; genotyping-by-sequencing; hybridization; partitioned D-statistic test; Pedicularis; restriction-site associated DNA.]

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: This Review provides a comprehensive discussion of RADseq methods to aid researchers in choosing among the many different approaches and avoiding erroneous scientific conclusions from RADseq data, a problem that has plagued other genetic marker types in the past.
Abstract: High-throughput techniques based on restriction site-associated DNA sequencing (RADseq) are enabling the low-cost discovery and genotyping of thousands of genetic markers for any species, including non-model organisms, which is revolutionizing ecological, evolutionary and conservation genetics. Technical differences among these methods lead to important considerations for all steps of genomics studies, from the specific scientific questions that can be addressed, and the costs of library preparation and sequencing, to the types of bias and error inherent in the resulting data. In this Review, we provide a comprehensive discussion of RADseq methods to aid researchers in choosing among the many different approaches and avoiding erroneous scientific conclusions from RADseq data, a problem that has plagued other genetic marker types in the past.

1,102 citations

Journal ArticleDOI
TL;DR: It is shown through reanalysis of an empirical RADseq dataset that indels are a common feature of such data, even at shallow phylogenetic scales, and PyRAD uses parallel processing as well as an optional hierarchical clustering method, which allows it to rapidly assemble phylogenetic datasets with hundreds of sampled individuals.
Abstract: Motivation: Restriction-site associated genomic markers are a powerful tool for investigating evolutionary questions at the population level, but are limited in their utility at deeper phylogenetic scales where fewer orthologous loci are typically recovered across disparate taxa. While this limitation stems in part from mutations to restriction recognition sites that disrupt data generation, an additional source of data loss comes from the failure to identify homology during bioinformatic analyses. Clustering methods that allow for lower similarity thresholds and the inclusion of indel variation will perform better at assembling RADseq loci at the phylogenetic scale. Results: PyRAD is a pipeline to assemble de novo RADseq loci with the aim of optimizing coverage across phylogenetic data sets. It utilizes a wrapper around an alignment-clustering algorithm which allows for indel variation within and between samples, as well as for incomplete overlap among reads (e.g., paired-end). Here I compare PyRAD with the program Stacks in their performance analyzing a simulated RADseq data set that includes indel variation. Indels disrupt clustering of homologous loci in Stacks but not in PyRAD, such that the latter recovers more shared loci across disparate taxa. I show through re-analysis of an empirical RADseq data set that indels are a common feature of such data, even at shallow phylogenetic scales. PyRAD utilizes parallel processing as well as an optional hierarchical clustering method which allow it to rapidly assemble phylogenetic data sets with hundreds of sampled individuals. Availability: Software is written in Python and freely available at http://www.dereneaton.com/software/ Supplement: Scripts to completely reproduce all simulated and empirical analyses are available in the Supplementary Materials.

710 citations


Cites background or methods or result from "Inferring phylogeny and introgressi..."

  • ...…as deep as 60 million years (Rubin et al., 2012; Cariou et al., 2013), however, to date all empirical RADseq studies conducted above the species-level were done at much shallower scales (The Heliconius Genome Consortium, 2012; Wagner et al., 2013; Wang et al., 2013; Eaton and Ree, 2013)....

    [...]

  • ..., 2012); however, to date, all empirical RADseq studies conducted above the species level were done at much shallower scales (Bergey et al., 2013; Eaton and Ree, 2013; Jones et al., 2013; Keller et al., 2013; Lexer et al., 2013; Nadeau et al., 2013; Stölting et al., 2013; The Heliconius Genome Consortium, 2012; Wagner et al., 2013; Wang et al., 2013)....

    [...]

  • ..., 2011) and related tests for genomic introgression (Eaton and Ree, 2013)....

    [...]

  • ...This currently includes measurement of D-statistics (Durand et al., 2011) and related tests for genomic introgression (Eaton and Ree, 2013)....

    [...]

  • ...A published RADseq data set from Eaton and Ree (2013) was downloaded from the NCBI sequence read archive (SRA072507)....

    [...]

Journal ArticleDOI
01 Dec 2008

636 citations

Journal ArticleDOI
TL;DR: It is demonstrated that hybridization between two divergent lineages facilitated this process by providing genetic variation that subsequently became recombined and sorted into many new species, indicating rapid and extensive adaptive radiation.
Abstract: Understanding why some evolutionary lineages generate exceptionally high species diversity is an important goal in evolutionary biology. Haplochromine cichlid fishes of Africa’s Lake Victoria region encompass >700 diverse species that all evolved in the last 150,000 years. How this ‘Lake Victoria Region Superflock’ could evolve on such rapid timescales is an enduring question. Here, we demonstrate that hybridization between two divergent lineages facilitated this process by providing genetic variation that subsequently became recombined and sorted into many new species. Notably, the hybridization event generated exceptional allelic variation at an opsin gene known to be involved in adaptation and speciation. More generally, differentiation between new species is accentuated around variants that were fixed differences between the parental lineages, and that now appear in many new combinations in the radiation species. We conclude that hybridization between divergent lineages, when coincident with ecological opportunity, may facilitate rapid and extensive adaptive radiation. Cichlids underwent a rapid diversification in the Lake Victoria region, expanding to more than 700 species within 150,000 years. Here, Meier and colleagues show that an ancient hybridization between two divergent cichlid lineages generated high genetic diversity that facilitated the rapid radiation.

496 citations

Journal ArticleDOI
TL;DR: It is found that D is unreliable in this situation as it gives inflated values when effective population size is low, causing D outliers to cluster in genomic regions of reduced diversity, and a related statistic f^d is proposed, a modified version of a statistic originally developed to estimate the genome-wide fraction of admixture.
Abstract: Several methods have been proposed to test for introgression across genomes. One method tests for a genome-wide excess of shared derived alleles between taxa using Patterson's D statistic, but does not establish which loci show such an excess or whether the excess is due to introgression or ancestral population structure. Several recent studies have extended the use of D by applying the statistic to small genomic regions, rather than genome-wide. Here, we use simulations and whole-genome data from Heliconius butterflies to investigate the behavior of D in small genomic regions. We find that D is unreliable in this situation as it gives inflated values when effective population size is low, causing D outliers to cluster in genomic regions of reduced diversity. As an alternative, we propose a related statistic ƒ(d), a modified version of a statistic originally developed to estimate the genome-wide fraction of admixture. ƒ(d) is not subject to the same biases as D, and is better at identifying introgressed loci. Finally, we show that both D and ƒ(d) outliers tend to cluster in regions of low absolute divergence (d(XY)), which can confound a recently proposed test for differentiating introgression from shared ancestral variation at individual loci.

489 citations


Cites background or methods from "Inferring phylogeny and introgressi..."

  • ...…behavior of Patterson’s D statistic, a test for gene flow based on detecting an inequality in the numbers of ABBA and BABA patterns, using wholegenome analyses across large numbers of informative sites (Green et al. 2010; Yang et al. 2012; Eaton and Ree 2013; Martin et al. 2013; Wall et al. 2013)....

    [...]

  • ...The robustness of the D statistic for detecting a genome-wide excess of shared derived alleles has been thoroughly explored (Green et al. 2010; Durand et al. 2011; Yang et al. 2012; Eaton and Ree 2013; Martin et al. 2013)....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: MUSCLE is a new computer program for creating multiple alignments of protein sequences that includes fast distance estimation using kmer counting, progressive alignment using a new profile function the authors call the log-expectation score, and refinement using tree-dependent restricted partitioning.
Abstract: We describe MUSCLE, a new computer program for creating multiple alignments of protein sequences. Elements of the algorithm include fast distance estimation using kmer counting, progressive alignment using a new profile function we call the logexpectation score, and refinement using treedependent restricted partitioning. The speed and accuracy of MUSCLE are compared with T-Coffee, MAFFT and CLUSTALW on four test sets of reference alignments: BAliBASE, SABmark, SMART and a new benchmark, PREFAB. MUSCLE achieves the highest, or joint highest, rank in accuracy on each of these sets. Without refinement, MUSCLE achieves average accuracy statistically indistinguishable from T-Coffee and MAFFT, and is the fastest of the tested methods for large numbers of sequences, aligning 5000 sequences of average length 350 in 7 min on a current desktop computer. The MUSCLE program, source code and PREFAB test data are freely available at http://www.drive5. com/muscle.

37,524 citations


"Inferring phylogeny and introgressi..." refers background in this paper

  • ...The resulting clusters representing putative RAD loci shared across samples are then aligned with Muscle (Edgar 2004)....

    [...]

Journal ArticleDOI
TL;DR: The new version provides convergence diagnostics and allows multiple analyses to be run in parallel with convergence progress monitored on the fly, and provides more output options than previously, including samples of ancestral states, site rates, site dN/dS rations, branch rates, and node dates.
Abstract: Since its introduction in 2001, MrBayes has grown in popularity as a software package for Bayesian phylogenetic inference using Markov chain Monte Carlo (MCMC) methods. With this note, we announce the release of version 3.2, a major upgrade to the latest official release presented in 2003. The new version provides convergence diagnostics and allows multiple analyses to be run in parallel with convergence progress monitored on the fly. The introduction of new proposals and automatic optimization of tuning parameters has improved convergence for many problems. The new version also sports significantly faster likelihood calculations through streaming single-instruction-multiple-data extensions (SSE) and support of the BEAGLE library, allowing likelihood calculations to be delegated to graphics processing units (GPUs) on compatible hardware. Speedup factors range from around 2 with SSE code to more than 50 with BEAGLE for codon problems. Checkpointing across all models allows long runs to be completed even when an analysis is prematurely terminated. New models include relaxed clocks, dating, model averaging across time-reversible substitution models, and support for hard, negative, and partial (backbone) tree constraints. Inference of species trees from gene trees is supported by full incorporation of the Bayesian estimation of species trees (BEST) algorithms. Marginal model likelihoods for Bayes factor tests can be estimated accurately across the entire model space using the stepping stone method. The new version provides more output options than previously, including samples of ancestral states, site rates, site d(N)/d(S) rations, branch rates, and node dates. A wide range of statistics on tree parameters can also be output for visualization in FigTree and compatible software.

18,718 citations


"Inferring phylogeny and introgressi..." refers methods in this paper

  • ...1 (Ronquist et al. 2012) using the GTR+ substitution model, each run with four chains for 1,010,000 generations....

    [...]

  • ...For each locus, we executed two independent runs of MrBayes 3.2.1 (Ronquist et al. 2012) using the GTR+ substitution model, each run with four chains for 1,010,000 generations....

    [...]

Journal ArticleDOI
TL;DR: UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters and offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets.
Abstract: Motivation: Biological sequence data is accumulating rapidly, motivating the development of improved high-throughput methods for sequence classification. Results: UBLAST and USEARCH are new algorithms enabling sensitive local and global search of large sequence databases at exceptionally high speeds. They are often orders of magnitude faster than BLAST in practical applications, though sensitivity to distant protein relationships is lower. UCLUST is a new clustering method that exploits USEARCH to assign sequences to clusters. UCLUST offers several advantages over the widely used program CD-HIT, including higher speed, lower memory use, improved sensitivity, clustering at lower identities and classification of much larger datasets. Availability: Binaries are available at no charge for non-commercial use at http://www.drive5.com/usearch Contact: [email protected] Supplementary information:Supplementary data are available at Bioinformatics online.

17,301 citations


"Inferring phylogeny and introgressi..." refers methods in this paper

  • ...%) using the uclust function in USEARCH (Edgar 2010) with heuristics turned off, yielding clusters representing putative loci....

    [...]

  • ...—For each sample, sequences are clustered by similarity (here, 90%) using the uclust function in USEARCH (Edgar 2010) with heuristics turned off, yielding clusters representing putative loci....

    [...]

Journal ArticleDOI
TL;DR: UNLABELLED RAxML-VI-HPC (randomized axelerated maximum likelihood for high performance computing) is a sequential and parallel program for inference of large phylogenies with maximum likelihood (ML) that has been used to compute ML trees on two of the largest alignments to date.
Abstract: Summary: RAxML-VI-HPC (randomized axelerated maximum likelihood for high performance computing) is a sequential and parallel program for inference of large phylogenies with maximum likelihood (ML). Low-level technical optimizations, a modification of the search algorithm, and the use of the GTR+CAT approximation as replacement for GTR+Γ yield a program that is between 2.7 and 52 times faster than the previous version of RAxML. A large-scale performance comparison with GARLI, PHYML, IQPNNI and MrBayes on real data containing 1000 up to 6722 taxa shows that RAxML requires at least 5.6 times less main memory and yields better trees in similar times than the best competing program (GARLI) on datasets up to 2500 taxa. On datasets ≥4000 taxa it also runs 2--3 times faster than GARLI. RAxML has been parallelized with MPI to conduct parallel multiple bootstraps and inferences on distinct starting trees. The program has been used to compute ML trees on two of the largest alignments to date containing 25 057 (1463 bp) and 2182 (51 089 bp) taxa, respectively. Availability: icwww.epfl.ch/~stamatak Contact: Alexandros.Stamatakis@epfl.ch Supplementary information: Supplementary data are available at Bioinformatics online.

14,847 citations


"Inferring phylogeny and introgressi..." refers methods in this paper

  • ...Maximumlikelihood trees were inferred for the minimum-taxa and full-taxa data sets using RAxML 7.2.8 (Stamatakis 2006), with bootstrap support estimated from 200 replicate searches with random starting trees using the GTR+ nucleotide substitution model....

    [...]

  • ...8 (Stamatakis 2006), with bootstrap support estimated from 200 replicate searches with random starting trees using the GTR+ nucleotide substitution model....

    [...]

Journal ArticleDOI
04 May 2011-PLOS ONE
TL;DR: A procedure for constructing GBS libraries based on reducing genome complexity with restriction enzymes (REs) is reported, which is simple, quick, extremely specific, highly reproducible, and may reach important regions of the genome that are inaccessible to sequence capture approaches.
Abstract: Advances in next generation technologies have driven the costs of DNA sequencing down to the point that genotyping-by-sequencing (GBS) is now feasible for high diversity, large genome species. Here, we report a procedure for constructing GBS libraries based on reducing genome complexity with restriction enzymes (REs). This approach is simple, quick, extremely specific, highly reproducible, and may reach important regions of the genome that are inaccessible to sequence capture approaches. By using methylation-sensitive REs, repetitive regions of genomes can be avoided and lower copy regions targeted with two to three fold higher efficiency. This tremendously simplifies computationally challenging alignment problems in species with high levels of genetic diversity. The GBS procedure is demonstrated with maize (IBM) and barley (Oregon Wolfe Barley) recombinant inbred populations where roughly 200,000 and 25,000 sequence tags were mapped, respectively. An advantage in species like barley that lack a complete genome sequence is that a reference map need only be developed around the restriction sites, and this can be done in the process of sample genotyping. In such cases, the consensus of the read clusters across the sequence tagged sites becomes the reference. Alternatively, for kinship analyses in the absence of a reference genome, the sequence tags can simply be treated as dominant markers. Future application of GBS to breeding, conservation, and global species and population surveys may allow plant breeders to conduct genomic selection on a novel germplasm or species without first having to develop any prior molecular tools, or conservation biologists to determine population structure without prior knowledge of the genome or diversity in the species.

5,163 citations


"Inferring phylogeny and introgressi..." refers background in this paper

  • ...2011), or genotypingby-sequencing (GBS; Elshire et al. 2011), allow the regions adjacent to restriction sites to be surveyed with deep coverage at a high ratio of samples to sequencing effort....

    [...]

  • ...…as restrictionsite associated DNA sequencing (RADseq; Miller et al. 2007; Baird et al. 2008; Rowe et al. 2011), or genotypingby-sequencing (GBS; Elshire et al. 2011), allow the regions adjacent to restriction sites to be surveyed with deep coverage at a high ratio of samples to sequencing…...

    [...]

Related Papers (5)