scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Genome-wide RAD sequence data provide unprecedented resolution of species boundaries and relationships in the Lake Victoria cichlid adaptive radiation

TL;DR: This work uses NGS data generated from reduced representation genomic libraries of restriction‐site‐associated DNA (RAD) markers to infer phylogenetic relationships among 16 species of cichlid fishes from a single rocky island community within Lake Victoria's cICHlid adaptive radiation, and produces phylogenetic trees with unprecedented resolution for this group.
Abstract: Although population genomic studies using next generation sequencing (NGS) data are becoming increasingly common, studies focusing on phylogenetic inference using these data are in their infancy. Here, we use NGS data generated from reduced representation genomic libraries of restriction-site-associated DNA (RAD) markers to infer phylogenetic relationships among 16 species of cichlid fishes from a single rocky island community within Lake Victoria's cichlid adaptive radiation. Previous attempts at sequence-based phylogenetic analyses in Victoria cichlids have shown extensive sharing of genetic variation among species and no resolution of species or higher-level relationships. These patterns have generally been attributed to the very recent origin ( 5.8 million base pairs in width), species are reciprocally monophyletic with high bootstrap support, and the majority of internal branches on the tree have high support. Given the difficulty of the phylogenetic problem that the Lake Victoria cichlid adaptive radiation represents, these results are striking. The strict interpretation of the topologies we present here warrants caution because many questions remain about phylogenetic inference with very large genomic data set and because we can with the current analysis not distinguish between effects of shared ancestry and post-speciation gene flow. However, these results provide the first conclusive evidence for the monophyly of species in the Lake Victoria cichlid radiation and demonstrate the power that NGS data sets hold to resolve even the most difficult of phylogenetic challenges.
Citations
More filters
Journal ArticleDOI
TL;DR: The expanded population genomics functions in Stacks will make it a useful tool to harness the newest generation of massively parallel genotyping data for ecological and evolutionary genetics.
Abstract: Massively parallel short-read sequencing technologies, coupled with powerful software platforms, are enabling investigators to analyse tens of thousands of genetic markers. This wealth of data is rapidly expanding and allowing biological questions to be addressed with unprecedented scope and precision. The sizes of the data sets are now posing significant data processing and analysis challenges. Here we describe an extension of the Stacks software package to efficiently use genotype-by-sequencing data for studies of populations of organisms. Stacks now produces core population genomic summary statistics and SNP-by-SNP statistical tests. These statistics can be analysed across a reference genome using a smoothed sliding window. Stacks also now provides several output formats for several commonly used downstream analysis packages. The expanded population genomics functions in Stacks will make it a useful tool to harness the newest generation of massively parallel genotyping data for ecological and evolutionary genetics.

2,958 citations


Cites background from "Genome-wide RAD sequence data provi..."

  • ...2013), cichlid species (Keller et al. 2013; Wagner et al. 2013), different lineages of trout (Hohenlohe et al....

    [...]

  • ...…Wyeomyia smithii (Emerson et al. 2010; Merz et al. 2013), carnivorous plant Sarracenia alata (Zellmer et al. 2012), cichlid fishes in Lake Victoria (Wagner et al. 2013), ninespine stickleback Pungitius pungitius in Scandinavia (Bruneaux et al. 2013) and recently diverged species of birds…...

    [...]

  • ...2012), cichlid fishes in Lake Victoria (Wagner et al. 2013), ninespine stickleback Pungitius pungitius in Scandinavia (Bruneaux et al....

    [...]

  • ...Phylogeographic studies using GBS markers have recently been completed in the pitcher plant mosquito Wyeomyia smithii (Emerson et al. 2010; Merz et al. 2013), carnivorous plant Sarracenia alata (Zellmer et al. 2012), cichlid fishes in Lake Victoria (Wagner et al. 2013), ninespine stickleback Pungitius pungitius in Scandinavia (Bruneaux et al. 2013) and recently diverged species of birds (McCormack et al. 2012)....

    [...]

  • ...…et al. 2013), Heliconius butterflies (Nadeau et al. 2013), trees in the genus Populus (Stolting et al. 2013), cichlid species (Keller et al. 2013; Wagner et al. 2013), different lineages of trout (Hohenlohe et al. 2011; Amish et al. 2012; Everett et al. 2012; Hecht et al. 2012, 2013; Miller et…...

    [...]

Journal ArticleDOI
TL;DR: This Review provides a comprehensive discussion of RADseq methods to aid researchers in choosing among the many different approaches and avoiding erroneous scientific conclusions from RADseq data, a problem that has plagued other genetic marker types in the past.
Abstract: High-throughput techniques based on restriction site-associated DNA sequencing (RADseq) are enabling the low-cost discovery and genotyping of thousands of genetic markers for any species, including non-model organisms, which is revolutionizing ecological, evolutionary and conservation genetics. Technical differences among these methods lead to important considerations for all steps of genomics studies, from the specific scientific questions that can be addressed, and the costs of library preparation and sequencing, to the types of bias and error inherent in the resulting data. In this Review, we provide a comprehensive discussion of RADseq methods to aid researchers in choosing among the many different approaches and avoiding erroneous scientific conclusions from RADseq data, a problem that has plagued other genetic marker types in the past.

1,102 citations

Journal ArticleDOI
David Brawand1, David Brawand2, Catherine E. Wagner3, Catherine E. Wagner4, Yang I. Li2, Milan Malinsky5, Milan Malinsky6, Irene Keller3, Shaohua Fan7, Oleg Simakov7, Alvin Yu Jin Ng8, Zhi Wei Lim8, Etienne Bezault9, Jason Turner-Maier1, Jeremy A. Johnson1, Rosa Alcazar10, Hyun Ji Noh1, Pamela Russell11, Bronwen Aken5, Jessica Alföldi1, Chris T. Amemiya12, Naoual Azzouzi13, Jean-François Baroiller, Frédérique Barloy-Hubler13, Aaron M. Berlin1, Ryan F. Bloomquist14, Karen L. Carleton15, Matthew A. Conte15, Helena D'Cotta, Orly Eshel, Leslie Gaffney1, Francis Galibert13, Hugo F. Gante16, Sante Gnerre1, Lucie Greuter3, Lucie Greuter4, Richard Guyon13, Natalie S. Haddad14, Wilfried Haerty2, Robert M Harris17, Hans A. Hofmann17, Thibaut Hourlier5, Gideon Hulata, David B. Jaffe1, Marcia Lara1, Alison P. Lee8, Iain MacCallum1, Salome Mwaiko4, Masato Nikaido18, Hidenori Nishihara18, Catherine Ozouf-Costaz19, David J. Penman20, Dariusz Przybylski1, Michaelle Rakotomanga13, Suzy C. P. Renn9, Filipe J. Ribeiro1, Micha Ron, Walter Salzburger16, Luis Sanchez-Pulido2, M. Emília Santos16, Steve Searle5, Ted Sharpe1, Ross Swofford1, Frederick J. Tan21, Louise Williams1, Sarah Young1, Shuangye Yin1, Norihiro Okada18, Norihiro Okada22, Thomas D. Kocher15, Eric A. Miska6, Eric S. Lander1, Byrappa Venkatesh8, Russell D. Fernald10, Axel Meyer7, Chris P. Ponting2, J. Todd Streelman14, Kerstin Lindblad-Toh1, Kerstin Lindblad-Toh23, Ole Seehausen4, Ole Seehausen3, Federica Di Palma1, Federica Di Palma24 
18 Sep 2014-Nature
TL;DR: This article found an excess of gene duplications in the East African lineage compared to Nile tilapia and other teleosts, an abundance of non-coding element divergence, accelerated coding sequence evolution, expression divergence associated with transposable element insertions, and regulation by novel microRNAs.
Abstract: Cichlid fishes are famous for large, diverse and replicated adaptive radiations in the Great Lakes of East Africa. To understand the molecular mechanisms underlying cichlid phenotypic diversity, we sequenced the genomes and transcriptomes of five lineages of African cichlids: the Nile tilapia (Oreochromis niloticus), an ancestral lineage with low diversity; and four members of the East African lineage: Neolamprologus brichardi/pulcher (older radiation, Lake Tanganyika), Metriaclima zebra (recent radiation, Lake Malawi), Pundamilia nyererei (very recent radiation, Lake Victoria), and Astatotilapia burtoni (riverine species around Lake Tanganyika). We found an excess of gene duplications in the East African lineage compared to tilapia and other teleosts, an abundance of non-coding element divergence, accelerated coding sequence evolution, expression divergence associated with transposable element insertions, and regulation by novel microRNAs. In addition, we analysed sequence data from sixty individuals representing six closely related species from Lake Victoria, and show genome-wide diversifying selection on coding and regulatory variants, some of which were recruited from ancient polymorphisms. We conclude that a number of molecular mechanisms shaped East African cichlid genomes, and that amassing of standing variation during periods of relaxed purifying selection may have been important in facilitating subsequent evolutionary diversification.

832 citations

Journal ArticleDOI
TL;DR: It is shown through reanalysis of an empirical RADseq dataset that indels are a common feature of such data, even at shallow phylogenetic scales, and PyRAD uses parallel processing as well as an optional hierarchical clustering method, which allows it to rapidly assemble phylogenetic datasets with hundreds of sampled individuals.
Abstract: Motivation: Restriction-site associated genomic markers are a powerful tool for investigating evolutionary questions at the population level, but are limited in their utility at deeper phylogenetic scales where fewer orthologous loci are typically recovered across disparate taxa. While this limitation stems in part from mutations to restriction recognition sites that disrupt data generation, an additional source of data loss comes from the failure to identify homology during bioinformatic analyses. Clustering methods that allow for lower similarity thresholds and the inclusion of indel variation will perform better at assembling RADseq loci at the phylogenetic scale. Results: PyRAD is a pipeline to assemble de novo RADseq loci with the aim of optimizing coverage across phylogenetic data sets. It utilizes a wrapper around an alignment-clustering algorithm which allows for indel variation within and between samples, as well as for incomplete overlap among reads (e.g., paired-end). Here I compare PyRAD with the program Stacks in their performance analyzing a simulated RADseq data set that includes indel variation. Indels disrupt clustering of homologous loci in Stacks but not in PyRAD, such that the latter recovers more shared loci across disparate taxa. I show through re-analysis of an empirical RADseq data set that indels are a common feature of such data, even at shallow phylogenetic scales. PyRAD utilizes parallel processing as well as an optional hierarchical clustering method which allow it to rapidly assemble phylogenetic data sets with hundreds of sampled individuals. Availability: Software is written in Python and freely available at http://www.dereneaton.com/software/ Supplement: Scripts to completely reproduce all simulated and empirical analyses are available in the Supplementary Materials.

710 citations


Cites background from "Genome-wide RAD sequence data provi..."

  • ..., 2013), however, to date all empirical RADseq studies conducted above the species-level were done at much shallower scales (The Heliconius Genome Consortium, 2012; Wagner et al., 2013; Wang et al., 2013; Eaton and Ree, 2013)....

    [...]

David Brawand1, David Brawand2, Catherine E. Wagner3, Catherine E. Wagner4, Yang I. Li2, Milan Malinsky5, Milan Malinsky6, Irene Keller3, Shaohua Fan7, Oleg Simakov7, Alvin Yu Jin Ng8, Zhi Wei Lim8, Etienne Bezault9, Jason Turner-Maier1, Jeremy A. Johnson1, Rosa Alcazar10, Hyun Ji Noh1, Pamela Russell11, Bronwen Aken6, Jessica Alföldi1, Chris T. Amemiya12, Naoual Azzouzi13, Jean-François Baroiller, Frédérique Barloy-Hubler13, Aaron M. Berlin1, Ryan F. Bloomquist14, Karen L. Carleton15, Matthew A. Conte15, Helena D'Cotta, Orly Eshel, Leslie Gaffney1, Francis Galibert13, Hugo F. Gante16, Sante Gnerre1, Lucie Greuter4, Lucie Greuter3, Richard Guyon13, Natalie S. Haddad14, Wilfried Haerty2, Robert M Harris17, Hans A. Hofmann17, Thibaut Hourlier6, Gideon Hulata, David B. Jaffe1, Marcia Lara1, Alison P. Lee8, Iain MacCallum1, Salome Mwaiko4, Masato Nikaido18, Hidenori Nishihara18, Catherine Ozouf-Costaz19, David J. Penman20, Dariusz Przybylski1, Michaelle Rakotomanga13, Suzy C. P. Renn9, Filipe J. Ribeiro1, Micha Ron, Walter Salzburger16, Luis Sanchez-Pulido2, M. Emília Santos16, Steve Searle6, Ted Sharpe1, Ross Swofford1, Frederick J. Tan21, Louise Williams1, Sarah Young1, Shuangye Yin1, Norihiro Okada22, Norihiro Okada18, Thomas D. Kocher15, Eric A. Miska5, Eric S. Lander1, Byrappa Venkatesh8, Russell D. Fernald10, Axel Meyer7, Chris P. Ponting2, J. Todd Streelman14, Kerstin Lindblad-Toh23, Kerstin Lindblad-Toh1, Ole Seehausen3, Ole Seehausen4, Federica Di Palma24, Federica Di Palma1 
01 Sep 2014
TL;DR: It is concluded that a number of molecular mechanisms shaped East African cichlid genomes, and that amassing of standing variation during periods of relaxed purifying selection may have been important in facilitating subsequent evolutionary diversification.
Abstract: Cichlid fishes are famous for large, diverse and replicated adaptive radiations in the Great Lakes of East Africa. To understand the molecular mechanisms underlying cichlid phenotypic diversity, we sequenced the genomes and transcriptomes of five lineages of African cichlids: the Nile tilapia (Oreochromis niloticus), an ancestral lineage with low diversity; and four members of the East African lineage: Neolamprologus brichardi/pulcher (older radiation, Lake Tanganyika), Metriaclima zebra (recent radiation, Lake Malawi), Pundamilia nyererei (very recent radiation, Lake Victoria), and Astatotilapia burtoni (riverine species around Lake Tanganyika). We found an excess of gene duplications in the East African lineage compared to tilapia and other teleosts, an abundance of non-coding element divergence, accelerated coding sequence evolution, expression divergence associated with transposable element insertions, and regulation by novel microRNAs. In addition, we analysed sequence data from sixty individuals representing six closely related species from Lake Victoria, and show genome-wide diversifying selection on coding and regulatory variants, some of which were recruited from ancient polymorphisms. We conclude that a number of molecular mechanisms shaped East African cichlid genomes, and that amassing of standing variation during periods of relaxed purifying selection may have been important in facilitating subsequent evolutionary diversification.

666 citations

References
More filters
Journal ArticleDOI
TL;DR: The GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Abstract: Next-generation DNA sequencing (NGS) projects, such as the 1000 Genomes Project, are already revolutionizing our understanding of genetic variation among individuals. However, the massive data sets generated by NGS—the 1000 Genome pilot alone includes nearly five terabases—make writing feature-rich, efficient, and robust analysis tools difficult for even computationally sophisticated individuals. Indeed, many professionals are limited in the scope and the ease with which they can answer scientific questions by the complexity of accessing and manipulating the data produced by these machines. Here, we discuss our Genome Analysis Toolkit (GATK), a structured programming framework designed to ease the development of efficient and robust analysis tools for next-generation DNA sequencers using the functional programming philosophy of MapReduce. The GATK provides a small but rich set of data access patterns that encompass the majority of analysis tool needs. Separating specific analysis calculations from common data management infrastructure enables us to optimize the GATK framework for correctness, stability, and CPU and memory efficiency and to enable distributed and shared memory parallelization. We highlight the capabilities of the GATK by describing the implementation and application of robust, scale-tolerant tools like coverage calculators and single nucleotide polymorphism (SNP) calling. We conclude that the GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.

20,557 citations


"Genome-wide RAD sequence data provi..." refers methods in this paper

  • ...Genotypes were called for the complete sample of individuals with Unified Genotyper from the Genome Analysis Tool kit v.1.4–19, using the SNP genotype likelihood model (GATK; DePristo et al. 2011; McKenna et al. 2010)....

    [...]

  • ...4–19, using the SNP genotype likelihood model (GATK; DePristo et al. 2011; McKenna et al. 2010)....

    [...]

Journal ArticleDOI
TL;DR: Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches and can be used simultaneously to achieve even greater alignment speeds.
Abstract: Bowtie is an ultrafast, memory-efficient alignment program for aligning short DNA sequence reads to large genomes. For the human genome, Burrows-Wheeler indexing allows Bowtie to align more than 25 million reads per CPU hour with a memory footprint of approximately 1.3 gigabytes. Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches. Multiple processor cores can be used simultaneously to achieve even greater alignment speeds. Bowtie is open source http://bowtie.cbcb.umd.edu.

20,335 citations


"Genome-wide RAD sequence data provi..." refers background or methods in this paper

  • ...Using the de novo assembly constructed in ustacks, we mapped the quality filtered, demultiplexed reads from each individual separately to the consensus sequences from these loci in bowtie v. 0.12.7 (Langmead et al. 2009), allowing no more than two mismatches....

    [...]

  • ...7 (Langmead et al. 2009), allowing no more than two mismatches....

    [...]

Journal ArticleDOI
TL;DR: UNLABELLED RAxML-VI-HPC (randomized axelerated maximum likelihood for high performance computing) is a sequential and parallel program for inference of large phylogenies with maximum likelihood (ML) that has been used to compute ML trees on two of the largest alignments to date.
Abstract: Summary: RAxML-VI-HPC (randomized axelerated maximum likelihood for high performance computing) is a sequential and parallel program for inference of large phylogenies with maximum likelihood (ML). Low-level technical optimizations, a modification of the search algorithm, and the use of the GTR+CAT approximation as replacement for GTR+Γ yield a program that is between 2.7 and 52 times faster than the previous version of RAxML. A large-scale performance comparison with GARLI, PHYML, IQPNNI and MrBayes on real data containing 1000 up to 6722 taxa shows that RAxML requires at least 5.6 times less main memory and yields better trees in similar times than the best competing program (GARLI) on datasets up to 2500 taxa. On datasets ≥4000 taxa it also runs 2--3 times faster than GARLI. RAxML has been parallelized with MPI to conduct parallel multiple bootstraps and inferences on distinct starting trees. The program has been used to compute ML trees on two of the largest alignments to date containing 25 057 (1463 bp) and 2182 (51 089 bp) taxa, respectively. Availability: icwww.epfl.ch/~stamatak Contact: Alexandros.Stamatakis@epfl.ch Supplementary information: Supplementary data are available at Bioinformatics online.

14,847 citations


"Genome-wide RAD sequence data provi..." refers methods in this paper

  • ...Because of its ability to efficiently handle very large data sets, we used a maximum likelihood approach in RAxML 7.2.8 for phylogenetic analyses (Stamatakis 2006)....

    [...]

Journal ArticleDOI
TL;DR: A unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs is presented.
Abstract: Recent advances in sequencing technology make it possible to comprehensively catalogue genetic variation in population samples, creating a foundation for understanding human disease, ancestry and evolution. The amounts of raw data produced are prodigious and many computational steps are required to translate this output into high-quality variant calls. We present a unified analytic framework to discover and genotype variation among multiple samples simultaneously that achieves sensitive and specific results across five sequencing technologies and three distinct, canonical experimental designs. Our process includes (1) initial read mapping; (2) local realignment around indels; (3) base quality score recalibration; (4) SNP discovery and genotyping to find all potential variants; and (5) machine learning to separate true segregating variation from machine artifacts common to next-generation sequencing technologies. We discuss the application of these tools, instantiated in the Genome Analysis Toolkit (GATK), to deep whole-genome, whole-exome capture, and multi-sample low-pass (~4×) 1000 Genomes Project datasets.

10,056 citations


"Genome-wide RAD sequence data provi..." refers methods in this paper

  • ...Genotypes were called for the complete sample of individuals with Unified Genotyper from the Genome Analysis Tool kit v.1.4–19, using the SNP genotype likelihood model (GATK; DePristo et al. 2011; McKenna et al. 2010)....

    [...]

  • ...4–19, using the SNP genotype likelihood model (GATK; DePristo et al. 2011; McKenna et al. 2010)....

    [...]

Journal ArticleDOI
TL;DR: This work developed, implemented, and thoroughly tested rapid bootstrap heuristics in RAxML (Randomized Axelerated Maximum Likelihood) that are more than an order of magnitude faster than current algorithms and can contribute to resolving the computational bottleneck and improve current methodology in phylogenetic analyses.
Abstract: Despite recent advances achieved by application of high-performance computing methods and novel algorithmic techniques to maximum likelihood (ML)-based inference programs, the major computational bottleneck still consists in the computation of bootstrap support values. Conducting a probably insufficient number of 100 bootstrap (BS) analyses with current ML programs on large datasets—either with respect to the number of taxa or base pairs—can easily require a month of run time. Therefore, we have developed, implemented, and thoroughly tested rapid bootstrap heuristics in RAxML (Randomized Axelerated Maximum Likelihood) that are more than an order of magnitude faster than current algorithms. These new heuristics can contribute to resolving the computational bottleneck and improve current methodology in phylogenetic analyses. Computational experiments to assess the performance and relative accuracy of these heuristics were conducted on 22 diverse DNA and AA (amino acid), single gene as well as multigene, real-world alignments containing 125 up to 7764 sequences. The standard BS (SBS) and rapid BS (RBS) values drawn on the best-scoring ML tree are highly correlated and show almost identical average support values. The weighted RF (Robinson-Foulds) distance between SBS- and RBS-based consensus trees was smaller than 6% in all cases (average 4%). More importantly, RBS inferences are between 8 and 20 times faster (average 14.73) than SBS analyses with RAxML and between 18 and 495 times faster than BS analyses with competing programs, such as PHYML or GARLI. Moreover, this performance improvement increases with alignment size. Finally, we have set up two freely accessible Web servers for this significantly improved version of RAxML that provide access to the 200-CPU cluster of the Vital-IT unit at the Swiss Institute of Bioinformatics and the 128-CPU cluster of the CIPRES project at the San Diego Supercomputer Center. These Web servers offer the possibility to conduct large-scale phylogenetic inferences to a large part of the community that does not have access to, or the expertise to use, high-performance computing resources. (Maximum likelihood; phylogenetic inference; rapid bootstrap; RAxML; support values.)

6,585 citations


"Genome-wide RAD sequence data provi..." refers methods in this paper

  • ...algorithm to account for uncertainty in the estimation of the topology (Stamatakis et al. 2008)....

    [...]

  • ...…gamma model of sequence evolution, as recommended and justified by the authors of the program in the version 7.0.4 manual, for single full ML tree searches, and 100 replicates of RAxML’s rapid bootstrap algorithm to account for uncertainty in the estimation of the topology (Stamatakis et al. 2008)....

    [...]