scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Genotyping‐by‐sequencing in ecological and conservation genomics

01 Jun 2013-Molecular Ecology (NIH Public Access)-Vol. 22, Iss: 11, pp 2841-2847
TL;DR: This special issue on ‘Genotyping-by-Sequencing in Ecological and Conservation Genomics’ represents a diverse set of empirical and theoretical studies that demonstrate both the utility and some of the challenges of GBS in ecological and conservation genomics.
Abstract: The fields of ecological and conservation genetics have developed greatly in recent decades through the use of molecular markers to investigate organisms in their natural habitat and to evaluate the effect of anthropogenic disturbances. However, many of these studies have been limited to narrow regions of the genome, allowing for limited inferences but making it difficult to generalize about the organisms and their evolutionary history. Tremendous advances in sequencing technology over the last decade (i.e. next-generation sequencing; NGS) have led to the ability to sample the genome much more densely and to observe the patterns of genetic variation that result from the full range of evolutionary processes acting across the genome (Allendorf et al. 2010; Stapley et al. 2010; Li et al. 2012). These studies are transforming molecular ecology by making many long-standing questions much more easily accessible in almost any organism. When studying the genetics of wild populations, it is desirable to samples tens, hundreds or even thousands of individuals. While it is now possible to sequence whole genomes for tens of individuals with small genome sizes, the sequencing of hundreds of individuals with large genomes remains prohibitively expensive, particularly where the genome sequence is unknown. Further, for the purpose of many studies, complete genomic sequence data for all individuals would be unnecessary and simply inflate the computational and bioinformatic costs. A major recent advance has been the development of genotyping-by-sequencing (GBS) approaches that allow a targeted fraction of the genome (a reduced representation library) to be sequenced with next-generation technology rather than the entire genome, even in species with little or no previous genomic information and large genomes. The subset of the genome to be sequenced in these GBS approaches may be targeted using restriction enzymes or capture probes or by sequencing the transcriptome (reviewed in Davey et al. 2011). In the future, as sequencing technology and computational and bioinformatic methods develop further, whole-genome resequencing may become the predominant method for ecological and conservation genomics. Currently, reduced representation approaches offer the ability to not only discover genetic variants such as SNPs but also genotype individuals at these newly discovered loci in the same data. This special issue on ‘Genotyping-by-Sequencing in Ecological and Conservation Genomics’ represents a diverse set of empirical and theoretical studies that demonstrate both the utility and some of the challenges of GBS in ecological and conservation genomics. The empirical studies include demonstrations of the utility of GBS for population genomics and association mapping, as well as the development of genomic resources (i.e. large SNP data sets) for target species. The studies also illustrate some of the differences between GBS methods, in particular, aligning paired-end reads to achieve longer consensus sequences in contrast to single-end reads with shorter alignments, and double-digest versus sonication methods to fragment DNA. In addition, several papers describe advanced data pipelines for handling GBS-related sequence data and critically evaluate best practices for GBS methods and potential biases and novel features associated with GBS data. Overall, this compilation of papers emphasizes that GBS has been quickly adopted by the scientific community and is expected to become a common tool for studies in molecular ecology.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
28 Feb 2014-PLOS ONE
TL;DR: The tassel-gbs pipeline, designed for the efficient processing of raw GBS sequence data into SNP genotypes, is described and benchmark it based upon a large scale, species wide analysis in maize, where the average error rate was reduced to 0.0042.
Abstract: Genotyping by sequencing (GBS) is a next generation sequencing based method that takes advantage of reduced representation to enable high throughput genotyping of large numbers of individuals at a large number of SNP markers. The relatively straightforward, robust, and cost-effective GBS protocol is currently being applied in numerous species by a large number of researchers. Herein we describe a bioinformatics pipeline, tassel-gbs, designed for the efficient processing of raw GBS sequence data into SNP genotypes. The tassel-gbs pipeline successfully fulfills the following key design criteria: (1) Ability to run on the modest computing resources that are typically available to small breeding or ecological research programs, including desktop or laptop machines with only 8–16 GB of RAM, (2) Scalability from small to extremely large studies, where hundreds of thousands or even millions of SNPs can be scored in up to 100,000 individuals (e.g., for large breeding programs or genetic surveys), and (3) Applicability in an accelerated breeding context, requiring rapid turnover from tissue collection to genotypes. Although a reference genome is required, the pipeline can also be run with an unfinished “pseudo-reference” consisting of numerous contigs. We describe the tassel-gbs pipeline in detail and benchmark it based upon a large scale, species wide analysis in maize (Zea mays), where the average error rate was reduced to 0.0042 through application of population genetic-based SNP filters. Overall, the GBS assay and the tassel-gbs pipeline provide robust tools for studying genomic diversity.

1,315 citations

Journal ArticleDOI
TL;DR: This Review provides a comprehensive discussion of RADseq methods to aid researchers in choosing among the many different approaches and avoiding erroneous scientific conclusions from RADseq data, a problem that has plagued other genetic marker types in the past.
Abstract: High-throughput techniques based on restriction site-associated DNA sequencing (RADseq) are enabling the low-cost discovery and genotyping of thousands of genetic markers for any species, including non-model organisms, which is revolutionizing ecological, evolutionary and conservation genetics. Technical differences among these methods lead to important considerations for all steps of genomics studies, from the specific scientific questions that can be addressed, and the costs of library preparation and sequencing, to the types of bias and error inherent in the resulting data. In this Review, we provide a comprehensive discussion of RADseq methods to aid researchers in choosing among the many different approaches and avoiding erroneous scientific conclusions from RADseq data, a problem that has plagued other genetic marker types in the past.

1,102 citations

Journal ArticleDOI
TL;DR: It is shown through reanalysis of an empirical RADseq dataset that indels are a common feature of such data, even at shallow phylogenetic scales, and PyRAD uses parallel processing as well as an optional hierarchical clustering method, which allows it to rapidly assemble phylogenetic datasets with hundreds of sampled individuals.
Abstract: Motivation: Restriction-site associated genomic markers are a powerful tool for investigating evolutionary questions at the population level, but are limited in their utility at deeper phylogenetic scales where fewer orthologous loci are typically recovered across disparate taxa. While this limitation stems in part from mutations to restriction recognition sites that disrupt data generation, an additional source of data loss comes from the failure to identify homology during bioinformatic analyses. Clustering methods that allow for lower similarity thresholds and the inclusion of indel variation will perform better at assembling RADseq loci at the phylogenetic scale. Results: PyRAD is a pipeline to assemble de novo RADseq loci with the aim of optimizing coverage across phylogenetic data sets. It utilizes a wrapper around an alignment-clustering algorithm which allows for indel variation within and between samples, as well as for incomplete overlap among reads (e.g., paired-end). Here I compare PyRAD with the program Stacks in their performance analyzing a simulated RADseq data set that includes indel variation. Indels disrupt clustering of homologous loci in Stacks but not in PyRAD, such that the latter recovers more shared loci across disparate taxa. I show through re-analysis of an empirical RADseq data set that indels are a common feature of such data, even at shallow phylogenetic scales. PyRAD utilizes parallel processing as well as an optional hierarchical clustering method which allow it to rapidly assemble phylogenetic data sets with hundreds of sampled individuals. Availability: Software is written in Python and freely available at http://www.dereneaton.com/software/ Supplement: Scripts to completely reproduce all simulated and empirical analyses are available in the Supplementary Materials.

710 citations

Journal ArticleDOI
Hans Ellegren1
TL;DR: High-throughput sequencing technologies are revolutionizing the life sciences, and the past 12 months have seen a burst of genome sequences from non-model organisms, in each case representing a fundamental source of data of significant importance to biological research.
Abstract: High-throughput sequencing technologies are revolutionizing the life sciences. The past 12 months have seen a burst of genome sequences from non-model organisms, in each case representing a fundamental source of data of significant importance to biological research. This has bearing on several aspects of evolutionary biology, and we are now beginning to see patterns emerging from these studies. These include significant heterogeneity in the rate of recombination that affects adaptive evolution and base composition, the role of population size in adaptive evolution, and the importance of expansion of gene families in lineage-specific adaptation. Moreover, resequencing of population samples (population genomics) has enabled the identification of the genetic basis of critical phenotypes and cast light on the landscape of genomic divergence during speciation.

607 citations

Journal ArticleDOI
TL;DR: Genotyping-by-sequencing (GBS) has been developed and applied in sequencing multiplexed samples that combine molecular marker discovery and genotyping and has been successfully used in implementing genome-wide association study (GWAS), genomic diversity study, genetic linkage analysis, molecular markers discovery and genomic selection under a large scale of plant breeding programs.
Abstract: Marker-assisted selection (MAS) refers to the use of molecular markers to assist phenotypic selections in crop improvement. Several types of molecular markers, such as single nucleotide polymorphism (SNP), have been identified and effectively used in plant breeding. The application of next-generation sequencing (NGS) technologies has led to remarkable advances in whole genome sequencing, which provides ultra-throughput sequences to revolutionize plant genotyping and breeding. To further broaden NGS usages to large crop genomes such as maize and wheat, genotyping-by-sequencing (GBS) has been developed and applied in sequencing multiplexed samples that combine molecular marker discovery and genotyping. GBS is a novel application of NGS protocols for discovering and genotyping SNPs in crop genomes and populations. The GBS approach includes the digestion of genomic DNA with restriction enzymes followed by the ligation of barcode adapter, PCR amplification and sequencing of the amplified DNA pool on a single lane of flow cells. Bioinformatic pipelines are needed to analyze and interpret GBS datasets. As an ultimate MAS tool and a cost-effective technique, GBS has been successfully used in implementing genome-wide association study (GWAS), genomic diversity study, genetic linkage analysis, molecular marker discovery and genomic selection under a large scale of plant breeding programs.

500 citations

References
More filters
Journal ArticleDOI
04 May 2011-PLOS ONE
TL;DR: A procedure for constructing GBS libraries based on reducing genome complexity with restriction enzymes (REs) is reported, which is simple, quick, extremely specific, highly reproducible, and may reach important regions of the genome that are inaccessible to sequence capture approaches.
Abstract: Advances in next generation technologies have driven the costs of DNA sequencing down to the point that genotyping-by-sequencing (GBS) is now feasible for high diversity, large genome species. Here, we report a procedure for constructing GBS libraries based on reducing genome complexity with restriction enzymes (REs). This approach is simple, quick, extremely specific, highly reproducible, and may reach important regions of the genome that are inaccessible to sequence capture approaches. By using methylation-sensitive REs, repetitive regions of genomes can be avoided and lower copy regions targeted with two to three fold higher efficiency. This tremendously simplifies computationally challenging alignment problems in species with high levels of genetic diversity. The GBS procedure is demonstrated with maize (IBM) and barley (Oregon Wolfe Barley) recombinant inbred populations where roughly 200,000 and 25,000 sequence tags were mapped, respectively. An advantage in species like barley that lack a complete genome sequence is that a reference map need only be developed around the restriction sites, and this can be done in the process of sample genotyping. In such cases, the consensus of the read clusters across the sequence tagged sites becomes the reference. Alternatively, for kinship analyses in the absence of a reference genome, the sequence tags can simply be treated as dominant markers. Future application of GBS to breeding, conservation, and global species and population surveys may allow plant breeders to conduct genomic selection on a novel germplasm or species without first having to develop any prior molecular tools, or conservation biologists to determine population structure without prior knowledge of the genome or diversity in the species.

5,163 citations


"Genotyping‐by‐sequencing in ecologi..." refers methods in this paper

  • ...Genotyping-by-sequencing methods using restriction enzymes (Miller et al. 2007; Baird et al. 2008; van Orsouw et al. 2007; Andolfatto et al. 2011; Elshire et al. 2011; Peterson et al. 2012; Parchman et al. 2012) can produce data with unique characteristics, resulting from factors such as restriction-site polymorphism or correlations of restriction fragment length with read depth....

    [...]

  • ...‡Elshire et al. 2011....

    [...]

  • ...Genotyping-by-sequencing methods using restriction enzymes (Miller et al. 2007; Baird et al. 2008; van Orsouw et al. 2007; Andolfatto et al. 2011; Elshire et al. 2011; Peterson et al. 2012; Parchman et al. 2012) can produce data with unique characteristics, resulting from factors such as…...

    [...]

Journal ArticleDOI
TL;DR: An online catalog of SNP-trait associations from published genome-wide association studies for use in investigating genomic characteristics of trait/disease-associated SNPs (TASs) is developed, well-suited to guide future investigations of the role of common variants in complex disease etiology.
Abstract: We have developed an online catalog of SNP-trait associations from published genome-wide association studies for use in investigating genomic characteristics of trait/disease-associated SNPs (TASs). Reported TASs were common [median risk allele frequency 36%, interquartile range (IQR) 21%−53%] and were associated with modest effect sizes [median odds ratio (OR) 1.33, IQR 1.20–1.61]. Among 20 genomic annotation sets, reported TASs were significantly overrepresented only in nonsynonymous sites [OR = 3.9 (2.2−7.0), p = 3.5 × 10−7] and 5kb-promoter regions [OR = 2.3 (1.5−3.6), p = 3 × 10−4] compared to SNPs randomly selected from genotyping arrays. Although 88% of TASs were intronic (45%) or intergenic (43%), TASs were not overrepresented in introns and were significantly depleted in intergenic regions [OR = 0.44 (0.34−0.58), p = 2.0 × 10−9]. Only slightly more TASs than expected by chance were predicted to be in regions under positive selection [OR = 1.3 (0.8−2.1), p = 0.2]. This new online resource, together with bioinformatic predictions of the underlying functionality at trait/disease-associated loci, is well-suited to guide future investigations of the role of common variants in complex disease etiology.

4,041 citations


"Genotyping‐by‐sequencing in ecologi..." refers background in this paper

  • ...In the last decade, these approaches have been utilized extensively in humans to identify specific genes and pathways involved human health (Hindorff et al. 2009) and to discover disease alleles in model organisms (Flint & Eskin 2012)....

    [...]

Journal ArticleDOI
13 Oct 2008-PLOS ONE
TL;DR: The sequencing of restriction-site associated DNA (RAD) tags was described, which identified more than 13,000 SNPs, and three traits in two model organisms were mapped, using less than half the capacity of one Illumina sequencing run.
Abstract: Single nucleotide polymorphism (SNP) discovery and genotyping are essential to genetic mapping. There remains a need for a simple, inexpensive platform that allows high-density SNP discovery and genotyping in large populations. Here we describe the sequencing of restriction-site associated DNA (RAD) tags, which identified more than 13,000 SNPs, and mapped three traits in two model organisms, using less than half the capacity of one Illumina sequencing run. We demonstrated that different marker densities can be attained by choice of restriction enzyme. Furthermore, we developed a barcoding system for sample multiplexing and fine mapped the genetic basis of lateral plate armor loss in threespine stickleback by identifying recombinant breakpoints in F2 individuals. Barcoding also facilitated mapping of a second trait, a reduction of pelvic structure, by in silico re-sorting of individuals. To further demonstrate the ease of the RAD sequencing approach we identified polymorphic markers and mapped an induced mutation in Neurospora crassa. Sequencing of RAD markers is an integrated platform for SNP discovery and genotyping. This approach should be widely applicable to genetic mapping in a variety of organisms.

3,112 citations


"Genotyping‐by‐sequencing in ecologi..." refers background or methods in this paper

  • ...Genotyping-by-sequencing methods using restriction enzymes (Miller et al. 2007; Baird et al. 2008; van Orsouw et al. 2007; Andolfatto et al. 2011; Elshire et al. 2011; Peterson et al. 2012; Parchman et al. 2012) can produce data with unique characteristics, resulting from factors such as restriction-site polymorphism or correlations of restriction fragment length with read depth....

    [...]

  • ...Genotyping-by-sequencing methods using restriction enzymes (Miller et al. 2007; Baird et al. 2008; van Orsouw et al. 2007; Andolfatto et al. 2011; Elshire et al. 2011; Peterson et al. 2012; Parchman et al. 2012) can produce data with unique characteristics, resulting from factors such as…...

    [...]

  • ...Table 1 Continued Study Organism Method # loci analysed # samples # groups Study goals Wang et al. Birch (Betula spp.) Single-end RAD-seq* ~43 000 15 inds n/a SNP discovery White et al. Bank vole (Myodes glareolus) Genotyping-bySequencing‡ 5979 281 inds 14 pops Genetic diversity *Baird et al. 2008....

    [...]

Journal ArticleDOI
TL;DR: The expanded population genomics functions in Stacks will make it a useful tool to harness the newest generation of massively parallel genotyping data for ecological and evolutionary genetics.
Abstract: Massively parallel short-read sequencing technologies, coupled with powerful software platforms, are enabling investigators to analyse tens of thousands of genetic markers. This wealth of data is rapidly expanding and allowing biological questions to be addressed with unprecedented scope and precision. The sizes of the data sets are now posing significant data processing and analysis challenges. Here we describe an extension of the Stacks software package to efficiently use genotype-by-sequencing data for studies of populations of organisms. Stacks now produces core population genomic summary statistics and SNP-by-SNP statistical tests. These statistics can be analysed across a reference genome using a smoothed sliding window. Stacks also now provides several output formats for several commonly used downstream analysis packages. The expanded population genomics functions in Stacks will make it a useful tool to harness the newest generation of massively parallel genotyping data for ecological and evolutionary genetics.

2,958 citations


"Genotyping‐by‐sequencing in ecologi..." refers background or methods in this paper

  • ...…mammals, making novel inferences regarding selection in natural populations in addition to measuring demographic parameters using neutral markers (Catchen et al. 2013b; Corander et al. 2013; De Wit & Palumbi 2013; Hess et al. 2013; Hyma & Fay 2013; Keller et al. 2013; Reitzel et al. 2013; Roda…...

    [...]

  • ...The most comprehensive pipeline for handling GBS data is Stacks (Catchen et al. 2011), and in this issue, Catchen et al. (2013a) describe new features in Stacks to calculate population genomic statistics (such as FST and nucleotide diversity), create smoothed distributions using sliding window…...

    [...]

Journal ArticleDOI
31 May 2012-PLOS ONE
TL;DR: This modified RADseq approach requires no prior genomic knowledge and achieves per-site and per-individual costs below that of current SNP chip technology, while requiring similar hands-on time investment, comparable amounts of input DNA, and downstream analysis times on the order of hours.
Abstract: The ability to efficiently and accurately determine genotypes is a keystone technology in modern genetics, crucial to studies ranging from clinical diagnostics, to genotype-phenotype association, to reconstruction of ancestry and the detection of selection. To date, high capacity, low cost genotyping has been largely achieved via ‘‘SNP chip’’ microarray-based platforms which require substantial prior knowledge of both genome sequence and variability, and once designed are suitable only for those targeted variable nucleotide sites. This method introduces substantial ascertainment bias and inherently precludes detection of rare or population-specific variants, a major source of information for both population history and genotypephenotype association. Recent developments in reduced-representation genome sequencing experiments on massively parallel sequencers (commonly referred to as RAD-tag or RADseq) have brought direct sequencing to the problem of population genotyping, but increased cost and procedural and analytical complexity have limited their widespread adoption. Here, we describe a complete laboratory protocol, including a custom combinatorial indexing method, and accompanying software tools to facilitate genotyping across large numbers (hundreds or more) of individuals for a range of markers (hundreds to hundreds of thousands). Our method requires no prior genomic knowledge and achieves per-site and per-individual costs below that of current SNP chip technology, while requiring similar hands-on time investment, comparable amounts of input DNA, and downstream analysis times on the order of hours. Finally, we provide empirical results from the application of this method to both genotyping in a laboratory cross and in wild populations. Because of its flexibility, this modified RADseq approach promises to be applicable to a diversity of biological questions in a wide range of organisms.

2,734 citations


"Genotyping‐by‐sequencing in ecologi..." refers methods in this paper

  • ...Arnold et al. (2013) evaluate several additional population genetics statistics, demonstrate that the choice of restriction enzyme and allele dropout can have substantial effects on these estimates, and assess the double-digest RAD-seq method (Peterson et al. 2012) as well as standard RAD-seq....

    [...]

  • ...…methods using restriction enzymes (Miller et al. 2007; Baird et al. 2008; van Orsouw et al. 2007; Andolfatto et al. 2011; Elshire et al. 2011; Peterson et al. 2012; Parchman et al. 2012) can produce data with unique characteristics, resulting from factors such as restriction-site…...

    [...]

  • ...†Peterson et al. 2012....

    [...]