scispace - formally typeset
Search or ask a question
Journal ArticleDOI

RADseq underestimates diversity and introduces genealogical biases due to nonrandom haplotype sampling

01 Jun 2013-Molecular Ecology (Mol Ecol)-Vol. 22, Iss: 11, pp 3179-3190
TL;DR: It is shown that loci with missing haplotypes have estimated summary statistic values that can deviate dramatically from true values and are also enriched for particular genealogical histories, which are sensitive to nonequilibrium demography, such as bottlenecks and population expansion.
Abstract: Reduced representation genome-sequencing approaches based on restriction digestion are enabling large-scale marker generation and facilitating genomic studies in a wide range of model and nonmodel systems. However, sampling chromosomes based on restriction digestion may introduce a bias in allele frequency estimation due to polymorphisms in restriction sites. To explore the effects of this nonrandom sampling and its sensitivity to different evolutionary parameters, we developed a coalescent-simulation framework to mimic the biased recovery of chromosomes in restriction-based short-read sequencing experiments (RADseq). We analysed simulated DNA sequence datasets and compared known values from simulations with those that would be estimated using a RADseq approach from the same samples. We compare these 'true' and 'estimated' values of commonly used summary statistics, π, θ(w), Tajima's D and F(ST). We show that loci with missing haplotypes have estimated summary statistic values that can deviate dramatically from true values and are also enriched for particular genealogical histories. These biases are sensitive to nonequilibrium demography, such as bottlenecks and population expansion. In silico digests with 102 completely sequenced Drosophila melanogaster genomes yielded results similar to our findings from coalescent simulations. Though the potential of RADseq for marker discovery and trait mapping in nonmodel systems remains undisputed, our results urge caution when applying this technique to make population genetic inferences.
Citations
More filters
Journal Article
Fumio Tajima1
30 Oct 1989-Genomics
TL;DR: It is suggested that the natural selection against large insertion/deletion is so weak that a large amount of variation is maintained in a population.

11,521 citations

Journal ArticleDOI
28 Feb 2014-PLOS ONE
TL;DR: The tassel-gbs pipeline, designed for the efficient processing of raw GBS sequence data into SNP genotypes, is described and benchmark it based upon a large scale, species wide analysis in maize, where the average error rate was reduced to 0.0042.
Abstract: Genotyping by sequencing (GBS) is a next generation sequencing based method that takes advantage of reduced representation to enable high throughput genotyping of large numbers of individuals at a large number of SNP markers. The relatively straightforward, robust, and cost-effective GBS protocol is currently being applied in numerous species by a large number of researchers. Herein we describe a bioinformatics pipeline, tassel-gbs, designed for the efficient processing of raw GBS sequence data into SNP genotypes. The tassel-gbs pipeline successfully fulfills the following key design criteria: (1) Ability to run on the modest computing resources that are typically available to small breeding or ecological research programs, including desktop or laptop machines with only 8–16 GB of RAM, (2) Scalability from small to extremely large studies, where hundreds of thousands or even millions of SNPs can be scored in up to 100,000 individuals (e.g., for large breeding programs or genetic surveys), and (3) Applicability in an accelerated breeding context, requiring rapid turnover from tissue collection to genotypes. Although a reference genome is required, the pipeline can also be run with an unfinished “pseudo-reference” consisting of numerous contigs. We describe the tassel-gbs pipeline in detail and benchmark it based upon a large scale, species wide analysis in maize (Zea mays), where the average error rate was reduced to 0.0042 through application of population genetic-based SNP filters. Overall, the GBS assay and the tassel-gbs pipeline provide robust tools for studying genomic diversity.

1,315 citations


Cites background from "RADseq underestimates diversity and..."

  • ...However, there might still be some subtle biases in either pipeline, caused by factors such as null alleles [63,64], alignment to the reference, and the use of inbreds only (rather than the full set of samples) to filter SNPs for FIT....

    [...]

Journal ArticleDOI
TL;DR: This Review provides a comprehensive discussion of RADseq methods to aid researchers in choosing among the many different approaches and avoiding erroneous scientific conclusions from RADseq data, a problem that has plagued other genetic marker types in the past.
Abstract: High-throughput techniques based on restriction site-associated DNA sequencing (RADseq) are enabling the low-cost discovery and genotyping of thousands of genetic markers for any species, including non-model organisms, which is revolutionizing ecological, evolutionary and conservation genetics. Technical differences among these methods lead to important considerations for all steps of genomics studies, from the specific scientific questions that can be addressed, and the costs of library preparation and sequencing, to the types of bias and error inherent in the resulting data. In this Review, we provide a comprehensive discussion of RADseq methods to aid researchers in choosing among the many different approaches and avoiding erroneous scientific conclusions from RADseq data, a problem that has plagued other genetic marker types in the past.

1,102 citations

Journal ArticleDOI
TL;DR: This Review demonstrates the breadth of questions that are being addressed by Pool-seq but also discusses its limitations and provides guidelines for users.
Abstract: The analysis of polymorphism data is becoming increasingly important as a complementary tool to classical genetic analyses. Nevertheless, despite plunging sequencing costs, genomic sequencing of individuals at the population scale is still restricted to a few model species. Whole-genome sequencing of pools of individuals (Pool-seq) provides a cost-effective alternative to sequencing individuals separately. With the availability of custom-tailored software tools, Pool-seq is being increasingly used for population genomic research on both model and non-model organisms. In this Review, we not only demonstrate the breadth of questions that are being addressed by Pool-seq but also discuss its limitations and provide guidelines for users.

642 citations

Journal ArticleDOI
TL;DR: The R package pcadapt performs genome scans to detect genes under selection based on population genomic data and is compared to other computer programs for genome scans, finding that pcadapt and hapflk are the most powerful in scenarios of population divergence and range expansion.
Abstract: The R package pcadapt performs genome scans to detect genes under selection based on population genomic data. It assumes that candidate markers are outliers with respect to how they are related to population structure. Because population structure is ascertained with principal component analysis, the package is fast and works with large-scale data. It can handle missing data and pooled sequencing data. By contrast to population-based approaches, the package handle admixed individuals and does not require grouping individuals into populations. Since its first release, pcadapt has evolved in terms of both statistical approach and software implementation. We present results obtained with robust Mahalanobis distance, which is a new statistic for genome scans available in the 2.0 and later versions of the package. When hierarchical population structure occurs, Mahalanobis distance is more powerful than the communality statistic that was implemented in the first version of the package. Using simulated data, we compare pcadapt to other computer programs for genome scans (BayeScan, hapflk, OutFLANK, sNMF). We find that the proportion of false discoveries is around a nominal false discovery rate set at 10% with the exception of BayeScan that generates 40% of false discoveries. We also find that the power of BayeScan is severely impacted by the presence of admixed individuals whereas pcadapt is not impacted. Last, we find that pcadapt and hapflk are the most powerful in scenarios of population divergence and range expansion. Because pcadapt handles next-generation sequencing data, it is a valuable tool for data analysis in molecular ecology.

594 citations


Cites background from "RADseq underestimates diversity and..."

  • ...Additionally, nextgeneration sequencing data may contain a substantial proportion of missing data that should be accounted for (Arnold et al. 2013; Gautier et al. 2013)....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: The purpose of this discussion is to offer some unity to various estimation formulae and to point out that correlations of genes in structured populations, with which F-statistics are concerned, are expressed very conveniently with a set of parameters treated by Cockerham (1 969, 1973).
Abstract: This journal frequently contains papers that report values of F-statistics estimated from genetic data collected from several populations. These parameters, FST, FIT, and FIS, were introduced by Wright (1951), and offer a convenient means of summarizing population structure. While there is some disagreement about the interpretation of the quantities, there is considerably more disagreement on the method of evaluating them. Different authors make different assumptions about sample sizes or numbers of populations and handle the difficulties of multiple alleles and unequal sample sizes in different ways. Wright himself, for example, did not consider the effects of finite sample size. The purpose of this discussion is to offer some unity to various estimation formulae and to point out that correlations of genes in structured populations, with which F-statistics are concerned, are expressed very conveniently with a set of parameters treated by Cockerham (1 969, 1973). We start with the parameters and construct appropriate estimators for them, rather than beginning the discussion with various data functions. The extension of Cockerham's work to multiple alleles and loci will be made explicit, and the use of jackknife procedures for estimating variances will be advocated. All of this may be regarded as an extension of a recent treatment of estimating the coancestry coefficient to serve as a mea-

17,890 citations


"RADseq underestimates diversity and..." refers methods in this paper

  • ...We then used these to calculate typical summaries of the data such as average number of pairwise differences (p, Tajima 1983), Watterson’s h (hw, Watterson 1975), Tajima’s D (Tajima 1989) and FST (Weir & Cockerham 1984)....

    [...]

Journal Article
Fumio Tajima1
30 Oct 1989-Genomics
TL;DR: It is suggested that the natural selection against large insertion/deletion is so weak that a large amount of variation is maintained in a population.

11,521 citations


"RADseq underestimates diversity and..." refers methods in this paper

  • ...We then used these to calculate typical summaries of the data such as average number of pairwise differences (p, Tajima 1983), Watterson’s h (hw, Watterson 1975), Tajima’s D (Tajima 1989) and FST (Weir & Cockerham 1984)....

    [...]

Journal ArticleDOI
Fumio Tajima1
01 Nov 1989-Genetics
TL;DR: The relationship between the two estimates of genetic variation at the DNA level, namely the number of segregating sites and the average number of nucleotide differences estimated from pairwise comparison, is investigated in this article.
Abstract: The relationship between the two estimates of genetic variation at the DNA level, namely the number of segregating sites and the average number of nucleotide differences estimated from pairwise comparison, is investigated. It is found that the correlation between these two estimates is large when the sample size is small, and decreases slowly as the sample size increases. Using the relationship obtained, a statistical method for testing the neutral mutation hypothesis is developed. This method needs only the data of DNA polymorphism, namely the genetic variation within population at the DNA level. A simple method of computer simulation, that was used in order to obtain the distribution of a new statistic developed, is also presented. Applying this statistical method to the five regions of DNA sequences in Drosophila melanogaster, it is found that large insertion/deletion (greater than 100 bp) is deleterious. It is suggested that the natural selection against large insertion/deletion is so weak that a large amount of variation is maintained in a population.

11,417 citations

Journal Article
TL;DR: A new basis for the construction of a genetic linkage map of the human genome is described, to develop, by recombinant DNA techniques, random single-copy DNA probes capable of detecting DNA sequence polymorphisms, when hybridized to restriction digests of an individual's DNA.
Abstract: We describe a new basis for the construction of a genetic linkage map of the human genome. The basic principle of the mapping scheme is to develop, by recombinant DNA techniques, random single-copy DNA probes capable of detecting DNA sequence polymorphisms, when hybridized to restriction digests of an individual's DNA. Each of these probes will define a locus. Loci can be expanded or contracted to include more or less polymorphism by further application of recombinant DNA technology. Suitably polymorphic loci can be tested for linkage relationships in human pedigrees by established methods; and loci can be arranged into linkage groups to form a true genetic map of "DNA marker loci." Pedigrees in which inherited traits are known to be segregating can then be analyzed, making possible the mapping of the gene(s) responsible for the trait with respect to the DNA marker loci, without requiring direct access to a specified gene's DNA. For inherited diseases mapped in this way, linked DNA marker loci can be used predictively for genetic counseling.

7,853 citations


"RADseq underestimates diversity and..." refers background in this paper

  • ...Though it may seem comparatively rare for a mutation to occur within a recognition sequence, these variants are frequent enough to enable detailed population genetic analyses (e.g. Restriction Fragment Length Polymorphisms, Botstein et al. 1980)....

    [...]

Journal ArticleDOI
TL;DR: The distribution is obtained for the number of segregating sites observed in a sample from a population which is subject to recurring, new, mutations but not subject to recombination, and applies approximately to three population models.

3,870 citations


"RADseq underestimates diversity and..." refers methods in this paper

  • ...We then used these to calculate typical summaries of the data such as average number of pairwise differences (p, Tajima 1983), Watterson’s h (hw, Watterson 1975), Tajima’s D (Tajima 1989) and FST (Weir & Cockerham 1984)....

    [...]