scispace - formally typeset
Search or ask a question
Journal ArticleDOI

SweeD: Likelihood-Based Detection of Selective Sweeps in Thousands of Genomes

TL;DR: It is shown that an increase of sample size results in more precise detection of positive selection and the ability to analyze substantially larger sample sizes by using SweeD leads to more accurate sweep detection.
Abstract: The advent of modern DNA sequencing technology is the driving force in obtaining complete intra-specific genomes that can be used to detect loci that have been subject to positive selection in the recent past. Based on selective sweep theory, beneficial loci can be detected by examining the single nucleotide polymorphism patterns in intraspecific genome alignments. In the last decade, a plethora of algorithms for identifying selective sweeps have been developed. However, the majority of these algorithms have not been designed for analyzing whole-genome data. We present SweeD (Sweep Detector), an open-source tool for the rapid detection of selective sweeps in whole genomes. It analyzes site frequency spectra and represents a substantial extension of the widely used SweepFinder program. The sequential version of SweeD is up to 22 times faster than SweepFinder and, more importantly, is able to analyze thousands of sequences. We also provide a parallel implementation of SweeD for multi-core processors. Furthermore, we implemented a checkpointing mechanism that allows to deploy SweeD on cluster systems with queue execution time restrictions, as well as to resume long-running analyses after processor failures. In addition, the user can specify various demographic models via the command-line to calculate their theoretically expected site frequency spectra. Therefore, (in contrast to SweepFinder) the neutral site frequencies can optionally be directly calculated from a given demographic model. We show that an increase of sample size results in more precise detection of positive selection. Thus, the ability to analyze substantially larger sample sizes by using SweeD leads to more accurate sweep detection. We validate SweeD via simulations and by scanning the first chromosome from the 1000 human Genomes project for selective sweeps. We compare SweeD results with results from a linkage-disequilibrium-based approach and identify common outliers.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: PopGenome is a population genomics package for the R software environment that offers a wide range of diverse population genetics analyses, including neutrality tests as well as statistics for population differentiation, linkage disequilibrium, and recombination.
Abstract: Although many computer programs can perform population genetics calculations, they are typically limited in the analyses and data input formats they offer; few applications can process the large data sets produced by whole-genome resequencing projects. Furthermore, there is no coherent framework for the easy integration of new statistics into existing pipelines, hindering the development and application of new population genetics and genomics approaches. Here, we present PopGenome, a population genomics package for the R software environment (a de facto standard for statistical analyses). PopGenome can efficiently process genome-scale data as well as large sets of individual loci. It reads DNA alignments and single-nucleotide polymorphism (SNP) data sets in most common formats, including those used by the HapMap, 1000 human genomes, and 1001 Arabidopsis genomes projects. PopGenome also reads associated annotation files in GFF format, enabling users to easily define regions or classify SNPs based on their annotation; all analyses can also be applied to sliding windows. PopGenome offers a wide range of diverse population genetics analyses, including neutrality tests as well as statistics for population differentiation, linkage disequilibrium, and recombination. PopGenome is linked to Hudson’s MS and Ewing’s MSMS programs to assess statistical significance based on coalescent simulations. PopGenome’s integration in R facilitates effortless and reproducible downstream analyses as well as the production of publication-quality graphics. Developers can easily incorporate new analyses methods into the PopGenome framework. PopGenome and R are freely available from CRAN (http://cran.r-project.org/) for all major operating systems under the GNU General Public License.

761 citations


Cites methods from "SweeD: Likelihood-Based Detection o..."

  • ...In the next release of PopGenome, we aim to incorporate methods for detecting recent selective sweeps, such as the algorithm implemented in the software SweeD (Pavlidis et al. 2013)....

    [...]

Journal ArticleDOI
TL;DR: Evidence for artificial selection at a genome-wide scale, as well as with a set of O. glaberrima genes orthologous to O. sativa genes that are known to be associated with domestication, is detected, indicating convergent yet independent selection of a common set of genes during two geographically and culturally distinct domestication processes.
Abstract: The cultivation of rice in Africa dates back more than 3,000 years. Interestingly, African rice is not of the same origin as Asian rice (Oryza sativa L.) but rather is an entirely different species (i.e., Oryza glaberrima Steud.). Here we present a high-quality assembly and annotation of the O. glaberrima genome and detailed analyses of its evolutionary history of domestication and selection. Population genomics analyses of 20 O. glaberrima and 94 Oryza barthii accessions support the hypothesis that O. glaberrima was domesticated in a single region along the Niger river as opposed to noncentric domestication events across Africa. We detected evidence for artificial selection at a genome-wide scale, as well as with a set of O. glaberrima genes orthologous to O. sativa genes that are known to be associated with domestication, thus indicating convergent yet independent selection of a common set of genes during two geographically and culturally distinct domestication processes.

328 citations

Journal ArticleDOI
TL;DR: It is shown that recent adaptation has operated almost exclusively on standing variation, and that patterns of adaptive mutations predict diverse effects on protein function, and provided evidence that chemosensory proteins have experienced relaxed constraint, and argued that this has been important for their rapid adaptation over short timescales.
Abstract: How organisms adapt to new environments is of fundamental biological interest, but poorly understood at the genetic level. Chemosensory systems provide attractive models to address this problem, because they lie between external environmental signals and internal physiological responses. To investigate how selection has shaped the well-characterized chemosensory system of Drosophila melanogaster, we have analysed genome-wide data from five diverse populations. By couching population genomic analyses of chemosensory protein families within parallel analyses of other large families, we demonstrate that chemosensory proteins are not outliers for adaptive divergence between species. However, chemosensory families often display the strongest genome-wide signals of recent selection within D. melanogaster. We show that recent adaptation has operated almost exclusively on standing variation, and that patterns of adaptive mutations predict diverse effects on protein function. Finally, we provide evidence that chemosensory proteins have experienced relaxed constraint, and argue that this has been important for their rapid adaptation over short timescales.

266 citations

Journal ArticleDOI
TL;DR: It is argued that recurrent selection for domestic traits likely counteracted the homogenizing effect of gene flow from wild boars and created 'islands of domestication' in the genome.
Abstract: Traditionally, the process of domestication is assumed to be initiated by humans, involve few individuals and rely on reproductive isolation between wild and domestic forms We analyzed pig domestication using over 100 genome sequences and tested whether pig domestication followed a traditional linear model or a more complex, reticulate model We found that the assumptions of traditional models, such as reproductive isolation and strong domestication bottlenecks, are incompatible with the genetic data In addition, our results show that, despite gene flow, the genomes of domestic pigs have strong signatures of selection at loci that affect behavior and morphology We argue that recurrent selection for domestic traits likely counteracted the homogenizing effect of gene flow from wild boars and created 'islands of domestication' in the genome Our results have major ramifications for the understanding of animal domestication and suggest that future studies should employ models that do not assume reproductive isolation

232 citations

Journal ArticleDOI
22 Jun 2018-Science
TL;DR: It is shown that cis-regulatory variation controlling seasonal expression of the Agouti gene underlies this adaptive winter camouflage polymorphism, and shows that introgression of genetic variants that underlie key ecological traits can seed past and ongoing adaptation to rapidly changing environments.
Abstract: Snowshoe hares (Lepus americanus) maintain seasonal camouflage by molting to a white winter coat, but some hares remain brown during the winter in regions with low snow cover. We show that cis-regulatory variation controlling seasonal expression of the Agouti gene underlies this adaptive winter camouflage polymorphism. Genetic variation at Agouti clustered by winter coat color across multiple hare and jackrabbit species, revealing a history of recurrent interspecific gene flow. Brown winter coats in snowshoe hares likely originated from an introgressed black-tailed jackrabbit allele that has swept to high frequency in mild winter environments. These discoveries show that introgression of genetic variants that underlie key ecological traits can seed past and ongoing adaptation to rapidly changing environments.

227 citations

References
More filters
Journal ArticleDOI
01 Nov 2012-Nature
TL;DR: It is shown that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites.
Abstract: By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.

7,710 citations

Book
01 Jan 2009
TL;DR: The aim of this book is to provide a Discussion of Constrained Optimization and its Applications to Linear Programming and Other Optimization Problems.
Abstract: Preface Table of Notation Part 1: Unconstrained Optimization Introduction Structure of Methods Newton-like Methods Conjugate Direction Methods Restricted Step Methods Sums of Squares and Nonlinear Equations Part 2: Constrained Optimization Introduction Linear Programming The Theory of Constrained Optimization Quadratic Programming General Linearly Constrained Optimization Nonlinear Programming Other Optimization Problems Non-Smooth Optimization References Subject Index.

7,278 citations


"SweeD: Likelihood-Based Detection o..." refers background in this paper

  • ...However, when the number of SNPs is small with respect to the number of sequences, substantially more iterations (and hence thread synchronization events) are required for the BFGS algorithm to converge....

    [...]

  • ...This is due to the small proportion of SNPs in the comparatively large number of sequences, which in turn leads to a significantly larger amount of time spent in the BFGS (Broyden–Fletcher– Goldfarb–Shanno; Fletcher 1987) algorithm that optimizes the neutral SFS....

    [...]

  • ...For example, when we analyze the data set with 10,000 sequences and 10,000 SNPs, the BFGS algorithm computes the likelihood of the input data set conditional on the SFS 4,477,114 times, whereas only 396 such likelihood calculations are required for the data set with 100 sequences and 10,000 SNPs....

    [...]

  • ...More specifically, the BFGS algorithm estimates the neutral SFS that maximizes the probability of the data set (i.e., the overall likelihood) given the input SFS and the data....

    [...]

Journal ArticleDOI
TL;DR: If the selective coefficients at the linked locus are small compared to those at the substituted locus, it is shown that the probability of complete fixation at the links is approximately exp (− Nc), where c is the recombinant fraction and N the population size.
Abstract: SUMMARY When a selectively favourable gene substitution occurs in a population, changes in gene frequencies will occur at closely linked loci. In the case of a neutral polymorphism, average heterozygosity will be reduced to an extent which varies with distance from the substituted locus. The aggregate eifect of substitution on neutral polymorphism is estimated; in populations of total size 10 6 or more (and perhaps of 10 4 or more), this eifect will be more important than that of random fixation. This may explain why the extent of polymorphism in natural populations does not vary as much as one would expect from a consideration of the equilibrium between mutation and random fixation in populations of different sizes. For a selectively maintained polymorphism at a linked locus, this process will only be important in the long run if it leads to complete fixation. If the selective coefficients at the linked locus are small compared to those at the substituted locus, it is shown that the probability of complete fixation at the linked locus is approximately exp (— Nc), where c is the recombinant fraction and N the population size. It follows that in a large population a selective substitution can occur in a cistron without eliminating a selectively maintained polymorphism in the same cistron.

2,726 citations

Journal ArticleDOI
TL;DR: A Monte Carlo computer program is available to generate samples drawn from a population evolving according to a Wright-Fisher neutral model, and the samples produced can be used to investigate the sampling properties of any sample statistic under these neutral models.
Abstract: A Monte Carlo computer program is available to generate samples drawn from a population evolving according to a Wright-Fisher neutral model. The program assumes an infinite-sites model of mutation, and allows recombination, gene conversion, symmetric migration among subpopulations, and a variety of demographic histories. The samples produced can be used to investigate the sampling properties of any sample statistic under these neutral models.

2,566 citations


"SweeD: Likelihood-Based Detection o..." refers background or methods in this paper

  • ...With respect to simulated data sets, SweeD supports ms (Hudson 2002) and MaCS (Chen et al. 2009) formats....

    [...]

  • ...Kim and Stephan (2002) interpreted fn,i as the probability of observing a single site where i derived alleles are found in a sample of size n....

    [...]

  • ...Simulations are usually performed using coalescent-based software such as Hudson’s ms (Hudson 2002) or msms (Ewing and Hermisson 2010)....

    [...]