scispace - formally typeset
Search or ask a question
Journal ArticleDOI

fastsimcoal: a continuous-time coalescent simulator of genomic diversity under arbitrarily complex evolutionary scenarios.

01 May 2011-Bioinformatics (Oxford University Press)-Vol. 27, Iss: 9, pp 1332-1334
TL;DR: A new coalescent-based simulation program fastsimcoal is presented, which is able to quickly simulate a variety of genetic markers scattered over very long genomic regions with arbitrary recombination patterns under complex evolutionary scenarios.
Abstract: Motivation: Genetic studies focus on increasingly larger genomic regions of both extant and ancient DNA, and there is a need for simulation software to match these technological advances. We present here a new coalescent-based simulation program fastsimcoal, which is able to quickly simulate a variety of genetic markers scattered over very long genomic regions with arbitrary recombination patterns under complex evolutionary scenarios. Availability and Implementation: fastsimcoal is a C++ program compiled for Windows, MacOsX and Linux platforms. It is freely available at cmpg.unibe.ch/software/fastsimcoal/, together with its detailed user manual and example input files.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: A flexible and robust simulation-based framework to infer demographic parameters from the site frequency spectrum (SFS) computed on large genomic datasets and shows that it allows one to study evolutionary models of arbitrary complexity, which cannot be tackled by other current likelihood-based methods.
Abstract: We introduce a flexible and robust simulation-based framework to infer demographic parameters from the site frequency spectrum (SFS) computed on large genomic datasets. We show that our composite-likelihood approach allows one to study evolutionary models of arbitrary complexity, which cannot be tackled by other current likelihood-based methods. For simple scenarios, our approach compares favorably in terms of accuracy and speed with , the current reference in the field, while showing better convergence properties for complex models. We first apply our methodology to non-coding genomic SNP data from four human populations. To infer their demographic history, we compare neutral evolutionary models of increasing complexity, including unsampled populations. We further show the versatility of our framework by extending it to the inference of demographic parameters from SNP chips with known ascertainment, such as that recently released by Affymetrix to study human origins. Whereas previous ways of handling ascertained SNPs were either restricted to a single population or only allowed the inference of divergence time between a pair of populations, our framework can correctly infer parameters of more complex models including the divergence of several populations, bottlenecks and migration. We apply this approach to the reconstruction of African demography using two distinct ascertained human SNP panels studied under two evolutionary models. The two SNP panels lead to globally very similar estimates and confidence intervals, and suggest an ancient divergence (>110 Ky) between Yoruba and San populations. Our methodology appears well suited to the study of complex scenarios from large genomic data sets.

1,199 citations


Cites methods from "fastsimcoal: a continuous-time coal..."

  • ...Coalescent simulations, estimation of the SFS, likelihood computations and its maximization were all done with fastsimcoal2, a modified version of the fastsimcoal program [82]....

    [...]

Journal ArticleDOI
06 Jul 2012-Science
TL;DR: It is concluded that because of rapid population growth and weak purifying selection, human populations harbor an abundance of rare variants, many of which are deleterious and have relevance to understanding disease risk.
Abstract: Rare genetic variants contribute to complex disease risk; however, the abundance of rare variants in human populations remains unknown. We explored this spectrum of variation by sequencing 202 genes encoding drug targets in 14,002 individuals. We find rare variants are abundant (1 every 17 bases) and geographically localized, so that even with large sample sizes, rare variant catalogs will be largely incomplete. We used the observed patterns of variation to estimate population growth parameters, the proportion of variants in a given frequency class that are putatively deleterious, and mutation rates for each gene. We conclude that because of rapid population growth and weak purifying selection, human populations harbor an abundance of rare variants, many of which are deleterious and have relevance to understanding disease risk.

724 citations

Journal ArticleDOI
TL;DR: ABC as discussed by the authors is a R package that implements several approximate Bayesian computation (ABC) algorithms for parameter estimation and model selection, in particular the recently developed nonlinear heteroscedastic regression methods for ABC.
Abstract: Summary 1. Many recent statistical applications involve inference under complex models, where it is computationally prohibitive to calculate likelihoods but possible to simulate data. Approximate Bayesian computation (ABC) is devoted to these complex models because it bypasses the evaluation of the likelihood function by comparing observed and simulated data. 2. We introduce the R package ‘abc’ that implements several ABC algorithms for performing parameter estimation and model selection. In particular, the recently developed nonlinear heteroscedastic regression methods for ABC are implemented. The ‘abc’ package also includes a cross-validation tool for measuring the accuracy of ABC estimates and to calculate the misclassification probabilities when performing model selection. The main functions are accompanied by appropriate summary and plotting tools. 3. R is already widely used in bioinformatics and several fields of biology. The R package ‘abc’ will make the ABC algorithms available to a large number of R users. ‘abc’ is a freely available R package under the GPL license, and it can be downloaded at http://cran.r-project.org/web/packages/

622 citations

Journal ArticleDOI
TL;DR: Sparse trees and coalescence records are introduced as the key units of genealogical analysis and exact simulation of the coalescent with recombination for chromosome-sized regions over hundreds of thousands of samples is possible, and substantially faster than present-day approximate methods.
Abstract: A central challenge in the analysis of genetic variation is to provide realistic genome simulation across millions of samples. Present day coalescent simulations do not scale well, or use approximations that fail to capture important long-range linkage properties. Analysing the results of simulations also presents a substantial challenge, as current methods to store genealogies consume a great deal of space, are slow to parse and do not take advantage of shared structure in correlated trees. We solve these problems by introducing sparse trees and coalescence records as the key units of genealogical analysis. Using these tools, exact simulation of the coalescent with recombination for chromosome-sized regions over hundreds of thousands of samples is possible, and substantially faster than present-day approximate methods. We can also analyse the results orders of magnitude more quickly than with existing methods.

564 citations


Cites background from "fastsimcoal: a continuous-time coal..."

  • ...Also, for these larger sequence lengths and sample sizes, ms is unreliable and crashes [15, 47]....

    [...]

Journal ArticleDOI
01 Jun 2013-Genetics
TL;DR: Refined IBD allows for IBD reporting on a haplotype level, which facilitates determination of multi-individual IBD and allows for haplotype-based downstream analyses and is implemented in Beagle version 4.
Abstract: Segments of indentity-by-descent (IBD) detected from high-density genetic data are useful for many applications, including long-range phase determination, phasing family data, imputation, IBD mapping, and heritability analysis in founder populations. We present Refined IBD, a new method for IBD segment detection. Refined IBD achieves both computational efficiency and highly accurate IBD segment reporting by searching for IBD in two steps. The first step (identification) uses the GERMLINE algorithm to find shared haplotypes exceeding a length threshold. The second step (refinement) evaluates candidate segments with a probabilistic approach to assess the evidence for IBD. Like GERMLINE, Refined IBD allows for IBD reporting on a haplotype level, which facilitates determination of multi-individual IBD and allows for haplotype-based downstream analyses. To investigate the properties of Refined IBD, we simulate SNP data from a model with recent superexponential population growth that is designed to match United Kingdom data. The simulation results show that Refined IBD achieves a better power/accuracy profile than fastIBD or GERMLINE. We find that a single run of Refined IBD achieves greater power than 10 runs of fastIBD. We also apply Refined IBD to SNP data for samples from the United Kingdom and from Northern Finland and describe the IBD sharing in these data sets. Refined IBD is powerful, highly accurate, and easy to use and is implemented in Beagle version 4.

524 citations

References
More filters
Journal ArticleDOI
TL;DR: The main innovations of the new version of the Arlequin program include enhanced outputs in XML format, the possibility to embed graphics displaying computation results directly into output files, and the implementation of a new method to detect loci under selection from genome scans.
Abstract: We present here a new version of the Arlequin program available under three different forms: a Windows graphical version (Winarl35), a console version of Arlequin (arlecore), and a specific console version to compute summary statistics (arlsumstat). The command-line versions run under both Linux and Windows. The main innovations of the new version include enhanced outputs in XML format, the possibility to embed graphics displaying computation results directly into output files, and the implementation of a new method to detect loci under selection from genome scans. Command-line versions are designed to handle large series of files, and arlsumstat can be used to generate summary statistics from simulated data sets within an Approximate Bayesian Computation framework.

13,581 citations


Additional excerpts

  • ...arp) that can then be processed with arlequin or arlsumstat (Excoffier and Lischer 2010) to get distributions of various summary statistics. Additional options of fastsimcoal2 can be specified on the command line (type "fastsimcoal2 -h" for help on command line options). fastsimcoal2 can handle very complex evolutionary scenarios including an arbitrary migration matrix between samples, historical events allowing for population resize, population fusion and fission, admixture events, changes in migration matrix, or changes in population growth rates. The time of sampling can be specified independently for each sample, allowing for serial sampling in the same or in different populations. Different markers, such as DNA sequences, SNP, STR (microsatellite) or multi-locus allelic data can be generated under a variety of mutation models (e.g. finite- and infinite-site models for DNA sequences, stepwise or generalized stepwise mutation model for STRs data, infinite-allele model for standard multi-allelic data). fastsimcoal2 can simulate data in genomic regions with arbitrary recombination rates, thus allowing for recombination hotspots of different intensities at any position. fastsimcoal2 implements a new approximation to the ancestral recombination graph in the form of sequential Markov coalescent allowing it to very quickly generate genetic diversity for >100 Mb genomic segments. Compiled versions of fastsimcoal2 for Windows, Linux or Mac Os X are available on http://cmpg.unibe.ch/software/fastsimcoal2 Since fastsimcoal2 output is meant to be interfaced with Arlequin or arlsumstat, the reader may also want to get more information on Arlequin on http://cmpg.unibe.ch/software/arlequin35 Since ver 2.1, fastsimcoal2 can be used to estimate demographic parameters from the (joint) SFS, as described in Excoffier et al. (2013)...

    [...]

Book
31 Jan 1986
TL;DR: Numerical Recipes: The Art of Scientific Computing as discussed by the authors is a complete text and reference book on scientific computing with over 100 new routines (now well over 300 in all), plus upgraded versions of many of the original routines, with many new topics presented at the same accessible level.
Abstract: From the Publisher: This is the revised and greatly expanded Second Edition of the hugely popular Numerical Recipes: The Art of Scientific Computing. The product of a unique collaboration among four leading scientists in academic research and industry, Numerical Recipes is a complete text and reference book on scientific computing. In a self-contained manner it proceeds from mathematical and theoretical considerations to actual practical computer routines. With over 100 new routines (now well over 300 in all), plus upgraded versions of many of the original routines, this book is more than ever the most practical, comprehensive handbook of scientific computing available today. The book retains the informal, easy-to-read style that made the first edition so popular, with many new topics presented at the same accessible level. In addition, some sections of more advanced material have been introduced, set off in small type from the main body of the text. Numerical Recipes is an ideal textbook for scientists and engineers and an indispensable reference for anyone who works in scientific computing. Highlights of the new material include a new chapter on integral equations and inverse methods; multigrid methods for solving partial differential equations; improved random number routines; wavelet transforms; the statistical bootstrap method; a new chapter on "less-numerical" algorithms including compression coding and arbitrary precision arithmetic; band diagonal linear systems; linear algebra on sparse matrices; Cholesky and QR decomposition; calculation of numerical derivatives; Pade approximants, and rational Chebyshev approximation; new special functions; Monte Carlo integration in high-dimensional spaces; globally convergent methods for sets of nonlinear equations; an expanded chapter on fast Fourier methods; spectral analysis on unevenly sampled data; Savitzky-Golay smoothing filters; and two-dimensional Kolmogorov-Smirnoff tests. All this is in addition to material on such basic top

12,662 citations

Journal ArticleDOI
28 Oct 2010-Nature
TL;DR: The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype as mentioned in this paper, and the results of the pilot phase of the project, designed to develop and compare different strategies for genomewide sequencing with high-throughput platforms.
Abstract: The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.

7,538 citations