scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Improving the Accuracy and Efficiency of Identity by Descent Detection in Population Data

01 Jun 2013-Genetics (Genetics Society of America)-Vol. 194, Iss: 2, pp 459-471
TL;DR: Refined IBD allows for IBD reporting on a haplotype level, which facilitates determination of multi-individual IBD and allows for haplotype-based downstream analyses and is implemented in Beagle version 4.
Abstract: Segments of indentity-by-descent (IBD) detected from high-density genetic data are useful for many applications, including long-range phase determination, phasing family data, imputation, IBD mapping, and heritability analysis in founder populations. We present Refined IBD, a new method for IBD segment detection. Refined IBD achieves both computational efficiency and highly accurate IBD segment reporting by searching for IBD in two steps. The first step (identification) uses the GERMLINE algorithm to find shared haplotypes exceeding a length threshold. The second step (refinement) evaluates candidate segments with a probabilistic approach to assess the evidence for IBD. Like GERMLINE, Refined IBD allows for IBD reporting on a haplotype level, which facilitates determination of multi-individual IBD and allows for haplotype-based downstream analyses. To investigate the properties of Refined IBD, we simulate SNP data from a model with recent superexponential population growth that is designed to match United Kingdom data. The simulation results show that Refined IBD achieves a better power/accuracy profile than fastIBD or GERMLINE. We find that a single run of Refined IBD achieves greater power than 10 runs of fastIBD. We also apply Refined IBD to SNP data for samples from the United Kingdom and from Northern Finland and describe the IBD sharing in these data sets. Refined IBD is powerful, highly accurate, and easy to use and is implemented in Beagle version 4.
Citations
More filters
Journal ArticleDOI
Adam Auton1, Gonçalo R. Abecasis2, David Altshuler3, Richard Durbin4  +514 moreInstitutions (90)
01 Oct 2015-Nature
TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.
Abstract: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

12,661 citations

Journal ArticleDOI
TL;DR: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility, and for the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
Abstract: Background: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for faster and scalable implementations of key functions, such as logistic regression, linkage disequilibrium estimation, and genomic distance evaluation. In addition, GWAS and population-genetic data now frequently contain genotype likelihoods, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1’s primary data format. Findings: To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O √ n -time/constant-space Hardy-Weinberg equilibrium and Fisher’s exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. We have also developed an extension to the data format which adds low-overhead support for genotype likelihoods, phase, multiallelic variants, and reference vs. alternate alleles, which is the basis of our planned second release (PLINK 2.0). Conclusions: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

7,038 citations


Cites methods from "Improving the Accuracy and Efficien..."

  • ...The name was originally shorthand for “population linkage”; BEAGLE: A software package capable of high-accuracy haplotype phasing, genotype imputation, and identity-by-descent estimation, developed by Browning [2]; GCTA: Genome-wide Complex Trait Analysis....

    [...]

  • ...To support easier interoperation with newer software, for example BEAGLE 4 [2], IMPUTE2 [3], GATK [4], VCFtools [5], BCFtools [6] and GCTA [7], features such as the import/export of VCF andOxford-format files and an efficient cross-platform genomic relationship matrix (GRM) calculator have been introduced....

    [...]

  • ...Browning, B., Browning, S.: A fast, powerful method for detecting identity by descent....

    [...]

  • ...Browning, B.: Presto: rapid calculation of order statistic distributions and multiple-testing adjusted p-values via permutation for one and two-stage genetic association studies....

    [...]

  • ...Browning, B., Browning, S.: Improving the accuracy and efficiency of identity by descent detection in population data....

    [...]

Journal ArticleDOI
TL;DR: PLINK as discussed by the authors is a C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics, which has been widely used in the literature.
Abstract: PLINK 1 is a widely used open-source C/C++ toolset for genome-wide association studies (GWAS) and research in population genetics. However, the steady accumulation of data from imputation and whole-genome sequencing studies has exposed a strong need for even faster and more scalable implementations of key functions. In addition, GWAS and population-genetic data now frequently contain probabilistic calls, phase information, and/or multiallelic variants, none of which can be represented by PLINK 1's primary data format. To address these issues, we are developing a second-generation codebase for PLINK. The first major release from this codebase, PLINK 1.9, introduces extensive use of bit-level parallelism, O(sqrt(n))-time/constant-space Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic improvements. In combination, these changes accelerate most operations by 1-4 orders of magnitude, and allow the program to handle datasets too large to fit in RAM. This will be followed by PLINK 2.0, which will introduce (a) a new data format capable of efficiently representing probabilities, phase, and multiallelic variants, and (b) extensions of many functions to account for the new types of information. The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility. For the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.

3,513 citations

Journal ArticleDOI
Iosif Lazaridis1, Iosif Lazaridis2, Nick Patterson2, Alissa Mittnik3, Gabriel Renaud4, Swapan Mallick1, Swapan Mallick2, Karola Kirsanow5, Peter H. Sudmant6, Joshua G. Schraiber7, Joshua G. Schraiber6, Sergi Castellano4, Mark Lipson8, Bonnie Berger8, Bonnie Berger2, Christos Economou9, Ruth Bollongino5, Qiaomei Fu4, Kirsten I. Bos3, Susanne Nordenfelt2, Susanne Nordenfelt1, Heng Li2, Heng Li1, Cesare de Filippo4, Kay Prüfer4, Susanna Sawyer4, Cosimo Posth3, Wolfgang Haak10, Fredrik Hallgren11, Elin Fornander11, Nadin Rohland1, Nadin Rohland2, Dominique Delsate12, Michael Francken3, Jean-Michel Guinet12, Joachim Wahl, George Ayodo, Hamza A. Babiker13, Hamza A. Babiker14, Graciela Bailliet, Elena Balanovska, Oleg Balanovsky, Ramiro Barrantes15, Gabriel Bedoya16, Haim Ben-Ami17, Judit Bene18, Fouad Berrada19, Claudio M. Bravi, Francesca Brisighelli20, George B.J. Busby21, Francesco Calì, Mikhail Churnosov22, David E. C. Cole23, Daniel Corach24, Larissa Damba, George van Driem25, Stanislav Dryomov26, Jean-Michel Dugoujon27, Sardana A. Fedorova28, Irene Gallego Romero29, Marina Gubina, Michael F. Hammer30, Brenna M. Henn31, Tor Hervig32, Ugur Hodoglugil33, Aashish R. Jha29, Sena Karachanak-Yankova34, Rita Khusainova35, Elza Khusnutdinova35, Rick A. Kittles30, Toomas Kivisild36, William Klitz7, Vaidutis Kučinskas37, Alena Kushniarevich38, Leila Laredj39, Sergey Litvinov38, Theologos Loukidis40, Theologos Loukidis41, Robert W. Mahley42, Béla Melegh18, Ene Metspalu43, Julio Molina, Joanna L. Mountain, Klemetti Näkkäläjärvi44, Desislava Nesheva34, Thomas B. Nyambo45, Ludmila P. Osipova, Jüri Parik43, Fedor Platonov28, Olga L. Posukh, Valentino Romano46, Francisco Rothhammer47, Francisco Rothhammer48, Igor Rudan13, Ruslan Ruizbakiev49, Hovhannes Sahakyan50, Hovhannes Sahakyan38, Antti Sajantila51, Antonio Salas52, Elena B. Starikovskaya26, Ayele Tarekegn, Draga Toncheva34, Shahlo Turdikulova49, Ingrida Uktveryte37, Olga Utevska53, René Vasquez54, Mercedes Villena54, Mikhail Voevoda55, Cheryl A. Winkler56, Levon Yepiskoposyan50, Pierre Zalloua1, Pierre Zalloua57, Tatijana Zemunik58, Alan Cooper10, Cristian Capelli21, Mark G. Thomas41, Andres Ruiz-Linares41, Sarah A. Tishkoff59, Lalji Singh60, Kumarasamy Thangaraj61, Richard Villems43, Richard Villems62, Richard Villems38, David Comas63, Rem I. Sukernik26, Mait Metspalu38, Matthias Meyer4, Evan E. Eichler6, Joachim Burger5, Montgomery Slatkin7, Svante Pääbo4, Janet Kelso4, David Reich1, David Reich2, David Reich64, Johannes Krause4, Johannes Krause3 
Harvard University1, Broad Institute2, University of Tübingen3, Max Planck Society4, University of Mainz5, University of Washington6, University of California, Berkeley7, Massachusetts Institute of Technology8, Stockholm University9, University of Adelaide10, The Heritage Foundation11, National Museum of Natural History12, University of Edinburgh13, Sultan Qaboos University14, University of Costa Rica15, University of Antioquia16, Rambam Health Care Campus17, University of Pécs18, Al Akhawayn University19, Catholic University of the Sacred Heart20, University of Oxford21, Belgorod State University22, University of Toronto23, University of Buenos Aires24, University of Bern25, Russian Academy of Sciences26, Paul Sabatier University27, North-Eastern Federal University28, University of Chicago29, University of Arizona30, Stony Brook University31, University of Bergen32, Illumina33, Sofia Medical University34, Bashkir State University35, University of Cambridge36, Vilnius University37, Estonian Biocentre38, University of Strasbourg39, Amgen40, University College London41, Gladstone Institutes42, University of Tartu43, University of Oulu44, Muhimbili University of Health and Allied Sciences45, University of Palermo46, University of Chile47, University of Tarapacá48, Academy of Sciences of Uzbekistan49, Armenian National Academy of Sciences50, University of North Texas51, University of Santiago de Compostela52, University of Kharkiv53, Higher University of San Andrés54, Novosibirsk State University55, Leidos56, Lebanese American University57, University of Split58, University of Pennsylvania59, Banaras Hindu University60, Centre for Cellular and Molecular Biology61, Estonian Academy of Sciences62, Pompeu Fabra University63, Howard Hughes Medical Institute64
18 Sep 2014-Nature
TL;DR: It is shown that most present-day Europeans derive from at least three highly differentiated populations: west European hunter-gatherers, who contributed ancestry to all Europeans but not to Near Easterners; ancient north Eurasians related to Upper Palaeolithic Siberians; and early European farmers, who were mainly of Near Eastern origin but also harboured west Europeanhunter-gatherer related ancestry.
Abstract: We sequenced the genomes of a ∼7,000-year-old farmer from Germany and eight ∼8,000-year-old hunter-gatherers from Luxembourg and Sweden. We analysed these and other ancient genomes with 2,345 contemporary humans to show that most present-day Europeans derive from at least three highly differentiated populations: west European hunter-gatherers, who contributed ancestry to all Europeans but not to Near Easterners; ancient north Eurasians related to Upper Palaeolithic Siberians, who contributed to both Europeans and Near Easterners; and early European farmers, who were mainly of Near Eastern origin but also harboured west European hunter-gatherer related ancestry. We model these populations' deep relationships and show that early European farmers had ∼44% ancestry from a 'basal Eurasian' population that split before the diversification of other non-African lineages.

1,077 citations

Journal ArticleDOI
Daniel Taliun1, Daniel N. Harris2, Michael D. Kessler2, Jedidiah Carlson3  +202 moreInstitutions (61)
10 Feb 2021-Nature
TL;DR: The Trans-Omics for Precision Medicine (TOPMed) project as discussed by the authors aims to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases.
Abstract: The Trans-Omics for Precision Medicine (TOPMed) programme seeks to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases The initial phases of the programme focused on whole-genome sequencing of individuals with rich phenotypic data and diverse backgrounds Here we describe the TOPMed goals and design as well as the available resources and early insights obtained from the sequence data The resources include a variant browser, a genotype imputation server, and genomic and phenotypic data that are available through dbGaP (Database of Genotypes and Phenotypes)1 In the first 53,831 TOPMed samples, we detected more than 400 million single-nucleotide and insertion or deletion variants after alignment with the reference genome Additional previously undescribed variants were detected through assembly of unmapped reads and customized analysis in highly variable loci Among the more than 400 million detected variants, 97% have frequencies of less than 1% and 46% are singletons that are present in only one individual (53% among unrelated individuals) These rare variants provide insights into mutational processes and recent human evolutionary history The extensive catalogue of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and noncoding sequence variants to phenotypic variation Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and reach of genome-wide association studies to include variants down to a frequency of approximately 001% The goals, resources and design of the NHLBI Trans-Omics for Precision Medicine (TOPMed) programme are described, and analyses of rare variants detected in the first 53,831 samples provide insights into mutational processes and recent human evolutionary history

801 citations

References
More filters
Journal ArticleDOI
TL;DR: This work introduces PLINK, an open-source C/C++ WGAS tool set, and describes the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation, which focuses on the estimation and use of identity- by-state and identity/descent information in the context of population-based whole-genome studies.
Abstract: Whole-genome association studies (WGAS) bring new computational, as well as analytic, challenges to researchers. Many existing genetic-analysis tools are not designed to handle such large data sets in a convenient manner and do not necessarily exploit the new opportunities that whole-genome data bring. To address these issues, we developed PLINK, an open-source C/C++ WGAS tool set. With PLINK, large data sets comprising hundreds of thousands of markers genotyped for thousands of individuals can be rapidly manipulated and analyzed in their entirety. As well as providing tools to make the basic analytic steps computationally efficient, PLINK also supports some novel approaches to whole-genome data that take advantage of whole-genome coverage. We introduce PLINK and describe the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation. In particular, we focus on the estimation and use of identity-by-state and identity-by-descent information in the context of population-based whole-genome studies. This information can be used to detect and correct for population stratification and to identify extended chromosomal segments that are shared identical by descent between very distantly related individuals. Analysis of the patterns of segmental sharing has the potential to map disease loci that contain multiple rare variants in a population-based linkage analysis.

26,280 citations


"Improving the Accuracy and Efficien..." refers methods in this paper

  • ...Probabilistic methods including Beagle IBD (Browning and Browning 2010), IBD_Haplo (Brown et al. 2012), RELATE (Albrechtsen et al. 2009), IBDLD (Han and Abney 2011), and PLINK (Purcell et al. 2007) fit a hidden Markov model (HMM) for IBD status and determine posterior probabilities of IBD....

    [...]

Journal ArticleDOI
Lawrence R. Rabiner1
01 Feb 1989
TL;DR: In this paper, the authors provide an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and give practical details on methods of implementation of the theory along with a description of selected applications of HMMs to distinct problems in speech recognition.
Abstract: This tutorial provides an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. Results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. The author first reviews the theory of discrete Markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. The theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. Three fundamental problems of HMMs are noted and several practical techniques for solving these problems are given. The various types of HMMs that have been studied, including ergodic as well as left-right models, are described. >

21,819 citations

Journal ArticleDOI
TL;DR: VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API.
Abstract: Summary: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API. Availability: http://vcftools.sourceforge.net Contact: [email protected]

10,164 citations


"Improving the Accuracy and Efficien..." refers background or methods in this paper

  • ...…history (Campbell et al. 2012; Gusev et al. 2012; Palamara et al. 2012; Ralph and Coop 2012), IBD mapping (Purcell et al. 2007; Gusev et al. 2011; Browning and Thompson 2012), and heritability analysis in founder populations (Price et al. 2011; Zuk et al. 2012; Browning and Browning 2013)....

    [...]

  • ...We analyzed 500 individuals of the simulated unphased filtered sequence data, using Refined IBD with a minimum segment length of 0.2 cM and LOD scores of 4 and 5....

    [...]

Journal ArticleDOI
28 Oct 2010-Nature
TL;DR: The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype as mentioned in this paper, and the results of the pilot phase of the project, designed to develop and compare different strategies for genomewide sequencing with high-throughput platforms.
Abstract: The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.

7,538 citations

Journal ArticleDOI
TL;DR: The upper bound is obtained for a specific probabilistic nonsequential decoding algorithm which is shown to be asymptotically optimum for rates above R_{0} and whose performance bears certain similarities to that of sequential decoding algorithms.
Abstract: The probability of error in decoding an optimal convolutional code transmitted over a memoryless channel is bounded from above and below as a function of the constraint length of the code. For all but pathological channels the bounds are asymptotically (exponentially) tight for rates above R_{0} , the computational cutoff rate of sequential decoding. As a function of constraint length the performance of optimal convolutional codes is shown to be superior to that of block codes of the same length, the relative improvement increasing with rate. The upper bound is obtained for a specific probabilistic nonsequential decoding algorithm which is shown to be asymptotically optimum for rates above R_{0} and whose performance bears certain similarities to that of sequential decoding algorithms.

6,804 citations