scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Next-generation genotype imputation service and methods.

TL;DR: Improvements to imputation machinery are described that reduce computational requirements by more than an order of magnitude with no loss of accuracy in comparison to standard imputation tools.
Abstract: Christian Fuchsberger, Goncalo Abecasis and colleagues describe a new web-based imputation service that enables rapid imputation of large numbers of samples and allows convenient access to large reference panels of sequenced individuals. Their state space reduction provides a computationally efficient solution for genotype imputation with no loss in imputation accuracy.
Citations
More filters
Journal ArticleDOI
TL;DR: A new phasing algorithm, Eagle2, is introduced that attains high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels (such as the Haplotype Reference Consortium; HRC) using a new data structure based on the positional Burrows-Wheeler transform.
Abstract: Po-Ru Loh, Alkes Price and colleagues present Eagle2, a reference-based phasing algorithm that allows for highly accurate and efficient phasing of genotypes across a broad range of cohort sizes. They demonstrate an approximately 10% improvement in accuracy and 20% improvement in speed compared to a competing method, SHAPEIT2.

1,246 citations

Journal ArticleDOI
TL;DR: Evidence is reported for the involvement of many systems in tobacco and alcohol use, including genes involved in nicotinic, dopaminergic, and glutamatergic neurotransmission, which provide a solid starting point to evaluate the effects of these loci in model organisms and more precise substance use measures.
Abstract: Tobacco and alcohol use are leading causes of mortality that influence risk for many complex diseases and disorders1. They are heritable2,3 and etiologically related4,5 behaviors that have been resistant to gene discovery efforts6–11. In sample sizes up to 1.2 million individuals, we discovered 566 genetic variants in 406 loci associated with multiple stages of tobacco use (initiation, cessation, and heaviness) as well as alcohol use, with 150 loci evidencing pleiotropic association. Smoking phenotypes were positively genetically correlated with many health conditions, whereas alcohol use was negatively correlated with these conditions, such that increased genetic risk for alcohol use is associated with lower disease risk. We report evidence for the involvement of many systems in tobacco and alcohol use, including genes involved in nicotinic, dopaminergic, and glutamatergic neurotransmission. The results provide a solid starting point to evaluate the effects of these loci in model organisms and more precise substance use measures.

1,082 citations

Journal ArticleDOI
04 Mar 2021-Nature
TL;DR: The GenOMICC (Genetics Of Mortality In Critical Care) genome-wide association study in 2244 critically ill Covid-19 patients from 208 UK intensive care units is reported, finding evidence in support of a causal link from low expression of IFNAR2, and high expression of TYK2, to life-threatening disease.
Abstract: Host-mediated lung inflammation is present1, and drives mortality2, in the critical illness caused by coronavirus disease 2019 (COVID-19). Host genetic variants associated with critical illness may identify mechanistic targets for therapeutic development3. Here we report the results of the GenOMICC (Genetics Of Mortality In Critical Care) genome-wide association study in 2,244 critically ill patients with COVID-19 from 208 UK intensive care units. We have identified and replicated the following new genome-wide significant associations: on chromosome 12q24.13 (rs10735079, P = 1.65 × 10−8) in a gene cluster that encodes antiviral restriction enzyme activators (OAS1, OAS2 and OAS3); on chromosome 19p13.2 (rs74956615, P = 2.3 × 10−8) near the gene that encodes tyrosine kinase 2 (TYK2); on chromosome 19p13.3 (rs2109069, P = 3.98 × 10−12) within the gene that encodes dipeptidyl peptidase 9 (DPP9); and on chromosome 21q22.1 (rs2236757, P = 4.99 × 10−8) in the interferon receptor gene IFNAR2. We identified potential targets for repurposing of licensed medications: using Mendelian randomization, we found evidence that low expression of IFNAR2, or high expression of TYK2, are associated with life-threatening disease; and transcriptome-wide association in lung tissue revealed that high expression of the monocyte–macrophage chemotactic receptor CCR2 is associated with severe COVID-19. Our results identify robust genetic signals relating to key host antiviral defence mechanisms and mediators of inflammatory organ damage in COVID-19. Both mechanisms may be amenable to targeted treatment with existing drugs. However, large-scale randomized clinical trials will be essential before any change to clinical practice. A genome-wide association study of critically ill patients with COVID-19 identifies genetic signals that relate to important host antiviral defence mechanisms and mediators of inflammatory organ damage that may be targeted by repurposing drug treatments.

941 citations

Journal ArticleDOI
TL;DR: A new genotype imputation method, Beagle 5.0, is presented, which greatly reduces the computational cost of imputation from large reference panels and is compared with Beagle 4.1 and Impute4 using 1000 Genomes Project data, Haplotype Reference Consortium data, and simulated data.
Abstract: Genotype imputation is commonly performed in genome-wide association studies because it greatly increases the number of markers that can be tested for association with a trait. In general, one should perform genotype imputation using the largest reference panel that is available because the number of accurately imputed variants increases with reference panel size. However, one impediment to using larger reference panels is the increased computational cost of imputation. We present a new genotype imputation method, Beagle 5.0, which greatly reduces the computational cost of imputation from large reference panels. We compare Beagle 5.0 with Beagle 4.1, Impute4, Minimac3, and Minimac4 using 1000 Genomes Project data, Haplotype Reference Consortium data, and simulated data for 10k, 100k, 1M, and 10M reference samples. All methods produce nearly identical accuracy, but Beagle 5.0 has the lowest computation time and the best scaling of computation time with increasing reference panel size. For 10k, 100k, 1M, and 10M reference samples and 1,000 phased target samples, Beagle 5.0's computation time is 3× (10k), 12× (100k), 43× (1M), and 533× (10M) faster than the fastest alternative method. Cost data from the Amazon Elastic Compute Cloud show that Beagle 5.0 can perform genome-wide imputation from 10M reference samples into 1,000 phased target samples at a cost of less than one US cent per sample.

894 citations


Cites background or methods or result from "Next-generation genotype imputation..."

  • ...Imputation accuracy for a variant generally increases with increasing reference panel size, and variants must be present in the reference panel in order to be accurately imputed.(4,5) Consequently, whenever a significantly larger reference panel becomes available, it is advantageous to re-impute target samples with the larger panel for subsequent analysis....

    [...]

  • ...Minimac does not require a genetic map because recombination parameters are estimated and stored when producing the m3vcf format input file for the reference data.(4,24) Beagle 4....

    [...]

  • ...The techniques developed thus far have made it possible to provide imputation using reference panels with tens of thousands of individuals as a free web service.(4) However, the total computational cost of imputation is substantial and increasing....

    [...]

  • ...Current imputation methods are able to make use of a rich palette of computational techniques, including the use of identity-by-descent,(6,14,15) haplotype clustering,(4,16) and linear interpolation(5) to reduce the model state space, the use of pre-phasing to reduce computational complexity,(14,17,18) and the use of specialized reference file formats to reduce file size and memory footprint.(4,5,19) The techniques developed thus far have made it possible to provide imputation using reference panels with tens of thousands of individuals as a free web service....

    [...]

  • ...Accuracy All the genotype imputation methods that we evaluated are based on the Li and Stephens probabilistic model(20) and have essentially the same imputation accuracy (Figure 3), which is consistent with previous reports.(4,5) The apparent difference in accuracy between Beagle 4....

    [...]

Journal ArticleDOI
Daniel Taliun1, Daniel N. Harris2, Michael D. Kessler2, Jedidiah Carlson3  +202 moreInstitutions (61)
10 Feb 2021-Nature
TL;DR: The Trans-Omics for Precision Medicine (TOPMed) project as discussed by the authors aims to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases.
Abstract: The Trans-Omics for Precision Medicine (TOPMed) programme seeks to elucidate the genetic architecture and biology of heart, lung, blood and sleep disorders, with the ultimate goal of improving diagnosis, treatment and prevention of these diseases The initial phases of the programme focused on whole-genome sequencing of individuals with rich phenotypic data and diverse backgrounds Here we describe the TOPMed goals and design as well as the available resources and early insights obtained from the sequence data The resources include a variant browser, a genotype imputation server, and genomic and phenotypic data that are available through dbGaP (Database of Genotypes and Phenotypes)1 In the first 53,831 TOPMed samples, we detected more than 400 million single-nucleotide and insertion or deletion variants after alignment with the reference genome Additional previously undescribed variants were detected through assembly of unmapped reads and customized analysis in highly variable loci Among the more than 400 million detected variants, 97% have frequencies of less than 1% and 46% are singletons that are present in only one individual (53% among unrelated individuals) These rare variants provide insights into mutational processes and recent human evolutionary history The extensive catalogue of genetic variation in TOPMed studies provides unique opportunities for exploring the contributions of rare and noncoding sequence variants to phenotypic variation Furthermore, combining TOPMed haplotypes with modern imputation methods improves the power and reach of genome-wide association studies to include variants down to a frequency of approximately 001% The goals, resources and design of the NHLBI Trans-Omics for Precision Medicine (TOPMed) programme are described, and analyses of rare variants detected in the first 53,831 samples provide insights into mutational processes and recent human evolutionary history

801 citations

References
More filters
Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
06 Dec 2004
TL;DR: This paper presents the implementation of MapReduce, a programming model and an associated implementation for processing and generating large data sets that runs on a large cluster of commodity machines and is highly scalable.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system. Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google's clusters every day.

20,309 citations

Journal ArticleDOI
Jeffrey Dean1, Sanjay Ghemawat1
TL;DR: This presentation explains how the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks.
Abstract: MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

17,663 citations

Journal ArticleDOI
Adam Auton1, Gonçalo R. Abecasis2, David Altshuler3, Richard Durbin4  +514 moreInstitutions (90)
01 Oct 2015-Nature
TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.
Abstract: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

12,661 citations

Journal ArticleDOI
01 Nov 2012-Nature
TL;DR: It is shown that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites.
Abstract: By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.

7,710 citations

Journal ArticleDOI
28 Oct 2010-Nature
TL;DR: The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype as mentioned in this paper, and the results of the pilot phase of the project, designed to develop and compare different strategies for genomewide sequencing with high-throughput platforms.
Abstract: The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype. Here we present results of the pilot phase of the project, designed to develop and compare different strategies for genome-wide sequencing with high-throughput platforms. We undertook three projects: low-coverage whole-genome sequencing of 179 individuals from four populations; high-coverage sequencing of two mother-father-child trios; and exon-targeted sequencing of 697 individuals from seven populations. We describe the location, allele frequency and local haplotype structure of approximately 15 million single nucleotide polymorphisms, 1 million short insertions and deletions, and 20,000 structural variants, most of which were previously undescribed. We show that, because we have catalogued the vast majority of common variation, over 95% of the currently accessible variants found in any individual are present in this data set. On average, each person is found to carry approximately 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. We demonstrate how these results can be used to inform association and functional studies. From the two trios, we directly estimate the rate of de novo germline base substitution mutations to be approximately 10(-8) per base pair per generation. We explore the data with regard to signatures of natural selection, and identify a marked reduction of genetic variation in the neighbourhood of genes, due to selection at linked sites. These methods and public data will support the next phase of human genetic research.

7,538 citations