scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A global reference for human genetic variation.

Adam Auton1, Gonçalo R. Abecasis2, David Altshuler3, Richard Durbin4  +514 moreInstitutions (90)
01 Oct 2015-Nature (Nature Publishing Group)-Vol. 526, Iss: 7571, pp 68-74
TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.
Abstract: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.
Citations
More filters
Journal ArticleDOI
TL;DR: The power of transcriptome sequencing is demonstrated to molecularly diagnose 10% of mitochondriopathy patients and identify candidate genes for the remainder, and examples of intronic loss-of-function variants with pathological relevance are provided.
Abstract: Across a variety of Mendelian disorders, ∼50-75% of patients do not receive a genetic diagnosis by exome sequencing indicating disease-causing variants in non-coding regions. Although genome sequencing in principle reveals all genetic variants, their sizeable number and poorer annotation make prioritization challenging. Here, we demonstrate the power of transcriptome sequencing to molecularly diagnose 10% (5 of 48) of mitochondriopathy patients and identify candidate genes for the remainder. We find a median of one aberrantly expressed gene, five aberrant splicing events and six mono-allelically expressed rare variants in patient-derived fibroblasts and establish disease-causing roles for each kind. Private exons often arise from cryptic splice sites providing an important clue for variant prioritization. One such event is found in the complex I assembly factor TIMMDC1 establishing a novel disease-associated gene. In conclusion, our study expands the diagnostic tools for detecting non-exonic variants and provides examples of intronic loss-of-function variants with pathological relevance.

414 citations

Journal ArticleDOI
11 Jan 2018-Cell
TL;DR: Analysis of UK National Health Service drug prescription and sales data suggests that characterizing GPCR variants could increase prescription precision, improving patients’ quality of life, and relieve the economic and societal burden due to variable drug responsiveness.

412 citations


Cites background from "A global reference for human geneti..."

  • ...We find that on average, 3.1% of the 2,504 individuals in the 1000 Genomes Project carry at least one allele with a missense variation in a known functional site in any given GPCR drug target (11.9% in known or putative functional site; Table S5)....

    [...]

  • ...Fraction of receptor length with a polymorphism or population with variant receptor For each of theGPCRdrug targets we calculated: (i) the ratio of receptor length withmissense variation in a known functional sites per GPCR drug target using the ExAC data (Table S5) and (ii) the fraction of affected individuals in the human population (n = 2,504; based on the 1000 Genomes Project dataset, Table S5)....

    [...]

  • ...2012 and 2017 have partial sales data and were not considered. d %Individuals is the percentage of affected individuals with amissense variant in a functional site of the respective drug target(s) (n = 2,504 individuals from 1000 Genomes Project genotype data as a representative for the UK population; this data includes non-Caucasian populations as well) (Table S5). d The % of affected individuals was calculated using four different criteria by considering individuals who have a variation in (i) known functional sites in both alleles (homozygous), which is the most conservative, (ii) known functional sites in at least one allele (i.e., homozygous and heterozygous), (iii) known or putative functional sites in both alleles (homozygous), and (iv) known or putative functional sites in at least one allele (i.e., homozygous and heterozygous), which is the least conservative. d Known functional sites include ligand binding, effector binding, post-translational modification site, sodium binding site and micro-switches....

    [...]

  • ...An investigation of complete genotype information for 2,504 ‘‘healthy’’ individuals from the 1000 Genomes Project (Auton et al., 2015) showed that, on average, an individual harbors 68 missense variations within the coding region of one-third of the GPCR drug targets (Figure 2A)....

    [...]

  • ...=X fraction of known functional sites that are polymorphic for each receptor targeted by the drug The drug score (Table S6) based on prevalence of affected individual (i.e., 1000 Genomes Project) was calculated by: variability score for a drug;Saffected% = fraction of affected individuals with a MV in a functional site of the respective drug targetðsÞ The fraction of affected individuals was calculated using four different criteria by considering individuals who have a variation in (i) known functional sites in both alleles (homozygous), which is the most conservative, (ii) known functional sites in at least one allele (i.e., homozygous and heterozygous), (iii) known or putative functional sites in both alleles (homozygous), and (iv) known or putative functional sites in at least one allele (i.e., homozygous and heterozygous), which is the least conservative....

    [...]

Journal ArticleDOI
TL;DR: Vg as discussed by the authors is a toolkit of computational methods for creating, manipulating, and using these structures as references at the scale of the human genome, which provides an efficient approach to mapping reads onto arbitrary variation graphs using generalized compressed suffix arrays, with improved accuracy over alignment to a linear reference.
Abstract: Reference genomes guide our interpretation of DNA sequence data. However, conventional linear references represent only one version of each locus, ignoring variation in the population. Poor representation of an individual's genome sequence impacts read mapping and introduces bias. Variation graphs are bidirected DNA sequence graphs that compactly represent genetic variation across a population, including large-scale structural variation such as inversions and duplications. Previous graph genome software implementations have been limited by scalability or topological constraints. Here we present vg, a toolkit of computational methods for creating, manipulating, and using these structures as references at the scale of the human genome. vg provides an efficient approach to mapping reads onto arbitrary variation graphs using generalized compressed suffix arrays, with improved accuracy over alignment to a linear reference, and effectively removing reference bias. These capabilities make using variation graphs as references for DNA sequencing practical at a gigabase scale, or at the topological complexity of de novo assemblies.

408 citations

Journal ArticleDOI
20 Oct 2016-Cell
TL;DR: It is shown that genetic effects on the immune response are strongly enriched for recent, population-specific signatures of adaptation, including for traits that are key to controlling infection.

405 citations


Cites methods from "A global reference for human geneti..."

  • ...To do so, we first calculated FST values between the Yoruba African (YRI) and the western European population (CEU) in Phase 3 data from the 1000 Genomes Project (Auton et al., 2015)....

    [...]

  • ...Bi-allelic SNPs across five European population samples (CEU, FIN, GBR, IBS, TSI), three African population samples with low levels of Eurasian ancestry (ESN, MSL, YRI), and ancestral allele were extracted from the phase 3 release of the 1,000 Genomes Project....

    [...]

  • ...The prior genotype probabilities in QuASAR are obtained from the 1000 Genomes Project minor allele frequencies assuming Hardy–Weinberg equilibrium; however, as we had the genotype information available, we manually input the prior genotype probabilities....

    [...]

Journal ArticleDOI
TL;DR: It is suggested that broad depression is the most tractable UK Biobank phenotype for discovering genes and gene sets that further the understanding of the biological pathways underlying depression.
Abstract: Depression is a polygenic trait that causes extensive periods of disability. Previous genetic studies have identified common risk variants which have progressively increased in number with increasing sample sizes of the respective studies. Here, we conduct a genome-wide association study in 322,580 UK Biobank participants for three depression-related phenotypes: broad depression, probable major depressive disorder (MDD), and International Classification of Diseases (ICD, version 9 or 10)-coded MDD. We identify 17 independent loci that are significantly associated (P < 5 × 10−8) across the three phenotypes. The direction of effect of these loci is consistently replicated in an independent sample, with 14 loci likely representing novel findings. Gene sets are enriched in excitatory neurotransmission, mechanosensory behaviour, post synapse, neuron spine and dendrite functions. Our findings suggest that broad depression is the most tractable UK Biobank phenotype for discovering genes and gene sets that further our understanding of the biological pathways underlying depression.

400 citations

References
More filters
Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

88,255 citations

Journal ArticleDOI
TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.
Abstract: Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: [email protected]

45,957 citations

Journal ArticleDOI
TL;DR: A new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format, which allows the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks.
Abstract: Motivation: Testing for correlations between different sets of genomic features is a fundamental task in genomics research. However, searching for overlaps between features with existing webbased methods is complicated by the massive datasets that are routinely produced with current sequencing technologies. Fast and flexible tools are therefore required to ask complex questions of these data in an efficient manner. Results: This article introduces a new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format. BEDTools also supports the comparison of sequence alignments in BAM format to both BED and GFF features. The tools are extremely efficient and allow the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks. BEDTools can be combined with one another as well as with standard UNIX commands, thus facilitating routine genomics tasks as well as pipelines that can quickly answer intricate questions of large genomic datasets. Availability and implementation: BEDTools was written in C++. Source code and a comprehensive user manual are freely available at http://code.google.com/p/bedtools

18,858 citations

Journal ArticleDOI
06 Sep 2012-Nature
TL;DR: The Encyclopedia of DNA Elements project provides new insights into the organization and regulation of the authors' genes and genome, and is an expansive resource of functional annotations for biomedical research.
Abstract: The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

13,548 citations

Journal ArticleDOI
TL;DR: VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API.
Abstract: Summary: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API. Availability: http://vcftools.sourceforge.net Contact: [email protected]

10,164 citations