scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A global reference for human genetic variation.

Adam Auton1, Gonçalo R. Abecasis2, David Altshuler3, Richard Durbin4  +514 moreInstitutions (90)
01 Oct 2015-Nature (Nature Publishing Group)-Vol. 526, Iss: 7571, pp 68-74
TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.
Abstract: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.
Citations
More filters
Journal ArticleDOI
TL;DR: Testing viability effects of sets of genetic variants that jointly influence 1 of 42 traits, a number of strong signals are detected, suggesting that when large, even late-onset effects are kept at low frequency by purifying selection.
Abstract: A number of open questions in human evolutionary genetics would become tractable if we were able to directly measure evolutionary fitness. As a step towards this goal, we developed a method to examine whether individual genetic variants, or sets of genetic variants, currently influence viability. The approach consists in testing whether the frequency of an allele varies across ages, accounting for variation in ancestry. We applied it to the Genetic Epidemiology Research on Adult Health and Aging (GERA) cohort and to the parents of participants in the UK Biobank. Across the genome, we found only a few common variants with large effects on age-specific mortality: tagging the APOE e4 allele and near CHRNA3. These results suggest that when large, even late-onset effects are kept at low frequency by purifying selection. Testing viability effects of sets of genetic variants that jointly influence 1 of 42 traits, we detected a number of strong signals. In participants of the UK Biobank of British ancestry, we found that variants that delay puberty timing are associated with a longer parental life span (P~6.2 × 10-6 for fathers and P~2.0 × 10-3 for mothers), consistent with epidemiological studies. Similarly, variants associated with later age at first birth are associated with a longer maternal life span (P~1.4 × 10-3). Signals are also observed for variants influencing cholesterol levels, risk of coronary artery disease (CAD), body mass index, as well as risk of asthma. These signals exhibit consistent effects in the GERA cohort and among participants of the UK Biobank of non-British ancestry. We also found marked differences between males and females, most notably at the CHRNA3 locus, and variants associated with risk of CAD and cholesterol levels. Beyond our findings, the analysis serves as a proof of principle for how upcoming biomedical data sets can be used to learn about selection effects in contemporary humans.

74 citations

Journal ArticleDOI
03 Feb 2017-PLOS ONE
TL;DR: The γ-H2AX ELISA represents a novel approach to quantifying DNA damage, which may lead to a better understanding of mutagenic pathways in cancer and provide a useful biomarker for monitoring the effectiveness of DNA-damaging anticancer agents.
Abstract: Phosphorylated H2AX (γ-H2AX) is a sensitive marker for DNA double-strand breaks (DSBs), but the variability of H2AX expression in different cell and tissue types makes it difficult to interpret the meaning of the γ-H2AX level. Furthermore, the assays commonly used for γ-H2AX detection utilize laborious and low-throughput microscopy-based methods. We describe here an ELISA assay that measures both phosphorylated H2AX and total H2AX absolute amounts to determine the percentage of γ-H2AX, providing a normalized value representative of the amount of DNA damage. We demonstrate the utility of the assay to measure DSBs introduced by either ionizing radiation or DNA-damaging agents in cultured cells and in xenograft models. Furthermore, utilizing the NCI-60 cancer cell line panel, we show a correlation between the basal fraction of γ-H2AX and cellular mutation levels. This additional application highlights the ability of the assay to measure γ-H2AX levels in many extracts at once, making it possible to correlate findings with other cellular characteristics. Overall, the γ-H2AX ELISA represents a novel approach to quantifying DNA damage, which may lead to a better understanding of mutagenic pathways in cancer and provide a useful biomarker for monitoring the effectiveness of DNA-damaging anticancer agents.

74 citations

Journal ArticleDOI
TL;DR: The results indicate that the current threshold for genome-wide significance is overly stringent for all ancestral populations except for Africans; however, it should employ a more stringent threshold when conducting a meta-analysis, regardless of the presence of African samples.
Abstract: To assess the statistical significance of associations between variants and traits, genome-wide association studies (GWAS) should employ an appropriate threshold that accounts for the massive burden of multiple testing in the study. Although most studies in the current literature commonly set a genome-wide significance threshold at the level of P=5.0 × 10−8, the adequacy of this value for respective populations has not been fully investigated. To empirically estimate thresholds for different ancestral populations, we conducted GWAS simulations using the 1000 Genomes Phase 3 data set for Africans (AFR), Europeans (EUR), Admixed Americans (AMR), East Asians (EAS) and South Asians (SAS). The estimated empirical genome-wide significance thresholds were Psig=3.24 × 10−8 (AFR), 9.26 × 10−8 (EUR), 1.83 × 10−7 (AMR), 1.61 × 10−7 (EAS) and 9.46 × 10−8 (SAS). We additionally conducted trans-ethnic meta-analyses across all populations (ALL) and all populations except for AFR (ΔAFR), which yielded Psig=3.25 × 10−8 (ALL) and 4.20 × 10−8 (ΔAFR). Our results indicate that the current threshold (P=5.0 × 10−8) is overly stringent for all ancestral populations except for Africans; however, we should employ a more stringent threshold when conducting a meta-analysis, regardless of the presence of African samples.

74 citations

Journal ArticleDOI
TL;DR: Nine groups are presented that meet a key need in pharmacogenetics research by enabling consistent communication of the scale of variability in global allele frequencies and are now used by Pharmacogenomics Knowledgebase (PharmGKB).
Abstract: The varying frequencies of pharmacogenetic alleles among populations have important implications for the impact of these alleles in different populations. Current population grouping methods to communicate these patterns are insufficient as they are inconsistent and fail to reflect the global distribution of genetic variability. To facilitate and standardize the reporting of variability in pharmacogenetic allele frequencies, we present seven geographically defined groups: American, Central/South Asian, East Asian, European, Near Eastern, Oceanian, and Sub-Saharan African, and two admixed groups: African American/Afro-Caribbean and Latino. These nine groups are defined by global autosomal genetic structure and based on data from large-scale sequencing initiatives. We recognize that broadly grouping global populations is an oversimplification of human diversity and does not capture complex social and cultural identity. However, these groups meet a key need in pharmacogenetics research by enabling consistent communication of the scale of variability in global allele frequencies and are now used by Pharmacogenomics Knowledgebase (PharmGKB).

74 citations

Journal ArticleDOI
TL;DR: In this paper, a pooled analysis of data from newly recruited patients with an MRI-confirmed diagnosis of lacunar stroke and existing genome-wide association studies (GWAS) was performed to identify novel associations and provide mechanistic insights into the disease.
Abstract: Background: The genetic basis of lacunar stroke is poorly understood, with a single locus on 16q24 identified to date. We sought to identify novel associations and provide mechanistic insights into the disease. Methods: We did a pooled analysis of data from newly recruited patients with an MRI-confirmed diagnosis of lacunar stroke and existing genome-wide association studies (GWAS). Patients were recruited from hospitals in the UK as part of the UK DNA Lacunar Stroke studies 1 and 2 and from collaborators within the International Stroke Genetics Consortium. Cases and controls were stratified by ancestry and two meta-analyses were done: a European ancestry analysis, and a transethnic analysis that included all ancestry groups. We also did a multi-trait analysis of GWAS, in a joint analysis with a study of cerebral white matter hyperintensities (an aetiologically related radiological trait), to find additional genetic associations. We did a transcriptome-wide association study (TWAS) to detect genes for which expression is associated with lacunar stroke; identified significantly enriched pathways using multi-marker analysis of genomic annotation; and evaluated cardiovascular risk factors causally associated with the disease using mendelian randomisation. Findings: Our meta-analysis comprised studies from Europe, the USA, and Australia, including 7338 cases and 254 798 controls, of which 2987 cases (matched with 29 540 controls) were confirmed using MRI. Five loci (ICA1L-WDR12-CARF-NBEAL1, ULK4, SPI1-SLC39A13-PSMC3-RAPSN, ZCCHC14, ZBTB14-EPB41L3) were found to be associated with lacunar stroke in the European or transethnic meta-analyses. A further seven loci (SLC25A44-PMF1-BGLAP, LOX-ZNF474-LOC100505841, FOXF2-FOXQ1, VTA1-GPR126, SH3PXD2A, HTRA1-ARMS2, COL4A2) were found to be associated in the multi-trait analysis with cerebral white matter hyperintensities (n=42 310). Two of the identified loci contain genes (COL4A2 and HTRA1) that are involved in monogenic lacunar stroke. The TWAS identified associations between the expression of six genes (SCL25A44, ULK4, CARF, FAM117B, ICA1L, NBEAL1) and lacunar stroke. Pathway analyses implicated disruption of the extracellular matrix, phosphatidylinositol 5 phosphate binding, and roundabout binding (false discovery rate <0·05). Mendelian randomisation analyses identified positive associations of elevated blood pressure, history of smoking, and type 2 diabetes with lacunar stroke. Interpretation: Lacunar stroke has a substantial heritable component, with 12 loci now identified that could represent future treatment targets. These loci provide insights into lacunar stroke pathogenesis, highlighting disruption of the vascular extracellular matrix (COL4A2, LOX, SH3PXD2A, GPR126, HTRA1), pericyte differentiation (FOXF2, GPR126), TGF-β signalling (HTRA1), and myelination (ULK4, GPR126) in disease risk. Funding: British Heart Foundation. (Less)

73 citations

References
More filters
Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

88,255 citations

Journal ArticleDOI
TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.
Abstract: Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: [email protected]

45,957 citations

Journal ArticleDOI
TL;DR: A new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format, which allows the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks.
Abstract: Motivation: Testing for correlations between different sets of genomic features is a fundamental task in genomics research. However, searching for overlaps between features with existing webbased methods is complicated by the massive datasets that are routinely produced with current sequencing technologies. Fast and flexible tools are therefore required to ask complex questions of these data in an efficient manner. Results: This article introduces a new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format. BEDTools also supports the comparison of sequence alignments in BAM format to both BED and GFF features. The tools are extremely efficient and allow the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks. BEDTools can be combined with one another as well as with standard UNIX commands, thus facilitating routine genomics tasks as well as pipelines that can quickly answer intricate questions of large genomic datasets. Availability and implementation: BEDTools was written in C++. Source code and a comprehensive user manual are freely available at http://code.google.com/p/bedtools

18,858 citations

Journal ArticleDOI
06 Sep 2012-Nature
TL;DR: The Encyclopedia of DNA Elements project provides new insights into the organization and regulation of the authors' genes and genome, and is an expansive resource of functional annotations for biomedical research.
Abstract: The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

13,548 citations

Journal ArticleDOI
TL;DR: VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API.
Abstract: Summary: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API. Availability: http://vcftools.sourceforge.net Contact: [email protected]

10,164 citations