scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A global reference for human genetic variation.

Adam Auton1, Gonçalo R. Abecasis2, David Altshuler3, Richard Durbin4  +514 moreInstitutions (90)
01 Oct 2015-Nature (Nature Publishing Group)-Vol. 526, Iss: 7571, pp 68-74
TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.
Abstract: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.
Citations
More filters
Journal ArticleDOI
TL;DR: A crucial role is identified for functional annotations such as IMPACT to improve the trans-ancestry portability of genetic data and capture consistent SNP heritability between populations, suggesting prioritization of shared functional variants.
Abstract: Poor trans-ancestry portability of polygenic risk scores is a consequence of Eurocentric genetic studies and limited knowledge of shared causal variants. Leveraging regulatory annotations may improve portability by prioritizing functional over tagging variants. We constructed a resource of 707 cell-type-specific IMPACT regulatory annotations by aggregating 5,345 epigenetic datasets to predict binding patterns of 142 transcription factors across 245 cell types. We then partitioned the common SNP heritability of 111 genome-wide association study summary statistics of European (average n ≈ 189,000) and East Asian (average n ≈ 157,000) origin. IMPACT annotations captured consistent SNP heritability between populations, suggesting prioritization of shared functional variants. Variant prioritization using IMPACT resulted in increased trans-ancestry portability of polygenic risk scores from Europeans to East Asians across all 21 phenotypes analyzed (49.9% mean relative increase in R2). Our study identifies a crucial role for functional annotations such as IMPACT to improve the trans-ancestry portability of genetic data.

101 citations

Journal ArticleDOI
TL;DR: A theoretical model of the relative accuracy of PGS across ancestries is derived and it is found that LD and MAF differences between ancestries can explain between 70 and 80% of the loss of RA of European-based PGS in African ancestry.
Abstract: Polygenic scores (PGS) have been widely used to predict disease risk using variants identified from genome-wide association studies (GWAS). To date, most GWAS have been conducted in populations of European ancestry, which limits the use of GWAS-derived PGS in non-European ancestry populations. Here, we derive a theoretical model of the relative accuracy (RA) of PGS across ancestries. We show through extensive simulations that the RA of PGS based on genome-wide significant SNPs can be predicted accurately from modelling linkage disequilibrium (LD), minor allele frequencies (MAF), cross-population correlations of causal SNP effects and heritability. We find that LD and MAF differences between ancestries can explain between 70 and 80% of the loss of RA of European-based PGS in African ancestry for traits like body mass index and type 2 diabetes. Our results suggest that causal variants underlying common genetic variation identified in European ancestry GWAS are mostly shared across continents.

101 citations

Journal ArticleDOI
TL;DR: A nonparametric approach for estimating the date of origin of genetic variants in large-scale sequencing data sets is developed and used to power a rapid approach for inferring the ancestry shared between individual genomes and to quantify genealogical relationships at different points in the past.
Abstract: The origin and fate of new mutations within species is the fundamental process underlying evolution. However, while much attention has been focused on characterizing the presence, frequency, and phenotypic impact of genetic variation, the evolutionary histories of most variants are largely unexplored. We have developed a nonparametric approach for estimating the date of origin of genetic variants in large-scale sequencing data sets. The accuracy and robustness of the approach is demonstrated through simulation. Using data from two publicly available human genomic diversity resources, we estimated the age of more than 45 million single-nucleotide polymorphisms (SNPs) in the human genome and release the Atlas of Variant Age as a public online database. We characterize the relationship between variant age and frequency in different geographical regions and demonstrate the value of age information in interpreting variants of functional and selective importance. Finally, we use allele age estimates to power a rapid approach for inferring the ancestry shared between individual genomes and to quantify genealogical relationships at different points in the past, as well as to describe and explore the evolutionary history of modern human populations.

101 citations

Journal ArticleDOI
16 Oct 2018-JAMA
TL;DR: The complexities of ancestral populations requires a new approach for discussing genomics, disease risk, race and ethnicity, and social determinants of health, and there should be parallel efforts to standardize data collection methods.
Abstract: The complexities of social identity and genetic ancestry have led to confusion and consternation related to the use and interpretation of race, ethnicity, and ancestry data in biomedical research. These discussions and overt debates have intensified with advances in genomics and knowledge about how social factors interact with biology. As more information about genomic diversity becomes available, the limitations of assigning social, political, and geographic labels to individuals become clearer; these limitations have led to growing challenges for researchers to communicate information about human genomic variation. Imprecise use of race and ethnicity data as population descriptors in genomics research has the potential to miscommunicate the complex relationships among an individual’s social identity, ancestry, socioeconomic status, and health, while also perpetuating misguided notions that discrete genetic groups exist. Self-identified race and ethnicity commonly correlate with geographical ancestry and, in turn, geographical ancestry is a contributing factor to human genomic variation. While selfidentified race and ethnicity correlate with the frequency of particular genomic variants at a population level, they cannot be used exclusively to predict a patient’s genotype or drug response.1 A recent analysis found significant heterogeneity among US clinical laboratories in the way race, ethnicity, and ancestry are ascertained; specifically, no 2 clinical laboratories used the same descriptive categories to designate a group or population on their requisition forms (C. Bustamante and A. Popejoy, written communication, August 2018). In light of the current realities, the complexity of ancestral populations requires a new approach for discussing genomics, disease risk, race and ethnicity, and social determinants of health. In 2016, the National Human Genome Research Institute (NHGRI) and the National Institute on Minority Health and Health Disparities (NIMHD) of the US National Institutes of Health convened a workshop to discuss the use of self-identified race and ethnicity data in genomics, biomedical, and clinical research, and the implications of this use for minority health and health disparities.2 Several major themes emerged from that workshop. For example, while the current use of the US Office of Management and Budget’s (OMB’s) racial and ethnic categories in research is important, there was a call for researchers to increase the scientific rigor in collecting such data, especially in clinical settings. Specifically, researchers should ensure the collected data reflect the multidimensional nature of a person’s identity, especially within the context of race, ethnicity, socioeconomic status, and geographic ancestry. Further, as these data sets are curated and refined, there should be parallel efforts to standardize data collection methods. A positive step forward would involve capturing self-identified race and ethnicity data, social and cultural identity, family background, and ancestry data derived from genomic analyses. In addition, other dimensions of race should be recognized, including perceived race or ethnicity (what others believe a person to be), reflected race (the race a person believes others assume her or him to be), and the cumulative burden of discrimination. New approaches are required to minimize survey burden in the collection of such additional information because it would be a challenge to collect detailed information about each of these variables. Another theme from the workshop was to expand beyond the traditional categories used to explain population differences.2 Race and ethnicity are operationalized inappropriately when they serve as proxies for other demographic variables, such as an individual’s socioeconomic status. One study examined the role of African ancestry and education in association with hypertension among black patients and found that having education beyond high school was significantly associated with lower systolic blood pressure, but proportion of African ancestrywasnot.3 Understandinghowsocial,demographic,and biological factors interact and affect health will require analyses that include these variables. To avoid undermining the scientific integrity of conclusions drawn from research studies, other types of data providing more nuanced insights should be collected in addition to race, ethnicity, and genetic ancestry, such as a person’s educational attainment, income, and geographic residence. The NHGRI and the NIMHD have supported work exploring how physicians and researchers collect and report race and ethnicity data as well how such data should be used for biomedical research. The NHGRI supports implementation research in the use of ancestral data in clinical genomic reports; studies have demonstrated the need to report such ancestry data to assist clinical laboratories and health professionals in interpreting the medical relevance of genomic variants. It is time for the broader scientific community to develop and adopt consensus practices for the use of race, ethnicity, social determinants of health, and ancestry data in study design, interpretation of results, publications, and medical care.

101 citations

Journal ArticleDOI
TL;DR: Calmodulinopathies are largely characterized by adrenergically-induced life-threatening arrhythmias, and available therapies are disquietingly insufficient, especially in CALM-LQTS.
Abstract: AIMS: Calmodulinopathies are rare life-threatening arrhythmia syndromes which affect mostly young individuals and are, caused by mutations in any of the three genes (CALM 1-3) that encode identical calmodulin proteins. We established the International Calmodulinopathy Registry (ICalmR) to understand the natural history, clinical features, and response to therapy of patients with a CALM-mediated arrhythmia syndrome. METHODS AND RESULTS: A dedicated Case Report File was created to collect demographic, clinical, and genetic information. ICalmR has enrolled 74 subjects, with a variant in the CALM1 (n = 36), CALM2 (n = 23), or CALM3 (n = 15) genes. Sixty-four (86.5%) were symptomatic and the 10-year cumulative mortality was 27%. The two prevalent phenotypes are long QT syndrome (LQTS; CALM-LQTS, n = 36, 49%) and catecholaminergic polymorphic ventricular tachycardia (CPVT; CALM-CPVT, n = 21, 28%). CALM-LQTS patients have extremely prolonged QTc intervals (594 ± 73 ms), high prevalence (78%) of life-threatening arrhythmias with median age at onset of 1.5 years [interquartile range (IQR) 0.1-5.5 years] and poor response to therapies. Most electrocardiograms (ECGs) show late onset peaked T waves. All CALM-CPVT patients were symptomatic with median age of onset of 6.0 years (IQR 3.0-8.5 years). Basal ECG frequently shows prominent U waves. Other CALM-related phenotypes are idiopathic ventricular fibrillation (IVF, n = 7), sudden unexplained death (SUD, n = 4), overlapping features of CPVT/LQTS (n = 3), and predominant neurological phenotype (n = 1). Cardiac structural abnormalities and neurological features were present in 18 and 13 patients, respectively. CONCLUSION: Calmodulinopathies are largely characterized by adrenergically-induced life-threatening arrhythmias. Available therapies are disquietingly insufficient, especially in CALM-LQTS. Combination therapy with drugs, sympathectomy, and devices should be considered.

101 citations

References
More filters
Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

88,255 citations

Journal ArticleDOI
TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.
Abstract: Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: [email protected]

45,957 citations

Journal ArticleDOI
TL;DR: A new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format, which allows the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks.
Abstract: Motivation: Testing for correlations between different sets of genomic features is a fundamental task in genomics research. However, searching for overlaps between features with existing webbased methods is complicated by the massive datasets that are routinely produced with current sequencing technologies. Fast and flexible tools are therefore required to ask complex questions of these data in an efficient manner. Results: This article introduces a new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format. BEDTools also supports the comparison of sequence alignments in BAM format to both BED and GFF features. The tools are extremely efficient and allow the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks. BEDTools can be combined with one another as well as with standard UNIX commands, thus facilitating routine genomics tasks as well as pipelines that can quickly answer intricate questions of large genomic datasets. Availability and implementation: BEDTools was written in C++. Source code and a comprehensive user manual are freely available at http://code.google.com/p/bedtools

18,858 citations

Journal ArticleDOI
06 Sep 2012-Nature
TL;DR: The Encyclopedia of DNA Elements project provides new insights into the organization and regulation of the authors' genes and genome, and is an expansive resource of functional annotations for biomedical research.
Abstract: The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

13,548 citations

Journal ArticleDOI
TL;DR: VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API.
Abstract: Summary: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API. Availability: http://vcftools.sourceforge.net Contact: [email protected]

10,164 citations