scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A global reference for human genetic variation.

Adam Auton1, Gonçalo R. Abecasis2, David Altshuler3, Richard Durbin4  +514 moreInstitutions (90)
01 Oct 2015-Nature (Nature Publishing Group)-Vol. 526, Iss: 7571, pp 68-74
TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.
Abstract: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.
Citations
More filters
Journal ArticleDOI
25 Apr 2017-eLife
TL;DR: It is reported here that mutational spectra differ substantially among species, human continental groups and even some closely related populations and the possibility of mapping mutational modifiers is suggested.
Abstract: DNA is a molecule that contains the information needed to build an organism This information is stored as a code made up of four chemicals: adenine (A), guanine (G), cytosine (C), and thymine (T) Every time a cell divides and copies its DNA, it accidentally introduces ‘typos’ into the code, known as mutations Most mutations are harmless, but some can cause damage All cells have ways to proofread DNA, and the more resources are devoted to proofreading, the less mutations occur Simple organisms such as bacteria use less energy to reduce mutations, because their genomes may tolerate more damage More complex organisms, from yeast to humans, instead need to proofread their genomes more thoroughly Recent research has shown that humans have a lower mutation rate than chimpanzees and gorillas, their closest living relatives Humans and other apes copy and proofread their DNA with basically the same biological machinery as yeast, which is about a billion years old Yet, humans and apes have only existed for a small fraction of this time, a few million years Why then do humans need to replicate and proofread their DNA differently from apes, and could it be that the way mutations arise is still evolving? Previous research revealed that European people experience more mutations within certain DNA motifs (specifically, the DNA sequences ‘TCC’, ‘TCT’, ‘CCC’ and ‘ACC’) than Africans or East Asians do Now, Harris (who conducted the previous research) and Pritchard have compared how various human ethnic groups accumulate mutations and how these processes differ in different groups Statistical analysis of the genomes of thousands of people from all over the world did indeed show that the mutation rates of many different three-letter DNA motifs have changed during the past 20,000 years of human evolution Harris and Pritchard report that when groups of humans left Africa and settled in isolated populations across different continents, each population quickly became better at avoiding mutations in some genomic contexts, but worse in others This suggests that the risk of passing on harmful mutations to future generations is changing and evolving at an even faster rate than was originally suspected The results suggest that every human ethnic group carries specific variants of the genes which ensure that DNA replication and repair are accurate These differences appear to influence which types of mutations are frequently passed down to future generations An important next step will be to identify the genetic variants that could be controlling mutational patterns and how they affect human health

146 citations


Cites background or methods from "A global reference for human geneti..."

  • ...Human mutation spectrum processing Mutation spectra were computed using 1000 Genomes Phase 3 SNPs (Auton et al., 2015) that are biallelic, pass all 1000 Genomes quality filters, and are not adjacent to any N’s in the hg19 reference sequence....

    [...]

  • ...Results To investigate the mutational processes in different human populations, we classified each single nucleotide variants (SNV) in the 1000 Genomes Phase 3 data (Auton et al., 2015) in terms of its ancestral allele, derived allele, and 5’ and 3’ flanking nucleotides....

    [...]

  • ...…mutation spectrum processing Mutation spectra were computed using 1000 Genomes Phase 3 SNPs (Auton et al., 2015) that are biallelic, pass all 1000 Genomes quality filters, and are not adjacent to any N’s in the hg19 reference sequence....

    [...]

  • ...To investigate the mutational processes in different human populations, we classified each single nucleotide variants (SNV) in the 1000 Genomes Phase 3 data (Auton et al., 2015) in terms of its ancestral allele, derived allele, and 5’ and 3’ flanking nucleotides....

    [...]

Journal ArticleDOI
TL;DR: The ability for t-SNE to reveal population stratification at different scales could be useful for human genetic association studies.
Abstract: The t-distributed stochastic neighbor embedding t-SNE is a new dimension reduction and visualization technique for high-dimensional data. t-SNE is rarely applied to human genetic data, even though it is commonly used in other data-intensive biological fields, such as single-cell genomics. We explore the applicability of t-SNE to human genetic data and make these observations: (i) similar to previously used dimension reduction techniques such as principal component analysis (PCA), t-SNE is able to separate samples from different continents; (ii) unlike PCA, t-SNE is more robust with respect to the presence of outliers; (iii) t-SNE is able to display both continental and sub-continental patterns in a single plot. We conclude that the ability for t-SNE to reveal population stratification at different scales could be useful for human genetic association studies.

146 citations

Journal ArticleDOI
20 Feb 2020-Cell
TL;DR: A novel probabilistic method to identify introgressed hominin sequences, which, unlike existing approaches, does not use a modern reference population and finds that African individuals carry a stronger signal of Neanderthal ancestry than previously thought.

145 citations


Cites methods from "A global reference for human geneti..."

  • ...We tested the effect of sample size on IBDmix using the CEU (Utah Residents with Northern and Western European Ancestry) subgroup from 1000 Genomes Project....

    [...]

  • ...Neanderthal Introgressed Sequence Detected in 1000 Genomes Project Populations (A) Violin plots showing the amount of Neanderthal sequence called per individual across geographically diverse populations from the 1000 Genomes Project....

    [...]

  • ...IBDmix Reveals Substantial Amounts of Neanderthal Signal in Africans and Nearly Uniform Levels in NonAfrican Populations We applied IBDmix to samples from the 1000 Genomes Project (Auton et al., 2015), collected from geographically diverse populations, and used the Altai Neanderthal reference genome (Prüfer et al....

    [...]

  • ...IBDmix Reveals Substantial Amounts of Neanderthal Signal in Africans and Nearly Uniform Levels in NonAfrican Populations We applied IBDmix to samples from the 1000 Genomes Project (Auton et al., 2015), collected from geographically diverse populations, and used the Altai Neanderthal reference genome (Prüfer et al., 2014) to identify introgressed Neanderthal sequence in these individuals....

    [...]

  • ...…of Neanderthal Signal in Africans and Nearly Uniform Levels in NonAfrican Populations We applied IBDmix to samples from the 1000 Genomes Project (Auton et al., 2015), collected from geographically diverse populations, and used the Altai Neanderthal reference genome (Prüfer et al., 2014) to…...

    [...]

Journal ArticleDOI
16 Jun 2017-Science
TL;DR: A complex CNV called DUP4 is associated with resistance to severe malaria and fully explains the previously reported signal of association, and a systematic catalog of CNVs is provided, describing structural diversity that may have functional importance at this locus.
Abstract: INTRODUCTION Malaria parasites cause human disease by invading and replicating inside red blood cells In the case of Plasmodium falciparum , this can lead to severe forms of malaria that are a major cause of childhood mortality in Africa This species of parasite enters the red blood cell through interactions with surface proteins including the glycophorins GYPA and GYPB, which determine the polymorphic MNS blood group system In a recent genome-wide association study, we identified alleles associated with protection against severe malaria near the cluster of genes encoding these invasion receptors RATIONALE Investigation of genetic variants at this locus and their relation to severe malaria is challenging because of the high sequence similarity between the neighboring glycophorin genes and the relative lack of available sequence data capturing the genetic diversity of sub-Saharan Africa To better assess whether variation in the glycophorin genes could explain the signal of association, we generated additional sequence data from sub-Saharan African populations and developed an analytical approach to characterize structural variation at this complex locus RESULTS Using 765 newly sequenced human genomes from 10 African ethnic groups along with data from the 1000 Genomes Project, we generated a reference panel of haplotypes across the glycophorin region In addition to single-nucleotide polymorphisms and short indels, we assayed large copy number variants (CNVs) using sequencing read depth and uncovered extensive structural diversity By imputing from this reference panel into 4579 severe malaria cases and 5310 controls from three African populations, we found that a complex CNV, here called DUP4, is associated with resistance to severe malaria and fully explains the previously reported signal of association In our sample, DUP4 is present only in east Africa, and this localization, as well as the extent of similarity between DUP4 haplotypes, suggests that it has recently increased in frequency, presumably under natural selection due to malaria To evaluate the potential functional consequences of this structural variant, we analyzed high-coverage sequence-read data from multiple individuals to generate a model of the DUP4 chromosome structure The DUP4 haplotype contains five glycophorin genes, including two hybrid genes that juxtapose the extracellular domain of GYPB with the transmembrane and intracellular domains of GYPA Noting that these predicted hybrids are characteristic of the Dantu antigen in the MNS blood group system, we sequenced a Dantu positive individual and confirmed that DUP4 is the molecular basis of the Dantu NE blood group variant CONCLUSION Although a role for GYPA and GYPB in parasite invasion is well known, a direct link between glycophorin polymorphisms and clinical susceptibility to malaria has been elusive Here we have provided a systematic catalog of CNVs, describing structural diversity that may have functional importance at this locus Our results identify a specific variant that encodes hybrid glycophorin proteins and is associated with protection against severe malaria This discovery calls for further work to determine how this particular molecular rearrangement affects parasite invasion and the red blood cell response and may lead us toward new parasite vulnerabilities that can be utilized in future interventions against this deadly disease

145 citations


Cites background or methods from "A global reference for human geneti..."

  • ...The 2504 individuals from 26 populations in the 1000 Genomes phase 3 release (20) were analyzed....

    [...]

  • ...We first assembled a list of previously identified SNPs and short indels from the 1000 Genomes phase 3 (20), the Illumina Omni 2....

    [...]

  • ...Finally, to form a joint reference panel across all individuals, we merged the phased haplotypes with the 1000 Genomes phase 3 haplotypes (20) at the overlapping set of variants....

    [...]

  • ...fa; (20)], by BWA (52) with base quality score recalibration (BQSR) and local realignment around known indels as implemented in GATK (53, 54)....

    [...]

  • ...Reads were mapped to the GRCh37 human reference genome with additional sequences as modified by the 1000 Genomes Project [hs37d5.fa; (20)], by BWA (52) with base quality score recalibration (BQSR) and local realignment around known indels as implemented in GATK (53, 54)....

    [...]

Journal ArticleDOI
TL;DR: It is shown that African populations represent the most ancient lineages and represents the deepest history in the A. thaliana lineage, and evidence is revealed that selfing, a major defining characteristic of the species, evolved in a single geographic region, best represented today within Africa.
Abstract: Over the past 20 y, many studies have examined the history of the plant ecological and molecular model, Arabidopsis thaliana, in Europe and North America. Although these studies informed us about the recent history of the species, the early history has remained elusive. In a large-scale genomic analysis of African A. thaliana, we sequenced the genomes of 78 modern and herbarium samples from Africa and analyzed these together with over 1,000 previously sequenced Eurasian samples. In striking contrast to expectations, we find that all African individuals sampled are native to this continent, including those from sub-Saharan Africa. Moreover, we show that Africa harbors the greatest variation and represents the deepest history in the A. thaliana lineage. Our results also reveal evidence that selfing, a major defining characteristic of the species, evolved in a single geographic region, best represented today within Africa. Demographic inference supports a model in which the ancestral A. thaliana population began to split by 120–90 kya, during the last interglacial and Abbassia pluvial, and Eurasian populations subsequently separated from one another at around 40 kya. This bears striking similarities to the patterns observed for diverse species, including humans, implying a key role for climatic events during interglacial and pluvial periods in shaping the histories and current distributions of a wide range of species.

145 citations


Cites background or result from "A global reference for human geneti..."

  • ...This finding is consistent with our finding that variation in Eurasia is often a subset of variation present within Africa and is similar to the situation in humans (34)....

    [...]

  • ...thaliana bear striking similarities to those observed for human populations, particularly in the larger effective population size in Africa (34), the exodus from Africa approximately 120 kya (39, 53–55), and the splitting of major human populations in Europe and Asia (approximately 45– 35 kya) (53, 54)....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

88,255 citations

Journal ArticleDOI
TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.
Abstract: Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: [email protected]

45,957 citations

Journal ArticleDOI
TL;DR: A new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format, which allows the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks.
Abstract: Motivation: Testing for correlations between different sets of genomic features is a fundamental task in genomics research. However, searching for overlaps between features with existing webbased methods is complicated by the massive datasets that are routinely produced with current sequencing technologies. Fast and flexible tools are therefore required to ask complex questions of these data in an efficient manner. Results: This article introduces a new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format. BEDTools also supports the comparison of sequence alignments in BAM format to both BED and GFF features. The tools are extremely efficient and allow the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks. BEDTools can be combined with one another as well as with standard UNIX commands, thus facilitating routine genomics tasks as well as pipelines that can quickly answer intricate questions of large genomic datasets. Availability and implementation: BEDTools was written in C++. Source code and a comprehensive user manual are freely available at http://code.google.com/p/bedtools

18,858 citations

Journal ArticleDOI
06 Sep 2012-Nature
TL;DR: The Encyclopedia of DNA Elements project provides new insights into the organization and regulation of the authors' genes and genome, and is an expansive resource of functional annotations for biomedical research.
Abstract: The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

13,548 citations

Journal ArticleDOI
TL;DR: VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API.
Abstract: Summary: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API. Availability: http://vcftools.sourceforge.net Contact: [email protected]

10,164 citations