scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A global reference for human genetic variation.

Adam Auton1, Gonçalo R. Abecasis2, David Altshuler3, Richard Durbin4  +514 moreInstitutions (90)
01 Oct 2015-Nature (Nature Publishing Group)-Vol. 526, Iss: 7571, pp 68-74
TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.
Abstract: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.
Citations
More filters
Journal ArticleDOI
TL;DR: It is found that the number and allelic frequencies of sites that are uniquely shared between archaic humans and specific present-day populations are particularly useful for detecting adaptive introgression.
Abstract: Comparisons of DNA from archaic and modern humans show that these groups interbred, and in some cases received an evolutionary advantage from doing so. This process-adaptive introgression-may lead to a faster rate of adaptation than is predicted from models with mutation and selection alone. Within the last couple of years, a series of studies have identified regions of the genome that are likely examples of adaptive introgression. In many cases, once a region was ascertained as being introgressed, commonly used statistics based on both haplotype as well as allele frequency information were employed to test for positive selection. Introgression by itself, however, changes both the haplotype structure and the distribution of allele frequencies, thus confounding traditional tests for detecting positive selection. Therefore, patterns generated by introgression alone may lead to false inferences of positive selection. Here we explore models involving both introgression and positive selection to investigate the behavior of various statistics under adaptive introgression. In particular, we find that the number and allelic frequencies of sites that are uniquely shared between archaic humans and specific present-day populations are particularly useful for detecting adaptive introgression. We then examine the 1000 Genomes dataset to characterize the landscape of uniquely shared archaic alleles in human populations. Finally, we identify regions that were likely subject to adaptive introgression and discuss some of the most promising candidate genes located in these regions.

162 citations


Cites background or methods from "A global reference for human geneti..."

  • ...Indeed, population structure analyses of the 1000 Genomes samples suggest that Peruvians have the largest amount of Native American ancestry (Auton et al. 2015) and show a bottleneck with a lack of recent population growth, which could explain this pattern....

    [...]

  • ...We then apply these statistics to real human genomic data from phase 3 of the 1000 Genomes Project (Auton et al. 2015), to detect AI in human populations, and find candidate genes....

    [...]

  • ...We used each of the non-African panels in the 1000 Genomes Project phase 3 data (Auton et al. 2015) as the “target” panel (B), and chose the outgroup panel (A) to be the combination of all African populations (YRI, LWK, GWD, MSL, and ESN), excluding admixed African-Americans....

    [...]

  • ...Candidate Regions for Adaptive Introgression To identify adaptively introgressed regions of the genome, we computed UA;B;C;Dðw; x; y; zÞ and Q95A;B;C;Dðw; y; zÞ in 40 kb nonoverlapping windows along the genome, using the Archaic Adaptive Introgression in Present-Day Human Populations . doi:10.1093/molbev/msw216 MBE low-coverage sequencing data from phase 3 of the 1000 Genomes Project (Auton et al. 2015)....

    [...]

  • ...By scanning the present-day human genomes from phase 3 of the 1000 Genomes Project (Auton et al. 2015) using these and other summary statistics, we were able to recapitulate previous AI findings (like the TLR [Dannemann et al. 2016; Deschamps et al. 2016] and OAS regions [Mendez et al. 2013]) as well as identify new candidate regions for AI in Eurasia (like the LIPA gene and the FAP/IFIH1 region)....

    [...]

Journal ArticleDOI
TL;DR: The results demonstrate that systematic clinically oriented pathway-based analysis of genomic data can accelerate the discovery of rare genetic disorders.
Abstract: Histone lysine methyltransferases (KMTs) and demethylases (KDMs) underpin gene regulation. Here we demonstrate that variants causing haploinsufficiency of KMTs and KDMs are frequently encountered in individuals with developmental disorders. Using a combination of human variation databases and existing animal models, we determine 22 KMTs and KDMs as additional candidates for dominantly inherited developmental disorders. We show that KMTs and KDMs that are associated with, or are candidates for, dominant developmental disorders tend to have a higher level of transcription, longer canonical transcripts, more interactors, and a higher number and more types of post-translational modifications than other KMT and KDMs. We provide evidence to firmly associate KMT2C, ASH1L, and KMT5B haploinsufficiency with dominant developmental disorders. Whereas KMT2C or ASH1L haploinsufficiency results in a predominantly neurodevelopmental phenotype with occasional physical anomalies, KMT5B mutations cause an overgrowth syndrome with intellectual disability. We further expand the phenotypic spectrum of KMT2B-related disorders and show that some individuals can have severe developmental delay without dystonia at least until mid-childhood. Additionally, we describe a recessive histone lysine-methylation defect caused by homozygous or compound heterozygous KDM5B variants and resulting in a recognizable syndrome with developmental delay, facial dysmorphism, and camptodactyly. Collectively, these results emphasize the significance of histone lysine methylation in normal human development and the importance of this process in human developmental disorders. Our results demonstrate that systematic clinically oriented pathway-based analysis of genomic data can accelerate the discovery of rare genetic disorders.

162 citations

Journal ArticleDOI
24 May 2019-Science
TL;DR: The characteristics of mtDNA in the human population are shaped by selective forces acting on heteroplasmy within the female germ line and are influenced by the nuclear genetic background, as indicated by population genetic evidence that selection shapes the evolving mtDNA phylogeny.
Abstract: Approximately 2.4% of the human mitochondrial DNA (mtDNA) genome exhibits common homoplasmic genetic variation. We analyzed 12,975 whole-genome sequences to show that 45.1% of individuals from 1526 mother-offspring pairs harbor a mixed population of mtDNA (heteroplasmy), but the propensity for maternal transmission differs across the mitochondrial genome. Over one generation, we observed selection both for and against variants in specific genomic regions; known variants were more likely to be transmitted than previously unknown variants. However, new heteroplasmies were more likely to match the nuclear genetic ancestry as opposed to the ancestry of the mitochondrial genome on which the mutations occurred, validating our findings in 40,325 individuals. Thus, human mtDNA at the population level is shaped by selective forces within the female germ line under nuclear genetic control, which ensures consistency between the two independent genetic lineages.

162 citations

Journal ArticleDOI
02 Jul 2020-Nature
TL;DR: A scalable pipeline is used to map and characterize structural variants in 17,795 deeply sequenced human genomes to create the largest, to the authors' knowledge, whole-genome-sequencing-based structural variant resource so far and infer the dosage sensitivity of genes and noncoding elements.
Abstract: A key goal of whole-genome sequencing for studies of human genetics is to interrogate all forms of variation, including single-nucleotide variants, small insertion or deletion (indel) variants and structural variants. However, tools and resources for the study of structural variants have lagged behind those for smaller variants. Here we used a scalable pipeline1 to map and characterize structural variants in 17,795 deeply sequenced human genomes. We publicly release site-frequency data to create the largest, to our knowledge, whole-genome-sequencing-based structural variant resource so far. On average, individuals carry 2.9 rare structural variants that alter coding regions; these variants affect the dosage or structure of 4.2 genes and account for 4.0-11.2% of rare high-impact coding alleles. Using a computational model, we estimate that structural variants account for 17.2% of rare alleles genome-wide, with predicted deleterious effects that are equivalent to loss-of-function coding alleles; approximately 90% of such structural variants are noncoding deletions (mean 19.1 per genome). We report 158,991 ultra-rare structural variants and show that 2% of individuals carry ultra-rare megabase-scale structural variants, nearly half of which are balanced or complex rearrangements. Finally, we infer the dosage sensitivity of genes and noncoding elements, and reveal trends that relate to element class and conservation. This work will help to guide the analysis and interpretation of structural variants in the era of whole-genome sequencing.

162 citations

Journal ArticleDOI
02 Oct 2020-Science
TL;DR: A rich landscape of mutational processes and selection in normal urothelium with large heterogeneity across clones and individuals is revealed, which suggests differential exposure to mutagens in the urine.
Abstract: The extent of somatic mutation and clonal selection in the human bladder remains unknown. We sequenced 2097 bladder microbiopsies from 20 individuals using targeted (n = 1914 microbiopsies), whole-exome (n = 655), and whole-genome (n = 88) sequencing. We found widespread positive selection in 17 genes. Chromatin remodeling genes were frequently mutated, whereas mutations were absent in several major bladder cancer genes. There was extensive interindividual variation in selection, with different driver genes dominating the clonal landscape across individuals. Mutational signatures were heterogeneous across clones and individuals, which suggests differential exposure to mutagens in the urine. Evidence of APOBEC mutagenesis was found in 22% of the microbiopsies. Sequencing multiple microbiopsies from five patients with bladder cancer enabled comparisons with cancer-free individuals and across histological features. This study reveals a rich landscape of mutational processes and selection in normal urothelium with large heterogeneity across clones and individuals.

162 citations

References
More filters
Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

88,255 citations

Journal ArticleDOI
TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.
Abstract: Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: [email protected]

45,957 citations

Journal ArticleDOI
TL;DR: A new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format, which allows the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks.
Abstract: Motivation: Testing for correlations between different sets of genomic features is a fundamental task in genomics research. However, searching for overlaps between features with existing webbased methods is complicated by the massive datasets that are routinely produced with current sequencing technologies. Fast and flexible tools are therefore required to ask complex questions of these data in an efficient manner. Results: This article introduces a new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format. BEDTools also supports the comparison of sequence alignments in BAM format to both BED and GFF features. The tools are extremely efficient and allow the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks. BEDTools can be combined with one another as well as with standard UNIX commands, thus facilitating routine genomics tasks as well as pipelines that can quickly answer intricate questions of large genomic datasets. Availability and implementation: BEDTools was written in C++. Source code and a comprehensive user manual are freely available at http://code.google.com/p/bedtools

18,858 citations

Journal ArticleDOI
06 Sep 2012-Nature
TL;DR: The Encyclopedia of DNA Elements project provides new insights into the organization and regulation of the authors' genes and genome, and is an expansive resource of functional annotations for biomedical research.
Abstract: The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

13,548 citations

Journal ArticleDOI
TL;DR: VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API.
Abstract: Summary: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API. Availability: http://vcftools.sourceforge.net Contact: [email protected]

10,164 citations