scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A global reference for human genetic variation.

Adam Auton1, Gonçalo R. Abecasis2, David Altshuler3, Richard Durbin4  +514 moreInstitutions (90)
01 Oct 2015-Nature (Nature Publishing Group)-Vol. 526, Iss: 7571, pp 68-74
TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.
Abstract: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.
Citations
More filters
Journal ArticleDOI
TL;DR: By allowing 3 heterozygous SNPs per ROH when dealing with W GS low coverage data, it is possible to establish meaningful comparisons between data using SNP array and WGS low coverage technologies.
Abstract: Runs of Homozygosity (ROH) are genomic regions where identical haplotypes are inherited from each parent. Since their first detection due to technological advances in the late 1990s, ROHs have been shedding light on human population history and deciphering the genetic basis of monogenic and complex traits and diseases. ROH studies have predominantly exploited SNP array data, but are gradually moving to whole genome sequence (WGS) data as it becomes available. WGS data, covering more genetic variability, can add value to ROH studies, but require additional considerations during analysis. Using SNP array and low coverage WGS data from 1885 individuals from 20 world populations, our aims were to compare ROH from the two datasets and to establish software conditions to get comparable results, thus providing guidelines for combining disparate datasets in joint ROH analyses. By allowing heterozygous SNPs per window, using the PLINK homozygosity function and non-parametric analysis, we were able to obtain non-significant differences in number ROH, mean ROH size and total sum of ROH between data sets using the different technologies for almost all populations. By allowing 3 heterozygous SNPs per ROH when dealing with WGS low coverage data, it is possible to establish meaningful comparisons between data using SNP array and WGS low coverage technologies.

80 citations

Journal ArticleDOI
17 Jun 2019-Leukemia
TL;DR: The aim of this article is to review publications that reported mutated cancer-associated genes in CML patients at various disease phases and to discuss the frequency and type of such variants at initial diagnosis and at the time of treatment failure and transformation.
Abstract: Outcomes for patients with chronic myeloid leukemia (CML) have substantially improved due to advances in drug development and rational treatment intervention strategies. Despite these significant advances there are still unanswered questions on patient management regarding how to more reliably predict treatment failure at the time of diagnosis and how to select frontline tyrosine kinase inhibitor (TKI) therapy for optimal outcome. The BCR-ABL1 transcript level at diagnosis has no established prognostic impact and cannot guide frontline TKI selection. BCR-ABL1 mutations are detected in ~50% of TKI resistant patients but are rarely responsible for primary resistance. Other resistance mechanisms are largely uncharacterized and there are no other routine molecular testing strategies to facilitate the evaluation and further stratification of TKI resistance. Advances in next-generation sequencing technology has aided the management of a growing number of other malignancies, enabling the incorporation of somatic mutation profiles in diagnosis, classification, and prognostication. A largely unexplored area in CML research is whether expanded genomic analysis at diagnosis, resistance, and disease transformation can enhance patient management decisions, as has occurred for other cancers. The aim of this article is to review publications that reported mutated cancer-associated genes in CML patients at various disease phases. We discuss the frequency and type of such variants at initial diagnosis and at the time of treatment failure and transformation. Current limitations in the evaluation of mutants and recommendations for future reporting are outlined. The collective evaluation of mutational studies over more than a decade suggests a limited set of cancer-associated genes are indeed recurrently mutated in CML and some at a relatively high frequency. Genomic studies have the potential to lay the foundation for improved diagnostic risk classification according to clinical and genomic risk, and to enable more precise early identification of TKI resistance.

80 citations


Cites background from "A global reference for human geneti..."

  • ...For some variants, sufficient information was supplied to review their frequency in population databases [62, 63]....

    [...]

Journal ArticleDOI
TL;DR: Changes in the structure and sequence of these arrays likely contribute to changes in CACNA1C function during human evolution and may modulate neuropsychiatric disease risk in modern human populations.
Abstract: Bipolar disorder (BD) and schizophrenia (SCZ) are highly heritable diseases that affect more than 3% of individuals worldwide. Genome-wide association studies have strongly and repeatedly linked risk for both of these neuropsychiatric diseases to a 100 kb interval in the third intron of the human calcium channel gene CACNA1C. However, the causative mutation is not yet known. We have identified a human-specific tandem repeat in this region that is composed of 30 bp units, often repeated hundreds of times. This large tandem repeat is unstable using standard polymerase chain reaction and bacterial cloning techniques, which may have resulted in its incorrect size in the human reference genome. The large 30-mer repeat region is polymorphic in both size and sequence in human populations. Particular sequence variants of the 30-mer are associated with risk status at several flanking single-nucleotide polymorphisms in the third intron of CACNA1C that have previously been linked to BD and SCZ. The tandem repeat arrays function as enhancers that increase reporter gene expression in a human neural progenitor cell line. Different human arrays vary in the magnitude of enhancer activity, and the 30-mer arrays associated with increased psychiatric disease risk status have decreased enhancer activity. Changes in the structure and sequence of these arrays likely contribute to changes in CACNA1C function during human evolution and may modulate neuropsychiatric disease risk in modern human populations.

80 citations

Journal ArticleDOI
TL;DR: STRait Razor v3.0 adds several key features that simplify the haplotype reporting process, including simple filters to remove low frequency haplotypes as well as merging haplotypes within a locus encoded on opposite strands of the DNA molecule.
Abstract: The short tandem repeat allele identification tool (STRait Razor), a program used to characterize the haplotypes of short tandem repeats (STRs) in massively parallel sequencing (MPS) data, was redesigned STRait Razor v30 performs ∼660× faster allele identification than its previous version (v2s), a speedup that is largely due to a novel indexing strategy used to perform "fuzzy" (approximate) string matching of anchor sequences Written in a portable compiled language, C++, STRait Razor v30 functions on all major operating systems including Microsoft Windows, and it has cross-platform multithreading support In silico estimates of precision and accuracy of STRait Razor v30 were 100% in this evaluation and results were highly concordant with those of Strait Razor v2s STRait Razor v30 adds several key features that simplify the haplotype reporting process, including simple filters to remove low frequency haplotypes as well as merging haplotypes within a locus encoded on opposite strands of the DNA molecule

80 citations

Journal ArticleDOI
TL;DR: A novel and powerful approach to apply mouse regulatory models to analyze human genetic variants associated with molecular phenotypes and disease and unleash thousands of non-human epigenetic and transcriptional profiles toward more effective investigation of how gene regulation affects human disease.
Abstract: Machine learning algorithms trained to predict the regulatory activity of nucleic acid sequences have revealed principles of gene regulation and guided genetic variation analysis. While the human genome has been extensively annotated and studied, model organisms have been less explored. Model organism genomes offer both additional training sequences and unique annotations describing tissue and cell states unavailable in humans. Here, we develop a strategy to train deep convolutional neural networks simultaneously on multiple genomes and apply it to learn sequence predictors for large compendia of human and mouse data. Training on both genomes improves gene expression prediction accuracy on held out and variant sequences. We further demonstrate a novel and powerful approach to apply mouse regulatory models to analyze human genetic variants associated with molecular phenotypes and disease. Together these techniques unleash thousands of non-human epigenetic and transcriptional profiles toward more effective investigation of how gene regulation affects human disease.

80 citations

References
More filters
Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

88,255 citations

Journal ArticleDOI
TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.
Abstract: Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: [email protected]

45,957 citations

Journal ArticleDOI
TL;DR: A new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format, which allows the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks.
Abstract: Motivation: Testing for correlations between different sets of genomic features is a fundamental task in genomics research. However, searching for overlaps between features with existing webbased methods is complicated by the massive datasets that are routinely produced with current sequencing technologies. Fast and flexible tools are therefore required to ask complex questions of these data in an efficient manner. Results: This article introduces a new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format. BEDTools also supports the comparison of sequence alignments in BAM format to both BED and GFF features. The tools are extremely efficient and allow the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks. BEDTools can be combined with one another as well as with standard UNIX commands, thus facilitating routine genomics tasks as well as pipelines that can quickly answer intricate questions of large genomic datasets. Availability and implementation: BEDTools was written in C++. Source code and a comprehensive user manual are freely available at http://code.google.com/p/bedtools

18,858 citations

Journal ArticleDOI
06 Sep 2012-Nature
TL;DR: The Encyclopedia of DNA Elements project provides new insights into the organization and regulation of the authors' genes and genome, and is an expansive resource of functional annotations for biomedical research.
Abstract: The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

13,548 citations

Journal ArticleDOI
TL;DR: VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API.
Abstract: Summary: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API. Availability: http://vcftools.sourceforge.net Contact: [email protected]

10,164 citations