A global reference for human genetic variation.

doi:10.1038/NATURE15393

Home
/
Papers
/
A global reference for human genetic variation.

Journal Article•DOI•

A global reference for human genetic variation.

Adam Auton¹, Gonçalo R. Abecasis², David Altshuler³, Richard Durbin⁴ +514 more•Institutions (90)

01 Oct 2015-Nature (Nature Publishing Group)-Vol. 526, Iss: 7571, pp 68-74

TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.

read less

Abstract: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Assessing runs of Homozygosity: a comparison of SNP Array and whole genome sequence low coverage data.

[...]

Francisco C. Ceballos¹, Scott Hazelhurst¹, Michèle Ramsay¹•Institutions (1)

University of the Witwatersrand¹

30 Jan 2018-BMC Genomics

TL;DR: By allowing 3 heterozygous SNPs per ROH when dealing with W GS low coverage data, it is possible to establish meaningful comparisons between data using SNP array and WGS low coverage technologies.

...read moreread less

Abstract: Runs of Homozygosity (ROH) are genomic regions where identical haplotypes are inherited from each parent. Since their first detection due to technological advances in the late 1990s, ROHs have been shedding light on human population history and deciphering the genetic basis of monogenic and complex traits and diseases. ROH studies have predominantly exploited SNP array data, but are gradually moving to whole genome sequence (WGS) data as it becomes available. WGS data, covering more genetic variability, can add value to ROH studies, but require additional considerations during analysis. Using SNP array and low coverage WGS data from 1885 individuals from 20 world populations, our aims were to compare ROH from the two datasets and to establish software conditions to get comparable results, thus providing guidelines for combining disparate datasets in joint ROH analyses. By allowing heterozygous SNPs per window, using the PLINK homozygosity function and non-parametric analysis, we were able to obtain non-significant differences in number ROH, mean ROH size and total sum of ROH between data sets using the different technologies for almost all populations. By allowing 3 heterozygous SNPs per ROH when dealing with WGS low coverage data, it is possible to establish meaningful comparisons between data using SNP array and WGS low coverage technologies.

...read moreread less

80 citations

Journal Article•DOI•

Laying the foundation for genomically-based risk assessment in chronic myeloid leukemia.

[...]

Susan Branford¹, Dennis Dong Hwan Kim², Jane F. Apperley³, Christopher A. Eide⁴, Satu Mustjoki⁵, Sin Tiong Ong⁶, Georgios Nteliopoulos³, Thomas Ernst, Charles Chuah⁷, Charles Chuah⁶, Carlo Gambacorti-Passerini⁸, Michael J. Mauro⁹, Brian J. Druker⁴, Brian J. Druker¹⁰, Dong-Kee Kim¹¹, Francois-Xavier Mahon¹², Jorge E. Cortes¹³, Jerry Radich¹⁴, A. Hochhaus, Timothy P. Hughes¹⁵, Timothy P. Hughes¹ - Show less +17 more•Institutions (15)

South Australia Pathology¹, Princess Margaret Cancer Centre², Imperial College London³, Oregon Health & Science University⁴, University of Helsinki⁵, National University of Singapore⁶, Singapore General Hospital⁷, University of Milano-Bicocca⁸, Memorial Sloan Kettering Cancer Center⁹, Howard Hughes Medical Institute¹⁰, Catholic University of Korea¹¹, University of Bordeaux¹², University of Texas MD Anderson Cancer Center¹³, Fred Hutchinson Cancer Research Center¹⁴, University of Adelaide¹⁵

17 Jun 2019-Leukemia

TL;DR: The aim of this article is to review publications that reported mutated cancer-associated genes in CML patients at various disease phases and to discuss the frequency and type of such variants at initial diagnosis and at the time of treatment failure and transformation.

...read moreread less

Abstract: Outcomes for patients with chronic myeloid leukemia (CML) have substantially improved due to advances in drug development and rational treatment intervention strategies. Despite these significant advances there are still unanswered questions on patient management regarding how to more reliably predict treatment failure at the time of diagnosis and how to select frontline tyrosine kinase inhibitor (TKI) therapy for optimal outcome. The BCR-ABL1 transcript level at diagnosis has no established prognostic impact and cannot guide frontline TKI selection. BCR-ABL1 mutations are detected in ~50% of TKI resistant patients but are rarely responsible for primary resistance. Other resistance mechanisms are largely uncharacterized and there are no other routine molecular testing strategies to facilitate the evaluation and further stratification of TKI resistance. Advances in next-generation sequencing technology has aided the management of a growing number of other malignancies, enabling the incorporation of somatic mutation profiles in diagnosis, classification, and prognostication. A largely unexplored area in CML research is whether expanded genomic analysis at diagnosis, resistance, and disease transformation can enhance patient management decisions, as has occurred for other cancers. The aim of this article is to review publications that reported mutated cancer-associated genes in CML patients at various disease phases. We discuss the frequency and type of such variants at initial diagnosis and at the time of treatment failure and transformation. Current limitations in the evaluation of mutants and recommendations for future reporting are outlined. The collective evaluation of mutational studies over more than a decade suggests a limited set of cancer-associated genes are indeed recurrently mutated in CML and some at a relatively high frequency. Genomic studies have the potential to lay the foundation for improved diagnostic risk classification according to clinical and genomic risk, and to enable more precise early identification of TKI resistance.

...read moreread less

80 citations

Cites background from "A global reference for human geneti..."

...For some variants, sufficient information was supplied to review their frequency in population databases [62, 63]....
[...]

Journal Article•DOI•

Characterization of a Human-Specific Tandem Repeat Associated with Bipolar Disorder and Schizophrenia.

[...]

Janet H.T. Song¹, Craig B. Lowe², Craig B. Lowe¹, David M. Kingsley¹, David M. Kingsley² - Show less +1 more•Institutions (2)

Stanford University¹, Howard Hughes Medical Institute²

06 Sep 2018-American Journal of Human Genetics

TL;DR: Changes in the structure and sequence of these arrays likely contribute to changes in CACNA1C function during human evolution and may modulate neuropsychiatric disease risk in modern human populations.

...read moreread less

Abstract: Bipolar disorder (BD) and schizophrenia (SCZ) are highly heritable diseases that affect more than 3% of individuals worldwide. Genome-wide association studies have strongly and repeatedly linked risk for both of these neuropsychiatric diseases to a 100 kb interval in the third intron of the human calcium channel gene CACNA1C. However, the causative mutation is not yet known. We have identified a human-specific tandem repeat in this region that is composed of 30 bp units, often repeated hundreds of times. This large tandem repeat is unstable using standard polymerase chain reaction and bacterial cloning techniques, which may have resulted in its incorrect size in the human reference genome. The large 30-mer repeat region is polymorphic in both size and sequence in human populations. Particular sequence variants of the 30-mer are associated with risk status at several flanking single-nucleotide polymorphisms in the third intron of CACNA1C that have previously been linked to BD and SCZ. The tandem repeat arrays function as enhancers that increase reporter gene expression in a human neural progenitor cell line. Different human arrays vary in the magnitude of enhancer activity, and the 30-mer arrays associated with increased psychiatric disease risk status have decreased enhancer activity. Changes in the structure and sequence of these arrays likely contribute to changes in CACNA1C function during human evolution and may modulate neuropsychiatric disease risk in modern human populations.

...read moreread less

80 citations

Journal Article•DOI•

Fast STR allele identification with STRait Razor 3.0

[...]

August E. Woerner¹, Jonathan L. King¹, Bruce Budowle¹•Institutions (1)

University of North Texas Health Science Center¹

01 Sep 2017-Forensic Science International-genetics

TL;DR: STRait Razor v3.0 adds several key features that simplify the haplotype reporting process, including simple filters to remove low frequency haplotypes as well as merging haplotypes within a locus encoded on opposite strands of the DNA molecule.

...read moreread less

Abstract: The short tandem repeat allele identification tool (STRait Razor), a program used to characterize the haplotypes of short tandem repeats (STRs) in massively parallel sequencing (MPS) data, was redesigned STRait Razor v30 performs ∼660× faster allele identification than its previous version (v2s), a speedup that is largely due to a novel indexing strategy used to perform "fuzzy" (approximate) string matching of anchor sequences Written in a portable compiled language, C++, STRait Razor v30 functions on all major operating systems including Microsoft Windows, and it has cross-platform multithreading support In silico estimates of precision and accuracy of STRait Razor v30 were 100% in this evaluation and results were highly concordant with those of Strait Razor v2s STRait Razor v30 adds several key features that simplify the haplotype reporting process, including simple filters to remove low frequency haplotypes as well as merging haplotypes within a locus encoded on opposite strands of the DNA molecule

...read moreread less

80 citations

Journal Article•DOI•

Cross-species regulatory sequence activity prediction.

[...]

David R. Kelley

20 Jul 2020-PLOS Computational Biology

TL;DR: A novel and powerful approach to apply mouse regulatory models to analyze human genetic variants associated with molecular phenotypes and disease and unleash thousands of non-human epigenetic and transcriptional profiles toward more effective investigation of how gene regulation affects human disease.

...read moreread less

Abstract: Machine learning algorithms trained to predict the regulatory activity of nucleic acid sequences have revealed principles of gene regulation and guided genetic variation analysis. While the human genome has been extensively annotated and studied, model organisms have been less explored. Model organism genomes offer both additional training sequences and unique annotations describing tissue and cell states unavailable in humans. Here, we develop a strategy to train deep convolutional neural networks simultaneously on multiple genomes and apply it to learn sequence predictors for large compendia of human and mouse data. Training on both genomes improves gene expression prediction accuracy on held out and variant sequences. We further demonstrate a novel and powerful approach to apply mouse regulatory models to analyze human genetic variants associated with molecular phenotypes and disease. Together these techniques unleash thousands of non-human epigenetic and transcriptional profiles toward more effective investigation of how gene regulation affects human disease.

...read moreread less

80 citations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
…
168
169
170
171
172
173
174
…
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Basic Local Alignment Search Tool

[...]

Stephen F. Altschul¹, Warren Gish¹, Webb Miller², Eugene W. Myers³, David J. Lipman¹ - Show less +1 more•Institutions (3)

National Institutes of Health¹, Pennsylvania State University², University of Arizona³

01 Oct 1990-Journal of Molecular Biology

TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

...read moreread less

88,255 citations

Journal Article•DOI•

The Sequence Alignment/Map format and SAMtools

[...]

Heng Li¹, Bob Handsaker², Alec Wysoker², T. J. Fennell², Jue Ruan³, Nils Homer², Gabor T. Marth⁴, Gonçalo R. Abecasis², Richard Durbin¹ - Show less +5 more•Institutions (4)

Wellcome Trust Sanger Institute¹, University of California, Los Angeles², Chinese Academy of Sciences³, Boston College⁴

01 Aug 2009-Bioinformatics

TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.

...read moreread less

Abstract: Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: [email protected]

...read moreread less

45,957 citations

Journal Article•DOI•

BEDTools: a flexible suite of utilities for comparing genomic features

[...]

Aaron R. Quinlan¹, Ira M. Hall¹•Institutions (1)

University of Virginia¹

15 Mar 2010-Bioinformatics

TL;DR: A new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format, which allows the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks.

...read moreread less

Abstract: Motivation: Testing for correlations between different sets of genomic features is a fundamental task in genomics research. However, searching for overlaps between features with existing webbased methods is complicated by the massive datasets that are routinely produced with current sequencing technologies. Fast and flexible tools are therefore required to ask complex questions of these data in an efficient manner. Results: This article introduces a new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format. BEDTools also supports the comparison of sequence alignments in BAM format to both BED and GFF features. The tools are extremely efficient and allow the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks. BEDTools can be combined with one another as well as with standard UNIX commands, thus facilitating routine genomics tasks as well as pipelines that can quickly answer intricate questions of large genomic datasets. Availability and implementation: BEDTools was written in C++. Source code and a comprehensive user manual are freely available at http://code.google.com/p/bedtools

...read moreread less

18,858 citations

Journal Article•DOI•

An integrated encyclopedia of DNA elements in the human genome

[...]

Principal investigators¹, Nhgri groups², Data production leads³, Lead analysts³•Institutions (3)

Wellcome Trust¹, University of Washington², Pennsylvania State University³

06 Sep 2012-Nature

TL;DR: The Encyclopedia of DNA Elements project provides new insights into the organization and regulation of the authors' genes and genome, and is an expansive resource of functional annotations for biomedical research.

...read moreread less

Abstract: The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

...read moreread less

13,548 citations

Journal Article•DOI•

The variant call format and VCFtools

[...]

Petr Danecek¹, Adam Auton², Gonçalo R. Abecasis³, Cornelis A. Albers¹, Eric Banks⁴, Mark A. DePristo⁴, Robert E. Handsaker⁴, Gerton Lunter², Gabor T. Marth⁵, Stephen T. Sherry⁶, Gilean McVean², Richard Durbin¹ - Show less +8 more•Institutions (6)

Wellcome Trust¹, University of Oxford², University of Michigan³, Broad Institute⁴, Boston College⁵, National Institutes of Health⁶

01 Aug 2011-Bioinformatics

TL;DR: VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API.

...read moreread less

Abstract: Summary: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API. Availability: http://vcftools.sourceforge.net Contact: [email protected]

...read moreread less

10,164 citations