A global reference for human genetic variation.

doi:10.1038/NATURE15393

Home
/
Papers
/
A global reference for human genetic variation.

Journal Article•DOI•

A global reference for human genetic variation.

Adam Auton¹, Gonçalo R. Abecasis², David Altshuler³, Richard Durbin⁴ +514 more•Institutions (90)

01 Oct 2015-Nature (Nature Publishing Group)-Vol. 526, Iss: 7571, pp 68-74

TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.

read less

Abstract: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Resolving the full spectrum of human genome variation using Linked-Reads.

[...]

Patrick Marks, Sarah T. Garcia, Alvaro Martinez Barrio, Kamila Belhocine, Jorge Bernate, Rajiv Bharadwaj, Keith Bjornson, Claudia Catalanotti, Josh Delaney, Adrian Fehr, Ian T. Fiddes, Brendan Galvin, Haynes Heaton, Jill Herschleb, Christopher Hindson, Esty Holt¹, Cassandra B. Jabara, Susanna Jett, Nikka Keivanfar, Sofia Kyriazopoulou-Panagiotopoulou, Monkol Lek², Bill Kengli Lin, Adam Lowe, Shazia Mahamdallie¹, Shamoni Maheshwari, Tony Makarewicz, Jamie L. Marshall³, Francesca Meschi, Christopher J. O'Keefe, Heather Ordonez, Pranav Patel, Andrew D. Price, Ariel Royall, Elise Ruark¹, Sheila Seal¹, Michael Schnall-Levin, Preyas Shah, David Stafford, Stephen R. Williams, Indira Wu, Andrew Wei Xu, Nazneen Rahman¹, Daniel G. MacArthur², Daniel G. MacArthur³, Deanna M. Church - Show less +41 more•Institutions (3)

Institute of Cancer Research¹, Harvard University², Broad Institute³

20 Mar 2019-Genome Research

TL;DR: The data presented here show that Linked-Reads provide a scalable approach for comprehensive genome analysis that is not possible using short reads alone, and allows for simultaneous detection of small and large variants from a single library.

...read moreread less

Abstract: Large-scale population analyses coupled with advances in technology have demonstrated that the human genome is more diverse than originally thought. To date, this diversity has largely been uncovered using short-read whole-genome sequencing. However, these short-read approaches fail to give a complete picture of a genome. They struggle to identify structural events, cannot access repetitive regions, and fail to resolve the human genome into haplotypes. Here, we describe an approach that retains long range information while maintaining the advantages of short reads. Starting from ∼1 ng of high molecular weight DNA, we produce barcoded short-read libraries. Novel informatic approaches allow for the barcoded short reads to be associated with their original long molecules producing a novel data type known as "Linked-Reads". This approach allows for simultaneous detection of small and large variants from a single library. In this manuscript, we show the advantages of Linked-Reads over standard short-read approaches for reference-based analysis. Linked-Reads allow mapping to 38 Mb of sequence not accessible to short reads, adding sequence in 423 difficult-to-sequence genes including disease-relevant genes STRC, SMN1, and SMN2 Both Linked-Read whole-genome and whole-exome sequencing identify complex structural variations, including balanced events and single exon deletions and duplications. Further, Linked-Reads extend the region of high-confidence calls by 68.9 Mb. The data presented here show that Linked-Reads provide a scalable approach for comprehensive genome analysis that is not possible using short reads alone.

...read moreread less

166 citations

Cites background from "A global reference for human geneti..."

...Since the completion of the HGP48 many large scale consortia studies have applied whole genome sequencing to thousands of49 individuals from diverse populations across the globe (Auton et al. 2015; Lek et al. 2016; Sudmant50 et al. 2015)....
[...]

Journal Article•DOI•

Racial Disparity in Gastrointestinal Cancer Risk

[...]

Hassan Ashktorab¹, Sonia S. Kupfer², Hassan Brim¹, John M. Carethers³•Institutions (3)

University of Washington¹, University of Chicago², University of Michigan³

01 Oct 2017-Gastroenterology

TL;DR: Cognizance of disparities in gastrointestinal cancer risk, as well as approaches that apply precision medicine methods to populations with the increased risk, may reduce the observed disparities for digestive cancers.

...read moreread less

165 citations

Cites background from "A global reference for human geneti..."

...These genetic risk variants are only found in East Asian populations [19]....
[...]
...Moreover, variants in PLCE1 [20,21] were associated with SCC in 2 GWAS evaluations, and the allele frequencies of the se variants are similar across populations [19], bu t have not been studied as risk factors in non-Asian populations....
[...]

Journal Article•DOI•

Ancestral Origins and Genetic History of Tibetan Highlanders

[...]

Dongsheng Lu¹, Dongsheng Lu², Haiyi Lou², Kai Yuan², Kai Yuan¹, Xiaoji Wang¹, Xiaoji Wang³, Xiaoji Wang², Yuchen Wang², Yuchen Wang¹, Chao Zhang¹, Chao Zhang², Yan Lu², Xiong Yang², Xiong Yang¹, Lian Deng¹, Lian Deng², Ying Zhou¹, Ying Zhou², Qidi Feng², Qidi Feng¹, Ya Hu⁴, Qiliang Ding⁴, Yajun Yang⁴, Shilin Li⁴, Li Jin⁴, Yaqun Guan⁵, Bing Su⁶, Longli Kang⁷, Shuhua Xu - Show less +26 more•Institutions (7)

Chinese Academy of Sciences¹, CAS-MPG Partner Institute for Computational Biology², ShanghaiTech University³, Fudan University⁴, Xinjiang Medical University⁵, Kunming Institute of Zoology⁶, Minzu University of China⁷

01 Sep 2016-American Journal of Human Genetics

TL;DR: The results support that Tibetans arose from a mixture of multiple ancestral gene pools but that their origins are much more complicated and ancient than previously suspected.

...read moreread less

Abstract: The origin of Tibetans remains one of the most contentious puzzles in history, anthropology, and genetics. Analyses of deeply sequenced (30×–60×) genomes of 38 Tibetan highlanders and 39 Han Chinese lowlanders, together with available data on archaic and modern humans, allow us to comprehensively characterize the ancestral makeup of Tibetans and uncover their origins. Non-modern human sequences compose ∼6% of the Tibetan gene pool and form unique haplotypes in some genomic regions, where Denisovan-like, Neanderthal-like, ancient-Siberian-like, and unknown ancestries are entangled and elevated. The shared ancestry of Tibetan-enriched sequences dates back to ∼62,000–38,000 years ago, predating the Last Glacial Maximum (LGM) and representing early colonization of the plateau. Nonetheless, most of the Tibetan gene pool is of modern human origin and diverged from that of Han Chinese ∼15,000 to ∼9,000 years ago, which can be largely attributed to post-LGM arrivals. Analysis of ∼200 contemporary populations showed that Tibetans share ancestry with populations from East Asia (∼82%), Central Asia and Siberia (∼11%), South Asia (∼6%), and western Eurasia and Oceania (∼1%). Our results support that Tibetans arose from a mixture of multiple ancestral gene pools but that their origins are much more complicated and ancient than previously suspected. We provide compelling evidence of the co-existence of Paleolithic and Neolithic ancestries in the Tibetan gene pool, indicating a genetic continuity between pre-historical highland-foragers and present-day Tibetans. In particular, highly differentiated sequences harbored in highlanders’ genomes were most likely inherited from pre-LGM settlers of multiple ancestral origins (SUNDer) and maintained in high frequency by natural selection.

...read moreread less

165 citations

Journal Article•DOI•

High-depth African genomes inform human migration and health

[...]

Ananyo Choudhury¹, Shaun Aron¹, Laura R. Botigué², Dhriti Sengupta¹, Gerrit Botha³, Taoufik Bensellak⁴, Gordon Wells⁵, Judit Kumuthini⁵, Daniel Shriner⁶, Yasmina Jaufeerally Fakim⁷, Anisah W. Ghoorah⁷, Eileen Dareng⁸, Trust Odia⁹, Oluwadamilare Falola⁹, Ezekiel Adebiyi⁹, Scott Hazelhurst¹, Gaston K. Mazandu³, Oscar A. Nyangiri¹⁰, Mamana Mbiyavanga³, Alia Benkahla¹¹, Samar K. Kassim¹², Nicola Mulder³, Sally N. Adebamowo¹³, Emile R. Chimusa³, Donna M. Muzny¹⁴, Ginger A. Metcalf¹⁴, Richard A. Gibbs¹⁴, Charles N. Rotimi⁶, Michèle Ramsay¹, Michèle Ramsay¹⁵, Adebowale Adeyemo⁶, Zané Lombard¹⁵, Neil A. Hanchard¹⁴ - Show less +29 more•Institutions (15)

University of the Witwatersrand¹, Spanish National Research Council², University of Cape Town³, Abdelmalek Essaâdi University⁴, South African National Bioinformatics Institute⁵, National Institutes of Health⁶, University of Mauritius⁷, University of Cambridge⁸, Covenant University⁹, Makerere University¹⁰, Pasteur Institute¹¹, Ain Shams University¹², University of Maryland, Baltimore¹³, Baylor College of Medicine¹⁴, National Health Laboratory Service¹⁵

28 Oct 2020-Nature

TL;DR: The findings refine the current understanding of continental migration, identify gene flow and the response to human disease as strong drivers of genome-level population variation, and underscore the scientific imperative for a broader characterization of the genomic diversity of African individuals to understand human ancestry and improve health.

...read moreread less

Abstract: The African continent is regarded as the cradle of modern humans and African genomes contain more genetic variation than those from any other continent, yet only a fraction of the genetic diversity among African individuals has been surveyed1. Here we performed whole-genome sequencing analyses of 426 individuals—comprising 50 ethnolinguistic groups, including previously unsampled populations—to explore the breadth of genomic diversity across Africa. We uncovered more than 3 million previously undescribed variants, most of which were found among individuals from newly sampled ethnolinguistic groups, as well as 62 previously unreported loci that are under strong selection, which were predominantly found in genes that are involved in viral immunity, DNA repair and metabolism. We observed complex patterns of ancestral admixture and putative-damaging and novel variation, both within and between populations, alongside evidence that population from Zambia were a likely intermediate site along the routes of expansion of Bantu-speaking populations. Pathogenic variants in genes that are currently characterized as medically relevant were uncommon—but in other genes, variants denoted as ‘likely pathogenic’ in the ClinVar database were commonly observed. Collectively, these findings refine our current understanding of continental migration, identify gene flow and the response to human disease as strong drivers of genome-level population variation, and underscore the scientific imperative for a broader characterization of the genomic diversity of African individuals to understand human ancestry and improve health. Whole-genome sequencing analyses of African populations provide insights into continental migration, gene flow and the response to human disease, highlighting the importance of including diverse populations in genomic analyses to understand human ancestry and improve health.

...read moreread less

165 citations

Journal Article•DOI•

Variant ribosomal RNA alleles are conserved and exhibit tissue-specific expression

[...]

Matthew Parks¹, Chad M. Kurylo¹, Randall A. Dass¹, Linda Bojmar¹, Linda Bojmar², Linda Bojmar³, David Lyden¹, David Lyden⁴, C. Theresa Vincent³, C. Theresa Vincent¹, Scott C. Blanchard¹ - Show less +7 more•Institutions (4)

Cornell University¹, Linköping University², Karolinska Institutet³, Memorial Sloan Kettering Cancer Center⁴

01 Feb 2018-Science Advances

TL;DR: Analysis of whole-genome sequencing data finds that rDNA copy number varies widely across individuals, and pervasive intra- and interindividual nucleotide variation in the 5S, 5.8S, 18S, and 28S ribosomal RNA (rRNA) genes of both human and mouse is identified.

...read moreread less

Abstract: The ribosome, the integration point for protein synthesis in the cell, is conventionally considered a homogeneous molecular assembly that only passively contributes to gene expression. Yet, epigenetic features of the ribosomal DNA (rDNA) operon and changes in the ribosome’s molecular composition have been associated with disease phenotypes, suggesting that the ribosome itself may possess inherent regulatory capacity. Analyzing whole-genome sequencing data from the 1000 Genomes Project and the Mouse Genomes Project, we find that rDNA copy number varies widely across individuals, and we identify pervasive intra- and interindividual nucleotide variation in the 5S, 5.8S, 18S, and 28S ribosomal RNA (rRNA) genes of both human and mouse. Conserved rRNA sequence heterogeneities map to functional centers of the assembled ribosome, variant rRNA alleles exhibit tissue-specific expression, and ribosomes bearing variant rRNA alleles are present in the actively translating ribosome pool. These findings provide a critical framework for exploring the possibility that the expression of genomically encoded variant rRNA alleles gives rise to physically and functionally heterogeneous ribosomes that contribute to mammalian physiology and human disease.

...read moreread less

165 citations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
…
74
75
76
77
78
79
80
…
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Basic Local Alignment Search Tool

[...]

Stephen F. Altschul¹, Warren Gish¹, Webb Miller², Eugene W. Myers³, David J. Lipman¹ - Show less +1 more•Institutions (3)

National Institutes of Health¹, Pennsylvania State University², University of Arizona³

01 Oct 1990-Journal of Molecular Biology

TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

...read moreread less

88,255 citations

Journal Article•DOI•

The Sequence Alignment/Map format and SAMtools

[...]

Heng Li¹, Bob Handsaker², Alec Wysoker², T. J. Fennell², Jue Ruan³, Nils Homer², Gabor T. Marth⁴, Gonçalo R. Abecasis², Richard Durbin¹ - Show less +5 more•Institutions (4)

Wellcome Trust Sanger Institute¹, University of California, Los Angeles², Chinese Academy of Sciences³, Boston College⁴

01 Aug 2009-Bioinformatics

TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.

...read moreread less

Abstract: Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: [email protected]

...read moreread less

45,957 citations

Journal Article•DOI•

BEDTools: a flexible suite of utilities for comparing genomic features

[...]

Aaron R. Quinlan¹, Ira M. Hall¹•Institutions (1)

University of Virginia¹

15 Mar 2010-Bioinformatics

TL;DR: A new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format, which allows the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks.

...read moreread less

Abstract: Motivation: Testing for correlations between different sets of genomic features is a fundamental task in genomics research. However, searching for overlaps between features with existing webbased methods is complicated by the massive datasets that are routinely produced with current sequencing technologies. Fast and flexible tools are therefore required to ask complex questions of these data in an efficient manner. Results: This article introduces a new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format. BEDTools also supports the comparison of sequence alignments in BAM format to both BED and GFF features. The tools are extremely efficient and allow the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks. BEDTools can be combined with one another as well as with standard UNIX commands, thus facilitating routine genomics tasks as well as pipelines that can quickly answer intricate questions of large genomic datasets. Availability and implementation: BEDTools was written in C++. Source code and a comprehensive user manual are freely available at http://code.google.com/p/bedtools

...read moreread less

18,858 citations

Journal Article•DOI•

An integrated encyclopedia of DNA elements in the human genome

[...]

Principal investigators¹, Nhgri groups², Data production leads³, Lead analysts³•Institutions (3)

Wellcome Trust¹, University of Washington², Pennsylvania State University³

06 Sep 2012-Nature

TL;DR: The Encyclopedia of DNA Elements project provides new insights into the organization and regulation of the authors' genes and genome, and is an expansive resource of functional annotations for biomedical research.

...read moreread less

Abstract: The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

...read moreread less

13,548 citations

Journal Article•DOI•

The variant call format and VCFtools

[...]

Petr Danecek¹, Adam Auton², Gonçalo R. Abecasis³, Cornelis A. Albers¹, Eric Banks⁴, Mark A. DePristo⁴, Robert E. Handsaker⁴, Gerton Lunter², Gabor T. Marth⁵, Stephen T. Sherry⁶, Gilean McVean², Richard Durbin¹ - Show less +8 more•Institutions (6)

Wellcome Trust¹, University of Oxford², University of Michigan³, Broad Institute⁴, Boston College⁵, National Institutes of Health⁶

01 Aug 2011-Bioinformatics

TL;DR: VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API.

...read moreread less

Abstract: Summary: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API. Availability: http://vcftools.sourceforge.net Contact: [email protected]

...read moreread less

10,164 citations