scispace - formally typeset
Open AccessJournal ArticleDOI

The Simons Genome Diversity Project: 300 genomes from 142 diverse populations

Swapan Mallick, +104 more
- 13 Oct 2016 - 
- Vol. 538, Iss: 7624, pp 201-206
TLDR
It is demonstrated that indigenous Australians, New Guineans and Andamanese do not derive substantial ancestry from an early dispersal of modern humans; instead, their modern human ancestry is consistent with coming from the same source as that of other non-Africans.
Abstract
Here we report the Simons Genome Diversity Project data set: high quality genomes from 300 individuals from 142 diverse populations. These genomes include at least 5.8 million base pairs that are not present in the human reference genome. Our analysis reveals key features of the landscape of human genome variation, including that the rate of accumulation of mutations has accelerated by about 5% in non-Africans compared to Africans since divergence. We show that the ancestors of some pairs of present-day human populations were substantially separated by 100,000 years ago, well before the archaeologically attested onset of behavioural modernity. We also demonstrate that indigenous Australians, New Guineans and Andamanese do not derive substantial ancestry from an early dispersal of modern humans; instead, their modern human ancestry is consistent with coming from the same source as that of other non-Africans.

read more

Content maybe subject to copyright    Report

UC Davis
UC Davis Previously Published Works
Title
The Simons Genome Diversity Project: 300 genomes from 142 diverse populations.
Permalink
https://escholarship.org/uc/item/71x1v1d2
Journal
Nature, 538(7624)
ISSN
0028-0836
Authors
Mallick, Swapan
Li, Heng
Lipson, Mark
et al.
Publication Date
2016-10-01
DOI
10.1038/nature18964
Peer reviewed
eScholarship.org Powered by the California Digital Library
University of California

The Simons Genome Diversity Project: 300 genomes from 142
diverse populations
A full list of authors and affiliations appears at the end of the article.
Abstract
We report the Simons Genome Diversity Project (SGDP) dataset: high quality genomes from 300
individuals from 142 diverse populations. These genomes include at least 5.8 million base pairs
that are not present in the human reference genome. Our analysis reveals key features of the
landscape of human genome variation, including that the rate of accumulation of mutations has
accelerated by about 5% in non-Africans compared to Africans since divergence. We show that the
ancestors of some pairs of present-day human populations were substantially separated by 100,000
years ago, well before the archaeologically attested onset of behavioral modernity. We also
demonstrate that indigenous Australians, New Guineans and Andamanese do not derive substantial
ancestry from an early dispersal of modern humans; instead, their modern human ancestry is
consistent with coming from the same source as that in other non-Africans.
To obtain a complete picture of human diversity, it is necessary to sequence the genomes of
many individuals from diverse locations. To date, the largest whole-genome sequencing
survey, the 1000 Genomes Project, analyzed 26 populations of European, East Asian, South
Asian, American, and sub-Saharan African ancestry
1
. However, this and most other
sequencing studies have focused on demographically large populations. Such studies tend to
Users may view, print, copy, and download text and data-mine the content in such documents, for the purposes of academic research,
subject always to the full Conditions of use:http://www.nature.com/authors/editorial_policies/license.html#terms
Correspondence and requests for materials should be addressed to S.M. (shop@genetics.med.harvard.edu) or D.R.
(reich@genetics.med.harvard.edu).
*
These authors contributed equally
63
Present address: Genome Foundation, Hyderabad 500076, India
Author information and data access: Raw data for 277 genomes are available through the EBI European Nucleotide Archive under
accession numbers PRJEB9586 and ERP010710. For the remaining 23 genomes, the informed consent documentation is not consistent
with fully public data release; data for these genomes can be accessed through a password protected link by researchers who send
S.M. and D.R. a signed letter containing the following language: “With regards to the non-public samples from the Simons Genome
Diversity Project, I agree that: (a) I will not secondarily distribute the data to anyone, (b) I will not post it publicly, (c) I will make no
attempt to connect the genetic data to personal identifiers for the samples, (d) I will not use the data for any commercial purposes.”
The identifiers for these 23 genomes are BR_Kashmiri_Pandit-1, BR_Kharia-1, BR_Kurumba-1, BR_Mala-1, BR_Onge-1,
BR_Onge-2, S_Igbo-1, S_Igbo-2, S_Kongo-2, S_Lemande-1, S_Lemande-2, S_Chipewyan-1, S_Chipewyan-2, S_Cree-1, S_Cree-2,
S_Nahua-1, S_Nahua-2, T_Sherpa-2, T_Tibetan-1, T_Tibetan-2 and T_Sherpa-1, and are designated by code “Y” in the seventh
column of Supplementary Data Table 1. Compact versions of the SGDP dataset and software for accessing it are available at (
http://
genetics.med.harvard.edu/reichlab/Reich_Lab/Datasets.html).
The authors declare competing financial interests. Ugur Hodoglugil is employed by NextBio, a division of Illumina Ltd.
Readers are welcome to comment on the online version of the paper.
Author contributions: S.M., Y.E., Y.S.S., S.P., J.K., N.P. and D.R. supervised the study. S.N., N.R., C.G., G.P., F.B., G.D., I.G.R.,
A.R.J., P.D., D.M.B., C.M.B., C.C., T.H., A.M.-E., O.L.P., E.B., O.B., S.K.-Y., H.S., D.T., L.Y., C.T.-S., Y.X., M.S.A., A.R.-L., C.B.,
A.D.R., C.J., E.B.S., E.M., J.P., R.V., B.M.H., U.H., R.W.M., A.S., G.S., J.T.S.W., R.K., E.K., S.L., G.A., D.C., M.H., T.K., W.K.,
C.W., D.L., M.B., L.B.J., S.A.T., W.S.W., M.M., S.D., R.S., L.S., K.T. and D.R. assembled samples. S.M., H.L., M.L., I.M., M.G.,
F.R., J.P.S. M.Z., N.C., A.T., P.S., I.L., S.S., Q.F., G.R., Y.S., N.P. and D.R. performed analyses. S.M., H.L., M.L., I.M., M.G., F.R.,
M.Z., N.P. and D.R. wrote the manuscript with help from all co-authors.
HHS Public Access
Author manuscript
Nature
. Author manuscript; available in PMC 2017 March 21.
Published in final edited form as:
Nature
. 2016 October 13; 538(7624): 201–206. doi:10.1038/nature18964.
Author Manuscript Author Manuscript Author Manuscript Author Manuscript

ignore smaller populations that are also important for understanding human diversity. In
addition, many of these studies have sequenced genomes to only 4–6-fold coverage. Here,
we report the Simons Genome Diversity Project (SGDP): deep genome sequences of 300
individuals from 142 populations chosen to span much of human genetic, linguistic, and
cultural variation (Supplementary Data Table 1).
Data set and catalog of novel variants
We sequenced the samples to an average coverage of 43-fold (range 34–83 fold) at Illumina
Ltd.; almost all samples (278) were prepared using the same PCR-free library preparation
2
.
We aligned reads to the human reference genome hs37d5/hg19 using BWA-MEM
(BWA-0.7.12)
3
(Supplementary Information section 1). We genotyped each sample
separately using the Genome Analysis Toolkit (GATK)
4
, with a modification to eliminate
bias toward genotypes matching the reference (Supplementary Information section 1). We
developed a filtering procedure that generates a sample-specific mask. At “filter level 1”
which we recommend for most analyses, we retain an average of 2.13 Gb of sequence per
sample and identify 34.4 million single nucleotide polymorphisms (SNPs) and 2.1 million
insertion/deletion polymorphisms (indels) (Supplementary Information section 2). We have
made the GATK-processed data available in a file small enough to download by FTP, along
with software to analyze these data (Supplementary Information section 3). The SGDP
dataset highlights the incompleteness of current catalogs of human variation, with the
fraction of heterozygous positions not discovered by the 1000 Genomes Project being 11%
in the KhoeSan and 5% in New Guineans and Australians (Extended Data Fig. 1;
Supplementary Data Table 1). We used FermiKit
5
to map short reads against each other,
store the assemblies in a compressed form that retains all the information required for
polymorphism discovery and analysis, and identified SNPs by comparing against the human
reference. We find that FermiKit has comparable sensitivity and specificity to GATK for
SNP discovery and genotyping, and is more accurate for indels (Supplementary Information
section 4). FermiKit also identified 5.8 Mb of contigs that are present in the SGDP but
absent in the human reference genome presumably because they are deleted there; these
contigs which we have made publicly available can be used as “decoys” to improve read
mapping (Supplementary Information section 5). Finally, we called copy number variants
6
and used lobSTR
7,8
to genotype 1.6 million short tandem repeats (STRs) (Supplementary
Information section 6). The high quality of the STR genotypes (r
2
=0.92 to capillary
sequencing calls) is evident from their accurate reconstruction of population relationships,
even for difficult-to-genotype mononucleotide repeats (Extended Data Fig. 2).
The structure of human genetic diversity
To obtain an overview of population relationships, we carried out ADMIXTURE
9
(Extended
Data Fig. 3) and principal component analysis
10
(Extended Data Fig. 4a). We also built
neighbor-joining trees based on pairwise divergence per nucleotide (Fig. 1a) and F
ST
(Extended Data Fig. 4b) whose topologies are consistent with previous findings that the
deepest splits among human populations are among Africans. We computed heterozygosity
– the proportion of diallelic genotypes per base pair – and recapitulate previous findings that
the highest genetic diversity is found in sub-Saharan Africa and that there is a much lower
Mallick et al. Page 2
Nature
. Author manuscript; available in PMC 2017 March 21.
Author Manuscript Author Manuscript Author Manuscript Author Manuscript

ratio of X-to-autosome diversity in non-Africans than in Africans (Fig. 1b)
11
. A surprise is
that African “Pygmy” hunter-gatherers have reduced X-to-autosome diversity ratios relative
to all other sub-Saharan Africans. This pattern remains even after we remove the third of
chromosome X known to be subject to the strongest natural selection, suggesting that the
finding is driven by demographic history rather than by natural selection (Supplementary
Information section 7). It has been suggested that the reduced X-to-autosome heterozygosity
ratio in non-Africans is due to ongoing male-driven admixture
11,12
. Male non-Pygmy
admixture into Pygmies is well-documented
13,14
, so this process could explain these
findings.
Comparisons of ancient to present-day human genomes have shown that all non-Africans
today possess Neanderthal ancestry
15
with more in eastern non-Africans
16,17
, and that
Australo-Melanesians and to a lesser extent other eastern non-Africans possess Denisovan
ancestry
18–20
. However, these studies only analyzed genomes from a handful of populations.
We computed statistics informative about Neanderthal and Denisovan ancestry and provide a
fine-scale view of these ancestry distributions worldwide (Fig. 1c,d; Supp. Data Table 1;
Supplementary Information section 8). We do not detect any population with a higher
proportion of Neanderthal ancestry than is present in East Asians. However, we do find
suggestive evidence of an excess of Denisovan ancestry in some South Asians compared to
other Eurasians. This signal may not have been detected before because earlier surveys of
archaic introgression largely excluded South Asians (Fig. 1d; Supp. Data Table 1).
The time course of human population separation
We studied demographic history by leveraging the fact that variation across the genome in
divergent sites per base pair can be used to reconstruct population size changes and
separations. We used the Pairwise Sequential Markovian Coalescent (PSMC)
21
to
reconstruct population size changes ,and the multiple sequentially Markovian coalescent
22
(MSMC) to study the time course of population separations. We infer that the population
ancestral to all present day humans began to develop substructure at least two hundred
thousand years ago (kya), which is most apparent when comparing the ancestors of some
present-day African hunter-gatherers (southern African KhoeSan and central African Mbuti
Pygmies) and other populations (Fig. 2a). However, it is also clear that this substructure
developed slowly, as all pairs of present-day populations including African hunter-gatherer
share a substantial subset of their ancestors as recently as a hundred thousand years
ago
23–26
. Quoting the time at which MSMC infers that more than 50% (25–75%) of lineages
for a pair of populations are descended from the same ancestral population, we estimate that
non-Africans separated substantially from KhoeSan 131 (82–173) kya and almost as
anciently from the Mbuti around 112 (67–171) kya. Within Africa (Fig. 2a–b), we infer that
the Yoruba separated substantially from the KhoeSan 87 (58–120) kya; from the Mbuti 56
(32–85) kya; and from the Dinka 19 (9–25) kya. We estimate a relatively rapid 21 (21–36)
kya separation of northern and southern KhoeSan
24,27
potentially reflecting isolation since
the last glacial maximum; and 38 (27–44) kya separation between western (Biaka) and
eastern (Mbuti) Pygmies, confirming very old substructure between these two central
African hunter gatherer groups
28
. Outside Africa, the most ancient structure dates to around
50 kya (Fig. 2c) during or shortly after the deepest part of the shared non-African bottleneck
Mallick et al. Page 3
Nature
. Author manuscript; available in PMC 2017 March 21.
Author Manuscript Author Manuscript Author Manuscript Author Manuscript

40–60 kya, consistent with the archaeological evidence of the dispersal of modern humans
into Eurasia during this period. We are not confident about the estimates of the date of
separation of Australians, New Guineans and Andamanese from other populations because
we find that these inferences change depending on the computational method we use for
phasing, likely due to these populations not being represented in the 1000 Genomes haploid
genome reference panel (Supplementary Information section 9). We caution that the date
estimates also do not take into account uncertainty about the true value of the human
mutation rate, which could plausibly be 30% higher or lower than the point estimate we
use
29
.
Early modern human dispersals contributed little to non-African
populations
There is intense debate about whether present-day Australians, New Guineans and Asian
“Negrito” populations are descended from the same source population as mainland
Eurasians, or whether they also derive some ancestry from an early, independent dispersal of
modern humans into Asia
30–32
. To explore this scenario rigorously, we fit an admixture
graph
33
—a phylogenetic tree incorporating mixture events—to the allele frequency
correlations among Neanderthals, Denisovans, Upper Paleolithic Europeans, East Asians,
New Guineans, Australians, and Andamanese. We obtain a good fit to the data if we include
known Neanderthal and Denisovan introgression and model all modern human ancestry in
New Guineans, Australians and Andamanese as part of an eastern clade together with
mainland East Asians (Supplementary Information section 11; Fig. 3). Furthermore, when
we manually introduce a deeply diverging modern human lineage contributing ancestry to
Australians, New Guineans, and Andamanese (or when we repeat the analysis in a model
without Andamanese), no position or proportion of the deep lineage improves the fit. If this
putative source population branched off the main lineage leading to non-Africans more than
about 10–20 ky prior to the separation of European and East Asian ancestors, we obtain an
upper bound of a few percent for the possible contribution to Australians and New Guineans
(Fig. 3 inset; Supplementary Information section 11). These results are at odds with an
inference of substantial early dispersal ancestry in a previous analysis of an Australian
genome
32
; however, that study used a less complete model that, notably, did not include the
known Denisovan admixture into Australo-Melanesians
18
. The findings for Australians are
also unlikely to be due to some unusual feature of the individuals we sequenced, as when we
compared three different Australian samples for which there is published genome-wide data,
they are all consistent with descending from a common homogeneous population since
separation from New Guineans (Supplementary Information section 10). These results are
not in conflict with skeletal and archaeological evidence of an early modern human presence
outside of Africa
30,34
, as early migrations could have occurred but not contributed
substantially to present-day populations. The possibility of populations that once flourished
but did not contribute substantially to living groups is especially plausible now that ancient
DNA from the ~45 kya Ust’-Ishim
29
and the ~40 kya Oase 1 individuals
35
has documented
directly their existence.
Mallick et al. Page 4
Nature
. Author manuscript; available in PMC 2017 March 21.
Author Manuscript Author Manuscript Author Manuscript Author Manuscript

Citations
More filters
Journal ArticleDOI

The complete sequence of a human genome

TL;DR: The T2T-CHM13-T2T Consortium presented a complete 3.055 billion-base pair sequence of a human genome, including gapless assemblies for all chromosomes except Y, corrected errors in the prior references, and introduced nearly 200 million base pairs of sequence containing gene predictions, 99 of which are predicted to be protein coding as discussed by the authors .
Journal ArticleDOI

Multi-platform discovery of haplotype-resolved structural variation in human genomes

Mark Chaisson, +107 more
TL;DR: A suite of long-read, short- read, strand-specific sequencing technologies, optical mapping, and variant discovery algorithms are applied to comprehensively analyze three trios to define the full spectrum of human genetic variation in a haplotype-resolved manner.
Journal ArticleDOI

Mosdepth: quick coverage calculation for genomes and exomes

TL;DR: Mosdepth is a new command‐line tool for rapidly calculating genome‐wide sequencing coverage that uses a simple algorithm that is computationally efficient and enables it to quickly produce coverage summaries.
Journal ArticleDOI

Telomere-to-telomere assembly of a complete human X chromosome

TL;DR: High-coverage, ultra-long-read nanopore sequencing is used to create a new human genome assembly that improves on the coverage and accuracy of the current reference (GRCh38) and includes the gap-free, telomere-to-telomere sequence of the X chromosome.
References
More filters
Journal ArticleDOI

Fast and accurate short read alignment with Burrows–Wheeler transform

TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.
Journal ArticleDOI

PLINK: A Tool Set for Whole-Genome Association and Population-Based Linkage Analyses

TL;DR: This work introduces PLINK, an open-source C/C++ WGAS tool set, and describes the five main domains of function: data management, summary statistics, population stratification, association analysis, and identity-by-descent estimation, which focuses on the estimation and use of identity- by-state and identity/descent information in the context of population-based whole-genome studies.
Journal ArticleDOI

The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data

TL;DR: The GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Journal ArticleDOI

An integrated map of genetic variation from 1,092 human genomes

TL;DR: It is shown that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites.
Journal ArticleDOI

Second-generation PLINK: rising to the challenge of larger and richer datasets

TL;DR: The second-generation versions of PLINK will offer dramatic improvements in performance and compatibility, and for the first time, users without access to high-end computing resources can perform several essential analyses of the feature-rich and very large genetic datasets coming into use.
Related Papers (5)

A global reference for human genetic variation.

Adam Auton, +517 more
- 01 Oct 2015 - 
Trending Questions (1)
How does the Simons Genome Diversity Project (SGDP) compare to other large-scale genome sequencing projects?

The Simons Genome Diversity Project (SGDP) includes high quality genomes from 300 individuals from 142 diverse populations, providing a comprehensive view of human genetic variation.