scispace - formally typeset
Open AccessJournal ArticleDOI

Mapping copy number variation by population-scale genome sequencing

Ryan E. Mills, +374 more
- 03 Feb 2011 - 
- Vol. 470, Iss: 7332, pp 59-65
TLDR
A map of unbalanced SVs is constructed based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations, and serves as a resource for sequencing-based association studies.
Abstract
Genomic structural variants (SVs) are abundant in humans, differing from other forms of variation in extent, origin and functional impact. Despite progress in SV characterization, the nucleotide resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs (that is, copy number variants) based on whole genome DNA sequencing data from 185 human genomes, integrating evidence from complementary SV discovery approaches with extensive experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs, including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide resolution, which facilitated analysing their origin and functional impact. We examined numerous whole and partial gene deletions with a genotyping approach and observed a depletion of gene disruptions amongst high frequency deletions. Furthermore, we observed differences in the size spectra of SVs originating from distinct formation mechanisms, and constructed a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map serves as a resource for sequencing-based association studies.

read more

Content maybe subject to copyright    Report

Mapping copy number variation by
population-scale genome sequencing
The MIT Faculty has made this article openly available. Please share
how this access benefits you. Your story matters.
Citation Mills, Ryan E., et al., "Mapping copy number variation by population-
scale genome sequencing." Nature 470 (2011): p. 59-65 doi 10.1038/
nature09708 ©2011 Author(s)
As Published 10.1038/nature09708
Publisher Springer Science and Business Media LLC
Version Author's final manuscript
Citable link https://hdl.handle.net/1721.1/125843
Terms of Use Creative Commons Attribution-Noncommercial-Share Alike
Detailed Terms http://creativecommons.org/licenses/by-nc-sa/4.0/

Mapping copy number variation by population scale genome
sequencing
A full list of authors and affiliations appears at the end of the article.
Summary
Genomic structural variants (SVs) are abundant in humans, differing from other variation classes
in extent, origin, and functional impact. Despite progress in SV characterization, the nucleotide
resolution architecture of most SVs remains unknown. We constructed a map of unbalanced SVs
(i.e., copy number variants) based on whole genome DNA sequencing data from 185 human
genomes, integrating evidence from complementary SV discovery approaches with extensive
experimental validations. Our map encompassed 22,025 deletions and 6,000 additional SVs,
including insertions and tandem duplications. Most SVs (53%) were mapped to nucleotide
resolution, which facilitated analyzing their origin and functional impact. We examined numerous
whole and partial gene deletions with a genotyping approach and observed a depletion of gene
disruptions amongst high frequency deletions. Furthermore, we observed differences in the size
spectra of SVs originating from distinct formation mechanisms, and constructed a map constructed
a map of SV hotspots formed by common mechanisms. Our analytical framework and SV map
serves as a resource for sequencing-based association studies.
Users may view, print, copy, download and text and data- mine the content in such documents, for the purposes of academic research,
subject always to the full Conditions of use: http://www.nature.com/authors/editorial_policies/license.html#terms
@
Correspondence should be addressed to (jan.korbel@embl.de). .
*
These authors contributed equally to this work.
#
Lists of paraticipants and affiliations appear in Supplementary Information.
Author Contributions: The authors contributed this study at different levels, as described in the following. SV discovery: K.W., C.S.,
R.H., K.C., C.A., A.A., S.C.Y., R.K.C., A.C., Y.F., I.H., F.H., Z.I., D.K., R.L., Y.L., C.L., R.L., X.J.M., H.E.P., L.D., G.T.M., J.S.,
J.W., K.Y., K.Y., E.E.E., M.B.G., M.E.H., S.A.M., and J.O.K. SV validation: R.E.M., K.W., K.C., A.A., S.C.Y., F.G., M.K.K., J.K.,
J.N., A.E.U., X.S., A.M.S., J.A.W., Y.Z., Z.Z., M.A.B., J.S., M.S., M.E.H., C.L, J.O.K. SV genotyping: K.W., R.H., M.E.H, and
S.A.M. Data analysis: R.E.M., C.S., C.A., A.A., R.H., K.C., S.C.Y., R.K.C., A.C., D.C., Y.F., F.H., L.M.I., Z.I., J.M.K., M.K.K.,
S.K., J.K., E.K., D.K., H.Y.K.L., J.L., R.L., Y.L., C.L., R.L., X.J.M., J.N., H.E.P., T.R., A.S., X.S., M.P.S., J.A.W., J.W., Y.Z., Z.Z.,
M.A.B., L.D., G.T.M., G.M. ,J.S., M.S., J.W., K.Y., K.Y., E.E.E., M.B.G., M.E.H., C.L, S.A.M., and J.O.K. Preparation of
manuscript display items: R.E.M., K.W., C.S., C.A., A.A., R.H., S.C.Y., L.M.I., S.K., E.K., M.K.K., X.J.M., X.S., J.A.W., M.B.G.,
S.A.M., and J.O.K. Co-chairs of the Structural Variation Analysis group: E.E.E., M.E.H., and C.L. The following were leading
contributors to the analysis described in this paper and therefore should be considered joint first authors: R.E.M., K.W., C.S., R.H.,
K.C., C.A., A.A., S.C.Y, and K.Y. The following equally contributed to directing the described analyses and participating in the
design of the study and should be considered joint senior authors: E.E.E, M.B.G., M.E.H., C.L, S.A.M., and J.O.K. The manuscript
was written by the following authors: R.E.M. and J.O.K.
Competing interests statement: The authors declare competing financial interests. H.E.P. and Y.F. are employees of Life
Technologies, the manufactures of the SOLiD sequencing platform. R.K.C. is an employee of Illumina Cambridge Ltd., the
manufacturer of the Illumina sequencing platform.
Data retrieval: The data sets described here can be obtained from the 1000 Genomes Project website at www.1000genomes.org (July
2010 Data Release). Individual SV discovery methods can be obtained from sources mentioned in Supplementary Table 1, or upon
request from the authors. Abbreviations used in this paper are summarized in the Supplementary Text.
HHS Public Access
Author manuscript
Nature. Author manuscript; available in PMC 2011 August 03.
Published in final edited form as:
Nature. 2011 February 3; 470(7332): 59–65. doi:10.1038/nature09708.
Author Manuscript Author Manuscript Author Manuscript Author Manuscript

Introduction
Unbalanced structural variants (SVs), or copy number variants (CNVs), involving large-
scale deletions, duplications, and insertions form one of the least well studied classes of
genetic variation. The fraction of the genome affected by SVs is comparatively larger than
that accounted for by single nucleotide polymorphisms1 (SNPs), implying significant
consequences of SVs on phenotypic variation. SVs have already been associated with
diverse diseases, including autism2,3, schizophrenia4,5 and Crohn’s disease6,7.
Furthermore, locus-specific studies suggest that diverse mechanisms may form SVs de novo,
with some mechanisms involving complex rearrangements resulting in multiple
chromosomal breakpoints8,9.
Initial microarray-based SV surveys focused on large gains and losses10,11,12, with recent
advances in array technology widening the accessible size spectrum towards smaller
SVs1,13. Microarray-based commonly mapped SVs to approximate genomic locations.
However, a detailed SV characterization, including analyses of SV origin and impact,
requires knowledge of precise SV sequences. Advances in sequencing technology have
enabled applying sequence-based approaches for mapping SVs at fine-
scale14,15,16,17,18,19,20,21. These approaches include: (i) paired-end mapping (or read
pair ‘RP’ analysis) based on sequencing and analysis of abnormally mapping pairs of clone
ends14,22,23,24 or high-throughput sequencing fragments15,17,18; (ii) read-depth (‘RD’)
analysis, which detects SVs by analyzing the read depth-of-coverage
16,21,25,26,27; (iii)
split-read (‘SR’) analysis, which evaluates gapped sequence alignments for SV
detection28,29; and (iv) sequence assembly (‘AS’), which enables the fine-scale discovery
of SVs, including novel (non-reference) sequence insertions30,31,32. Sequence-based SV
discovery approaches have thus far been applied to a limited (<20) number of genomes,
leaving the fine-scale architecture of most common SVs unknown.
Sequence data generated by the 1000 Genomes Project (1000GP) provide an unprecedented
opportunity to generate a comprehensive SV map. The 1000GP recently generated 4.1
Terabases of raw sequence in pilot projects targeting whole human genomes33
(Supplementary Table 1). These studies comprise a population-scale project, termed ‘low-
coverage project’, in which 179 unrelated individuals were sequenced with an average
coverage of 3.6X – including 59 Yoruba individuals from Nigeria (YRI), 60 individuals of
European ancestry from Utah (CEU), 30 of Han ancestry from Beijing (CHB), and 30 of
Japanese ancestry from Tokyo (JPT; the latter two were jointly analyzed as JPT+CHB). In
addition, a high-coverage project, termed the ‘trio project’, was carried out, with individuals
of a CEU and a YRI parent-offspring trio sequenced to 42X coverage on average.
We report here the results of analyses undertaken by the Structural Variation Analysis
Group of the 1000GP. The group’s objectives were to discover, assemble, genotype, and
validate SVs of 50 bp and larger in size, and to assess and compare different sequence-based
SV detection approaches. The focus of the group was initially on deletions, a variant class
often associated with disease9, for which rich control datasets and diverse ascertainment
approaches exist1,13,22,28. Less focus was placed on insertions and duplications34 and
none on balanced SV forms (such as inversions). Specifically, we applied nineteen methods
Mills et al.
Page 2
Nature. Author manuscript; available in PMC 2011 August 03.
Author Manuscript Author Manuscript Author Manuscript Author Manuscript

to generate an SV discovery set. We further generated reference genotypes for most
deletions, assessed the SVs’ functional impact, and stratified SV formation mechanism with
respect to variant size and genomic context.
Prediction of SV candidate loci and assessment of discovery methods
We incorporated the SV discovery methods into a pipeline (Fig. 1AB), with the goal of
ascertaining different SV types and assessing each method for its ability to discover SVs.
The methods detected SVs by analyzing RD, RP, SR, and AS features, or by combining RP
and RD features (abbreviated as ‘PD’). Altogether we generated thirty-six SV callsets by
applying the methods on trio and low-coverage data, and by identifying SVs as genomic
differences relative to a human reference, corresponding to the reference genome, or to a set
of individuals (i.e. population reference; Supplementary Table 2). We initially identified
SVs as deletions, tandem duplications, novel sequence insertions, and mobile element
insertions (MEIs) relative to the human reference. Subsequent comparative analyses
involving primate genomes enabled us to classify SVs as deletions, duplications, or
insertions relative to inferred ancestral genomic loci, reflecting mechanisms of SV formation
(see below). DNA reads analyzed by SV discovery methods were initially mapped to the
human reference genome using a variety of alignment algorithms. Most of these algorithms
mapped each read to a single genomic position, although one algorithm (mrFAST16) also
considered alternative mapping positions for reads aligning onto repetitive regions (see
Supplementary Tables 2-4 for method-specific parameters and full SV callsets). We filtered
each callset by excluding SVs <50bp, which are reported elsewhere33. Many SVs exhibited
support from distinct SV discovery methods, as exemplified by a common deletion,
previously associated with body-mass index35 (BMI), that we identified with RP, RD, and
SR methods (Fig. 1C). Nonetheless, we observed notable differences between methods (Fig.
2ABC) in terms of genomic regions ascertained (Supplementary Fig. 1), accessible SV size-
range (Fig. 2A), and breakpoint precision (Fig.2C, Supplementary Fig. 2).
To estimate callset specificity, we carried out extensive validations (Methods), including
PCRs for over 3,000 candidate loci, and microarray data analyses for 50,000 candidate loci
(Supplementary Tables 3, 4; Supplementary Fig. 3). We combined PCR and array-based
analysis results to estimate false discovery rates (FDRs), and found that eight callsets (three
deletion, four insertion, and one tandem duplication callset) met the pre-specified specificity
threshold33 (FDR≤10%), whereas the other callsets yielded lower specificity (FDRs of
13%-89%).
We further assessed the sensitivity of deletion discovery methods by collating data from four
earlier surveys1,13,22,28 into a gold standard (Methods, Supplementary Tables 5, 6, and
Supplementary Fig. 4A), and specifically assessing the detection sensitivity for an individual
sequenced at high-coverage (NA12878) as well as for an individual sequenced at low-
coverage (NA12156). Unsurprisingly, given the typical trade-off between sensitivity and
specificity, in the trios the highest sensitivities were achieved by RD and RP methods with
FDR>10% (Fig. 2B). By comparison, in the low-coverage data, the individual method with
the greatest accuracy (FDR=3.7%) was the second most sensitive based on our gold standard
(Fig. 2B), and the most sensitive when expanding the gold standard to a larger set of
Mills et al.
Page 3
Nature. Author manuscript; available in PMC 2011 August 03.
Author Manuscript Author Manuscript Author Manuscript Author Manuscript

individuals (Supplementary Fig. 4B). This method, Genome STRiP (to be described
elsewhere36), integrated both RP and RD features (PD), implying that considering different
evidence types can improve SV discovery.
Construction of a high-confidence SV discovery set
To construct our SV discovery set (“release set”), we joined calls from different discovery
methods corresponding to the same SV with a merging approach that was aware of each
callset’s precision in SV breakpoint detection (Supplementary Fig. 5 and Methods). Most
SVs in the release set (61%) were contributed by individual methods meeting the pre-
defined specificity threshold (FDR≤10%). The remaining 39% of calls were contributed by
lower specificity methods following experimental validation. Altogether, the release set
comprised 22,025 deletions, 501 tandem duplications, 5,371 MEIs, and 128 non-reference
insertions (Table 1, Supplementary Table 7). With our gold standard we estimated an overall
sensitivity of deletion discovery of 82% in the trios, and 69% in low-coverage sequence
(Fig. 2B) using a 1 bp overlap criterion. When instead applying a stringent 50% reciprocal
overlap criterion for sensitivity assessment (which required SV sizes inferred on different
experimental platforms to be in close agreement) our sensitivity estimates decreased by 12%
and 18%, respectively, in trio and low-coverage sequence (Supplementary Table 8). We
further examined an alternative approach that involved the pairwise integration of deletion
discovery methods, and tested its ability to discover SVs without relying on the inclusion of
lower specificity calls following experimental validation (“algorithm-centric set”; Fig. 1B).
While this alternative approach resulted in an increased number (by ~13%) of high-
specificity (FDR<10%) calls compared to the release set (Supplementary Text), it overall
resulted in fewer SV calls owing to its decreased sensitivity at the lower (<200bp) SV size
range. In the following analyses we thus focused on the release set.
Extent and impact of our SV discovery set
We next assessed the extent and impact of our SV discovery (release) set. The median SV
size was 729 bp (mean=8 kb), approximately four times smaller than in a recent tiling CGH
based study1, reflecting the high resolution of DNA sequence based SV discovery. We also
compared our set to a recent survey of SVs in an individual genome37 based on capillary
sequencing and array-based analyses24, and observed a similar size distribution for
deletions, but differences in the size distributions of other SV classes, reflecting underlying
differences in SV ascertainment (Supplementary Fig. 6). By comparing our SVs to databases
of structural variation and to additional personal genome datasets, we classified 15,556 SVs
in our set as novel, with an enrichment of low frequency SVs and small SVs amongst the
novel variants (Methods and Supplementary Text).
A major advantage of sequence-based SV discovery is the nucleotide resolution mapping of
SVs. We initially mapped the breakpoints of 7,066 deletions and 3,299 MEIs using SR and
AS features. Using the TIGRA-targeted assembly approach38 we further identified the
breakpoints of an additional 4,188 deletions and 160 tandem duplications, initially
discovered by RD, RP, and PD methods (Methods, Supplementary Table 2). Altogether, we
mapped ~15,000 SVs at nucleotide resolution, 48% of which were novel. Few deletion loci
Mills et al.
Page 4
Nature. Author manuscript; available in PMC 2011 August 03.
Author Manuscript Author Manuscript Author Manuscript Author Manuscript

Citations
More filters
Journal ArticleDOI

An integrated map of genetic variation from 1,092 human genomes

TL;DR: It is shown that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites.
Journal ArticleDOI

Genetic studies of body mass index yield new insights for obesity biology

TL;DR: A genome-wide association study and Metabochip meta-analysis of body mass index (BMI), a measure commonly used to define obesity and assess adiposity, in up to 339,224 individuals provide strong support for a role of the central nervous system in obesity susceptibility.

Genetic studies of body mass index yield new insights for obesity biology

Adam E. Locke, +481 more
TL;DR: This paper conducted a genome-wide association study and meta-analysis of body mass index (BMI), a measure commonly used to define obesity and assess adiposity, in up to 339,224 individuals.
Journal ArticleDOI

An integrated map of structural variation in 2,504 human genomes

Peter H. Sudmant, +87 more
- 01 Oct 2015 - 
TL;DR: In this paper, the authors describe an integrated set of eight structural variant classes comprising both balanced and unbalanced variants, which are constructed using short-read DNA sequencing data and statistically phased onto haplotype blocks in 26 human populations.
Journal ArticleDOI

Ancient Admixture in Human History

TL;DR: A suite of methods for learning about population mixtures are presented, implemented in a software package called ADMIXTOOLS, that support formal tests for whether mixture occurred and make it possible to infer proportions and dates of mixture.
References
More filters
Journal ArticleDOI

A Map of Human Genome Variation From Population-Scale Sequencing

TL;DR: The 1000 Genomes Project aims to provide a deep characterization of human genome sequence variation as a foundation for investigating the relationship between genotype and phenotype as mentioned in this paper, and the results of the pilot phase of the project, designed to develop and compare different strategies for genomewide sequencing with high-throughput platforms.
Journal ArticleDOI

ABySS: A parallel assembler for short read sequence data

TL;DR: ABySS (Assembly By Short Sequences), a parallelized sequence assembler, was developed and assembled 3.5 billion paired-end reads from the genome of an African male publicly released by Illumina, Inc, representing 68% of the reference human genome.
Journal ArticleDOI

Detection of large-scale variation in the human genome.

TL;DR: This article identified 255 loci across the human genome that contain genomic imbalances among unrelated individuals, and revealed that half of these regions overlap with genes, and many coincide with segmental duplications or gaps in human genome assembly.
Journal ArticleDOI

Integrating common and rare genetic variation in diverse human populations

David Altshuler, +68 more
- 02 Sep 2010 - 
TL;DR: An expanded public resource of genome variants in global populations supports deeper interrogation of genomic variation and its role in human disease, and serves as a step towards a high-resolution map of the landscape of human genetic variation.
Related Papers (5)
Frequently Asked Questions (12)
Q1. What are the SVs classified as deletions relative to ancestral loci?

The remaining (ungrouped) SVs comprise truncated MEIs, VNTR expansion and shrinkage events, as well as NAHR-associated deletions and duplications. 

J.W.’s group was supported by the National Basic Research Program of China (973 program no. 2011CB809200), the National Natural Science Foundation of China (30725008; 30890032; 30811130531; 30221004), the Chinese 863 program (2006AA02Z177; 2006AA02Z334; 2006AA02A302; 2009AA022707), the Shenzhen Municipal Government of China (grants JC200903190767A; JC200903190772A; ZYC200903240076A; CXB200903110066A; ZYC200903240077A; ZYC200903240076A and ZYC200903240080A), and the Ole RØmer grant from the Danish Natural Science Research Council. 

Sequence and structural variation in a human genome uncovered by shortread, massively parallel ligation sequencing using two-base encoding. 

Initial genotype likelihoods were derived with a Bayesian model and imputation into a SNP genotype reference panel from the HapMap41 (Hapmap3r2) was achieved with Beagle (v3.1; http://faculty.washington.edu/browning/beagle/beagle.html).SV formation mechanism analysisSV breakpoints mapped at nucleotide resolution were analyzed with BreakSeq43 to classify SVs relative to putative ancestral loci and to infer SV formation mechanisms. 

The authors thank the Genome Structural Variation Consortium (http://www.sanger.ac.uk/humgen/cnv/42mio/) and the International HapMap Consortium for making available microarray data. 

Colored bars depict numbers of SV hotspots in which at least 50% of the variants were inferred to be formed by a single formation mechanism. 

The authors acknowledge the individuals participating in the 1000 Genomes Project by providing samples, including The Yoruba people of Ibadan, Nigeria, the community at Beijing Normal University, the people of Tokyo, Japan, and the people of the Utah CEPH community. 

Ye K, Schulz MH, Long Q, Apweiler R, Ning Z. Pindel: a pattern growth approach to detect break points of large deletions and medium sized insertions from paired-end short reads. 

Analysis of deletion presence and absence in two populations A-C. Deletion allele frequencies and observed sharing of alleles across populations, displayed for deletions discovered in the CEU, YRI, and JPT+CHB population samples in terms of stacked bars. 

D. Allele frequency spectra for deletions intersecting with intergenic (blue), intronic (yellow), and protein-coding sequences (red). 

The ellipses indicate MEIs, i.e., Alu (~300 bp) and L1 (~6 kb) insertions, associated with target site duplications of up to 28 bp in size at the breakpoints. 

Three groups are visible, with AS and SR, PD and RP, as well as RD and ‘RL’ (RP analysis involving relatively long range (≥1 kb) insert size libraries, resulting in a different deletion detection size range compared to the predominantly used <500kb insert size libraries), respectively, ascertaining similar size-ranges.