scispace - formally typeset
Search or ask a question
Journal ArticleDOI

An integrated map of structural variation in 2,504 human genomes

Peter H. Sudmant1, Tobias Rausch, Eugene J. Gardner2, Robert E. Handsaker3, Robert E. Handsaker4, Alexej Abyzov5, John Huddleston1, Yan Zhang6, Kai Ye7, Goo Jun8, Goo Jun9, Markus His Yang Fritz, Miriam K. Konkel10, Ankit Malhotra, Adrian M. Stütz, Xinghua Shi11, Francesco Paolo Casale12, Jieming Chen6, Fereydoun Hormozdiari1, Gargi Dayama8, Ken Chen13, Maika Malig1, Mark Chaisson1, Klaudia Walter12, Sascha Meiers, Seva Kashin4, Seva Kashin3, Erik Garrison14, Adam Auton15, Hugo Y. K. Lam, Xinmeng Jasmine Mu3, Xinmeng Jasmine Mu6, Can Alkan16, Danny Antaki17, Taejeong Bae5, Eliza Cerveira, Peter S. Chines18, Zechen Chong13, Laura Clarke12, Elif Dal16, Li Ding7, S. Emery8, Xian Fan13, Madhusudan Gujral17, Fatma Kahveci16, Jeffrey M. Kidd8, Yu Kong15, Eric-Wubbo Lameijer19, Shane A. McCarthy12, Paul Flicek12, Richard A. Gibbs20, Gabor T. Marth14, Christopher E. Mason21, Androniki Menelaou22, Androniki Menelaou23, Donna M. Muzny24, Bradley J. Nelson1, Amina Noor17, Nicholas F. Parrish25, Matthew Pendleton24, Andrew Quitadamo11, Benjamin Raeder, Eric E. Schadt24, Mallory Romanovitch, Andreas Schlattl, Robert Sebra24, Andrey A. Shabalin26, Andreas Untergasser27, Jerilyn A. Walker10, Min Wang20, Fuli Yu20, Chengsheng Zhang, Jing Zhang6, Xiangqun Zheng-Bradley12, Wanding Zhou13, Thomas Zichner, Jonathan Sebat17, Mark A. Batzer10, Steven A. McCarroll3, Steven A. McCarroll4, Ryan E. Mills8, Mark Gerstein6, Ali Bashir24, Oliver Stegle12, Scott E. Devine2, Charles Lee28, Evan E. Eichler1, Jan O. Korbel12 
01 Oct 2015-Nature (Nature Publishing Group)-Vol. 526, Iss: 7571, pp 75-81
TL;DR: In this paper, the authors describe an integrated set of eight structural variant classes comprising both balanced and unbalanced variants, which are constructed using short-read DNA sequencing data and statistically phased onto haplotype blocks in 26 human populations.
Abstract: Structural variants are implicated in numerous diseases and make up the majority of varying nucleotides among human genomes. Here we describe an integrated set of eight structural variant classes comprising both balanced and unbalanced variants, which we constructed using short-read DNA sequencing data and statistically phased onto haplotype blocks in 26 human populations. Analysing this set, we identify numerous gene-intersecting structural variants exhibiting population stratification and describe naturally occurring homozygous gene knockouts that suggest the dispensability of a variety of human genes. We demonstrate that structural variants are enriched on haplotypes identified by genome-wide association studies and exhibit enrichment for expression quantitative trait loci. Additionally, we uncover appreciable levels of structural variant complexity at different scales, including genic loci subject to clusters of repeated rearrangement and complex structural variants with multiple breakpoints likely to have formed through individual mutational events. Our catalogue will enhance future studies into structural variant demography, functional impact and disease association.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
Adam Auton1, Gonçalo R. Abecasis2, David Altshuler3, Richard Durbin4  +514 moreInstitutions (90)
01 Oct 2015-Nature
TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.
Abstract: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

12,661 citations

Journal ArticleDOI
TL;DR: These and other strategies are providing researchers and clinicians a variety of tools to probe genomes in greater depth, leading to an enhanced understanding of how genome sequence variants underlie phenotype and disease.
Abstract: Since the completion of the human genome project in 2003, extraordinary progress has been made in genome sequencing technologies, which has led to a decreased cost per megabase and an increase in the number and diversity of sequenced genomes. An astonishing complexity of genome architecture has been revealed, bringing these sequencing technologies to even greater advancements. Some approaches maximize the number of bases sequenced in the least amount of time, generating a wealth of data that can be used to understand increasingly complex phenotypes. Alternatively, other approaches now aim to sequence longer contiguous pieces of DNA, which are essential for resolving structurally complex regions. These and other strategies are providing researchers and clinicians a variety of tools to probe genomes in greater depth, leading to an enhanced understanding of how genome sequence variants underlie phenotype and disease.

3,096 citations

Journal ArticleDOI
Peter J. Campbell1, Gad Getz2, Jan O. Korbel3, Joshua M. Stuart4  +1329 moreInstitutions (238)
06 Feb 2020-Nature
TL;DR: The flagship paper of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium describes the generation of the integrative analyses of 2,658 whole-cancer genomes and their matching normal tissues across 38 tumour types, the structures for international data sharing and standardized analyses, and the main scientific findings from across the consortium studies.
Abstract: Cancer is driven by genetic change, and the advent of massively parallel sequencing has enabled systematic documentation of this variation at the whole-genome scale1,2,3. Here we report the integrative analysis of 2,658 whole-cancer genomes and their matching normal tissues across 38 tumour types from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). We describe the generation of the PCAWG resource, facilitated by international data sharing using compute clouds. On average, cancer genomes contained 4–5 driver mutations when combining coding and non-coding genomic elements; however, in around 5% of cases no drivers were identified, suggesting that cancer driver discovery is not yet complete. Chromothripsis, in which many clustered structural variants arise in a single catastrophic event, is frequently an early event in tumour evolution; in acral melanoma, for example, these events precede most somatic point mutations and affect several cancer-associated genes simultaneously. Cancers with abnormal telomere maintenance often originate from tissues with low replicative activity and show several mechanisms of preventing telomere attrition to critical levels. Common and rare germline variants affect patterns of somatic mutation, including point mutations, structural variants and somatic retrotransposition. A collection of papers from the PCAWG Consortium describes non-coding mutations that drive cancer beyond those in the TERT promoter4; identifies new signatures of mutational processes that cause base substitutions, small insertions and deletions and structural variation5,6; analyses timings and patterns of tumour evolution7; describes the diverse transcriptional consequences of somatic mutation on splicing, expression levels, fusion genes and promoter activity8,9; and evaluates a range of more-specialized features of cancer genomes8,10,11,12,13,14,15,16,17,18.

1,600 citations

Journal ArticleDOI
Ditte Demontis1, Ditte Demontis2, Raymond K. Walters3, Raymond K. Walters4, Joanna Martin5, Joanna Martin4, Joanna Martin6, Manuel Mattheisen, Thomas Damm Als1, Thomas Damm Als2, Esben Agerbo2, Esben Agerbo1, Gisli Baldursson, Rich Belliveau4, Jonas Bybjerg-Grauholm2, Jonas Bybjerg-Grauholm7, Marie Bækvad-Hansen7, Marie Bækvad-Hansen2, Felecia Cerrato4, Kimberly Chambert4, Claire Churchhouse4, Claire Churchhouse3, Ashley Dumont4, Nicholas Eriksson, Michael J. Gandal, Jacqueline I. Goldstein4, Jacqueline I. Goldstein3, Katrina L. Grasby8, Jakob Grove, Olafur O Gudmundsson9, Olafur O Gudmundsson10, Christine Søholm Hansen7, Christine Søholm Hansen11, Christine Søholm Hansen2, Mads E. Hauberg2, Mads E. Hauberg1, Mads V. Hollegaard2, Mads V. Hollegaard7, Daniel P. Howrigan4, Daniel P. Howrigan3, Hailiang Huang4, Hailiang Huang3, Julian Maller4, Alicia R. Martin3, Alicia R. Martin4, Nicholas G. Martin8, Jennifer L. Moran4, Jonatan Pallesen2, Jonatan Pallesen1, Duncan S. Palmer3, Duncan S. Palmer4, Carsten Bøcker Pedersen1, Carsten Bøcker Pedersen2, Marianne Giørtz Pedersen2, Marianne Giørtz Pedersen1, Timothy Poterba3, Timothy Poterba4, Jesper Buchhave Poulsen2, Jesper Buchhave Poulsen7, Stephan Ripke3, Stephan Ripke4, Stephan Ripke12, Elise B. Robinson3, F. Kyle Satterstrom4, F. Kyle Satterstrom3, Hreinn Stefansson10, Christine Stevens4, Patrick Turley3, Patrick Turley4, G. Bragi Walters10, G. Bragi Walters9, Hyejung Won13, Hyejung Won14, Margaret J. Wright15, Ole A. Andreassen16, Philip Asherson17, Christie L. Burton18, Dorret I. Boomsma19, Bru Cormand, Søren Dalsgaard1, Barbara Franke20, Joel Gelernter21, Joel Gelernter22, Daniel H. Geschwind14, Daniel H. Geschwind13, Hakon Hakonarson23, Jan Haavik24, Jan Haavik25, Henry R. Kranzler26, Henry R. Kranzler21, Jonna Kuntsi17, Kate Langley5, Klaus-Peter Lesch27, Klaus-Peter Lesch28, Klaus-Peter Lesch29, Christel M. Middeldorp19, Christel M. Middeldorp15, Andreas Reif30, Luis Augusto Rohde31, Panos Roussos, Russell Schachar18, Pamela Sklar32, Edmund J.S. Sonuga-Barke17, Patrick F. Sullivan33, Patrick F. Sullivan6, Anita Thapar5, Joyce Y. Tung, Irwin D. Waldman34, Sarah E. Medland8, Kari Stefansson10, Kari Stefansson9, Merete Nordentoft35, Merete Nordentoft2, David M. Hougaard7, David M. Hougaard2, Thomas Werge2, Thomas Werge11, Thomas Werge35, Ole Mors36, Ole Mors2, Preben Bo Mortensen, Mark J. Daly, Stephen V. Faraone37, Anders D. Børglum2, Anders D. Børglum1, Benjamin M. Neale4, Benjamin M. Neale3 
TL;DR: A genome-wide association meta-analysis of 20,183 individuals diagnosed with ADHD and 35,191 controls identifies variants surpassing genome- wide significance in 12 independent loci and implicates neurodevelopmental pathways and conserved regions of the genome as being involved in underlying ADHD biology.
Abstract: Attention deficit/hyperactivity disorder (ADHD) is a highly heritable childhood behavioral disorder affecting 5% of children and 2.5% of adults. Common genetic variants contribute substantially to ADHD susceptibility, but no variants have been robustly associated with ADHD. We report a genome-wide association meta-analysis of 20,183 individuals diagnosed with ADHD and 35,191 controls that identifies variants surpassing genome-wide significance in 12 independent loci, finding important new information about the underlying biology of ADHD. Associations are enriched in evolutionarily constrained genomic regions and loss-of-function intolerant genes and around brain-expressed regulatory marks. Analyses of three replication studies: a cohort of individuals diagnosed with ADHD, a self-reported ADHD sample and a meta-analysis of quantitative measures of ADHD symptoms in the population, support these findings while highlighting study-specific differences on genetic overlap with educational attainment. Strong concordance with GWAS of quantitative population measures of ADHD symptoms supports that clinical diagnosis of ADHD is an extreme expression of continuous heritable traits.

1,436 citations

Journal ArticleDOI
TL;DR: NGMLR and Sniffles perform highly accurate alignment and structural variation detection from long-read sequencing data and can automatically filter false events and operate on low-coverage data, thereby reducing the high costs that have hindered the application of long reads in clinical and research settings.
Abstract: Structural variations are the greatest source of genetic variation, but they remain poorly understood because of technological limitations. Single-molecule long-read sequencing has the potential to dramatically advance the field, although high error rates are a challenge with existing methods. Addressing this need, we introduce open-source methods for long-read alignment (NGMLR; https://github.com/philres/ngmlr ) and structural variant identification (Sniffles; https://github.com/fritzsedlazeck/Sniffles ) that provide unprecedented sensitivity and precision for variant detection, even in repeat-rich regions and for complex nested events that can have substantial effects on human health. In several long-read datasets, including healthy and cancerous human genomes, we discovered thousands of novel variants and categorized systematic errors in short-read approaches. NGMLR and Sniffles can automatically filter false events and operate on low-coverage data, thereby reducing the high costs that have hindered the application of long reads in clinical and research settings.

1,058 citations

References
More filters
Journal ArticleDOI
TL;DR: Burrows-Wheeler Alignment tool (BWA) is implemented, a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps.
Abstract: Motivation: The enormous amount of short reads generated by the new DNA sequencing technologies call for the development of fast and accurate read alignment programs. A first generation of hash table-based methods has been developed, including MAQ, which is accurate, feature rich and fast enough to align short reads from a single individual. However, MAQ does not support gapped alignment for single-end reads, which makes it unsuitable for alignment of longer reads where indels may occur frequently. The speed of MAQ is also a concern when the alignment is scaled up to the resequencing of hundreds of individuals. Results: We implemented Burrows-Wheeler Alignment tool (BWA), a new read alignment package that is based on backward search with Burrows–Wheeler Transform (BWT), to efficiently align short sequencing reads against a large reference sequence such as the human genome, allowing mismatches and gaps. BWA supports both base space reads, e.g. from Illumina sequencing machines, and color space reads from AB SOLiD machines. Evaluations on both simulated and real data suggest that BWA is ~10–20× faster than MAQ, while achieving similar accuracy. In addition, BWA outputs alignment in the new standard SAM (Sequence Alignment/Map) format. Variant calling and other downstream analyses after the alignment can be achieved with the open source SAMtools software package. Availability: http://maq.sourceforge.net Contact: [email protected]

43,862 citations

Journal ArticleDOI
06 Sep 2012-Nature
TL;DR: The Encyclopedia of DNA Elements project provides new insights into the organization and regulation of the authors' genes and genome, and is an expansive resource of functional annotations for biomedical research.
Abstract: The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

13,548 citations

Journal ArticleDOI
Adam Auton1, Gonçalo R. Abecasis2, David Altshuler3, Richard Durbin4  +514 moreInstitutions (90)
01 Oct 2015-Nature
TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.
Abstract: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.

12,661 citations

Journal Article
01 Jan 2012-Nature
TL;DR: The Encyclopedia of DNA Elements project provides new insights into the organization and regulation of the authors' genes and genome, and is an expansive resource of functional annotations for biomedical research.
Abstract: The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

8,106 citations

Journal ArticleDOI
01 Nov 2012-Nature
TL;DR: It is shown that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites.
Abstract: By characterizing the geographic and functional spectrum of human genetic variation, the 1000 Genomes Project aims to build a resource to help to understand the genetic contribution to disease. Here we describe the genomes of 1,092 individuals from 14 populations, constructed using a combination of low-coverage whole-genome and exome sequencing. By developing methods to integrate information across several algorithms and diverse data sources, we provide a validated haplotype map of 38 million single nucleotide polymorphisms, 1.4 million short insertions and deletions, and more than 14,000 larger deletions. We show that individuals from different populations carry different profiles of rare and common variants, and that low-frequency variants show substantial geographic differentiation, which is further increased by the action of purifying selection. We show that evolutionary conservation and coding consequence are key determinants of the strength of purifying selection, that rare-variant load varies substantially across biological pathways, and that each individual contains hundreds of rare non-coding variants at conserved sites, such as motif-disrupting changes in transcription-factor-binding sites. This resource, which captures up to 98% of accessible single nucleotide polymorphisms at a frequency of 1% in related populations, enables analysis of common and low-frequency variants in individuals from diverse, including admixed, populations.

7,710 citations

Related Papers (5)
18 Aug 2016-Nature
Monkol Lek, Konrad J. Karczewski, Konrad J. Karczewski, Eric Vallabh Minikel, Eric Vallabh Minikel, Kaitlin E. Samocha, Eric Banks, Timothy Fennell, Anne H. O’Donnell-Luria, Anne H. O’Donnell-Luria, Anne H. O’Donnell-Luria, James S. Ware, Andrew J. Hill, Andrew J. Hill, Andrew J. Hill, Beryl B. Cummings, Beryl B. Cummings, Taru Tukiainen, Taru Tukiainen, Daniel P. Birnbaum, Jack A. Kosmicki, Laramie E. Duncan, Laramie E. Duncan, Karol Estrada, Karol Estrada, Fengmei Zhao, Fengmei Zhao, James Zou, Emma Pierce-Hoffman, Emma Pierce-Hoffman, Joanne Berghout, David Neil Cooper, Nicole A. Deflaux, Mark A. DePristo, Ron Do, Jason Flannick, Jason Flannick, Menachem Fromer, Laura D. Gauthier, Jackie Goldstein, Jackie Goldstein, Namrata Gupta, Daniel P. Howrigan, Daniel P. Howrigan, Adam Kiezun, Mitja I. Kurki, Mitja I. Kurki, Ami Levy Moonshine, Pradeep Natarajan, Lorena Orozco, Gina M. Peloso, Gina M. Peloso, Ryan Poplin, Manuel A. Rivas, Valentin Ruano-Rubio, Samuel A. Rose, Douglas M. Ruderfer, Khalid Shakir, Peter D. Stenson, Christine Stevens, Brett Thomas, Brett Thomas, Grace Tiao, María Teresa Tusié-Luna, Ben Weisburd, Hong-Hee Won, Dongmei Yu, David Altshuler, David Altshuler, Diego Ardissino, Michael Boehnke, John Danesh, Stacey Donnelly, Roberto Elosua, Jose C. Florez, Jose C. Florez, Stacey Gabriel, Gad Getz, Gad Getz, Stephen J. Glatt, Christina M. Hultman, Sekar Kathiresan, Markku Laakso, Steven A. McCarroll, Steven A. McCarroll, Mark I. McCarthy, Mark I. McCarthy, Dermot P.B. McGovern, Ruth McPherson, Benjamin M. Neale, Benjamin M. Neale, Aarno Palotie, Shaun Purcell, Danish Saleheen, Jeremiah M. Scharf, Pamela Sklar, Patrick F. Sullivan, Patrick F. Sullivan, Jaakko Tuomilehto, Ming T. Tsuang, Hugh Watkins, Hugh Watkins, James G. Wilson, Mark J. Daly, Mark J. Daly, Daniel G. MacArthur, Daniel G. MacArthur