scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A global reference for human genetic variation.

Adam Auton1, Gonçalo R. Abecasis2, David Altshuler3, Richard Durbin4  +514 moreInstitutions (90)
01 Oct 2015-Nature (Nature Publishing Group)-Vol. 526, Iss: 7571, pp 68-74
TL;DR: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations, and has reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-generation sequencing, deep exome sequencing, and dense microarray genotyping.
Abstract: The 1000 Genomes Project set out to provide a comprehensive description of common human genetic variation by applying whole-genome sequencing to a diverse set of individuals from multiple populations. Here we report completion of the project, having reconstructed the genomes of 2,504 individuals from 26 populations using a combination of low-coverage whole-genome sequencing, deep exome sequencing, and dense microarray genotyping. We characterized a broad spectrum of genetic variation, in total over 88 million variants (84.7 million single nucleotide polymorphisms (SNPs), 3.6 million short insertions/deletions (indels), and 60,000 structural variants), all phased onto high-quality haplotypes. This resource includes >99% of SNP variants with a frequency of >1% for a variety of ancestries. We describe the distribution of genetic variation across the global sample, and discuss the implications for common disease studies.
Citations
More filters
Proceedings ArticleDOI
01 Jun 2019
TL;DR: Transfer learning as discussed by the authors is a set of methods that extend the classical supervised machine learning paradigm by leveraging data from additional domains or tasks to train a model with better generalization properties, which can be used for NLP tasks.
Abstract: The classic supervised machine learning paradigm is based on learning in isolation, a single predictive model for a task using a single dataset. This approach requires a large number of training examples and performs best for well-defined and narrow tasks. Transfer learning refers to a set of methods that extend this approach by leveraging data from additional domains or tasks to train a model with better generalization properties. Over the last two years, the field of Natural Language Processing (NLP) has witnessed the emergence of several transfer learning methods and architectures which significantly improved upon the state-of-the-art on a wide range of NLP tasks. These improvements together with the wide availability and ease of integration of these methods are reminiscent of the factors that led to the success of pretrained word embeddings and ImageNet pretraining in computer vision, and indicate that these methods will likely become a common tool in the NLP landscape as well as an important research direction. We will present an overview of modern transfer learning methods in NLP, how models are pre-trained, what information the representations they learn capture, and review examples and case studies on how these models can be integrated and adapted in downstream NLP tasks.

267 citations

Journal ArticleDOI
Hugh McColl1, Fernando Racimo1, Lasse Vinner1, Fabrice Demeter2, Takashi Gakuhari3, Takashi Gakuhari4, J. Víctor Moreno-Mayar1, George van Driem5, George van Driem6, Uffe Gram Wilken1, Andaine Seguin-Orlando7, Andaine Seguin-Orlando1, Constanza de la Fuente Castro1, Sally Wasef8, Rasmi Shoocongdej9, Viengkeo Souksavatdy, Thongsa Sayavongkhamdy, Mokhtar Saidin10, Morten E. Allentoft1, Takehiro Sato3, Anna-Sapfo Malaspinas11, Farhang Aghakhanian12, Thorfinn Sand Korneliussen1, Ana Prohaska13, Ashot Margaryan14, Ashot Margaryan2, Peter de Barros Damgaard1, Supannee Kaewsutthi15, Patcharee Lertrit15, Thi Mai Huong Nguyen, Hsiao-chun Hung16, Thi Minh Tran, Huu Nghia Truong, Giang Hai Nguyen, Shaiful Shahidan10, Ketut Wiradnyana, Hiromi Matsumae4, Nobuo Shigehara17, Minoru Yoneda18, Hajime Ishida19, Tadayuki Masuyama, Yasuhiro Yamada20, Atsushi Tajima3, Hiroki Shibata21, Atsushi Toyoda22, Tsunehiko Hanihara4, Shigeki Nakagome23, Thibaut Devièse24, Anne-Marie Bacon25, Philippe Duringer26, Jean Luc Ponche26, Laura L. Shackelford27, Elise Patole-Edoumba1, Anh Nguyen, Bérénice Bellina-Pryce28, Jean Christophe Galipaud29, Rebecca Kinaston30, Rebecca Kinaston31, Hallie R. Buckley30, Christophe Pottier32, Silas Anselm Rasmussen33, Thomas Higham24, Robert Foley13, Marta Mirazón Lahr13, Ludovic Orlando1, Ludovic Orlando7, Martin Sikora1, Maude E. Phipps12, Hiroki Oota4, Charles Higham13, Charles Higham30, David M. Lambert8, Eske Willerslev34, Eske Willerslev1, Eske Willerslev13 
06 Jul 2018-Science
TL;DR: Neither interpretation fits the complexity of Southeast Asian history: Both Hòabìnhian hunter-gatherers and East Asian farmers contributed to current Southeast Asian diversity, with further migrations affecting island SEA and Vietnam.
Abstract: The human occupation history of Southeast Asia (SEA) remains heavily debated Current evidence suggests that SEA was occupied by Hoabinhian hunter-gatherers until ~4000 years ago, when farming economies developed and expanded, restricting foraging groups to remote habitats Some argue that agricultural development was indigenous; others favor the "two-layer" hypothesis that posits a southward expansion of farmers giving rise to present-day Southeast Asian genetic diversity By sequencing 26 ancient human genomes (25 from SEA, 1 Japanese Jōmon), we show that neither interpretation fits the complexity of Southeast Asian history: Both Hoabinhian hunter-gatherers and East Asian farmers contributed to current Southeast Asian diversity, with further migrations affecting island SEA and Vietnam Our results help resolve one of the long-standing controversies in Southeast Asian prehistory

265 citations

Journal ArticleDOI
TL;DR: This work characterize spindles in 11,630 individuals aged 4 to 97 years, as a prelude to future genetic studies and identifies previously unappreciated correlates of spindle activity, including confounding by body mass index mediated by cardiac interference in the EEG.
Abstract: Sleep spindles are characteristic electroencephalogram (EEG) signatures of stage 2 non-rapid eye movement sleep. Implicated in sleep regulation and cognitive functioning, spindles may represent heritable biomarkers of neuropsychiatric disease. Here we characterize spindles in 11,630 individuals aged 4 to 97 years, as a prelude to future genetic studies. Spindle properties are highly reliable but exhibit distinct developmental trajectories. Across the night, we observe complex patterns of age- and frequency-dependent dynamics, including signatures of circadian modulation. We identify previously unappreciated correlates of spindle activity, including confounding by body mass index mediated by cardiac interference in the EEG. After taking account of these confounds, genetic factors significantly contribute to spindle and spectral sleep traits. Finally, we consider topographical differences and critical measurement issues. Taken together, our findings will lead to an increased understanding of the genetic architecture of sleep spindles and their relation to behavioural and health outcomes, including neuropsychiatric disorders. Sleep patterns vary and are associated with health and disease. Here Purcellet alcharacterize sleep spindle activity in 11,630 individuals and describe age-related changes, genetic influences, and possible confounding effects, serving as a resource for further understanding the physiology of sleep.

265 citations

Posted ContentDOI
25 Jan 2017-bioRxiv
TL;DR: This work develops and applies an approach that uses stratified LD score regression to test whether disease heritability is enriched in regions surrounding genes with the highest specific expression in a given tissue and demonstrates that the polygenic approach is a powerful way to leverage gene expression data for interpreting GWAS signal.
Abstract: Genetics can provide a systematic approach to discovering the tissues and cell types relevant for a complex disease or trait. Identifying these tissues and cell types is critical for following up on non-coding allelic function, developing ex-vivo models, and identifying therapeutic targets. Here, we analyze gene expression data from several sources, including the GTEx and PsychENCODE consortia, together with genome-wide association study (GWAS) summary statistics for 48 diseases and traits with an average sample size of 86,850, to identify disease-relevant tissues and cell types. We develop and apply an approach that uses stratified LD score regression to test whether disease heritability is enriched in regions surrounding genes with the highest specific expression in a given tissue. We detect tissue-specific enrichments at FDR < 5% for 30 diseases and traits across a broad range of tissues that recapitulate known biology. In our analysis of traits with observed central nervous system enrichment, we detect an enrichment of neurons over other brain cell types for several brain-related traits, enrichment of inhibitory neurons over excitatory neurons for bipolar disorder, and enrichments in the cortex for schizophrenia and in the striatum for migraine. In our analysis of traits with observed immunological enrichment, we identify enrichments of alpha beta T cells for asthma and eczema, B cells for primary biliary cirrhosis, and myeloid cells for lupus and Alzheimer's disease. Our results demonstrate that our polygenic approach is a powerful way to leverage gene expression data for interpreting GWAS signal.

264 citations

Journal ArticleDOI
TL;DR: This work surveys various projects underway to build and apply graph-based structures-which it is referred to as genome graphs-and discusses the improvements in read mapping, variant calling, and haplotype determination that genome graphs are expected to produce.
Abstract: The human reference genome is part of the foundation of modern human biology and a monumental scientific achievement. However, because it excludes a great deal of common human variation, it introduces a pervasive reference bias into the field of human genomics. To reduce this bias, it makes sense to draw on representative collections of human genomes, brought together into reference cohorts. There are a number of techniques to represent and organize data gleaned from these cohorts, many using ideas implicitly or explicitly borrowed from graph-based models. Here, we survey various projects underway to build and apply these graph-based structures-which we collectively refer to as genome graphs-and discuss the improvements in read mapping, variant calling, and haplotype determination that genome graphs are expected to produce.

264 citations


Cites background from "A global reference for human geneti..."

  • ...Reference allele bias is the tendency to underreport data whose underlying DNA does not match a reference allele (Degner et al. 2009; Brandt et al. 2015)....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

88,255 citations

Journal ArticleDOI
TL;DR: SAMtools as discussed by the authors implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments.
Abstract: Summary: The Sequence Alignment/Map (SAM) format is a generic alignment format for storing read alignments against reference sequences, supporting short and long reads (up to 128 Mbp) produced by different sequencing platforms. It is flexible in style, compact in size, efficient in random access and is the format in which alignments from the 1000 Genomes Project are released. SAMtools implements various utilities for post-processing alignments in the SAM format, such as indexing, variant caller and alignment viewer, and thus provides universal tools for processing read alignments. Availability: http://samtools.sourceforge.net Contact: [email protected]

45,957 citations

Journal ArticleDOI
TL;DR: A new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format, which allows the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks.
Abstract: Motivation: Testing for correlations between different sets of genomic features is a fundamental task in genomics research. However, searching for overlaps between features with existing webbased methods is complicated by the massive datasets that are routinely produced with current sequencing technologies. Fast and flexible tools are therefore required to ask complex questions of these data in an efficient manner. Results: This article introduces a new software suite for the comparison, manipulation and annotation of genomic features in Browser Extensible Data (BED) and General Feature Format (GFF) format. BEDTools also supports the comparison of sequence alignments in BAM format to both BED and GFF features. The tools are extremely efficient and allow the user to compare large datasets (e.g. next-generation sequencing data) with both public and custom genome annotation tracks. BEDTools can be combined with one another as well as with standard UNIX commands, thus facilitating routine genomics tasks as well as pipelines that can quickly answer intricate questions of large genomic datasets. Availability and implementation: BEDTools was written in C++. Source code and a comprehensive user manual are freely available at http://code.google.com/p/bedtools

18,858 citations

Journal ArticleDOI
06 Sep 2012-Nature
TL;DR: The Encyclopedia of DNA Elements project provides new insights into the organization and regulation of the authors' genes and genome, and is an expansive resource of functional annotations for biomedical research.
Abstract: The human genome encodes the blueprint of life, but the function of the vast majority of its nearly three billion bases is unknown. The Encyclopedia of DNA Elements (ENCODE) project has systematically mapped regions of transcription, transcription factor association, chromatin structure and histone modification. These data enabled us to assign biochemical functions for 80% of the genome, in particular outside of the well-studied protein-coding regions. Many discovered candidate regulatory elements are physically associated with one another and with expressed genes, providing new insights into the mechanisms of gene regulation. The newly identified elements also show a statistical correspondence to sequence variants linked to human disease, and can thereby guide interpretation of this variation. Overall, the project provides new insights into the organization and regulation of our genes and genome, and is an expansive resource of functional annotations for biomedical research.

13,548 citations

Journal ArticleDOI
TL;DR: VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API.
Abstract: Summary: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API. Availability: http://vcftools.sourceforge.net Contact: [email protected]

10,164 citations