scispace - formally typeset
Search or ask a question

Showing papers by "Richard Durbin published in 2017"


Journal ArticleDOI
TL;DR: It is asserted that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote the understanding of human biology and advance the efforts to improve health.
Abstract: The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009; it reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures, and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions, and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that although the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.

643 citations


Journal ArticleDOI
15 Jun 2017-Nature
TL;DR: This study outlines the major sources of genetic and phenotypic variation in iPS cells and establishes their suitability as models of complex human traits and cancer.
Abstract: Technology utilizing human induced pluripotent stem cells (iPS cells) has enormous potential to provide improved cellular models of human disease. However, variable genetic and phenotypic characterization of many existing iPS cell lines limits their potential use for research and therapy. Here we describe the systematic generation, genotyping and phenotyping of 711 iPS cell lines derived from 301 healthy individuals by the Human Induced Pluripotent Stem Cells Initiative. Our study outlines the major sources of genetic and phenotypic variation in iPS cells and establishes their suitability as models of complex human traits and cancer. Through genome-wide profiling we find that 5-46% of the variation in different iPS cell phenotypes, including differentiation capacity and cellular morphology, arises from differences between individuals. Additionally, we assess the phenotypic consequences of genomic copy-number alterations that are repeatedly observed in iPS cells. In addition, we present a comprehensive map of common regulatory variants affecting the transcriptome of human pluripotent cells.

462 citations


Journal ArticleDOI
TL;DR: Long-read sequencing is used to generate end-to-end genome assemblies for 12 strains representing major subpopulations of the partially domesticated yeast Saccharomyces cerevisiae and its wild relative Saccharomers paradoxus to enable precise definition of chromosomal boundaries between cores and subtelomeres and a high-resolution view of evolutionary genome dynamics.
Abstract: Structural rearrangements have long been recognized as an important source of genetic variation, with implications in phenotypic diversity and disease, yet their detailed evolutionary dynamics remain elusive. Here we use long-read sequencing to generate end-to-end genome assemblies for 12 strains representing major subpopulations of the partially domesticated yeast Saccharomyces cerevisiae and its wild relative Saccharomyces paradoxus. These population-level high-quality genomes with comprehensive annotation enable precise definition of chromosomal boundaries between cores and subtelomeres and a high-resolution view of evolutionary genome dynamics. In chromosomal cores, S. paradoxus shows faster accumulation of balanced rearrangements (inversions, reciprocal translocations and transpositions), whereas S. cerevisiae accumulates unbalanced rearrangements (novel insertions, deletions and duplications) more rapidly. In subtelomeres, both species show extensive interchromosomal reshuffling, with a higher tempo in S. cerevisiae. Such striking contrasts between wild and domesticated yeasts are likely to reflect the influence of human activities on structural genome evolution.

293 citations


Journal ArticleDOI
TL;DR: This paper re-sequenced a well characterized genome, the Saccharomyces cerevisiae S288C strain using three different platforms: MinION, PacBio and MiSeq.
Abstract: Long-read sequencing technologies such as Pacific Biosciences and Oxford Nanopore MinION are capable of producing long sequencing reads with average fragment lengths of over 10,000 base-pairs and maximum lengths reaching 100,000 base- pairs. Compared with short reads, the assemblies obtained from long-read sequencing platforms have much higher contig continuity and genome completeness as long fragments are able to extend paths into problematic or repetitive regions. Many successful assembly applications of the Pacific Biosciences technology have been reported ranging from small bacterial genomes to large plant and animal genomes. Recently, genome assemblies using Oxford Nanopore MinION data have attracted much attention due to the portability and low cost of this novel sequencing instrument. In this paper, we re-sequenced a well characterized genome, the Saccharomyces cerevisiae S288C strain using three different platforms: MinION, PacBio and MiSeq. We present a comprehensive metric comparison of assemblies generated by various pipelines and discuss how the platform associated data characteristics affect the assembly quality. With a given read depth of 31X, the assemblies from both Pacific Biosciences and Oxford Nanopore MinION show excellent continuity and completeness for the 16 nuclear chromosomes, but not for the mitochondrial genome, whose reconstruction still represents a significant challenge.

141 citations


Journal ArticleDOI
Ioanna Tachmazidou1, Daniel Suveges1, Josine L. Min2, Graham R. S. Ritchie3, Graham R. S. Ritchie1, Julia Steinberg1, Klaudia Walter1, Valentina Iotchkova4, Valentina Iotchkova1, Jeremy Schwartzentruber1, Jie Huang, Yasin Memari1, Shane A. McCarthy1, Andrew A Crawford, Cristina Bombieri5, Massimiliano Cocca6, Aliki-Eleni Farmaki7, Tom R. Gaunt2, Pekka Jousilahti8, Marjolein N. Kooijman9, Benjamin Lehne10, Giovanni Malerba5, Satu Männistö8, Angela Matchan1, Carolina Medina-Gomez9, Sarah Metrustry11, Abhishek Nag11, Ioanna Ntalla12, Lavinia Paternoster2, Nigel W. Rayner13, Nigel W. Rayner14, Nigel W. Rayner1, Cinzia Sala15, William R. Scott16, William R. Scott10, Hashem A. Shihab2, Lorraine Southam1, Lorraine Southam13, Beate St Pourcain2, Michela Traglia15, Katerina Trajanoska9, Gialuigi Zaza, Weihua Zhang10, Weihua Zhang16, María Soler Artigas17, Narinder Bansal18, Marianne Benn19, Marianne Benn20, Zhongsheng Chen21, Petr Danecek20, Petr Danecek19, Wei-Yu Lin18, Adam E. Locke22, Adam E. Locke21, Jian'an Luan18, Alisa K. Manning23, Alisa K. Manning24, Antonella Mulas25, Carlo Sidore, Anne Tybjærg-Hansen19, Anne Tybjærg-Hansen20, Anette Varbo20, Anette Varbo19, Magdalena Zoledziewska, Chris Finan26, Konstantinos Hatzikotoulas1, Audrey E. Hendricks27, Audrey E. Hendricks1, John P. Kemp2, Alireza Moayyeri26, Alireza Moayyeri11, Kalliope Panoutsopoulou1, Michal Szpak1, Scott Wilson28, Scott Wilson29, Scott Wilson11, Michael Boehnke21, Francesco Cucca25, Emanuele Di Angelantonio30, Emanuele Di Angelantonio18, Claudia Langenberg18, Cecilia M. Lindgren13, Cecilia M. Lindgren14, Mark I. McCarthy13, Mark I. McCarthy14, Mark I. McCarthy31, Andrew P. Morris13, Andrew P. Morris32, Andrew P. Morris33, Børge G. Nordestgaard19, Børge G. Nordestgaard20, Robert A. Scott18, Martin D. Tobin30, Martin D. Tobin17, Nicholas J. Wareham18, Paul Burton2, John C. Chambers10, John C. Chambers16, John C. Chambers34, George Davey Smith2, George Dedoussis7, Janine F. Felix9, Oscar H. Franco9, Giovanni Gambaro35, Paolo Gasparini6, Christopher J Hammond11, Albert Hofman9, Vincent W. V. Jaddoe9, Marcus E. Kleber36, Jaspal S. Kooner16, Jaspal S. Kooner8, Jaspal S. Kooner34, Markus Perola37, Markus Perola32, Markus Perola8, Caroline L Relton2, Susan M. Ring2, Fernando Rivadeneira9, Veikko Salomaa8, Tim D. Spector11, Oliver Stegle4, Daniela Toniolo15, André G. Uitterlinden9, Inês Barroso1, Inês Barroso18, Celia M. T. Greenwood38, Celia M. T. Greenwood39, John R. B. Perry18, John R. B. Perry11, Brian R. Walker3, Adam S. Butterworth30, Adam S. Butterworth18, Yali Xue1, Richard Durbin1, Kerrin S. Small11, Nicole Soranzo2, Nicholas J. Timpson2, Eleftheria Zeggini1 
TL;DR: This work applied a hybrid whole-genome sequencing (WGS) and deep imputation approach to examine the broader allelic architecture of 12 anthropometric traits associated with height, body mass, and fat distribution in up to 267,616 individuals to report 106 genome-wide significant signals that have not been previously identified.
Abstract: Deep sequence-based imputation can enhance the discovery power of genome-wide association studies by assessing previously unexplored variation across the common- and low-frequency spectra We applied a hybrid whole-genome sequencing (WGS) and deep imputation approach to examine the broader allelic architecture of 12 anthropometric traits associated with height, body mass, and fat distribution in up to 267,616 individuals We report 106 genome-wide significant signals that have not been previously identified, including 9 low-frequency variants pointing to functional candidates Of the 106 signals, 6 are in genomic regions that have not been implicated with related traits before, 28 are independent signals at previously reported regions, and 72 represent previously reported signals for a different anthropometric trait 71% of signals reside within genes and fine mapping resolves 23 signals to one or two likely causal variants We confirm genetic overlap between human monogenic and polygenic anthropometric traits and find signal enrichment in cis expression QTLs in relevant tissues Our results highlight the potential of WGS strategies to enhance biologically relevant discoveries across the frequency spectrum

121 citations


Journal ArticleDOI
TL;DR: A multi-generational estimate from the autozygous segment in a non-European population that gives insight into the contribution of post-zygotic mutations and population-specific mutational processes is presented.
Abstract: Heterozygous mutations within homozygous sequences descended from a recent common ancestor offer a way to ascertain de novo mutations across multiple generations. Using exome sequences from 3222 British-Pakistani individuals with high parental relatedness, we estimate a mutation rate of 1.45 ± 0.05 × 10−8 per base pair per generation in autosomal coding sequence, with a corresponding non-crossover gene conversion rate of 8.75 ± 0.05 × 10−6 per base pair per generation. This is at the lower end of exome mutation rates previously estimated in parent–offspring trios, suggesting that post-zygotic mutations contribute little to the human germ-line mutation rate. We find frequent recurrence of mutations at polymorphic CpG sites, and an increase in C to T mutations in a 5ʹ CCG 3ʹ to 5ʹ CTG 3ʹ context in the Pakistani population compared to Europeans, suggesting that mutational processes have evolved rapidly between human populations. Estimates of human mutation rates differ substantially based on the approach. Here, the authors present a multi-generational estimate from the autozygous segment in a non-European population that gives insight into the contribution of post-zygotic mutations and population-specific mutational processes.

92 citations


Journal ArticleDOI
14 Jun 2017
TL;DR: No evidence for a systematic reduction in recombination rates in F1 hybrids, and no evidence for inversions longer than 50 kb that might be involved in generating or maintaining species barriers are found, which suggests that mechanisms leading to global or local reduction of recombination do not play a significant role in the maintenance of species barriers between H. melpomene and H. cydno.
Abstract: Mechanisms that suppress recombination are known to help maintain species barriers by preventing the breakup of coadapted gene combinations. The sympatric butterfly species Heliconius melpomene and Heliconius cydno are separated by many strong barriers, but the species still hybridize infrequently in the wild, and around 40% of the genome is influenced by introgression. We tested the hypothesis that genetic barriers between the species are maintained by inversions or other mechanisms that reduce between-species recombination rate. We constructed fine-scale recombination maps for Panamanian populations of both species and their hybrids to directly measure recombination rate within and between species, and generated long sequence reads to detect inversions. We find no evidence for a systematic reduction in recombination rates in F1 hybrids, and also no evidence for inversions longer than 50 kb that might be involved in generating or maintaining species barriers. This suggests that mechanisms leading to global or local reduction in recombination do not play a significant role in the maintenance of species barriers between H. melpomene and H. cydno.

85 citations


Journal ArticleDOI
TL;DR: A significant depletion of variants in the rare frequency spectrum was observed in Finns when comparing the two populations and these functional categories represent the highest a priori power for downstream association studies of rare variants using population isolates.
Abstract: Isolated populations with enrichment of variants due to recent population bottlenecks provide a powerful resource for identifying disease-associated genetic variants and genes. As a model of an isolate population, we sequenced the genomes of 1463 Finnish individuals as part of the Sequencing Initiative Suomi (SISu) Project. We compared the genomic profiles of the 1463 Finns to a sample of 1463 British individuals that were sequenced in parallel as part of the UK10K Project. Whereas there were no major differences in the allele frequency of common variants, a significant depletion of variants in the rare frequency spectrum was observed in Finns when comparing the two populations. On the other hand, we observed >2.1 million variants that were twice as frequent among Finns compared with Britons and 800 000 variants that were more than 10 times more frequent in Finns. Furthermore, in Finns we observed a relative proportional enrichment of variants in the minor allele frequency range between 2 and 5% (P<2.2 × 10-16). When stratified by their functional annotations, loss-of-function variants showed the highest proportional enrichment in Finns (P=0.0291). In the non-coding part of the genome, variants in conserved regions (P=0.002) and promoters (P=0.01) were also significantly enriched in the Finnish samples. These functional categories represent the highest a priori power for downstream association studies of rare variants using population isolates.

65 citations


Journal ArticleDOI
TL;DR: An isolation-index (Isx) is developed that predicts the overall level of such key genetic characteristics and can thus help guide population choice in future complex-trait association studies.
Abstract: Isolated populations often have special genetic compositions that can be leveraged for genetic association studies. Here, Xue and colleagues generate and analyse 3,059 low-depth whole-genome sequences from eight European isolated populations and two matched general popula…

63 citations


Journal ArticleDOI
14 Jun 2017-Nature
TL;DR: This corrects the article DOI: 10.1038/nature22403 to indicate that the author of the paper is a scientist rather than a scientist-in-residence, as previously reported.
Abstract: Nature 546, 370–375 (2017); doi:10.1038/nature22403 In this Article, the authors Fiona M. Watt and Richard Durbin should also have been included as ‘jointly supervising’ authors, and authors Oliver Stegle and Daniel J. Gaffney should also have been noted as ‘equally contributing’ authors. In addition, the Author Contributions section should have included the sentence: ‘H.K. and A.G. contributed equally to this work; O.S. and D.J.G. contributed equally to this work’, as further clarification. The original Article has been corrected online.

29 citations


Journal ArticleDOI
TL;DR: No single non-synonymous (NS) single nucleotide variant (SNV) nor any gene carrying a higher burden of NS SNVs was significantly associated with PD status after multiple-testing correction, but significant enrichments of genes whose proteins have roles in the extracellular matrix were amongst the top 300 genes with the most significantly associated NSSNVs.
Abstract: Parkinson’s disease (PD) is the most common neurodegenerative movement disorder, affecting 1% of the population over 65 years characterized clinically by both motor and non-motor symptoms accompanied by the preferential loss of dopamine neurons in the substantia nigra pars compacta. Here, we sequenced the exomes of 244 Parkinson’s patients selected from the Oxford Parkinson’s Disease Centre Discovery Cohort and, after quality control, 228 exomes were available for analyses. The PD patient exomes were compared to 884 control exomes selected from the UK10K datasets. No single non-synonymous (NS) single nucleotide variant (SNV) nor any gene carrying a higher burden of NS SNVs was significantly associated with PD status after multiple-testing correction. However, significant enrichments of genes whose proteins have roles in the extracellular matrix were amongst the top 300 genes with the most significantly associated NS SNVs, while regions associated with PD by a recent Genome Wide Association (GWA) study were enriched in genes containing PD-associated NS SNVs. By examining genes within GWA regions possessing rare PD-associated SNVs, we identified RAD51B. The protein-product of RAD51B interacts with that of its paralogue RAD51, which is associated with congenital mirror movements phenotypes, a phenotype also comorbid with PD.

Posted ContentDOI
31 May 2017-bioRxiv
TL;DR: The genomes of 73 cichlid fish species from Lake Malawi uncover evolutionary processes underlying a large adaptive evolutionary radiation, and enhance the understanding of genomic processes underlying rapid species diversification.
Abstract: The hundreds of cichlid fish species in Lake Malawi constitute the most extensive recent vertebrate adaptive radiation. Here we characterize its genomic diversity by sequencing 134 individuals covering 73 species across all major lineages. Average sequence divergence between species pairs is only 0.1-0.25%. These divergence values overlap diversity within species, with 82% of heterozygosity shared between species. Phylogenetic analyses suggest that diversification initially proceeded by serial branching from a generalist Astatotilapia-like ancestor. However, no single species tree adequately represents all species relationships, with evidence for substantial gene flow at multiple times. Common signatures of selection on visual and oxygen transport genes shared by distantly related deep water species point to both adaptive introgression and independent selection. These findings enhance our understanding of genomic processes underlying rapid species diversification, and provide a platform for future genetic analysis of the Malawi radiation.

Journal ArticleDOI
TL;DR: The concept of a population BWT is introduced and used to store and index the sequencing reads of 2705 samples from the 1000 Genomes Project and it is shown that the vast majority of variant alleles can be uniquely described by overlapping 31-mers.
Abstract: We are rapidly approaching the point where we have sequenced millions of human genomes. There is a pressing need for new data structures to store raw sequencing data and efficient algorithms for population scale analysis. Current reference-based data formats do not fully exploit the redundancy in population sequencing nor take advantage of shared genetic variation. In recent years, the Burrows–Wheeler transform (BWT) and FM-index have been widely employed as a full-text searchable index for read alignment and de novo assembly. We introduce the concept of a population BWT and use it to store and index the sequencing reads of 2705 samples from the 1000 Genomes Project. A key feature is that, as more genomes are added, identical read sequences are increasingly observed, and compression becomes more efficient. We assess the support in the 1000 Genomes read data for every base position of two human reference assembly versions, identifying that 3.2 Mbp with population support was lost in the transition from GRCh37 with 13.7 Mbp added to GRCh38. We show that the vast majority of variant alleles can be uniquely described by overlapping 31-mers and show how rapid and accurate SNP and indel genotyping can be carried out across the genomes in the population BWT. We use the population BWT to carry out nonreference queries to search for the presence of all known viral genomes and discover human T-lymphotropic virus 1 integrations in six samples in a recognized epidemiological distribution.

Posted ContentDOI
15 Dec 2017-bioRxiv
TL;DR: The vg toolkit as mentioned in this paper provides an efficient approach to mapping reads onto arbitrary variation graphs using generalized compressed suffix arrays, with improved accuracy over alignment to a linear reference, creating data structures to support downstream variant calling and genotyping.
Abstract: Reference genomes guide our interpretation of DNA sequence data. However, conventional linear references are fundamentally limited in that they represent only one version of each locus, whereas the population may contain multiple variants. When the reference represents an individual’s genome poorly, it can impact read mapping and introduce bias. Variation graphs are bidirected DNA sequence graphs that compactly represent genetic variation, including large scale structural variation such as inversions and duplications.1 Equivalent structures are produced by de novo genome assemblers.2,3 Here we present vg, a toolkit of computational methods for creating, manipulating, and utilizing these structures as references at the scale of the human genome. vg provides an efficient approach to mapping reads onto arbitrary variation graphs using generalized compressed suffix arrays,4 with improved accuracy over alignment to a linear reference, creating data structures to support downstream variant calling and genotyping. These capabilities make using variation graphs as reference structures for DNA sequencing practical at the scale of vertebrate genomes, or at the topological complexity of new species assemblies.