scispace - formally typeset
Search or ask a question

Showing papers by "Richard Durbin published in 2016"


Journal ArticleDOI
Shane A. McCarthy1, Sayantan Das2, Warren W. Kretzschmar3, Olivier Delaneau4, Andrew R. Wood5, Alexander Teumer6, Hyun Min Kang2, Christian Fuchsberger2, Petr Danecek1, Kevin Sharp3, Yang Luo1, C Sidore7, Alan Kwong2, Nicholas J. Timpson8, Seppo Koskinen, Scott I. Vrieze9, Laura J. Scott2, He Zhang2, Anubha Mahajan3, Jan H. Veldink, Ulrike Peters10, Ulrike Peters11, Carlos N. Pato12, Cornelia M. van Duijn13, Christopher E. Gillies2, Ilaria Gandin14, Massimo Mezzavilla, Arthur Gilly1, Massimiliano Cocca14, Michela Traglia, Andrea Angius7, Jeffrey C. Barrett1, D.I. Boomsma15, Kari Branham2, Gerome Breen16, Gerome Breen17, Chad M. Brummett2, Fabio Busonero7, Harry Campbell18, Andrew T. Chan19, Sai Chen2, Emily Y. Chew20, Francis S. Collins20, Laura J Corbin8, George Davey Smith8, George Dedoussis21, Marcus Dörr6, Aliki-Eleni Farmaki21, Luigi Ferrucci20, Lukas Forer22, Ross M. Fraser2, Stacey Gabriel23, Shawn Levy, Leif Groop24, Leif Groop25, Tabitha A. Harrison10, Andrew T. Hattersley5, Oddgeir L. Holmen26, Kristian Hveem26, Matthias Kretzler2, James Lee27, Matt McGue28, Thomas Meitinger29, David Melzer5, Josine L. Min8, Karen L. Mohlke30, John B. Vincent31, Matthias Nauck6, Deborah A. Nickerson11, Aarno Palotie23, Aarno Palotie19, Michele T. Pato12, Nicola Pirastu14, Melvin G. McInnis2, J. Brent Richards16, J. Brent Richards32, Cinzia Sala, Veikko Salomaa, David Schlessinger20, Sebastian Schoenherr22, P. Eline Slagboom33, Kerrin S. Small16, Tim D. Spector16, Dwight Stambolian34, Marcus A. Tuke5, Jaakko Tuomilehto, Leonard H. van den Berg, Wouter van Rheenen, Uwe Völker6, Cisca Wijmenga35, Daniela Toniolo, Eleftheria Zeggini1, Paolo Gasparini14, Matthew G. Sampson2, James F. Wilson18, Timothy M. Frayling5, Paul I.W. de Bakker36, Morris A. Swertz35, Steven A. McCarroll19, Charles Kooperberg10, Annelot M. Dekker, David Altshuler, Cristen J. Willer2, William G. Iacono28, Samuli Ripatti24, Nicole Soranzo27, Nicole Soranzo1, Klaudia Walter1, Anand Swaroop20, Francesco Cucca7, Carl A. Anderson1, Richard M. Myers, Michael Boehnke2, Mark I. McCarthy37, Mark I. McCarthy3, Richard Durbin1, Gonçalo R. Abecasis2, Jonathan Marchini3 
TL;DR: A reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry leads to accurate genotype imputation at minor allele frequencies as low as 0.1% and a large increase in the number of SNPs tested in association studies.
Abstract: We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1% and a large increase in the number of SNPs tested in association studies, and it can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.

2,149 citations


Shane A. McCarthy, Sayantan Das, Warren W. Kretzschmar, Olivier Delaneau, Andrew R. Wood, Alexander Teumer, Hyun Min Kang, Christian Fuchsberger, Petr Danecek, Kevin Sharp, Yang Luo, Carlo Sidorel, Alan Kwong, Nicholas J. Timpson, Seppo Koskinen, Scott I. Vrieze, Laura J. Scott, He Zhang, Anubha Mahajan, Jan H. Veldink, Ulrike Peters, Carlos N. Pato, Cornelia M. van Duijn, Christopher E. Gillies, Ilaria Gandin, Massimo Mezzavilla, Arthur Gilly, Massimiliano Cocca, Michela Traglia, Andrea Angius, Jeffrey C. Barrett, D.I. Boomsma, Kari Branham, Gerome Breen, Chad M. Brummett, Fabio Busonero, Harry Campbell, Andrew T. Chan, Sai Che, Emily Y. Chew, Francis S. Collins, Laura J Corbin, George Davey Smith, George Dedoussis, Marcus Dörr, Aliki-Eleni Farmaki, Luigi Ferrucci, Lukas Forer, Ross M. Fraser, Stacey Gabriel, Shawn Levy, Leif Groop, Tabitha A. Harrison, Andrew T. Hattersley, Oddgeir L. Holmen, Kristian Hveem, Matthias Kretzler, James Lee, Matt McGue, Thomas Meitinger, David Melzer, Josine L. Min, Karen L. Mohlke, John B. Vincent, Matthias Nauck, Deborah A. Nickerson, Aarno Palotie, Michele T. Pato, Nicola Pirastu, Melvin G. McInnis, J. Brent Richards, Cinzia Sala, Veikko Salomaa, David Schlessinger, Sebastian Schoenherr, P. Eline Slagboom, Kerrin S. Small, Tim D. Spector, Dwight Stambolian, Marcus A. Tuke, Jaakko Tuomilehto, Leonard H. van den Berg, Wouter van Rheenen, Uwe Völker, Cisca Wijmenga, Daniela Toniolo, Eleftheria Zeggini, Paolo Gasparini, Matthew G. Sampson, James F. Wilson, Timothy M. Frayling, Paul I.W. de Bakker, Morris A. Swertz, Steven A. McCarroll, Charles Kooperberg, Annelot M. Dekker, David Altshuler, Cristen J. Willer, William G. Iacono, Samuli Ripatti, Nicole Soranzo, Klaudia Walter, Anand Swaroop, Francesco Cucca, Carl A. Anderson, Richard M. Myers, Michael Boehnke, Mark I. McCarthy, Richard Durbin, Gonçalo R. Abecasis, Jonathan Marchini 
01 Jan 2016
TL;DR: In this article, a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry is presented.
Abstract: We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1% and a large increase in the number of SNPs tested in association studies, and it can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.

1,261 citations


Journal ArticleDOI
TL;DR: A new phasing algorithm, Eagle2, is introduced that attains high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels (such as the Haplotype Reference Consortium; HRC) using a new data structure based on the positional Burrows-Wheeler transform.
Abstract: Po-Ru Loh, Alkes Price and colleagues present Eagle2, a reference-based phasing algorithm that allows for highly accurate and efficient phasing of genotypes across a broad range of cohort sizes. They demonstrate an approximately 10% improvement in accuracy and 20% improvement in speed compared to a competing method, SHAPEIT2.

1,246 citations


Journal ArticleDOI
TL;DR: BCFtools/RoH is presented and evaluated, an extension to the BCFtools software package, that detects regions of autozygosity in sequencing data, in particular exome data, using a hidden Markov model and it is shown that it has higher sensitivity and specificity than existing methods under a range of sequencing error rates and levels of autozykgosity.
Abstract: Summary: Runs of homozygosity (RoHs) are genomic stretches of a diploid genome that show identical alleles on both chromosomes. Longer RoHs are unlikely to have arisen by chance but are likely to denote autozygosity, whereby both copies of the genome descend from the same recent ancestor. Early tools to detect RoH used genotype array data, but substantially more information is available from sequencing data. Here, we present and evaluate BCFtools/RoH, an extension to the BCFtools software package, that detects regions of autozygosity in sequencing data, in particular exome data, using a hidden Markov model. By applying it to simulated data and real data from the 1000 Genomes Project we estimate its accuracy and show that it has higher sensitivity and specificity than existing methods under a range of sequencing error rates and levels of autozygosity. Availability and implementation: BCFtools/RoH and its associated binary/source files are freely available from https://github.com/samtools/BCFtools. Contact: ku.ca.regnas@2nv or ku.ca.regnas@3dp Supplementary information: Supplementary data are available at Bioinformatics online.

452 citations


Journal ArticleDOI
Anna-Sapfo Malaspinas1, Anna-Sapfo Malaspinas2, Anna-Sapfo Malaspinas3, Michael C. Westaway4, Craig Muller1, Vitor C. Sousa3, Vitor C. Sousa2, Oscar Lao5, Isabel Alves6, Isabel Alves2, Isabel Alves3, Anders Bergström7, Georgios Athanasiadis8, Jade Yu Cheng9, Jade Yu Cheng8, Jacob E. Crawford9, Tim H. Heupink4, Enrico Macholdt10, Stephan Peischl3, Stephan Peischl2, Simon Rasmussen11, Stephan Schiffels10, Sankar Subramanian4, Joanne L. Wright4, Anders Albrechtsen1, Chiara Barbieri10, Isabelle Dupanloup2, Isabelle Dupanloup3, Anders Eriksson12, Anders Eriksson13, Ashot Margaryan1, Ida Moltke1, Irina Pugach10, Thorfinn Sand Korneliussen1, Ivan P. Levkivskyi14, J. Víctor Moreno-Mayar1, Shengyu Ni10, Fernando Racimo9, Martin Sikora1, Yali Xue7, Farhang Aghakhanian15, Nicolas Brucato16, Søren Brunak1, Paula F. Campos1, Paula F. Campos17, Warren Clark, Sturla Ellingvåg, Gudjugudju Fourmile, Pascale Gerbault18, Darren Injie, George Koki19, Matthew Leavesley20, Betty Logan, Aubrey Lynch, Elizabeth Matisoo-Smith21, Peter McAllister, Alexander J. Mentzer22, Mait Metspalu23, Andrea Bamberg Migliano18, Les Murgha, Maude E. Phipps15, William Pomat19, Doc Reynolds, François-Xavier Ricaut16, Peter Siba19, Mark G. Thomas18, Thomas Wales, Colleen Ma Run Wall, Stephen Oppenheimer24, Chris Tyler-Smith7, Richard Durbin7, Joe Dortch25, Andrea Manica12, Mikkel H. Schierup8, Robert Foley1, Robert Foley12, Marta Mirazón Lahr12, Marta Mirazón Lahr1, Claire Bowern26, Jeffrey D. Wall27, Thomas Mailund8, Mark Stoneking10, Rasmus Nielsen1, Rasmus Nielsen9, Manjinder S. Sandhu7, Laurent Excoffier2, Laurent Excoffier3, David M. Lambert4, Eske Willerslev1, Eske Willerslev12, Eske Willerslev7 
13 Oct 2016-Nature
TL;DR: A population expansion in northeast Australia during the Holocene epoch associated with limited gene flow from this region to the rest of Australia, consistent with the spread of the Pama–Nyungan languages is inferred.
Abstract: The population history of Aboriginal Australians remains largely uncharacterized. Here we generate high-coverage genomes for 83 Aboriginal Australians (speakers of Pama–Nyungan languages) and 25 Papuans from the New Guinea Highlands. We find that Papuan and Aboriginal Australian ancestors diversified 25–40 thousand years ago (kya), suggesting pre-Holocene population structure in the ancient continent of Sahul (Australia, New Guinea and Tasmania). However, all of the studied Aboriginal Australians descend from a single founding population that differentiated ~10–32 kya. We infer a population expansion in northeast Australia during the Holocene epoch (past 10,000 years) associated with limited gene flow from this region to the rest of Australia, consistent with the spread of the Pama–Nyungan languages. We estimate that Aboriginal Australians and Papuans diverged from Eurasians 51–72 kya, following a single out-of-Africa dispersal, and subsequently admixed with archaic populations. Finally, we report evidence of selection in Aboriginal Australians potentially associated with living in the desert.

389 citations


Journal ArticleDOI
22 Apr 2016-Science
TL;DR: The results show that meiotic recombination sites are localized away from PRDM9-dependent hotspots, Thus, natural LOF variants inform on essential genetic loci and demonstratePRDM9 redundancy in humans.
Abstract: Examining complete gene knockouts within a viable organism can inform on gene function. We sequenced the exomes of 3222 British Pakistani-heritage adults with high parental relatedness, discovering 1111 rare-variant homozygous genotypes with predicted loss of gene function (knockouts) in 781 genes. We observed 13.7% fewer than expected homozygous knockout genotypes, implying an average load of 1.6 recessive-lethal-equivalent LOF variants per adult. Linking genetic data to lifelong health records, knockouts were not associated with clinical consultation or prescription rate. In this dataset we identified a healthy PRDM9 knockout mother, and performed phased genome sequencing on her, her child and controls, which showed meiotic recombination sites localized away from PRDM9-dependent hotspots. Thus, natural LOF variants inform upon essential genetic loci, and demonstrate PRDM9 redundancy in humans.

266 citations


Posted ContentDOI
30 Aug 2016-bioRxiv
TL;DR: It is asserted that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote the understanding of human biology and advance the efforts to improve health.
Abstract: The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009 and reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that while the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.

194 citations


Journal ArticleDOI
10 Jun 2016-Science
TL;DR: This data-sharing effort has led to improved variant interpretation and development of treatments for rare diseases and some cancer types, but such benefits will only be available to the general population if researchers and clinicians can access and make comparisons across data from millions of individuals.
Abstract: Silos of genome data collection are being transformed into seamlessly connected, independent systems Early data-sharing efforts have led to improved variant interpretation and development of treatments for rare diseases and some cancer types (1–3). However, such benefits will only be available to the general population if researchers and clinicians can access and make comparisons across data from millions of individuals.

173 citations


Journal ArticleDOI
TL;DR: It is shown that NSun3 is required for deposition of m5C at the anticodon loop in the mitochondrially encoded transfer RNA methionine (mt-tRNAMet), and f5C in human mitochondrial RNA is generated by oxidative processing of m 5C.
Abstract: Epitranscriptome modifications are required for structure and function of RNA and defects in these pathways have been associated with human disease. Here we identify the RNA target for the previously uncharacterized 5-methylcytosine (m(5)C) methyltransferase NSun3 and link m(5)C RNA modifications with energy metabolism. Using whole-exome sequencing, we identified loss-of-function mutations in NSUN3 in a patient presenting with combined mitochondrial respiratory chain complex deficiency. Patient-derived fibroblasts exhibit severe defects in mitochondrial translation that can be rescued by exogenous expression of NSun3. We show that NSun3 is required for deposition of m(5)C at the anticodon loop in the mitochondrially encoded transfer RNA methionine (mt-tRNA(Met)). Further, we demonstrate that m(5)C deficiency in mt-tRNA(Met) results in the lack of 5-formylcytosine (f(5)C) at the same tRNA position. Our findings demonstrate that NSUN3 is necessary for efficient mitochondrial translation and reveal that f(5)C in human mitochondrial RNA is generated by oxidative processing of m(5)C.

172 citations


Journal ArticleDOI
TL;DR: Using rarecoal, a new method, it is estimated that on average the contemporary East English population derives 38% of its ancestry from Anglo-Saxon migrations, while the Iron Age samples share ancestors with multiple Northern European populations including Britain.
Abstract: British population history has been shaped by a series of immigrations, including the early Anglo-Saxon migrations after 400 CE. It remains an open question how these events affected the genetic composition of the current British population. Here, we present whole-genome sequences from 10 individuals excavated close to Cambridge in the East of England, ranging from the late Iron Age to the middle Anglo-Saxon period. By analysing shared rare variants with hundreds of modern samples from Britain and Europe, we estimate that on average the contemporary East English population derives 38% of its ancestry from Anglo-Saxon migrations. We gain further insight with a new method, rarecoal, which infers population history and identifies fine-scale genetic ancestry from rare variants. Using rarecoal we find that the Anglo-Saxon samples are closely related to modern Dutch and Danish populations, while the Iron Age samples share ancestors with multiple Northern European populations including Britain.

144 citations


Journal ArticleDOI
TL;DR: A monoclonal antibody specific to DNAH11 was designed and validated and performed high-resolution IFM of both control and PCD-affected human respiratory cells, as well as samples from green fluorescent protein (GFP)-left-right dynein mice, to determine the ciliary localization of DNAH 11.
Abstract: Primary ciliary dyskinesia (PCD) is a recessively inherited disease that leads to chronic respiratory disorders owing to impaired mucociliary clearance. Conventional transmission electron microscopy (TEM) is a diagnostic standard to identify ultrastructural defects in respiratory cilia but is not useful in approximately 30% of PCD cases, which have normal ciliary ultrastructure. DNAH11 mutations are a common cause of PCD with normal ciliary ultrastructure and hyperkinetic ciliary beating, but its pathophysiology remains poorly understood. We therefore characterized DNAH11 in human respiratory cilia by immunofluorescence microscopy (IFM) in the context of PCD. We used whole-exome and targeted next-generation sequence analysis as well as Sanger sequencing to identify and confirm eight novel loss-of-function DNAH11 mutations. We designed and validated a monoclonal antibody specific to DNAH11 and performed high-resolution IFM of both control and PCD-affected human respiratory cells, as well as samples from green fluorescent protein (GFP)-left-right dynein mice, to determine the ciliary localization of DNAH11. IFM analysis demonstrated native DNAH11 localization in only the proximal region of wild-type human respiratory cilia and loss of DNAH11 in individuals with PCD with certain loss-of-function DNAH11 mutations. GFP-left-right dynein mice confirmed proximal DNAH11 localization in tracheal cilia. DNAH11 retained proximal localization in respiratory cilia of individuals with PCD with distinct ultrastructural defects, such as the absence of outer dynein arms (ODAs). TEM tomography detected a partial reduction of ODAs in DNAH11-deficient cilia. DNAH11 mutations result in a subtle ODA defect in only the proximal region of respiratory cilia, which is detectable by IFM and TEM tomography.

Journal ArticleDOI
TL;DR: The results establish TANGO2 deficiency as a clinically recognizable cause of pediatric disease with multi-organ involvement and Investigation of palmitate-dependent respiration in mutant fibroblasts showed evidence of a functional defect in mitochondrial β-oxidation.
Abstract: Molecular diagnosis of mitochondrial disorders is challenging because of extreme clinical and genetic heterogeneity. By exome sequencing, we identified three different bi-allelic truncating mutations in TANGO2 in three unrelated individuals with infancy-onset episodic metabolic crises characterized by encephalopathy, hypoglycemia, rhabdomyolysis, arrhythmias, and laboratory findings suggestive of a defect in mitochondrial fatty acid oxidation. Over the course of the disease, all individuals developed global brain atrophy with cognitive impairment and pyramidal signs. TANGO2 (transport and Golgi organization 2) encodes a protein with a putative function in redistribution of Golgi membranes into the endoplasmic reticulum in Drosophila and a mitochondrial localization has been confirmed in mice. Investigation of palmitate-dependent respiration in mutant fibroblasts showed evidence of a functional defect in mitochondrial β-oxidation. Our results establish TANGO2 deficiency as a clinically recognizable cause of pediatric disease with multi-organ involvement.

Journal ArticleDOI
TL;DR: TTC25 is reported as a new member of the ODA-DC machinery in humans and mice, and loss of the ciliary ODAs in humans via TEM and immunofluorescence analyses.
Abstract: Multiprotein complexes referred to as outer dynein arms (ODAs) develop the main mechanical force to generate the ciliary and flagellar beat. ODA defects are the most common cause of primary ciliary dyskinesia (PCD), a congenital disorder of ciliary beating, characterized by recurrent infections of the upper and lower airways, as well as by progressive lung failure and randomization of left-right body asymmetry. Using a whole-exome sequencing approach, we identified recessive loss-of-function mutations within TTC25 in three individuals from two unrelated families affected by PCD. Mice generated by CRISPR/Cas9 technology and carrying a deletion of exons 2 and 3 in Ttc25 presented with laterality defects. Consistently, we observed immotile nodal cilia and missing leftward flow via particle image velocimetry. Furthermore, transmission electron microscopy (TEM) analysis in TTC25-deficient mice revealed an absence of ODAs. Consistent with our findings in mice, we were able to show loss of the ciliary ODAs in humans via TEM and immunofluorescence (IF) analyses. Additionally, IF analyses revealed an absence of the ODA docking complex (ODA-DC), along with its known components CCDC114, CCDC151, and ARMC4. Co-immunoprecipitation revealed interaction between the ODA-DC component CCDC114 and TTC25. Thus, here we report TTC25 as a new member of the ODA-DC machinery in humans and mice.

Journal ArticleDOI
01 Mar 2016-Methods
TL;DR: In this paper, a high-content platform for phenotypic analysis of human induced pluripotent stem cells (iPSC) lines is described, where cells are dissociated and seeded as single cells onto 96-well plates coated with fibronectin at three different concentrations.

Journal ArticleDOI
13 May 2016-PLOS ONE
TL;DR: A new method for sensitive detection of copy number alterations, aneuploidy, and contamination in cell lines using genome-wide SNP genotyping data is presented and results based on induced pluripotent stem cell lines obtained in the HipSci project are presented.
Abstract: Genomic screening for chromosomal abnormalities is an important part of quality control when establishing and maintaining stem cell lines. We present a new method for sensitive detection of copy number alterations, aneuploidy, and contamination in cell lines using genome-wide SNP genotyping data. In contrast to other methods designed for identifying copy number variations in a single sample or in a sample composed of a mixture of normal and tumor cells, this new method is tailored for determining differences between cell lines and the starting material from which they were derived, which allows us to distinguish between normal and novel copy number variation. We implemented the method in the freely available BCFtools package and present results based on induced pluripotent stem cell lines obtained in the HipSci project.

Posted ContentDOI
13 Oct 2016-bioRxiv
TL;DR: The rate of false positive variants introduced by the imputation of Finnish genotype data using global reference panels using Haplotype Reference Consortium1; HRC, and the 1000Genomes project Phase I3; 1000G is evaluated and the results are compared to a Finnish population-specific reference panel combining whole genome and exome sequenced samples.
Abstract: Previous studies1,2 have shown that large multi-population imputation reference panels increases the number of well-imputed variants. However, to our knowledge, no previous studies have evaluated the rate of introduced variation in monomorphic sites of the study population when using imputation panels with admixed populations. In this study we evaluate the rate of false positive variants introduced by the imputation of Finnish genotype data using global reference panels (Haplotype Reference Consortium1; HRC, and the 1000Genomes project Phase I3; 1000G) and compare the results to a Finnish population-specific reference panel combining whole genome and exome sequenced samples. In sites that were monomorphic in our test set, we observed high false positive rates for the global reference panels (4.0% for 1000G and 2.6% for HRC) compared to the Finnish panel (0.26%). This rate was even higher (7.4%) when using a combination panel of 1000G and Finnish whole genome sequences with cross-panel imputation.

Posted ContentDOI
03 May 2016-bioRxiv
TL;DR: Tripathi et al. as mentioned in this paper proposed a set of three algorithms to reduce heterozygosity in genomic data prior to assembly in organisms with moderate to high levels of homozygosity.
Abstract: Motivation: Most DNA sequence in diploid organisms is found in two copies, one contributed by the mother and the other by the father. The high density of differences between the maternally and paternally contributed sequences (heterozygous sites) in some organisms makes de novo genome assembly very challenging, even for algorithms specifically designed to deal with these cases. Therefore, various approaches, most commonly inbreeding in the laboratory, are used to reduce heterozygosity in genomic data prior to assembly. However, many species are not amenable to these techniques. Results: We introduce trio-sga, a set of three algorithms designed to take advantage of mother-father-offspring trio sequencing to facilitate better quality genome assembly in organisms with moderate to high levels of heterozygosity. Two of the algorithms use haplotype phase information present in the trio data to eliminate the majority of heterozygous sites before the assembly commences. The third algorithm is designed to reduce sequencing costs by enabling the use of parents' reads in the assembly of the genome of the offspring. We test these algorithms on a 'simulated trio' from four haploid datasets, and further demonstrate their performance by assembling three highly heterozygous Heliconius butterfly genomes. While the implementation of trio-sga is tuned towards Illumina-generated data, we note that the trio approach to reducing heterozygosity is likely to have cross-platform utility for de novo assembly. Availability: trio-sga is an extension of the sga genome assembler. It is available at https://github.com/millanek/trio-sga, written in C++, and runs multithreaded on UNIX- based systems. Contact: millanek@gmail.com, rd@sanger.ac.uk

Posted ContentDOI
17 Jun 2016-bioRxiv
TL;DR: Exome sequences from 3,222 British-Pakistani individuals with high parental relatedness are used to estimate exome mutation rates, finding frequent recurrence of mutations at polymorphic CpG sites, and an increase in C to T mutations in the Pakistani population compared to Europeans, suggesting that mutational processes have evolved rapidly between human populations.
Abstract: Heterozygous mutations within homozygous sequences descended from a recent common ancestor offer a way to ascertain de novo mutations (DNMs) across multiple generations. Using exome sequences from 3,222 British-Pakistani individuals with high parental relatedness, we estimate a mutation rate of 1.45 ± 0.05 × 10 -8 per base pair per generation in autosomal coding sequence, with a corresponding non-crossover gene conversion rate of 8.75 ± 0.05 × 10 -6 per base pair per generation. This is at the lower end of exome mutation rates previously estimated in parent-offspring trios, suggesting that post-zygotic mutations contribute little to the human germline mutation rate. We found frequent recurrence of mutations at polymorphic CpG sites, and an increase in C to T mutations in a 59 CCG 39 → 59 CTG 39 context in the Pakistani population compared to Europeans, suggesting that mutational processes have evolved rapidly between human populations.

Posted ContentDOI
07 Jul 2016-bioRxiv
TL;DR: A new phasing algorithm, Eagle2, is introduced that attains high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels (such as the Haplotype Reference Consortium, HRC) using a new data structure based on the positional BurrowsWheeler transform.
Abstract: Haplotype phasing is a fundamental problem in medical and population genetics. Phasing is generally performed via statistical phasing within a genotyped cohort, an approach that can attain high accuracy in very large cohorts but attains lower accuracy in smaller cohorts. Here, we instead explore the paradigm of reference-based phasing. We introduce a new phasing algorithm, Eagle2, that attains high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels (such as the Haplotype Reference Consortium, HRC) using a new data structure based on the positional Burrows-Wheeler transform. We demonstrate that Eagle2 attains a ≈20x speedup and ≈10% increase in accuracy compared to reference-based phasing using SHAPEIT2. On European-ancestry samples, Eagle2 with the HRC panel achieves >2x the accuracy of 1000 Genomes-based phasing. Eagle2 is open source and freely available for HRC-based phasing via the Sanger Imputation Service and the Michigan Imputation Server.

Posted ContentDOI
25 May 2016-bioRxiv
TL;DR: This study provides a comprehensive picture of the major sources of genetic and phenotypic variation in iPSCs and establishes their suitability for use in genetic studies of complex human traits and cancer.
Abstract: Induced pluripotent stem cell (iPSC) technology has enormous potential to provide improved cellular models of human disease. However, variable genetic and phenotypic characterisation of many existing iPSC lines limits their potential use for research and therapy. Here, we describe the systematic generation, genotyping and phenotyping of 522 open access human iPSCs derived from 189 healthy male and female individuals as part of the Human Induced Pluripotent Stem Cells Initiative (HipSci: http://www.hipsci.org). Our study provides a comprehensive picture of the major sources of genetic and phenotypic variation in iPSCs and establishes their suitability for use in genetic studies of complex human traits and cancer. Using a combination of genome-wide analyses we find that 5-25% of the variation in different iPSC phenotypes, including differentiation capacity and cellular morphology, arises from differences between individuals. We also assess the phenotypic effects of rare, genomic copy number mutations that are recurrently seen following iPSC reprogramming and present an initial map of common regulatory variants affecting the transcriptome of pluripotent cells in humans.

Journal ArticleDOI
TL;DR: This work confirms the isolate status of Vis population by means of whole-exome sequence and reveals the pattern of loss-of-function mutations, which resembles the trails of adaptive evolution that were found in other species.
Abstract: We have whole-exome sequenced 176 individuals from the isolated population of the island of Vis in Croatia in order to describe exonic variation architecture. We found 290 577 single nucleotide variants (SNVs), 65% of which are singletons, low frequency or rare variants. A total of 25 430 (9%) SNVs are novel, previously not catalogued in NHLBI GO Exome Sequencing Project, UK10K-Generation Scotland, 1000Genomes Project, ExAC or NCBI Reference Assembly dbSNP. The majority of these variants (76%) are singletons. Comparable to data obtained from UK10K-Generation Scotland that were sequenced and analysed using the same protocols, we detected an enrichment of potentially damaging variants (non-synonymous and loss-of-function) in the low frequency and common variant categories. On average 115 (range 93–140) genotypes with loss-of-function variants, 23 (15–34) of which were homozygous, were identified per person. The landscape of loss-of-function variants across an exome revealed that variants mainly accumulated in genes on the xenobiotic-related pathways, of which majority coded for enzymes. The frequency of loss-of-function variants was additionally increased in Vis runs of homozygosity regions where variants mainly affected signalling pathways. This work confirms the isolate status of Vis population by means of whole-exome sequence and reveals the pattern of loss-of-function mutations, which resembles the trails of adaptive evolution that were found in other species. By cataloguing the exomic variants and describing the allelic structure of the Vis population, this study will serve as a valuable resource for future genetic studies of human diseases, population genetics and evolution in this population.

Posted ContentDOI
22 Sep 2016-bioRxiv
TL;DR: High-resolution view of structural dynamics uncovers that, in chromosomal cores, S. paradoxus exhibits higher accumulation rate of balanced structural rearrangements (inversions, translocations and transpositions) whereas S. cerevisiae accumulates unbalanced rearrangement more rapidly.
Abstract: Structural rearrangements have long been recognized as an important source of genetic variation with implications in phenotypic diversity and disease, yet their evolutionary dynamics are difficult to characterize with short-read sequencing. Here, we report long-read sequencing for 12 strains representing major subpopulations of the partially domesticated yeast Saccharomyces cerevisiae and its wild relative Saccharomyces paradoxus. Complete genome assemblies and annotations generate population-level reference genomes and allow for the first explicit definition of chromosome partitioning into cores, subtelomeres and chromosome-ends. High-resolution view of structural dynamics uncovers that, in chromosomal cores, S. paradoxus exhibits higher accumulation rate of balanced structural rearrangements (inversions, translocations and transpositions) whereas S. cerevisiae accumulates unbalanced rearrangements (large insertions, deletions and duplications) more rapidly. In subtelomeres, recurrent interchromosomal reshuffling was found in both species, with higher rate in S. cerevisiae. Such striking contrasts between wild and domesticated yeasts reveal the influence of human activities on structural genome evolution.

Posted ContentDOI
12 Jul 2016-bioRxiv
TL;DR: A significant depletion of variants in the rare frequency spectrum was observed in Finns when comparing the two populations and these functional categories represent the highest a priori power for downstream association studies of rare variants using population isolates.
Abstract: Isolated populations with enrichment of variants due to recent population bottlenecks provide a powerful resource for identifying disease-associated genetic variants and genes. As a model of an isolate population, we sequenced the genomes of 1463 Finnish individuals as part of the Sequencing Initiative Suomi (SISu) Project. We compared the genomic profiles of the 1463 Finns to a sample of 1463 British individuals that were sequenced in parallel as part of the UK10K Project. Whereas there were no major differences in the allele frequency of common variants, a significant depletion of variants in the rare frequency spectrum was observed in Finns when comparing the two populations. On the other hand, we observed >2.1 million variants that were twice as frequent among Finns compared to Britons and 800,000 variants that were more than 10 times more frequent in Finns. Furthermore, in Finns we observed a relative proportional enrichment of variants in the minor allele frequency range between 2 - 5% (p

Posted ContentDOI
22 Jun 2016-bioRxiv
TL;DR: The concept of a population BWT is introduced and used to store and index the sequencing reads of 2,705 samples from the 1000 Genomes Project and it is shown that as more genomes are added, identical read sequences are increasingly observed and compression becomes more efficient.
Abstract: We are rapidly approaching the point where we have sequenced millions of human genomes. There is a pressing need for new data structures to store raw sequencing data and efficient algorithms for population scale analysis. Current reference based data formats do not fully exploit the redundancy in population sequencing nor take advantage of shared genetic variation. In recent years, the Burrows-Wheeler transform (BWT) and FM-index have been widely employed as a full text searchable index for read alignment and de novo assembly. We introduce the concept of a population BWT and use it to store and index the sequencing reads of 2,705 samples from the 1000 Genomes Project. A key feature is that as more genomes are added, identical read sequences are increasingly observed and compression becomes more efficient. We assess the support in the 1000 Genomes read data for every base position of two human reference assembly versions, identifying that 3.2 Mbp with population support was lost in the transition from GRCh37 with 13.7 Mbp added to GRCh38. We show that the vast majority of variant alleles can be uniquely described by overlapping 31-mers and show how rapid and accurate SNP and indel genotyping can be carried out across the genomes in the population BWT. We use the population BWT to carry out non-reference queries to search for the presence of all known viral genomes, and discover human T-lymphotropic virus 1 integrations in six samples in a recognised epidemiological distribution.

Posted ContentDOI
27 Oct 2016-bioRxiv
TL;DR: Deep sequencing of large crosses of butterflies is used to show that there are no long chromosomes regions that are not broken up during hybridisation, and no long chromosome inversions anywhere between the two genomes, which suggests that hybridisation is rare enough and mate preference is strong enough that inversions are not necessary to maintain the species barrier.
Abstract: Mechanisms that suppress recombination are known to help maintain species barriers by preventing the breakup of co-adapted gene combinations The sympatric butterfly species H melpomene and H cydno are separated by many strong barriers, but the species still hybridise infrequently in the wild, with around 40% of the genome influenced by introgression We tested the hypothesis that genetic barriers between the species are reinforced by inversions or other mechanisms to reduce between-species recombination rate We constructed fine-scale recombination maps for Panamanian populations of both species and hybrids to directly measure recombination rate between these species, and generated long sequence reads to detect inversions We find no evidence for a systematic reduction in recombination rates in F1 hybrids, and also no evidence for inversions longer than 50 kb that might be involved in generating or maintaining species barriers This suggests that mechanisms leading to global or local reduction in recombination do not play a significant role in the maintenance of species barriers between H melpomene and H cydno