Showing papers by "Richard Durbin published in 2016"
••
Wellcome Trust Sanger Institute1, University of Michigan2, University of Oxford3, University of Geneva4, University of Exeter5, Greifswald University Hospital6, National Research Council7, University of Bristol8, University of Colorado Boulder9, Fred Hutchinson Cancer Research Center10, University of Washington11, SUNY Downstate Medical Center12, Erasmus University Rotterdam13, University of Trieste14, VU University Amsterdam15, King's College London16, South London and Maudsley NHS Foundation Trust17, University of Edinburgh18, Harvard University19, National Institutes of Health20, Harokopio University21, Innsbruck Medical University22, Broad Institute23, University of Helsinki24, Lund University25, Norwegian University of Science and Technology26, University of Cambridge27, University of Minnesota28, Technische Universität München29, University of North Carolina at Chapel Hill30, University of Toronto31, McGill University32, Leiden University33, University of Pennsylvania34, University of Groningen35, Utrecht University36, Churchill Hospital37
TL;DR: A reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry leads to accurate genotype imputation at minor allele frequencies as low as 0.1% and a large increase in the number of SNPs tested in association studies.
Abstract: We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1% and a large increase in the number of SNPs tested in association studies, and it can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.
2,149 citations
01 Jan 2016
TL;DR: In this article, a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry is presented.
Abstract: We describe a reference panel of 64,976 human haplotypes at 39,235,157 SNPs constructed using whole-genome sequence data from 20 studies of predominantly European ancestry. Using this resource leads to accurate genotype imputation at minor allele frequencies as low as 0.1% and a large increase in the number of SNPs tested in association studies, and it can help to discover and refine causal loci. We describe remote server resources that allow researchers to carry out imputation and phasing consistently and efficiently.
1,261 citations
••
TL;DR: A new phasing algorithm, Eagle2, is introduced that attains high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels (such as the Haplotype Reference Consortium; HRC) using a new data structure based on the positional Burrows-Wheeler transform.
Abstract: Po-Ru Loh, Alkes Price and colleagues present Eagle2, a reference-based phasing algorithm that allows for highly accurate and efficient phasing of genotypes across a broad range of cohort sizes. They demonstrate an approximately 10% improvement in accuracy and 20% improvement in speed compared to a competing method, SHAPEIT2.
1,246 citations
••
TL;DR: BCFtools/RoH is presented and evaluated, an extension to the BCFtools software package, that detects regions of autozygosity in sequencing data, in particular exome data, using a hidden Markov model and it is shown that it has higher sensitivity and specificity than existing methods under a range of sequencing error rates and levels of autozykgosity.
Abstract: Summary: Runs of homozygosity (RoHs) are genomic stretches of a diploid genome that show identical alleles on both chromosomes. Longer RoHs are unlikely to have arisen by chance but are likely to denote autozygosity, whereby both copies of the genome descend from the same recent ancestor. Early tools to detect RoH used genotype array data, but substantially more information is available from sequencing data. Here, we present and evaluate BCFtools/RoH, an extension to the BCFtools software package, that detects regions of autozygosity in sequencing data, in particular exome data, using a hidden Markov model. By applying it to simulated data and real data from the 1000 Genomes Project we estimate its accuracy and show that it has higher sensitivity and specificity than existing methods under a range of sequencing error rates and levels of autozygosity.
Availability and implementation: BCFtools/RoH and its associated binary/source files are freely available from https://github.com/samtools/BCFtools.
Contact: ku.ca.regnas@2nv or ku.ca.regnas@3dp
Supplementary information: Supplementary data are available at Bioinformatics online.
452 citations
••
University of Copenhagen1, Swiss Institute of Bioinformatics2, University of Bern3, Griffith University4, Pompeu Fabra University5, Instituto Gulbenkian de Ciência6, Wellcome Trust Sanger Institute7, Aarhus University8, University of California, Berkeley9, Max Planck Society10, Technical University of Denmark11, University of Cambridge12, King Abdullah University of Science and Technology13, ETH Zurich14, Monash University Malaysia Campus15, Centre national de la recherche scientifique16, University of Porto17, University College London18, Papua New Guinea Institute of Medical Research19, University of Papua New Guinea20, University of Otago21, Wellcome Trust Centre for Human Genetics22, Estonian Biocentre23, University of Oxford24, University of Western Australia25, Yale University26, University of California, San Francisco27
TL;DR: A population expansion in northeast Australia during the Holocene epoch associated with limited gene flow from this region to the rest of Australia, consistent with the spread of the Pama–Nyungan languages is inferred.
Abstract: The population history of Aboriginal Australians remains largely uncharacterized. Here we generate high-coverage genomes for 83 Aboriginal Australians (speakers of Pama–Nyungan languages) and 25 Papuans from the New Guinea Highlands. We find that Papuan and Aboriginal Australian ancestors diversified 25–40 thousand years ago (kya), suggesting pre-Holocene population structure in the ancient continent of Sahul (Australia, New Guinea and Tasmania). However, all of the studied Aboriginal Australians descend from a single founding population that differentiated ~10–32 kya. We infer a population expansion in northeast Australia during the Holocene epoch (past 10,000 years) associated with limited gene flow from this region to the rest of Australia, consistent with the spread of the Pama–Nyungan languages. We estimate that Aboriginal Australians and Papuans diverged from Eurasians 51–72 kya, following a single out-of-Africa dispersal, and subsequently admixed with archaic populations. Finally, we report evidence of selection in Aboriginal Australians potentially associated with living in the desert.
389 citations
••
Wellcome Trust Sanger Institute1, Queen Mary University of London2, National Health Service3, Broad Institute4, Harvard University5, Heart of England NHS Foundation Trust6, Aston University7, University College London8, University of Birmingham9, Cambridge University Hospitals NHS Foundation Trust10, National Institute for Health Research11, King's College London12
TL;DR: The results show that meiotic recombination sites are localized away from PRDM9-dependent hotspots, Thus, natural LOF variants inform on essential genetic loci and demonstratePRDM9 redundancy in humans.
Abstract: Examining complete gene knockouts within a viable organism can inform on gene function. We sequenced the exomes of 3222 British Pakistani-heritage adults with high parental relatedness, discovering 1111 rare-variant homozygous genotypes with predicted loss of gene function (knockouts) in 781 genes. We observed 13.7% fewer than expected homozygous knockout genotypes, implying an average load of 1.6 recessive-lethal-equivalent LOF variants per adult. Linking genetic data to lifelong health records, knockouts were not associated with clinical consultation or prescription rate. In this dataset we identified a healthy PRDM9 knockout mother, and performed phased genome sequencing on her, her child and controls, which showed meiotic recombination sites localized away from PRDM9-dependent hotspots. Thus, natural LOF variants inform upon essential genetic loci, and demonstrate PRDM9 redundancy in humans.
266 citations
••
TL;DR: It is asserted that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote the understanding of human biology and advance the efforts to improve health.
Abstract: The human reference genome assembly plays a central role in nearly all aspects of today's basic and clinical research. GRCh38 is the first coordinate-changing assembly update since 2009 and reflects the resolution of roughly 1000 issues and encompasses modifications ranging from thousands of single base changes to megabase-scale path reorganizations, gap closures and localization of previously orphaned sequences. We developed a new approach to sequence generation for targeted base updates and used data from new genome mapping technologies and single haplotype resources to identify and resolve larger assembly issues. For the first time, the reference assembly contains sequence-based representations for the centromeres. We also expanded the number of alternate loci to create a reference that provides a more robust representation of human population variation. We demonstrate that the updates render the reference an improved annotation substrate, alter read alignments in unchanged regions and impact variant interpretation at clinically relevant loci. We additionally evaluated a collection of new de novo long-read haploid assemblies and conclude that while the new assemblies compare favorably to the reference with respect to continuity, error rate, and gene completeness, the reference still provides the best representation for complex genomic regions and coding sequences. We assert that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote our understanding of human biology and advance our efforts to improve health.
194 citations
••
TL;DR: This data-sharing effort has led to improved variant interpretation and development of treatments for rare diseases and some cancer types, but such benefits will only be available to the general population if researchers and clinicians can access and make comparisons across data from millions of individuals.
Abstract: Silos of genome data collection are being transformed into seamlessly connected, independent systems Early data-sharing efforts have led to improved variant interpretation and development of treatments for rare diseases and some cancer types (1–3). However, such benefits will only be available to the general population if researchers and clinicians can access and make comparisons across data from millions of individuals.
173 citations
••
MRC Mitochondrial Biology Unit1, University of Cambridge2, Technische Universität München3, University of Bath4, University of Würzburg5, Max Planck Society6, Max Delbrück Center for Molecular Medicine7, Boston Children's Hospital8, Wellcome Trust Sanger Institute9, Paracelsus Private Medical University of Salzburg10
TL;DR: It is shown that NSun3 is required for deposition of m5C at the anticodon loop in the mitochondrially encoded transfer RNA methionine (mt-tRNAMet), and f5C in human mitochondrial RNA is generated by oxidative processing of m 5C.
Abstract: Epitranscriptome modifications are required for structure and function of RNA and defects in these pathways have been associated with human disease. Here we identify the RNA target for the previously uncharacterized 5-methylcytosine (m(5)C) methyltransferase NSun3 and link m(5)C RNA modifications with energy metabolism. Using whole-exome sequencing, we identified loss-of-function mutations in NSUN3 in a patient presenting with combined mitochondrial respiratory chain complex deficiency. Patient-derived fibroblasts exhibit severe defects in mitochondrial translation that can be rescued by exogenous expression of NSun3. We show that NSun3 is required for deposition of m(5)C at the anticodon loop in the mitochondrially encoded transfer RNA methionine (mt-tRNA(Met)). Further, we demonstrate that m(5)C deficiency in mt-tRNA(Met) results in the lack of 5-formylcytosine (f(5)C) at the same tRNA position. Our findings demonstrate that NSUN3 is necessary for efficient mitochondrial translation and reveal that f(5)C in human mitochondrial RNA is generated by oxidative processing of m(5)C.
172 citations
••
TL;DR: Using rarecoal, a new method, it is estimated that on average the contemporary East English population derives 38% of its ancestry from Anglo-Saxon migrations, while the Iron Age samples share ancestors with multiple Northern European populations including Britain.
Abstract: British population history has been shaped by a series of immigrations, including the early Anglo-Saxon migrations after 400 CE. It remains an open question how these events affected the genetic composition of the current British population. Here, we present whole-genome sequences from 10 individuals excavated close to Cambridge in the East of England, ranging from the late Iron Age to the middle Anglo-Saxon period. By analysing shared rare variants with hundreds of modern samples from Britain and Europe, we estimate that on average the contemporary East English population derives 38% of its ancestry from Anglo-Saxon migrations. We gain further insight with a new method, rarecoal, which infers population history and identifies fine-scale genetic ancestry from rare variants. Using rarecoal we find that the Anglo-Saxon samples are closely related to modern Dutch and Danish populations, while the Iron Age samples share ancestors with multiple Northern European populations including Britain.
144 citations
••
TL;DR: A monoclonal antibody specific to DNAH11 was designed and validated and performed high-resolution IFM of both control and PCD-affected human respiratory cells, as well as samples from green fluorescent protein (GFP)-left-right dynein mice, to determine the ciliary localization of DNAH 11.
Abstract: Primary ciliary dyskinesia (PCD) is a recessively inherited disease that leads to chronic respiratory disorders owing to impaired mucociliary clearance. Conventional transmission electron microscopy (TEM) is a diagnostic standard to identify ultrastructural defects in respiratory cilia but is not useful in approximately 30% of PCD cases, which have normal ciliary ultrastructure. DNAH11 mutations are a common cause of PCD with normal ciliary ultrastructure and hyperkinetic ciliary beating, but its pathophysiology remains poorly understood. We therefore characterized DNAH11 in human respiratory cilia by immunofluorescence microscopy (IFM) in the context of PCD. We used whole-exome and targeted next-generation sequence analysis as well as Sanger sequencing to identify and confirm eight novel loss-of-function DNAH11 mutations. We designed and validated a monoclonal antibody specific to DNAH11 and performed high-resolution IFM of both control and PCD-affected human respiratory cells, as well as samples from green fluorescent protein (GFP)-left-right dynein mice, to determine the ciliary localization of DNAH11. IFM analysis demonstrated native DNAH11 localization in only the proximal region of wild-type human respiratory cilia and loss of DNAH11 in individuals with PCD with certain loss-of-function DNAH11 mutations. GFP-left-right dynein mice confirmed proximal DNAH11 localization in tracheal cilia. DNAH11 retained proximal localization in respiratory cilia of individuals with PCD with distinct ultrastructural defects, such as the absence of outer dynein arms (ODAs). TEM tomography detected a partial reduction of ODAs in DNAH11-deficient cilia. DNAH11 mutations result in a subtle ODA defect in only the proximal region of respiratory cilia, which is detectable by IFM and TEM tomography.
••
TL;DR: The results establish TANGO2 deficiency as a clinically recognizable cause of pediatric disease with multi-organ involvement and Investigation of palmitate-dependent respiration in mutant fibroblasts showed evidence of a functional defect in mitochondrial β-oxidation.
Abstract: Molecular diagnosis of mitochondrial disorders is challenging because of extreme clinical and genetic heterogeneity. By exome sequencing, we identified three different bi-allelic truncating mutations in TANGO2 in three unrelated individuals with infancy-onset episodic metabolic crises characterized by encephalopathy, hypoglycemia, rhabdomyolysis, arrhythmias, and laboratory findings suggestive of a defect in mitochondrial fatty acid oxidation. Over the course of the disease, all individuals developed global brain atrophy with cognitive impairment and pyramidal signs. TANGO2 (transport and Golgi organization 2) encodes a protein with a putative function in redistribution of Golgi membranes into the endoplasmic reticulum in Drosophila and a mitochondrial localization has been confirmed in mice. Investigation of palmitate-dependent respiration in mutant fibroblasts showed evidence of a functional defect in mitochondrial β-oxidation. Our results establish TANGO2 deficiency as a clinically recognizable cause of pediatric disease with multi-organ involvement.
••
TL;DR: TTC25 is reported as a new member of the ODA-DC machinery in humans and mice, and loss of the ciliary ODAs in humans via TEM and immunofluorescence analyses.
Abstract: Multiprotein complexes referred to as outer dynein arms (ODAs) develop the main mechanical force to generate the ciliary and flagellar beat. ODA defects are the most common cause of primary ciliary dyskinesia (PCD), a congenital disorder of ciliary beating, characterized by recurrent infections of the upper and lower airways, as well as by progressive lung failure and randomization of left-right body asymmetry. Using a whole-exome sequencing approach, we identified recessive loss-of-function mutations within TTC25 in three individuals from two unrelated families affected by PCD. Mice generated by CRISPR/Cas9 technology and carrying a deletion of exons 2 and 3 in Ttc25 presented with laterality defects. Consistently, we observed immotile nodal cilia and missing leftward flow via particle image velocimetry. Furthermore, transmission electron microscopy (TEM) analysis in TTC25-deficient mice revealed an absence of ODAs. Consistent with our findings in mice, we were able to show loss of the ciliary ODAs in humans via TEM and immunofluorescence (IF) analyses. Additionally, IF analyses revealed an absence of the ODA docking complex (ODA-DC), along with its known components CCDC114, CCDC151, and ARMC4. Co-immunoprecipitation revealed interaction between the ODA-DC component CCDC114 and TTC25. Thus, here we report TTC25 as a new member of the ODA-DC machinery in humans and mice.
••
TL;DR: In this paper, a high-content platform for phenotypic analysis of human induced pluripotent stem cells (iPSC) lines is described, where cells are dissociated and seeded as single cells onto 96-well plates coated with fibronectin at three different concentrations.
••
TL;DR: A new method for sensitive detection of copy number alterations, aneuploidy, and contamination in cell lines using genome-wide SNP genotyping data is presented and results based on induced pluripotent stem cell lines obtained in the HipSci project are presented.
Abstract: Genomic screening for chromosomal abnormalities is an important part of quality control when establishing and maintaining stem cell lines. We present a new method for sensitive detection of copy number alterations, aneuploidy, and contamination in cell lines using genome-wide SNP genotyping data. In contrast to other methods designed for identifying copy number variations in a single sample or in a sample composed of a mixture of normal and tumor cells, this new method is tailored for determining differences between cell lines and the starting material from which they were derived, which allows us to distinguish between normal and novel copy number variation. We implemented the method in the freely available BCFtools package and present results based on induced pluripotent stem cell lines obtained in the HipSci project.
••
TL;DR: The rate of false positive variants introduced by the imputation of Finnish genotype data using global reference panels using Haplotype Reference Consortium1; HRC, and the 1000Genomes project Phase I3; 1000G is evaluated and the results are compared to a Finnish population-specific reference panel combining whole genome and exome sequenced samples.
Abstract: Previous studies1,2 have shown that large multi-population imputation reference panels increases the number of well-imputed variants. However, to our knowledge, no previous studies have evaluated the rate of introduced variation in monomorphic sites of the study population when using imputation panels with admixed populations. In this study we evaluate the rate of false positive variants introduced by the imputation of Finnish genotype data using global reference panels (Haplotype Reference Consortium1; HRC, and the 1000Genomes project Phase I3; 1000G) and compare the results to a Finnish population-specific reference panel combining whole genome and exome sequenced samples. In sites that were monomorphic in our test set, we observed high false positive rates for the global reference panels (4.0% for 1000G and 2.6% for HRC) compared to the Finnish panel (0.26%). This rate was even higher (7.4%) when using a combination panel of 1000G and Finnish whole genome sequences with cross-panel imputation.
••
TL;DR: Tripathi et al. as mentioned in this paper proposed a set of three algorithms to reduce heterozygosity in genomic data prior to assembly in organisms with moderate to high levels of homozygosity.
Abstract: Motivation: Most DNA sequence in diploid organisms is found in two copies, one contributed by the mother and the other by the father. The high density of differences between the maternally and paternally contributed sequences (heterozygous sites) in some organisms makes de novo genome assembly very challenging, even for algorithms specifically designed to deal with these cases. Therefore, various approaches, most commonly inbreeding in the laboratory, are used to reduce heterozygosity in genomic data prior to assembly. However, many species are not amenable to these techniques. Results: We introduce trio-sga, a set of three algorithms designed to take advantage of mother-father-offspring trio sequencing to facilitate better quality genome assembly in organisms with moderate to high levels of heterozygosity. Two of the algorithms use haplotype phase information present in the trio data to eliminate the majority of heterozygous sites before the assembly commences. The third algorithm is designed to reduce sequencing costs by enabling the use of parents' reads in the assembly of the genome of the offspring. We test these algorithms on a 'simulated trio' from four haploid datasets, and further demonstrate their performance by assembling three highly heterozygous Heliconius butterfly genomes. While the implementation of trio-sga is tuned towards Illumina-generated data, we note that the trio approach to reducing heterozygosity is likely to have cross-platform utility for de novo assembly. Availability: trio-sga is an extension of the sga genome assembler. It is available at https://github.com/millanek/trio-sga, written in C++, and runs multithreaded on UNIX- based systems. Contact: millanek@gmail.com, rd@sanger.ac.uk
••
TL;DR: Exome sequences from 3,222 British-Pakistani individuals with high parental relatedness are used to estimate exome mutation rates, finding frequent recurrence of mutations at polymorphic CpG sites, and an increase in C to T mutations in the Pakistani population compared to Europeans, suggesting that mutational processes have evolved rapidly between human populations.
Abstract: Heterozygous mutations within homozygous sequences descended from a recent common ancestor offer a way to ascertain de novo mutations (DNMs) across multiple generations. Using exome sequences from 3,222 British-Pakistani individuals with high parental relatedness, we estimate a mutation rate of 1.45 ± 0.05 × 10 -8 per base pair per generation in autosomal coding sequence, with a corresponding non-crossover gene conversion rate of 8.75 ± 0.05 × 10 -6 per base pair per generation. This is at the lower end of exome mutation rates previously estimated in parent-offspring trios, suggesting that post-zygotic mutations contribute little to the human germline mutation rate. We found frequent recurrence of mutations at polymorphic CpG sites, and an increase in C to T mutations in a 59 CCG 39 → 59 CTG 39 context in the Pakistani population compared to Europeans, suggesting that mutational processes have evolved rapidly between human populations.
••
TL;DR: A new phasing algorithm, Eagle2, is introduced that attains high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels (such as the Haplotype Reference Consortium, HRC) using a new data structure based on the positional BurrowsWheeler transform.
Abstract: Haplotype phasing is a fundamental problem in medical and population genetics. Phasing is generally performed via statistical phasing within a genotyped cohort, an approach that can attain high accuracy in very large cohorts but attains lower accuracy in smaller cohorts. Here, we instead explore the paradigm of reference-based phasing. We introduce a new phasing algorithm, Eagle2, that attains high accuracy across a broad range of cohort sizes by efficiently leveraging information from large external reference panels (such as the Haplotype Reference Consortium, HRC) using a new data structure based on the positional Burrows-Wheeler transform. We demonstrate that Eagle2 attains a ≈20x speedup and ≈10% increase in accuracy compared to reference-based phasing using SHAPEIT2. On European-ancestry samples, Eagle2 with the HRC panel achieves >2x the accuracy of 1000 Genomes-based phasing. Eagle2 is open source and freely available for HRC-based phasing via the Sanger Imputation Service and the Michigan Imputation Server.
••
TL;DR: This study provides a comprehensive picture of the major sources of genetic and phenotypic variation in iPSCs and establishes their suitability for use in genetic studies of complex human traits and cancer.
Abstract: Induced pluripotent stem cell (iPSC) technology has enormous potential to provide improved cellular models of human disease. However, variable genetic and phenotypic characterisation of many existing iPSC lines limits their potential use for research and therapy. Here, we describe the systematic generation, genotyping and phenotyping of 522 open access human iPSCs derived from 189 healthy male and female individuals as part of the Human Induced Pluripotent Stem Cells Initiative (HipSci: http://www.hipsci.org). Our study provides a comprehensive picture of the major sources of genetic and phenotypic variation in iPSCs and establishes their suitability for use in genetic studies of complex human traits and cancer. Using a combination of genome-wide analyses we find that 5-25% of the variation in different iPSC phenotypes, including differentiation capacity and cellular morphology, arises from differences between individuals. We also assess the phenotypic effects of rare, genomic copy number mutations that are recurrently seen following iPSC reprogramming and present an initial map of common regulatory variants affecting the transcriptome of pluripotent cells in humans.
••
TL;DR: This work confirms the isolate status of Vis population by means of whole-exome sequence and reveals the pattern of loss-of-function mutations, which resembles the trails of adaptive evolution that were found in other species.
Abstract: We have whole-exome sequenced 176 individuals from the isolated population of the island of Vis in Croatia in order to describe exonic variation architecture. We found 290 577 single nucleotide variants (SNVs), 65% of which are singletons, low frequency or rare variants. A total of 25 430 (9%) SNVs are novel, previously not catalogued in NHLBI GO Exome Sequencing Project, UK10K-Generation Scotland, 1000Genomes Project, ExAC or NCBI Reference Assembly dbSNP. The majority of these variants (76%) are singletons. Comparable to data obtained from UK10K-Generation Scotland that were sequenced and analysed using the same protocols, we detected an enrichment of potentially damaging variants (non-synonymous and loss-of-function) in the low frequency and common variant categories. On average 115 (range 93–140) genotypes with loss-of-function variants, 23 (15–34) of which were homozygous, were identified per person. The landscape of loss-of-function variants across an exome revealed that variants mainly accumulated in genes on the xenobiotic-related pathways, of which majority coded for enzymes. The frequency of loss-of-function variants was additionally increased in Vis runs of homozygosity regions where variants mainly affected signalling pathways. This work confirms the isolate status of Vis population by means of whole-exome sequence and reveals the pattern of loss-of-function mutations, which resembles the trails of adaptive evolution that were found in other species. By cataloguing the exomic variants and describing the allelic structure of the Vis population, this study will serve as a valuable resource for future genetic studies of human diseases, population genetics and evolution in this population.
••
TL;DR: High-resolution view of structural dynamics uncovers that, in chromosomal cores, S. paradoxus exhibits higher accumulation rate of balanced structural rearrangements (inversions, translocations and transpositions) whereas S. cerevisiae accumulates unbalanced rearrangement more rapidly.
Abstract: Structural rearrangements have long been recognized as an important source of genetic variation with implications in phenotypic diversity and disease, yet their evolutionary dynamics are difficult to characterize with short-read sequencing. Here, we report long-read sequencing for 12 strains representing major subpopulations of the partially domesticated yeast Saccharomyces cerevisiae and its wild relative Saccharomyces paradoxus. Complete genome assemblies and annotations generate population-level reference genomes and allow for the first explicit definition of chromosome partitioning into cores, subtelomeres and chromosome-ends. High-resolution view of structural dynamics uncovers that, in chromosomal cores, S. paradoxus exhibits higher accumulation rate of balanced structural rearrangements (inversions, translocations and transpositions) whereas S. cerevisiae accumulates unbalanced rearrangements (large insertions, deletions and duplications) more rapidly. In subtelomeres, recurrent interchromosomal reshuffling was found in both species, with higher rate in S. cerevisiae. Such striking contrasts between wild and domesticated yeasts reveal the influence of human activities on structural genome evolution.
••
TL;DR: A significant depletion of variants in the rare frequency spectrum was observed in Finns when comparing the two populations and these functional categories represent the highest a priori power for downstream association studies of rare variants using population isolates.
Abstract: Isolated populations with enrichment of variants due to recent population bottlenecks provide a powerful resource for identifying disease-associated genetic variants and genes. As a model of an isolate population, we sequenced the genomes of 1463 Finnish individuals as part of the Sequencing Initiative Suomi (SISu) Project. We compared the genomic profiles of the 1463 Finns to a sample of 1463 British individuals that were sequenced in parallel as part of the UK10K Project. Whereas there were no major differences in the allele frequency of common variants, a significant depletion of variants in the rare frequency spectrum was observed in Finns when comparing the two populations. On the other hand, we observed >2.1 million variants that were twice as frequent among Finns compared to Britons and 800,000 variants that were more than 10 times more frequent in Finns. Furthermore, in Finns we observed a relative proportional enrichment of variants in the minor allele frequency range between 2 - 5% (p
••
TL;DR: The concept of a population BWT is introduced and used to store and index the sequencing reads of 2,705 samples from the 1000 Genomes Project and it is shown that as more genomes are added, identical read sequences are increasingly observed and compression becomes more efficient.
Abstract: We are rapidly approaching the point where we have sequenced millions of human genomes. There is a pressing need for new data structures to store raw sequencing data and efficient algorithms for population scale analysis. Current reference based data formats do not fully exploit the redundancy in population sequencing nor take advantage of shared genetic variation. In recent years, the Burrows-Wheeler transform (BWT) and FM-index have been widely employed as a full text searchable index for read alignment and de novo assembly. We introduce the concept of a population BWT and use it to store and index the sequencing reads of 2,705 samples from the 1000 Genomes Project. A key feature is that as more genomes are added, identical read sequences are increasingly observed and compression becomes more efficient. We assess the support in the 1000 Genomes read data for every base position of two human reference assembly versions, identifying that 3.2 Mbp with population support was lost in the transition from GRCh37 with 13.7 Mbp added to GRCh38. We show that the vast majority of variant alleles can be uniquely described by overlapping 31-mers and show how rapid and accurate SNP and indel genotyping can be carried out across the genomes in the population BWT. We use the population BWT to carry out non-reference queries to search for the presence of all known viral genomes, and discover human T-lymphotropic virus 1 integrations in six samples in a recognised epidemiological distribution.
••
TL;DR: Deep sequencing of large crosses of butterflies is used to show that there are no long chromosomes regions that are not broken up during hybridisation, and no long chromosome inversions anywhere between the two genomes, which suggests that hybridisation is rare enough and mate preference is strong enough that inversions are not necessary to maintain the species barrier.
Abstract: Mechanisms that suppress recombination are known to help maintain species barriers by preventing the breakup of co-adapted gene combinations The sympatric butterfly species H melpomene and H cydno are separated by many strong barriers, but the species still hybridise infrequently in the wild, with around 40% of the genome influenced by introgression We tested the hypothesis that genetic barriers between the species are reinforced by inversions or other mechanisms to reduce between-species recombination rate We constructed fine-scale recombination maps for Panamanian populations of both species and hybrids to directly measure recombination rate between these species, and generated long sequence reads to detect inversions We find no evidence for a systematic reduction in recombination rates in F1 hybrids, and also no evidence for inversions longer than 50 kb that might be involved in generating or maintaining species barriers This suggests that mechanisms leading to global or local reduction in recombination do not play a significant role in the maintenance of species barriers between H melpomene and H cydno