Showing papers by "Wellcome Trust Sanger Institute published in 2020"
••
TL;DR: A catalogue of predicted loss-of-function variants in 125,748 whole-exome and 15,708 whole-genome sequencing datasets from the Genome Aggregation Database (gnomAD) reveals the spectrum of mutational constraints that affect these human protein-coding genes.
Abstract: Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes that are crucial for the function of an organism will be depleted of such variants in natural populations, whereas non-essential genes will tolerate their accumulation. However, predicted loss-of-function variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes1. Here we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence predicted loss-of-function variants in this cohort after filtering for artefacts caused by sequencing and annotation errors. Using an improved model of human mutation rates, we classify human protein-coding genes along a spectrum that represents tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve the power of gene discovery for both common and rare diseases. A catalogue of predicted loss-of-function variants in 125,748 whole-exome and 15,708 whole-genome sequencing datasets from the Genome Aggregation Database (gnomAD) reveals the spectrum of mutational constraints that affect these human protein-coding genes.
4,913 citations
••
TL;DR: In this paper, the expression of viral entry-associated genes in single-cell RNA-sequencing data from multiple tissues from healthy human donors was investigated, and co-detected these transcripts in specific respiratory, corneal and intestinal epithelial cells, potentially explaining the high efficiency of SARS-CoV-2 transmission.
Abstract: We investigated SARS-CoV-2 potential tropism by surveying expression of viral entry-associated genes in single-cell RNA-sequencing data from multiple tissues from healthy human donors. We co-detected these transcripts in specific respiratory, corneal and intestinal epithelial cells, potentially explaining the high efficiency of SARS-CoV-2 transmission. These genes are co-expressed in nasal epithelial cells with genes involved in innate immunity, highlighting the cells' potential role in initial viral infection, spread and clearance. The study offers a useful resource for further lines of inquiry with valuable clinical samples from COVID-19 patients and we provide our data in a comprehensive, open and user-friendly fashion at www.covid19cellatlas.org.
2,024 citations
••
TL;DR: Analysis of the compendium of data points to a particularly relevant role for nasal goblet and ciliated cells as early viral targets and potential reservoirs of SARS-CoV-2 infection and underscores the importance of the availability of the Human Cell Atlas as a reference dataset.
Abstract: The SARS-CoV-2 coronavirus, the etiologic agent responsible for COVID-19 coronavirus disease, is a global threat. To better understand viral tropism, we assessed the RNA expression of the coronavirus receptor, ACE2, as well as the viral S protein priming protease TMPRSS2 thought to govern viral entry in single-cell RNA-sequencing (scRNA-seq) datasets from healthy individuals generated by the Human Cell Atlas consortium. We found that ACE2, as well as the protease TMPRSS2, are differentially expressed in respiratory and gut epithelial cells. In-depth analysis of epithelial cells in the respiratory tree reveals that nasal epithelial cells, specifically goblet/secretory cells and ciliated cells, display the highest ACE2 expression of all the epithelial cells analyzed. The skewed expression of viral receptors/entry-associated proteins towards the upper airway may be correlated with enhanced transmissivity. Finally, we showed that many of the top genes associated with ACE2 airway epithelial expression are innate immune-associated, antiviral genes, highly enriched in the nasal epithelial cells. This association with immune pathways might have clinical implications for the course of infection and viral pathology, and highlights the specific significance of nasal epithelia in viral infection. Our findings underscore the importance of the availability of the Human Cell Atlas as a reference dataset. In this instance, analysis of the compendium of data points to a particularly relevant role for nasal goblet and ciliated cells as early viral targets and potential reservoirs of SARS-CoV-2 infection. This, in turn, serves as a biological framework for dissecting viral transmission and developing clinical strategies for prevention and therapy.
1,602 citations
••
TL;DR: The flagship paper of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium describes the generation of the integrative analyses of 2,658 whole-cancer genomes and their matching normal tissues across 38 tumour types, the structures for international data sharing and standardized analyses, and the main scientific findings from across the consortium studies.
Abstract: Cancer is driven by genetic change, and the advent of massively parallel sequencing has enabled systematic documentation of this variation at the whole-genome scale1,2,3. Here we report the integrative analysis of 2,658 whole-cancer genomes and their matching normal tissues across 38 tumour types from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). We describe the generation of the PCAWG resource, facilitated by international data sharing using compute clouds. On average, cancer genomes contained 4–5 driver mutations when combining coding and non-coding genomic elements; however, in around 5% of cases no drivers were identified, suggesting that cancer driver discovery is not yet complete. Chromothripsis, in which many clustered structural variants arise in a single catastrophic event, is frequently an early event in tumour evolution; in acral melanoma, for example, these events precede most somatic point mutations and affect several cancer-associated genes simultaneously. Cancers with abnormal telomere maintenance often originate from tissues with low replicative activity and show several mechanisms of preventing telomere attrition to critical levels. Common and rare germline variants affect patterns of somatic mutation, including point mutations, structural variants and somatic retrotransposition. A collection of papers from the PCAWG Consortium describes non-coding mutations that drive cancer beyond those in the TERT promoter4; identifies new signatures of mutational processes that cause base substitutions, small insertions and deletions and structural variation5,6; analyses timings and patterns of tumour evolution7; describes the diverse transcriptional consequences of somatic mutation on splicing, expression levels, fusion genes and promoter activity8,9; and evaluates a range of more-specialized features of cancer genomes8,10,11,12,13,14,15,16,17,18.
1,600 citations
••
University of California, San Diego1, Broad Institute2, Harvard University3, National University of Singapore4, Baylor College of Medicine5, National Institutes of Health6, Pompeu Fabra University7, Catalan Institution for Research and Advanced Studies8, Wellcome Trust Sanger Institute9, National Centre for Biological Sciences10, University of Helsinki11
TL;DR: The characterization of 4,645 whole-genome and 19,184 exome sequences, covering most types of cancer, identifies 81 single-base substitution, doublet- base substitution and small-insertion-and-deletion mutational signatures, providing a systematic overview of the mutational processes that contribute to cancer development.
Abstract: Somatic mutations in cancer genomes are caused by multiple mutational processes, each of which generates a characteristic mutational signature1. Here, as part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium2 of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA), we characterized mutational signatures using 84,729,690 somatic mutations from 4,645 whole-genome and 19,184 exome sequences that encompass most types of cancer. We identified 49 single-base-substitution, 11 doublet-base-substitution, 4 clustered-base-substitution and 17 small insertion-and-deletion signatures. The substantial size of our dataset, compared with previous analyses3–15, enabled the discovery of new signatures, the separation of overlapping signatures and the decomposition of signatures into components that may represent associated—but distinct—DNA damage, repair and/or replication mechanisms. By estimating the contribution of each signature to the mutational catalogues of individual cancer genomes, we revealed associations of signatures to exogenous or endogenous exposures, as well as to defective DNA-maintenance processes. However, many signatures are of unknown cause. This analysis provides a systematic perspective on the repertoire of mutational processes that contribute to the development of human cancer. The characterization of 4,645 whole-genome and 19,184 exome sequences, covering most types of cancer, identifies 81 single-base substitution, doublet-base substitution and small-insertion-and-deletion mutational signatures, providing a systematic overview of the mutational processes that contribute to cancer development.
1,521 citations
••
TL;DR: The structure and content of CellPhoneDB is outlined, procedures for inferring cell–cell communication networks from single-cell RNA sequencing data are provided and a practical step-by-step guide to help implement the protocol is presented.
Abstract: Cell–cell communication mediated by ligand–receptor complexes is critical to coordinating diverse biological processes, such as development, differentiation and inflammation. To investigate how the context-dependent crosstalk of different cell types enables physiological processes to proceed, we developed CellPhoneDB, a novel repository of ligands, receptors and their interactions. In contrast to other repositories, our database takes into account the subunit architecture of both ligands and receptors, representing heteromeric complexes accurately. We integrated our resource with a statistical framework that predicts enriched cellular interactions between two cell types from single-cell transcriptomics data. Here, we outline the structure and content of our repository, provide procedures for inferring cell–cell communication networks from single-cell RNA sequencing data and present a practical step-by-step guide to help implement the protocol. CellPhoneDB v.2.0 is an updated version of our resource that incorporates additional functionalities to enable users to introduce new interacting molecules and reduces the time and resources needed to interrogate large datasets. CellPhoneDB v.2.0 is publicly available, both as code and as a user-friendly web interface; it can be used by both experts and researchers with little experience in computational genomics. In our protocol, we demonstrate how to evaluate meaningful biological interactions with CellPhoneDB v.2.0 using published datasets. This protocol typically takes ~2 h to complete, from installation to statistical analysis and visualization, for a dataset of ~10 GB, 10,000 cells and 19 cell types, and using five threads. CellPhoneDB combines an interactive database and a statistical framework for the exploration of ligand–receptor interactions inferred from single-cell transcriptomics measurements.
1,392 citations
••
F. Kyle Satterstrom1, F. Kyle Satterstrom2, Jack A. Kosmicki, Jiebiao Wang3 +198 more•Institutions (53)
TL;DR: The largest exome sequencing study of autism spectrum disorder (ASD) to date, using an enhanced analytical framework to integrate de novo and case-control rare variation, identifies 102 risk genes at a false discovery rate of 0.1 or less, consistent with multiple paths to an excitatory-inhibitory imbalance underlying ASD.
1,169 citations
••
TL;DR: A novel tool, purge_dups, is presented, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps and can reduce heter allele duplication and increase assembly continuity while maintaining completeness of the primary assembly.
Abstract: Motivation Rapid development in long-read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either focus only on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors. Results Here we present a novel tool, purge_dups, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can easily be integrated into assembly pipelines. Availability and implementation The source code is written in C and is available at https://github.com/dfguan/purge_dups. Supplementary information Supplementary data are available at Bioinformatics online.
728 citations
••
Wellcome Trust Sanger Institute1, Max Delbrück Center for Molecular Medicine2, European Bioinformatics Institute3, Harvard University4, University of Hamburg5, Sapporo Medical University6, Technische Universität München7, National Institutes of Health8, Brigham and Women's Hospital9, Howard Hughes Medical Institute10, University of Cambridge11, Sun Yat-sen University12, University of Alberta13, British Heart Foundation14
TL;DR: The state-of-the-art analyses of large-scale single-cell and single-nucleus transcriptomes are used to construct a cellular atlas of the human heart that will aid further research into cardiac physiology and disease and provides a valuable reference for future studies.
Abstract: Cardiovascular disease is the leading cause of death worldwide. Advanced insights into disease mechanisms and therapeutic strategies require a deeper understanding of the molecular processes involved in the healthy heart. Knowledge of the full repertoire of cardiac cells and their gene expression profiles is a fundamental first step in this endeavour. Here, using state-of-the-art analyses of large-scale single-cell and single-nucleus transcriptomes, we characterize six anatomical adult heart regions. Our results highlight the cellular heterogeneity of cardiomyocytes, pericytes and fibroblasts, and reveal distinct atrial and ventricular subsets of cells with diverse developmental origins and specialized properties. We define the complexity of the cardiac vasculature and its changes along the arterio-venous axis. In the immune compartment, we identify cardiac-resident macrophages with inflammatory and protective transcriptional signatures. Furthermore, analyses of cell-to-cell interactions highlight different networks of macrophages, fibroblasts and cardiomyocytes between atria and ventricles that are distinct from those of skeletal muscle. Our human cardiac cell atlas improves our understanding of the human heart and provides a valuable reference for future studies.
703 citations
••
University of Duisburg-Essen1, University of Düsseldorf2, Harvard University3, University of Warsaw4, St. Vincent's Institute of Medical Research5, University of Melbourne6, Johns Hopkins University7, Swiss Institute of Bioinformatics8, The Turing Institute9, Western General Hospital10, BC Cancer Agency11, University of British Columbia12, ETH Zurich13, Delft University of Technology14, Leiden University Medical Center15, Broad Institute16, Georgia State University17, Heidelberg Institute for Theoretical Studies18, Karlsruhe Institute of Technology19, Centrum Wiskunde & Informatica20, Utrecht University21, University of Amsterdam22, Imperial College London23, Radboud University Nijmegen24, University Medical Center Groningen25, Wageningen University and Research Centre26, University of Connecticut27, Wellcome Trust Sanger Institute28, University of Cambridge29, European Bioinformatics Institute30, Max Planck Society31, Saarland University32, Zuse Institute Berlin33, German Cancer Research Center34, Leiden University35, I.M. Sechenov First Moscow State Medical University36, Princeton University37, Memorial Sloan Kettering Cancer Center38
TL;DR: This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years in single-cell data science.
Abstract: The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands-or even millions-of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges that will be central to bringing this emerging field of single-cell data science forward. For each challenge, we highlight motivating research questions, review prior work, and formulate open problems. This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years.
677 citations
••
National Institutes of Health1, Wellcome Trust Sanger Institute2, Rockefeller University3, University of California, Davis4, European Bioinformatics Institute5, Seoul National University6, Max Planck Society7, Durham University8, University of Massachusetts Amherst9, University of Adelaide10, University of Missouri11, East Carolina University12, University of Queensland13, Queen Mary University of London14, Wellington Management Company15, University of Arizona16, Natural History Museum17, Bangor University18, University of Konstanz19, Northeastern University20, Naturalis21, University of Graz22, Florida Museum of Natural History23, University of California, Santa Cruz24, Pacific Biosciences25, University of Maryland, College Park26, Harbin Institute of Technology27, University of Chicago28, Oregon Health & Science University29, Monash University Malaysia Campus30, University of Milan31, University of Copenhagen32, Pennsylvania State University33, University of Los Andes34, Agency for Science, Technology and Research35, Royal Ontario Museum36, Smithsonian Conservation Biology Institute37, University of East Anglia38, Pompeu Fabra University39, University College Dublin40, University of Illinois at Urbana–Champaign41, La Trobe University42, University of California, San Diego43, UPRRP College of Natural Sciences44, Dresden University of Technology45
TL;DR: The Vertebrate Genomes Project is embarked on, an effort to generate high-quality, complete reference genomes for all ~70,000 extant vertebrate species and help enable a new era of discovery across the life sciences.
Abstract: High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are only available for a few non-microbial species. To address this issue, the international Genome 10K (G10K) consortium has worked over a five-year period to evaluate and develop cost-effective methods for assembling the most accurate and complete reference genomes to date. Here we summarize these developments, introduce a set of quality standards, and present lessons learned from sequencing and assembling 16 species representing major vertebrate lineages (mammals, birds, reptiles, amphibians, teleost fishes and cartilaginous fishes). We confirm that long-read sequencing technologies are essential for maximizing genome quality and that unresolved complex repeats and haplotype heterozygosity are major sources of error in assemblies. Our new assemblies identify and correct substantial errors in some of the best historical reference genomes. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an effort to generate high-quality, complete reference genomes for all ~70,000 extant vertebrate species and help enable a new era of discovery across the life sciences.
••
Wellcome Trust Sanger Institute1, European Bioinformatics Institute2, Francis Crick Institute3, Broad Institute4, University of Oxford5, University of Cambridge6, University of Toronto7, Oregon Health & Science University8, University of Texas MD Anderson Cancer Center9, German Cancer Research Center10, Heidelberg University11, University of Ljubljana12, NorthShore University HealthSystem13, Vancouver Prostate Centre14, Simon Fraser University15, Walter and Eliza Hall Institute of Medical Research16, University of Melbourne17, Katholieke Universiteit Leuven18, Cornell University19, University of California, Santa Cruz20, Ontario Institute for Cancer Research21, University of California, Los Angeles22, Peter MacCallum Cancer Centre23, Harvard University24, Indiana University25, University of Chicago26, University of Cologne27, University of Helsinki28, University of Glasgow29
TL;DR: Whole-genome sequencing data for 2,778 cancer samples from 2,658 unique donors is used to reconstruct the evolutionary history of cancer, revealing that driver mutations can precede diagnosis by several years to decades.
Abstract: Cancer develops through a process of somatic evolution1,2. Sequencing data from a single biopsy represent a snapshot of this process that can reveal the timing of specific genomic aberrations and the changing influence of mutational processes3. Here, by whole-genome sequencing analysis of 2,658 cancers as part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA)4, we reconstruct the life history and evolution of mutational processes and driver mutation sequences of 38 types of cancer. Early oncogenesis is characterized by mutations in a constrained set of driver genes, and specific copy number gains, such as trisomy 7 in glioblastoma and isochromosome 17q in medulloblastoma. The mutational spectrum changes significantly throughout tumour evolution in 40% of samples. A nearly fourfold diversification of driver genes and increased genomic instability are features of later stages. Copy number alterations often occur in mitotic crises, and lead to simultaneous gains of chromosomal segments. Timing analyses suggest that driver mutations often precede diagnosis by many years, if not decades. Together, these results determine the evolutionary trajectories of cancer, and highlight opportunities for early cancer detection.
••
University of California, Santa Cruz1, National Institutes of Health2, University of Washington3, Johns Hopkins University4, University of California, San Diego5, Stowers Institute for Medical Research6, Wellcome Trust Sanger Institute7, Washington University in St. Louis8, University of California, Davis9, University of Birmingham10, University of Nottingham11, University of Pittsburgh12, Duke University13
TL;DR: High-coverage, ultra-long-read nanopore sequencing is used to create a new human genome assembly that improves on the coverage and accuracy of the current reference (GRCh38) and includes the gap-free, telomere-to-telomere sequence of the X chromosome.
Abstract: After two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no single chromosome has been finished end to end, and hundreds of unresolved gaps persist1,2. Here we present a human genome assembly that surpasses the continuity of GRCh382, along with a gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the complete hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our efforts on the human X chromosome3, we reconstructed the centromeric satellite DNA array (approximately 3.1 Mb) and closed the 29 remaining gaps in the current reference, including new sequences from the human pseudoautosomal regions and from cancer-testis ampliconic gene families (CT-X and GAGE). These sequences will be integrated into future human reference genome releases. In addition, the complete chromosome X, combined with the ultra-long nanopore data, allowed us to map methylation patterns across complex tandem repeats and satellite arrays. Our results demonstrate that finishing the entire human genome is now within reach, and the data presented here will facilitate ongoing efforts to complete the other human chromosomes. High-coverage, ultra-long-read nanopore sequencing is used to create a new human genome assembly that improves on the coverage and accuracy of the current reference (GRCh38) and includes the gap-free, telomere-to-telomere sequence of the X chromosome.
••
TL;DR: The authors identified shared biology and host-directed drug targets to prioritize therapeutics with potential for rapid deployment against current and future coronavirus outbreaks, and found that individuals with genotypes corresponding to higher soluble IL17RA levels in plasma are at decreased risk of COVID-19 hospitalization.
Abstract: The COVID-19 pandemic, caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), is a grave threat to public health and the global economy. SARS-CoV-2 is closely related to the more lethal but less transmissible coronaviruses SARS-CoV-1 and Middle East respiratory syndrome coronavirus (MERS-CoV). Here, we have carried out comparative viral-human protein-protein interaction and viral protein localization analyses for all three viruses. Subsequent functional genetic screening identified host factors that functionally impinge on coronavirus proliferation, including Tom70, a mitochondrial chaperone protein that interacts with both SARS-CoV-1 and SARS-CoV-2 ORF9b, an interaction we structurally characterized using cryo-electron microscopy. Combining genetically validated host factors with both COVID-19 patient genetic data and medical billing records identified molecular mechanisms and potential drug treatments that merit further molecular and clinical study.
••
TL;DR: Whole-genome sequencing data from more than 2,500 cancers of 38 tumour types reveal 16 signatures that can be used to classify somatic structural variants, highlighting the diversity of genomic rearrangements in cancer.
Abstract: A key mutational process in cancer is structural variation, in which rearrangements delete, amplify or reorder genomic segments that range in size from kilobases to whole chromosomes1-7. Here we develop methods to group, classify and describe somatic structural variants, using data from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA), which aggregated whole-genome sequencing data from 2,658 cancers across 38 tumour types8. Sixteen signatures of structural variation emerged. Deletions have a multimodal size distribution, assort unevenly across tumour types and patients, are enriched in late-replicating regions and correlate with inversions. Tandem duplications also have a multimodal size distribution, but are enriched in early-replicating regions-as are unbalanced translocations. Replication-based mechanisms of rearrangement generate varied chromosomal structures with low-level copy-number gains and frequent inverted rearrangements. One prominent structure consists of 2-7 templates copied from distinct regions of the genome strung together within one locus. Such cycles of templated insertions correlate with tandem duplications, and-in liver cancer-frequently activate the telomerase gene TERT. A wide variety of rearrangement processes are active in cancer, which generate complex configurations of the genome upon which selection can act.
••
TL;DR: BlobToolKit, a software suite to aid researchers in identifying and isolating non-target data in draft and publicly available genome assemblies, is presented, providing an indication of assembly quality alongside the public record with links out to allow full exploration in the browser-based Viewer.
Abstract: Reconstruction of target genomes from sequence data produced by instruments that are agnostic as to the species-of-origin may be confounded by contaminant DNA. Whether introduced during sample processing or through co-extraction alongside the target DNA, if insufficient care is taken during the assembly process, the final assembled genome may be a mixture of data from several species. Such assemblies can confound sequence-based biological inference and, when deposited in public databases, may be included in downstream analyses by users unaware of underlying problems. We present BlobToolKit, a software suite to aid researchers in identifying and isolating non-target data in draft and publicly available genome assemblies. BlobToolKit can be used to process assembly, read and analysis files for fully reproducible interactive exploration in the browser-based Viewer. BlobToolKit can be used during assembly to filter non-target DNA, helping researchers produce assemblies with high biological credibility. We have been running an automated BlobToolKit pipeline on eukaryotic assemblies publicly available in the International Nucleotide Sequence Data Collaboration and are making the results available through a public instance of the Viewer at https://blobtoolkit.genomehubs.org/view We aim to complete analysis of all publicly available genomes and then maintain currency with the flow of new genomes. We have worked to embed these views into the presentation of genome assemblies at the European Nucleotide Archive, providing an indication of assembly quality alongside the public record with links out to allow full exploration in the Viewer.
••
TL;DR: The utility of comprehensive screening of HCWs with minimal or no symptoms for SARS-CoV-2 testing is demonstrated, and this approach will be critical for protecting patients and hospital staff.
Abstract: Significant differences exist in the availability of healthcare worker (HCW) SARS-CoV-2 testing between countries, and existing programmes focus on screening symptomatic rather than asymptomatic staff. Over a 3 week period (April 2020), 1032 asymptomatic HCWs were screened for SARS-CoV-2 in a large UK teaching hospital. Symptomatic staff and symptomatic household contacts were additionally tested. Real-time RT-PCR was used to detect viral RNA from a throat+nose self-swab. 3% of HCWs in the asymptomatic screening group tested positive for SARS-CoV-2. 17/30 (57%) were truly asymptomatic/pauci-symptomatic. 12/30 (40%) had experienced symptoms compatible with coronavirus disease 2019 (COVID-19)>7 days prior to testing, most self-isolating, returning well. Clusters of HCW infection were discovered on two independent wards. Viral genome sequencing showed that the majority of HCWs had the dominant lineage B∙1. Our data demonstrates the utility of comprehensive screening of HCWs with minimal or no symptoms. This approach will be critical for protecting patients and hospital staff.
••
TL;DR: The authors' study adds data about African, Oceanian, and Amerindian populations and indicates that diversity tends to result from differences at the single-nucleotide level rather than copy number variation.
Abstract: Genome sequences from diverse human groups are needed to understand the structure of genetic variation in our species and the history of, and relationships between, different populations. We present 929 high-coverage genome sequences from 54 diverse human populations, 26 of which are physically phased using linked-read sequencing. Analyses of these genomes reveal an excess of previously undocumented common genetic variation private to southern Africa, central Africa, Oceania, and the Americas, but an absence of such variants fixed between major geographical regions. We also find deep and gradual population separations within Africa, contrasting population size histories between hunter-gatherer and agriculturalist groups in the past 10,000 years, and a contrast between single Neanderthal but multiple Denisovan source populations contributing to present-day human populations.
••
TL;DR: Using single cell transcriptome sequencing, the authors identify multiple astrocyte subtypes in the adult mouse CNS, which map to distinct spatial locations and show correlations to cell morphology and physiology.
Abstract: Astrocytes, a major cell type found throughout the central nervous system, have general roles in the modulation of synapse formation and synaptic transmission, blood-brain barrier formation, and regulation of blood flow, as well as metabolic support of other brain resident cells. Crucially, emerging evidence shows specific adaptations and astrocyte-encoded functions in regions, such as the spinal cord and cerebellum. To investigate the true extent of astrocyte molecular diversity across forebrain regions, we used single-cell RNA sequencing. Our analysis identifies five transcriptomically distinct astrocyte subtypes in adult mouse cortex and hippocampus. Validation of our data in situ reveals distinct spatial positioning of defined subtypes, reflecting the distribution of morphologically and physiologically distinct astrocyte populations. Our findings are evidence for specialized astrocyte subtypes between and within brain regions. The data are available through an online database (https://holt-sc.glialab.org/), providing a resource on which to base explorations of local astrocyte diversity and function in the brain.
••
TL;DR: SoupX, a tool for removing ambient RNA contamination from droplet-based single-cell RNA sequencing experiments, has broad applicability, and its application can improve the biological utility of existing and future datasets.
Abstract: Background Droplet-based single-cell RNA sequence analyses assume that all acquired RNAs are endogenous to cells. However, any cell-free RNAs contained within the input solution are also captured by these assays. This sequencing of cell-free RNA constitutes a background contamination that confounds the biological interpretation of single-cell transcriptomic data. Results We demonstrate that contamination from this "soup" of cell-free RNAs is ubiquitous, with experiment-specific variations in composition and magnitude. We present a method, SoupX, for quantifying the extent of the contamination and estimating "background-corrected" cell expression profiles that seamlessly integrate with existing downstream analysis tools. Applying this method to several datasets using multiple droplet sequencing technologies, we demonstrate that its application improves biological interpretation of otherwise misleading data, as well as improving quality control metrics. Conclusions We present SoupX, a tool for removing ambient RNA contamination from droplet-based single-cell RNA sequencing experiments. This tool has broad applicability, and its application can improve the biological utility of existing and future datasets.
••
Broad Institute1, University of Hohenheim2, Beth Israel Deaconess Medical Center3, Icahn School of Medicine at Mount Sinai4, University of Oxford5, Vanderbilt University Medical Center6, University of Geneva7, Wellcome Trust Sanger Institute8, Harvard University9, University of Melbourne10, Baylor College of Medicine11, Boston Children's Hospital12, University of California, San Francisco13, National Institutes of Health14, University of Washington15, Howard Hughes Medical Institute16, University of Cambridge17
TL;DR: Progress is described in the study of human genetics, in which rapid advances in technology, foundational genomic resources and analytical tools have contributed to the understanding of the mechanisms responsible for many rare and common diseases and to preventative and therapeutic strategies for many of these conditions.
Abstract: A primary goal of human genetics is to identify DNA sequence variants that influence biomedical traits, particularly those related to the onset and progression of human disease. Over the past 25 years, progress in realizing this objective has been transformed by advances in technology, foundational genomic resources and analytical tools, and by access to vast amounts of genotype and phenotype data. Genetic discoveries have substantially improved our understanding of the mechanisms responsible for many rare and common diseases and driven development of novel preventative and therapeutic strategies. Medical innovation will increasingly focus on delivering care tailored to individual patterns of genetic predisposition.
••
TL;DR: Real-time genomic surveillance of SARS-CoV-2 in a UK hospital was established and showed the benefit of combined genomic and epidemiological analysis for the investigation of health-care associated COVID-19 cases.
Abstract: Summary Background The burden and influence of health-care associated severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infections is unknown. We aimed to examine the use of rapid SARS-CoV-2 sequencing combined with detailed epidemiological analysis to investigate health-care associated SARS-CoV-2 infections and inform infection control measures. Methods In this prospective surveillance study, we set up rapid SARS-CoV-2 nanopore sequencing from PCR-positive diagnostic samples collected from our hospital (Cambridge, UK) and a random selection from hospitals in the East of England, enabling sample-to-sequence in less than 24 h. We established a weekly review and reporting system with integration of genomic and epidemiological data to investigate suspected health-care associated COVID-19 cases. Findings Between March 13 and April 24, 2020, we collected clinical data and samples from 5613 patients with COVID-19 from across the East of England. We sequenced 1000 samples producing 747 high-quality genomes. We combined epidemiological and genomic analysis of the 299 patients from our hospital and identified 35 clusters of identical viruses involving 159 patients. 92 (58%) of 159 patients had strong epidemiological links and 32 (20%) patients had plausible epidemiological links. These results were fed back to clinical, infection control, and hospital management teams, leading to infection-control interventions and informing patient safety reporting. Interpretation We established real-time genomic surveillance of SARS-CoV-2 in a UK hospital and showed the benefit of combined genomic and epidemiological analysis for the investigation of health-care associated COVID-19. This approach enabled us to detect cryptic transmission events and identify opportunities to target infection-control interventions to further reduce health-care associated infections. Our findings have important implications for national public health policy as they enable rapid tracking and investigation of infections in hospital and community settings. Funding COVID-19 Genomics UK (supported by UK Research and Innovation, the National Institute of Health Research, the Wellcome Sanger Institute), the Wellcome Trust, the Academy of Medical Sciences and the Health Foundation, and the National Institute for Health Research Cambridge Biomedical Research Centre.
••
Wellcome Trust Sanger Institute1, Newcastle University2, Ghent University3, University of Cambridge4, Wellcome Trust/Cancer Research UK Gurdon Institute5, Francis Crick Institute6, University College London7, Laboratory of Molecular Biology8, Freeman Hospital9, Cambridge University Hospitals NHS Foundation Trust10
TL;DR: The authors' single-cell transcriptome profile of the thymus across the human lifetime and across species provides a high-resolution census of T cell development within the native tissue microenvironment, and identifies novel subpopulations of human thymic fibroblasts and epithelial cells and located them in situ.
Abstract: The thymus provides a nurturing environment for the differentiation and selection of T cells, a process orchestrated by their interaction with multiple thymic cell types. We used single-cell RNA sequencing to create a cell census of the human thymus across the life span and to reconstruct T cell differentiation trajectories and T cell receptor (TCR) recombination kinetics. Using this approach, we identified and located in situ CD8αα+ T cell populations, thymic fibroblast subtypes, and activated dendritic cell states. In addition, we reveal a bias in TCR recombination and selection, which is attributed to genomic position and the kinetics of lineage commitment. Taken together, our data provide a comprehensive atlas of the human thymus across the life span with new insights into human T cell development.
••
TL;DR: This work describes the tried and tested approach for assembly curation using gEVAL, the genome evaluation browser, and outlines the procedures applied to genome curations using g EVAL and also outlines the recommendations for assemblyCuration in an gevAL-independent context to facilitate the uptake of genome curation in the wider community.
Abstract: Background
Genome sequence assemblies provide the basis for our understanding of biology. Generating error-free assemblies is therefore the ultimate, but sadly still unachieved goal of a multitude of research projects. Despite the ever-advancing improvements in data generation, assembly algorithms and pipelines, no automated approach has so far reliably generated near error-free genome assemblies for eukaryotes.
Results
Whilst working towards improved data sets and fully automated pipelines, assembly evaluation and curation is actively employed to bridge this shortcoming and significantly reduce the number of assembly errors. In addition to this increase in product value, the insights gained from assembly curation are fed back into the automated assembly strategy and contribute to notable improvements in genome assembly quality.
Conclusions
We describe our tried and tested approach for assembly curation using gEVAL, the genome evaluation browser. We outline the procedures applied to genome curation using gEVAL and also our recommendations for assembly curation in an gEVAL-independent context to facilitate the uptake of genome curation in the wider community.
••
27 Jul 2020
TL;DR: Deep transfer learning is used to quantify histopathological patterns across 17,355 hematoxylin and eosin-stained histopathology slide images from 28 cancer types and correlate these with matched genomic, transcriptomic and survival data, showing the remarkable potential of computer vision in characterizing the molecular basis of tumor Histopathology.
Abstract: We use deep transfer learning to quantify histopathological patterns across 17,355 hematoxylin and eosin-stained histopathology slide images from 28 cancer types and correlate these with matched genomic, transcriptomic and survival data. This approach accurately classifies cancer types and provides spatially resolved tumor and normal tissue distinction. Automatically learned computational histopathological features correlate with a large range of recurrent genetic aberrations across cancer types. This includes whole-genome duplications, which display universal features across cancer types, individual chromosomal aneuploidies, focal amplifications and deletions, as well as driver gene mutations. There are widespread associations between bulk gene expression levels and histopathology, which reflect tumor composition and enable the localization of transcriptomically defined tumor-infiltrating lymphocytes. Computational histopathology augments prognosis based on histopathological subtyping and grading, and highlights prognostically relevant areas such as necrosis or lymphocytic aggregates. These findings show the remarkable potential of computer vision in characterizing the molecular basis of tumor histopathology. Two papers by Kather and colleagues and Gerstung and colleagues develop workflows to predict a wide range of molecular alterations from pan-cancer digital pathology slides.
••
TL;DR: In this article, the effect of cold storage on fresh healthy spleen, esophagus, and lung from ≥ 5 donors over 72h was assessed, and robust protocols for tissue preservation for up to 24h prior to scRNA-seq analysis were presented.
Abstract: The Human Cell Atlas is a large international collaborative effort to map all cell types of the human body. Single-cell RNA sequencing can generate high-quality data for the delivery of such an atlas. However, delays between fresh sample collection and processing may lead to poor data and difficulties in experimental design. This study assesses the effect of cold storage on fresh healthy spleen, esophagus, and lung from ≥ 5 donors over 72 h. We collect 240,000 high-quality single-cell transcriptomes with detailed cell type annotations and whole genome sequences of donors, enabling future eQTL studies. Our data provide a valuable resource for the study of these 3 organs and will allow cross-organ comparison of cell types. We see little effect of cold ischemic time on cell yield, total number of reads per cell, and other quality control metrics in any of the tissues within the first 24 h. However, we observe a decrease in the proportions of lung T cells at 72 h, higher percentage of mitochondrial reads, and increased contamination by background ambient RNA reads in the 72-h samples in the spleen, which is cell type specific. In conclusion, we present robust protocols for tissue preservation for up to 24 h prior to scRNA-seq analysis. This greatly facilitates the logistics of sample collection for Human Cell Atlas or clinical studies since it increases the time frames for sample processing.
••
TL;DR: This work presents Multi-Omics Factor Analysis v2 (MOFA+), a statistical framework for the comprehensive and scalable integration of single-cell multi-modal data that reconstructs a low-dimensional representation of the data using computationally efficient variational inference and supports flexible sparsity constraints.
Abstract: Technological advances have enabled the profiling of multiple molecular layers at single-cell resolution, assaying cells from multiple samples or conditions. Consequently, there is a growing need for computational strategies to analyze data from complex experimental designs that include multiple data modalities and multiple groups of samples. We present Multi-Omics Factor Analysis v2 (MOFA+), a statistical framework for the comprehensive and scalable integration of single-cell multi-modal data. MOFA+ reconstructs a low-dimensional representation of the data using computationally efficient variational inference and supports flexible sparsity constraints, allowing to jointly model variation across multiple sample groups and data modalities.
••
Princess Margaret Cancer Centre1, Ontario Institute for Cancer Research2, University Health Network3, Lunenfeld-Tanenbaum Research Institute4, University of California, San Diego5, Cold Spring Harbor Laboratory6, Lustgarten Foundation7, Imperial College London8, Vancouver General Hospital9, University of British Columbia10, University of Toronto11, McGill University Health Centre12, McGill University13, University of Cambridge14, Wellcome Trust Sanger Institute15
TL;DR: Whole-genome sequencing, transcriptome sequencing and single-cell analysis of primary and metastatic pancreatic adenocarcinoma identify molecular subtypes and intratumor heterogeneity, and support the premise that the constellation of genomic aberrations in the tumor gives rise to the molecular subtype.
Abstract: Pancreatic adenocarcinoma presents as a spectrum of a highly aggressive disease in patients. The basis of this disease heterogeneity has proved difficult to resolve due to poor tumor cellularity and extensive genomic instability. To address this, a dataset of whole genomes and transcriptomes was generated from purified epithelium of primary and metastatic tumors. Transcriptome analysis demonstrated that molecular subtypes are a product of a gene expression continuum driven by a mixture of intratumoral subpopulations, which was confirmed by single-cell analysis. Integrated whole-genome analysis uncovered that molecular subtypes are linked to specific copy number aberrations in genes such as mutant KRAS and GATA6. By mapping tumor genetic histories, tetraploidization emerged as a key mutational process behind these events. Taken together, these data support the premise that the constellation of genomic aberrations in the tumor gives rise to the molecular subtype, and that disease heterogeneity is due to ongoing genomic instability during progression.
••
TL;DR: To identify novel DD-associated genes, healthcare and research exome sequences are integrated on 31,058 DD parent-offspring trios, and a simulation-based statistical test is developed to identify gene-specific enrichments of DNMs.
Abstract: De novo mutations in protein-coding genes are a well-established cause of developmental disorders1. However, genes known to be associated with developmental disorders account for only a minority of the observed excess of such de novo mutations1,2. Here, to identify previously undescribed genes associated with developmental disorders, we integrate healthcare and research exome-sequence data from 31,058 parent-offspring trios of individuals with developmental disorders, and develop a simulation-based statistical test to identify gene-specific enrichment of de novo mutations. We identified 285 genes that were significantly associated with developmental disorders, including 28 that had not previously been robustly associated with developmental disorders. Although we detected more genes associated with developmental disorders, much of the excess of de novo mutations in protein-coding genes remains unaccounted for. Modelling suggests that more than 1,000 genes associated with developmental disorders have not yet been described, many of which are likely to be less penetrant than the currently known genes. Research access to clinical diagnostic datasets will be critical for completing the map of genes associated with developmental disorders.
••
National Institute for Health Research1, Harvard University2, Montreal Heart Institute3, University of North Carolina at Chapel Hill4, Wellcome Trust Sanger Institute5, VA Boston Healthcare System6, Osaka University7, Icahn School of Medicine at Mount Sinai8, University of Wisconsin–Milwaukee9, Kyushu University10, University of Washington11, University of Bristol12, University of Copenhagen13, Erasmus University Medical Center14, National Institutes of Health15, Veterans Health Administration16, Kaiser Permanente17, International Agency for Research on Cancer18, Wake Forest University19, Imperial College London20, Broad Institute21, Greifswald University Hospital22, University of Pennsylvania23, British Heart Foundation24, Fred Hutchinson Cancer Research Center25, Chinese National Human Genome Center26, Technische Universität München27, University of Tampere28, University of Tokyo29, University of Ioannina30, University of Colorado Denver31, Duke University32, University of Virginia33, University of Minnesota34, Turku University Hospital35, Los Angeles Biomedical Research Institute36, Stanford University37, Mashhad University of Medical Sciences38, NHS Blood and Transplant39, Brigham and Women's Hospital40, University of Oxford41, University of Liège42, European Bioinformatics Institute43, John Radcliffe Hospital44
TL;DR: The results show the power of large-scale blood cell trait GWAS to interrogate clinically meaningful variants across a wide allelic spectrum of human variation.