scispace - formally typeset
Search or ask a question

Showing papers by "Wellcome Trust Sanger Institute published in 2020"


Journal ArticleDOI
27 May 2020-Nature
TL;DR: A catalogue of predicted loss-of-function variants in 125,748 whole-exome and 15,708 whole-genome sequencing datasets from the Genome Aggregation Database (gnomAD) reveals the spectrum of mutational constraints that affect these human protein-coding genes.
Abstract: Genetic variants that inactivate protein-coding genes are a powerful source of information about the phenotypic consequences of gene disruption: genes that are crucial for the function of an organism will be depleted of such variants in natural populations, whereas non-essential genes will tolerate their accumulation. However, predicted loss-of-function variants are enriched for annotation errors, and tend to be found at extremely low frequencies, so their analysis requires careful variant annotation and very large sample sizes1. Here we describe the aggregation of 125,748 exomes and 15,708 genomes from human sequencing studies into the Genome Aggregation Database (gnomAD). We identify 443,769 high-confidence predicted loss-of-function variants in this cohort after filtering for artefacts caused by sequencing and annotation errors. Using an improved model of human mutation rates, we classify human protein-coding genes along a spectrum that represents tolerance to inactivation, validate this classification using data from model organisms and engineered human cells, and show that it can be used to improve the power of gene discovery for both common and rare diseases. A catalogue of predicted loss-of-function variants in 125,748 whole-exome and 15,708 whole-genome sequencing datasets from the Genome Aggregation Database (gnomAD) reveals the spectrum of mutational constraints that affect these human protein-coding genes.

4,913 citations


Journal ArticleDOI
TL;DR: In this paper, the expression of viral entry-associated genes in single-cell RNA-sequencing data from multiple tissues from healthy human donors was investigated, and co-detected these transcripts in specific respiratory, corneal and intestinal epithelial cells, potentially explaining the high efficiency of SARS-CoV-2 transmission.
Abstract: We investigated SARS-CoV-2 potential tropism by surveying expression of viral entry-associated genes in single-cell RNA-sequencing data from multiple tissues from healthy human donors. We co-detected these transcripts in specific respiratory, corneal and intestinal epithelial cells, potentially explaining the high efficiency of SARS-CoV-2 transmission. These genes are co-expressed in nasal epithelial cells with genes involved in innate immunity, highlighting the cells' potential role in initial viral infection, spread and clearance. The study offers a useful resource for further lines of inquiry with valuable clinical samples from COVID-19 patients and we provide our data in a comprehensive, open and user-friendly fashion at www.covid19cellatlas.org.

2,024 citations


Journal ArticleDOI
TL;DR: Analysis of the compendium of data points to a particularly relevant role for nasal goblet and ciliated cells as early viral targets and potential reservoirs of SARS-CoV-2 infection and underscores the importance of the availability of the Human Cell Atlas as a reference dataset.
Abstract: The SARS-CoV-2 coronavirus, the etiologic agent responsible for COVID-19 coronavirus disease, is a global threat. To better understand viral tropism, we assessed the RNA expression of the coronavirus receptor, ACE2, as well as the viral S protein priming protease TMPRSS2 thought to govern viral entry in single-cell RNA-sequencing (scRNA-seq) datasets from healthy individuals generated by the Human Cell Atlas consortium. We found that ACE2, as well as the protease TMPRSS2, are differentially expressed in respiratory and gut epithelial cells. In-depth analysis of epithelial cells in the respiratory tree reveals that nasal epithelial cells, specifically goblet/secretory cells and ciliated cells, display the highest ACE2 expression of all the epithelial cells analyzed. The skewed expression of viral receptors/entry-associated proteins towards the upper airway may be correlated with enhanced transmissivity. Finally, we showed that many of the top genes associated with ACE2 airway epithelial expression are innate immune-associated, antiviral genes, highly enriched in the nasal epithelial cells. This association with immune pathways might have clinical implications for the course of infection and viral pathology, and highlights the specific significance of nasal epithelia in viral infection. Our findings underscore the importance of the availability of the Human Cell Atlas as a reference dataset. In this instance, analysis of the compendium of data points to a particularly relevant role for nasal goblet and ciliated cells as early viral targets and potential reservoirs of SARS-CoV-2 infection. This, in turn, serves as a biological framework for dissecting viral transmission and developing clinical strategies for prevention and therapy.

1,602 citations


Journal ArticleDOI
Peter J. Campbell1, Gad Getz2, Jan O. Korbel3, Joshua M. Stuart4  +1329 moreInstitutions (238)
06 Feb 2020-Nature
TL;DR: The flagship paper of the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium describes the generation of the integrative analyses of 2,658 whole-cancer genomes and their matching normal tissues across 38 tumour types, the structures for international data sharing and standardized analyses, and the main scientific findings from across the consortium studies.
Abstract: Cancer is driven by genetic change, and the advent of massively parallel sequencing has enabled systematic documentation of this variation at the whole-genome scale1,2,3. Here we report the integrative analysis of 2,658 whole-cancer genomes and their matching normal tissues across 38 tumour types from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA). We describe the generation of the PCAWG resource, facilitated by international data sharing using compute clouds. On average, cancer genomes contained 4–5 driver mutations when combining coding and non-coding genomic elements; however, in around 5% of cases no drivers were identified, suggesting that cancer driver discovery is not yet complete. Chromothripsis, in which many clustered structural variants arise in a single catastrophic event, is frequently an early event in tumour evolution; in acral melanoma, for example, these events precede most somatic point mutations and affect several cancer-associated genes simultaneously. Cancers with abnormal telomere maintenance often originate from tissues with low replicative activity and show several mechanisms of preventing telomere attrition to critical levels. Common and rare germline variants affect patterns of somatic mutation, including point mutations, structural variants and somatic retrotransposition. A collection of papers from the PCAWG Consortium describes non-coding mutations that drive cancer beyond those in the TERT promoter4; identifies new signatures of mutational processes that cause base substitutions, small insertions and deletions and structural variation5,6; analyses timings and patterns of tumour evolution7; describes the diverse transcriptional consequences of somatic mutation on splicing, expression levels, fusion genes and promoter activity8,9; and evaluates a range of more-specialized features of cancer genomes8,10,11,12,13,14,15,16,17,18.

1,600 citations


Journal ArticleDOI
05 Feb 2020-Nature
TL;DR: The characterization of 4,645 whole-genome and 19,184 exome sequences, covering most types of cancer, identifies 81 single-base substitution, doublet- base substitution and small-insertion-and-deletion mutational signatures, providing a systematic overview of the mutational processes that contribute to cancer development.
Abstract: Somatic mutations in cancer genomes are caused by multiple mutational processes, each of which generates a characteristic mutational signature1. Here, as part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium2 of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA), we characterized mutational signatures using 84,729,690 somatic mutations from 4,645 whole-genome and 19,184 exome sequences that encompass most types of cancer. We identified 49 single-base-substitution, 11 doublet-base-substitution, 4 clustered-base-substitution and 17 small insertion-and-deletion signatures. The substantial size of our dataset, compared with previous analyses3–15, enabled the discovery of new signatures, the separation of overlapping signatures and the decomposition of signatures into components that may represent associated—but distinct—DNA damage, repair and/or replication mechanisms. By estimating the contribution of each signature to the mutational catalogues of individual cancer genomes, we revealed associations of signatures to exogenous or endogenous exposures, as well as to defective DNA-maintenance processes. However, many signatures are of unknown cause. This analysis provides a systematic perspective on the repertoire of mutational processes that contribute to the development of human cancer. The characterization of 4,645 whole-genome and 19,184 exome sequences, covering most types of cancer, identifies 81 single-base substitution, doublet-base substitution and small-insertion-and-deletion mutational signatures, providing a systematic overview of the mutational processes that contribute to cancer development.

1,521 citations


Journal ArticleDOI
TL;DR: The structure and content of CellPhoneDB is outlined, procedures for inferring cell–cell communication networks from single-cell RNA sequencing data are provided and a practical step-by-step guide to help implement the protocol is presented.
Abstract: Cell–cell communication mediated by ligand–receptor complexes is critical to coordinating diverse biological processes, such as development, differentiation and inflammation. To investigate how the context-dependent crosstalk of different cell types enables physiological processes to proceed, we developed CellPhoneDB, a novel repository of ligands, receptors and their interactions. In contrast to other repositories, our database takes into account the subunit architecture of both ligands and receptors, representing heteromeric complexes accurately. We integrated our resource with a statistical framework that predicts enriched cellular interactions between two cell types from single-cell transcriptomics data. Here, we outline the structure and content of our repository, provide procedures for inferring cell–cell communication networks from single-cell RNA sequencing data and present a practical step-by-step guide to help implement the protocol. CellPhoneDB v.2.0 is an updated version of our resource that incorporates additional functionalities to enable users to introduce new interacting molecules and reduces the time and resources needed to interrogate large datasets. CellPhoneDB v.2.0 is publicly available, both as code and as a user-friendly web interface; it can be used by both experts and researchers with little experience in computational genomics. In our protocol, we demonstrate how to evaluate meaningful biological interactions with CellPhoneDB v.2.0 using published datasets. This protocol typically takes ~2 h to complete, from installation to statistical analysis and visualization, for a dataset of ~10 GB, 10,000 cells and 19 cell types, and using five threads. CellPhoneDB combines an interactive database and a statistical framework for the exploration of ligand–receptor interactions inferred from single-cell transcriptomics measurements.

1,392 citations


Journal ArticleDOI
06 Feb 2020-Cell
TL;DR: The largest exome sequencing study of autism spectrum disorder (ASD) to date, using an enhanced analytical framework to integrate de novo and case-control rare variation, identifies 102 risk genes at a false discovery rate of 0.1 or less, consistent with multiple paths to an excitatory-inhibitory imbalance underlying ASD.

1,169 citations


Journal ArticleDOI
TL;DR: A novel tool, purge_dups, is presented, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps and can reduce heter allele duplication and increase assembly continuity while maintaining completeness of the primary assembly.
Abstract: Motivation Rapid development in long-read sequencing and scaffolding technologies is accelerating the production of reference-quality assemblies for large eukaryotic genomes. However, haplotype divergence in regions of high heterozygosity often results in assemblers creating two copies rather than one copy of a region, leading to breaks in contiguity and compromising downstream steps such as gene annotation. Several tools have been developed to resolve this problem. However, they either focus only on removing contained duplicate regions, also known as haplotigs, or fail to use all the relevant information and hence make errors. Results Here we present a novel tool, purge_dups, that uses sequence similarity and read depth to automatically identify and remove both haplotigs and heterozygous overlaps. In comparison with current tools, we demonstrate that purge_dups can reduce heterozygous duplication and increase assembly continuity while maintaining completeness of the primary assembly. Moreover, purge_dups is fully automatic and can easily be integrated into assembly pipelines. Availability and implementation The source code is written in C and is available at https://github.com/dfguan/purge_dups. Supplementary information Supplementary data are available at Bioinformatics online.

728 citations


Journal ArticleDOI
24 Sep 2020-Nature
TL;DR: The state-of-the-art analyses of large-scale single-cell and single-nucleus transcriptomes are used to construct a cellular atlas of the human heart that will aid further research into cardiac physiology and disease and provides a valuable reference for future studies.
Abstract: Cardiovascular disease is the leading cause of death worldwide. Advanced insights into disease mechanisms and therapeutic strategies require a deeper understanding of the molecular processes involved in the healthy heart. Knowledge of the full repertoire of cardiac cells and their gene expression profiles is a fundamental first step in this endeavour. Here, using state-of-the-art analyses of large-scale single-cell and single-nucleus transcriptomes, we characterize six anatomical adult heart regions. Our results highlight the cellular heterogeneity of cardiomyocytes, pericytes and fibroblasts, and reveal distinct atrial and ventricular subsets of cells with diverse developmental origins and specialized properties. We define the complexity of the cardiac vasculature and its changes along the arterio-venous axis. In the immune compartment, we identify cardiac-resident macrophages with inflammatory and protective transcriptional signatures. Furthermore, analyses of cell-to-cell interactions highlight different networks of macrophages, fibroblasts and cardiomyocytes between atria and ventricles that are distinct from those of skeletal muscle. Our human cardiac cell atlas improves our understanding of the human heart and provides a valuable reference for future studies.

703 citations


Journal ArticleDOI
TL;DR: This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years in single-cell data science.
Abstract: The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands-or even millions-of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges that will be central to bringing this emerging field of single-cell data science forward. For each challenge, we highlight motivating research questions, review prior work, and formulate open problems. This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years.

677 citations


Posted ContentDOI
Arang Rhie1, Shane A. McCarthy2, Olivier Fedrigo3, Joana Damas4, Giulio Formenti3, Sergey Koren1, Marcela Uliano-Silva2, William Chow2, Arkarachai Fungtammasan, Gregory Gedman3, Lindsey J. Cantin3, Françoise Thibaud-Nissen1, Leanne Haggerty5, Chul Hee Lee6, Byung June Ko6, J. H. Kim6, Iliana Bista2, Michelle Smith2, Bettina Haase3, Jacquelyn Mountcastle3, Sylke Winkler7, Sadye Paez3, Jason T. Howard8, Sonja C. Vernes7, Tanya M. Lama9, Frank Grützner10, Wesley C. Warren11, Christopher N. Balakrishnan12, Dave W Burt13, Jimin George14, Matthew T. Biegler3, David Iorns15, Andrew Digby, Daryl Eason, Taylor Edwards16, Mark Wilkinson17, George F. Turner18, Axel Meyer19, Andreas F. Kautt19, Paolo Franchini19, H. William Detrich20, Hannes Svardal21, Maximilian Wagner22, Gavin J. P. Naylor23, Martin Pippel7, Milan Malinsky2, Mark Mooney, Maria Simbirsky, Brett T. Hannigan, Trevor Pesout24, Marlys L. Houck, Ann C Misuraca, Sarah B. Kingan25, Richard Hall25, Zev N. Kronenberg25, Jonas Korlach25, Ivan Sović25, Christopher Dunn25, Zemin Ning2, Alex Hastie, Joyce V. Lee, Siddarth Selvaraj, Richard E. Green24, Nicholas H. Putnam, Jay Ghurye26, Erik Garrison24, Ying Sims2, Joanna Collins2, Sarah Pelan2, James Torrance2, Alan Tracey2, Jonathan Wood2, Dengfeng Guan27, Sarah E. London28, David F. Clayton14, Claudio V. Mello29, Samantha R. Friedrich29, Peter V. Lovell29, Ekaterina Osipova7, Farooq O. Al-Ajli30, Simona Secomandi31, Heebal Kim6, Constantina Theofanopoulou3, Yang Zhou32, Robert S. Harris33, Kateryna D. Makova33, Paul Medvedev33, Jinna Hoffman1, Patrick Masterson1, Karen Clark1, Fergal J. Martin5, Kevin L. Howe5, Paul Flicek5, Brian P. Walenz1, Woori Kwak, Hiram Clawson24, Mark Diekhans24, Luis R Nassar24, Benedict Paten24, Robert H. S. Kraus19, Harris A. Lewin4, Andrew J. Crawford34, M. Thomas P. Gilbert32, Guojie Zhang32, Byrappa Venkatesh35, Robert W. Murphy36, Klaus-Peter Koepfli37, Beth Shapiro24, Warren E. Johnson37, Federica Di Palma38, Tomas Marques-Bonet39, Emma C. Teeling40, Tandy Warnow41, Jennifer A. Marshall Graves42, Oliver A. Ryder43, David Haussler24, Stephen J. O'Brien44, Kerstin Howe2, Eugene W. Myers45, Richard Durbin2, Adam M. Phillippy1, Erich D. Jarvis3 
23 May 2020-bioRxiv
TL;DR: The Vertebrate Genomes Project is embarked on, an effort to generate high-quality, complete reference genomes for all ~70,000 extant vertebrate species and help enable a new era of discovery across the life sciences.
Abstract: High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are only available for a few non-microbial species. To address this issue, the international Genome 10K (G10K) consortium has worked over a five-year period to evaluate and develop cost-effective methods for assembling the most accurate and complete reference genomes to date. Here we summarize these developments, introduce a set of quality standards, and present lessons learned from sequencing and assembling 16 species representing major vertebrate lineages (mammals, birds, reptiles, amphibians, teleost fishes and cartilaginous fishes). We confirm that long-read sequencing technologies are essential for maximizing genome quality and that unresolved complex repeats and haplotype heterozygosity are major sources of error in assemblies. Our new assemblies identify and correct substantial errors in some of the best historical reference genomes. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an effort to generate high-quality, complete reference genomes for all ~70,000 extant vertebrate species and help enable a new era of discovery across the life sciences.

Journal ArticleDOI
06 Feb 2020-Nature
TL;DR: Whole-genome sequencing data for 2,778 cancer samples from 2,658 unique donors is used to reconstruct the evolutionary history of cancer, revealing that driver mutations can precede diagnosis by several years to decades.
Abstract: Cancer develops through a process of somatic evolution1,2. Sequencing data from a single biopsy represent a snapshot of this process that can reveal the timing of specific genomic aberrations and the changing influence of mutational processes3. Here, by whole-genome sequencing analysis of 2,658 cancers as part of the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA)4, we reconstruct the life history and evolution of mutational processes and driver mutation sequences of 38 types of cancer. Early oncogenesis is characterized by mutations in a constrained set of driver genes, and specific copy number gains, such as trisomy 7 in glioblastoma and isochromosome 17q in medulloblastoma. The mutational spectrum changes significantly throughout tumour evolution in 40% of samples. A nearly fourfold diversification of driver genes and increased genomic instability are features of later stages. Copy number alterations often occur in mitotic crises, and lead to simultaneous gains of chromosomal segments. Timing analyses suggest that driver mutations often precede diagnosis by many years, if not decades. Together, these results determine the evolutionary trajectories of cancer, and highlight opportunities for early cancer detection.

Journal ArticleDOI
03 Sep 2020-Nature
TL;DR: High-coverage, ultra-long-read nanopore sequencing is used to create a new human genome assembly that improves on the coverage and accuracy of the current reference (GRCh38) and includes the gap-free, telomere-to-telomere sequence of the X chromosome.
Abstract: After two decades of improvements, the current human reference genome (GRCh38) is the most accurate and complete vertebrate genome ever produced. However, no single chromosome has been finished end to end, and hundreds of unresolved gaps persist1,2. Here we present a human genome assembly that surpasses the continuity of GRCh382, along with a gapless, telomere-to-telomere assembly of a human chromosome. This was enabled by high-coverage, ultra-long-read nanopore sequencing of the complete hydatidiform mole CHM13 genome, combined with complementary technologies for quality improvement and validation. Focusing our efforts on the human X chromosome3, we reconstructed the centromeric satellite DNA array (approximately 3.1 Mb) and closed the 29 remaining gaps in the current reference, including new sequences from the human pseudoautosomal regions and from cancer-testis ampliconic gene families (CT-X and GAGE). These sequences will be integrated into future human reference genome releases. In addition, the complete chromosome X, combined with the ultra-long nanopore data, allowed us to map methylation patterns across complex tandem repeats and satellite arrays. Our results demonstrate that finishing the entire human genome is now within reach, and the data presented here will facilitate ongoing efforts to complete the other human chromosomes. High-coverage, ultra-long-read nanopore sequencing is used to create a new human genome assembly that improves on the coverage and accuracy of the current reference (GRCh38) and includes the gap-free, telomere-to-telomere sequence of the X chromosome.

Journal ArticleDOI
04 Dec 2020-Science
TL;DR: The authors identified shared biology and host-directed drug targets to prioritize therapeutics with potential for rapid deployment against current and future coronavirus outbreaks, and found that individuals with genotypes corresponding to higher soluble IL17RA levels in plasma are at decreased risk of COVID-19 hospitalization.
Abstract: The COVID-19 pandemic, caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), is a grave threat to public health and the global economy. SARS-CoV-2 is closely related to the more lethal but less transmissible coronaviruses SARS-CoV-1 and Middle East respiratory syndrome coronavirus (MERS-CoV). Here, we have carried out comparative viral-human protein-protein interaction and viral protein localization analyses for all three viruses. Subsequent functional genetic screening identified host factors that functionally impinge on coronavirus proliferation, including Tom70, a mitochondrial chaperone protein that interacts with both SARS-CoV-1 and SARS-CoV-2 ORF9b, an interaction we structurally characterized using cryo-electron microscopy. Combining genetically validated host factors with both COVID-19 patient genetic data and medical billing records identified molecular mechanisms and potential drug treatments that merit further molecular and clinical study.

Journal ArticleDOI
05 Feb 2020-Nature
TL;DR: Whole-genome sequencing data from more than 2,500 cancers of 38 tumour types reveal 16 signatures that can be used to classify somatic structural variants, highlighting the diversity of genomic rearrangements in cancer.
Abstract: A key mutational process in cancer is structural variation, in which rearrangements delete, amplify or reorder genomic segments that range in size from kilobases to whole chromosomes1-7. Here we develop methods to group, classify and describe somatic structural variants, using data from the Pan-Cancer Analysis of Whole Genomes (PCAWG) Consortium of the International Cancer Genome Consortium (ICGC) and The Cancer Genome Atlas (TCGA), which aggregated whole-genome sequencing data from 2,658 cancers across 38 tumour types8. Sixteen signatures of structural variation emerged. Deletions have a multimodal size distribution, assort unevenly across tumour types and patients, are enriched in late-replicating regions and correlate with inversions. Tandem duplications also have a multimodal size distribution, but are enriched in early-replicating regions-as are unbalanced translocations. Replication-based mechanisms of rearrangement generate varied chromosomal structures with low-level copy-number gains and frequent inverted rearrangements. One prominent structure consists of 2-7 templates copied from distinct regions of the genome strung together within one locus. Such cycles of templated insertions correlate with tandem duplications, and-in liver cancer-frequently activate the telomerase gene TERT. A wide variety of rearrangement processes are active in cancer, which generate complex configurations of the genome upon which selection can act.

Journal ArticleDOI
TL;DR: BlobToolKit, a software suite to aid researchers in identifying and isolating non-target data in draft and publicly available genome assemblies, is presented, providing an indication of assembly quality alongside the public record with links out to allow full exploration in the browser-based Viewer.
Abstract: Reconstruction of target genomes from sequence data produced by instruments that are agnostic as to the species-of-origin may be confounded by contaminant DNA. Whether introduced during sample processing or through co-extraction alongside the target DNA, if insufficient care is taken during the assembly process, the final assembled genome may be a mixture of data from several species. Such assemblies can confound sequence-based biological inference and, when deposited in public databases, may be included in downstream analyses by users unaware of underlying problems. We present BlobToolKit, a software suite to aid researchers in identifying and isolating non-target data in draft and publicly available genome assemblies. BlobToolKit can be used to process assembly, read and analysis files for fully reproducible interactive exploration in the browser-based Viewer. BlobToolKit can be used during assembly to filter non-target DNA, helping researchers produce assemblies with high biological credibility. We have been running an automated BlobToolKit pipeline on eukaryotic assemblies publicly available in the International Nucleotide Sequence Data Collaboration and are making the results available through a public instance of the Viewer at https://blobtoolkit.genomehubs.org/view We aim to complete analysis of all publicly available genomes and then maintain currency with the flow of new genomes. We have worked to embed these views into the presentation of genome assemblies at the European Nucleotide Archive, providing an indication of assembly quality alongside the public record with links out to allow full exploration in the Viewer.

Journal ArticleDOI
11 May 2020-eLife
TL;DR: The utility of comprehensive screening of HCWs with minimal or no symptoms for SARS-CoV-2 testing is demonstrated, and this approach will be critical for protecting patients and hospital staff.
Abstract: Significant differences exist in the availability of healthcare worker (HCW) SARS-CoV-2 testing between countries, and existing programmes focus on screening symptomatic rather than asymptomatic staff. Over a 3 week period (April 2020), 1032 asymptomatic HCWs were screened for SARS-CoV-2 in a large UK teaching hospital. Symptomatic staff and symptomatic household contacts were additionally tested. Real-time RT-PCR was used to detect viral RNA from a throat+nose self-swab. 3% of HCWs in the asymptomatic screening group tested positive for SARS-CoV-2. 17/30 (57%) were truly asymptomatic/pauci-symptomatic. 12/30 (40%) had experienced symptoms compatible with coronavirus disease 2019 (COVID-19)>7 days prior to testing, most self-isolating, returning well. Clusters of HCW infection were discovered on two independent wards. Viral genome sequencing showed that the majority of HCWs had the dominant lineage B∙1. Our data demonstrates the utility of comprehensive screening of HCWs with minimal or no symptoms. This approach will be critical for protecting patients and hospital staff.

Journal ArticleDOI
20 Mar 2020-Science
TL;DR: The authors' study adds data about African, Oceanian, and Amerindian populations and indicates that diversity tends to result from differences at the single-nucleotide level rather than copy number variation.
Abstract: Genome sequences from diverse human groups are needed to understand the structure of genetic variation in our species and the history of, and relationships between, different populations. We present 929 high-coverage genome sequences from 54 diverse human populations, 26 of which are physically phased using linked-read sequencing. Analyses of these genomes reveal an excess of previously undocumented common genetic variation private to southern Africa, central Africa, Oceania, and the Americas, but an absence of such variants fixed between major geographical regions. We also find deep and gradual population separations within Africa, contrasting population size histories between hunter-gatherer and agriculturalist groups in the past 10,000 years, and a contrast between single Neanderthal but multiple Denisovan source populations contributing to present-day human populations.

Journal ArticleDOI
TL;DR: Using single cell transcriptome sequencing, the authors identify multiple astrocyte subtypes in the adult mouse CNS, which map to distinct spatial locations and show correlations to cell morphology and physiology.
Abstract: Astrocytes, a major cell type found throughout the central nervous system, have general roles in the modulation of synapse formation and synaptic transmission, blood-brain barrier formation, and regulation of blood flow, as well as metabolic support of other brain resident cells. Crucially, emerging evidence shows specific adaptations and astrocyte-encoded functions in regions, such as the spinal cord and cerebellum. To investigate the true extent of astrocyte molecular diversity across forebrain regions, we used single-cell RNA sequencing. Our analysis identifies five transcriptomically distinct astrocyte subtypes in adult mouse cortex and hippocampus. Validation of our data in situ reveals distinct spatial positioning of defined subtypes, reflecting the distribution of morphologically and physiologically distinct astrocyte populations. Our findings are evidence for specialized astrocyte subtypes between and within brain regions. The data are available through an online database (https://holt-sc.glialab.org/), providing a resource on which to base explorations of local astrocyte diversity and function in the brain.

Journal ArticleDOI
TL;DR: SoupX, a tool for removing ambient RNA contamination from droplet-based single-cell RNA sequencing experiments, has broad applicability, and its application can improve the biological utility of existing and future datasets.
Abstract: Background Droplet-based single-cell RNA sequence analyses assume that all acquired RNAs are endogenous to cells. However, any cell-free RNAs contained within the input solution are also captured by these assays. This sequencing of cell-free RNA constitutes a background contamination that confounds the biological interpretation of single-cell transcriptomic data. Results We demonstrate that contamination from this "soup" of cell-free RNAs is ubiquitous, with experiment-specific variations in composition and magnitude. We present a method, SoupX, for quantifying the extent of the contamination and estimating "background-corrected" cell expression profiles that seamlessly integrate with existing downstream analysis tools. Applying this method to several datasets using multiple droplet sequencing technologies, we demonstrate that its application improves biological interpretation of otherwise misleading data, as well as improving quality control metrics. Conclusions We present SoupX, a tool for removing ambient RNA contamination from droplet-based single-cell RNA sequencing experiments. This tool has broad applicability, and its application can improve the biological utility of existing and future datasets.

Journal ArticleDOI
08 Jan 2020-Nature
TL;DR: Progress is described in the study of human genetics, in which rapid advances in technology, foundational genomic resources and analytical tools have contributed to the understanding of the mechanisms responsible for many rare and common diseases and to preventative and therapeutic strategies for many of these conditions.
Abstract: A primary goal of human genetics is to identify DNA sequence variants that influence biomedical traits, particularly those related to the onset and progression of human disease. Over the past 25 years, progress in realizing this objective has been transformed by advances in technology, foundational genomic resources and analytical tools, and by access to vast amounts of genotype and phenotype data. Genetic discoveries have substantially improved our understanding of the mechanisms responsible for many rare and common diseases and driven development of novel preventative and therapeutic strategies. Medical innovation will increasingly focus on delivering care tailored to individual patterns of genetic predisposition.

Journal ArticleDOI
TL;DR: Real-time genomic surveillance of SARS-CoV-2 in a UK hospital was established and showed the benefit of combined genomic and epidemiological analysis for the investigation of health-care associated COVID-19 cases.
Abstract: Summary Background The burden and influence of health-care associated severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infections is unknown. We aimed to examine the use of rapid SARS-CoV-2 sequencing combined with detailed epidemiological analysis to investigate health-care associated SARS-CoV-2 infections and inform infection control measures. Methods In this prospective surveillance study, we set up rapid SARS-CoV-2 nanopore sequencing from PCR-positive diagnostic samples collected from our hospital (Cambridge, UK) and a random selection from hospitals in the East of England, enabling sample-to-sequence in less than 24 h. We established a weekly review and reporting system with integration of genomic and epidemiological data to investigate suspected health-care associated COVID-19 cases. Findings Between March 13 and April 24, 2020, we collected clinical data and samples from 5613 patients with COVID-19 from across the East of England. We sequenced 1000 samples producing 747 high-quality genomes. We combined epidemiological and genomic analysis of the 299 patients from our hospital and identified 35 clusters of identical viruses involving 159 patients. 92 (58%) of 159 patients had strong epidemiological links and 32 (20%) patients had plausible epidemiological links. These results were fed back to clinical, infection control, and hospital management teams, leading to infection-control interventions and informing patient safety reporting. Interpretation We established real-time genomic surveillance of SARS-CoV-2 in a UK hospital and showed the benefit of combined genomic and epidemiological analysis for the investigation of health-care associated COVID-19. This approach enabled us to detect cryptic transmission events and identify opportunities to target infection-control interventions to further reduce health-care associated infections. Our findings have important implications for national public health policy as they enable rapid tracking and investigation of infections in hospital and community settings. Funding COVID-19 Genomics UK (supported by UK Research and Innovation, the National Institute of Health Research, the Wellcome Sanger Institute), the Wellcome Trust, the Academy of Medical Sciences and the Health Foundation, and the National Institute for Health Research Cambridge Biomedical Research Centre.

Journal ArticleDOI
21 Feb 2020-Science
TL;DR: The authors' single-cell transcriptome profile of the thymus across the human lifetime and across species provides a high-resolution census of T cell development within the native tissue microenvironment, and identifies novel subpopulations of human thymic fibroblasts and epithelial cells and located them in situ.
Abstract: The thymus provides a nurturing environment for the differentiation and selection of T cells, a process orchestrated by their interaction with multiple thymic cell types. We used single-cell RNA sequencing to create a cell census of the human thymus across the life span and to reconstruct T cell differentiation trajectories and T cell receptor (TCR) recombination kinetics. Using this approach, we identified and located in situ CD8αα+ T cell populations, thymic fibroblast subtypes, and activated dendritic cell states. In addition, we reveal a bias in TCR recombination and selection, which is attributed to genomic position and the kinetics of lineage commitment. Taken together, our data provide a comprehensive atlas of the human thymus across the life span with new insights into human T cell development.

Posted ContentDOI
13 Aug 2020-bioRxiv
TL;DR: This work describes the tried and tested approach for assembly curation using gEVAL, the genome evaluation browser, and outlines the procedures applied to genome curations using g EVAL and also outlines the recommendations for assemblyCuration in an gevAL-independent context to facilitate the uptake of genome curation in the wider community.
Abstract: Background Genome sequence assemblies provide the basis for our understanding of biology. Generating error-free assemblies is therefore the ultimate, but sadly still unachieved goal of a multitude of research projects. Despite the ever-advancing improvements in data generation, assembly algorithms and pipelines, no automated approach has so far reliably generated near error-free genome assemblies for eukaryotes. Results Whilst working towards improved data sets and fully automated pipelines, assembly evaluation and curation is actively employed to bridge this shortcoming and significantly reduce the number of assembly errors. In addition to this increase in product value, the insights gained from assembly curation are fed back into the automated assembly strategy and contribute to notable improvements in genome assembly quality. Conclusions We describe our tried and tested approach for assembly curation using gEVAL, the genome evaluation browser. We outline the procedures applied to genome curation using gEVAL and also our recommendations for assembly curation in an gEVAL-independent context to facilitate the uptake of genome curation in the wider community.

Journal ArticleDOI
27 Jul 2020
TL;DR: Deep transfer learning is used to quantify histopathological patterns across 17,355 hematoxylin and eosin-stained histopathology slide images from 28 cancer types and correlate these with matched genomic, transcriptomic and survival data, showing the remarkable potential of computer vision in characterizing the molecular basis of tumor Histopathology.
Abstract: We use deep transfer learning to quantify histopathological patterns across 17,355 hematoxylin and eosin-stained histopathology slide images from 28 cancer types and correlate these with matched genomic, transcriptomic and survival data. This approach accurately classifies cancer types and provides spatially resolved tumor and normal tissue distinction. Automatically learned computational histopathological features correlate with a large range of recurrent genetic aberrations across cancer types. This includes whole-genome duplications, which display universal features across cancer types, individual chromosomal aneuploidies, focal amplifications and deletions, as well as driver gene mutations. There are widespread associations between bulk gene expression levels and histopathology, which reflect tumor composition and enable the localization of transcriptomically defined tumor-infiltrating lymphocytes. Computational histopathology augments prognosis based on histopathological subtyping and grading, and highlights prognostically relevant areas such as necrosis or lymphocytic aggregates. These findings show the remarkable potential of computer vision in characterizing the molecular basis of tumor histopathology. Two papers by Kather and colleagues and Gerstung and colleagues develop workflows to predict a wide range of molecular alterations from pan-cancer digital pathology slides.

Journal ArticleDOI
TL;DR: In this article, the effect of cold storage on fresh healthy spleen, esophagus, and lung from ≥ 5 donors over 72h was assessed, and robust protocols for tissue preservation for up to 24h prior to scRNA-seq analysis were presented.
Abstract: The Human Cell Atlas is a large international collaborative effort to map all cell types of the human body. Single-cell RNA sequencing can generate high-quality data for the delivery of such an atlas. However, delays between fresh sample collection and processing may lead to poor data and difficulties in experimental design. This study assesses the effect of cold storage on fresh healthy spleen, esophagus, and lung from ≥ 5 donors over 72 h. We collect 240,000 high-quality single-cell transcriptomes with detailed cell type annotations and whole genome sequences of donors, enabling future eQTL studies. Our data provide a valuable resource for the study of these 3 organs and will allow cross-organ comparison of cell types. We see little effect of cold ischemic time on cell yield, total number of reads per cell, and other quality control metrics in any of the tissues within the first 24 h. However, we observe a decrease in the proportions of lung T cells at 72 h, higher percentage of mitochondrial reads, and increased contamination by background ambient RNA reads in the 72-h samples in the spleen, which is cell type specific. In conclusion, we present robust protocols for tissue preservation for up to 24 h prior to scRNA-seq analysis. This greatly facilitates the logistics of sample collection for Human Cell Atlas or clinical studies since it increases the time frames for sample processing.

Journal ArticleDOI
TL;DR: This work presents Multi-Omics Factor Analysis v2 (MOFA+), a statistical framework for the comprehensive and scalable integration of single-cell multi-modal data that reconstructs a low-dimensional representation of the data using computationally efficient variational inference and supports flexible sparsity constraints.
Abstract: Technological advances have enabled the profiling of multiple molecular layers at single-cell resolution, assaying cells from multiple samples or conditions. Consequently, there is a growing need for computational strategies to analyze data from complex experimental designs that include multiple data modalities and multiple groups of samples. We present Multi-Omics Factor Analysis v2 (MOFA+), a statistical framework for the comprehensive and scalable integration of single-cell multi-modal data. MOFA+ reconstructs a low-dimensional representation of the data using computationally efficient variational inference and supports flexible sparsity constraints, allowing to jointly model variation across multiple sample groups and data modalities.

Journal ArticleDOI
TL;DR: Whole-genome sequencing, transcriptome sequencing and single-cell analysis of primary and metastatic pancreatic adenocarcinoma identify molecular subtypes and intratumor heterogeneity, and support the premise that the constellation of genomic aberrations in the tumor gives rise to the molecular subtype.
Abstract: Pancreatic adenocarcinoma presents as a spectrum of a highly aggressive disease in patients. The basis of this disease heterogeneity has proved difficult to resolve due to poor tumor cellularity and extensive genomic instability. To address this, a dataset of whole genomes and transcriptomes was generated from purified epithelium of primary and metastatic tumors. Transcriptome analysis demonstrated that molecular subtypes are a product of a gene expression continuum driven by a mixture of intratumoral subpopulations, which was confirmed by single-cell analysis. Integrated whole-genome analysis uncovered that molecular subtypes are linked to specific copy number aberrations in genes such as mutant KRAS and GATA6. By mapping tumor genetic histories, tetraploidization emerged as a key mutational process behind these events. Taken together, these data support the premise that the constellation of genomic aberrations in the tumor gives rise to the molecular subtype, and that disease heterogeneity is due to ongoing genomic instability during progression.

Journal ArticleDOI
14 Oct 2020-Nature
TL;DR: To identify novel DD-associated genes, healthcare and research exome sequences are integrated on 31,058 DD parent-offspring trios, and a simulation-based statistical test is developed to identify gene-specific enrichments of DNMs.
Abstract: De novo mutations in protein-coding genes are a well-established cause of developmental disorders1. However, genes known to be associated with developmental disorders account for only a minority of the observed excess of such de novo mutations1,2. Here, to identify previously undescribed genes associated with developmental disorders, we integrate healthcare and research exome-sequence data from 31,058 parent-offspring trios of individuals with developmental disorders, and develop a simulation-based statistical test to identify gene-specific enrichment of de novo mutations. We identified 285 genes that were significantly associated with developmental disorders, including 28 that had not previously been robustly associated with developmental disorders. Although we detected more genes associated with developmental disorders, much of the excess of de novo mutations in protein-coding genes remains unaccounted for. Modelling suggests that more than 1,000 genes associated with developmental disorders have not yet been described, many of which are likely to be less penetrant than the currently known genes. Research access to clinical diagnostic datasets will be critical for completing the map of genes associated with developmental disorders.

Journal ArticleDOI
Dragana Vuckovic1, Erik L. Bao2, Parsa Akbari1, Caleb A. Lareau2, Abdou Mousas3, Tao Jiang1, Ming-Huei Chen, Laura M. Raffield4, Manuel Tardaguila5, Jennifer E. Huffman6, Scott C. Ritchie1, Karyn Megy1, Hannes Ponstingl5, Christopher J. Penkett1, Patrick K. Albers5, Emilie M. Wigdor5, Saori Sakaue7, Arden Moscati8, Regina Manansala9, Ken Sin Lo3, Huijun Qian4, Masato Akiyama10, Traci M. Bartz11, Yoav Ben-Shlomo12, Andrew D Beswick12, Jette Bork-Jensen13, Erwin P. Bottinger8, Jennifer A. Brody11, Frank J. A. van Rooij14, Kumaraswamy Naidu Chitrala15, Peter W.F. Wilson16, Hélène Choquet17, John Danesh, Emanuele Di Angelantonio, Niki Dimou18, Jingzhong Ding19, Paul Elliott20, Tõnu Esko21, Michele K. Evans15, Stephan B. Felix22, James S. Floyd11, Linda Broer14, Niels Grarup13, Michael H. Guo23, Qi Guo24, Andreas Greinacher22, Jeffrey Haessler25, Torben Hansen13, J. M. M. Howson1, Wei Huang26, Eric Jorgenson17, Tim Kacprowski27, Mika Kähönen28, Yoichiro Kamatani29, Masahiro Kanai2, Savita Karthikeyan24, Fotios Koskeridis30, Leslie A. Lange31, Terho Lehtimäki, Allan Linneberg13, Yongmei Liu32, Leo-Pekka Lyytikäinen, Ani Manichaikul33, Koichi Matsuda29, Karen L. Mohlke4, Nina Mononen, Yoshinori Murakami29, Girish N. Nadkarni8, Kjell Nikus28, Nathan Pankratz34, Oluf Pedersen13, Michael Preuss8, Bruce M. Psaty11, Olli T. Raitakari35, Stephen S. Rich33, Benjamin Rodriguez, Jonathan D. Rosen4, Jerome I. Rotter36, Petra Schubert6, Cassandra N. Spracklen4, Praveen Surendran5, Hua Tang37, Jean-Claude Tardif3, Mohsen Ghanbari38, Uwe Völker22, Henry Völzke22, Nicholas A. Watkins39, Stefan Weiss22, VA Million Veteran Program5, Na Cai5, Kousik Kundu5, Stephen B. Watt5, Klaudia Walter5, Alan B. Zonderman15, Kelly Cho40, Yun Li4, Ruth J. F. Loos8, Julian C. Knight41, Michel Georges42, Oliver Stegle43, Evangelos Evangelou20, Yukinori Okada7, David J. Roberts44, Michael Inouye, Andrew D. Johnson, Paul L. Auer9, William J. Astle1, Alexander P. Reiner11, Adam S. Butterworth, Willem H. Ouwehand1, Guillaume Lettre3, Vijay G. Sankaran2, Vijay G. Sankaran21, Nicole Soranzo 
03 Sep 2020-Cell
TL;DR: The results show the power of large-scale blood cell trait GWAS to interrogate clinically meaningful variants across a wide allelic spectrum of human variation.