scispace - formally typeset
Search or ask a question

Showing papers on "Genome published in 2021"


Journal ArticleDOI
TL;DR: The K EGG pathway maps are now integrated with network variation maps in the NETWORK database, as well as with conserved functional units of KEGG modules and reaction modules in the MODULE database, and the KO database for functional orthologs continues to be improved.
Abstract: KEGG (https://www.kegg.jp/) is a manually curated resource integrating eighteen databases categorized into systems, genomic, chemical and health information. It also provides KEGG mapping tools, which enable understanding of cellular and organism-level functions from genome sequences and other molecular datasets. KEGG mapping is a predictive method of reconstructing molecular network systems from molecular building blocks based on the concept of functional orthologs. Since the introduction of the KEGG NETWORK database, various diseases have been associated with network variants, which are perturbed molecular networks caused by human gene variants, viruses, other pathogens and environmental factors. The network variation maps are created as aligned sets of related networks showing, for example, how different viruses inhibit or activate specific cellular signaling pathways. The KEGG pathway maps are now integrated with network variation maps in the NETWORK database, as well as with conserved functional units of KEGG modules and reaction modules in the MODULE database. The KO database for functional orthologs continues to be improved and virus KOs are being expanded for better understanding of virus-cell interactions and for enabling prediction of viral perturbations.

2,087 citations


Journal ArticleDOI
TL;DR: The genetic and phenotypic structure of COVID-19 in pathogenesis is important and this article highlights the most important of these features compared to other Betacoronaviruses.
Abstract: COVID-19 is a novel coronavirus with an outbreak of unusual viral pneumonia in Wuhan, China, and then pandemic. Based on its phylogenetic relationships and genomic structures the COVID-19 belongs to genera Betacoronavirus. Human Betacoronaviruses (SARS-CoV-2, SARS-CoV, and MERS-CoV) have many similarities, but also have differences in their genomic and phenotypic structure that can influence their pathogenesis. COVID-19 is containing single-stranded (positive-sense) RNA associated with a nucleoprotein within a capsid comprised of matrix protein. A typical CoV contains at least six ORFs in its genome. All the structural and accessory proteins are translated from the sgRNAs of CoVs. Four main structural proteins are encoded by ORFs 10, 11 on the one-third of the genome near the 3'-terminus. The genetic and phenotypic structure of COVID-19 in pathogenesis is important. This article highlights the most important of these features compared to other Betacoronaviruses.

670 citations


Journal ArticleDOI
TL;DR: The results suggest SARS-CoV-2 may continue to circulate among the human populations despite herd immunity due to natural infection or vaccination, and further studies of patients with re-infection will shed light on protective correlates important for vaccine design.
Abstract: BACKGROUND: Waning immunity occurs in patients who have recovered from Coronavirus Disease 2019 (COVID-19). However, it remains unclear whether true re-infection occurs. METHODS: Whole genome sequencing was performed directly on respiratory specimens collected during 2 episodes of COVID-19 in a patient. Comparative genome analysis was conducted to differentiate re-infection from persistent viral shedding. Laboratory results, including RT-PCR Ct values and serum Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) IgG, were analyzed. RESULTS: The second episode of asymptomatic infection occurred 142 days after the first symptomatic episode in an apparently immunocompetent patient. During the second episode, there was evidence of acute infection including elevated C-reactive protein and SARS-CoV-2 IgG seroconversion. Viral genomes from first and second episodes belong to different clades/lineages. The virus genome from the first episode contained a a stop codon at position 64 of ORF8, leading to a truncation of 58 amino acids. Another 23 nucleotide and 13 amino acid differences located in 9 different proteins, including positions of B and T cell epitopes, were found between viruses from the first and second episodes. Compared to viral genomes in GISAID, the first virus genome was phylogenetically closely related to strains collected in March/April 2020, while the second virus genome was closely related to strains collected in July/August 2020. CONCLUSIONS: Epidemiological, clinical, serological, and genomic analyses confirmed that the patient had re-infection instead of persistent viral shedding from first infection. Our results suggest SARS-CoV-2 may continue to circulate among humans despite herd immunity due to natural infection. Further studies of patients with re-infection will shed light on protective immunological correlates for guiding vaccine design.

670 citations


Journal ArticleDOI
Arang Rhie1, Shane A. McCarthy2, Shane A. McCarthy3, Olivier Fedrigo4, Joana Damas5, Giulio Formenti4, Sergey Koren1, Marcela Uliano-Silva6, William Chow3, Arkarachai Fungtammasan, J. H. Kim7, Chul Hee Lee7, Byung June Ko7, Mark Chaisson8, Gregory Gedman4, Lindsey J. Cantin4, Françoise Thibaud-Nissen1, Leanne Haggerty9, Iliana Bista2, Iliana Bista3, Michelle Smith3, Bettina Haase4, Jacquelyn Mountcastle4, Sylke Winkler10, Sylke Winkler11, Sadye Paez4, Jason T. Howard, Sonja C. Vernes12, Sonja C. Vernes13, Sonja C. Vernes10, Tanya M. Lama14, Frank Grützner15, Wesley C. Warren16, Christopher N. Balakrishnan17, Dave W Burt18, Jimin George19, Matthew T. Biegler4, David Iorns, Andrew Digby, Daryl Eason, Bruce C. Robertson20, Taylor Edwards21, Mark Wilkinson22, George F. Turner23, Axel Meyer24, Andreas F. Kautt24, Andreas F. Kautt25, Paolo Franchini24, H. William Detrich26, Hannes Svardal27, Hannes Svardal28, Maximilian Wagner29, Gavin J. P. Naylor30, Martin Pippel10, Milan Malinsky31, Milan Malinsky3, Mark Mooney, Maria Simbirsky, Brett T. Hannigan, Trevor Pesout32, Marlys L. Houck33, Ann C Misuraca33, Sarah B. Kingan34, Richard Hall34, Zev N. Kronenberg34, Ivan Sović34, Christopher Dunn34, Zemin Ning3, Alex Hastie, Joyce V. Lee, Siddarth Selvaraj, Richard E. Green32, Nicholas H. Putnam, Ivo Gut35, Jay Ghurye36, Erik Garrison32, Ying Sims3, Joanna Collins3, Sarah Pelan3, James Torrance3, Alan Tracey3, Jonathan Wood3, Robel E. Dagnew8, Dengfeng Guan37, Dengfeng Guan2, Sarah E. London38, David F. Clayton19, Claudio V. Mello39, Samantha R. Friedrich39, Peter V. Lovell39, Ekaterina Osipova10, Farooq O. Al-Ajli40, Farooq O. Al-Ajli41, Simona Secomandi42, Heebal Kim7, Constantina Theofanopoulou4, Michael Hiller43, Yang Zhou, Robert S. Harris44, Kateryna D. Makova44, Paul Medvedev44, Jinna Hoffman1, Patrick Masterson1, Karen Clark1, Fergal J. Martin9, Kevin L. Howe9, Paul Flicek9, Brian P. Walenz1, Woori Kwak, Hiram Clawson32, Mark Diekhans32, Luis R Nassar32, Benedict Paten32, Robert H. S. Kraus24, Robert H. S. Kraus10, Andrew J. Crawford45, M. Thomas P. Gilbert46, M. Thomas P. Gilbert47, Guojie Zhang, Byrappa Venkatesh48, Robert W. Murphy49, Klaus-Peter Koepfli50, Beth Shapiro32, Beth Shapiro51, Warren E. Johnson52, Warren E. Johnson50, Federica Di Palma53, Tomas Marques-Bonet, Emma C. Teeling54, Tandy Warnow55, Jennifer A. Marshall Graves56, Oliver A. Ryder57, Oliver A. Ryder33, David Haussler32, Stephen J. O'Brien58, Jonas Korlach34, Harris A. Lewin5, Kerstin Howe3, Eugene W. Myers11, Eugene W. Myers10, Richard Durbin3, Richard Durbin2, Adam M. Phillippy1, Erich D. Jarvis4, Erich D. Jarvis51 
National Institutes of Health1, University of Cambridge2, Wellcome Trust Sanger Institute3, Rockefeller University4, University of California, Davis5, Leibniz Association6, Seoul National University7, University of Southern California8, European Bioinformatics Institute9, Max Planck Society10, Dresden University of Technology11, University of St Andrews12, Radboud University Nijmegen13, University of Massachusetts Amherst14, University of Adelaide15, University of Missouri16, East Carolina University17, University of Queensland18, Clemson University19, University of Otago20, University of Arizona21, Natural History Museum22, Bangor University23, University of Konstanz24, Harvard University25, Northeastern University26, National Museum of Natural History27, University of Antwerp28, University of Graz29, University of Florida30, University of Basel31, University of California, Santa Cruz32, Zoological Society of San Diego33, Pacific Biosciences34, Pompeu Fabra University35, University of Maryland, College Park36, Harbin Institute of Technology37, University of Chicago38, Oregon Health & Science University39, Monash University Malaysia Campus40, Qatar Airways41, University of Milan42, Goethe University Frankfurt43, Pennsylvania State University44, University of Los Andes45, Norwegian University of Science and Technology46, University of Copenhagen47, Agency for Science, Technology and Research48, Royal Ontario Museum49, Smithsonian Institution50, Howard Hughes Medical Institute51, Walter Reed Army Institute of Research52, University of East Anglia53, University College Dublin54, University of Illinois at Urbana–Champaign55, La Trobe University56, University of California, San Diego57, Nova Southeastern University58
28 Apr 2021-Nature
TL;DR: The Vertebrate Genomes Project (VGP) as mentioned in this paper is an international effort to generate high quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.
Abstract: High-quality and complete reference genome assemblies are fundamental for the application of genomics to biology, disease, and biodiversity conservation. However, such assemblies are available for only a few non-microbial species1-4. To address this issue, the international Genome 10K (G10K) consortium5,6 has worked over a five-year period to evaluate and develop cost-effective methods for assembling highly accurate and nearly complete reference genomes. Here we present lessons learned from generating assemblies for 16 species that represent six major vertebrate lineages. We confirm that long-read sequencing technologies are essential for maximizing genome quality, and that unresolved complex repeats and haplotype heterozygosity are major sources of assembly error when not handled correctly. Our assemblies correct substantial errors, add missing sequence in some of the best historical reference genomes, and reveal biological discoveries. These include the identification of many false gene duplications, increases in gene sizes, chromosome rearrangements that are specific to lineages, a repeated independent chromosome breakpoint in bat genomes, and a canonical GC-rich pattern in protein-coding genes and their regulatory regions. Adopting these lessons, we have embarked on the Vertebrate Genomes Project (VGP), an international effort to generate high-quality, complete reference genomes for all of the roughly 70,000 extant vertebrate species and to help to enable a new era of discovery across the life sciences.

647 citations


Journal ArticleDOI
TL;DR: The Unified Human Gastrointestinal Genome (UHGG) collection, comprising 204,938 nonredundant genomes from 4,644 gut prokaryotes, is presented, providing comprehensive resources for microbiome researchers.
Abstract: Comprehensive, high-quality reference genomes are required for functional characterization and taxonomic assignment of the human gut microbiota. We present the Unified Human Gastrointestinal Genome (UHGG) collection, comprising 204,938 nonredundant genomes from 4,644 gut prokaryotes. These genomes encode >170 million protein sequences, which we collated in the Unified Human Gastrointestinal Protein (UHGP) catalog. The UHGP more than doubles the number of gut proteins in comparison to those present in the Integrated Gene Catalog. More than 70% of the UHGG species lack cultured representatives, and 40% of the UHGP lack functional annotations. Intraspecies genomic variation analyses revealed a large reservoir of accessory genes and single-nucleotide variants, many of which are specific to individual human populations. The UHGG and UHGP collections will enable studies linking genotypes to phenotypes in the human gut microbiome.

485 citations


Journal ArticleDOI
06 Jan 2021
TL;DR: The BRAKER2 pipeline as mentioned in this paper generates and integrates external protein support into the iterative process of training and gene prediction by GeneMark-EP+ and AUGUSTUS, and it is favorably compared with other pipelines, e.g. MAKER2, in terms of accuracy and performance.
Abstract: The task of eukaryotic genome annotation remains challenging. Only a few genomes could serve as standards of annotation achieved through a tremendous investment of human curation efforts. Still, the correctness of all alternative isoforms, even in the best-annotated genomes, could be a good subject for further investigation. The new BRAKER2 pipeline generates and integrates external protein support into the iterative process of training and gene prediction by GeneMark-EP+ and AUGUSTUS. BRAKER2 continues the line started by BRAKER1 where self-training GeneMark-ET and AUGUSTUS made gene predictions supported by transcriptomic data. Among the challenges addressed by the new pipeline was a generation of reliable hints to protein-coding exon boundaries from likely homologous but evolutionarily distant proteins. In comparison with other pipelines for eukaryotic genome annotation, BRAKER2 is fully automatic. It is favorably compared under equal conditions with other pipelines, e.g. MAKER2, in terms of accuracy and performance. Development of BRAKER2 should facilitate solving the task of harmonization of annotation of protein-coding genes in genomes of different eukaryotic species. However, we fully understand that several more innovations are needed in transcriptomic and proteomic technologies as well as in algorithmic development to reach the goal of highly accurate annotation of eukaryotic genomes.

455 citations


Journal ArticleDOI
TL;DR: The Genome Taxonomy Database (GTDB) as discussed by the authors provides a phylogenetically consistent and rank normalized genome-based taxonomy for prokaryotic genomes sourced from the NCBI Assembly database.
Abstract: The Genome Taxonomy Database (GTDB; https://gtdb.ecogenomic.org) provides a phylogenetically consistent and rank normalized genome-based taxonomy for prokaryotic genomes sourced from the NCBI Assembly database. GTDB R06-RS202 spans 254 090 bacterial and 4316 archaeal genomes, a 270% increase since the introduction of the GTDB in November, 2017. These genomes are organized into 45 555 bacterial and 2339 archaeal species clusters which is a 200% increase since the integration of species clusters into the GTDB in June, 2019. Here, we explore prokaryotic diversity from the perspective of the GTDB and highlight the importance of metagenome-assembled genomes in expanding available genomic representation. We also discuss improvements to the GTDB website which allow tracking of taxonomic changes, easy assessment of genome assembly quality, and identification of genomes assembled from type material or used as species representatives. Methodological updates and policy changes made since the inception of the GTDB are then described along with the procedure used to update species clusters in the GTDB. We conclude with a discussion on the use of average nucleotide identities as a pragmatic approach for delineating prokaryotic species.

339 citations


Journal ArticleDOI
TL;DR: The UCSC Genome Browser database has provided high-quality genomics data visualization and genome annotations to the research community for more than two decades, and new features released this past year include a Hi-C heatmap display, a phased family trio display for VCF files, and various track visualization improvements.
Abstract: For more than two decades, the UCSC Genome Browser database (https://genome.ucsc.edu) has provided high-quality genomics data visualization and genome annotations to the research community. As the field of genomics grows and more data become available, new modes of display are required to accommodate new technologies. New features released this past year include a Hi-C heatmap display, a phased family trio display for VCF files, and various track visualization improvements. Striving to keep data up-to-date, new updates to gene annotations include GENCODE Genes, NCBI RefSeq Genes, and Ensembl Genes. New data tracks added for human and mouse genomes include the ENCODE registry of candidate cis-regulatory elements, promoters from the Eukaryotic Promoter Database, and NCBI RefSeq Select and Matched Annotation from NCBI and EMBL-EBI (MANE). Within weeks of learning about the outbreak of coronavirus, UCSC released a genome browser, with detailed annotation tracks, for the SARS-CoV-2 RNA reference assembly.

310 citations


Journal ArticleDOI
02 Apr 2021-Science
TL;DR: In this article, the authors present 64 assembled haplotypes from 32 diverse human genomes, which integrate all forms of genetic variation, even across complex loci, and identify 107,590 structural variants (SVs), of which 68% were not discovered with short-read sequencing.
Abstract: Long-read and strand-specific sequencing technologies together facilitate the de novo assembly of high-quality haplotype-resolved human genomes without parent-child trio data. We present 64 assembled haplotypes from 32 diverse human genomes. These highly contiguous haplotype assemblies (average minimum contig length needed to cover 50% of the genome: 26 million base pairs) integrate all forms of genetic variation, even across complex loci. We identified 107,590 structural variants (SVs), of which 68% were not discovered with short-read sequencing, and 278 SV hotspots (spanning megabases of gene-rich sequence). We characterized 130 of the most active mobile element source elements and found that 63% of all SVs arise through homology-mediated mechanisms. This resource enables reliable graph-based genotyping from short reads of up to 50,340 SVs, resulting in the identification of 1526 expression quantitative trait loci as well as SV candidates for adaptive selection within the human population.

289 citations


Journal ArticleDOI
12 Feb 2021-Science
TL;DR: The United Kingdom's COVID-19 epidemic during early 2020 was one of world's largest and was unusually well represented by virus genomic sampling as mentioned in this paper, which determined the fine-scale genetic lineage structure of this epidemic through analysis of 50,887 severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genomes, including 26,181 from the UK sampled throughout the country's first wave of infection.
Abstract: The United Kingdom’s COVID-19 epidemic during early 2020 was one of world’s largest and was unusually well represented by virus genomic sampling. We determined the fine-scale genetic lineage structure of this epidemic through analysis of 50,887 severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) genomes, including 26,181 from the UK sampled throughout the country’s first wave of infection. Using large-scale phylogenetic analyses combined with epidemiological and travel data, we quantified the size, spatiotemporal origins, and persistence of genetically distinct UK transmission lineages. Rapid fluctuations in virus importation rates resulted in >1000 lineages; those introduced prior to national lockdown tended to be larger and more dispersed. Lineage importation and regional lineage diversity declined after lockdown, whereas lineage elimination was size-dependent. We discuss the implications of our genetic perspective on transmission dynamics for COVID-19 epidemiology and control.

285 citations


Journal ArticleDOI
Yunhao Wang1, Yue Zhao1, Audrey Bollas1, Yuru Wang1, Kin Fai Au1 
TL;DR: Nanopore sequencing is being applied in genome assembly, full-length transcript detection and base modification detection and in more specialized areas, such as rapid clinical diagnoses and outbreak surveillance.
Abstract: Rapid advances in nanopore technologies for sequencing single long DNA and RNA molecules have led to substantial improvements in accuracy, read length and throughput. These breakthroughs have required extensive development of experimental and bioinformatics methods to fully exploit nanopore long reads for investigations of genomes, transcriptomes, epigenomes and epitranscriptomes. Nanopore sequencing is being applied in genome assembly, full-length transcript detection and base modification detection and in more specialized areas, such as rapid clinical diagnoses and outbreak surveillance. Many opportunities remain for improving data quality and analytical approaches through the development of new nanopores, base-calling methods and experimental protocols tailored to particular applications. Au and colleagues outline the field of nanopore sequencing.

Journal ArticleDOI
19 Jan 2021-Mbio
TL;DR: In this paper, the authors used a pipeline for single nucleotide variant calling in a metagenomic context, characterized minor SARS-CoV-2 alleles in the wastewater and detected viral genotypes which were also found within clinical genomes throughout California.
Abstract: Viral genome sequencing has guided our understanding of the spread and extent of genetic diversity of SARS-CoV-2 during the COVID-19 pandemic. SARS-CoV-2 viral genomes are usually sequenced from nasopharyngeal swabs of individual patients to track viral spread. Recently, RT-qPCR of municipal wastewater has been used to quantify the abundance of SARS-CoV-2 in several regions globally. However, metatranscriptomic sequencing of wastewater can be used to profile the viral genetic diversity across infected communities. Here, we sequenced RNA directly from sewage collected by municipal utility districts in the San Francisco Bay Area to generate complete and nearly complete SARS-CoV-2 genomes. The major consensus SARS-CoV-2 genotypes detected in the sewage were identical to clinical genomes from the region. Using a pipeline for single nucleotide variant calling in a metagenomic context, we characterized minor SARS-CoV-2 alleles in the wastewater and detected viral genotypes which were also found within clinical genomes throughout California. Observed wastewater variants were more similar to local California patient-derived genotypes than they were to those from other regions within the United States or globally. Additional variants detected in wastewater have only been identified in genomes from patients sampled outside California, indicating that wastewater sequencing can provide evidence for recent introductions of viral lineages before they are detected by local clinical sequencing. These results demonstrate that epidemiological surveillance through wastewater sequencing can aid in tracking exact viral strains in an epidemic context.

Journal ArticleDOI
07 Apr 2021-Nature
TL;DR: In this article, the activity-by-contact (ABC) model was applied to create enhancer-gene maps in 131 human cell types and tissues, and use these maps to interpret the functions of GWAS variants.
Abstract: Genome-wide association studies (GWAS) have identified thousands of noncoding loci that are associated with human diseases and complex traits, each of which could reveal insights into the mechanisms of disease1. Many of the underlying causal variants may affect enhancers2,3, but we lack accurate maps of enhancers and their target genes to interpret such variants. We recently developed the activity-by-contact (ABC) model to predict which enhancers regulate which genes and validated the model using CRISPR perturbations in several cell types4. Here we apply this ABC model to create enhancer–gene maps in 131 human cell types and tissues, and use these maps to interpret the functions of GWAS variants. Across 72 diseases and complex traits, ABC links 5,036 GWAS signals to 2,249 unique genes, including a class of 577 genes that appear to influence multiple phenotypes through variants in enhancers that act in different cell types. In inflammatory bowel disease (IBD), causal variants are enriched in predicted enhancers by more than 20-fold in particular cell types such as dendritic cells, and ABC achieves higher precision than other regulatory methods at connecting noncoding variants to target genes. These variant-to-function maps reveal an enhancer that contains an IBD risk variant and that regulates the expression of PPIF to alter the membrane potential of mitochondria in macrophages. Our study reveals principles of genome regulation, identifies genes that affect IBD and provides a resource and generalizable strategy to connect risk variants of common diseases to their molecular and cellular functions. Mapping enhancer regulation across human cell types and tissues illuminates genome function and provides a resource to connect risk variants for common diseases to their molecular and cellular functions.

Journal ArticleDOI
TL;DR: Open Targets Genetics offers tools that enable users to prioritise causal variants and genes at disease-associated loci and access systematic cross-disease and disease-molecular trait colocalization analysis across 92 cell types and tissues including the eQTL Catalogue.
Abstract: Open Targets Genetics (https://genetics.opentargets.org) is an open-access integrative resource that aggregates human GWAS and functional genomics data including gene expression, protein abundance, chromatin interaction and conformation data from a wide range of cell types and tissues to make robust connections between GWAS-associated loci, variants and likely causal genes. This enables systematic identification and prioritisation of likely causal variants and genes across all published trait-associated loci. In this paper, we describe the public resources we aggregate, the technology and analyses we use, and the functionality that the portal offers. Open Targets Genetics can be searched by variant, gene or study/phenotype. It offers tools that enable users to prioritise causal variants and genes at disease-associated loci and access systematic cross-disease and disease-molecular trait colocalization analysis across 92 cell types and tissues including the eQTL Catalogue. Data visualizations such as Manhattan-like plots, regional plots, credible sets overlap between studies and PheWAS plots enable users to explore GWAS signals in depth. The integrated data is made available through the web portal, for bulk download and via a GraphQL API, and the software is open source. Applications of this integrated data include identification of novel targets for drug discovery and drug repurposing.

Journal ArticleDOI
TL;DR: Liftoff is described, a new genome annotation lift-over tool capable of mapping genes between two assemblies of the same or closely-related species and finds the mapping that maximizes sequence identity while preserving the structure of each exon, transcript, and gene.
Abstract: Motivation Improvements in DNA sequencing technology and computational methods have led to a substantial increase in the creation of high-quality genome assemblies of many species. To understand the biology of these genomes, annotation of gene features and other functional elements is essential; however for most species, only the reference genome is well-annotated. Results One strategy to annotate new or improved genome assemblies is to map or 'lift over' the genes from a previously-annotated reference genome. Here we describe Liftoff, a new genome annotation lift-over tool capable of mapping genes between two assemblies of the same or closely-related species. Liftoff aligns genes from a reference genome to a target genome and finds the mapping that maximizes sequence identity while preserving the structure of each exon, transcript, and gene. We show that Liftoff can accurately map 99.9% of genes between two versions of the human reference genome with an average sequence identity >99.9%. We also show that Liftoff can map genes across species by successfully lifting over 98.3% of human protein-coding genes to a chimpanzee genome assembly with 98.2% sequence identity. Availability and implementation Liftoff can be installed via bioconda and PyPI. Additionally, the source code for Liftoff is available at https://github.com/agshumate/Liftoff. Supplementary information Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
Stefan C. Dentro1, Stefan C. Dentro2, Stefan C. Dentro3, Ignaty Leshchiner4, Kerstin Haase2, Maxime Tarabichi1, Maxime Tarabichi2, Jeff Wintersinger5, Amit G. Deshwar5, Kaixian Yu6, Yulia Rubanova5, Geoff Macintyre7, Jonas Demeulemeester2, Jonas Demeulemeester8, Ignacio Vázquez-García, Kortine Kleinheinz9, Kortine Kleinheinz10, Dimitri Livitz4, Salem Malikic, Nilgun Donmez11, Nilgun Donmez12, Subhajit Sengupta13, Pavana Anur14, Clemency Jolly2, Marek Cmero15, Marek Cmero16, Daniel Rosebrock4, Steven E. Schumacher4, Yu Fan6, Matthew Fittall2, Ruben M. Drews7, Xiaotong Yao17, Thomas B.K. Watkins2, Juhee Lee18, Matthias Schlesner10, Hongtu Zhu6, David J. Adams1, Nicholas McGranahan19, Charles Swanton19, Charles Swanton2, Gad Getz, Paul C. Boutros20, Paul C. Boutros21, Paul C. Boutros5, Marcin Imielinski17, Rameen Beroukhim4, Rameen Beroukhim22, S. Cenk Sahinalp, Yuan Ji23, Yuan Ji13, Martin Peifer24, Inigo Martincorena1, Florian Markowetz7, Ville Mustonen25, Ke Yuan26, Ke Yuan7, Moritz Gerstung1, Moritz Gerstung27, Paul T. Spellman14, Wenyi Wang6, Quaid Morris, David C. Wedge28, David C. Wedge3, Peter Van Loo2, Santiago Gonzalez, David D.L. Bowtell, Peter J. Campbell, Shaolong Cao, Elizabeth L. Christie, Yupeng Cun, Kevin J. Dawson, Roland Eils, Dale W. Garsed, Gavin Ha, Lara Jerman, Henry Lee-Six, Thomas J. Mitchell, Layla Oesper, Myron Peto, Benjamin J. Raphael, Adriana Salcedo, Ruian Shi, Seung Jun Shin, Lincoln Stein, Oliver Spiro, Shankar Vembu, David A. Wheeler, Tsun-Po Yang 
15 Apr 2021-Cell
TL;DR: In this article, the authors extensively characterize intra-tumor heterogeneity (ITH) across whole-genome sequences of 2,658 cancer samples spanning 38 cancer types and identify cancer type-specific subclonal patterns of driver gene mutations, fusions, structural variants, and copy number alterations.

Journal ArticleDOI
TL;DR: The third version of IMG/VR is presented, composed of 18 373 cultivated and 2 314 329 uncultivated viral genomes (UViGs), nearly tripling the total number of sequences compared to the previous version, and annotated with a new standardized pipeline including genome quality estimation using CheckV and expanded host taxonomy prediction.
Abstract: Viruses are integral components of all ecosystems and microbiomes on Earth. Through pervasive infections of their cellular hosts, viruses can reshape microbial community structure and drive global nutrient cycling. Over the past decade, viral sequences identified from genomes and metagenomes have provided an unprecedented view of viral genome diversity in nature. Since 2016, the IMG/VR database has provided access to the largest collection of viral sequences obtained from (meta)genomes. Here, we present the third version of IMG/VR, composed of 18 373 cultivated and 2 314 329 uncultivated viral genomes (UViGs), nearly tripling the total number of sequences compared to the previous version. These clustered into 935 362 viral Operational Taxonomic Units (vOTUs), including 188 930 with two or more members. UViGs in IMG/VR are now reported as single viral contigs, integrated proviruses or genome bins, and are annotated with a new standardized pipeline including genome quality estimation using CheckV, taxonomic classification reflecting the latest ICTV update, and expanded host taxonomy prediction. The new IMG/VR interface enables users to efficiently browse, search, and select UViGs based on genome features and/or sequence similarity. IMG/VR v3 is available at https://img.jgi.doe.gov/vr, and the underlying data are available to download at https://genome.jgi.doe.gov/portal/IMG_VR.

Journal ArticleDOI
06 Aug 2021-Science
TL;DR: In this article, de novo genome assemblies, transcriptomes, annotations, and methylomes for the 26 inbreds that serve as the founders for the maize nested association mapping population were reported.
Abstract: We report de novo genome assemblies, transcriptomes, annotations, and methylomes for the 26 inbreds that serve as the founders for the maize nested association mapping population. The number of pan-genes in these diverse genomes exceeds 103,000, with approximately a third found across all genotypes. The results demonstrate that the ancient tetraploid character of maize continues to degrade by fractionation to the present day. Excellent contiguity over repeat arrays and complete annotation of centromeres revealed additional variation in major cytological landmarks. We show that combining structural variation with single-nucleotide polymorphisms can improve the power of quantitative mapping studies. We also document variation at the level of DNA methylation and demonstrate that unmethylated regions are enriched for cis-regulatory elements that contribute to phenotypic variation.

Journal ArticleDOI
07 Apr 2021-Nature
TL;DR: In this article, the authors used complementary long-read sequencing technologies to complete the linear assembly of human chromosome 8, including a 2.08-Mb centromeric α-satellite array, a 644-kb copy number polymorphism in the β-defensin gene cluster that is important for disease risk, and an 863-kb variable number tandem repeat at chromosome 8q21.2 that can function as a neocentromere.
Abstract: The complete assembly of each human chromosome is essential for understanding human biology and evolution1,2. Here we use complementary long-read sequencing technologies to complete the linear assembly of human chromosome 8. Our assembly resolves the sequence of five previously long-standing gaps, including a 2.08-Mb centromeric α-satellite array, a 644-kb copy number polymorphism in the β-defensin gene cluster that is important for disease risk, and an 863-kb variable number tandem repeat at chromosome 8q21.2 that can function as a neocentromere. We show that the centromeric α-satellite array is generally methylated except for a 73-kb hypomethylated region of diverse higher-order α-satellites enriched with CENP-A nucleosomes, consistent with the location of the kinetochore. In addition, we confirm the overall organization and methylation pattern of the centromere in a diploid human genome. Using a dual long-read sequencing approach, we complete high-quality draft assemblies of the orthologous centromere from chromosome 8 in chimpanzee, orangutan and macaque to reconstruct its evolutionary history. Comparative and phylogenetic analyses show that the higher-order α-satellite structure evolved in the great ape ancestor with a layered symmetry, in which more ancient higher-order repeats locate peripherally to monomeric α-satellites. We estimate that the mutation rate of centromeric satellite DNA is accelerated by more than 2.2-fold compared to the unique portions of the genome, and this acceleration extends into the flanking sequence. The complete assembly of human chromosome 8 resolves previous gaps and reveals hidden complex forms of genetic variation, enabling functional and evolutionary characterization of primate centromeres.

Journal ArticleDOI
TL;DR: This work shows enrichment of specific chromosomes from the human genome and of low-abundance organisms in mixed populations without a priori knowledge of sample composition and enrichs targeted panels comprising 25,600 exons from 10,000 human genes and 717 genes implicated in cancer.
Abstract: Nanopore sequencers can be used to selectively sequence certain DNA molecules in a pool by reversing the voltage across individual nanopores to reject specific sequences, enabling enrichment and depletion to address biological questions. Previously, we achieved this using dynamic time warping to map the signal to a reference genome, but the method required substantial computational resources and did not scale to gigabase-sized references. Here we overcome this limitation by using graphical processing unit (GPU) base-calling. We show enrichment of specific chromosomes from the human genome and of low-abundance organisms in mixed populations without a priori knowledge of sample composition. Finally, we enrich targeted panels comprising 25,600 exons from 10,000 human genes and 717 genes implicated in cancer, identifying PML–RARA fusions in the NB4 cell line in <15 h sequencing. These methods can be used to efficiently screen any target panel of genes without specialized sample preparation using any computer and a suitable GPU. Our toolkit, readfish, is available at https://www.github.com/looselab/readfish . A nanopore sequencer achieves selective sequencing of any region in the human genome.

Journal ArticleDOI
TL;DR: Current knowledge of the role of mtDNA copy number regulation in various types of human diseases, including mitochondrial disorders, neurodegenerative disorders and cancer, and during ageing is critically discussed.

Journal ArticleDOI
TL;DR: In this article, a long amplicon strategy was used to determine the secondary structure of the SARS-CoV-2 RNA genome at single-nucleotide resolution in infected cells.

Journal ArticleDOI
TL;DR: This article used a direct label and stain (DLS) optical map of the Chinese Spring (CS) genome together with a prior nick, label, repair and stain optical map, and sequence contigs assembled with Pacific Biosciences long reads, to refine the v1.0 assembly.
Abstract: Until recently, achieving a reference-quality genome sequence for bread wheat was long thought beyond the limits of genome sequencing and assembly technology, primarily due to the large genome size and > 80% repetitive sequence content. The release of the chromosome scale 14.5-Gb IWGSC RefSeq v1.0 genome sequence of bread wheat cv. Chinese Spring (CS) was, therefore, a milestone. Here, we used a direct label and stain (DLS) optical map of the CS genome together with a prior nick, label, repair and stain (NLRS) optical map, and sequence contigs assembled with Pacific Biosciences long reads, to refine the v1.0 assembly. Inconsistencies between the sequence and maps were reconciled and gaps were closed. Gap filling and anchoring of 279 unplaced scaffolds increased the total length of pseudomolecules by 168 Mb (excluding Ns). Positions and orientations were corrected for 233 and 354 scaffolds, respectively, representing 10% of the genome sequence. The accuracy of the remaining 90% of the assembly was validated. As a result of the increased contiguity, the numbers of transposable elements (TEs) and intact TEs have increased in IWGSC RefSeq v2.1 compared with v1.0. In total, 98% of the gene models identified in v1.0 were mapped onto this new assembly through development of a dedicated approach implemented in the MAGAAT pipeline. The numbers of high-confidence genes on pseudomolecules have increased from 105 319 to 105 534. The reconciled assembly enhances the utility of the sequence for genetic mapping, comparative genomics, gene annotation and isolation, and more general studies on the biology of wheat.

Journal ArticleDOI
TL;DR: Jiang et al. as mentioned in this paper developed a novel pre-trained bidirectional encoder represen-tation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts.
Abstract: Motivation Deciphering the language of non-coding DNA is one of the fundamental problems in genome research Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios Results To address this challenge, we developed a novel pre-trained bidirectional encoder represen-tation, named DNABERT, to capture global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts We compared DNABERT to the most widely used programs for genome-wide regulatory elements prediction and demonstrate its ease of use, accuracy, and efficiency We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on prediction of promoters, splice sites, and transcription factor binding sites, after easy fine-tuning using small task-specific labelled data Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variant candidates Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance We anticipate that the pre-trained DNABERT model can be fined tuned to many other sequence analyses tasks Availability The source code, pretrained and finetuned model for DNABERT are available at GitHub https://githubcom/jerryji1993/DNABERT Supplementary information Supplementary data are available at Bioinformatics online

Journal ArticleDOI
24 Jun 2021-Cell
TL;DR: In this paper, a pan-genome-scale genomic resources including a graph-based genome, providing access to rice genomic variations, were developed to facilitate rice breeding as well as plant functional genomics and evolutionary biology research.

Journal ArticleDOI
TL;DR: In this article, the authors investigated the possibility that SARS-CoV-2 RNAs can be reverse-transcribed and integrated into the DNA of human cells in culture and that transcription of the integrated sequences might account for some of the positive PCR tests seen in patients.
Abstract: Prolonged detection of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) RNA and recurrence of PCR-positive tests have been widely reported in patients after recovery from COVID-19, but some of these patients do not appear to shed infectious virus. We investigated the possibility that SARS-CoV-2 RNAs can be reverse-transcribed and integrated into the DNA of human cells in culture and that transcription of the integrated sequences might account for some of the positive PCR tests seen in patients. In support of this hypothesis, we found that DNA copies of SARS-CoV-2 sequences can be integrated into the genome of infected human cells. We found target site duplications flanking the viral sequences and consensus LINE1 endonuclease recognition sequences at the integration sites, consistent with a LINE1 retrotransposon-mediated, target-primed reverse transcription and retroposition mechanism. We also found, in some patient-derived tissues, evidence suggesting that a large fraction of the viral sequences is transcribed from integrated DNA copies of viral sequences, generating viral-host chimeric transcripts. The integration and transcription of viral sequences may thus contribute to the detection of viral RNA by PCR in patients after infection and clinical recovery. Because we have detected only subgenomic sequences derived mainly from the 3' end of the viral genome integrated into the DNA of the host cell, infectious virus cannot be produced from the integrated subgenomic SARS-CoV-2 sequences.

Journal ArticleDOI
TL;DR: In this article, the authors presented a new open resource of genome-wide association study summary statistics, using the 2020 data release, almost tripling the discovery sample size, including the X chromosome and new classes of imaging-derived phenotypes.
Abstract: UK Biobank is a major prospective epidemiological study, including multimodal brain imaging, genetics and ongoing health outcomes. Previously, we published genome-wide associations of 3,144 brain imaging-derived phenotypes, with a discovery sample of 8,428 individuals. Here we present a new open resource of genome-wide association study summary statistics, using the 2020 data release, almost tripling the discovery sample size. We now include the X chromosome and new classes of imaging-derived phenotypes (subcortical volumes and tissue contrast). Previously, we found 148 replicated clusters of associations between genetic variants and imaging phenotypes; in this study, we found 692, including 12 on the X chromosome. We describe some of the newly found associations, focusing on the X chromosome and autosomal associations involving the new classes of imaging-derived phenotypes. Our novel associations implicate, for example, pathways involved in the rare X-linked STAR (syndactyly, telecanthus and anogenital and renal malformations) syndrome, Alzheimer's disease and mitochondrial disorders.

Journal ArticleDOI
TL;DR: The authors conducted a large multi-ethnic meta-analysis of genome-wide association studies on a total of 34,179 cases and 349,321 controls, identifying 44 previously unreported risk loci and confirming 83 loci that were previously known.
Abstract: Primary open-angle glaucoma (POAG), is a heritable common cause of blindness world-wide. To identify risk loci, we conduct a large multi-ethnic meta-analysis of genome-wide association studies on a total of 34,179 cases and 349,321 controls, identifying 44 previously unreported risk loci and confirming 83 loci that were previously known. The majority of loci have broadly consistent effects across European, Asian and African ancestries. Cross-ancestry data improve fine-mapping of causal variants for several loci. Integration of multiple lines of genetic evidence support the functional relevance of the identified POAG risk loci and highlight potential contributions of several genes to POAG pathogenesis, including SVEP1, RERE, VCAM1, ZNF638, CLIC5, SLC2A12, YAP1, MXRA5, and SMAD6. Several drug compounds targeting POAG risk genes may be potential glaucoma therapeutic candidates.

Journal ArticleDOI
TL;DR: The authors proposed a standardized archaeal taxonomy that is derived from a 122-concatenated-protein phylogeny that resolves polyphyletic groups and normalizes ranks based on relative evolutionary divergence.
Abstract: The accrual of genomic data from both cultured and uncultured microorganisms provides new opportunities to develop systematic taxonomies based on evolutionary relationships. Previously, we established a bacterial taxonomy through the Genome Taxonomy Database. Here, we propose a standardized archaeal taxonomy that is derived from a 122-concatenated-protein phylogeny that resolves polyphyletic groups and normalizes ranks based on relative evolutionary divergence. The resulting archaeal taxonomy, which forms part of the Genome Taxonomy Database, is stable for a range of phylogenetic variables including marker gene selection, inference methods, corrections for rate heterogeneity and compositional bias, tree rooting scenarios and expansion of the genome database. Rank normalization is shown to robustly correct for substitution rates varying up to 30-fold using simulated datasets. Taxonomic curation follows the rules of the International Code of Nomenclature of Prokaryotes while taking into account proposals to formally recognize the rank of phylum and to use genome sequences as type material. This taxonomy is based on 2,392 archaeal genomes, 93.3% of which required one or more changes to their existing taxonomy, mainly owing to incomplete classification. We identify 16 archaeal phyla and reclassify 3 major monophyletic units from the former Euryarchaeota and one phylum that unites the Thaumarchaeota-Aigarchaeota-Crenarchaeota-Korarchaeota (TACK) superphylum into a single phylum.

Journal ArticleDOI
TL;DR: This Review discusses the relationship between genome structure and gene regulation, with a focus on whether genome organization has an instructive role or largely reflects the activity of regulatory elements.
Abstract: Precise patterns of gene expression in metazoans are controlled by three classes of regulatory elements: promoters, enhancers and boundary elements. During differentiation and development, these elements form specific interactions in dynamic higher-order chromatin structures. However, the relationship between genome structure and its function in gene regulation is not completely understood. Here we review recent progress in this field and discuss whether genome structure plays an instructive role in regulating gene expression or is a reflection of the activity of the regulatory elements of the genome.