Showing papers in "Genome Research in 2020"

PDF

Open Access

Journal Article•DOI•

The EnteroBase user's guide, with case studies on Salmonella transmissions, Yersinia pestis phylogeny, and Escherichia core genomic diversity

[...]

Zhemin Zhou¹, Nabil-Fareed Alikhan, Khaled M. Mohamed, Yulei Fan, Mark Achtman - Show less +1 more•Institutions (1)

University of Warwick¹

01 Jan 2020-Genome Research

TL;DR: An overview on how EnteroBase works, what it can do, and its future prospects is provided.

...read moreread less

Abstract: EnteroBase is an integrated software environment that supports the identification of global population structures within several bacterial genera that include pathogens. Here, we provide an overview of how EnteroBase works, what it can do, and its future prospects. EnteroBase has currently assembled more than 300,000 genomes from Illumina short reads from Salmonella, Escherichia, Yersinia, Clostridioides, Helicobacter, Vibrio, and Moraxella and genotyped those assemblies by core genome multilocus sequence typing (cgMLST). Hierarchical clustering of cgMLST sequence types allows mapping a new bacterial strain to predefined population structures at multiple levels of resolution within a few hours after uploading its short reads. Case Study 1 illustrates this process for local transmissions of Salmonella enterica serovar Agama between neighboring social groups of badgers and humans. EnteroBase also supports single nucleotide polymorphism (SNP) calls from both genomic assemblies and after extraction from metagenomic sequences, as illustrated by Case Study 2 which summarizes the microevolution of Yersinia pestis over the last 5000 years of pandemic plague. EnteroBase can also provide a global overview of the genomic diversity within an entire genus, as illustrated by Case Study 3, which presents a novel, global overview of the population structure of all of the species, subspecies, and clades within Escherichia.

...read moreread less

469 citations

Journal Article•DOI•

HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads

[...]

Sergey Nurk¹, Brian P. Walenz¹, Arang Rhie¹, Mitchell R. Vollger², Glennis A. Logsdon², Robert Grothe³, Karen H. Miga⁴, Evan E. Eichler², Adam M. Phillippy¹, Sergey Koren¹ - Show less +6 more•Institutions (4)

National Institutes of Health¹, University of Washington², Pacific Biosciences³, University of California, Santa Cruz⁴

14 Aug 2020-Genome Research

TL;DR: This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of 9 complete human centromeric regions, a significant advance towards the complete assembly of human genomes.

...read moreread less

Abstract: Complete and accurate genome assemblies form the basis of most downstream genomic analyses and are of critical importance. Recent genome assembly projects have relied on a combination of noisy long-read sequencing and accurate short-read sequencing, with the former offering greater assembly continuity and the latter providing higher consensus accuracy. The recently introduced Pacific Biosciences (PacBio) HiFi sequencing technology bridges this divide by delivering long reads (>10 kbp) with high per-base accuracy (>99.9%). Here we present HiCanu, a modification of the Canu assembler designed to leverage the full potential of HiFi reads via homopolymer compression, overlap-based error correction, and aggressive false overlap filtering. We benchmark HiCanu with a focus on the recovery of haplotype diversity, major histocompatibility complex (MHC) variants, satellite DNAs, and segmental duplications. For diploid human genomes sequenced to 30× HiFi coverage, HiCanu achieved superior accuracy and allele recovery compared to the current state of the art. On the effectively haploid CHM13 human cell line, HiCanu achieved an NG50 contig size of 77 Mbp with a per-base consensus accuracy of 99.999% (QV50), surpassing recent assemblies of high-coverage, ultralong Oxford Nanopore Technologies (ONT) reads in terms of both accuracy and continuity. This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of nine complete human centromeric regions. Although gaps and errors still remain within the most challenging regions of the genome, these results represent a significant advance toward the complete assembly of human genomes.

...read moreread less

342 citations

Journal Article•DOI•

Accurate and complete genomes from metagenomes

[...]

Lin-Xing Chen¹, Karthik Anantharaman¹, Alon Shaiber², A. Murat Eren³, A. Murat Eren², Jillian F. Banfield¹, Jillian F. Banfield⁴ - Show less +3 more•Institutions (4)

University of California, Berkeley¹, University of Chicago², Marine Biological Laboratory³, Lawrence Berkeley National Laboratory⁴

18 Mar 2020-Genome Research

TL;DR: The analysis of cumulative GC skew identified potential misassemblies in some reference genomes of isolated bacteria and the repeat sequences that likely gave rise to them, and methods that could be implemented in bioinformatic approaches for curation are discussed to ensure that metabolic and evolutionary analyses can be based on very high-quality genomes.

...read moreread less

Abstract: Genomes are an integral component of the biological information about an organism; thus, the more complete the genome, the more informative it is. Historically, bacterial and archaeal genomes were reconstructed from pure (monoclonal) cultures, and the first reported sequences were manually curated to completion. However, the bottleneck imposed by the requirement for isolates precluded genomic insights for the vast majority of microbial life. Shotgun sequencing of microbial communities, referred to initially as community genomics and subsequently as genome-resolved metagenomics, can circumvent this limitation by obtaining metagenome-assembled genomes (MAGs); but gaps, local assembly errors, chimeras, and contamination by fragments from other genomes limit the value of these genomes. Here, we discuss genome curation to improve and, in some cases, achieve complete (circularized, no gaps) MAGs (CMAGs). To date, few CMAGs have been generated, although notably some are from very complex systems such as soil and sediment. Through analysis of about 7000 published complete bacterial isolate genomes, we verify the value of cumulative GC skew in combination with other metrics to establish bacterial genome sequence accuracy. The analysis of cumulative GC skew identified potential misassemblies in some reference genomes of isolated bacteria and the repeat sequences that likely gave rise to them. We discuss methods that could be implemented in bioinformatic approaches for curation to ensure that metabolic and evolutionary analyses can be based on very high-quality genomes.

...read moreread less

105 citations

Journal Article•DOI•

Mapping genome variation of SARS-CoV-2 worldwide highlights the impact of COVID-19 super-spreaders.

[...]

Alberto Gómez-Carballa¹, Xabier Bello¹, Jacobo Pardo-Seco¹, Federico Martinón-Torres¹, Antonio Salas¹ - Show less +1 more•Institutions (1)

University of Santiago de Compostela¹

02 Sep 2020-Genome Research

TL;DR: Phylogenetic analysis indicates a TMRCA for SARS-CoV-2 genomes dating to 12 November 2019 - thus matching epidemiological records and revealing a rather uniform mutation occurrence along branches that could have implications for diagnostics and the design of future vaccines.

...read moreread less

Abstract: The human pathogen severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is responsible for the major pandemic of the twenty-first century. We analyzed more than 4700 SARS-CoV-2 genomes and associated metadata retrieved from public repositories. SARS-CoV-2 sequences have a high sequence identity (>99.9%), which drops to >96% when compared to bat coronavirus genome. We built a mutation-annotated reference SARS-CoV-2 phylogeny with two main macro-haplogroups, A and B, both of Asian origin, and more than 160 sub-branches representing virus strains of variable geographical origins worldwide, revealing a rather uniform mutation occurrence along branches that could have implications for diagnostics and the design of future vaccines. Identification of the root of SARS-CoV-2 genomes is not without problems, owing to conflicting interpretations derived from either using the bat coronavirus genomes as an outgroup or relying on the sampling chronology of the SARS-CoV-2 genomes and TMRCA estimates; however, the overall scenario favors haplogroup A as the ancestral node. Phylogenetic analysis indicates a TMRCA for SARS-CoV-2 genomes dating to November 12, 2019, thus matching epidemiological records. Sub-haplogroup A2 most likely originated in Europe from an Asian ancestor and gave rise to subclade A2a, which represents the major non-Asian outbreak, especially in Africa and Europe. Multiple founder effect episodes, most likely associated with super-spreader hosts, might explain COVID-19 pandemic to a large extent.

...read moreread less

97 citations

Journal Article•DOI•

Functional annotation of human long noncoding RNAs via molecular phenotyping

[...]

Jordan A. Ramilowski, Chi Wai Yip, Saumya Agrawal, Jen-Chien Chang, Yari Ciani, Ivan V. Kulakovskiy¹, Mickaël Mendez², Jasmine Li Ching Ooi, John F. Ouyang³, Nicholas J. Parkinson⁴, Andreas Petri⁵, Leonie Roos⁶, Jessica Severin, Kayoko Yasuzawa, Imad Abugessaisa, Altuna Akalin, Ivan Antonov⁷, Erik Arner, Alessandro Bonetti, Hidemasa Bono⁸, Beatrice Borsari, Frank Brombacher⁹, Christopher J. F. Cameron¹⁰, Carlo Vittorio Cannistraci¹¹, Ryan Cardenas¹², Melissa Cardon, Howard Y. Chang¹³, Josée Dostie¹⁰, Luca Ducoli¹⁴, Alexander V. Favorov⁷, Alexandre Fort, Diego Garrido, Noa Gil¹⁵, Juliette Gimenez, Reto Guler⁹, Lusy Handoko, Jayson Harshbarger, Akira Hasegawa, Yuki Hasegawa, Kosuke Hashimoto, Norihito Hayatsu, Peter Heutink¹⁶, Tetsuro Hirose¹⁷, Eddie Luidy Imada¹⁸, Masayoshi Itoh, Bogumil Kaczkowski, Aditi Kanhere¹², Emily Kawabata, Hideya Kawaji, Tsugumi Kawashima, S. Thomas Kelly, Miki Kojima, Naoto Kondo, Haruhiko Koseki, Tsukasa Kouno, Anton Kratz, Mariola Kurowska-Stolarska¹⁹, Andrew T. Kwon, Jeffrey T. Leek¹⁸, Andreas Lennartsson²⁰, Marina Lizio, Fernando López-Redondo, Joachim Luginbühl, Shiori Maeda, Vsevolod J. Makeev⁷, Vsevolod J. Makeev²¹, Luigi Marchionni¹⁸, Yulia A. Medvedeva⁷, Yulia A. Medvedeva²¹, Aki Minoda, Ferenc Müller¹², Manuel Muñoz-Aguirre, Mitsuyoshi Murata, Hiromi Nishiyori, Kazuhiro R. Nitta, Shuhei Noguchi, Yukihiko Noro, Ramil N. Nurtdinov, Yasushi Okazaki, Valerio Orlando²², Denis Paquette¹⁰, Callum J.C. Parr, Owen J. L. Rackham³, Patrizia Rizzu¹⁶, Diego Fernando Sánchez Martinez¹⁸, Albin Sandelin²³, Pillay Sanjana¹², Colin A. Semple⁴, Youtaro Shibayama, Divya M. Sivaraman, Takahiro Suzuki, Suzannah C. Szumowski, Michihira Tagami, Martin S. Taylor⁴, Chikashi Terao, Malte Thodberg²³, Supat Thongjuea, Vidisha Tripathi, Igor Ulitsky¹⁵, Roberto Verardo, Ilya E. Vorontsov⁷, Chinatsu Yamamoto, Robert Young⁴, J Kenneth Baillie⁴, Alistair R. R. Forrest, Roderic Guigó, Michael M. Hoffman²⁴, Chung-Chau Hon, Takeya Kasukawa, Sakari Kauppinen⁵, Juha Kere²⁰, Boris Lenhard⁶, Claudio Schneider²⁵, Harukazu Suzuki, Ken Yagi, Michiel J. L. de Hoon, Jay W. Shin, Piero Carninci - Show less +114 more•Institutions (25)

27 Jul 2020-Genome Research

TL;DR: The largest-to-date lncRNA knockdown data set with molecular phenotyping is disseminated for further exploration and functional roles for ZNF213-AS1 and lnc-KHDC3L-2 are highlighted.

...read moreread less

Abstract: Long noncoding RNAs (lncRNAs) constitute the majority of transcripts in the mammalian genomes, and yet, their functions remain largely unknown. As part of the FANTOM6 project, we systematically knocked down the expression of 285 lncRNAs in human dermal fibroblasts and quantified cellular growth, morphological changes, and transcriptomic responses using Capped Analysis of Gene Expression (CAGE). Antisense oligonucleotides targeting the same lncRNAs exhibited global concordance, and the molecular phenotype, measured by CAGE, recapitulated the observed cellular phenotypes while providing additional insights on the affected genes and pathways. Here, we disseminate the largest-to-date lncRNA knockdown data set with molecular phenotyping (over 1000 CAGE deep-sequencing libraries) for further exploration and highlight functional roles for ZNF213-AS1 and lnc-KHDC3L-2.

...read moreread less

97 citations

Journal Article•DOI•

Comprehensive analyses of 723 transcriptomes enhance genetic and biological interpretations for complex traits in cattle

[...]

Lingzhao Fang, Wentao Cai¹, Wentao Cai², Shuli Liu³, Shuli Liu², Oriol Canela-Xandri⁴, Yahui Gao¹, Yahui Gao³, Jicai Jiang¹, Konrad Rawlik⁴, Bingjie Li³, Steven G. Schroeder³, Benjamin D. Rosen³, Congjun Li³, Tad S. Sonstegard, Leeson J. Alexander³, Curtis P. Van Tassell³, Paul M. VanRaden³, John B. Cole³, Ying Yu², Shengli Zhang², Albert Tenesa⁴, Li Ma¹, George E. Liu³ - Show less +20 more•Institutions (4)

University of Maryland, College Park¹, China Agricultural University², Agricultural Research Service³, University of Edinburgh⁴

18 May 2020-Genome Research

TL;DR: The results demonstrated that tissue-specific genes are significantly associated with the tissue-relevant biology, and the transcriptome atlas can serve as a primary source for biological interpretation, functional validation, studies of adaptive evolution, and genomic improvement in livestock.

...read moreread less

Abstract: By uniformly analyzing 723 RNA-seq data from 91 tissues and cell types, we built a comprehensive gene atlas and studied tissue specificity of genes in cattle. We demonstrated that tissue-specific genes significantly reflected the tissue-relevant biology, showing distinct promoter methylation and evolution patterns (e.g., brain-specific genes evolve slowest, whereas testis-specific genes evolve fastest). Through integrative analyses of those tissue-specific genes with large-scale genome-wide association studies, we detected relevant tissues/cell types and candidate genes for 45 economically important traits in cattle, including blood/immune system (e.g., CCDC88C) for male fertility, brain (e.g., TRIM46 and RAB6A) for milk production, and multiple growth-related tissues (e.g., FGF6 and CCND2) for body conformation. We validated these findings by using epigenomic data across major somatic tissues and sperm. Collectively, our findings provided novel insights into the genetic and biological mechanisms underlying complex traits in cattle, and our transcriptome atlas can serve as a primary source for biological interpretation, functional validation, studies of adaptive evolution, and genomic improvement in livestock.

...read moreread less

93 citations

Journal Article•DOI•

A comparison of gene expression and DNA methylation patterns across tissues and species.

[...]

Lauren E. Blake¹, Julien Roux², Julien Roux³, Julien Roux¹, Irene Hernando-Herraez⁴, Nicholas E. Banovich¹, Raquel Garcia Perez⁴, Chiaowen Joyce Hsiao¹, Ittai E. Eres¹, Claudia Cuevas¹, Tomas Marques-Bonet, Yoav Gilad¹ - Show less +8 more•Institutions (4)

University of Chicago¹, University of Basel², Swiss Institute of Bioinformatics³, Pompeu Fabra University⁴

17 Jan 2020-Genome Research

TL;DR: It is estimated that overall, inter-species and inter-tissue differences in gene expression levels can only modestly be accounted for by corresponding differences in promoter DNA methylation, however, the expression pattern of genes with conserved inter-Tissue expression differences can be explained by corresponding inter- Species methylation changes more often.

...read moreread less

Abstract: Previously published comparative functional genomic data sets from primates using frozen tissue samples, including many data sets from our own group, were often collected and analyzed using nonoptimal study designs and analysis approaches. In addition, when samples from multiple tissues were studied in a comparative framework, individuals and tissues were confounded. We designed a multitissue comparative study of gene expression and DNA methylation in primates that minimizes confounding effects by using a balanced design with respect to species, tissues, and individuals. We also developed a comparative analysis pipeline that minimizes biases attributable to sequence divergence. Thus, we present the most comprehensive catalog of similarities and differences in gene expression and DNA methylation levels between livers, kidneys, hearts, and lungs, in humans, chimpanzees, and rhesus macaques. We estimate that overall, interspecies and inter-tissue differences in gene expression levels can only modestly be accounted for by corresponding differences in promoter DNA methylation. However, the expression pattern of genes with conserved inter-tissue expression differences can be explained by corresponding interspecies methylation changes more often. Finally, we show that genes whose tissue-specific regulatory patterns are consistent with the action of natural selection are highly connected in both gene regulatory and protein-protein interaction networks.

...read moreread less

84 citations

Journal Article•DOI•

Single-cell-resolution transcriptome map of human, chimpanzee, bonobo, and macaque brains.

[...]

Ekaterina Khrameeva¹, Ilia Kurochkin¹, Dingding Han², Patricia Guijarro³, Sabina Kanton⁴, Malgorzata Santel⁴, Zhengzong Qian³, Shen Rong³, Pavel V. Mazin⁵, Pavel V. Mazin¹, Marat Sabirov⁵, Matvei Bulat¹, Olga Efimova¹, Anna Tkachev¹, Anna Tkachev⁵, Song Guo³, Song Guo¹, Chet C. Sherwood⁶, J. Gray Camp, Svante Pääbo⁴, Barbara Treutlein⁷, Philipp Khaitovich - Show less +18 more•Institutions (7)

Skolkovo Institute of Science and Technology¹, Guangzhou Medical University², CAS-MPG Partner Institute for Computational Biology³, Max Planck Society⁴, Russian Academy of Sciences⁵, George Washington University⁶, École Polytechnique Fédérale de Lausanne⁷

01 May 2020-Genome Research

TL;DR: Comparison of the bulk tissue and single nuclei sequencing revealed that conventional RNA sequencing did not detect up to two-thirds of cell-type-specific evolutionary differences in the human evolutionary lineage.

...read moreread less

Abstract: Identification of gene expression traits unique to the human brain sheds light on the molecular mechanisms underlying human evolution. Here, we searched for uniquely human gene expression traits by analyzing 422 brain samples from humans, chimpanzees, bonobos, and macaques representing 33 anatomical regions, as well as 88,047 cell nuclei composing three of these regions. Among 33 regions, cerebral cortex areas, hypothalamus, and cerebellar gray and white matter evolved rapidly in humans. At the cellular level, astrocytes and oligodendrocyte progenitors displayed more differences in the human evolutionary lineage than the neurons. Comparison of the bulk tissue and single-nuclei sequencing revealed that conventional RNA sequencing did not detect up to two-thirds of cell-type-specific evolutionary differences.

...read moreread less

83 citations

Journal Article•DOI•

Single-cell analysis of human embryos reveals diverse patterns of aneuploidy and mosaicism.

[...]

Margaret R. Starostik¹, Olukayode A. Sosina¹, Rajiv C. McCoy¹•Institutions (1)

Johns Hopkins University¹

01 Jun 2020-Genome Research

TL;DR: This work provides a high-resolution view of aneuploidy in preimplantation embryos, and supports the conclusion that low-level mosaicism is a common feature of early human development.

...read moreread less

Abstract: Less than half of human zygotes survive to birth, primarily due to aneuploidies of meiotic or mitotic origin. Mitotic errors generate chromosomal mosaicism, defined by multiple cell lineages with distinct chromosome complements. The incidence and impacts of mosaicism in human embryos remain controversial, with most previous studies based on bulk DNA assays or comparisons of multiple biopsies of few embryonic cells. Single-cell genomic data provide an opportunity to quantify mosaicism on an embryo-wide scale. To this end, we extended an approach to infer aneuploidies based on dosage-associated changes in gene expression by integrating signatures of allelic imbalance. We applied this method to published single-cell RNA sequencing data from 74 human embryos, spanning the morula to blastocyst stages. Our analysis revealed widespread mosaic aneuploidies, with 59 of 74 (80%) embryos harboring at least one putative aneuploid cell (1% FDR). By clustering copy number calls, we reconstructed histories of chromosome segregation, inferring that 55 (74%) embryos possessed mitotic aneuploidies and 23 (31%) embryos possessed meiotic aneuploidies. We found no significant enrichment of aneuploid cells in the trophectoderm compared to the inner cell mass, although we do detect such enrichment in data from later postimplantation stages. Finally, we observed that aneuploid cells up-regulate immune response genes and down-regulate genes involved in proliferation, metabolism, and protein processing, consistent with stress responses documented in other stages and systems. Together, our work provides a high-resolution view of aneuploidy in preimplantation embryos, and supports the conclusion that low-level mosaicism is a common feature of early human development.

...read moreread less

78 citations

Journal Article•DOI•

Assembly-free single-molecule sequencing recovers complete virus genomes from natural microbial communities.

[...]

John Beaulaurier, Elaine Luo¹, John M. Eppley¹, Paul Den Uyl¹, Xiaoguang Dai, Andrew Burger¹, Daniel J. Turner, Matthew Pendelton, Sissel Juul, Eoghan D. Harrington, Edward F. DeLong¹ - Show less +7 more•Institutions (1)

University of Hawaii¹

19 Feb 2020-Genome Research

TL;DR: An assembly-free, single-molecule nanopore sequencing approach enabling direct recovery of complete viral genome sequences from environmental samples is developed, which can provide previously unavailable information about the genome structures, population biology, and ecology of naturally occurring viruses and viral parasites.

...read moreread less

Abstract: Viruses are the most abundant biological entities on Earth and play key roles in host ecology, evolution, and horizontal gene transfer. Despite recent progress in viral metagenomics, the inherent genetic complexity of virus populations still poses technical difficulties for recovering complete virus genomes from natural assemblages. To address these challenges, we developed an assembly-free, single-molecule nanopore sequencing approach, enabling direct recovery of complete virus genome sequences from environmental samples. Our method yielded thousands of full-length, high-quality draft virus genome sequences that were not recovered using standard short-read assembly approaches. Additionally, our analyses discriminated between populations whose genomes had identical direct terminal repeats versus those with circularly permuted repeats at their termini, thus providing new insight into native virus reproduction and genome packaging. Novel DNA sequences were discovered, whose repeat structures, gene contents, and concatemer lengths suggest they are phage-inducible chromosomal islands, which are packaged as concatemers in phage particles, with lengths that match the size ranges of co-occurring phage genomes. Our new virus sequencing strategy can provide previously unavailable information about the genome structures, population biology, and ecology of naturally occurring viruses and viral parasites.

...read moreread less

76 citations

Journal Article•DOI•

Sequencing identifies multiple early introductions of SARS-CoV-2 to the New York City region.

[...]

United States Department of Energy Office of Science¹, Imperial College London²

22 Oct 2020-Genome Research

TL;DR: Analysis of 864 SARS-CoV-2 sequences from cases in the New York City metropolitan area during the COVID-19 outbreak in spring 2020 showed that early transmission was most linked to cases from Europe.

...read moreread less

Abstract: Effective public response to a pandemic relies upon accurate measurement of the extent and dynamics of an outbreak. Viral genome sequencing has emerged as a powerful approach to link seemingly unrelated cases, and large-scale sequencing surveillance can inform on critical epidemiological parameters. Here, we report the analysis of 864 SARS-CoV-2 sequences from cases in the New York City metropolitan area during the COVID-19 outbreak in spring 2020. The majority of cases had no recent travel history or known exposure, and genetically linked cases were spread throughout the region. Comparison to global viral sequences showed that early transmission was most linked to cases from Europe. Our data are consistent with numerous seeds from multiple sources and a prolonged period of unrecognized community spreading. This work highlights the complementary role of genomic surveillance in addition to traditional epidemiological indicators.

...read moreread less

Journal Article•DOI•

Genome variation and population structure among 1142 mosquitoes of the African malaria vector species Anopheles gambiae and Anopheles coluzzii

[...]

Anopheles Gambiae Genomes¹•Institutions (1)

Wellcome Trust Sanger Institute¹

28 Sep 2020-Genome Research

TL;DR: This data resource provides a foundation for developing new operational systems for molecular surveillance and for accelerating research and development of new vector control tools, and provides a unique resource for the study of population genomics and evolutionary biology in eukaryotic species with high levels of genetic diversity under strong anthropogenic evolutionary pressures.

...read moreread less

Abstract: Mosquito control remains a central pillar of efforts to reduce malaria burden in sub-Saharan Africa. However, insecticide resistance is entrenched in malaria vector populations, and countries with a high malaria burden face a daunting challenge to sustain malaria control with a limited set of surveillance and intervention tools. Here we report on the second phase of a project to build an open resource of high-quality data on genome variation among natural populations of the major African malaria vector species Anopheles gambiae and Anopheles coluzzii. We analyzed whole genomes of 1142 individual mosquitoes sampled from the wild in 13 African countries, as well as a further 234 individuals comprising parents and progeny of 11 laboratory crosses. The data resource includes high-confidence single-nucleotide polymorphism (SNP) calls at 57 million variable sites, genome-wide copy number variation (CNV) calls, and haplotypes phased at biallelic SNPs. We use these data to analyze genetic population structure and characterize genetic diversity within and between populations. We illustrate the utility of these data by investigating species differences in isolation by distance, genetic variation within proposed gene drive target sequences, and patterns of resistance to pyrethroid insecticides. This data resource provides a foundation for developing new operational systems for molecular surveillance and for accelerating research and development of new vector control tools. It also provides a unique resource for the study of population genomics and evolutionary biology in eukaryotic species with high levels of genetic diversity under strong anthropogenic evolutionary pressures.

...read moreread less

Journal Article•DOI•

Deep neural networks for interpreting RNA-binding protein target preferences.

[...]

Mahsa Ghanbari¹, Uwe Ohler², Uwe Ohler¹•Institutions (2)

Max Delbrück Center for Molecular Medicine¹, Humboldt University of Berlin²

28 Jan 2020-Genome Research

TL;DR: A multitask and multimodal deep neural network for characterizing in vivo RBP targets and is able to avoid experimental biases and to identify the RNA sequence motifs and transcript context patterns that are the most important for the predictions of each individual RBP.

...read moreread less

Abstract: Deep learning has become a powerful paradigm to analyze the binding sites of regulatory factors including RNA-binding proteins (RBPs), owing to its strength to learn complex features from possibly multiple sources of raw data. However, the interpretability of these models, which is crucial to improve our understanding of RBP binding preferences and functions, has not yet been investigated in significant detail. We have designed a multitask and multimodal deep neural network for characterizing in vivo RBP targets. The model incorporates not only the sequence but also the region type of the binding sites as input, which helps the model to boost the prediction performance. To interpret the model, we quantified the contribution of the input features to the predictive score of each RBP. Learning across multiple RBPs at once, we are able to avoid experimental biases and to identify the RNA sequence motifs and transcript context patterns that are the most important for the predictions of each individual RBP. Our findings are consistent with known motifs and binding behaviors and can provide new insights about the regulatory functions of RBPs.

...read moreread less

Journal Article•DOI•

Comprehensive analysis of structural variants in breast cancer genomes using single-molecule sequencing.

[...]

Sergey Aganezov¹, Sara Goodwin², Rachel M. Sherman¹, Fritz J. Sedlazeck³, Gayatri Arun², Sonam Bhatia², Isac Lee¹, Melanie Kirsche¹, Robert Wappel², Melissa Kramer², Karen Kostroff, David L. Spector², Winston Timp¹, W. Richard McCombie², Michael C. Schatz², Michael C. Schatz¹ - Show less +12 more•Institutions (3)

Johns Hopkins University¹, Cold Spring Harbor Laboratory², Baylor College of Medicine³

04 Sep 2020-Genome Research

TL;DR: This work performs whole-genome sequencing of the SKBR3 breast cancer cell line and patient-derived tumor and normal organoids from two breast cancer patients and inferred SVs and large-scale allele-specific copy number variants (CNVs) using an ensemble of methods to highlight the need for long-read sequencing of cancer genomes for the precise analysis of their genetic instability.

...read moreread less

Abstract: Improved identification of structural variants (SVs) in cancer can lead to more targeted and effective treatment options as well as advance our basic understanding of the disease and its progression. We performed whole-genome sequencing of the SKBR3 breast cancer cell line and patient-derived tumor and normal organoids from two breast cancer patients using Illumina/10x Genomics, Pacific Biosciences (PacBio), and Oxford Nanopore Technologies (ONT) sequencing. We then inferred SVs and large-scale allele-specific copy number variants (CNVs) using an ensemble of methods. Our findings show that long-read sequencing allows for substantially more accurate and sensitive SV detection, with between 90% and 95% of variants supported by each long-read technology also supported by the other. We also report high accuracy for long reads even at relatively low coverage (25×-30×). Furthermore, we integrated SV and CNV data into a unifying karyotype-graph structure to present a more accurate representation of the mutated cancer genomes. We find hundreds of variants within known cancer-related genes detectable only through long-read sequencing. These findings highlight the need for long-read sequencing of cancer genomes for the precise analysis of their genetic instability.

...read moreread less

Journal Article•DOI•

Heterodimerization of TFAP2 pioneer factors drives epigenomic remodeling during neural crest specification.

[...]

Megan Rothstein¹, Marcos Simoes-Costa¹•Institutions (1)

Cornell University¹

01 Jan 2020-Genome Research

TL;DR: The findings show how pioneer factors regulate distinct genomic targets in a stage-specific manner, and highlight how paralogy can serve as an evolutionary strategy to diversify the function of the regulators that control embryonic development.

...read moreread less

Abstract: Cell fate commitment involves the progressive restriction of developmental potential. Recent studies have shown that this process requires not only shifts in gene expression but also an extensive remodeling of the epigenomic landscape. To examine how chromatin states are reorganized during cellular specification in an in vivo system, we examined the function of pioneer factor TFAP2A at discrete stages of neural crest development. Our results show that TFAP2A activates distinct sets of genomic regions during induction of the neural plate border and specification of neural crest cells. Genomic occupancy analysis revealed that the repertoire of TFAP2A targets depends upon its dimerization with paralogous proteins TFAP2C and TFAP2B. During gastrula stages, TFAP2A/C heterodimers activate components of the neural plate border induction program. As neurulation begins, TFAP2A trades partners, and TFAP2A/B heterodimers reorganize the epigenomic landscape of progenitor cells to promote neural crest specification. We propose that this molecular switch acts to drive progressive cell commitment, remodeling the epigenomic landscape to define the presumptive neural crest. Our findings show how pioneer factors regulate distinct genomic targets in a stage-specific manner and highlight how paralogy can serve as an evolutionary strategy to diversify the function of the regulators that control embryonic development.

...read moreread less

Journal Article•DOI•

The full-length transcriptome of C. elegans using direct RNA sequencing.

[...]

Nathan P. Roach¹, Norah Sadowski¹, Amelia F. Alessi¹, Winston Timp¹, James Taylor¹, John Kim¹ - Show less +2 more•Institutions (1)

Johns Hopkins University¹

01 Feb 2020-Genome Research

TL;DR: N nanopore-based direct RNA sequencing is applied to characterize the developmental polyadenylated transcriptome of C. elegans, providing support for 23,865 splice isoforms across 14,611 genes, without the need for computational reconstruction of gene models and determining that poly(A) tail lengths of transcripts vary across development, as do the strengths of previously reported correlations between poly( A) tail length and expression level.

...read moreread less

Abstract: Current transcriptome annotations have largely relied on short read lengths intrinsic to the most widely used high-throughput cDNA sequencing technologies. For example, in the annotation of the Caenorhabditis elegans transcriptome, more than half of the transcript isoforms lack full-length support and instead rely on inference from short reads that do not span the full length of the isoform. We applied nanopore-based direct RNA sequencing to characterize the developmental polyadenylated transcriptome of C. elegans Taking advantage of long reads spanning the full length of mRNA transcripts, we provide support for 23,865 splice isoforms across 14,611 genes, without the need for computational reconstruction of gene models. Of the isoforms identified, 3452 are novel splice isoforms not present in the WormBase WS265 annotation. Furthermore, we identified 16,342 isoforms in the 3' untranslated region (3' UTR), 2640 of which are novel and do not fall within 10 bp of existing 3'-UTR data sets and annotations. Combining 3' UTRs and splice isoforms, we identified 28,858 full-length transcript isoforms. We also determined that poly(A) tail lengths of transcripts vary across development, as do the strengths of previously reported correlations between poly(A) tail length and expression level, and poly(A) tail length and 3'-UTR length. Finally, we have formatted this data as a publicly accessible track hub, enabling researchers to explore this data set easily in a genome browser.

...read moreread less

Journal Article•DOI•

Cross-species analysis of enhancer logic using deep learning

[...]

Liesbeth Minnoye¹, Ibrahim Ihsan Taskiran¹, David Mauduit¹, Maurizio Fazio², Maurizio Fazio³, Linde Van Aerschot¹, Gert Hulselmans¹, Valerie Christiaens¹, Samira Makhzami¹, Monika Seltenhammer⁴, Monika Seltenhammer⁵, Panagiotis Karras¹, Aline Primot⁶, Edouard Cadieu⁶, Ellen van Rooijen³, Ellen van Rooijen², Jean-Christophe Marine¹, Giorgia Egidy⁷, Ghanem Elias Ghanem⁸, Leonard I. Zon³, Leonard I. Zon², Jasper Wouters¹, Stein Aerts¹ - Show less +19 more•Institutions (8)

Katholieke Universiteit Leuven¹, Harvard University², Howard Hughes Medical Institute³, University of Natural Resources and Life Sciences, Vienna⁴, Medical University of Vienna⁵, University of Rennes⁶, Université Paris-Saclay⁷, Université libre de Bruxelles⁸

30 Jul 2020-Genome Research

TL;DR: The combination of deep learning and cross-species chromatin accessibility profiling to build explainable enhancer models is explored, and the accuracy of DeepMEL predictions on the CAGI5 challenge is shown, where it significantly outperforms existing models on the melanoma enhancer of IRF4.

...read moreread less

Abstract: Deciphering the genomic regulatory code of enhancers is a key challenge in biology because this code underlies cellular identity. A better understanding of how enhancers work will improve the interpretation of noncoding genome variation and empower the generation of cell type-specific drivers for gene therapy. Here, we explore the combination of deep learning and cross-species chromatin accessibility profiling to build explainable enhancer models. We apply this strategy to decipher the enhancer code in melanoma, a relevant case study owing to the presence of distinct melanoma cell states. We trained and validated a deep learning model, called DeepMEL, using chromatin accessibility data of 26 melanoma samples across six different species. We show the accuracy of DeepMEL predictions on the CAGI5 challenge, where it significantly outperforms existing models on the melanoma enhancer of IRF4 Next, we exploit DeepMEL to analyze enhancer architectures and identify accurate transcription factor binding sites for the core regulatory complexes in the two different melanoma states, with distinct roles for each transcription factor, in terms of nucleosome displacement or enhancer activation. Finally, DeepMEL identifies orthologous enhancers across distantly related species, where sequence alignment fails, and the model highlights specific nucleotide substitutions that underlie enhancer turnover. DeepMEL can be used from the Kipoi database to predict and optimize candidate enhancers and to prioritize enhancer mutations. In addition, our computational strategy can be applied to other cancer or normal cell types.

...read moreread less

Journal Article•DOI•

netNMF-sc: leveraging gene-gene interactions for imputation and dimensionality reduction in single-cell expression analysis.

[...]

Rebecca Elyanow¹, Bianca Dumitrascu², Barbara E. Engelhardt², Benjamin J. Raphael²•Institutions (2)

Brown University¹, Princeton University²

28 Jan 2020-Genome Research

TL;DR: NetNMF-sc as mentioned in this paper learns a low-dimensional representation of scRNA-seq transcript counts using network-regularized nonnegative matrix factorization, which encourages pairs of genes with known interactions to be nearby each other in the lowdimensional representation.

...read moreread less

Abstract: Single-cell RNA-sequencing (scRNA-seq) enables high-throughput measurement of RNA expression in single cells. However, because of technical limitations, scRNA-seq data often contain zero counts for many transcripts in individual cells. These zero counts, or dropout events, complicate the analysis of scRNA-seq data using standard methods developed for bulk RNA-seq data. Current scRNA-seq analysis methods typically overcome dropout by combining information across cells in a lower-dimensional space, leveraging the observation that cells generally occupy a small number of RNA expression states. We introduce netNMF-sc, an algorithm for scRNA-seq analysis that leverages information across both cells and genes. netNMF-sc learns a low-dimensional representation of scRNA-seq transcript counts using network-regularized non-negative matrix factorization. The network regularization takes advantage of prior knowledge of gene-gene interactions, encouraging pairs of genes with known interactions to be nearby each other in the low-dimensional representation. The resulting matrix factorization imputes gene abundance for both zero and nonzero counts and can be used to cluster cells into meaningful subpopulations. We show that netNMF-sc outperforms existing methods at clustering cells and estimating gene-gene covariance using both simulated and real scRNA-seq data, with increasing advantages at higher dropout rates (e.g., >60%). We also show that the results from netNMF-sc are robust to variation in the input network, with more representative networks leading to greater performance gains.

...read moreread less

Journal Article•DOI•

Ultralow-input single-tube linked-read library method enables short-read second-generation sequencing systems to routinely generate highly accurate and economical long-range sequencing information.

[...]

Zhoutao Chen, Long Pham, Wu Tsai-Chin, Guoya Mo, Yu Xia, Peter L. Chang, Devin Porter, Tan Phan, Huu Che, Hao Tran¹, Vikas Bansal², Justin P. Shaffer², Pedro Belda-Ferre², Greg Humphrey², Rob Knight², Pavel A. Pevzner², Son Pham, Yong Wang, Ming Lei - Show less +15 more•Institutions (2)

Vietnam National University, Ho Chi Minh City¹, University of California, San Diego²

01 Jun 2020-Genome Research

TL;DR: A single-tube Transposase Enzyme Linked Long-read Sequencing (TELL-seqTM) technology is developed, which enables a low-cost, high-accuracy and high-throughput short-read second-generation sequencer to generate over 100 kb long-range sequencing information with as little as 0.1 ng input material.

...read moreread less

Abstract: Long-range sequencing information is required for haplotype phasing, de novo assembly, and structural variation detection. Current long-read sequencing technologies can provide valuable long-range information but at a high cost with low accuracy and high DNA input requirements. We have developed a single-tube Transposase Enzyme Linked Long-read Sequencing (TELL-seq) technology, which enables a low-cost, high-accuracy, and high-throughput short-read second-generation sequencer to generate over 100 kb of long-range sequencing information with as little as 0.1 ng input material. In a PCR tube, millions of clonally barcoded beads are used to uniquely barcode long DNA molecules in an open bulk reaction without dilution and compartmentation. The barcoded linked-reads are used to successfully assemble genomes ranging from microbes to human. These linked-reads also generate megabase-long phased blocks and provide a cost-effective tool for detecting structural variants in a genome, which are important to identify compound heterozygosity in recessive Mendelian diseases and discover genetic drivers and diagnostic biomarkers in cancers.

...read moreread less

Journal Article•DOI•

Analysis of Hi-C data using SIP effectively identifies loops in organisms from C. elegans to mammals

[...]

M. Jordan Rowley¹, Axel Poulet¹, Michael H. Nichols¹, Brianna J. Bixler¹, Adrian L. Sanborn², Elizabeth A. Brouhard³, Karen Hermetz¹, Hannah Linsenbaum¹, Györgyi Csankovszki³, Erez Lieberman Aiden⁴, Erez Lieberman Aiden², Victor G. Corces¹ - Show less +8 more•Institutions (4)

Emory University¹, Baylor College of Medicine², University of Michigan³, Rice University⁴

03 Mar 2020-Genome Research

TL;DR: It is shown that SIP is resistant to noise and sequencing depth and can be used to detect loops that were previously missed in human cells as well as loops in other organisms, and can lead to biological insights by characterizing the contribution of several transcription factors to CTCF loop stability in human Cells.

...read moreread less

Abstract: Chromatin loops are a major component of 3D nuclear organization, visually apparent as intense point-to-point interactions in Hi-C maps. Identification of these loops is a critical part of most Hi-C analyses. However, current methods often miss visually evident CTCF loops in Hi-C data sets from mammals, and they completely fail to identify high intensity loops in other organisms. We present SIP, Significant Interaction Peak caller, and SIPMeta, which are platform independent programs to identify and characterize these loops in a time- and memory-efficient manner. We show that SIP is resistant to noise and sequencing depth, and can be used to detect loops that were previously missed in human cells as well as loops in other organisms. SIPMeta corrects for a common visualization artifact by accounting for Manhattan distance to create average plots of Hi-C and HiChIP data. We then demonstrate that the use of SIP and SIPMeta can lead to biological insights by characterizing the contribution of several transcription factors to CTCF loop stability in human cells. We also annotate loops associated with the SMC component of the dosage compensation complex (DCC) in Caenorhabditis elegans and demonstrate that loop anchors represent bidirectional blocks for symmetrical loop extrusion. This is in contrast to the asymmetrical extrusion until unidirectional blockage by CTCF that is presumed to occur in mammals. Using HiChIP and multiway ligation events, we then show that DCC loops form a network of strong interactions that may contribute to X Chromosome-wide condensation in C. elegans hermaphrodites.

...read moreread less

Journal Article•DOI•

Detection and characterization of jagged ends of double-stranded DNA in plasma.

[...]

Peiyong Jiang¹, Tingting Xie¹, Spencer C. Ding¹, Ze Zhou¹, Suk Hang Cheng¹, Rebecca W.Y. Chan¹, Wing-Shan Lee¹, Wenlei Peng¹, John Wong¹, Vincent Wai-Sun Wong¹, Henry Lik-Yuen Chan¹, Stephen L. Chan¹, Liona C. Poon¹, Tak Yeung Leung¹, K.C. Allen Chan¹, Rossa W.K. Chiu¹, Y.M. Dennis Lo¹ - Show less +13 more•Institutions (1)

The Chinese University of Hong Kong¹

01 Aug 2020-Genome Research

TL;DR: Plasma DNA jagged ends represent an intrinsic property of plasma DNA and provide a link between nuclease activities and the fragmentation of plasmaDNA.

...read moreread less

Abstract: Cell-free DNA in plasma has been used for noninvasive prenatal testing and cancer liquid biopsy. The physical properties of cell-free DNA fragments in plasma, such as fragment sizes and ends, have attracted much recent interest, leading to the emerging field of cell-free DNA fragmentomics. However, one aspect of plasma DNA fragmentomics as to whether double-stranded plasma molecules might carry single-stranded ends, termed a jagged end in this study, remains underexplored. We have developed two approaches for investigating the presence of jagged ends in a plasma DNA pool. These approaches utilized DNA end repair to introduce differential methylation signals between the original sequence and the jagged ends, depending on whether unmethylated or methylated cytosines were used in the DNA end-repair procedure. The majority of plasma DNA molecules (87.8%) were found to bear jagged ends. The jaggedness varied according to plasma DNA fragment sizes and appeared to be in association with nucleosomal patterns. In the plasma of pregnant women, the jaggedness of fetal DNA molecules was higher than that of the maternal counterparts. The jaggedness of plasma DNA correlated with the fetal DNA fraction. Similarly, in the plasma of cancer patients, tumor-derived DNA molecules in patients with hepatocellular carcinoma showed an elevated jaggedness compared with nontumoral DNA. In mouse models, knocking out of the Dnase1 gene reduced jaggedness, whereas knocking out of the Dnase1l3 gene enhanced jaggedness. Hence, plasma DNA jagged ends represent an intrinsic property of plasma DNA and provide a link between nuclease activities and the fragmentation of plasma DNA.

...read moreread less

Journal Article•DOI•

Characterizing and inferring quantitative cell cycle phase in single-cell RNA-seq data analysis.

[...]

Chiaowen Joyce Hsiao¹, Po-Yuan Tung¹, John D. Blischak¹, Jonathan E. Burnett¹, Kenneth Barr¹, Kushal K. Dey², Matthew Stephens¹, Yoav Gilad¹ - Show less +4 more•Institutions (2)

University of Chicago¹, Harvard University²

01 Apr 2020-Genome Research

TL;DR: It is found that, on average, scRNA-seq data from only five genes predicted a cell's position on the cell cycle continuum to within 14% of the entire cycle and that using more genes did not improve this accuracy, which can directly help future studies to account for cell cycle-related heterogeneity in iPSCs.

...read moreread less

Abstract: Cellular heterogeneity in gene expression is driven by cellular processes, such as cell cycle and cell-type identity, and cellular environment such as spatial location. The cell cycle, in particular, is thought to be a key driver of cell-to-cell heterogeneity in gene expression, even in otherwise homogeneous cell populations. Recent advances in single-cell RNA-sequencing (scRNA-seq) facilitate detailed characterization of gene expression heterogeneity and can thus shed new light on the processes driving heterogeneity. Here, we combined fluorescence imaging with scRNA-seq to measure cell cycle phase and gene expression levels in human induced pluripotent stem cells (iPSCs). By using these data, we developed a novel approach to characterize cell cycle progression. Although standard methods assign cells to discrete cell cycle stages, our method goes beyond this and quantifies cell cycle progression on a continuum. We found that, on average, scRNA-seq data from only five genes predicted a cell's position on the cell cycle continuum to within 14% of the entire cycle and that using more genes did not improve this accuracy. Our data and predictor of cell cycle phase can directly help future studies to account for cell cycle-related heterogeneity in iPSCs. Our results and methods also provide a foundation for future work to characterize the effects of the cell cycle on expression heterogeneity in other cell types.

...read moreread less

Journal Article•DOI•

A genome-wide long noncoding RNA CRISPRi screen identifies PRANCR as a novel regulator of epidermal homeostasis.

[...]

Pengfei Cai¹, Auke B C Otten², Binbin Cheng², Mitsuhiro A. Ishii², Wen Zhang¹, Beibei Huang¹, Kun Qu¹, Bryan K. Sun² - Show less +4 more•Institutions (2)

University of Science and Technology of China¹, University of California, San Diego²

01 Jan 2020-Genome Research

TL;DR: It is demonstrated that PRANCR is a novel lncRNA regulating epidermal homeostasis and other lncRNAs candidates that may have roles in this process as well are identified.

...read moreread less

Abstract: Genome-wide association studies indicate that many disease susceptibility regions reside in non-protein-coding regions of the genome. Long noncoding RNAs (lncRNAs) are a major component of the noncoding genome, but their biological impacts are not fully understood. Here, we performed a CRISPR interference (CRISPRi) screen on 2263 epidermis-expressed lncRNAs and identified nine novel candidate lncRNAs regulating keratinocyte proliferation. We further characterized a top hit from the screen, progenitor renewal associated non-coding RNA (PRANCR), using RNA interference-mediated knockdown and phenotypic analysis in organotypic human tissue. PRANCR regulates keratinocyte proliferation, cell cycle progression, and clonogenicity. PRANCR-deficient epidermis displayed impaired stratification with reduced expression of differentiation genes that are altered in human skin diseases, including keratins 1 and 10, filaggrin, and loricrin. Transcriptome analysis showed that PRANCR controls the expression of 1136 genes, with strong enrichment for late cell cycle genes containing a CHR promoter element. In addition, PRANCR depletion led to increased levels of both total and nuclear CDKN1A (also known as p21), which is known to govern both keratinocyte proliferation and differentiation. Collectively, these data show that PRANCR is a novel lncRNA regulating epidermal homeostasis and identify other lncRNA candidates that may have roles in this process as well.

...read moreread less

Journal Article•DOI•

SHARP: hyperfast and accurate processing of single-cell RNA-seq data via ensemble random projection.

[...]

Shibiao Wan¹, Junil Kim, Kyoung-Jae Won•Institutions (1)

University of Pennsylvania¹

28 Jan 2020-Genome Research

TL;DR: To process large-scale single-cell RNA-sequencing (scRNA-seq) data effectively without excessive distortion during dimension reduction, SHARP is presented, an ensemble random projection-based algorithm which is scalable to clustering 10 million cells.

...read moreread less

Abstract: To process large-scale single-cell RNA-sequencing (scRNA-seq) data effectively without excessive distortion during dimension reduction, we present SHARP, an ensemble random projection-based algorithm that is scalable to clustering 10 million cells. Comprehensive benchmarking tests on 17 public scRNA-seq data sets show that SHARP outperforms existing methods in terms of speed and accuracy. Particularly, for large-size data sets (more than 40,000 cells), SHARP runs faster than other competitors while maintaining high clustering accuracy and robustness. To the best of our knowledge, SHARP is the only R-based tool that is scalable to clustering scRNA-seq data with 10 million cells.

...read moreread less

Journal Article•DOI•

Integrated annotations and analyses of small RNA-producing loci from 47 diverse plants.

[...]

Alice Lunardon¹, Nathan R. Johnson¹, Emily Hagerott², Tamia Phifer², Seth Polydore¹, Ceyda Coruh¹, Michael J. Axtell¹ - Show less +3 more•Institutions (2)

Pennsylvania State University¹, Knox College²

16 Mar 2020-Genome Research

TL;DR: A large data set of published and newly generated sRNA sequencing data and a uniform bioinformatic pipeline are used to produce comprehensive sRNA locus annotations of 47 diverse plants, yielding more than 2.7 million s RNA loci.

...read moreread less

Abstract: Plant endogenous small RNAs (sRNAs) are important regulators of gene expression. There are two broad categories of plant sRNAs: microRNAs (miRNAs) and endogenous short interfering RNAs (siRNAs). MicroRNA loci are relatively well-annotated but compose only a small minority of the total sRNA pool; siRNA locus annotations have lagged far behind. Here, we used a large data set of published and newly generated sRNA sequencing data (1333 sRNA-seq libraries containing more than 20 billion reads) and a uniform bioinformatic pipeline to produce comprehensive sRNA locus annotations of 47 diverse plants, yielding more than 2.7 million sRNA loci. The two most numerous classes of siRNA loci produced mainly 24- and 21-nucleotide (nt) siRNAs, respectively. Most often, 24-nt-dominated siRNA loci occurred in intergenic regions, especially at the 5'-flanking regions of protein-coding genes. In contrast, 21-nt-dominated siRNA loci were most often derived from double-stranded RNA precursors copied from spliced mRNAs. Genic 21-nt-dominated loci were especially common from disease resistance genes, including from a large number of monocots. Individual siRNA sequences of all types showed very little conservation across species, whereas mature miRNAs were more likely to be conserved. We developed a web server where our data and several search and analysis tools are freely accessible.

...read moreread less

Journal Article•DOI•

Transcriptome-wide sites of collided ribosomes reveal principles of translational pausing.

[...]

Alaaddin Bulak Arpat¹, Alaaddin Bulak Arpat², Angélica Liechti¹, Mara De Matos¹, René Dreos², René Dreos¹, Peggy Janich¹, David Gatfield¹ - Show less +4 more•Institutions (2)

University of Lausanne¹, Swiss Institute of Bioinformatics²

23 Jul 2020-Genome Research

TL;DR: It is uncovered that the stacking of an elongating onto a paused ribosome occurs frequently and scales with translation rate, trapping ∼10% of translating ribosomes in the disome state.

...read moreread less

Abstract: Translation initiation is the major regulatory step defining the rate of protein production from an mRNA. Meanwhile, the impact of nonuniform ribosomal elongation rates is largely unknown. Using a modified ribosome profiling protocol based on footprints from two closely packed ribosomes (disomes), we have mapped ribosomal collisions transcriptome-wide in mouse liver. We uncover that the stacking of an elongating onto a paused ribosome occurs frequently and scales with translation rate, trapping ∼10% of translating ribosomes in the disome state. A distinct class of pause sites is indicative of deterministic pausing signals. Pause site association with specific amino acids, peptide motifs, and nascent polypeptide structure is suggestive of programmed pausing as a widespread mechanism associated with protein folding. Evolutionary conservation at disome sites indicates functional relevance of translational pausing. Collectively, our disome profiling approach allows unique insights into gene regulation occurring at the step of translation elongation.

...read moreread less

Journal Article•DOI•

Redundant and specific roles of cohesin STAG subunits in chromatin looping and transcriptional control

[...]

Valentina Casa¹, Macarena Moronta Gines¹, Eduardo G. Gusmao², Eduardo G. Gusmao³, Johan A. Slotman¹, Anne Zirkel³, Natasa Josipovic², Natasa Josipovic³, Edwin Oole¹, Wilfred F. J. van IJcken¹, Adriaan B. Houtsmuller¹, Argyris Papantonis², Argyris Papantonis³, Kerstin S. Wendt¹ - Show less +10 more•Institutions (3)

Erasmus University Rotterdam¹, University of Göttingen², University of Cologne³

01 Jan 2020-Genome Research

TL;DR: High resolution Hi-C, gene expression and sequential ChIP studies show that STAG1 and STAG2 do not co-occupy individual binding sites and have distinct ways by which they affect looping and gene expression.

...read moreread less

Abstract: Cohesin is a ring-shaped multiprotein complex that is crucial for 3D genome organization and transcriptional regulation during differentiation and development. It also confers sister chromatid cohesion and facilitates DNA damage repair. Besides its core subunits SMC3, SMC1A, and RAD21, cohesin in somatic cells contains one of two orthologous STAG subunits, STAG1 or STAG2. How these variable subunits affect the function of the cohesin complex is still unclear. STAG1- and STAG2-cohesin were initially proposed to organize cohesion at telomeres and centromeres, respectively. Here, we uncover redundant and specific roles of STAG1 and STAG2 in gene regulation and chromatin looping using HCT116 cells with an auxin-inducible degron (AID) tag fused to either STAG1 or STAG2. Following rapid depletion of either subunit, we perform high-resolution Hi-C, gene expression, and sequential ChIP studies to show that STAG1 and STAG2 do not co-occupy individual binding sites and have distinct ways by which they affect looping and gene expression. These findings are further supported by single-molecule localizations via direct stochastic optical reconstruction microscopy (dSTORM) super-resolution imaging. Since somatic and congenital mutations of the STAG subunits are associated with cancer (STAG2) and intellectual disability syndromes with congenital abnormalities (STAG1 and STAG2), we verified STAG1-/STAG2-dependencies using human neural stem cells, hence highlighting their importance in particular disease contexts.

...read moreread less

Journal Article•DOI•

An efficient RNA-seq-based segregation analysis identifies the sex chromosomes of Cannabis sativa.

[...]

Djivan Prentout¹, Olga V. Razumova², Bénédicte Rhoné¹, Bénédicte Rhoné³, Hélène Badouin¹, Hélène Henri¹, Cong Feng⁴, Jos Käfer¹, Gennady I. Karlov, Gabriel A. B. Marais¹ - Show less +6 more•Institutions (4)

University of Lyon¹, Russian Academy of Sciences², Institut de recherche pour le développement³, Chongqing Medical University⁴

01 Feb 2020-Genome Research

TL;DR: This study RNA-sequenced a C. sativa family and identified >500 sex-linked genes and revealed that old plant sex chromosomes can have large, highly divergent nonrecombining regions, yet still be roughly homomorphic.

...read moreread less

Abstract: Cannabis sativa-derived tetrahydrocannabinol (THC) production is increasing very fast worldwide. C. sativa is a dioecious plant with XY Chromosomes, and only females (XX) are useful for THC production. Identifying the sex chromosome sequence would improve early sexing and better management of this crop; however, the C. sativa genome projects have failed to do so. Moreover, as dioecy in the Cannabaceae family is ancestral, C. sativa sex chromosomes are potentially old and thus very interesting to study, as little is known about old plant sex chromosomes. Here, we RNA-sequenced a C. sativa family (two parents and 10 male and female offspring, 576 million reads) and performed a segregation analysis for all C. sativa genes using the probabilistic method SEX-DETector. We identified >500 sex-linked genes. Mapping of these sex-linked genes to a C. sativa genome assembly identified the largest chromosome pair being the sex chromosomes. We found that the X-specific region (not recombining between X and Y) is large compared to other plant systems. Further analysis of the sex-linked genes revealed that C. sativa has a strongly degenerated Y Chromosome and may represent the oldest plant sex chromosome system documented so far. Our study revealed that old plant sex chromosomes can have large, highly divergent nonrecombining regions, yet still be roughly homomorphic.

...read moreread less

Journal Article•DOI•

Versatile and robust genome editing with Streptococcus thermophilus CRISPR1-Cas9

[...]

Daniel Agudelo¹, Sophie Carter¹, Minja Velimirovic¹, Alexis Duringer¹, Jean-François Rivest¹, Sébastien Lévesque¹, Jeremy Loehr¹, Mathilde Mouchiroud¹, Denis Cyr², Paula J. Waters², Mathieu Laplante¹, Sylvain Moineau¹, Adeline Goulet³, Yannick Doyon¹ - Show less +10 more•Institutions (3)

Laval University¹, Centre Hospitalier Universitaire de Sherbrooke², Centre national de la recherche scientifique³

03 Jan 2020-Genome Research

TL;DR: It is found that St1Cas9 strain variants enable targeting to five distinct A-rich PAMs and provide a structural basis for their specificities.

...read moreread less

Abstract: Targeting definite genomic locations using CRISPR-Cas systems requires a set of enzymes with unique protospacer adjacent motif (PAM) compatibilities. To expand this repertoire, we engineered nucleases, cytosine base editors, and adenine base editors from the archetypal Streptococcus thermophilus CRISPR1-Cas9 (St1Cas9) system. We found that St1Cas9 strain variants enable targeting to five distinct A-rich PAMs and provide a structural basis for their specificities. The small size of this ortholog enables expression of the holoenzyme from a single adeno-associated viral vector for in vivo editing applications. Delivery of St1Cas9 to the neonatal liver efficiently rewired metabolic pathways, leading to phenotypic rescue in a mouse model of hereditary tyrosinemia. These robust enzymes expand and complement current editing platforms available for tailoring mammalian genomes.

...read moreread less

Journal Article•DOI•

Quantitative analysis of Y-Chromosome gene expression across 36 human tissues

[...]

Alexander K. Godfrey¹, Sahin Naqvi¹, Lukáš Chmátal¹, Joel M. Chick², Richard N. Mitchell³, Steven P. Gygi², Helen Skaletsky¹, Helen Skaletsky⁴, David C. Page¹, David C. Page⁴ - Show less +6 more•Institutions (4)

Massachusetts Institute of Technology¹, Harvard University², Brigham and Women's Hospital³, Howard Hughes Medical Institute⁴

27 May 2020-Genome Research

TL;DR: Divergence between the X and Y Chromosomes in regulatory sequence can lead to tissue-specific, Y-Chromosome-driven sex biases in expression of critical, dosage-sensitive regulatory genes.

...read moreread less

Abstract: Little is known about how human Y-Chromosome gene expression directly contributes to differences between XX (female) and XY (male) individuals in nonreproductive tissues. Here, we analyzed quantitative profiles of Y-Chromosome gene expression across 36 human tissues from hundreds of individuals. Although it is often said that Y-Chromosome genes are lowly expressed outside the testis, we report many instances of elevated Y-Chromosome gene expression in a nonreproductive tissue. A notable example is EIF1AY, which encodes eukaryotic translation initiation factor 1A Y-linked, together with its X-linked homolog EIF1AX Evolutionary loss of a Y-linked microRNA target site enabled up-regulation of EIF1AY, but not of EIF1AX, in the heart. Consequently, this essential translation initiation factor is nearly twice as abundant in male as in female heart tissue at the protein level. Divergence between the X and Y Chromosomes in regulatory sequence can therefore lead to tissue-specific Y-Chromosome-driven sex biases in expression of critical, dosage-sensitive regulatory genes.

...read moreread less

Collapse