scispace - formally typeset
Search or ask a question

Showing papers in "Genome Research in 2020"


Journal ArticleDOI
TL;DR: An overview on how EnteroBase works, what it can do, and its future prospects is provided.
Abstract: EnteroBase is an integrated software environment that supports the identification of global population structures within several bacterial genera that include pathogens. Here, we provide an overview of how EnteroBase works, what it can do, and its future prospects. EnteroBase has currently assembled more than 300,000 genomes from Illumina short reads from Salmonella, Escherichia, Yersinia, Clostridioides, Helicobacter, Vibrio, and Moraxella and genotyped those assemblies by core genome multilocus sequence typing (cgMLST). Hierarchical clustering of cgMLST sequence types allows mapping a new bacterial strain to predefined population structures at multiple levels of resolution within a few hours after uploading its short reads. Case Study 1 illustrates this process for local transmissions of Salmonella enterica serovar Agama between neighboring social groups of badgers and humans. EnteroBase also supports single nucleotide polymorphism (SNP) calls from both genomic assemblies and after extraction from metagenomic sequences, as illustrated by Case Study 2 which summarizes the microevolution of Yersinia pestis over the last 5000 years of pandemic plague. EnteroBase can also provide a global overview of the genomic diversity within an entire genus, as illustrated by Case Study 3, which presents a novel, global overview of the population structure of all of the species, subspecies, and clades within Escherichia.

469 citations


Journal ArticleDOI
TL;DR: This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of 9 complete human centromeric regions, a significant advance towards the complete assembly of human genomes.
Abstract: Complete and accurate genome assemblies form the basis of most downstream genomic analyses and are of critical importance. Recent genome assembly projects have relied on a combination of noisy long-read sequencing and accurate short-read sequencing, with the former offering greater assembly continuity and the latter providing higher consensus accuracy. The recently introduced Pacific Biosciences (PacBio) HiFi sequencing technology bridges this divide by delivering long reads (>10 kbp) with high per-base accuracy (>99.9%). Here we present HiCanu, a modification of the Canu assembler designed to leverage the full potential of HiFi reads via homopolymer compression, overlap-based error correction, and aggressive false overlap filtering. We benchmark HiCanu with a focus on the recovery of haplotype diversity, major histocompatibility complex (MHC) variants, satellite DNAs, and segmental duplications. For diploid human genomes sequenced to 30× HiFi coverage, HiCanu achieved superior accuracy and allele recovery compared to the current state of the art. On the effectively haploid CHM13 human cell line, HiCanu achieved an NG50 contig size of 77 Mbp with a per-base consensus accuracy of 99.999% (QV50), surpassing recent assemblies of high-coverage, ultralong Oxford Nanopore Technologies (ONT) reads in terms of both accuracy and continuity. This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of nine complete human centromeric regions. Although gaps and errors still remain within the most challenging regions of the genome, these results represent a significant advance toward the complete assembly of human genomes.

342 citations


Journal ArticleDOI
TL;DR: The analysis of cumulative GC skew identified potential misassemblies in some reference genomes of isolated bacteria and the repeat sequences that likely gave rise to them, and methods that could be implemented in bioinformatic approaches for curation are discussed to ensure that metabolic and evolutionary analyses can be based on very high-quality genomes.
Abstract: Genomes are an integral component of the biological information about an organism; thus, the more complete the genome, the more informative it is. Historically, bacterial and archaeal genomes were reconstructed from pure (monoclonal) cultures, and the first reported sequences were manually curated to completion. However, the bottleneck imposed by the requirement for isolates precluded genomic insights for the vast majority of microbial life. Shotgun sequencing of microbial communities, referred to initially as community genomics and subsequently as genome-resolved metagenomics, can circumvent this limitation by obtaining metagenome-assembled genomes (MAGs); but gaps, local assembly errors, chimeras, and contamination by fragments from other genomes limit the value of these genomes. Here, we discuss genome curation to improve and, in some cases, achieve complete (circularized, no gaps) MAGs (CMAGs). To date, few CMAGs have been generated, although notably some are from very complex systems such as soil and sediment. Through analysis of about 7000 published complete bacterial isolate genomes, we verify the value of cumulative GC skew in combination with other metrics to establish bacterial genome sequence accuracy. The analysis of cumulative GC skew identified potential misassemblies in some reference genomes of isolated bacteria and the repeat sequences that likely gave rise to them. We discuss methods that could be implemented in bioinformatic approaches for curation to ensure that metabolic and evolutionary analyses can be based on very high-quality genomes.

105 citations


Journal ArticleDOI
TL;DR: Phylogenetic analysis indicates a TMRCA for SARS-CoV-2 genomes dating to 12 November 2019 - thus matching epidemiological records and revealing a rather uniform mutation occurrence along branches that could have implications for diagnostics and the design of future vaccines.
Abstract: The human pathogen severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is responsible for the major pandemic of the twenty-first century. We analyzed more than 4700 SARS-CoV-2 genomes and associated metadata retrieved from public repositories. SARS-CoV-2 sequences have a high sequence identity (>99.9%), which drops to >96% when compared to bat coronavirus genome. We built a mutation-annotated reference SARS-CoV-2 phylogeny with two main macro-haplogroups, A and B, both of Asian origin, and more than 160 sub-branches representing virus strains of variable geographical origins worldwide, revealing a rather uniform mutation occurrence along branches that could have implications for diagnostics and the design of future vaccines. Identification of the root of SARS-CoV-2 genomes is not without problems, owing to conflicting interpretations derived from either using the bat coronavirus genomes as an outgroup or relying on the sampling chronology of the SARS-CoV-2 genomes and TMRCA estimates; however, the overall scenario favors haplogroup A as the ancestral node. Phylogenetic analysis indicates a TMRCA for SARS-CoV-2 genomes dating to November 12, 2019, thus matching epidemiological records. Sub-haplogroup A2 most likely originated in Europe from an Asian ancestor and gave rise to subclade A2a, which represents the major non-Asian outbreak, especially in Africa and Europe. Multiple founder effect episodes, most likely associated with super-spreader hosts, might explain COVID-19 pandemic to a large extent.

97 citations


Journal ArticleDOI
Jordan A. Ramilowski, Chi Wai Yip, Saumya Agrawal, Jen-Chien Chang, Yari Ciani, Ivan V. Kulakovskiy1, Mickaël Mendez2, Jasmine Li Ching Ooi, John F. Ouyang3, Nicholas J. Parkinson4, Andreas Petri5, Leonie Roos6, Jessica Severin, Kayoko Yasuzawa, Imad Abugessaisa, Altuna Akalin, Ivan Antonov7, Erik Arner, Alessandro Bonetti, Hidemasa Bono8, Beatrice Borsari, Frank Brombacher9, Christopher J. F. Cameron10, Carlo Vittorio Cannistraci11, Ryan Cardenas12, Melissa Cardon, Howard Y. Chang13, Josée Dostie10, Luca Ducoli14, Alexander V. Favorov7, Alexandre Fort, Diego Garrido, Noa Gil15, Juliette Gimenez, Reto Guler9, Lusy Handoko, Jayson Harshbarger, Akira Hasegawa, Yuki Hasegawa, Kosuke Hashimoto, Norihito Hayatsu, Peter Heutink16, Tetsuro Hirose17, Eddie Luidy Imada18, Masayoshi Itoh, Bogumil Kaczkowski, Aditi Kanhere12, Emily Kawabata, Hideya Kawaji, Tsugumi Kawashima, S. Thomas Kelly, Miki Kojima, Naoto Kondo, Haruhiko Koseki, Tsukasa Kouno, Anton Kratz, Mariola Kurowska-Stolarska19, Andrew T. Kwon, Jeffrey T. Leek18, Andreas Lennartsson20, Marina Lizio, Fernando López-Redondo, Joachim Luginbühl, Shiori Maeda, Vsevolod J. Makeev7, Vsevolod J. Makeev21, Luigi Marchionni18, Yulia A. Medvedeva7, Yulia A. Medvedeva21, Aki Minoda, Ferenc Müller12, Manuel Muñoz-Aguirre, Mitsuyoshi Murata, Hiromi Nishiyori, Kazuhiro R. Nitta, Shuhei Noguchi, Yukihiko Noro, Ramil N. Nurtdinov, Yasushi Okazaki, Valerio Orlando22, Denis Paquette10, Callum J.C. Parr, Owen J. L. Rackham3, Patrizia Rizzu16, Diego Fernando Sánchez Martinez18, Albin Sandelin23, Pillay Sanjana12, Colin A. Semple4, Youtaro Shibayama, Divya M. Sivaraman, Takahiro Suzuki, Suzannah C. Szumowski, Michihira Tagami, Martin S. Taylor4, Chikashi Terao, Malte Thodberg23, Supat Thongjuea, Vidisha Tripathi, Igor Ulitsky15, Roberto Verardo, Ilya E. Vorontsov7, Chinatsu Yamamoto, Robert Young4, J Kenneth Baillie4, Alistair R. R. Forrest, Roderic Guigó, Michael M. Hoffman24, Chung-Chau Hon, Takeya Kasukawa, Sakari Kauppinen5, Juha Kere20, Boris Lenhard6, Claudio Schneider25, Harukazu Suzuki, Ken Yagi, Michiel J. L. de Hoon, Jay W. Shin, Piero Carninci 
TL;DR: The largest-to-date lncRNA knockdown data set with molecular phenotyping is disseminated for further exploration and functional roles for ZNF213-AS1 and lnc-KHDC3L-2 are highlighted.
Abstract: Long noncoding RNAs (lncRNAs) constitute the majority of transcripts in the mammalian genomes, and yet, their functions remain largely unknown. As part of the FANTOM6 project, we systematically knocked down the expression of 285 lncRNAs in human dermal fibroblasts and quantified cellular growth, morphological changes, and transcriptomic responses using Capped Analysis of Gene Expression (CAGE). Antisense oligonucleotides targeting the same lncRNAs exhibited global concordance, and the molecular phenotype, measured by CAGE, recapitulated the observed cellular phenotypes while providing additional insights on the affected genes and pathways. Here, we disseminate the largest-to-date lncRNA knockdown data set with molecular phenotyping (over 1000 CAGE deep-sequencing libraries) for further exploration and highlight functional roles for ZNF213-AS1 and lnc-KHDC3L-2.

97 citations


Journal ArticleDOI
TL;DR: The results demonstrated that tissue-specific genes are significantly associated with the tissue-relevant biology, and the transcriptome atlas can serve as a primary source for biological interpretation, functional validation, studies of adaptive evolution, and genomic improvement in livestock.
Abstract: By uniformly analyzing 723 RNA-seq data from 91 tissues and cell types, we built a comprehensive gene atlas and studied tissue specificity of genes in cattle. We demonstrated that tissue-specific genes significantly reflected the tissue-relevant biology, showing distinct promoter methylation and evolution patterns (e.g., brain-specific genes evolve slowest, whereas testis-specific genes evolve fastest). Through integrative analyses of those tissue-specific genes with large-scale genome-wide association studies, we detected relevant tissues/cell types and candidate genes for 45 economically important traits in cattle, including blood/immune system (e.g., CCDC88C) for male fertility, brain (e.g., TRIM46 and RAB6A) for milk production, and multiple growth-related tissues (e.g., FGF6 and CCND2) for body conformation. We validated these findings by using epigenomic data across major somatic tissues and sperm. Collectively, our findings provided novel insights into the genetic and biological mechanisms underlying complex traits in cattle, and our transcriptome atlas can serve as a primary source for biological interpretation, functional validation, studies of adaptive evolution, and genomic improvement in livestock.

93 citations


Journal ArticleDOI
TL;DR: It is estimated that overall, inter-species and inter-tissue differences in gene expression levels can only modestly be accounted for by corresponding differences in promoter DNA methylation, however, the expression pattern of genes with conserved inter-Tissue expression differences can be explained by corresponding inter- Species methylation changes more often.
Abstract: Previously published comparative functional genomic data sets from primates using frozen tissue samples, including many data sets from our own group, were often collected and analyzed using nonoptimal study designs and analysis approaches. In addition, when samples from multiple tissues were studied in a comparative framework, individuals and tissues were confounded. We designed a multitissue comparative study of gene expression and DNA methylation in primates that minimizes confounding effects by using a balanced design with respect to species, tissues, and individuals. We also developed a comparative analysis pipeline that minimizes biases attributable to sequence divergence. Thus, we present the most comprehensive catalog of similarities and differences in gene expression and DNA methylation levels between livers, kidneys, hearts, and lungs, in humans, chimpanzees, and rhesus macaques. We estimate that overall, interspecies and inter-tissue differences in gene expression levels can only modestly be accounted for by corresponding differences in promoter DNA methylation. However, the expression pattern of genes with conserved inter-tissue expression differences can be explained by corresponding interspecies methylation changes more often. Finally, we show that genes whose tissue-specific regulatory patterns are consistent with the action of natural selection are highly connected in both gene regulatory and protein-protein interaction networks.

84 citations


Journal ArticleDOI
TL;DR: Comparison of the bulk tissue and single nuclei sequencing revealed that conventional RNA sequencing did not detect up to two-thirds of cell-type-specific evolutionary differences in the human evolutionary lineage.
Abstract: Identification of gene expression traits unique to the human brain sheds light on the molecular mechanisms underlying human evolution. Here, we searched for uniquely human gene expression traits by analyzing 422 brain samples from humans, chimpanzees, bonobos, and macaques representing 33 anatomical regions, as well as 88,047 cell nuclei composing three of these regions. Among 33 regions, cerebral cortex areas, hypothalamus, and cerebellar gray and white matter evolved rapidly in humans. At the cellular level, astrocytes and oligodendrocyte progenitors displayed more differences in the human evolutionary lineage than the neurons. Comparison of the bulk tissue and single-nuclei sequencing revealed that conventional RNA sequencing did not detect up to two-thirds of cell-type-specific evolutionary differences.

83 citations


Journal ArticleDOI
TL;DR: This work provides a high-resolution view of aneuploidy in preimplantation embryos, and supports the conclusion that low-level mosaicism is a common feature of early human development.
Abstract: Less than half of human zygotes survive to birth, primarily due to aneuploidies of meiotic or mitotic origin. Mitotic errors generate chromosomal mosaicism, defined by multiple cell lineages with distinct chromosome complements. The incidence and impacts of mosaicism in human embryos remain controversial, with most previous studies based on bulk DNA assays or comparisons of multiple biopsies of few embryonic cells. Single-cell genomic data provide an opportunity to quantify mosaicism on an embryo-wide scale. To this end, we extended an approach to infer aneuploidies based on dosage-associated changes in gene expression by integrating signatures of allelic imbalance. We applied this method to published single-cell RNA sequencing data from 74 human embryos, spanning the morula to blastocyst stages. Our analysis revealed widespread mosaic aneuploidies, with 59 of 74 (80%) embryos harboring at least one putative aneuploid cell (1% FDR). By clustering copy number calls, we reconstructed histories of chromosome segregation, inferring that 55 (74%) embryos possessed mitotic aneuploidies and 23 (31%) embryos possessed meiotic aneuploidies. We found no significant enrichment of aneuploid cells in the trophectoderm compared to the inner cell mass, although we do detect such enrichment in data from later postimplantation stages. Finally, we observed that aneuploid cells up-regulate immune response genes and down-regulate genes involved in proliferation, metabolism, and protein processing, consistent with stress responses documented in other stages and systems. Together, our work provides a high-resolution view of aneuploidy in preimplantation embryos, and supports the conclusion that low-level mosaicism is a common feature of early human development.

78 citations


Journal ArticleDOI
TL;DR: An assembly-free, single-molecule nanopore sequencing approach enabling direct recovery of complete viral genome sequences from environmental samples is developed, which can provide previously unavailable information about the genome structures, population biology, and ecology of naturally occurring viruses and viral parasites.
Abstract: Viruses are the most abundant biological entities on Earth and play key roles in host ecology, evolution, and horizontal gene transfer. Despite recent progress in viral metagenomics, the inherent genetic complexity of virus populations still poses technical difficulties for recovering complete virus genomes from natural assemblages. To address these challenges, we developed an assembly-free, single-molecule nanopore sequencing approach, enabling direct recovery of complete virus genome sequences from environmental samples. Our method yielded thousands of full-length, high-quality draft virus genome sequences that were not recovered using standard short-read assembly approaches. Additionally, our analyses discriminated between populations whose genomes had identical direct terminal repeats versus those with circularly permuted repeats at their termini, thus providing new insight into native virus reproduction and genome packaging. Novel DNA sequences were discovered, whose repeat structures, gene contents, and concatemer lengths suggest they are phage-inducible chromosomal islands, which are packaged as concatemers in phage particles, with lengths that match the size ranges of co-occurring phage genomes. Our new virus sequencing strategy can provide previously unavailable information about the genome structures, population biology, and ecology of naturally occurring viruses and viral parasites.

76 citations


Journal ArticleDOI
TL;DR: Analysis of 864 SARS-CoV-2 sequences from cases in the New York City metropolitan area during the COVID-19 outbreak in spring 2020 showed that early transmission was most linked to cases from Europe.
Abstract: Effective public response to a pandemic relies upon accurate measurement of the extent and dynamics of an outbreak. Viral genome sequencing has emerged as a powerful approach to link seemingly unrelated cases, and large-scale sequencing surveillance can inform on critical epidemiological parameters. Here, we report the analysis of 864 SARS-CoV-2 sequences from cases in the New York City metropolitan area during the COVID-19 outbreak in spring 2020. The majority of cases had no recent travel history or known exposure, and genetically linked cases were spread throughout the region. Comparison to global viral sequences showed that early transmission was most linked to cases from Europe. Our data are consistent with numerous seeds from multiple sources and a prolonged period of unrecognized community spreading. This work highlights the complementary role of genomic surveillance in addition to traditional epidemiological indicators.

Journal ArticleDOI
TL;DR: This data resource provides a foundation for developing new operational systems for molecular surveillance and for accelerating research and development of new vector control tools, and provides a unique resource for the study of population genomics and evolutionary biology in eukaryotic species with high levels of genetic diversity under strong anthropogenic evolutionary pressures.
Abstract: Mosquito control remains a central pillar of efforts to reduce malaria burden in sub-Saharan Africa. However, insecticide resistance is entrenched in malaria vector populations, and countries with a high malaria burden face a daunting challenge to sustain malaria control with a limited set of surveillance and intervention tools. Here we report on the second phase of a project to build an open resource of high-quality data on genome variation among natural populations of the major African malaria vector species Anopheles gambiae and Anopheles coluzzii. We analyzed whole genomes of 1142 individual mosquitoes sampled from the wild in 13 African countries, as well as a further 234 individuals comprising parents and progeny of 11 laboratory crosses. The data resource includes high-confidence single-nucleotide polymorphism (SNP) calls at 57 million variable sites, genome-wide copy number variation (CNV) calls, and haplotypes phased at biallelic SNPs. We use these data to analyze genetic population structure and characterize genetic diversity within and between populations. We illustrate the utility of these data by investigating species differences in isolation by distance, genetic variation within proposed gene drive target sequences, and patterns of resistance to pyrethroid insecticides. This data resource provides a foundation for developing new operational systems for molecular surveillance and for accelerating research and development of new vector control tools. It also provides a unique resource for the study of population genomics and evolutionary biology in eukaryotic species with high levels of genetic diversity under strong anthropogenic evolutionary pressures.

Journal ArticleDOI
TL;DR: A multitask and multimodal deep neural network for characterizing in vivo RBP targets and is able to avoid experimental biases and to identify the RNA sequence motifs and transcript context patterns that are the most important for the predictions of each individual RBP.
Abstract: Deep learning has become a powerful paradigm to analyze the binding sites of regulatory factors including RNA-binding proteins (RBPs), owing to its strength to learn complex features from possibly multiple sources of raw data. However, the interpretability of these models, which is crucial to improve our understanding of RBP binding preferences and functions, has not yet been investigated in significant detail. We have designed a multitask and multimodal deep neural network for characterizing in vivo RBP targets. The model incorporates not only the sequence but also the region type of the binding sites as input, which helps the model to boost the prediction performance. To interpret the model, we quantified the contribution of the input features to the predictive score of each RBP. Learning across multiple RBPs at once, we are able to avoid experimental biases and to identify the RNA sequence motifs and transcript context patterns that are the most important for the predictions of each individual RBP. Our findings are consistent with known motifs and binding behaviors and can provide new insights about the regulatory functions of RBPs.

Journal ArticleDOI
TL;DR: This work performs whole-genome sequencing of the SKBR3 breast cancer cell line and patient-derived tumor and normal organoids from two breast cancer patients and inferred SVs and large-scale allele-specific copy number variants (CNVs) using an ensemble of methods to highlight the need for long-read sequencing of cancer genomes for the precise analysis of their genetic instability.
Abstract: Improved identification of structural variants (SVs) in cancer can lead to more targeted and effective treatment options as well as advance our basic understanding of the disease and its progression. We performed whole-genome sequencing of the SKBR3 breast cancer cell line and patient-derived tumor and normal organoids from two breast cancer patients using Illumina/10x Genomics, Pacific Biosciences (PacBio), and Oxford Nanopore Technologies (ONT) sequencing. We then inferred SVs and large-scale allele-specific copy number variants (CNVs) using an ensemble of methods. Our findings show that long-read sequencing allows for substantially more accurate and sensitive SV detection, with between 90% and 95% of variants supported by each long-read technology also supported by the other. We also report high accuracy for long reads even at relatively low coverage (25×-30×). Furthermore, we integrated SV and CNV data into a unifying karyotype-graph structure to present a more accurate representation of the mutated cancer genomes. We find hundreds of variants within known cancer-related genes detectable only through long-read sequencing. These findings highlight the need for long-read sequencing of cancer genomes for the precise analysis of their genetic instability.

Journal ArticleDOI
TL;DR: The findings show how pioneer factors regulate distinct genomic targets in a stage-specific manner, and highlight how paralogy can serve as an evolutionary strategy to diversify the function of the regulators that control embryonic development.
Abstract: Cell fate commitment involves the progressive restriction of developmental potential. Recent studies have shown that this process requires not only shifts in gene expression but also an extensive remodeling of the epigenomic landscape. To examine how chromatin states are reorganized during cellular specification in an in vivo system, we examined the function of pioneer factor TFAP2A at discrete stages of neural crest development. Our results show that TFAP2A activates distinct sets of genomic regions during induction of the neural plate border and specification of neural crest cells. Genomic occupancy analysis revealed that the repertoire of TFAP2A targets depends upon its dimerization with paralogous proteins TFAP2C and TFAP2B. During gastrula stages, TFAP2A/C heterodimers activate components of the neural plate border induction program. As neurulation begins, TFAP2A trades partners, and TFAP2A/B heterodimers reorganize the epigenomic landscape of progenitor cells to promote neural crest specification. We propose that this molecular switch acts to drive progressive cell commitment, remodeling the epigenomic landscape to define the presumptive neural crest. Our findings show how pioneer factors regulate distinct genomic targets in a stage-specific manner and highlight how paralogy can serve as an evolutionary strategy to diversify the function of the regulators that control embryonic development.

Journal ArticleDOI
TL;DR: N nanopore-based direct RNA sequencing is applied to characterize the developmental polyadenylated transcriptome of C. elegans, providing support for 23,865 splice isoforms across 14,611 genes, without the need for computational reconstruction of gene models and determining that poly(A) tail lengths of transcripts vary across development, as do the strengths of previously reported correlations between poly( A) tail length and expression level.
Abstract: Current transcriptome annotations have largely relied on short read lengths intrinsic to the most widely used high-throughput cDNA sequencing technologies. For example, in the annotation of the Caenorhabditis elegans transcriptome, more than half of the transcript isoforms lack full-length support and instead rely on inference from short reads that do not span the full length of the isoform. We applied nanopore-based direct RNA sequencing to characterize the developmental polyadenylated transcriptome of C. elegans Taking advantage of long reads spanning the full length of mRNA transcripts, we provide support for 23,865 splice isoforms across 14,611 genes, without the need for computational reconstruction of gene models. Of the isoforms identified, 3452 are novel splice isoforms not present in the WormBase WS265 annotation. Furthermore, we identified 16,342 isoforms in the 3' untranslated region (3' UTR), 2640 of which are novel and do not fall within 10 bp of existing 3'-UTR data sets and annotations. Combining 3' UTRs and splice isoforms, we identified 28,858 full-length transcript isoforms. We also determined that poly(A) tail lengths of transcripts vary across development, as do the strengths of previously reported correlations between poly(A) tail length and expression level, and poly(A) tail length and 3'-UTR length. Finally, we have formatted this data as a publicly accessible track hub, enabling researchers to explore this data set easily in a genome browser.

Journal ArticleDOI
TL;DR: The combination of deep learning and cross-species chromatin accessibility profiling to build explainable enhancer models is explored, and the accuracy of DeepMEL predictions on the CAGI5 challenge is shown, where it significantly outperforms existing models on the melanoma enhancer of IRF4.
Abstract: Deciphering the genomic regulatory code of enhancers is a key challenge in biology because this code underlies cellular identity. A better understanding of how enhancers work will improve the interpretation of noncoding genome variation and empower the generation of cell type-specific drivers for gene therapy. Here, we explore the combination of deep learning and cross-species chromatin accessibility profiling to build explainable enhancer models. We apply this strategy to decipher the enhancer code in melanoma, a relevant case study owing to the presence of distinct melanoma cell states. We trained and validated a deep learning model, called DeepMEL, using chromatin accessibility data of 26 melanoma samples across six different species. We show the accuracy of DeepMEL predictions on the CAGI5 challenge, where it significantly outperforms existing models on the melanoma enhancer of IRF4 Next, we exploit DeepMEL to analyze enhancer architectures and identify accurate transcription factor binding sites for the core regulatory complexes in the two different melanoma states, with distinct roles for each transcription factor, in terms of nucleosome displacement or enhancer activation. Finally, DeepMEL identifies orthologous enhancers across distantly related species, where sequence alignment fails, and the model highlights specific nucleotide substitutions that underlie enhancer turnover. DeepMEL can be used from the Kipoi database to predict and optimize candidate enhancers and to prioritize enhancer mutations. In addition, our computational strategy can be applied to other cancer or normal cell types.

Journal ArticleDOI
TL;DR: NetNMF-sc as mentioned in this paper learns a low-dimensional representation of scRNA-seq transcript counts using network-regularized nonnegative matrix factorization, which encourages pairs of genes with known interactions to be nearby each other in the lowdimensional representation.
Abstract: Single-cell RNA-sequencing (scRNA-seq) enables high-throughput measurement of RNA expression in single cells. However, because of technical limitations, scRNA-seq data often contain zero counts for many transcripts in individual cells. These zero counts, or dropout events, complicate the analysis of scRNA-seq data using standard methods developed for bulk RNA-seq data. Current scRNA-seq analysis methods typically overcome dropout by combining information across cells in a lower-dimensional space, leveraging the observation that cells generally occupy a small number of RNA expression states. We introduce netNMF-sc, an algorithm for scRNA-seq analysis that leverages information across both cells and genes. netNMF-sc learns a low-dimensional representation of scRNA-seq transcript counts using network-regularized non-negative matrix factorization. The network regularization takes advantage of prior knowledge of gene-gene interactions, encouraging pairs of genes with known interactions to be nearby each other in the low-dimensional representation. The resulting matrix factorization imputes gene abundance for both zero and nonzero counts and can be used to cluster cells into meaningful subpopulations. We show that netNMF-sc outperforms existing methods at clustering cells and estimating gene-gene covariance using both simulated and real scRNA-seq data, with increasing advantages at higher dropout rates (e.g., >60%). We also show that the results from netNMF-sc are robust to variation in the input network, with more representative networks leading to greater performance gains.

Journal ArticleDOI
TL;DR: A single-tube Transposase Enzyme Linked Long-read Sequencing (TELL-seqTM) technology is developed, which enables a low-cost, high-accuracy and high-throughput short-read second-generation sequencer to generate over 100 kb long-range sequencing information with as little as 0.1 ng input material.
Abstract: Long-range sequencing information is required for haplotype phasing, de novo assembly, and structural variation detection. Current long-read sequencing technologies can provide valuable long-range information but at a high cost with low accuracy and high DNA input requirements. We have developed a single-tube Transposase Enzyme Linked Long-read Sequencing (TELL-seq) technology, which enables a low-cost, high-accuracy, and high-throughput short-read second-generation sequencer to generate over 100 kb of long-range sequencing information with as little as 0.1 ng input material. In a PCR tube, millions of clonally barcoded beads are used to uniquely barcode long DNA molecules in an open bulk reaction without dilution and compartmentation. The barcoded linked-reads are used to successfully assemble genomes ranging from microbes to human. These linked-reads also generate megabase-long phased blocks and provide a cost-effective tool for detecting structural variants in a genome, which are important to identify compound heterozygosity in recessive Mendelian diseases and discover genetic drivers and diagnostic biomarkers in cancers.

Journal ArticleDOI
TL;DR: It is shown that SIP is resistant to noise and sequencing depth and can be used to detect loops that were previously missed in human cells as well as loops in other organisms, and can lead to biological insights by characterizing the contribution of several transcription factors to CTCF loop stability in human Cells.
Abstract: Chromatin loops are a major component of 3D nuclear organization, visually apparent as intense point-to-point interactions in Hi-C maps. Identification of these loops is a critical part of most Hi-C analyses. However, current methods often miss visually evident CTCF loops in Hi-C data sets from mammals, and they completely fail to identify high intensity loops in other organisms. We present SIP, Significant Interaction Peak caller, and SIPMeta, which are platform independent programs to identify and characterize these loops in a time- and memory-efficient manner. We show that SIP is resistant to noise and sequencing depth, and can be used to detect loops that were previously missed in human cells as well as loops in other organisms. SIPMeta corrects for a common visualization artifact by accounting for Manhattan distance to create average plots of Hi-C and HiChIP data. We then demonstrate that the use of SIP and SIPMeta can lead to biological insights by characterizing the contribution of several transcription factors to CTCF loop stability in human cells. We also annotate loops associated with the SMC component of the dosage compensation complex (DCC) in Caenorhabditis elegans and demonstrate that loop anchors represent bidirectional blocks for symmetrical loop extrusion. This is in contrast to the asymmetrical extrusion until unidirectional blockage by CTCF that is presumed to occur in mammals. Using HiChIP and multiway ligation events, we then show that DCC loops form a network of strong interactions that may contribute to X Chromosome-wide condensation in C. elegans hermaphrodites.

Journal ArticleDOI
TL;DR: Plasma DNA jagged ends represent an intrinsic property of plasma DNA and provide a link between nuclease activities and the fragmentation of plasmaDNA.
Abstract: Cell-free DNA in plasma has been used for noninvasive prenatal testing and cancer liquid biopsy. The physical properties of cell-free DNA fragments in plasma, such as fragment sizes and ends, have attracted much recent interest, leading to the emerging field of cell-free DNA fragmentomics. However, one aspect of plasma DNA fragmentomics as to whether double-stranded plasma molecules might carry single-stranded ends, termed a jagged end in this study, remains underexplored. We have developed two approaches for investigating the presence of jagged ends in a plasma DNA pool. These approaches utilized DNA end repair to introduce differential methylation signals between the original sequence and the jagged ends, depending on whether unmethylated or methylated cytosines were used in the DNA end-repair procedure. The majority of plasma DNA molecules (87.8%) were found to bear jagged ends. The jaggedness varied according to plasma DNA fragment sizes and appeared to be in association with nucleosomal patterns. In the plasma of pregnant women, the jaggedness of fetal DNA molecules was higher than that of the maternal counterparts. The jaggedness of plasma DNA correlated with the fetal DNA fraction. Similarly, in the plasma of cancer patients, tumor-derived DNA molecules in patients with hepatocellular carcinoma showed an elevated jaggedness compared with nontumoral DNA. In mouse models, knocking out of the Dnase1 gene reduced jaggedness, whereas knocking out of the Dnase1l3 gene enhanced jaggedness. Hence, plasma DNA jagged ends represent an intrinsic property of plasma DNA and provide a link between nuclease activities and the fragmentation of plasma DNA.

Journal ArticleDOI
TL;DR: It is found that, on average, scRNA-seq data from only five genes predicted a cell's position on the cell cycle continuum to within 14% of the entire cycle and that using more genes did not improve this accuracy, which can directly help future studies to account for cell cycle-related heterogeneity in iPSCs.
Abstract: Cellular heterogeneity in gene expression is driven by cellular processes, such as cell cycle and cell-type identity, and cellular environment such as spatial location. The cell cycle, in particular, is thought to be a key driver of cell-to-cell heterogeneity in gene expression, even in otherwise homogeneous cell populations. Recent advances in single-cell RNA-sequencing (scRNA-seq) facilitate detailed characterization of gene expression heterogeneity and can thus shed new light on the processes driving heterogeneity. Here, we combined fluorescence imaging with scRNA-seq to measure cell cycle phase and gene expression levels in human induced pluripotent stem cells (iPSCs). By using these data, we developed a novel approach to characterize cell cycle progression. Although standard methods assign cells to discrete cell cycle stages, our method goes beyond this and quantifies cell cycle progression on a continuum. We found that, on average, scRNA-seq data from only five genes predicted a cell's position on the cell cycle continuum to within 14% of the entire cycle and that using more genes did not improve this accuracy. Our data and predictor of cell cycle phase can directly help future studies to account for cell cycle-related heterogeneity in iPSCs. Our results and methods also provide a foundation for future work to characterize the effects of the cell cycle on expression heterogeneity in other cell types.

Journal ArticleDOI
TL;DR: It is demonstrated that PRANCR is a novel lncRNA regulating epidermal homeostasis and other lncRNAs candidates that may have roles in this process as well are identified.
Abstract: Genome-wide association studies indicate that many disease susceptibility regions reside in non-protein-coding regions of the genome. Long noncoding RNAs (lncRNAs) are a major component of the noncoding genome, but their biological impacts are not fully understood. Here, we performed a CRISPR interference (CRISPRi) screen on 2263 epidermis-expressed lncRNAs and identified nine novel candidate lncRNAs regulating keratinocyte proliferation. We further characterized a top hit from the screen, progenitor renewal associated non-coding RNA (PRANCR), using RNA interference-mediated knockdown and phenotypic analysis in organotypic human tissue. PRANCR regulates keratinocyte proliferation, cell cycle progression, and clonogenicity. PRANCR-deficient epidermis displayed impaired stratification with reduced expression of differentiation genes that are altered in human skin diseases, including keratins 1 and 10, filaggrin, and loricrin. Transcriptome analysis showed that PRANCR controls the expression of 1136 genes, with strong enrichment for late cell cycle genes containing a CHR promoter element. In addition, PRANCR depletion led to increased levels of both total and nuclear CDKN1A (also known as p21), which is known to govern both keratinocyte proliferation and differentiation. Collectively, these data show that PRANCR is a novel lncRNA regulating epidermal homeostasis and identify other lncRNA candidates that may have roles in this process as well.

Journal ArticleDOI
TL;DR: To process large-scale single-cell RNA-sequencing (scRNA-seq) data effectively without excessive distortion during dimension reduction, SHARP is presented, an ensemble random projection-based algorithm which is scalable to clustering 10 million cells.
Abstract: To process large-scale single-cell RNA-sequencing (scRNA-seq) data effectively without excessive distortion during dimension reduction, we present SHARP, an ensemble random projection-based algorithm that is scalable to clustering 10 million cells. Comprehensive benchmarking tests on 17 public scRNA-seq data sets show that SHARP outperforms existing methods in terms of speed and accuracy. Particularly, for large-size data sets (more than 40,000 cells), SHARP runs faster than other competitors while maintaining high clustering accuracy and robustness. To the best of our knowledge, SHARP is the only R-based tool that is scalable to clustering scRNA-seq data with 10 million cells.

Journal ArticleDOI
TL;DR: A large data set of published and newly generated sRNA sequencing data and a uniform bioinformatic pipeline are used to produce comprehensive sRNA locus annotations of 47 diverse plants, yielding more than 2.7 million s RNA loci.
Abstract: Plant endogenous small RNAs (sRNAs) are important regulators of gene expression. There are two broad categories of plant sRNAs: microRNAs (miRNAs) and endogenous short interfering RNAs (siRNAs). MicroRNA loci are relatively well-annotated but compose only a small minority of the total sRNA pool; siRNA locus annotations have lagged far behind. Here, we used a large data set of published and newly generated sRNA sequencing data (1333 sRNA-seq libraries containing more than 20 billion reads) and a uniform bioinformatic pipeline to produce comprehensive sRNA locus annotations of 47 diverse plants, yielding more than 2.7 million sRNA loci. The two most numerous classes of siRNA loci produced mainly 24- and 21-nucleotide (nt) siRNAs, respectively. Most often, 24-nt-dominated siRNA loci occurred in intergenic regions, especially at the 5'-flanking regions of protein-coding genes. In contrast, 21-nt-dominated siRNA loci were most often derived from double-stranded RNA precursors copied from spliced mRNAs. Genic 21-nt-dominated loci were especially common from disease resistance genes, including from a large number of monocots. Individual siRNA sequences of all types showed very little conservation across species, whereas mature miRNAs were more likely to be conserved. We developed a web server where our data and several search and analysis tools are freely accessible.

Journal ArticleDOI
TL;DR: It is uncovered that the stacking of an elongating onto a paused ribosome occurs frequently and scales with translation rate, trapping ∼10% of translating ribosomes in the disome state.
Abstract: Translation initiation is the major regulatory step defining the rate of protein production from an mRNA. Meanwhile, the impact of nonuniform ribosomal elongation rates is largely unknown. Using a modified ribosome profiling protocol based on footprints from two closely packed ribosomes (disomes), we have mapped ribosomal collisions transcriptome-wide in mouse liver. We uncover that the stacking of an elongating onto a paused ribosome occurs frequently and scales with translation rate, trapping ∼10% of translating ribosomes in the disome state. A distinct class of pause sites is indicative of deterministic pausing signals. Pause site association with specific amino acids, peptide motifs, and nascent polypeptide structure is suggestive of programmed pausing as a widespread mechanism associated with protein folding. Evolutionary conservation at disome sites indicates functional relevance of translational pausing. Collectively, our disome profiling approach allows unique insights into gene regulation occurring at the step of translation elongation.

Journal ArticleDOI
TL;DR: High resolution Hi-C, gene expression and sequential ChIP studies show that STAG1 and STAG2 do not co-occupy individual binding sites and have distinct ways by which they affect looping and gene expression.
Abstract: Cohesin is a ring-shaped multiprotein complex that is crucial for 3D genome organization and transcriptional regulation during differentiation and development. It also confers sister chromatid cohesion and facilitates DNA damage repair. Besides its core subunits SMC3, SMC1A, and RAD21, cohesin in somatic cells contains one of two orthologous STAG subunits, STAG1 or STAG2. How these variable subunits affect the function of the cohesin complex is still unclear. STAG1- and STAG2-cohesin were initially proposed to organize cohesion at telomeres and centromeres, respectively. Here, we uncover redundant and specific roles of STAG1 and STAG2 in gene regulation and chromatin looping using HCT116 cells with an auxin-inducible degron (AID) tag fused to either STAG1 or STAG2. Following rapid depletion of either subunit, we perform high-resolution Hi-C, gene expression, and sequential ChIP studies to show that STAG1 and STAG2 do not co-occupy individual binding sites and have distinct ways by which they affect looping and gene expression. These findings are further supported by single-molecule localizations via direct stochastic optical reconstruction microscopy (dSTORM) super-resolution imaging. Since somatic and congenital mutations of the STAG subunits are associated with cancer (STAG2) and intellectual disability syndromes with congenital abnormalities (STAG1 and STAG2), we verified STAG1-/STAG2-dependencies using human neural stem cells, hence highlighting their importance in particular disease contexts.

Journal ArticleDOI
TL;DR: This study RNA-sequenced a C. sativa family and identified >500 sex-linked genes and revealed that old plant sex chromosomes can have large, highly divergent nonrecombining regions, yet still be roughly homomorphic.
Abstract: Cannabis sativa-derived tetrahydrocannabinol (THC) production is increasing very fast worldwide. C. sativa is a dioecious plant with XY Chromosomes, and only females (XX) are useful for THC production. Identifying the sex chromosome sequence would improve early sexing and better management of this crop; however, the C. sativa genome projects have failed to do so. Moreover, as dioecy in the Cannabaceae family is ancestral, C. sativa sex chromosomes are potentially old and thus very interesting to study, as little is known about old plant sex chromosomes. Here, we RNA-sequenced a C. sativa family (two parents and 10 male and female offspring, 576 million reads) and performed a segregation analysis for all C. sativa genes using the probabilistic method SEX-DETector. We identified >500 sex-linked genes. Mapping of these sex-linked genes to a C. sativa genome assembly identified the largest chromosome pair being the sex chromosomes. We found that the X-specific region (not recombining between X and Y) is large compared to other plant systems. Further analysis of the sex-linked genes revealed that C. sativa has a strongly degenerated Y Chromosome and may represent the oldest plant sex chromosome system documented so far. Our study revealed that old plant sex chromosomes can have large, highly divergent nonrecombining regions, yet still be roughly homomorphic.

Journal ArticleDOI
TL;DR: It is found that St1Cas9 strain variants enable targeting to five distinct A-rich PAMs and provide a structural basis for their specificities.
Abstract: Targeting definite genomic locations using CRISPR-Cas systems requires a set of enzymes with unique protospacer adjacent motif (PAM) compatibilities. To expand this repertoire, we engineered nucleases, cytosine base editors, and adenine base editors from the archetypal Streptococcus thermophilus CRISPR1-Cas9 (St1Cas9) system. We found that St1Cas9 strain variants enable targeting to five distinct A-rich PAMs and provide a structural basis for their specificities. The small size of this ortholog enables expression of the holoenzyme from a single adeno-associated viral vector for in vivo editing applications. Delivery of St1Cas9 to the neonatal liver efficiently rewired metabolic pathways, leading to phenotypic rescue in a mouse model of hereditary tyrosinemia. These robust enzymes expand and complement current editing platforms available for tailoring mammalian genomes.

Journal ArticleDOI
TL;DR: Divergence between the X and Y Chromosomes in regulatory sequence can lead to tissue-specific, Y-Chromosome-driven sex biases in expression of critical, dosage-sensitive regulatory genes.
Abstract: Little is known about how human Y-Chromosome gene expression directly contributes to differences between XX (female) and XY (male) individuals in nonreproductive tissues. Here, we analyzed quantitative profiles of Y-Chromosome gene expression across 36 human tissues from hundreds of individuals. Although it is often said that Y-Chromosome genes are lowly expressed outside the testis, we report many instances of elevated Y-Chromosome gene expression in a nonreproductive tissue. A notable example is EIF1AY, which encodes eukaryotic translation initiation factor 1A Y-linked, together with its X-linked homolog EIF1AX Evolutionary loss of a Y-linked microRNA target site enabled up-regulation of EIF1AY, but not of EIF1AX, in the heart. Consequently, this essential translation initiation factor is nearly twice as abundant in male as in female heart tissue at the protein level. Divergence between the X and Y Chromosomes in regulatory sequence can therefore lead to tissue-specific Y-Chromosome-driven sex biases in expression of critical, dosage-sensitive regulatory genes.