scispace - formally typeset
Search or ask a question
Journal ArticleDOI

TrancriptomeReconstructoR: data-driven annotation of complex transcriptomes

31 May 2021-BMC Bioinformatics (BioMed Central)-Vol. 22, Iss: 1, pp 290-290
TL;DR: In this paper, the authors developed a pipeline for automated transcriptome annotation based on integrating features from independent and complementary datasets, such as full-length RNA-seq for detecting splicing patterns and high-throughput 5′ and 3′ tag sequencing data for accurate definition of gene borders.
Abstract: The quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data. We developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: (i) full-length RNA-seq for detection of splicing patterns and (ii) high-throughput 5′ and 3′ tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts. We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings and Saccharomyces cerevisiae cells as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the most commonly used community gene models, TAIR10 and Araport11 for A.thaliana and SacCer3 for S.cerevisiae. In particular, we identify multiple transient transcripts missing from the existing annotations. Our new annotations promise to improve the quality of A.thaliana and S.cerevisiae genome research. Our proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: In this article , the authors identify ∼130,000 long intergenic noncoding RNAs (lincRNAs) in four Brassicaceae: Arabidopsis thaliana, Camelina sativa, Brassica rapa, and Eutrema salsugineum.
Abstract: Long intergenic noncoding RNAs (lincRNAs) are a large yet enigmatic class of eukaryotic transcripts that can have critical biological functions. The wealth of RNA-sequencing (RNA-seq) data available for plants provides the opportunity to implement a harmonized identification and annotation effort for lincRNAs that enables cross-species functional and genomic comparisons as well as prioritization of functional candidates. In this study, we processed >24 Tera base pairs of RNA-seq data from >16,000 experiments to identify ∼130,000 lincRNAs in four Brassicaceae: Arabidopsis thaliana, Camelina sativa, Brassica rapa, and Eutrema salsugineum. We used nanopore RNA-seq, transcriptome-wide structural information, peptide data, and epigenomic data to characterize these lincRNAs and identify conserved motifs. We then used comparative genomic and transcriptomic approaches to highlight lincRNAs in our data set with sequence or transcriptional conservation. Finally, we used guilt-by-association analyses to assign putative functions to lincRNAs within our data set. We tested this approach on a subset of lincRNAs associated with germination and seed development, observing germination defects for Arabidopsis lines harboring T-DNA insertions at these loci. LincRNAs with Brassicaceae-conserved putative miRNA binding motifs, small open reading frames, or abiotic-stress modulated expression are a few of the annotations that will guide functional analyses into this cryptic portion of the transcriptome.

13 citations

Journal ArticleDOI
TL;DR: In this article , a time-course study of the establishment of PAMP-triggered immunity (PTI) using cap analysis of gene expression was performed. And the authors found that around 15% of all transcription start sites (TSSs) rapidly induced during PTI define alternative transcription initiation events.
Abstract: Immune responses triggered by pathogen-associated molecular patterns (PAMPs) are key to pathogen defense, but drivers and stabilizers of the growth-to-defense genetic reprogramming remain incompletely understood in plants. Here, we report a time-course study of the establishment of PAMP-triggered immunity (PTI) using cap analysis of gene expression. We show that around 15% of all transcription start sites (TSSs) rapidly induced during PTI define alternative transcription initiation events. From these, we identify clear examples of regulatory TSS change via alternative inclusion of target peptides or domains in encoded proteins, or of upstream open reading frames in mRNA leader sequences. We also find that 60% of PAMP response genes respond earlier than previously thought. In particular, a cluster of rapidly and transiently PAMP-induced genes is enriched in transcription factors (TFs) whose functions, previously associated with biological processes as diverse as abiotic stress adaptation and stem cell activity, appear to converge on growth restriction. Furthermore, examples of known potentiators of PTI, in one case under direct mitogen-activated protein kinase control, support the notion that the rapidly induced TFs could constitute direct links to PTI signaling pathways and drive gene expression changes underlying establishment of the immune state.

7 citations

Journal ArticleDOI
TL;DR: In this article, the Hda1 histone deacetylase complex (Hda1C) was identified as a repressor of non-coding (DNC) transcription.
Abstract: Nucleosome-depleted regions (NDRs) at gene promoters support initiation of RNA polymerase II transcription. Interestingly, transcription often initiates in both directions, resulting in an mRNA and a divergent non-coding (DNC) transcript of unclear purpose. Here, we characterized the genetic architecture and molecular mechanism of DNC transcription in budding yeast. Using high-throughput reverse genetic screens based on quantitative single-cell fluorescence measurements, we identified the Hda1 histone deacetylase complex (Hda1C) as a repressor of DNC transcription. Nascent transcription profiling showed a genome-wide role of Hda1C in repression of DNC transcription. Live-cell imaging of transcription revealed that mutations in the Hda3 subunit increased the frequency of DNC transcription. Hda1C contributed to decreased acetylation of histone H3 in DNC transcription regions, supporting DNC transcription repression by histone deacetylation. Our data support the interpretation that DNC transcription results as a consequence of the NDR-based architecture of eukaryotic promoters, but that it is governed by locus-specific repression to maintain genome fidelity.

5 citations

Posted ContentDOI
16 Nov 2021-bioRxiv
TL;DR: In this article, a flowering associated intergenic lncRNA (FLAIL) that represses flowering in Arabidopsis was identified, likely through mediating transcriptional regulation of genes directly bound by FLAIL.
Abstract: Eukaryotic genomes give rise to thousands of long non-coding RNAs (lncRNAs), yet the purpose of lncRNAs remains largely enigmatic. Functional characterization of lncRNAs is challenging due to multiple orthogonal hypothesis for molecular activities of lncRNA loci. Here, we identified a flowering associated intergenic lncRNA (FLAIL) that represses flowering in Arabidopsis. An allelic series of flail loss-of-function mutants generated by CRISPR/Cas9 and T-DNA mutagenesis showed an early flowering phenotype. Gene expression analyses in flail mutants revealed differentially expressed genes linked to the regulation of flowering. A genomic rescue fragment of FLAIL introduced in flail mutants complemented gene expression defects and early flowering, consistent with trans-acting effects of the FLAIL RNA. Knock-down of FLAIL RNA levels using the artificial microRNA approach revealed an early flowering phenotype shared with genomic mutations, indicating a trans-acting role of FLAIL RNA in the repression of flowering time. Genome-wide detection of FLAIL-DNA interactions by ChIRP-seq suggested that FLAIL may directly bind genomic regions. FLAIL bound to genes involved in regulation of flowering that were differentially expressed in flail, consistent with the interpretation of FLAIL as a trans-acting lncRNA directly shaping gene expression. Our findings highlight FLAIL as a trans-acting lncRNA that affects flowering in Arabidopsis, likely through mediating transcriptional regulation of genes directly bound by FLAIL.

2 citations

Journal ArticleDOI
TL;DR: In this paper , the functional impact of sequence variations in the non-coding genome on plant biology in the context of crop breeding and agricultural traits is reviewed. But the focus of this paper is on examples of noncoding with particularly convincing functional support.
Abstract: The growing world population in combination with the anticipated effects of climate change pressures food security. Plants display an impressive arsenal of cellular mechanisms for resilience to adverse environmental conditions and we rely on those mechanism for stable food production. The elucidation of the molecular basis of the mechanisms plants use to achieve resilience promises knowledge-based approaches to enhance food security. DNA sequence polymorphisms can reveal genomic regions that are linked to beneficial traits of plants. However, our ability to interpret how a given DNA sequence polymorphism confers a fitness advantage on the molecular level remains often poor. A key factor is that these polymorphisms largely localize to the enigmatic non-coding genome. Here, we review the functional impact of sequence variations in the non-coding genome on plant biology in the context of crop breeding and agricultural traits. We focus our review on examples of non-coding with particularly convincing functional support. Our survey combines findings that are consistent with the view that that the non-coding genome contributes to cellular mechanisms assisting many plant traits. Understanding how DNA sequence polymorphisms in the non-coding genome shape plant traits on the molecular level offers a largely unexplored reservoir of solutions to address future challenges in plant growth and resilience.

2 citations

References
More filters
Journal ArticleDOI
TL;DR: The RNA-Seq approach to transcriptome profiling that uses deep-sequencing technologies provides a far more precise measurement of levels of transcripts and their isoforms than other methods.
Abstract: RNA-Seq is a recently developed approach to transcriptome profiling that uses deep-sequencing technologies. Studies using this method have already altered our view of the extent and complexity of eukaryotic transcriptomes. RNA-Seq also provides a far more precise measurement of levels of transcripts and their isoforms than other methods. This article describes the RNA-Seq approach, the challenges associated with its application, and the advances made so far in characterizing several eukaryote transcriptomes.

11,528 citations


"TrancriptomeReconstructoR: data-dri..." refers background in this paper

  • ...First, the RNA-seq coverage gradually 54 decreases towards gene borders, which limits accurate definition of 5' and 3‘ ends of transcripts [12]....

    [...]

Journal ArticleDOI
Heng Li1
TL;DR: Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database and is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mapper at higher accuracy, surpassing most aligners specialized in one type of alignment.
Abstract: Motivation Recent advances in sequencing technologies promise ultra-long reads of ∼100 kb in average, full-length mRNA or cDNA reads in high throughput and genomic contigs over 100 Mb in length. Existing alignment programs are unable or inefficient to process such data at scale, which presses for the development of new alignment algorithms. Results Minimap2 is a general-purpose alignment program to map DNA or long mRNA sequences against a large reference database. It works with accurate short reads of ≥100 bp in length, ≥1 kb genomic reads at error rate ∼15%, full-length noisy Direct RNA or cDNA reads and assembly contigs or closely related full chromosomes of hundreds of megabases in length. Minimap2 does split-read alignment, employs concave gap cost for long insertions and deletions and introduces new heuristics to reduce spurious alignments. It is 3-4 times as fast as mainstream short-read mappers at comparable accuracy, and is ≥30 times faster than long-read genomic or cDNA mappers at higher accuracy, surpassing most aligners specialized in one type of alignment. Availability and implementation https://github.com/lh3/minimap2. Supplementary information Supplementary data are available at Bioinformatics online.

6,264 citations


"TrancriptomeReconstructoR: data-dri..." refers background in this paper

  • ...116 A common problem with full-length RNA sequencing reads aligned by dedicated aligners such as Minimap2 117 [23] is the under-splitting, i....

    [...]

Journal ArticleDOI
TL;DR: This work describes Bioconductor infrastructure for representing and computing on annotated genomic ranges and integrating genomic data with the statistical computing features of R and its extensions, including those for sequence analysis, differential expression analysis and visualization.
Abstract: We describe Bioconductor infrastructure for representing and computing on annotated genomic ranges and integrating genomic data with the statistical computing features of R and its extensions. At the core of the infrastructure are three packages: IRanges, GenomicRanges, and GenomicFeatures. These packages provide scalable data structures for representing annotated ranges on the genome, with special support for transcript structures, read alignments and coverage vectors. Computational facilities include efficient algorithms for overlap and nearest neighbor detection, coverage calculation and other range operations. This infrastructure directly supports more than 80 other Bioconductor packages, including those for sequence analysis, differential expression analysis and visualization.

3,005 citations

Journal ArticleDOI
19 Dec 2008-Science
TL;DR: Global run-on sequencing, GRO-seq, shows that peaks of promoter-proximal polymerase reside on ∼30% of human genes, transcription extends beyond pre-messenger RNA 3′ cleavage, and antisense transcription is prevalent.
Abstract: RNA polymerases are highly regulated molecular machines. We present a method (global run-on sequencing, GRO-seq) that maps the position, amount, and orientation of transcriptionally engaged RNA polymerases genome-wide. In this method, nuclear run-on RNA molecules are subjected to large-scale parallel sequencing and mapped to the genome. We show that peaks of promoter-proximal polymerase reside on ∼30% of human genes, transcription extends beyond pre-messenger RNA 3′ cleavage, and antisense transcription is prevalent. Additionally, most promoters have an engaged polymerase upstream and in an orientation opposite to the annotated gene. This divergent polymerase is associated with active genes but does not elongate effectively beyond the promoter. These results imply that the interplay between polymerases and regulators over broad promoter regions dictates the orientation and efficiency of productive transcription.

1,945 citations

Journal ArticleDOI
TL;DR: Recent developments include several new genome releases, progress on functional annotation of the genome and the release of several new tools including Textpresso for Arabidopsis which provides the capability to carry out full text searches on a large body of research literature.
Abstract: The Arabidopsis Information Resource (TAIR, http://arabidopsisorg) is a genome database for Arabidopsis thaliana, an important reference organism for many fundamental aspects of biology as well as basic and applied plant biology research TAIR serves as a central access point for Arabidopsis data, annotates gene function and expression patterns using controlled vocabulary terms, and maintains and updates the A thaliana genome assembly and annotation TAIR also provides researchers with an extensive set of visualization and analysis tools Recent de- velopments include several new genome releases (TAIR8, TAIR9 and TAIR10) in which the A thaliana assembly was updated, pseudogenes and transposon genes were re-annotated, and new data from proteomics and next generation transcriptome sequencing were incorporated into gene models and splice variants Other highlights include progress on functional anno- tation of the genome and the release of sev- eral new tools including Textpresso for Arabidopsis which provides the capability to carry out full text searches on a large body of research literature

1,874 citations

Related Papers (5)