scispace - formally typeset
Open AccessProceedings ArticleDOI

Identification of protein coding regions in RNA transcripts

TLDR
It is demonstrated that the GeneMarkS-T self-training is robust with respect to the presence of errors in assembled transcripts and accuracy of GeneMarkT in identification of protein-coding regions and, particularly, in prediction of gene starts compares favorably to other existing methods.
Abstract
Massive parallel sequencing of RNA transcripts by the next generation technology (RNA-Seq) is a powerful method of generating critically important data for discovery of structure and function of eukaryotic genes. The transcripts may or may not carry protein-coding regions. If protein coding region is present, it should be a continuous (spliced) open reading frame. Gene finding in transcripts can be done by statistical (alignment-free) as well as by alignment based methods. We describe a new tool, GeneMarkS-T, for ab initio identification of protein-coding regions, complete or incomplete, in RNA transcripts assembled from RNA-Seq reads. Important feature of GeneMarkS-T is unsupervised estimation of parameters of the algorithm that makes unnecessary several conventional steps used in the gene prediction protocols, most importantly the manually curated preparation of training sets. We demonstrate that i/the GeneMarkS-T self-training is robust with respect to the presence of errors in assembled transcripts and ii/accuracy of GeneMarkS-T in identification of protein-coding regions and, particularly, in prediction of gene starts compares favorably to other existing methods.

read more

Citations
More filters
Journal ArticleDOI

Diversity and evolution of the emerging Pandoraviridae family

TL;DR: It is suggested that de novo gene creation could contribute to the evolution of the giant pandoravirus genomes because most of the strain-specific genes have no extant homolog and exhibit statistical features comparable to intergenic regions.
Journal ArticleDOI

EnTAP: Bringing faster and smarter functional annotation to non-model eukaryotic transcriptomes

TL;DR: EnTAP (Eukaryotic Non‐Model Transcriptome Annotation Pipeline) was designed to improve the accuracy, speed, and flexibility of functional gene annotation for de novo assembled transcriptomes in non‐model eukaryotes.
Journal ArticleDOI

Plant genome and transcriptome annotations: from misconceptions to simple solutions

TL;DR: A comprehensive review of typical ontologies to be used in the plant sciences, useful databases and resources used for functional annotation, what to expect from an annotated plant genome and a recipe and reference chart outlining typical steps used to annotate plant genomes/transcriptomes using publicly available resources are presented.
Journal ArticleDOI

The transcriptome, extracellular proteome and active secretome of agroinfiltrated Nicotiana benthamiana uncover a large, diverse protease repertoire.

TL;DR: This data set increases the understanding of the plant response to agroinfiltration and indicates ways to improve a key expression platform for both plant science and molecular farming.
References
More filters
Journal ArticleDOI

RNA-Seq: a revolutionary tool for transcriptomics

TL;DR: The RNA-Seq approach to transcriptome profiling that uses deep-sequencing technologies provides a far more precise measurement of levels of transcripts and their isoforms than other methods.
Journal ArticleDOI

Velvet: Algorithms for de novo short read assembly using de Bruijn graphs

TL;DR: Velvet represents a new approach to assembly that can leverage very short reads in combination with read pairs to produce useful assemblies and is in close agreement with simulated results without read-pair information.
Journal ArticleDOI

Prodigal: prokaryotic gene recognition and translation initiation site identification

TL;DR: This work developed a new gene prediction algorithm called Prodigal (PROkaryotic DYnamic programming Gene-finding ALgorithm), which achieved good results compared to existing methods, and it is believed it will be a valuable asset to automated microbial annotation pipelines.
Journal ArticleDOI

De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis

TL;DR: This protocol provides a workflow for genome-independent transcriptome analysis leveraging the Trinity platform and presents Trinity-supported companion utilities for downstream applications, including RSEM for transcript abundance estimation, R/Bioconductor packages for identifying differentially expressed transcripts across samples and approaches to identify protein-coding genes.
Journal ArticleDOI

An analysis of 5'-noncoding sequences from 699 vertebrate messenger RNAs

TL;DR: 5'-Noncoding sequences have been compiled from 699 vertebrate mRNAs and GCCA/GCCATGG emerges as the consensus sequence for initiation of translation in vertebrates.
Related Papers (5)
Trending Questions (1)
What is a term for coding regions in primary transcripts of eukaryotic genes?

The term for coding regions in primary transcripts of eukaryotic genes is "protein-coding regions."