scispace - formally typeset
Open AccessJournal ArticleDOI

GENCODE: The reference human genome annotation for The ENCODE Project

Reads0
Chats0
TLDR
This work has examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites, and over one-third of GENCODE protein-Coding genes aresupported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas.
Abstract
The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.

read more

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI

The convergent roles of the nuclear factor I transcription factors in development and cancer

TL;DR: The evidence suggesting a converging role for the NFI genes in development and cancer is summarized, and the challenges impeding the understanding of NFI function in cancer biology are presented, to demonstrate how a developmental perspective may contribute towards overcoming such hurdles.
Journal ArticleDOI

Testing and controlling for horizontal pleiotropy with probabilistic Mendelian randomization in transcriptome-wide association studies.

TL;DR: A powerful TWAS method based on probabilistic Mendelian Randomization, PMR-Egger, which is reasonably robust under various types of model misspecifications, is more powerful than existing TWAS/MR approaches, and can directly test for horizontal pleiotropy.
Journal ArticleDOI

Genetic and Epigenetic Regulation of Human lincRNA Gene Expression

TL;DR: Investigating the epigenetic regulation of lincRNAs observed both positive and negative correlations between DNA methylation and gene expression (expression quantitative trait methylation [eQTMs], as expected, and found that the landscapes of passive and active roles of DNA methylation in gene regulation are similar to protein-coding genes.
Journal ArticleDOI

Accurate Identification and Analysis of Human mRNA Isoforms Using Deep Long Read Sequencing

TL;DR: This work uses long-read complementary DNA datasets for the analysis of a eukaryotic transcriptome and demonstrates that long read sequence can be assembled into full-length transcripts with considerable success and is applicable to all long read sequencing technologies.
Journal ArticleDOI

Detection of aberrant splicing events in RNA-seq data using FRASER

TL;DR: In this paper, the authors developed an algorithm that detects aberrant splicing and intron retention events from RNA-seq data and apply it to diagnosis in mitochondrial disease, which is easy to use and freely available.
References
More filters
Journal ArticleDOI

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.
Journal ArticleDOI

The Protein Data Bank

TL;DR: The goals of the PDB are described, the systems in place for data deposition and access, how to obtain further information and plans for the future development of the resource are described.
Journal ArticleDOI

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome

TL;DR: Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches and can be used simultaneously to achieve even greater alignment speeds.
Journal ArticleDOI

The Pfam protein families database

TL;DR: The definition and use of family-specific, manually curated gathering thresholds are explained and some of the features of domains of unknown function (also known as DUFs) are discussed, which constitute a rapidly growing class of families within Pfam.
Journal ArticleDOI

Pfam: the protein families database.

TL;DR: Pfam as discussed by the authors is a widely used database of protein families, containing 14 831 manually curated entries in the current version, version 27.0, and has been updated several times since 2012.
Related Papers (5)