GENCODE: The reference human genome annotation for The ENCODE Project

doi:10.1101/GR.135350.111

Open AccessJournal ArticleDOI

GENCODE: The reference human genome annotation for The ENCODE Project

Jennifer Harrow, +40 more

- 01 Sep 2012 -

Genome Research

- Vol. 22, Iss: 9, pp 1760-1774

Chats0

TLDR

This work has examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites, and over one-third of GENCODE protein-Coding genes aresupported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas.

Abstract:

The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.

Citations

PDF

Open Access

More filters

Journal ArticleDOI

PhenoScanner V2: an expanded tool for searching human genotype-phenotype associations

Mihir A Kamat, +7 more

- 01 Nov 2019 -

Bioinformatics

TL;DR: A major update of PhenoScanner is presented, including over 150 million genetic variants and more than 65 billion associations with diseases and traits, gene expression, metabolite and protein levels, and epigenetic markers.

...read moreread less

Journal ArticleDOI

Evaluation of GRCh38 and de novo haploid genome assemblies demonstrates the enduring quality of the reference assembly

Valerie A. Schneider, +37 more

- 01 May 2017 -

Genome Research

TL;DR: It is asserted that the collected updates in GRCh38 make the newer assembly a more robust substrate for comprehensive analyses that will promote the understanding of human biology and advance the efforts to improve health.

...read moreread less

Journal ArticleDOI

Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs.

Matthew J. Hangauer, +2 more

- 20 Jun 2013 -

PLOS Genetics

TL;DR: It is observed with a large set of RNA-seq data covering a wide array of human tissue types that the majority of the genome is indeed transcribed, corroborating recent observations by the ENCODE project and finding that intergenic regions encode far more long intergenic noncoding RNAs (lincRNAs) than previously described, helping to resolve the discrepancy between the vast amount of observed intergenic transcription and the limited number of previously known linc RNAs.

...read moreread less

Journal ArticleDOI

Correlation of circular RNA abundance with proliferation--exemplified with colorectal and ovarian cancer, idiopathic lung fibrosis, and normal human tissues.

Anna Bachmayr-Heyda, +9 more

- 27 Jan 2015 -

Scientific Reports

TL;DR: The first to report a global reduction of circular RNA abundance in colorectal cancer cell lines and cancer compared to normal tissues is reported and a negative correlation of global circular RNAs abundance and proliferation is discovered.

...read moreread less

Journal ArticleDOI

The UCSC Genome Browser database: 2016 update.

Matthew L. Speir, +22 more

- 04 Jan 2016 -

Nucleic Acids Research

TL;DR: The UCSC Genome Browser has greatly expanded the data sets available on the most recent human assembly, hg38/GRCh38, to include updated gene prediction sets from GENCODE, more phenotype- and disease-associated variants from ClinVar and ClinGen, more genomic regulatory data, and a new multiple genome alignment.

...read moreread less

Collapse

References

PDF

Open Access

More filters

Journal ArticleDOI

Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.

Stephen F. Altschul, +6 more

- 01 Sep 1997 -

Nucleic Acids Research

TL;DR: A new criterion for triggering the extension of word hits, combined with a new heuristic for generating gapped alignments, yields a gapped BLAST program that runs at approximately three times the speed of the original.

...read moreread less

Journal ArticleDOI

The Protein Data Bank

Helen M. Berman, +7 more

- 01 Jan 2000 -

Nucleic Acids Research

TL;DR: The goals of the PDB are described, the systems in place for data deposition and access, how to obtain further information and plans for the future development of the resource are described.

...read moreread less

Journal ArticleDOI

Ultrafast and memory-efficient alignment of short DNA sequences to the human genome

Ben Langmead, +3 more

- 04 Mar 2009 -

Genome Biology

TL;DR: Bowtie extends previous Burrows-Wheeler techniques with a novel quality-aware backtracking algorithm that permits mismatches and can be used simultaneously to achieve even greater alignment speeds.

...read moreread less

Journal ArticleDOI

The Pfam protein families database

Marco Punta, +15 more

- 01 Jan 2000 -

Nucleic Acids Research

TL;DR: The definition and use of family-specific, manually curated gathering thresholds are explained and some of the features of domains of unknown function (also known as DUFs) are discussed, which constitute a rapidly growing class of families within Pfam.

...read moreread less

Journal ArticleDOI

Pfam: the protein families database.

Robert D. Finn, +12 more

- 01 Jan 2014 -

Nucleic Acids Research

TL;DR: Pfam as discussed by the authors is a widely used database of protein families, containing 14 831 manually curated entries in the current version, version 27.0, and has been updated several times since 2012.

...read moreread less