GENCODE: The reference human genome annotation for The ENCODE Project
Citations
30,684 citations
13,548 citations
8,106 citations
5,365 citations
Cites methods from "GENCODE: The reference human genome..."
...…have longer tandem isoforms, we extended them accordingly, using additional annotations provided by (i) the ‘comprehensive’ set of Gencode gene models (Harrow et al., 2012), (ii) all mRNAs in the RefSeq database (Pruitt et al., 2012), downloaded from the refGene database through the UCSC table…...
[...]
...Prior to use, our PCT scores were updated to take advantage of improvements in both mouse and human 3′-UTR annotations (Harrow et al., 2012; Flicek et al., 2014), the additional sequenced vertebrate genomes aligned to the mouse and human genomes (Karolchik et al., 2014), and our expanded set of…...
[...]
...The human and mouse databases started with Gencode annotations (Harrow et al., 2012), for which 3′ UTRs were extended, when possible, using RefSeq annotations (Pruitt et al., 2012), recently identified long 3′-UTR isoforms (Miura et al., 2013), and 3P-seq clusters marking more distal cleavage and…...
[...]
...…We updated human PCT scores using the following datasets: (i) 3′ UTRs derived from 19,800 human protein-coding genes annotated in Gencode version 19 (Harrow et al., 2012), and (ii) 3′-UTR multiple sequence alignments (MSAs) across 84 vertebrate species derived from the 100-way multiz alignments in…...
[...]
...3′-UTR profiles for TargetScan7 predictions To build databases of human and mouse 3′-UTR profiles, we began with the ‘basic’ set of proteincoding gene models deposited in Gencode v19 (human hg19 assembly) and Gencode vM3 (mouse mm10 assembly), respectively (Harrow et al., 2012)....
[...]
4,658 citations
Cites background from "GENCODE: The reference human genome..."
...There are two major sources of Homo sapiens annotation: GENCODE [17] and Reference Sequence (RefSeq) [18] at the National Center for Biotechnology Information (NCBI)....
[...]
...For Homo sapiens and Mus musculus this is the GENCODE gene set, which denotes that it is a full merge of Ensembl’s evidence-based transcript predictions with manual annotation to create the most extensive set of transcript isoforms for these species [36]....
[...]
...For example, in H. sapiens and M. musculus the filtered GENCODE Basic transcript set includes the vast majority of transcripts identified as dominantly expressed [36] and consensus coding sequence (CCDS) annotation highlights transcripts having the same CDS in both RefSeq and Ensembl....
[...]
...GENCODE’s aim is to create a comprehensive transcript set to represent expression of each isoform across any tissue and stage of development and, as a result, there are, on average, nearly four transcript isoforms per protein-coding gene....
[...]
...There are differences in how the transcript sets are produced: GENCODE annotation is genome-based while RefSeq transcripts are independent of the reference genome....
[...]
References
70,111 citations
"GENCODE: The reference human genome..." refers methods in this paper
...The finished genomic sequence is analyzed using a modified Ensembl pipeline (Searle et al. 2004), and BLAST results of cDNAs/ ESTs and proteins, along with various ab initio predictions, can be analyzed manually in the annotation browser tool Otterlace (http:// www.sanger.ac.uk/resources/software/otterlace/)....
[...]
...These data were aligned to the individual BAC clones that make up the reference genome sequence using BLAST (Altschul et al. 1997) with a subsequent realignment of transcript data by Est2Genome (Mott 1997)....
[...]
34,239 citations
"GENCODE: The reference human genome..." refers background in this paper
...• Matador3D is locally installed and checks for structural homologs for each transcript in the PDB (Berman et al. 2000)....
[...]
...Twenty-six thousand nine hundred fifty-five isoforms (31.9% of all isoforms or 42.3% of alternative isoforms) would have lost or damaged structural domains, based on alignments with Protein Data Bank (PDB) structures, and 16,540 isoforms (19.6% of all isoforms or 26% of alternative isoforms) would lose functionally important residues....
[...]
20,335 citations
Additional excerpts
...Mapping and validation of amplified exon–exon junction Thirty-five- or 75-nucleotide (nt)-long reads were mapped both on to the reference human genome (hg19) and the predicted spliced amplicons with Bowtie 0.12.5 (Langmead et al. 2009)....
[...]
14,075 citations
"GENCODE: The reference human genome..." refers background or methods in this paper
...(Finn et al. 2010) or with damaged Pfam domains with respect to the constitutional variant for the same gene....
[...]
...• SPADE uses a locally installed version of the program Pfamscan (Finn et al. 2010) to identify the conservation of protein functional domains....
[...]
...Once the correct transcript structure had been ascertained, the protein-coding potential of the transcript was determined on the basis of similarity to known protein sequences, the sequences of orthologous and paralogous proteins, the presence of Pfam functional domains (Finn et al. 2010), possible alternativeORFs, the presence of retained intronic sequence, and the likely susceptibility of the transcript to NMD (Lewis et al....
[...]
9,415 citations
"GENCODE: The reference human genome..." refers background or methods in this paper
...Once the correct transcript structure had been ascertained, the protein-coding potential of the transcript was determined on the basis of similarity to known protein sequences, the sequences of orthologous and paralogous proteins, the presence of Pfam functional domains (Finn et al. 2010), possible alternative ORFs, the presence of retained intronic sequence, and the likely susceptibility of the transcript to NMD (Lewis et al....
[...]
...(Finn et al. 2010) or with damaged Pfam domains with respect to the constitutional variant for the same gene....
[...]
...• SPADE uses a locally installed version of the program Pfamscan (Finn et al. 2010) to identify the conservation of protein functional domains....
[...]
...…the basis of similarity to known protein sequences, the sequences of orthologous and paralogous proteins, the presence of Pfam functional domains (Finn et al. 2010), possible alternative ORFs, the presence of retained intronic sequence, and the likely susceptibility of the transcript to NMD…...
[...]
...…as translated in the GENCODE 7 release, 30,148 (35.7% of all transcripts or 47.3% of alternative transcripts) would generate protein isoforms either with fewer Pfam functional domains (Finn et al. 2010) or with damaged Pfam domains with respect to the constitutional variant for the same gene....
[...]