Performance Evaluation of SpliceAI for the Prediction of Splicing of NF1 Variants

doi:10.3390/GENES12091308

Home
/
Papers
/
Performance Evaluation of SpliceAI for the Prediction of Splicing of NF1 Variants

Journal Article•DOI•

Performance Evaluation of SpliceAI for the Prediction of Splicing of NF1 Variants

Changhee Ha¹, Jong-Won Kim¹, Ja-Hyun Jang¹•Institutions (1)

Samsung Medical Center¹

25 Aug 2021-Genes (Multidisciplinary Digital Publishing Institute)-Vol. 12, Iss: 9, pp 1308

TL;DR: In this paper, the authors investigated the sensitivity and specificity of SpliceAI, a recently introduced in silico splicing prediction algorithm in conjunction with other in-silico tools.

read less

Abstract: Neurofibromatosis type 1, characterized by neurofibromas and cafe-au-lait macules, is one of the most common genetic disorders caused by pathogenic NF1 variants. Because of the high proportion of splicing mutations in NF1, identifying variants that alter splicing may be an essential issue for laboratories. Here, we investigated the sensitivity and specificity of SpliceAI, a recently introduced in silico splicing prediction algorithm in conjunction with other in silico tools. We evaluated 285 NF1 variants identified from 653 patients. The effect on variants on splicing alteration was confirmed by complementary DNA sequencing followed by genomic DNA sequencing. For in silico prediction of splicing effects, we used SpliceAI, MaxEntScan (MES), and Splice Site Finder-like (SSF). The sensitivity and specificity of SpliceAI were 94.5% and 94.3%, respectively, with a cut-off value of Δ Score > 0.22. The area under the curve of SpliceAI was 0.975 (p < 0.0001). Combined analysis of MES/SSF showed a sensitivity of 83.6% and specificity of 82.5%. The concordance rate between SpliceAI and MES/SSF was 84.2%. SpliceAI showed better performance for the prediction of splicing alteration for NF1 variants compared with MES/SSF. As a convenient web-based tool, SpliceAI may be helpful in clinical laboratories conducting DNA-based NF1 sequencing.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

Minigene‐based splicing analysis and ACMG/AMP‐based tentative classification of 56 ATM variants

[...]

E. Bueno‐Martínez, Lara Sanoguera-Miralles, Alberto Valenzuela-Palomo, Ada Esteban-Sánchez, Víctor Lorca, Inés Llinares-Burguet, Jamie Allen, Alicia Garcia-Alvarez, Pedro Pérez-Segura, Mercedes Durán, Douglas F. Easton, Peter Devilee, Maaike P.G. Vreeswijk, M. De La Hoya, Eladio A Velasco-Sampedro - Show less +11 more

18 Jun 2022-The journal of pathology

TL;DR: Circumstantial evidence in three ATM variants (leakiness uncovered by the mgATM analysis together with clinical data) provides some support for a dosage‐sensitive expression model in which variants producing ≥30% of FL‐transcripts would be predicted benign, while variants producing ≤13% ofFL‐Transcripts might be pathogenic.

...read moreread less

Abstract: The ataxia telangiectasia‐mutated (ATM) protein is a major coordinator of the DNA damage response pathway. ATM loss‐of‐function variants are associated with 2‐fold increased breast cancer risk. We aimed at identifying and classifying spliceogenic ATM variants detected in subjects of the large‐scale sequencing project BRIDGES. A total of 381 variants at the intron–exon boundaries were identified, 128 of which were predicted to be spliceogenic. After further filtering, we ended up selecting 56 variants for splicing analysis. Four functional minigenes (mgATM) spanning exons 4–9, 11–17, 25–29, and 49–52 were constructed in the splicing plasmid pSAD. Selected variants were genetically engineered into the four constructs and assayed in MCF‐7/HeLa cells. Forty‐eight variants (85.7%) impaired splicing, 32 of which did not show any trace of the full‐length (FL) transcript. A total of 43 transcripts were identified where the most prevalent event was exon/multi‐exon skipping. Twenty‐seven transcripts were predicted to truncate the ATM protein. A tentative ACMG/AMP (American College of Medical Genetics and Genomics/Association for Molecular Pathology)‐based classification scheme that integrates mgATM data allowed us to classify 29 ATM variants as pathogenic/likely pathogenic and seven variants as likely benign. Interestingly, the likely pathogenic variant c.1898+2T>G generated 13% of the minigene FL‐transcript due to the use of a noncanonical GG‐5’‐splice‐site (0.014% of human donor sites). Circumstantial evidence in three ATM variants (leakiness uncovered by our mgATM analysis together with clinical data) provides some support for a dosage‐sensitive expression model in which variants producing ≥30% of FL‐transcripts would be predicted benign, while variants producing ≤13% of FL‐transcripts might be pathogenic. © 2022 The Authors. The Journal of Pathology published by John Wiley & Sons Ltd on behalf of The Pathological Society of Great Britain and Ireland.

...read moreread less

3 citations

Journal Article•DOI•

Comparison of In Silico Tools for Splice-Altering Variant Prediction Using Established Spliceogenic Variants: An End-User's Point of View

[...]

Woori Jang, Joon-Hee Park, Hyojin Chae, Myungshin Kim

13 Oct 2022-International journal of genomics

TL;DR: Deep learning algorithms, especially those of SpliceAI, are validated at a significantly higher rate than other in silico tools for clinically relevant NF1 variants, suggesting that deep learning algorithms outperform traditional probabilistic approaches and classical machine learning tools in predicting the de novo and cryptic splice sites.

...read moreread less

Abstract: Assessing the impact of variants of unknown significance on splicing has become a critical issue and a bottleneck, especially with the widespread implementation of whole-genome or exome sequencing. Although multiple in silico tools are available, the interpretation and application of these tools are difficult and practical guidelines are still lacking. A streamlined decision-making process can facilitate the downstream RNA analysis in a more efficient manner. Therefore, we evaluated the performance of 8 in silico tools (Splice Site Finder, MaxEntScan, Splice-site prediction by neural network, GeneSplicer, Human Splicing Finder, SpliceAI, Splicing Predictions in Consensus Elements, and SpliceRover) using 114 NF1 spliceogenic variants, experimentally validated at the mRNA level. The change in the predicted score incurred by the variant of the nearest wild-type splice site was analyzed, and for type II, III, and IV splice variants, the change in the prediction score of de novo or cryptic splice site was also analyzed. SpliceAI and SpliceRover, tools based on deep learning, outperformed all other tools, with AUCs of 0.972 and 0.924, respectively. For de novo and cryptic splice sites, SpliceAI outperformed all other tools and showed a sensitivity of 95.7% at an optimal cut-off of 0.02 score change. Our results show that deep learning algorithms, especially those of SpliceAI, are validated at a significantly higher rate than other in silico tools for clinically relevant NF1 variants. This suggests that deep learning algorithms outperform traditional probabilistic approaches and classical machine learning tools in predicting the de novo and cryptic splice sites.

...read moreread less

3 citations

Journal Article•DOI•

SpliceAI-visual: a free online tool to improve SpliceAI splicing variant interpretation

[...]

Jean-Madeleine de Sainte Agathe, Mathilde Filser, Bertrand Isidor, Thomas Besnard, Paul Gueguen, Aurélien Perrin, C. Van Goethem, Camille Verebi, Marion Masingue, John Rendu, Mireille Cossée, Anne Bergougnoux, Laurent Frobert, Julien Buratti, Elodie Lejeune, Éric Le Guern, Florence Pasquier, Fabienne Clot, Vasiliki Kalatzis, Anne-Françoise Roux, Benjamin Cogné, David Baux - Show less +18 more

10 Feb 2023-Human Genomics

TL;DR: SpliceAI-visual as mentioned in this paper is a free online tool based on the SpliceAI algorithm, which is able to annotate complex variants (e.g., complex delins) and demonstrate its relevance in the assessment/modulation of the PVS1 classification criteria.

...read moreread less

Abstract: SpliceAI is an open-source deep learning splicing prediction algorithm that has demonstrated in the past few years its high ability to predict splicing defects caused by DNA variations. However, its outputs present several drawbacks: (1) although the numerical values are very convenient for batch filtering, their precise interpretation can be difficult, (2) the outputs are delta scores which can sometimes mask a severe consequence, and (3) complex delins are most often not handled. We present here SpliceAI-visual, a free online tool based on the SpliceAI algorithm, and show how it complements the traditional SpliceAI analysis. First, SpliceAI-visual manipulates raw scores and not delta scores, as the latter can be misleading in certain circumstances. Second, the outcome of SpliceAI-visual is user-friendly thanks to the graphical presentation. Third, SpliceAI-visual is currently one of the only SpliceAI-derived implementations able to annotate complex variants (e.g., complex delins). We report here the benefits of using SpliceAI-visual and demonstrate its relevance in the assessment/modulation of the PVS1 classification criteria. We also show how SpliceAI-visual can elucidate several complex splicing defects taken from the literature but also from unpublished cases. SpliceAI-visual is available as a Google Colab notebook and has also been fully integrated in a free online variant interpretation tool, MobiDetails ( https://mobidetails.iurc.montp.inserm.fr/MD ).

...read moreread less

3 citations

Journal Article•DOI•

Alternative Splicing in Human Physiology and Disease

[...]

Pinelopi I Artemaki, Christos K. Kontos

01 Oct 2022-Genes

TL;DR: Since the discovery of alternative splicing in the late 1970s, a great number of alternatively spliced transcripts have emerged; this number has exponentially increased with the advances in transcriptomics and massive parallel sequencing technologies as discussed by the authors .

...read moreread less

Abstract: Since the discovery of alternative splicing in the late 1970s, a great number of alternatively spliced transcripts have emerged; this number has exponentially increased with the advances in transcriptomics and massive parallel sequencing technologies [...].

...read moreread less

2 citations

Journal Article•DOI•

SpliceAI-10k calculator for the prediction of pseudoexonization, intron retention, and exon deletion

[...]

Daffodil M. Canson, Olga Kondrashova, M. De La Hoya, Michael T. Parsons, Dylan M. Glubb, Amanda B. Spurdle - Show less +2 more

14 Dec 2022-Bioinformatics

TL;DR: The SpliceAI- 10k calculator (SAI-10k-calc) is developed to extend use of this deep learning-based tool to predict aberration type and size of inserted or deleted sequence, using an analysis window of 10,000 nucleotides.

...read moreread less

Abstract: Summary SpliceAI is a widely used splicing prediction tool and its most common application relies on the maximum delta score to assign variant impact on splicing. We developed the SpliceAI-10k calculator (SAI-10k-calc) to extend use of this tool to predict: the splicing aberration type including pseudoexonization, intron retention, partial exon deletion, and (multi)exon skipping using a 10 kb analysis window; the size of inserted or deleted sequence; the effect on reading frame; and the altered amino acid sequence. SAI-10k-calc has 95% sensitivity and 96% specificity for predicting variants that impact splicing, computed from a control dataset of 1,212 single nucleotide variants (SNVs) with curated splicing assay results. Notably, it has high performance (≥84% accuracy) for predicting pseudoexon and partial intron retention. The automated amino acid sequence prediction allows for efficient identification of variants that are expected to result in mRNA nonsense-mediated decay or translation of truncated proteins. Availability and implementation SAI-10k-calc is implemented in R (https://github.com/adavi4/SAI-10k-calc) and also available as a Microsoft Excel spreadsheet. Users can adjust the default thresholds to suit their target performance values. Supplementary information Supplementary data are available online.

...read moreread less

2 citations

References

PDF

Open Access

More filters

Journal Article•DOI•

The meaning and use of the area under a receiver operating characteristic (ROC) curve.

[...]

James A. Hanley, Barbara J. McNeil

01 Apr 1982-Radiology

TL;DR: A representation and interpretation of the area under a receiver operating characteristic (ROC) curve obtained by the "rating" method, or by mathematical predictions based on patient characteristics, is presented and it is shown that in such a setting the area represents the probability that a randomly chosen diseased subject is (correctly) rated or ranked with greater suspicion than a random chosen non-diseased subject.

...read moreread less

Abstract: A representation and interpretation of the area under a receiver operating characteristic (ROC) curve obtained by the "rating" method, or by mathematical predictions based on patient characteristics, is presented. It is shown that in such a setting the area represents the probability that a randomly chosen diseased subject is (correctly) rated or ranked with greater suspicion than a randomly chosen non-diseased subject. Moreover, this probability of a correct ranking is the same quantity that is estimated by the already well-studied nonparametric Wilcoxon statistic. These two relationships are exploited to (a) provide rapid closed-form expressions for the approximate magnitude of the sampling variability, i.e., standard error that one uses to accompany the area under a smoothed ROC curve, (b) guide in determining the size of the sample required to provide a sufficiently reliable estimate of this area, and (c) determine how large sample sizes should be to ensure that one can statistically detect difference...

...read moreread less

19,398 citations

Journal Article•DOI•

Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology.

[...]

Sue Richards¹, Nazneen Aziz², Nazneen Aziz³, Sherri J. Bale⁴, David P. Bick⁵, Soma Das⁶, Julie M. Gastier-Foster, Wayne W. Grody⁷, Madhuri Hegde⁸, Elaine Lyon⁹, Elaine B. Spector¹⁰, Karl V. Voelkerding⁹, Heidi L. Rehm¹¹ - Show less +9 more•Institutions (11)

Oregon Health & Science University¹, College of American Pathologists², Boston Children's Hospital³, GeneDx⁴, Medical College of Wisconsin⁵, University of Chicago⁶, University of California, Los Angeles⁷, Emory University⁸, University of Utah⁹, University of Colorado Denver¹⁰, Harvard University¹¹

05 Mar 2015-Genetics in Medicine

TL;DR: Because of the increased complexity of analysis and interpretation of clinical genetic testing described in this report, the ACMG strongly recommends thatclinical molecular genetic testing should be performed in a Clinical Laboratory Improvement Amendments–approved laboratory, with results interpreted by a board-certified clinical molecular geneticist or molecular genetic pathologist or the equivalent.

...read moreread less

17,834 citations

Journal Article•DOI•

The Genotype-Tissue Expression (GTEx) project

[...]

John T. Lonsdale, Jeffrey Thomas, Mike Salvatore, Rebecca Phillips, Edmund Lo, Saboor Shad, Richard Hasz, Gary Walters, Fernando U. Garcia¹, Nancy Young², Barbara A. Foster³, Mike Moser³, Ellen Karasik³, Bryan Gillard³, Kimberley Ramsey³, Susan L. Sullivan, Jason Bridge, Harold Magazine, John Syron, Johnelle Fleming, Laura A. Siminoff⁴, Heather M. Traino⁴, Maghboeba Mosavel⁴, Laura Barker⁴, Scott D. Jewell⁵, Daniel C. Rohrer⁵, Dan Maxim⁵, Dana Filkins⁵, Philip Harbach⁵, Eddie Cortadillo⁵, Bree Berghuis⁵, Lisa Turner⁵, Eric Hudson⁵, Kristin Feenstra⁵, Leslie H. Sobin⁶, James A. Robb⁶, Phillip Branton, Greg E. Korzeniewski⁶, Charles Shive⁶, David Tabor⁶, Liqun Qi⁶, Kevin Groch⁶, Sreenath Nampally⁶, Steve Buia⁶, Angela Zimmerman⁶, Anna M. Smith⁶, Robin Burges⁶, Karna Robinson⁶, Kim Valentino⁶, Deborah Bradbury⁶, Mark Cosentino⁶, Norma Diaz-Mayoral⁶, Mary Kennedy⁶, Theresa Engel⁶, Penelope Williams⁶, Kenyon Erickson, Kristin G. Ardlie⁷, Wendy Winckler⁷, Gad Getz⁸, Gad Getz⁷, David S. DeLuca⁷, MacArthur Daniel MacArthur⁸, MacArthur Daniel MacArthur⁷, Manolis Kellis⁷, Alexander Thomson⁷, Taylor Young⁷, Ellen Gelfand⁷, Molly Donovan⁷, Yan Meng⁷, George B. Grant⁷, Deborah C. Mash⁹, Yvonne Marcus⁹, Margaret J. Basile⁹, Jun Liu⁸, Jun Zhu¹⁰, Zhidong Tu¹⁰, Nancy J. Cox¹¹, Dan L. Nicolae¹¹, Eric R. Gamazon¹¹, Hae Kyung Im¹¹, Anuar Konkashbaev¹¹, Jonathan K. Pritchard¹², Jonathan K. Pritchard¹¹, Matthew Stevens¹¹, Timothée Flutre¹¹, Xiaoquan Wen¹¹, Emmanouil T. Dermitzakis¹³, Tuuli Lappalainen¹³, Roderic Guigó, Jean Monlong, Michael Sammeth, Daphne Koller¹⁴, Alexis Battle¹⁴, Sara Mostafavi¹⁴, Mark I. McCarthy¹⁵, Manual Rivas¹⁵, Julian Maller¹⁵, Ivan Rusyn¹⁶, Andrew B. Nobel¹⁶, Fred A. Wright¹⁶, Andrey A. Shabalin¹⁶, Mike Feolo¹⁷, Nataliya Sharopova¹⁷, Anne Sturcke¹⁷, Justin Paschal¹⁷, James M. Anderson¹⁷, Elizabeth L. Wilder¹⁷, Leslie Derr¹⁷, Eric D. Green¹⁷, Jeffery P. Struewing¹⁷, Gary F. Temple¹⁷, Simona Volpi¹⁷, Joy T. Boyer¹⁷, Elizabeth J. Thomson¹⁷, Mark S. Guyer¹⁷, Cathy Ng¹⁷, Assya Abdallah¹⁷, Deborah Colantuoni¹⁷, Thomas R. Insel¹⁷, Susan E. Koester¹⁷, Roger Little¹⁷, Patrick Bender¹⁷, Thomas Lehner¹⁷, Yin Yao¹⁷, Carolyn C. Compton¹⁷, Jimmie B. Vaught¹⁷, Sherilyn Sawyer¹⁷, Nicole C. Lockhart¹⁷, Joanne P. Demchok¹⁷, Helen F. Moore¹⁷ - Show less +126 more•Institutions (17)

Drexel University¹, Yeshiva University², Roswell Park Cancer Institute³, Virginia Commonwealth University⁴, Van Andel Institute⁵, Science Applications International Corporation⁶, Massachusetts Institute of Technology⁷, Harvard University⁸, University of Miami⁹, Icahn School of Medicine at Mount Sinai¹⁰, University of Chicago¹¹, Howard Hughes Medical Institute¹², University of Geneva¹³, Stanford University¹⁴, University of Oxford¹⁵, University of North Carolina at Chapel Hill¹⁶, National Institutes of Health¹⁷

29 May 2013-Nature Genetics

TL;DR: The Genotype-Tissue Expression (GTEx) project is described, which will establish a resource database and associated tissue bank for the scientific community to study the relationship between genetic variation and gene expression in human tissues.

...read moreread less

Abstract: Genome-wide association studies have identified thousands of loci for common diseases, but, for the majority of these, the mechanisms underlying disease susceptibility remain unknown. Most associated variants are not correlated with protein-coding changes, suggesting that polymorphisms in regulatory regions probably contribute to many disease phenotypes. Here we describe the Genotype-Tissue Expression (GTEx) project, which will establish a resource database and associated tissue bank for the scientific community to study the relationship between genetic variation and gene expression in human tissues.

...read moreread less

6,545 citations

Journal Article•DOI•

The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans

[...]

Kristin G. Ardlie, David S. DeLuca, Ayellet V. Segrè, Timothy J. Sullivan, Taylor Young, Ellen Gelfand, Casandra A. Trowbridge, Julian Maller, Taru Tukiainen, Monkol Lek, Lucas D. Ward, Pouya Kheradpour, Benjamin Iriarte, Yan Meng, Cameron D. Palmer, Tõnu Esko, Wendy Winckler, Joel N. Hirschhorn, Manolis Kellis, Daniel G. MacArthur, Gad Getz, Andrey A. Shabalin, Gen Li, Yi-Hui Zhou, Andrew B. Nobel, Ivan Rusyn, Fred A. Wright, Tuuli Lappalainen, Pedro G. Ferreira, Halit Ongen, Manuel A. Rivas, Alexis Battle, Sara Mostafavi, Jean Monlong, Michael Sammeth, Marta Melé, Ferran Reverter, Jakob M. Goldmann, Daphne Koller, Roderic Guigó, Mark I. McCarthy, Emmanouil T. Dermitzakis, Eric R. Gamazon, Hae Kyung Im, Anuar Konkashbaev, Dan L. Nicolae, Nancy J. Cox, Timothée Flutre, Xiaoquan Wen, Matthew Stephens, Jonathan K. Pritchard, Zhidong Tu, Bin Zhang, Tao Huang, Quan Long, Luan Lin, Jialiang Yang, Jun Zhu, Jun Liu, Amanda Brown, Bernadette Mestichelli, Denee Tidwell, Edmund Lo, Mike Salvatore, Saboor Shad, Jeffrey A. Thomas, John T. Lonsdale, Michael T. Moser, Bryan Gillard, Ellen Karasik, Kimberly Ramsey, Christopher Choi, Barbara A. Foster, John Syron, Johnell Fleming, Harold Magazine, Rick Hasz, Gary Walters, Jason Bridge, Mark Miklos, Susan L. Sullivan, Laura Barker, Heather M. Traino, Maghboeba Mosavel, Laura A. Siminoff, Dana R. Valley, Daniel C. Rohrer, Scott D. Jewell, Philip A. Branton, Leslie H. Sobin, Mary Barcus, Liqun Qi, Jeffrey McLean, Pushpa Hariharan, Ki Sung Um, Shenpei Wu, David Tabor, Charles Shive, Anna M. Smith, Stephen A. Buia, Anita H. Undale, Karna Robinson, Nancy Roche, Kimberly M. Valentino, Angela Britton, Robin Burges, Debra Bradbury, Kenneth W. Hambright, John Seleski, Greg E. Korzeniewski, Kenyon Erickson, Yvonne Marcus, Jorge Tejada, Mehran Taherian, Chunrong Lu, Margaret J. Basile, Deborah C. Mash, Simona Volpi, Jeffery P. Struewing, Gary F. Temple, Joy T. Boyer, Deborah Colantuoni, Roger Little, Susan E. Koester, Latarsha J. Carithers, Helen M. Moore, Ping Guan, Carolyn C. Compton, Sherilyn Sawyer, Joanne P. Demchok, Jimmie B. Vaught, Chana A. Rabiner, Nicole C. Lockhart - Show less +129 more

08 May 2015-Science

TL;DR: The landscape of gene expression across tissues is described, thousands of tissue-specific and shared regulatory expression quantitative trait loci (eQTL) variants are cataloged, complex network relationships are described, and signals from genome-wide association studies explained by eQTLs are identified.

...read moreread less

Abstract: Understanding the functional consequences of genetic variation, and how it affects complex human disease and quantitative traits, remains a critical challenge for biomedicine. We present an analysi...

...read moreread less

4,418 citations

Journal Article•DOI•

GENCODE: The reference human genome annotation for The ENCODE Project

[...]

Jennifer Harrow¹, Adam Frankish¹, José M. González¹, Electra Tapanari¹, Mark Diekhans², Felix Kokocinski¹, Bronwen Aken¹, Daniel Barrell¹, Amonida Zadissa¹, Stephen M. J. Searle¹, If H. A. Barnes¹, Alexandra Bignell¹, Veronika Boychenko¹, Toby Hunt¹, M. Kay¹, Gaurab Mukherjee¹, Jeena Rajan¹, Gloria Despacio-Reyes¹, Gary Saunders¹, Charles A. Steward¹, Rachel A. Harte², Michael F. Lin³, Cédric Howald⁴, Andrea Tanzer, Thomas Derrien⁴, Jacqueline Chrast⁴, Nathalie Walters⁴, Suganthi Balasubramanian⁵, Baikang Pei⁵, Michael L. Tress, Jose Manuel Rodriguez, Iakes Ezkurdia, Jeltje Van Baren, Michael R. Brent, David Haussler², Manolis Kellis³, Alfonso Valencia, Alexandre Reymond⁴, Mark Gerstein⁵, Roderic Guigó, Tim Hubbard¹ - Show less +37 more•Institutions (5)

Wellcome Trust Sanger Institute¹, University of California, Santa Cruz², Massachusetts Institute of Technology³, University of Lausanne⁴, Yale University⁵

01 Sep 2012-Genome Research

TL;DR: This work has examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites, and over one-third of GENCODE protein-Coding genes aresupported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas.

...read moreread less

Abstract: The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.

...read moreread less

4,281 citations