Performance Evaluation of SpliceAI for the Prediction of Splicing of NF1 Variants
TL;DR: In this paper, the authors investigated the sensitivity and specificity of SpliceAI, a recently introduced in silico splicing prediction algorithm in conjunction with other in-silico tools.
Abstract: Neurofibromatosis type 1, characterized by neurofibromas and cafe-au-lait macules, is one of the most common genetic disorders caused by pathogenic NF1 variants. Because of the high proportion of splicing mutations in NF1, identifying variants that alter splicing may be an essential issue for laboratories. Here, we investigated the sensitivity and specificity of SpliceAI, a recently introduced in silico splicing prediction algorithm in conjunction with other in silico tools. We evaluated 285 NF1 variants identified from 653 patients. The effect on variants on splicing alteration was confirmed by complementary DNA sequencing followed by genomic DNA sequencing. For in silico prediction of splicing effects, we used SpliceAI, MaxEntScan (MES), and Splice Site Finder-like (SSF). The sensitivity and specificity of SpliceAI were 94.5% and 94.3%, respectively, with a cut-off value of Δ Score > 0.22. The area under the curve of SpliceAI was 0.975 (p < 0.0001). Combined analysis of MES/SSF showed a sensitivity of 83.6% and specificity of 82.5%. The concordance rate between SpliceAI and MES/SSF was 84.2%. SpliceAI showed better performance for the prediction of splicing alteration for NF1 variants compared with MES/SSF. As a convenient web-based tool, SpliceAI may be helpful in clinical laboratories conducting DNA-based NF1 sequencing.
Citations
More filters
••
TL;DR: Circumstantial evidence in three ATM variants (leakiness uncovered by the mgATM analysis together with clinical data) provides some support for a dosage‐sensitive expression model in which variants producing ≥30% of FL‐transcripts would be predicted benign, while variants producing ≤13% ofFL‐Transcripts might be pathogenic.
Abstract: The ataxia telangiectasia‐mutated (ATM) protein is a major coordinator of the DNA damage response pathway. ATM loss‐of‐function variants are associated with 2‐fold increased breast cancer risk. We aimed at identifying and classifying spliceogenic ATM variants detected in subjects of the large‐scale sequencing project BRIDGES. A total of 381 variants at the intron–exon boundaries were identified, 128 of which were predicted to be spliceogenic. After further filtering, we ended up selecting 56 variants for splicing analysis. Four functional minigenes (mgATM) spanning exons 4–9, 11–17, 25–29, and 49–52 were constructed in the splicing plasmid pSAD. Selected variants were genetically engineered into the four constructs and assayed in MCF‐7/HeLa cells. Forty‐eight variants (85.7%) impaired splicing, 32 of which did not show any trace of the full‐length (FL) transcript. A total of 43 transcripts were identified where the most prevalent event was exon/multi‐exon skipping. Twenty‐seven transcripts were predicted to truncate the ATM protein. A tentative ACMG/AMP (American College of Medical Genetics and Genomics/Association for Molecular Pathology)‐based classification scheme that integrates mgATM data allowed us to classify 29 ATM variants as pathogenic/likely pathogenic and seven variants as likely benign. Interestingly, the likely pathogenic variant c.1898+2T>G generated 13% of the minigene FL‐transcript due to the use of a noncanonical GG‐5’‐splice‐site (0.014% of human donor sites). Circumstantial evidence in three ATM variants (leakiness uncovered by our mgATM analysis together with clinical data) provides some support for a dosage‐sensitive expression model in which variants producing ≥30% of FL‐transcripts would be predicted benign, while variants producing ≤13% of FL‐transcripts might be pathogenic. © 2022 The Authors. The Journal of Pathology published by John Wiley & Sons Ltd on behalf of The Pathological Society of Great Britain and Ireland.
3 citations
••
TL;DR: Deep learning algorithms, especially those of SpliceAI, are validated at a significantly higher rate than other in silico tools for clinically relevant NF1 variants, suggesting that deep learning algorithms outperform traditional probabilistic approaches and classical machine learning tools in predicting the de novo and cryptic splice sites.
Abstract: Assessing the impact of variants of unknown significance on splicing has become a critical issue and a bottleneck, especially with the widespread implementation of whole-genome or exome sequencing. Although multiple in silico tools are available, the interpretation and application of these tools are difficult and practical guidelines are still lacking. A streamlined decision-making process can facilitate the downstream RNA analysis in a more efficient manner. Therefore, we evaluated the performance of 8 in silico tools (Splice Site Finder, MaxEntScan, Splice-site prediction by neural network, GeneSplicer, Human Splicing Finder, SpliceAI, Splicing Predictions in Consensus Elements, and SpliceRover) using 114 NF1 spliceogenic variants, experimentally validated at the mRNA level. The change in the predicted score incurred by the variant of the nearest wild-type splice site was analyzed, and for type II, III, and IV splice variants, the change in the prediction score of de novo or cryptic splice site was also analyzed. SpliceAI and SpliceRover, tools based on deep learning, outperformed all other tools, with AUCs of 0.972 and 0.924, respectively. For de novo and cryptic splice sites, SpliceAI outperformed all other tools and showed a sensitivity of 95.7% at an optimal cut-off of 0.02 score change. Our results show that deep learning algorithms, especially those of SpliceAI, are validated at a significantly higher rate than other in silico tools for clinically relevant NF1 variants. This suggests that deep learning algorithms outperform traditional probabilistic approaches and classical machine learning tools in predicting the de novo and cryptic splice sites.
3 citations
••
TL;DR: SpliceAI-visual as mentioned in this paper is a free online tool based on the SpliceAI algorithm, which is able to annotate complex variants (e.g., complex delins) and demonstrate its relevance in the assessment/modulation of the PVS1 classification criteria.
Abstract: SpliceAI is an open-source deep learning splicing prediction algorithm that has demonstrated in the past few years its high ability to predict splicing defects caused by DNA variations. However, its outputs present several drawbacks: (1) although the numerical values are very convenient for batch filtering, their precise interpretation can be difficult, (2) the outputs are delta scores which can sometimes mask a severe consequence, and (3) complex delins are most often not handled. We present here SpliceAI-visual, a free online tool based on the SpliceAI algorithm, and show how it complements the traditional SpliceAI analysis. First, SpliceAI-visual manipulates raw scores and not delta scores, as the latter can be misleading in certain circumstances. Second, the outcome of SpliceAI-visual is user-friendly thanks to the graphical presentation. Third, SpliceAI-visual is currently one of the only SpliceAI-derived implementations able to annotate complex variants (e.g., complex delins). We report here the benefits of using SpliceAI-visual and demonstrate its relevance in the assessment/modulation of the PVS1 classification criteria. We also show how SpliceAI-visual can elucidate several complex splicing defects taken from the literature but also from unpublished cases. SpliceAI-visual is available as a Google Colab notebook and has also been fully integrated in a free online variant interpretation tool, MobiDetails ( https://mobidetails.iurc.montp.inserm.fr/MD ).
3 citations
••
TL;DR: Since the discovery of alternative splicing in the late 1970s, a great number of alternatively spliced transcripts have emerged; this number has exponentially increased with the advances in transcriptomics and massive parallel sequencing technologies as discussed by the authors .
Abstract: Since the discovery of alternative splicing in the late 1970s, a great number of alternatively spliced transcripts have emerged; this number has exponentially increased with the advances in transcriptomics and massive parallel sequencing technologies [...].
2 citations
••
TL;DR: The SpliceAI- 10k calculator (SAI-10k-calc) is developed to extend use of this deep learning-based tool to predict aberration type and size of inserted or deleted sequence, using an analysis window of 10,000 nucleotides.
Abstract: Summary SpliceAI is a widely used splicing prediction tool and its most common application relies on the maximum delta score to assign variant impact on splicing. We developed the SpliceAI-10k calculator (SAI-10k-calc) to extend use of this tool to predict: the splicing aberration type including pseudoexonization, intron retention, partial exon deletion, and (multi)exon skipping using a 10 kb analysis window; the size of inserted or deleted sequence; the effect on reading frame; and the altered amino acid sequence. SAI-10k-calc has 95% sensitivity and 96% specificity for predicting variants that impact splicing, computed from a control dataset of 1,212 single nucleotide variants (SNVs) with curated splicing assay results. Notably, it has high performance (≥84% accuracy) for predicting pseudoexon and partial intron retention. The automated amino acid sequence prediction allows for efficient identification of variants that are expected to result in mRNA nonsense-mediated decay or translation of truncated proteins. Availability and implementation SAI-10k-calc is implemented in R (https://github.com/adavi4/SAI-10k-calc) and also available as a Microsoft Excel spreadsheet. Users can adjust the default thresholds to suit their target performance values. Supplementary information Supplementary data are available online.
2 citations
References
More filters
••
TL;DR: A representation and interpretation of the area under a receiver operating characteristic (ROC) curve obtained by the "rating" method, or by mathematical predictions based on patient characteristics, is presented and it is shown that in such a setting the area represents the probability that a randomly chosen diseased subject is (correctly) rated or ranked with greater suspicion than a random chosen non-diseased subject.
Abstract: A representation and interpretation of the area under a receiver operating characteristic (ROC) curve obtained by the "rating" method, or by mathematical predictions based on patient characteristics, is presented. It is shown that in such a setting the area represents the probability that a randomly chosen diseased subject is (correctly) rated or ranked with greater suspicion than a randomly chosen non-diseased subject. Moreover, this probability of a correct ranking is the same quantity that is estimated by the already well-studied nonparametric Wilcoxon statistic. These two relationships are exploited to (a) provide rapid closed-form expressions for the approximate magnitude of the sampling variability, i.e., standard error that one uses to accompany the area under a smoothed ROC curve, (b) guide in determining the size of the sample required to provide a sufficiently reliable estimate of this area, and (c) determine how large sample sizes should be to ensure that one can statistically detect difference...
19,398 citations
••
TL;DR: Because of the increased complexity of analysis and interpretation of clinical genetic testing described in this report, the ACMG strongly recommends thatclinical molecular genetic testing should be performed in a Clinical Laboratory Improvement Amendments–approved laboratory, with results interpreted by a board-certified clinical molecular geneticist or molecular genetic pathologist or the equivalent.
17,834 citations
••
Drexel University1, Yeshiva University2, Roswell Park Cancer Institute3, Virginia Commonwealth University4, Van Andel Institute5, Science Applications International Corporation6, Massachusetts Institute of Technology7, Harvard University8, University of Miami9, Icahn School of Medicine at Mount Sinai10, University of Chicago11, Howard Hughes Medical Institute12, University of Geneva13, Stanford University14, University of Oxford15, University of North Carolina at Chapel Hill16, National Institutes of Health17
TL;DR: The Genotype-Tissue Expression (GTEx) project is described, which will establish a resource database and associated tissue bank for the scientific community to study the relationship between genetic variation and gene expression in human tissues.
Abstract: Genome-wide association studies have identified thousands of loci for common diseases, but, for the majority of these, the mechanisms underlying disease susceptibility remain unknown. Most associated variants are not correlated with protein-coding changes, suggesting that polymorphisms in regulatory regions probably contribute to many disease phenotypes. Here we describe the Genotype-Tissue Expression (GTEx) project, which will establish a resource database and associated tissue bank for the scientific community to study the relationship between genetic variation and gene expression in human tissues.
6,545 citations
••
TL;DR: The landscape of gene expression across tissues is described, thousands of tissue-specific and shared regulatory expression quantitative trait loci (eQTL) variants are cataloged, complex network relationships are described, and signals from genome-wide association studies explained by eQTLs are identified.
Abstract: Understanding the functional consequences of genetic variation, and how it affects complex human disease and quantitative traits, remains a critical challenge for biomedicine. We present an analysi...
4,418 citations
••
TL;DR: This work has examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites, and over one-third of GENCODE protein-Coding genes aresupported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas.
Abstract: The GENCODE Consortium aims to identify all gene features in the human genome using a combination of computational analysis, manual annotation, and experimental validation. Since the first public release of this annotation data set, few new protein-coding loci have been added, yet the number of alternative splicing transcripts annotated has steadily increased. The GENCODE 7 release contains 20,687 protein-coding and 9640 long noncoding RNA loci and has 33,977 coding transcripts not represented in UCSC genes and RefSeq. It also has the most comprehensive annotation of long noncoding RNA (lncRNA) loci publicly available with the predominant transcript form consisting of two exons. We have examined the completeness of the transcript annotation and found that 35% of transcriptional start sites are supported by CAGE clusters and 62% of protein-coding genes have annotated polyA sites. Over one-third of GENCODE protein-coding genes are supported by peptide hits derived from mass spectrometry spectra submitted to Peptide Atlas. New models derived from the Illumina Body Map 2.0 RNA-seq data identify 3689 new loci not currently in GENCODE, of which 3127 consist of two exon models indicating that they are possibly unannotated long noncoding loci. GENCODE 7 is publicly available from gencodegenes.org and via the Ensembl and UCSC Genome Browsers.
4,281 citations