scispace - formally typeset
Open AccessPosted ContentDOI

A unified analytic framework for prioritization of non-coding variants of uncertain significance in heritable breast and ovarian cancer

TLDR
This approach distills large numbers of variants detected by NGS to a limited set of variants prioritized as potential deleterious changes and presents a strategy for complete gene sequence analysis followed by a unified framework for interpreting non-coding variants that may affect gene expression.
Abstract
Background: Sequencing of both healthy and disease singletons yields many novel and low frequency variants of uncertain significance (VUS). Complete gene and genome sequencing by next generation sequencing (NGS) significantly increases the number of VUS detected. While prior studies have emphasized protein coding variants, non-coding sequence variants have also been proven to significantly contribute to high penetrance disorders, such as hereditary breast and ovarian cancer (HBOC). We present a strategy for analyzing different functional classes of non-coding variants based on information theory (IT). Methods: We captured and enriched for coding and non-coding variants in genes known to harbor mutations that increase HBOC risk. Custom oligonucleotide baits spanning the complete coding, non-coding, and intergenic regions 10 kb up- and downstream of ATM, BRCA1, BRCA2, CDH1, CHEK2, PALB2, and TP53 were synthesized for solution hybridization enrichment. Unique and divergent repetitive sequences were sequenced in 102 high-risk patients without identified mutations in BRCA1/2. Aside from protein coding changes, IT-based sequence analysis was used to identify and prioritize pathogenic non-coding variants that occurred within sequence elements predicted to be recognized by proteins or protein complexes involved in mRNA splicing, transcription, and untranslated region (UTR) binding and structure. This approach was supplemented by in silico and laboratory analysis of UTR structure. Results: 15,311 unique variants were identified, of which 245 occurred in coding regions. With the unified IT-framework, 132 variants were identified and 87 functionally significant VUS were further prioritized. We also identified 4 stop-gain variants and 3 reading-frame altering exonic insertions/deletions (indels). Conclusions: We have presented a strategy for complete gene sequence analysis followed by a unified framework for interpreting non-coding variants that may affect gene expression. This approach distills large numbers of variants detected by NGS to a limited set of variants prioritized as potential deleterious changes.

read more

Content maybe subject to copyright    Report

TE C H N I C A L A D V A N C E Open Access
A unified analytic framework for
prioritization of non-coding variants of
uncertain significance in heritable breast
and ovarian cancer
Eliseos J. Mucaki
1
, Natasha G. Caminsky
1
, Ami M. Perri
1
, Ruipeng Lu
2
, Alain Laederach
3
, Matthew Halvorsen
4
,
Joan H. M. Knoll
5,6
and Peter K. Rogan
1,2,6,7*
Abstract
Background: Sequencing of both healthy and disease singletons yields many novel and low frequency variants of
uncertain significance (VUS). Co mplete gene and genome sequencing by next generation sequencing (NGS)
significantly increases the number of VUS detected. While prior studies have emphasized protein coding variants,
non-coding sequence variants have also been proven to significantly contribute to high penetrance disorders, such
as hereditary breast and ovarian cancer (HBOC). We present a strategy for analyzing different functional classes of
non-coding variants based on information theory (IT) and prioritizing patients with large intragenic deletions.
Methods: We captured and enriched for coding and non-coding variants in genes known to harbor mutations that
increase HBOC risk. Custom oligonucleotide baits spanning the complete coding, non-coding, and intergenic
regions 10 kb up- and downstream of ATM, BRCA1, BRCA2, CDH1, CHEK2, PALB2, and TP53 were synthesized for
solution hybridization enrichment. Unique and divergent repetitive sequences were sequenced in 102 high-risk,
anonymized patients without identified mutations in BRCA1/2. Aside from protein coding and copy number
changes, IT-based sequence analysis was used to identify and prioritize pathogenic non-coding variants that
occurred within sequence elements predicted to be recognized by proteins or protein complexes involved in
mRNA splicing, transcription, and untranslated region (UTR) binding and structure. This approach was
supplemented by in silico and laboratory analysis of UTR structure.
Results: 15,311 unique variants were identified, of which 245 occurred in coding regions. With the unified IT-
framework, 132 variants were identified and 87 functionally significant VUS were further prioritized. An intragenic
32.1 kb interval in BRCA2 that was likely hemizygous was detected in one patient. We also identified 4 stop-gain
variants and 3 reading-frame altering exonic insertions/deletions (indels).
Conclusions: We have presented a strategy for complete gene sequence analysis followed by a unified framework
for interpreting non-coding variants that may affect gene expression. This approach distills large numbers of
variants detected by NGS to a limited set of variants prioritized as potential deleterious changes.
Keywords: Information theory, Hereditary breast and ovarian cancer, Transcription factor binding, RNA-binding
protein, Prioritization, Variants of uncertain significance, Splicing, Non-coding, Next-generation sequencing
* Correspondence: progan@uwo.ca
EJM and NGC should be considered to be joint first authors.
1
Department of Biochemistry, Schulich School of Medicine and Dentistry,
Western University, London, ON N6A 2C1, Canada
2
Department of Computer Science, Faculty of Science, Western University,
London N6A 2C1, Canada
Full list of author information is available at the end of the article
© 2016 Mucaki et al. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and
reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to
the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver
(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Mucaki et al. BMC Medical Genomics (2016) 9:19
DOI 10.1186/s12920-016-0178-5

Background
Advances in NGS have enabled panels of genes, whole
exomes, and even whole genomes to be sequenced for
multiple individuals in parallel. These platforms have be-
come so cost-effective and accurate that they are begin-
ning to be adopted in clinical settings, as evidenced by
recent FDA approvals [1, 2]. However, the overwhelming
number of gene variants revealed in each individual has
challenged interpretation of clinically significant genetic
variation [35].
After common variants, which are rarely pathogenic,
are eliminated, the number of VUS in the residual set re-
mains substantial. Assessment of pathogenicity is not
trivial, considering that nearly half of the uniq ue variants
are novel, and cannot be resolved using published litera-
ture and variant databases [6]. Furthermore, loss-of-
function variants (those resulting in protein truncation
are most likely to be deleterious) represent a very small
proportion of identified variants. The remaining variants
are missense and synonymous variants in the exon, sin-
gle nucleotide changes, or in frame insertions or dele-
tions in intervening and intergenic regions. Functional
analysis of large numbers of these variants often cannot
be performed, due to lack of relevant tissues, and the
cost, time, and labor required for each variant. Another
problem is that in silico protein coding predi ction tools
exhibit inconsistent accuracy and are thus problematic
for clinical risk evaluation [79]. Consequently, many
HBOC patients undergoing genetic susceptibility testing
will receive either an inconclusive (no BRCA variant
identified) or an uncertain (BRCA VUS) result. The
former has been reported in up to 80 % of cases and
depends on the number of genes tested [10]. The occ ur-
rence of unce rtain BRCA mutations varies greatly (as
high as 46 % in African American populations and as
low as 2.1 %) among tested individuals depending on the
laboratory and the patients eth nicity [1113]. The in-
consistency in diagnostic yield is significant, considering
that HBOC account s for 5 10 % of all breast/ovarian
cancer [14, 15].
One strategy to improve variant interpretation in patients
is to reduce the full set of variants to a manageable list of
potentially pathogenic variants. Evidence for pathogenicity
of VUS in genetic disease is often limited to amino acid
coding changes [16, 17], and mutations affecting splicing,
transcriptional activation, and mRNA stability tend to be
underreported [1824]. Splicing errors are estimated to
represent 15 % of disease-causing mutations [25], but may
be much higher [26, 27]. The impact of a single nucleotide
change in a recognition sequence can range from insignifi-
cant to complete abolition of a protein binding site. Aber -
rant splicing events causing frameshifts often disrupt
protein function; in-frame changes are dependent on gene
context. The complexity of interpretation of non-coding
sequence variants benefits from computational approaches
[28] and direct functional analyses [2933] that may each
support evidence of pathogenicity.
Ex vivo transfection assays developed to determine the
pathogenicity of VUS predicted to lead to splicing aberra-
tions (using in silico tool s) have been successful in identify-
ing pathogenic sequence variants [34, 35]. IT -based analysis
of splicing variants has proven to be robust and accurate
(as determined by functional assays for mRNA expression
or binding assays) at analyzing splice site (SS) variants, in-
cluding splicing regulatory factor binding sites (SRFBSs),
and in distinguishing them from polymorphisms in both
rare and common diseases [3639]. However, IT can be ap-
plied to any sequence recognized and bound by another
factor [40], such as with transcription factor binding sites
(TFBSs) and RNA-binding protein binding sites (RBBSs).
IT is used as a measure of sequence conservation and is
more accurate than consensus sequences [41]. The individ-
ual information (R
i
) of a base is related to thermodynamic
entropy, and therefore free energy of binding, and is mea-
sured on a logarithmic scale (in bits). By comparing the
change in information (ΔR
i
) for a nucleotide variation of a
bound sequence, the resulting change in binding affinity
is 2
ΔRi
, such that a 1 bit change in information will result
in at least a 2-fold change in binding affinity [42].
IT measures nucleotide sequence conservation and
does not provide information on effe cts of variants on
mRNA secondary (2°) structure, nor can it accurately
predict effects of amino acid sequence changes. Associa-
tions of structural changes in untranslated regions
(UTR) of mRN A with disease justifies including pre-
dicted effects of these changes on structure in the
comprehensive analysis of sequence variants [43]. Other
in silico methods have attempted to address these defi-
ciencies. For example, Halvorsen et al. (2010) introduced
an algorithm called SNPfold, which computes the potential
effect of a single nucleotide variant (SNV) on mRNA
structure [20]. Predictions made by SNPfold can be tested
by the SHAPE assay (Selective 2-Hydroxyl Acylation ana-
lyzed by Primer Extension) [44], which provides evidence
for sequence variants that lead to structural changes in
mRNA by detection of covalent adducts in mRNA.
The implications of improved VUS interpr eta tion are
particularly relevant for HBOC due to its incidence and the
adoption of panel testing for these individuals [45, 46]. It
has been suggested that patients with a high risk profile re-
ceiving uninformative results would imply that deleterious
variants lie in untested regions of BRCA1/2, untested genes,
or are unrecognized [47, 48]. This is also supported by
studies where families with linkage to BRCA1/2 had no de-
tectable pathogenic mutation (however it is noteworthy
that detection rates of BRCA mutations in families with
documented linkage to these loci appears to vary b y ascer-
tainment, inclusion criteria, and technology used to identify
Mucaki et al. BMC Medical Genomics (2016) 9:19 Page 2 of 25

the mutations) [49, 50]. The concept of non-BRCA gene
association has been demonstrated by the identification of
low-to-moderate risk HBOC genes, and variants within
coding and non-coding r egions affecting splicing and regu-
latory factor binding [51, 52]. Consequently, VUS, which in-
clude rare missense changes, other coding and non-coding
changes in all of these genes, greatly outnumber the catalog
of known deleterious mutations [53].
Here, we devel op and e va luate IT-ba sed model s to
predict potential non-coding sequence mutations in
SSs, TFBSs , and RBBSs in 7 genes sequenced in their
entirety. These models were used to analyze 102 an-
onymous HBOC patients who did not exhibit known
BRC A1/2 coding mutations at the time of initial test-
ing, despite meet ing the criteria for BRCA genetic
testing. The genes are: ATM, BRCA1, BRCA2, CDH1,
CHEK2, PALB2, and TP53, and have been reported to
harbor mutations that increase HBOC risk [5476].
We apply these IT-based methods to analyze variant s
in the complete sequences of coding, non-coding , and
up- and downstream regions of the 7 genes. In this
study, we established and applied a unified IT-ba sed
framework , first filtering out common variant s , then
to flag potentially deleterious ones. Then, using
context-specific criteria and information from the
published literature, we prioritized likely candidates.
Methods
Design of tiled capture array for HBOC gene panel
Nucleic acid hybridization capture reagents designed from
genomic sequences generally avoid repetitive sequence
content to avoid cross hybridization [77]. Complete gene
sequences harbor numerous repetitive sequences, and an
excess of denatured C
0
t-1 DNA is usually added to
hybridization to prevent inclusion of these sequences [78].
RepeatMasker software completely masks all repetitive
and low-complexity sequences [79]. We increased se-
quence coverage in complete genes with capture probes
by enriching for both single-copy and divergent repeat
(>30 % divergence) regions, such that, under the correct
hybridization and wash conditions, all probes hybridize
only to their correct genomic locations [77]. This step was
incorporated into a modified version of Gnirke and col-
leagues (2009) in-solution hybridization enrichment
protocol, in which the majority of library preparation,
pull-down, and wash steps were automated using a
BioMe FXP Automation Workstation (Beckman
Coulter, Mississauga, Canada) [80].
Genes ATM (RefSeq: NM_000051. 3, NP_000042.3),
BRCA1 (RefSeq: NM_007294.3, NP_009225.1), BRCA2
(RefSeq: NM_000059.3, NP_000050.2), CDH1 (RefSeq:
NM_004360.3, NP_004351.1), CHEK2 (RefSeq: NM_
145862.2, NP_665861.1), PALB2 (RefSeq: NM_024675.3,
NP_078951.2), and TP53 (RefSeq: NM_000546.5, NP_
000537.3) were selected for capture probe design by tar-
geting single copy or highly divergent repeat regions
(spanning 10 kb up- and downstream of each gene rela-
tive to the most upstream first exon and most down-
stream final exon in RefSeq) using an ab initio approach
[77]. If a region was excluded by ab initio but lacked a
conserved repeat element (i.e. divergence > 30 %) [79],
the region was added back into the probe-design se-
quence file. Probe sequences were selected using PICKY
2.2 software [81]. These probes were used in solution
hybridization to capture our target sequences, followed
by NGS on an Illumina Genome Analyzer IIx (Add-
itional file 1: Methods).
Genomic sequences from both strands were captured
using overlapping oligonucleotide sequence designs cover-
ing 342,075 nt among the 7 genes (Fig. 1). In total, 11,841
oligonucleotides were synthesized from the transcribed
strand consisting of the complete, single copy coding, and
flanking regions of ATM (3513), BRCA1 (1587), BRCA2
(2386), CDH1 (1867), CHEK2 (889), PALB2 (811), and
TP53 (788). Additionally, 11,828 antisense strand oligos
were synthesized (3497 ATM,1591BRCA1,2395BRCA2,
1860 CDH1,883CHEK2,826PALB2, and 776 TP53). Any
intronic or intergenic regions without probe coverage are
most likely due to the presence of conserved repetitive el-
ements or other paralogous sequences.
For regions lacking probe coverage (of 10 nt, N =141;
8inATM,26inBRCA1,10inBRCA2,29inCDH1,36in
CHEK2,15inPALB2,and17inTP53), probes were se-
lected based on predicted T
m
s similar to other probes,
limited alignment to other sequences in the transcriptome
(<10 times), and avoidance of stable, base-paired struc-
tures (with unaFOLD) [82, 83]. The average coverage of
these sequenced regions was 14.124.9 % lower than other
probe sets, indicating that capture was less efficient,
though still successful.
HBOC samples for oligo capture and high-throughput
sequencing
GenomicDNAfrom102patientspreviouslytestedfor
inherited breast/ovarian cancer without evidence of a pre-
disposing genetic mutation, was obtained from the Molecu-
lar Genetics Laboratory (MGL) at the London Health
Sciences Centre in London, Ontario, Canada. Patients
qualified for genetic susceptibility testing as determined by
the Ontario Ministry of Health and Long-Term Care
BRCA1 and BRCA2 genetic testing criteria [84] (see Add-
itional file 2). The University of Western Ontario research
ethics board (REB) approved this anonymized study of
these individuals to evaluate the analytical methods pre-
sented here. BRCA1 and BRCA2 were previously analyzed
by Protein Truncation Test (PTT) and Multiplex Ligation-
dependent Probe Amplification (MLP A). The exons of sev-
eral patients (N = 14) had also been Sanger sequenced. No
Mucaki et al. BMC Medical Genomics (2016) 9:19 Page 3 of 25

pathogenic sequence change was found in any of these in-
dividuals. In addition, one patient with a known pathogenic
BRCA variant was re-sequenc ed by NGS as a positive
control.
Sequence alignment and variant calling
Variant analysis involved the steps of detection, filtering,
IT-based and coding sequence analysis, and prioritization
(Fig. 2). Sequencing data were demultiplexed and aligned to
the specific chromosomes of our sequenced genes (hg19)
using both CASAVA (Consensus Assessment of Sequen-
cing and Variation; v1.8.2) [85] and CRAC (Complex Reads
Analysis and Classification; v1.3.0) [86] software. Align-
ments were prepared for variant calling using Picard [87]
and variant calling was performed on both versions of the
aligned sequences using the UnifiedGenotyper tool in the
Genome Analysis Toolkit (GATK) [88]. We used the rec-
ommended minimum phred base quality score of 30, and
Fig. 1 Capture Probe Coverage over Sequenced Genes. The genomic structure of the 7 genes chosen are displayed with the UCSC Genome
Browser. Top row for each gene is a custom track with the dense visualization modality selected with black regions indicating the intervals
covered by the oligonucleotide capture reagent. Regions without probe coverage contain conserved repetitive sequences or correspond to
paralogous sequences that are unsuitable for probe design
Mucaki et al. BMC Medical Genomics (2016) 9:19 Page 4 of 25

results were exported in variant call format (VCF; v4.1). A
software program was developed to ex clude variants called
outside of targeted capture regions and those with quality
scores < 50. Variants flagged by bioinformatic analysis (de-
scribed below) were also assessed by manually inspecting
the reads in the region using the Integrative Genomics
Viewer (IGV; version 2.3) [89, 90] to note and eliminate ob-
vious false positives (i.e. variant called due to polyhomonu-
cleotide run dephasing, or PCR duplicates that were not
eliminated by Picard). Finally, common variants (1 % allele
frequency based on dbSNP 142 or > 5 individuals in our
study cohort) were not prioritized.
IT-based variant analysis
All variants were analyzed using the Shannon Human
Splicing Mutation Pipeline, a genome-scale variant
analysis program that predicts the effects of variants on
mRNA splicing [91, 92]. Variants were flagged based on
criteria reported in Shirley et al. (2013): weakened nat-
ural site 1.0 bits , or strengthened cryptic site (within
300 nt of the nearest exon) where cryptic site strength is
equivalent or greater than the nearest natural site of the
same phase [91]. The effects of flagged variants were fur-
ther analyzed in detail using the Automated Splice Site
and Exon Definition Analysis (ASSEDA) server [38].
Exonic variants and those found within 500 nt of an
exon were assessed for their effects, if any, on SRFBSs
[38]. Sequence logos for splicing regulatory factors (SRFs)
(SRSF1, SRSF2, SRSF5, SRSF6, hnRNPH, hnRNPA1,
ELAVL1, TIA1, and PTB) and their R
sequence
values (the
mean information content [93]) are provided in Caminsky
et al. (2015) [36]. Because these motifs occur frequently in
Fig. 2 Framework for the Identification of Potentially Pathogenic Variants. Integrated laboratory processing and bioinformatic analysis procedures
for comprehensive complete gene variant determination and analysis. Intermediate datasets resulting from filtering are represented in yellow and
final datasets in green. Non-bioinformatic steps, such as sample preparation are represented in blue and prediction programs in purple. Sequencing
analysis yields base calls for all samples. CASAVA [85] and CRAC [86] were used to align these sequencing results to hg19. GATK [88] was used to call
variants from this data against GRCh37 release of the reference human genome. Variants with a quality score < 50 and/or call confidence score < 30
were eliminated along with variants falling outside of our target regions. SNPnexus [112114] was used to identify the genomic location of the variants.
Nonsense and indels were noted and prediction tools were used to assess the potential pathogenicity of missense variants. The Shannon Pipeline [91]
evaluated the effect of a variant on natural and cryptic SSs, as well as SRFBSs. ASSEDA [38] was used to predict the potential isoforms as a result of
these variants. PWMs for 83 TFs were built using an information weight matrix generator based on Bipad [106]. Mutation Analyzer evaluated the effect
of variants found 10 kb upstream up to the first intron on protein binding. Bit thresholds (R
i
values) for filtering variants on software program outputs
are indicated. Variants falling within the UTR sequences were assessed using SNPfold [20], and the most probable variants that alter mRNA
structure (p < 0.1) were then processed using mFold to predict the effect on stability [83]. All U TR variants were scanned with a modified
version of the Shannon Pipeline, which uses PWMs computed from nucleotide frequencies for 28 RBPs in RBPDB [109] and 76 RBPs in
CISBP-RNA [110]. All variants meeting these filtering criteria were verified with IGV [89, 90]. *Sanger sequencing was only performed for
protein trunca ting, splicing, and selected missense variants
Mucaki et al. BMC Medical Genomics (2016) 9:19 Page 5 of 25

Citations
More filters

Integrative Genomics Viewer

TL;DR: The sheer volume and scope of data posed by this flood of data pose a significant challenge to the development of efficient and intuitive visualization tools able to scale to very large data sets and to flexibly integrate multiple data types, including clinical data.
Posted ContentDOI

Prioritizing variants in complete Hereditary Breast and Ovarian Cancer (HBOC) genes in patients lacking known BRCA mutations

TL;DR: Information theory (IT) is applied to predict and prioritize non-coding variants of uncertain significance (VUS) in regulatory, coding, and intronic regions based on changes in binding sites in these genes.
References
More filters
Journal ArticleDOI

Estimates of worldwide burden of cancer in 2008: GLOBOCAN 2008.

TL;DR: The results for 20 world regions are presented, summarizing the global patterns for the eight most common cancers, and striking differences in the patterns of cancer from region to region are observed.
Journal ArticleDOI

The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data

TL;DR: The GATK programming framework enables developers and analysts to quickly and easily write efficient and robust NGS tools, many of which have already been incorporated into large-scale sequencing projects like the 1000 Genomes Project and The Cancer Genome Atlas.
Journal ArticleDOI

Standards and guidelines for the interpretation of sequence variants: a joint consensus recommendation of the American College of Medical Genetics and Genomics and the Association for Molecular Pathology.

TL;DR: Because of the increased complexity of analysis and interpretation of clinical genetic testing described in this report, the ACMG strongly recommends thatclinical molecular genetic testing should be performed in a Clinical Laboratory Improvement Amendments–approved laboratory, with results interpreted by a board-certified clinical molecular geneticist or molecular genetic pathologist or the equivalent.
Journal ArticleDOI

An integrated encyclopedia of DNA elements in the human genome

TL;DR: The Encyclopedia of DNA Elements project provides new insights into the organization and regulation of the authors' genes and genome, and is an expansive resource of functional annotations for biomedical research.
Journal ArticleDOI

Mfold web server for nucleic acid folding and hybridization prediction

TL;DR: The objective of this web server is to provide easy access to RNA and DNA folding and hybridization software to the scientific community at large by making use of universally available web GUIs (Graphical User Interfaces).
Related Papers (5)

Analysis of protein-coding genetic variation in 60,706 humans

Monkol Lek, +72 more
- 30 Oct 2015 - 
Frequently Asked Questions (17)
Q1. What have the authors contributed in "A unified analytic framework for prioritization of non-coding variants of uncertain significance in heritable breast and ovarian cancer" ?

The authors present a strategy for analyzing different functional classes of non-coding variants based on information theory ( IT ) and prioritizing patients with large intragenic deletions. The authors have presented a strategy for complete gene sequence analysis followed by a unified framework for interpreting non-coding variants that may affect gene expression. With the unified ITframework, 132 variants were identified and 87 functionally significant VUS were further prioritized. This approach distills large numbers of variants detected by NGS to a limited set of variants prioritized as potential deleterious changes. 

11,828 antisense strand oligos were synthesized (3497 ATM, 1591 BRCA1, 2395 BRCA2, 1860 CDH1, 883 CHEK2, 826 PALB2, and 776 TP53). 

The authors required that > 80 % of the control individuals be heterozyogous for at least two welldistributed loci within these intervals. 

Highly informative SNPs with a random genomic distribution in the controls (and other public databases) and which were nonpolymorphic in the individual with the suspected deletion were weighted more heavily in inferring potential hemizygosity. 

The genes are: ATM, BRCA1, BRCA2, CDH1, CHEK2, PALB2, and TP53, and have been reported to harbor mutations that increase HBOC risk [54–76]. 

The impact of a single nucleotide change in a recognition sequence can range from insignificant to complete abolition of a protein binding site. 

BRCA coding variants were found in individuals who were previously screened for lesions in these genes, suggesting this NGS protocol is a more sensitive approach for detecting coding changes. 

Although the cryptic exon is strengthened (final Ri,total = 6.9 bits, ΔRi = 14.7 bits), ASSEDA predicts the level of expression of this exon to be negligible, as it is weaker than the natural exon (Ri,total = 8.4 bits) due to the increased length of the predicted exon (+291 nt) [38]. 

The authors identified 141 TFs with evidence for binding to the promoters of the genes the authors sequenced, including c-Myc, C/EBPβ, and Sp1, shown to transcriptionally regulate BRCA1, TP53, and ATM, respectively [98–100]. 

IT-based analysis of splicing variants has proven to be robust and accurate (as determined by functional assays for mRNA expression or binding assays) at analyzing splice site (SS) variants, including splicing regulatory factor binding sites (SRFBSs), and in distinguishing them from polymorphisms in both rare and common diseases [36–39]. 

One strategy to improve variant interpretation in patients is to reduce the full set of variants to a manageable list of potentially pathogenic variants. 

Variants flagged by SNPfold with the highest probability of altering stable 2° structures in mRNA (where p-value < 0.1) were prioritized. 

The predicted effects on protein conservation and function of the remaining variants were evaluated by in silico tools: PolyPhen-2 [118], Mutation Assessor (release 2) [119, 120], and PROVEAN (v1.1.3) [121, 122]. 

The complexity of interpretation of non-coding sequence variants benefits from computational approaches [28] and direct functional analyses [29–33] that may each support evidence of pathogenicity. 

As previously reported [147], the authors noted that false positive variant calls within intronic and intergenic regions were the most common consequence of dephasing in low complexity, pyrimidine-enriched intervals. 

The average number of variants per patient at each step is indicated in a table below each plot, along with the percent reduction in variants from one step to anotherThree prioritized variants have multiple predicted roles: ATM c.1538A >G in missense and SRFBS, CHEK2 c.190G >A in missense and UTR binding, and CHEK2 c.433C > 

Variants predicted by all four programs to be benign were less likely to have a deleterious impact on protein activity; however this did not exclude them from mRNA splicing analysis (described above in IT-Based Variant Analysis).