A novel phylogenetic analysis combined with a machine learning approach predicts human mitochondrial variant pathogenicity
Summary (2 min read)
INTRODUCTION
- Because of the critical roles that mitochondria play in metabolism and bioenergetics, mutation of mitochondria-localized proteins and ribonucleic acids can adversely affect human health (Alston et al, 2017; Suomalainen & Battersby, 2018; Khan et al, 2020; Russell et al, 2020).
- Simple tabulation of mtDNA variants found among healthy or sick individuals (Whiffin et al, 2017) may be of limited utility in predicting how harmful a variant may be.
- First, while knowledge of amino acid physico-chemical properties is widely considered to be informative regarding whether an amino acid substitution may or may not have a damaging effect on protein function (Dayhoff 3 et al, 1978), the site-specific acceptability of a given substitution is ultimately decided within the context of its local protein environment (Zuckerkandl & Pauling, 1965).
RESULTS
- Mapping apparent substitutions to a phylogenetic tree allows calculation of relative positional conservation in mtDNA-encoded proteins and RNAs Using the sequences of extant species and the predicted ancestral node values, the authors subsequently analyzed each edge of the tree for the presence or absence of substitutions at each aligned human position.
- When calculated for protein and RNA sites encoded by mammalian mtDNA, it is clear that the TSS (and the ISS, not shown) provides an excellent readout of relative conservation at, and consequent functional importance of, each alignment position.
- Substitution scores and inferred direct substitutions can be linked to human mtDNA variant pathogenicity Since summation of detected substitutions across a phylogenetic tree provides a robust measure of relative conservation at different macromolecular positions, the authors were confident that a phylogenetic analysis that includes TSSs would also provide information about the pathogenicity of human mtDNA variants.
- Even so, the distribution of variant frequencies among full-length sequences in GenBank was strikingly different for those mutations for which an IIDS could be identified in their mammalian trees of proteins , and even tRNAs , when compared to those for which an IIDS could not be identified.
A support vector machine predicts harmful mtDNA variants
- Given the clear presence of deleterious substitutions among so far uncharacterized variants, the authors sought a high-throughput method that could, with confidence, identify these potentially deleterious substitutions.
- MitoCAP also scored best against their training set when considering most auxiliary measures of prediction proficiency .
- To further investigate this possibility, the authors first plotted the level of agreement between MitoCAP other methods when assessing all classified variants, and they noted a pronounced lack of overlap between their MitoCAP predictions and the predictions of other methods .
- When heteroplasmy data for unannotated variants in HelixMTdb are analyzed for other prediction methods , as performed above for MitoCAP, MitoCAP best separated variants into classes with different heteroplasmy propensities and achieved the highest Kolmogorov-Smirnov D score .
- Taken together, their analyses indicate that MitoCAP appears to be the most proficient among the compared methods in predicting pathogenicity of variants in mtDNA-encoded proteins, while alternative methods may outperform MitoCAP during classification of tRNA variants.
DISCUSSION
- The authors describe here a methodology that allows improved quantification of the relative conservation of sites within and between genes, RNAs, and proteins.
- Even nearly identical sequences can be utilized by their approach, allowing for an everincreasing input dataset that can be deployed toward calculation of site-specific conservation.
- The authors note that focusing upon IIDSs, rather than the simple presence or absence of a character at a site, can indirectly integrate information about potential epistatic interactions that permit or block a substitution from being successfully established within a lineage.
- The MitoCAP predictions that the authors provide allow for improved comprehension of which mtDNA variants identified within a patient may be linked to mitochondrial disease.
- Concordantly, their data suggest a strong propensity for heteroplasmy in the set of substitutions that the authors predict to be pathogenic, but are not yet clinically annotated as disease-associated.
METHODOLOGY
- Mitochondrial DNA sequence acquisition and conservation analysis Mammalian mtDNA sequences were retrieved from the National Center for Biotechnology Information database of organelle genomes (https://www.ncbi.nlm.nih.gov/genome/browse#!/organelles/ on September 26, 2019).
- The PAGAN output was then analyzed using “binary-table-by-edges-v2.2” and "addconvention-to-binarytable-v1.1.py" (https://github.com/corydunnlab/hummingbird).
- For proteins, the negative training sets consisted of 50 mtDNA substitutions (encoding 51 protein variants) from the reference sequence.
- Predictions for the ROC curve were collected using ‘mining’ function of the rminer package (Cortez, 2015), with the optimized parameters during 10 runs of 5-fold cross-validation [model="ksvm", task = "prob", method = c("kfold", 5), Runs = 10].
- Comparison of selected, alternative prediction methods with MitoCAP Pathogenicity predictions for their training and test set variants were compared to predictions made by PolyPhen-2 (Adzhubei et al, 2013), PROVEAN (Choi et al, 2012), Panther-PSEP (Tang & Thomas, 2016b), Mitoclass (Martín-Navarro et al, 2017) and MitImpact (Castellana et al, 2015).
AUTHOR CONTRIBUTIONS
- B.A.A. developed software, analyzed data, and edited the manuscript.
- P.O.C. and V.O.P. analyzed data and edited the manuscript.
- C.D.D. conceived of the classification approach, supervised the project, analyzed data, prepared figures, and wrote the manuscript.
Did you find this useful? Give us your feedback
References
1,711 citations
"A novel phylogenetic analysis combi..." refers methods in this paper
...ROC curves were drawn with the mgraph function from rminer package (Cortez 2015), which uses ksvm from kernlab package (Karatzoglou et al. 2004)....
[...]
1,154 citations
Additional excerpts
...10.2 (Shen et al. 2016)....
[...]
973 citations
"A novel phylogenetic analysis combi..." refers background in this paper
...To further demonstrate that low SumSub(Int) values correspond with conservation, we plotted mammalian SumSub(Int) sums for the three mitochondria-encoded subunits of cytochrome c oxidase (COX) onto the high-resolution structure of the bovine enzyme (Yoshikawa et al. 1998)....
[...]
883 citations
"A novel phylogenetic analysis combi..." refers background in this paper
...…effects in both non-mitotic and renewable tissues (Stewart and Chinnery 2015; Zhang et al. 2018), may generate clones with a high proportion of deleterious mutations and to complex, tissue-specific outcomes (Nekhaeva et al. 2002; Fayet et al. 2002; Greaves et al. 2006; Bratic and Larsson 2013)....
[...]
638 citations