A novel phylogenetic analysis combined with a machine learning approach predicts human mitochondrial variant pathogenicity
Summary (2 min read)
INTRODUCTION
- Because of the critical roles that mitochondria play in metabolism and bioenergetics, mutation of mitochondria-localized proteins and ribonucleic acids can adversely affect human health (Alston et al, 2017; Suomalainen & Battersby, 2018; Khan et al, 2020; Russell et al, 2020).
- Simple tabulation of mtDNA variants found among healthy or sick individuals (Whiffin et al, 2017) may be of limited utility in predicting how harmful a variant may be.
- First, while knowledge of amino acid physico-chemical properties is widely considered to be informative regarding whether an amino acid substitution may or may not have a damaging effect on protein function (Dayhoff 3 et al, 1978), the site-specific acceptability of a given substitution is ultimately decided within the context of its local protein environment (Zuckerkandl & Pauling, 1965).
RESULTS
- Mapping apparent substitutions to a phylogenetic tree allows calculation of relative positional conservation in mtDNA-encoded proteins and RNAs Using the sequences of extant species and the predicted ancestral node values, the authors subsequently analyzed each edge of the tree for the presence or absence of substitutions at each aligned human position.
- When calculated for protein and RNA sites encoded by mammalian mtDNA, it is clear that the TSS (and the ISS, not shown) provides an excellent readout of relative conservation at, and consequent functional importance of, each alignment position.
- Substitution scores and inferred direct substitutions can be linked to human mtDNA variant pathogenicity Since summation of detected substitutions across a phylogenetic tree provides a robust measure of relative conservation at different macromolecular positions, the authors were confident that a phylogenetic analysis that includes TSSs would also provide information about the pathogenicity of human mtDNA variants.
- Even so, the distribution of variant frequencies among full-length sequences in GenBank was strikingly different for those mutations for which an IIDS could be identified in their mammalian trees of proteins , and even tRNAs , when compared to those for which an IIDS could not be identified.
A support vector machine predicts harmful mtDNA variants
- Given the clear presence of deleterious substitutions among so far uncharacterized variants, the authors sought a high-throughput method that could, with confidence, identify these potentially deleterious substitutions.
- MitoCAP also scored best against their training set when considering most auxiliary measures of prediction proficiency .
- To further investigate this possibility, the authors first plotted the level of agreement between MitoCAP other methods when assessing all classified variants, and they noted a pronounced lack of overlap between their MitoCAP predictions and the predictions of other methods .
- When heteroplasmy data for unannotated variants in HelixMTdb are analyzed for other prediction methods , as performed above for MitoCAP, MitoCAP best separated variants into classes with different heteroplasmy propensities and achieved the highest Kolmogorov-Smirnov D score .
- Taken together, their analyses indicate that MitoCAP appears to be the most proficient among the compared methods in predicting pathogenicity of variants in mtDNA-encoded proteins, while alternative methods may outperform MitoCAP during classification of tRNA variants.
DISCUSSION
- The authors describe here a methodology that allows improved quantification of the relative conservation of sites within and between genes, RNAs, and proteins.
- Even nearly identical sequences can be utilized by their approach, allowing for an everincreasing input dataset that can be deployed toward calculation of site-specific conservation.
- The authors note that focusing upon IIDSs, rather than the simple presence or absence of a character at a site, can indirectly integrate information about potential epistatic interactions that permit or block a substitution from being successfully established within a lineage.
- The MitoCAP predictions that the authors provide allow for improved comprehension of which mtDNA variants identified within a patient may be linked to mitochondrial disease.
- Concordantly, their data suggest a strong propensity for heteroplasmy in the set of substitutions that the authors predict to be pathogenic, but are not yet clinically annotated as disease-associated.
METHODOLOGY
- Mitochondrial DNA sequence acquisition and conservation analysis Mammalian mtDNA sequences were retrieved from the National Center for Biotechnology Information database of organelle genomes (https://www.ncbi.nlm.nih.gov/genome/browse#!/organelles/ on September 26, 2019).
- The PAGAN output was then analyzed using “binary-table-by-edges-v2.2” and "addconvention-to-binarytable-v1.1.py" (https://github.com/corydunnlab/hummingbird).
- For proteins, the negative training sets consisted of 50 mtDNA substitutions (encoding 51 protein variants) from the reference sequence.
- Predictions for the ROC curve were collected using ‘mining’ function of the rminer package (Cortez, 2015), with the optimized parameters during 10 runs of 5-fold cross-validation [model="ksvm", task = "prob", method = c("kfold", 5), Runs = 10].
- Comparison of selected, alternative prediction methods with MitoCAP Pathogenicity predictions for their training and test set variants were compared to predictions made by PolyPhen-2 (Adzhubei et al, 2013), PROVEAN (Choi et al, 2012), Panther-PSEP (Tang & Thomas, 2016b), Mitoclass (Martín-Navarro et al, 2017) and MitImpact (Castellana et al, 2015).
AUTHOR CONTRIBUTIONS
- B.A.A. developed software, analyzed data, and edited the manuscript.
- P.O.C. and V.O.P. analyzed data and edited the manuscript.
- C.D.D. conceived of the classification approach, supervised the project, analyzed data, prepared figures, and wrote the manuscript.
Did you find this useful? Give us your feedback
References
6,807 citations
Additional excerpts
...4.rev22 (Capella-Gutiérrez et al. 2009)....
[...]
6,355 citations
3,460 citations
"A novel phylogenetic analysis combi..." refers background in this paper
...…due to potential sampling 7 biases and population bottlenecks (Auer and Lettre 2015), timing of mutation arrival within an expanding population (Luria and Delbrück 1943), and the divergent nucleotide- and strand-specific mutational propensities of mtDNA (Tanaka and Ozawa 1994; Reyes et al.…...
[...]
2,677 citations
"A novel phylogenetic analysis combi..." refers background in this paper
...…informative regarding whether an amino acid substitution may or may not have a damaging effect on protein function (Dayhoff et al. 1978), the site-specific acceptability of a given substitution is ultimately decided within the context of its local protein environment (Zuckerkandl and Pauling 1965)....
[...]
...…of our edge-focused substitution analysis, like other approaches investigating substitution, obscures the development of any Poisson distribution (Zuckerkandl and Pauling 1965; Goldman 1990) based on repeated degenerate site substitutions, so we expected the total number of degenerate sites…...
[...]
...…may be harmful or not (Dayhoff et al. 1978; Jones et al. 1992), yet amino acid exchangeability matrices change across clades (Zou and Zhang 2019), and successful substitution of any given character obviously occurs in the context of a very specific local environment (Zuckerkandl and Pauling 1965)....
[...]
2,071 citations
"A novel phylogenetic analysis combi..." refers methods in this paper
...After gap removal, translation of protein coding genes was performed using the vertebrate mitochondrial codon table in AliView (Larsson 2014)....
[...]