Posted Content•DOI•

A novel phylogenetic analysis combined with a machine learning approach predicts human mitochondrial variant pathogenicity

Cory D. Dunn, Ani Akpinar, Paul D. Carlson

11 Jan 2020-bioRxiv (Cold Spring Harbor Laboratory)-

TL;DR: A novel and empirical approach for assessing site-specific conservation and variant acceptability that depends upon phylogenetic analysis and ancestral prediction and minimizes current alignment limitations is described and a substantial portion of encountered mtDNA alleles not yet characterized as harmful are, in fact, likely to be deleterious.

read less

Abstract: Linking mitochondrial DNA (mtDNA) mutations to patient outcomes has been a serious challenge. The multicopy nature and potential heteroplasmy of the mitochondrial genome, differential distribution of mutant mtDNAs among various tissues, genetic interactions among alleles, and environmental effects can hamper clinicians as they try to inform patients regarding the etiology of their metabolic disease. Multiple sequence alignments using samples ranging across multiple organisms and taxa are often deployed to assess the overall conservation of any site within a mtDNA-encoded macromolecule and to determine the acceptability of any given variant at a particular position. However, the utility of multiple sequence alignments in pathogenicity prediction can be restricted by factors including sample set bias, alignment errors, and sequencing errors. Here, we describe a novel and empirical approach for assessing site-specific conservation and variant acceptability that depends upon phylogenetic analysis and ancestral prediction and minimizes current alignment limitations. Next, we use machine learning to predict the pathogenicity of thousands of so-far-uncharacterized human alleles catalogued in the clinic. Our work demonstrates that a substantial portion of encountered mtDNA alleles not yet characterized as harmful are, in fact, likely to be deleterious. Beyond general applications of our methodology that lie outside of mitochondrial studies, our findings are likely to be of direct relevance to those at risk of mitochondria-associated illness.

...read moreread less

Summary (2 min read)

Jump to: [INTRODUCTION] – [RESULTS] – [A support vector machine predicts harmful mtDNA variants] – [DISCUSSION] – [METHODOLOGY] and [AUTHOR CONTRIBUTIONS]

INTRODUCTION

Because of the critical roles that mitochondria play in metabolism and bioenergetics, mutation of mitochondria-localized proteins and ribonucleic acids can adversely affect human health (Alston et al, 2017; Suomalainen & Battersby, 2018; Khan et al, 2020; Russell et al, 2020).
Simple tabulation of mtDNA variants found among healthy or sick individuals (Whiffin et al, 2017) may be of limited utility in predicting how harmful a variant may be.
First, while knowledge of amino acid physico-chemical properties is widely considered to be informative regarding whether an amino acid substitution may or may not have a damaging effect on protein function (Dayhoff 3 et al, 1978), the site-specific acceptability of a given substitution is ultimately decided within the context of its local protein environment (Zuckerkandl & Pauling, 1965).

RESULTS

Mapping apparent substitutions to a phylogenetic tree allows calculation of relative positional conservation in mtDNA-encoded proteins and RNAs Using the sequences of extant species and the predicted ancestral node values, the authors subsequently analyzed each edge of the tree for the presence or absence of substitutions at each aligned human position.
When calculated for protein and RNA sites encoded by mammalian mtDNA, it is clear that the TSS (and the ISS, not shown) provides an excellent readout of relative conservation at, and consequent functional importance of, each alignment position.
Substitution scores and inferred direct substitutions can be linked to human mtDNA variant pathogenicity Since summation of detected substitutions across a phylogenetic tree provides a robust measure of relative conservation at different macromolecular positions, the authors were confident that a phylogenetic analysis that includes TSSs would also provide information about the pathogenicity of human mtDNA variants.
Even so, the distribution of variant frequencies among full-length sequences in GenBank was strikingly different for those mutations for which an IIDS could be identified in their mammalian trees of proteins , and even tRNAs , when compared to those for which an IIDS could not be identified.

A support vector machine predicts harmful mtDNA variants

Given the clear presence of deleterious substitutions among so far uncharacterized variants, the authors sought a high-throughput method that could, with confidence, identify these potentially deleterious substitutions.
MitoCAP also scored best against their training set when considering most auxiliary measures of prediction proficiency .
To further investigate this possibility, the authors first plotted the level of agreement between MitoCAP other methods when assessing all classified variants, and they noted a pronounced lack of overlap between their MitoCAP predictions and the predictions of other methods .
When heteroplasmy data for unannotated variants in HelixMTdb are analyzed for other prediction methods , as performed above for MitoCAP, MitoCAP best separated variants into classes with different heteroplasmy propensities and achieved the highest Kolmogorov-Smirnov D score .
Taken together, their analyses indicate that MitoCAP appears to be the most proficient among the compared methods in predicting pathogenicity of variants in mtDNA-encoded proteins, while alternative methods may outperform MitoCAP during classification of tRNA variants.

DISCUSSION

The authors describe here a methodology that allows improved quantification of the relative conservation of sites within and between genes, RNAs, and proteins.
Even nearly identical sequences can be utilized by their approach, allowing for an everincreasing input dataset that can be deployed toward calculation of site-specific conservation.
The authors note that focusing upon IIDSs, rather than the simple presence or absence of a character at a site, can indirectly integrate information about potential epistatic interactions that permit or block a substitution from being successfully established within a lineage.
The MitoCAP predictions that the authors provide allow for improved comprehension of which mtDNA variants identified within a patient may be linked to mitochondrial disease.
Concordantly, their data suggest a strong propensity for heteroplasmy in the set of substitutions that the authors predict to be pathogenic, but are not yet clinically annotated as disease-associated.

METHODOLOGY

Mitochondrial DNA sequence acquisition and conservation analysis Mammalian mtDNA sequences were retrieved from the National Center for Biotechnology Information database of organelle genomes (https://www.ncbi.nlm.nih.gov/genome/browse#!/organelles/ on September 26, 2019).
The PAGAN output was then analyzed using “binary-table-by-edges-v2.2” and "addconvention-to-binarytable-v1.1.py" (https://github.com/corydunnlab/hummingbird).
For proteins, the negative training sets consisted of 50 mtDNA substitutions (encoding 51 protein variants) from the reference sequence.
Predictions for the ROC curve were collected using ‘mining’ function of the rminer package (Cortez, 2015), with the optimized parameters during 10 runs of 5-fold cross-validation [model="ksvm", task = "prob", method = c("kfold", 5), Runs = 10].
Comparison of selected, alternative prediction methods with MitoCAP Pathogenicity predictions for their training and test set variants were compared to predictions made by PolyPhen-2 (Adzhubei et al, 2013), PROVEAN (Choi et al, 2012), Panther-PSEP (Tang & Thomas, 2016b), Mitoclass (Martín-Navarro et al, 2017) and MitImpact (Castellana et al, 2015).

AUTHOR CONTRIBUTIONS

B.A.A. developed software, analyzed data, and edited the manuscript.
P.O.C. and V.O.P. analyzed data and edited the manuscript.
C.D.D. conceived of the classification approach, supervised the project, analyzed data, prepared figures, and wrote the manuscript.

Did you find this useful? Give us your feedback

Content maybe subject to copyright Report

A novel phylogenetic analysis and machine learning

predict pathogenicity of human mtDNA variants

Bala Anı Akpınar

1 †

, Paul O. Carlson

, Ville O. Paavilainen

, and Cory D. Dunn

1 †

Institute of Biotechnology, Helsinki Institute of Life Science, University of Helsinki,

Helsinki, 00014, Finland

†

Corresponding authors

Correspondence:

Bala Anı Akpınar, Ph.D.

P.O. Box 56

University of Helsinki

00014 Finland

Email: ani.akpinar@helsinki.fi

Phone: +358 50 311 9307

Cory Dunn, Ph.D.

P.O. Box 56

University of Helsinki

00014 Finland

Email: cory.dunn@helsinki.fi

Phone: +358 50 311 9307

The copyright holder for this preprintthis version posted October 10, 2020. ; https://doi.org/10.1101/2020.01.10.902239doi: bioRxiv preprint

ABSTRACT

Linking mitochondrial DNA (mtDNA) variation to clinical outcomes remains a formidable

challenge. Diagnosis of mitochondrial disease is hampered by the multicopy nature and

potential heteroplasmy of the mitochondrial genome, differential distribution of mutant

mtDNAs among various tissues, genetic interactions among alleles, and environmental

effects. Here, we describe a new approach to the assessment of which mtDNA variants may

be pathogenic. Our method takes advantage of site-specific conservation and variant

acceptability metrics that minimize previous classification limitations. Using our novel

features, we deploy machine learning to predict the pathogenicity of thousands of human

mtDNA variants. Our work demonstrates that a substantial fraction of mtDNA changes not

yet characterized as harmful are, in fact, likely to be deleterious. Our findings will be of direct

relevance to those at risk of mitochondria-associated metabolic disease.

The copyright holder for this preprintthis version posted October 10, 2020. ; https://doi.org/10.1101/2020.01.10.902239doi: bioRxiv preprint

INTRODUCTION

Because of the critical roles that mitochondria play in metabolism and bioenergetics,

mutation of mitochondria-localized proteins and ribonucleic acids can adversely affect

human health (Alston et al, 2017; Suomalainen & Battersby, 2018; Khan et al, 2020; Russell

et al, 2020). Indeed, at least one in 5000 people (Gorman et al, 2015) is estimated to be

overtly affected by mitochondrial disease. While a very limited number of mitochondrial DNA

(mtDNA) lesions can be directly linked to human illness, the clinical outcome for many other

mtDNA changes remains ambiguous (Vento & Pappa, 2013). Heteroplasmy among the

hundreds of mitochondrial DNA (mtDNA) molecules found within a cell (Stewart & Chinnery,

2015; Hahn & Zuryn, 2019; Wei & Chinnery, 2020), differential distribution of disease-causing

mtDNA among tissues (Boulet et al, 1992), and modifier alleles within the mitochondrial

genome (Wei et al, 2017; Elliott et al, 2008) magnify the difficulty of interpreting different

mtDNA alterations. Mito-nuclear interactions and environmental effects may also determine

the outcome of mitochondrial DNA mutations (Wolff et al, 2014; Hill et al, 2019; Matilainen et

al, 2017; Turnbull et al, 2018). Beyond the obvious importance of resolving the genetic

etiology of symptoms presented in a clinical setting, the rapidly increasing prominence of

direct-to-consumer genetic testing (Phillips et al, 2018) calls for an improved understanding

of which mtDNA polymorphisms might affect human health (Blell & Hunter, 2019).

Simple tabulation of mtDNA variants found among healthy or sick individuals (Whiffin

et al, 2017) may be of limited utility in predicting how harmful a variant may be. Differing,

strand-specific mutational propensities for mtDNA nucleotides at different locations within

the molecule (Tanaka & Ozawa, 1994; Faith & Pollock, 2003; Reyes et al, 1998) should be

taken into account when assessing population-wide data, yet allele frequencies are rarely, if

ever, normalized in this way. Population sampling biases and recent population bottleneck

effects can lead to misinterpretation of variant frequencies (Zuk et al, 2014; Chheda et al,

2017; Keinan & Clark, 2012; Landry et al, 2018; Pirastu et al, 2020). Mildly deleterious

variants arising in a population are slow to be removed by selection (Nachman, 1998;

Nachman et al, 1996), leading to a false prediction of variant benignancy. Finally, a lack of

selection against variants that might act in a deleterious manner at the post-reproductive

stage of life also makes likely the possibility that some mtDNA changes will contribute to

age-related phenotypes while avoiding overt association with mitochondrial disease

(Maklakov et al, 2015; Medawar, 1952; Cui et al, 2019; Williams, 1957; Wallace, 1994).

Examining evolutionary conservation by use of multiple sequence alignments offers

important assistance when predicting a variant’s potential pathogenicity (Raychaudhuri,

2011; Tang & Thomas, 2016a). However, caveats are also associated with predicting

mutation outcome by the use of these alignments. First, while knowledge of amino acid

physico-chemical properties is widely considered to be informative regarding whether an

amino acid substitution may or may not have a damaging effect on protein function (Dayhoff

The copyright holder for this preprintthis version posted October 10, 2020. ; https://doi.org/10.1101/2020.01.10.902239doi: bioRxiv preprint

et al, 1978), the site-specific acceptability of a given substitution is ultimately decided within

the context of its local protein environment (Zuckerkandl & Pauling, 1965). Second, sampling

biases and improper clade selection may lead to inaccurate clinical interpretations regarding

the relative acceptability of specific variants (Zuk et al, 2014; Chheda et al, 2017; Keinan &

Clark, 2012; Landry et al, 2018). Third, alignment (Kawrykow et al, 2012; Iantorno et al, 2014)

and sequencing errors (Chen et al, 2017; Smith, 2019) may falsely indicate the acceptability

of a particular mtDNA substitution.

Here, we have deployed a methodology to calculate, by a novel analysis of available

mammalian genomes, the relative conservation of human mtDNA-encoded positions.

Moreover, we infer ancestral direct substitutions within mammals and test whether they

match substitutions from the human reference sequence, providing further knowledge

regarding the potential pathogenicity of any human mtDNA substitution. By subsequent

application of machine learning, we demonstrate that a surprising number of

uncharacterized mtDNA mutations carried by humans are likely to promote disease. We

provide our predictions, which should be of great utility to clinicians and to those studying

mitochondrial disease.

RESULTS

Mapping apparent substitutions to a phylogenetic tree allows calculation of relative

positional conservation in mtDNA-encoded proteins and RNAs

We previously developed an empirical method for detection and quantification of

mtDNA substitutions mapped to the edges of a phylogenetic tree (Dunn et al, 2020). Here,

we have extended our approach toward prediction of human mitochondrial variant

pathogenicity. First, we retrieved full mammalian mtDNA sequences from the National

Center for Biotechnology Information Reference Sequence (NCBI RefSeq) database and

extracted each RNA or protein-coding gene using the Homo sapiens reference mtDNA as a

guide. Next, we aligned the resulting protein, tRNA, and rRNA sequences, concatenated the

sequences of each species based upon molecule class, and generated phylogenetic trees

using a maximum likelihood approach. Following tree generation, we performed ancestral

prediction to reconstruct the character values of each position at every bifurcating node.

Using the sequences of extant species and the predicted ancestral node values, we

subsequently analyzed each edge of the tree for the presence or absence of substitutions at

each aligned human position. We subsequently sum all substitutions at a given position that

occur along all tree edges to generate a new metric, the total substitution score (TSS, Figure

1A). The TSS should surpass metrics that consider positional character frequencies derived

from multiple sequence alignments as a proxy of conservation, as character frequencies are

highly sensitive to sampling biases among input sequences. Moreover, many site-specific

measurements of variability, such as Shannon entropy, are limited in dynamic range and

The copyright holder for this preprintthis version posted October 10, 2020. ; https://doi.org/10.1101/2020.01.10.902239doi: bioRxiv preprint

benefit minimally from the rapid increase in available genomic information. In contrast, the

dynamic range of the TSS is very wide, and potentially unlimited, continuously benefitting

from the accretion of new sequence information.

Furthermore, by excluding edges from analysis that lead directly to extant sequences,

one can further minimize effects of alignment errors and sequencing errors that may lead to

eventual misinterpretation of variant pathogenicity. Moreover, mutations mapped to internal

edges are more likely to represent fixed changes informative for the purposes of disease

prediction, while polymorphisms that have not yet been subject to selection of sufficient

strength or duration might be expected to complicate predictions of variant pathogenicity

(Nachman et al, 1996; Nachman, 1998). Summation of substitutions only at these internal

edges provides an internal substitution score (ISS, Figure 1B).

When calculated for protein and RNA sites encoded by mammalian mtDNA, it is clear

that the TSS (and the ISS, not shown) provides an excellent readout of relative conservation

at, and consequent functional importance of, each alignment position. When comparing TSS

data from different mtDNA-encoded proteins, our findings are consistent with previous

results, obtained by alternative methodologies, demonstrating that the core, mtDNA-

encoded subunits of Complexes III and IV tend to be the most conserved, while positions

within the mtDNA-encoded polypeptides of Complex I and Complex V tend to be less well

conserved (da Fonseca et al, 2008; Nabholz et al, 2013) (Figure 2A). Examination of the

structures of these complexes indicate that, indeed, the most conserved residues are

preferentially localized near the key catalytic regions of each complex (not shown). Within

each protein, there was, as expected, a spectrum of site conservation values, also illustrated

by plotting a distribution of TSS values across each polypeptide (Figure S1). Nearly all

analyzed protein positions appeared to be under some selective pressure and are not

saturated with mutations, with TSS values existing far from the maximal values that can be

achieved within this phylogenetic analysis of mammals. Selective pressure on most aligned

sites is also observed when examining mtDNA-encoded tRNAs and rRNAs (Figure 2B and

Figure S2).

Beyond summation of substitutions across a phylogenetic tree, the inferred ancestral

and descendent characters at each edge of the phylogenetic tree can also be examined

following generation of the substitution map and can provide important information

regarding what changes to mtDNA-encoded macromolecules might be deleterious or not.

Specifically, if an inferred direct substitution from the human reference character to the

mutant character (or the inverse, assuming the time-reversibility of character substitutions) is

predicted along the edge of a phylogenetic tree, then such a change at a given position

might be expected to be less deleterious than an inferred direct substitution to or from the

human character that was never encountered over the evolutionary history of a clade. In

contrast, the simple presence or absence of a character at an alignment position, without

The copyright holder for this preprintthis version posted October 10, 2020. ; https://doi.org/10.1101/2020.01.10.902239doi: bioRxiv preprint

HTML Viewer

Frequently Asked Questions (1)

Q1. What are the contributions in "A novel phylogenetic analysis and machine learning predict pathogenicity of human mtdna variants" ?

Here, the authors describe a new approach to the assessment of which mtDNA variants may be pathogenic. ( which was not certified by peer review ) is the author/funder.

A novel phylogenetic analysis combined with a machine learning approach predicts human mitochondrial variant pathogenicity

Summary (2 min read)

INTRODUCTION

RESULTS

A support vector machine predicts harmful mtDNA variants

DISCUSSION

METHODOLOGY

AUTHOR CONTRIBUTIONS

References

"A novel phylogenetic analysis combi..." refers background in this paper

"A novel phylogenetic analysis combi..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (1)

Q1. What are the contributions in "A novel phylogenetic analysis and machine learning predict pathogenicity of human mtdna variants" ?