scispace - formally typeset
Search or ask a question
Posted ContentDOI

A novel phylogenetic analysis combined with a machine learning approach predicts human mitochondrial variant pathogenicity

11 Jan 2020-bioRxiv (Cold Spring Harbor Laboratory)-
TL;DR: A novel and empirical approach for assessing site-specific conservation and variant acceptability that depends upon phylogenetic analysis and ancestral prediction and minimizes current alignment limitations is described and a substantial portion of encountered mtDNA alleles not yet characterized as harmful are, in fact, likely to be deleterious.
Abstract: Linking mitochondrial DNA (mtDNA) mutations to patient outcomes has been a serious challenge. The multicopy nature and potential heteroplasmy of the mitochondrial genome, differential distribution of mutant mtDNAs among various tissues, genetic interactions among alleles, and environmental effects can hamper clinicians as they try to inform patients regarding the etiology of their metabolic disease. Multiple sequence alignments using samples ranging across multiple organisms and taxa are often deployed to assess the overall conservation of any site within a mtDNA-encoded macromolecule and to determine the acceptability of any given variant at a particular position. However, the utility of multiple sequence alignments in pathogenicity prediction can be restricted by factors including sample set bias, alignment errors, and sequencing errors. Here, we describe a novel and empirical approach for assessing site-specific conservation and variant acceptability that depends upon phylogenetic analysis and ancestral prediction and minimizes current alignment limitations. Next, we use machine learning to predict the pathogenicity of thousands of so-far-uncharacterized human alleles catalogued in the clinic. Our work demonstrates that a substantial portion of encountered mtDNA alleles not yet characterized as harmful are, in fact, likely to be deleterious. Beyond general applications of our methodology that lie outside of mitochondrial studies, our findings are likely to be of direct relevance to those at risk of mitochondria-associated illness.

Summary (2 min read)

INTRODUCTION

  • Because of the critical roles that mitochondria play in metabolism and bioenergetics, mutation of mitochondria-localized proteins and ribonucleic acids can adversely affect human health (Alston et al, 2017; Suomalainen & Battersby, 2018; Khan et al, 2020; Russell et al, 2020).
  • Simple tabulation of mtDNA variants found among healthy or sick individuals (Whiffin et al, 2017) may be of limited utility in predicting how harmful a variant may be.
  • First, while knowledge of amino acid physico-chemical properties is widely considered to be informative regarding whether an amino acid substitution may or may not have a damaging effect on protein function (Dayhoff 3 et al, 1978), the site-specific acceptability of a given substitution is ultimately decided within the context of its local protein environment (Zuckerkandl & Pauling, 1965).

RESULTS

  • Mapping apparent substitutions to a phylogenetic tree allows calculation of relative positional conservation in mtDNA-encoded proteins and RNAs Using the sequences of extant species and the predicted ancestral node values, the authors subsequently analyzed each edge of the tree for the presence or absence of substitutions at each aligned human position.
  • When calculated for protein and RNA sites encoded by mammalian mtDNA, it is clear that the TSS (and the ISS, not shown) provides an excellent readout of relative conservation at, and consequent functional importance of, each alignment position.
  • Substitution scores and inferred direct substitutions can be linked to human mtDNA variant pathogenicity Since summation of detected substitutions across a phylogenetic tree provides a robust measure of relative conservation at different macromolecular positions, the authors were confident that a phylogenetic analysis that includes TSSs would also provide information about the pathogenicity of human mtDNA variants.
  • Even so, the distribution of variant frequencies among full-length sequences in GenBank was strikingly different for those mutations for which an IIDS could be identified in their mammalian trees of proteins , and even tRNAs , when compared to those for which an IIDS could not be identified.

A support vector machine predicts harmful mtDNA variants

  • Given the clear presence of deleterious substitutions among so far uncharacterized variants, the authors sought a high-throughput method that could, with confidence, identify these potentially deleterious substitutions.
  • MitoCAP also scored best against their training set when considering most auxiliary measures of prediction proficiency .
  • To further investigate this possibility, the authors first plotted the level of agreement between MitoCAP other methods when assessing all classified variants, and they noted a pronounced lack of overlap between their MitoCAP predictions and the predictions of other methods .
  • When heteroplasmy data for unannotated variants in HelixMTdb are analyzed for other prediction methods , as performed above for MitoCAP, MitoCAP best separated variants into classes with different heteroplasmy propensities and achieved the highest Kolmogorov-Smirnov D score .
  • Taken together, their analyses indicate that MitoCAP appears to be the most proficient among the compared methods in predicting pathogenicity of variants in mtDNA-encoded proteins, while alternative methods may outperform MitoCAP during classification of tRNA variants.

DISCUSSION

  • The authors describe here a methodology that allows improved quantification of the relative conservation of sites within and between genes, RNAs, and proteins.
  • Even nearly identical sequences can be utilized by their approach, allowing for an everincreasing input dataset that can be deployed toward calculation of site-specific conservation.
  • The authors note that focusing upon IIDSs, rather than the simple presence or absence of a character at a site, can indirectly integrate information about potential epistatic interactions that permit or block a substitution from being successfully established within a lineage.
  • The MitoCAP predictions that the authors provide allow for improved comprehension of which mtDNA variants identified within a patient may be linked to mitochondrial disease.
  • Concordantly, their data suggest a strong propensity for heteroplasmy in the set of substitutions that the authors predict to be pathogenic, but are not yet clinically annotated as disease-associated.

METHODOLOGY

  • Mitochondrial DNA sequence acquisition and conservation analysis Mammalian mtDNA sequences were retrieved from the National Center for Biotechnology Information database of organelle genomes (https://www.ncbi.nlm.nih.gov/genome/browse#!/organelles/ on September 26, 2019).
  • The PAGAN output was then analyzed using “binary-table-by-edges-v2.2” and "addconvention-to-binarytable-v1.1.py" (https://github.com/corydunnlab/hummingbird).
  • For proteins, the negative training sets consisted of 50 mtDNA substitutions (encoding 51 protein variants) from the reference sequence.
  • Predictions for the ROC curve were collected using ‘mining’ function of the rminer package (Cortez, 2015), with the optimized parameters during 10 runs of 5-fold cross-validation [model="ksvm", task = "prob", method = c("kfold", 5), Runs = 10].
  • Comparison of selected, alternative prediction methods with MitoCAP Pathogenicity predictions for their training and test set variants were compared to predictions made by PolyPhen-2 (Adzhubei et al, 2013), PROVEAN (Choi et al, 2012), Panther-PSEP (Tang & Thomas, 2016b), Mitoclass (Martín-Navarro et al, 2017) and MitImpact (Castellana et al, 2015).

AUTHOR CONTRIBUTIONS

  • B.A.A. developed software, analyzed data, and edited the manuscript.
  • P.O.C. and V.O.P. analyzed data and edited the manuscript.
  • C.D.D. conceived of the classification approach, supervised the project, analyzed data, prepared figures, and wrote the manuscript.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A novel phylogenetic analysis and machine learning
predict pathogenicity of human mtDNA variants
Bala Anı Akpınar
1 †
, Paul O. Carlson
1
, Ville O. Paavilainen
1
, and Cory D. Dunn
1 †
1
Institute of Biotechnology, Helsinki Institute of Life Science, University of Helsinki,
Helsinki, 00014, Finland
Corresponding authors
Correspondence:
Bala Anı Akpınar, Ph.D.
P.O. Box 56
University of Helsinki
00014 Finland
Email: ani.akpinar@helsinki.fi
Phone: +358 50 311 9307
or
Cory Dunn, Ph.D.
P.O. Box 56
University of Helsinki
00014 Finland
Email: cory.dunn@helsinki.fi
Phone: +358 50 311 9307
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted October 10, 2020. ; https://doi.org/10.1101/2020.01.10.902239doi: bioRxiv preprint

ABSTRACT
Linking mitochondrial DNA (mtDNA) variation to clinical outcomes remains a formidable
challenge. Diagnosis of mitochondrial disease is hampered by the multicopy nature and
potential heteroplasmy of the mitochondrial genome, differential distribution of mutant
mtDNAs among various tissues, genetic interactions among alleles, and environmental
effects. Here, we describe a new approach to the assessment of which mtDNA variants may
be pathogenic. Our method takes advantage of site-specific conservation and variant
acceptability metrics that minimize previous classification limitations. Using our novel
features, we deploy machine learning to predict the pathogenicity of thousands of human
mtDNA variants. Our work demonstrates that a substantial fraction of mtDNA changes not
yet characterized as harmful are, in fact, likely to be deleterious. Our findings will be of direct
relevance to those at risk of mitochondria-associated metabolic disease.
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted October 10, 2020. ; https://doi.org/10.1101/2020.01.10.902239doi: bioRxiv preprint

2
INTRODUCTION
Because of the critical roles that mitochondria play in metabolism and bioenergetics,
mutation of mitochondria-localized proteins and ribonucleic acids can adversely affect
human health (Alston et al, 2017; Suomalainen & Battersby, 2018; Khan et al, 2020; Russell
et al, 2020). Indeed, at least one in 5000 people (Gorman et al, 2015) is estimated to be
overtly affected by mitochondrial disease. While a very limited number of mitochondrial DNA
(mtDNA) lesions can be directly linked to human illness, the clinical outcome for many other
mtDNA changes remains ambiguous (Vento & Pappa, 2013). Heteroplasmy among the
hundreds of mitochondrial DNA (mtDNA) molecules found within a cell (Stewart & Chinnery,
2015; Hahn & Zuryn, 2019; Wei & Chinnery, 2020), differential distribution of disease-causing
mtDNA among tissues (Boulet et al, 1992), and modifier alleles within the mitochondrial
genome (Wei et al, 2017; Elliott et al, 2008) magnify the difficulty of interpreting different
mtDNA alterations. Mito-nuclear interactions and environmental effects may also determine
the outcome of mitochondrial DNA mutations (Wolff et al, 2014; Hill et al, 2019; Matilainen et
al, 2017; Turnbull et al, 2018). Beyond the obvious importance of resolving the genetic
etiology of symptoms presented in a clinical setting, the rapidly increasing prominence of
direct-to-consumer genetic testing (Phillips et al, 2018) calls for an improved understanding
of which mtDNA polymorphisms might affect human health (Blell & Hunter, 2019).
Simple tabulation of mtDNA variants found among healthy or sick individuals (Whiffin
et al, 2017) may be of limited utility in predicting how harmful a variant may be. Differing,
strand-specific mutational propensities for mtDNA nucleotides at different locations within
the molecule (Tanaka & Ozawa, 1994; Faith & Pollock, 2003; Reyes et al, 1998) should be
taken into account when assessing population-wide data, yet allele frequencies are rarely, if
ever, normalized in this way. Population sampling biases and recent population bottleneck
effects can lead to misinterpretation of variant frequencies (Zuk et al, 2014; Chheda et al,
2017; Keinan & Clark, 2012; Landry et al, 2018; Pirastu et al, 2020). Mildly deleterious
variants arising in a population are slow to be removed by selection (Nachman, 1998;
Nachman et al, 1996), leading to a false prediction of variant benignancy. Finally, a lack of
selection against variants that might act in a deleterious manner at the post-reproductive
stage of life also makes likely the possibility that some mtDNA changes will contribute to
age-related phenotypes while avoiding overt association with mitochondrial disease
(Maklakov et al, 2015; Medawar, 1952; Cui et al, 2019; Williams, 1957; Wallace, 1994).
Examining evolutionary conservation by use of multiple sequence alignments offers
important assistance when predicting a variant’s potential pathogenicity (Raychaudhuri,
2011; Tang & Thomas, 2016a). However, caveats are also associated with predicting
mutation outcome by the use of these alignments. First, while knowledge of amino acid
physico-chemical properties is widely considered to be informative regarding whether an
amino acid substitution may or may not have a damaging effect on protein function (Dayhoff
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted October 10, 2020. ; https://doi.org/10.1101/2020.01.10.902239doi: bioRxiv preprint

3
et al, 1978), the site-specific acceptability of a given substitution is ultimately decided within
the context of its local protein environment (Zuckerkandl & Pauling, 1965). Second, sampling
biases and improper clade selection may lead to inaccurate clinical interpretations regarding
the relative acceptability of specific variants (Zuk et al, 2014; Chheda et al, 2017; Keinan &
Clark, 2012; Landry et al, 2018). Third, alignment (Kawrykow et al, 2012; Iantorno et al, 2014)
and sequencing errors (Chen et al, 2017; Smith, 2019) may falsely indicate the acceptability
of a particular mtDNA substitution.
Here, we have deployed a methodology to calculate, by a novel analysis of available
mammalian genomes, the relative conservation of human mtDNA-encoded positions.
Moreover, we infer ancestral direct substitutions within mammals and test whether they
match substitutions from the human reference sequence, providing further knowledge
regarding the potential pathogenicity of any human mtDNA substitution. By subsequent
application of machine learning, we demonstrate that a surprising number of
uncharacterized mtDNA mutations carried by humans are likely to promote disease. We
provide our predictions, which should be of great utility to clinicians and to those studying
mitochondrial disease.
RESULTS
Mapping apparent substitutions to a phylogenetic tree allows calculation of relative
positional conservation in mtDNA-encoded proteins and RNAs
We previously developed an empirical method for detection and quantification of
mtDNA substitutions mapped to the edges of a phylogenetic tree (Dunn et al, 2020). Here,
we have extended our approach toward prediction of human mitochondrial variant
pathogenicity. First, we retrieved full mammalian mtDNA sequences from the National
Center for Biotechnology Information Reference Sequence (NCBI RefSeq) database and
extracted each RNA or protein-coding gene using the Homo sapiens reference mtDNA as a
guide. Next, we aligned the resulting protein, tRNA, and rRNA sequences, concatenated the
sequences of each species based upon molecule class, and generated phylogenetic trees
using a maximum likelihood approach. Following tree generation, we performed ancestral
prediction to reconstruct the character values of each position at every bifurcating node.
Using the sequences of extant species and the predicted ancestral node values, we
subsequently analyzed each edge of the tree for the presence or absence of substitutions at
each aligned human position. We subsequently sum all substitutions at a given position that
occur along all tree edges to generate a new metric, the total substitution score (TSS, Figure
1A). The TSS should surpass metrics that consider positional character frequencies derived
from multiple sequence alignments as a proxy of conservation, as character frequencies are
highly sensitive to sampling biases among input sequences. Moreover, many site-specific
measurements of variability, such as Shannon entropy, are limited in dynamic range and
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted October 10, 2020. ; https://doi.org/10.1101/2020.01.10.902239doi: bioRxiv preprint

4
benefit minimally from the rapid increase in available genomic information. In contrast, the
dynamic range of the TSS is very wide, and potentially unlimited, continuously benefitting
from the accretion of new sequence information.
Furthermore, by excluding edges from analysis that lead directly to extant sequences,
one can further minimize effects of alignment errors and sequencing errors that may lead to
eventual misinterpretation of variant pathogenicity. Moreover, mutations mapped to internal
edges are more likely to represent fixed changes informative for the purposes of disease
prediction, while polymorphisms that have not yet been subject to selection of sufficient
strength or duration might be expected to complicate predictions of variant pathogenicity
(Nachman et al, 1996; Nachman, 1998). Summation of substitutions only at these internal
edges provides an internal substitution score (ISS, Figure 1B).
When calculated for protein and RNA sites encoded by mammalian mtDNA, it is clear
that the TSS (and the ISS, not shown) provides an excellent readout of relative conservation
at, and consequent functional importance of, each alignment position. When comparing TSS
data from different mtDNA-encoded proteins, our findings are consistent with previous
results, obtained by alternative methodologies, demonstrating that the core, mtDNA-
encoded subunits of Complexes III and IV tend to be the most conserved, while positions
within the mtDNA-encoded polypeptides of Complex I and Complex V tend to be less well
conserved (da Fonseca et al, 2008; Nabholz et al, 2013) (Figure 2A). Examination of the
structures of these complexes indicate that, indeed, the most conserved residues are
preferentially localized near the key catalytic regions of each complex (not shown). Within
each protein, there was, as expected, a spectrum of site conservation values, also illustrated
by plotting a distribution of TSS values across each polypeptide (Figure S1). Nearly all
analyzed protein positions appeared to be under some selective pressure and are not
saturated with mutations, with TSS values existing far from the maximal values that can be
achieved within this phylogenetic analysis of mammals. Selective pressure on most aligned
sites is also observed when examining mtDNA-encoded tRNAs and rRNAs (Figure 2B and
Figure S2).
Beyond summation of substitutions across a phylogenetic tree, the inferred ancestral
and descendent characters at each edge of the phylogenetic tree can also be examined
following generation of the substitution map and can provide important information
regarding what changes to mtDNA-encoded macromolecules might be deleterious or not.
Specifically, if an inferred direct substitution from the human reference character to the
mutant character (or the inverse, assuming the time-reversibility of character substitutions) is
predicted along the edge of a phylogenetic tree, then such a change at a given position
might be expected to be less deleterious than an inferred direct substitution to or from the
human character that was never encountered over the evolutionary history of a clade. In
contrast, the simple presence or absence of a character at an alignment position, without
(which was not certified by peer review) is the author/funder. All rights reserved. No reuse allowed without permission.
The copyright holder for this preprintthis version posted October 10, 2020. ; https://doi.org/10.1101/2020.01.10.902239doi: bioRxiv preprint

References
More filters
Journal ArticleDOI
TL;DR: TrimAl is a tool for automated alignment trimming, which is especially suited for large-scale phylogenetic analyses and can automatically select the parameters to be used in each specific alignment so that the signal-to-noise ratio is optimized.
Abstract: Multiple sequence alignments (MSA) are central to many areas of bioinformatics, including phylogenetics, homology modeling, database searches and motif finding. Recently, such MSA-based techniques have been incorporated in high-throughput pipelines such as genome annotation and phylogenomics analyses. In all these applications, the reliability and accuracy of the analyses depend critically on the quality of the underlying alignments. A plethora of computer programs and algorithms for MSA are currently available (Notredame, 2007), which implement different heuristics to find mathematically optimal solutions to the MSA problem. Accuracies of 80–90% have been reported for the best algorithms, but even the best scoring alignment algorithms may fail with certain protein families or at specific regions in the alignment. The situation worsens in large-scale analyses, where faster but less reliable algorithms and large numbers of automatically selected sequences are used. It is therefore generally assumed that trimming the alignment, so that poorly aligned regions are eliminated, increases the accuracy of the resulting MSA-based applications (Talavera and Castresana, 2007). Some programs such as G-blocks (Castresana, 2000) have been developed to assist in the MSA trimming phase by selecting blocks of conserved regions. They have become very popular and are extensively used, with good performance, in small-to-medium scale datasets, where several parameters can be tested manually (Talavera and Castresana, 2007). However, their use over larger datasets is hampered by the need for defining, prior to the analysis, the set of parameters that will be used for all sequence families. Here, we present trimAl, a tool for automated alignment trimming. Its speed and the possibility for automatically adjusting the parameters to improve the phylogenetic signal-to-noise ratio, makes trimAl especially suited for large-scale phylogenomic analyses, involving thousands of large alignments. trimAl has been developed in a GNU/Linux environment using C++ programming language and has been tested on various UNIX, Mac and Windows platforms. Moreover, we have developed a web server to run trimAl online (http://phylemon2.bioinfo.cipf.es/), which has been included in the Phylemon suite for phylogenetic and phylogenomic tools (Tarraga et al., 2007). The documentation, source files and additional information for trimAl are available through a wiki page (http://trimal.cgenomics.org). trimAl reads and renders protein or nucleotide alignments in several standard formats. trimAl starts by reading all columns in an alignment and computes a score (Sx) for each of them. This score can be a gap score (Sg), a similarity score (Ss) or a consistency score (Sc). The score for each column can be computed based only on the information from that column or, if a window size of w is specified, it corresponds to the average value of w columns around the position considered. The gap score (Sg) for a column is the fraction of sequences without a gap in that position. The residue similarity score (Ss) consists of mean distance (MD) scores as described in Thompson et al. (2001) and Supplementary Material. This score uses the MD between pairs of residues, as defined by a given scoring matrix. Finally, the consistency score (Sc) can only be computed when more than one alignment for the same set of sequences is provided. Details on how these scores are computed are provided in the Supplementary Material. In brief, Sc measures the level of consistency of all the residue pairs found in a column as compared with the other alignments. The alignment with the highest consistency is chosen and then trimmed to remove the columns that are less conserved, according to Sc or other thresholds set by the user. Once all column scores have been computed trimAl can proceed in two ways. If both a score and a minimum conservation threshold are provided, trimAl renders a trimmed alignment in which only the columns with scores above the score threshold are included, as far as the number of selected columns is above a conservation threshold defined by the user. If this number is below the conservation threshold, trimAl will add more columns to the trimmed alignment in a decreasing order of scores until the conservation threshold is reached. The conservation threshold corresponds to the minimum percentage of columns, from the original alignment, which the user wants to include in the trimmed alignment. Alternatively, if the automatic selection of parameters options is selected, trimAl will compute specific score thresholds depending on the inherent characteristics of each alignment. So far, trimAl incorporates three modes for the automated selection of parameters, gappyout, strict and strictplus, which are based on the different use of gap and similarity scores. Moreover, the option automated1 implements a heuristic to decide the most appropriate mode depending on the alignment characteristics. The heuristics to define such parameters have been designed based on the results of a benchmark. Details on the heuristics and the benchmark can be found in the online documentation of the program. In brief, the automatic selection of parameters approximate optimal cutoffs by plotting, internally, the cumulative graphs of gap and similarity scores of the columns in the alignment (see online documentation). We expanded, using ROSE simulations (Stoye et al., 1998) a benchmark set that has been used previously to test the improvement in phylogenetic performance after an alignment trimming phase (Talavera and Castresana, 2007). This dataset simulates several evolutionary scenarios varying in the number and length of the sequences, the topology of the underlying tree and the level of sequence divergence considered. We compared the results obtained from MUSCLE alignments before and after trimming with trimAl using automated selection of parameters. The accuracy of the resulting trees was measured by comparing them with the original trees used to generate the sequence sets, and measuring the Robinson Foulds distance (Robinson and Foulds, 1981). We observed an overall improvement of the phylogenetic accuracy after trimming. Using -automated1 option of trimAl, the trimmed alignment always produced Maximum Likelihood trees that were of equal (36%) or significantly better (64%) quality as compared with the tree derived from the complete alignment. For Neighbor Joining reconstruction the -strictplus option of trimAl worked best, improving the phylogenetic accuracy in 89% of the scenarios. In most scenarios (90%), trimAl outperformed Gblocks v0.91b with default parameters. Most importantly, the use of Gblocks default parameters diminished the accuracy of the subsequent tree reconstruction in half of the scenarios considered. In contrast, the use of trimAl automated methods rarely (1.5%) undermined the topological accuracy of the resulting phylogenetic tree (see Supplementary Material for more details). To test the applicability of trimAl on real datasets as well as its suitability for large-scale phylogenetic datasets, we ran trimAl on the complete set of MUSCLE alignments generated for the Human Phylome project (Huerta-Cepas et al., 2007). This includes a total of 31 182 alignments, containing, on average, 67 sequences of 1472 positions of length. Trimming these alignments using the -gappyout and automated1 options used 5 min 45 s and 125 min, 2 s, respectively, on a computer with an Intel QuadCore XEON E5410 processors and 8 GB of RAM. trimAl has been used previously in a pipeline to reconstruct complete collections of gene trees. In this case, the parameter sets used were a minimum conservation threshold of 60% and a gap threshold of 90% (-cons 60 -gt 0.9). Complete and trimmed alignments used to generate the phylomes included in PhylomeDB (Huerta-Cepas et al., 2008) can be viewed through this database.

6,807 citations


Additional excerpts

  • ...4.rev22 (Capella-Gutiérrez et al. 2009)....

    [...]

Journal ArticleDOI
TL;DR: An efficient means for generating mutation data matrices from large numbers of protein sequences is presented, by means of an approximate peptide-based sequence comparison algorithm, which is fast enough to process the entire SWISS-PROT databank in 20 h on a Sun SPARCstation 1, and is fastenough to generate a matrix from a specific family or class of proteins in minutes.
Abstract: An efficient means for generating mutation data matrices from large numbers of protein sequences is presented here. By means of an approximate peptide-based sequence comparison algorithm, the set sequences are clustered at the 85% identity level. The closest relating pairs of sequences are aligned, and observed amino acid exchanges tallied in a matrix. The raw mutation frequency matrix is processed in a similar way to that described by Dayhoff et al. (1978), and so the resulting matrices may be easily used in current sequence analysis applications, in place of the standard mutation data matrices, which have not been updated for 13 years. The method is fast enough to process the entire SWISS-PROT databank in 20 h on a Sun SPARCstation 1, and is fast enough to generate a matrix from a specific family or class of proteins in minutes. Differences observed between our 250 PAM mutation data matrix and the matrix calculated by Dayhoff et al. are briefly discussed.

6,355 citations

Journal ArticleDOI
20 Nov 1943-Genetics
TL;DR: This article reported Luria and Delbruck's breakthrough study in which they established that viruses do not induce mutations in bacteria, but that virus-resisting mutations are spontaneous.
Abstract: This article reported Luria and Delbruck's breakthrough study in which they established that viruses do not induce mutations in bacteria, but that virus-resisting mutations are spontaneous. Their "fluctuation test" theory demonstrated that bacteria were ideal subjects for genetic research.

3,460 citations


"A novel phylogenetic analysis combi..." refers background in this paper

  • ...…due to potential sampling 7 biases and population bottlenecks (Auer and Lettre 2015), timing of mutation arrival within an expanding population (Luria and Delbrück 1943), and the divergent nucleotide- and strand-specific mutational propensities of mtDNA (Tanaka and Ozawa 1994; Reyes et al.…...

    [...]

Book ChapterDOI
01 Jan 1965
TL;DR: The evaluation of the amount of differences between two organisms as derived from sequences in structural genes or in their polypeptide translation is likely to lead to quantities different from those obtained on the basis of observations made at any other, higher level of biological integration.
Abstract: Publisher Summary Informational macromolecules, or semantides, play a unique role in determining the properties of living matter in the perspectives that differ by the magnitude of time required for the processes involved—the short-timed biochemical reaction, the medium-timed ontogenetic event, and the long-timed evolutionary event. Although the slower processes should be broken down into linked faster processes, if one loses sight of the slower processes one also loses the links between the component faster processes. The relative importance of the contributions to evolution of changes in functional properties of polypeptides through their structural modification on the one hand, and of changes in the timing and the rate of synthesis of these polypeptides on the other hand, constitutes a problem that justifies the study of evolution at the level of informational macromolecules. The evaluation of the amount of differences between two organisms as derived from sequences in structural genes or in their polypeptide translation is likely to lead to quantities different from those obtained on the basis of observations made at any other, higher level of biological integration.

2,677 citations


"A novel phylogenetic analysis combi..." refers background in this paper

  • ...…informative regarding whether an amino acid substitution may or may not have a damaging effect on protein function (Dayhoff et al. 1978), the site-specific acceptability of a given substitution is ultimately decided within the context of its local protein environment (Zuckerkandl and Pauling 1965)....

    [...]

  • ...…of our edge-focused substitution analysis, like other approaches investigating substitution, obscures the development of any Poisson distribution (Zuckerkandl and Pauling 1965; Goldman 1990) based on repeated degenerate site substitutions, so we expected the total number of degenerate sites…...

    [...]

  • ...…may be harmful or not (Dayhoff et al. 1978; Jones et al. 1992), yet amino acid exchangeability matrices change across clades (Zou and Zhang 2019), and successful substitution of any given character obviously occurs in the context of a very specific local environment (Zuckerkandl and Pauling 1965)....

    [...]

Journal ArticleDOI
TL;DR: AliView is an alignment viewer and editor designed to meet the requirements of next-generation sequencing era phylogenetic datasets and works as an easy-to-use alignment editor for small as well as large datasets.
Abstract: Summary: AliView is an alignment viewer and editor designed to meet the requirements of next-generation sequencing era phylogenetic datasets. AliView handles alignments of unlimited size in the formats most commonly used, i.e. FASTA, Phylip, Nexus, Clustal and MSF. The intuitive graphical interface makes it easy to inspect, sort, delete, merge and realign sequences as part of the manual filtering process of large datasets. AliView also works as an easy-to-use alignment editor for small as well as large datasets. Availability and implementation: AliView is released as open-source software under the GNU General Public License, version 3.0 (GPLv3), and is available at GitHub (www.github.com/AliView). The program is cross-platform and extensively tested on Linux, Mac OS X and Windows systems. Downloads and help are available at http://ormbunkar.se/aliview Contact: es.uu.cbe@nossral.sredna Supplementary information: Supplementary data are available at Bioinformatics online.

2,071 citations


"A novel phylogenetic analysis combi..." refers methods in this paper

  • ...After gap removal, translation of protein coding genes was performed using the vertebrate mitochondrial codon table in AliView (Larsson 2014)....

    [...]

Frequently Asked Questions (1)
Q1. What are the contributions in "A novel phylogenetic analysis and machine learning predict pathogenicity of human mtdna variants" ?

Here, the authors describe a new approach to the assessment of which mtDNA variants may be pathogenic. ( which was not certified by peer review ) is the author/funder.