The rapid generation of mutation data matrices from protein sequences

doi:10.1093/BIOINFORMATICS/8.3.275

Home
/
Papers
/
The rapid generation of mutation data matrices from protein sequences

Journal Article•DOI•

The rapid generation of mutation data matrices from protein sequences

David T. Jones¹, William R. Taylor, Janet M. Thornton•Institutions (1)

University College London¹

01 Jun 1992-Bioinformatics (Oxford University Press)-Vol. 8, Iss: 3, pp 275-282

TL;DR: An efficient means for generating mutation data matrices from large numbers of protein sequences is presented, by means of an approximate peptide-based sequence comparison algorithm, which is fast enough to process the entire SWISS-PROT databank in 20 h on a Sun SPARCstation 1, and is fastenough to generate a matrix from a specific family or class of proteins in minutes.

read less

Abstract: An efficient means for generating mutation data matrices from large numbers of protein sequences is presented here. By means of an approximate peptide-based sequence comparison algorithm, the set sequences are clustered at the 85% identity level. The closest relating pairs of sequences are aligned, and observed amino acid exchanges tallied in a matrix. The raw mutation frequency matrix is processed in a similar way to that described by Dayhoff et al. (1978), and so the resulting matrices may be easily used in current sequence analysis applications, in place of the standard mutation data matrices, which have not been updated for 13 years. The method is fast enough to process the entire SWISS-PROT databank in 20 h on a Sun SPARCstation 1, and is fast enough to generate a matrix from a specific family or class of proteins in minutes. Differences observed between our 250 PAM mutation data matrix and the matrix calculated by Dayhoff et al. are briefly discussed.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

MEGA5: Molecular Evolutionary Genetics Analysis using Maximum Likelihood, Evolutionary Distance, and Maximum Parsimony Methods

[...]

Koichiro Tamura¹, Daniel S. Peterson², Nicholas Peterson², Glen Stecher², Masatoshi Nei³, Sudhir Kumar² - Show less +2 more•Institutions (3)

Tokyo Metropolitan University¹, Arizona State University², Pennsylvania State University³

01 Oct 2011-Molecular Biology and Evolution

TL;DR: The newest addition in MEGA5 is a collection of maximum likelihood (ML) analyses for inferring evolutionary trees, selecting best-fit substitution models, inferring ancestral states and sequences, and estimating evolutionary rates site-by-site.

...read moreread less

Abstract: Comparative analysis of molecular sequence data is essential for reconstructing the evolutionary histories of species and inferring the nature and extent of selective forces shaping the evolution of genes and species. Here, we announce the release of Molecular Evolutionary Genetics Analysis version 5 (MEGA5), which is a user-friendly software for mining online databases, building sequence alignments and phylogenetic trees, and using methods of evolutionary bioinformatics in basic biology, biomedicine, and evolution. The newest addition in MEGA5 is a collection of maximum likelihood (ML) analyses for inferring evolutionary trees, selecting best-fit substitution models (nucleotide or amino acid), inferring ancestral states and sequences (along with probabilities), and estimating evolutionary rates site-by-site. In computer simulation analyses, ML tree inference algorithms in MEGA5 compared favorably with other software packages in terms of computational efficiency and the accuracy of the estimates of phylogenetic trees, substitution parameters, and rate variation among sites. The MEGA user interface has now been enhanced to be activity driven to make it easier for the use of both beginners and experienced scientists. This version of MEGA is intended for the Windows platform, and it has been configured for effective use on Mac OS X and Linux desktops. It is available free of charge from http://www.megasoftware.net.

...read moreread less

39,110 citations

Cites methods from "The rapid generation of mutation da..."

...MEGA5 automatically infers the evolutionary tree by the NeighborJoining (NJ) algorithm that uses a matrix of pairwise distances estimated under the Jones–Thornton–Taylor (JTT) model for amino acid sequences or the Tamura and Nei (1993) model for nucleotide sequences (Saitou and Nei 1987; Jones et al. 1992; Tamura and Nei 1993; Tamura et al. 2004)....
[...]
...…or generated automatically by applying NJ and BIONJ algorithms to a matrix of pairwise distances estimated using a maximum composite likelihood approach for nucleotide sequences and a JTT model for amino acid sequences (Saitou and Nei 1987; Jones et al. 1992; Gascuel 1997; Tamura et al. 2004)....
[...]
...…the NeighborJoining (NJ) algorithm that uses a matrix of pairwise distances estimated under the Jones–Thornton–Taylor (JTT) model for amino acid sequences or the Tamura and Nei (1993) model for nucleotide sequences (Saitou and Nei 1987; Jones et al. 1992; Tamura and Nei 1993; Tamura et al. 2004)....
[...]
...The initial tree for the ML search can be supplied by the user (Newick format) or generated automatically by applying NJ and BIONJ algorithms to a matrix of pairwise distances estimated using a maximum composite likelihood approach for nucleotide sequences and a JTT model for amino acid sequences (Saitou and Nei 1987; Jones et al. 1992; Gascuel 1997; Tamura et al. 2004)....
[...]

Journal Article•DOI•

MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability

[...]

Kazutaka Katoh¹, Daron M. Standley¹•Institutions (1)

Osaka University¹

01 Apr 2013-Molecular Biology and Evolution

TL;DR: This version of MAFFT has several new features, including options for adding unaligned sequences into an existing alignment, adjustment of direction in nucleotide alignment, constrained alignment and parallel processing, which were implemented after the previous major update.

...read moreread less

Abstract: We report a major update of the MAFFT multiple sequence alignment program. This version has several new features, including options for adding unaligned sequences into an existing alignment, adjustment of direction in nucleotide alignment, constrained alignment and parallel processing, which were implemented after the previous major update. This report shows actual examples to explain how these features work, alone and in combination. Some examples incorrectly aligned by MAFFT are also shown to clarify its limitations. We discuss how to avoid misalignments, and our ongoing efforts to overcome such limitations.

...read moreread less

27,771 citations

Cites methods from "The rapid generation of mutation da..."

...…in a benchmark using simulated protein sequences (Löytynoja et al. 2012) generated by INDELiBLE (Fletcher and Yang 2009), when we tested a more stringent scoring matrix, JTT 1PAM (Jones et al. 1992) with weaker gap penalties than the default, the benchmark scores were considerably improved....
[...]
...2012) generated by INDELiBLE (Fletcher and Yang 2009), when we tested a more stringent scoring matrix, JTT 1PAM (Jones et al. 1992) with weaker gap penalties than the default, the benchmark scores were considerably improved....
[...]
...The ––bl, ––jtt, and ––tm options mean BLOSUM (Henikoff S and Henikoff JG 1992), JTT (Jones et al. 1992), and a transmembrane model (Jones et al....
[...]
...The ––bl, ––jtt, and ––tm options mean BLOSUM (Henikoff S and Henikoff JG 1992), JTT (Jones et al. 1992), and a transmembrane model (Jones et al. 1994), respectively....
[...]
...For example, in a benchmark using simulated protein sequences (Löytynoja et al. 2012) generated by INDELiBLE (Fletcher and Yang 2009), when we tested a more stringent scoring matrix, JTT 1PAM (Jones et al. 1992) with weaker gap penalties than the default, the benchmark scores were considerably improved....
[...]

Journal Article•DOI•

A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood.

[...]

Stéphane Guindon¹, Olivier Gascuel¹•Institutions (1)

Centre national de la recherche scientifique¹

01 Oct 2003-Systematic Biology

TL;DR: This work has used extensive and realistic computer simulations to show that the topological accuracy of this new method is at least as high as that of the existing maximum-likelihood programs and much higher than the performance of distance-based and parsimony approaches.

...read moreread less

Abstract: The increase in the number of large data sets and the complexity of current probabilistic sequence evolution models necessitates fast and reliable phylogeny reconstruction methods. We describe a new approach, based on the maximum- likelihood principle, which clearly satisfies these requirements. The core of this method is a simple hill-climbing algorithm that adjusts tree topology and branch lengths simultaneously. This algorithm starts from an initial tree built by a fast distance-based method and modifies this tree to improve its likelihood at each iteration. Due to this simultaneous adjustment of the topology and branch lengths, only a few iterations are sufficient to reach an optimum. We used extensive and realistic computer simulations to show that the topological accuracy of this new method is at least as high as that of the existing maximum-likelihood programs and much higher than the performance of distance-based and parsimony approaches. The reduction of computing time is dramatic in comparison with other maximum-likelihood packages, while the likelihood maximization ability tends to be higher. For example, only 12 min were required on a standard personal computer to analyze a data set consisting of 500 rbcL sequences with 1,428 base pairs from plant plastids, thus reaching a speed of the same order as some popular distance-based and parsimony algorithms. This new method is implemented in the PHYML program, which is freely available on our web page: http://www.lirmm.fr/w3ifa/MAAS/. (Algorithm; computer simulations; maximum likelihood; phylogeny; rbcL; RDPII project.) The size of homologous sequence data sets has in- creased dramatically in recent years, and many of these data sets now involve several hundreds of taxa. More- over, current probabilistic sequence evolution models (Swofford et al., 1996 ; Page and Holmes, 1998 ), notably those including rate variation among sites (Uzzell and Corbin, 1971 ; Jin and Nei, 1990 ; Yang, 1996 ), require an increasing number of calculations. Therefore, the speed of phylogeny reconstruction methods is becoming a sig- nificant requirement and good compromises between speed and accuracy must be found. The maximum likelihood (ML) approach is especially accurate for building molecular phylogenies. Felsenstein (1981) brought this framework to nucleotide-based phy- logenetic inference, and it was later also applied to amino acid sequences (Kishino et al., 1990). Several vari- ants were proposed, most notably the Bayesian meth- ods (Rannala and Yang 1996; and see below), and the discrete Fourier analysis of Hendy et al. (1994), for ex- ample. Numerous computer studies (Huelsenbeck and Hillis, 1993; Kuhner and Felsenstein, 1994; Huelsenbeck, 1995; Rosenberg and Kumar, 2001; Ranwez and Gascuel, 2002) have shown that ML programs can recover the cor- rect tree from simulated data sets more frequently than other methods can. Another important advantage of the ML approach is the ability to compare different trees and evolutionary models within a statistical framework (see Whelan et al., 2001, for a review). However, like all optimality criterion-based phylogenetic reconstruction approaches, ML is hampered by computational difficul- ties, making it impossible to obtain the optimal tree with certainty from even moderate data sets (Swofford et al., 1996). Therefore, all practical methods rely on heuristics that obtain near-optimal trees in reasonable computing time. Moreover, the computation problem is especially difficult with ML, because the tree likelihood not only depends on the tree topology but also on numerical pa- rameters, including branch lengths. Even computing the optimal values of these parameters on a single tree is not an easy task, particularly because of possible local optima (Chor et al., 2000). The usual heuristic method, implemented in the pop- ular PHYLIP (Felsenstein, 1993 ) and PAUP ∗ (Swofford, 1999 ) packages, is based on hill climbing. It combines stepwise insertion of taxa in a growing tree and topolog- ical rearrangement. For each possible insertion position and rearrangement, the branch lengths of the resulting tree are optimized and the tree likelihood is computed. When the rearrangement improves the current tree or when the position insertion is the best among all pos- sible positions, the corresponding tree becomes the new current tree. Simple rearrangements are used during tree growing, namely "nearest neighbor interchanges" (see below), while more intense rearrangements can be used once all taxa have been inserted. The procedure stops when no rearrangement improves the current best tree. Despite significant decreases in computing times, no- tably in fastDNAml (Olsen et al., 1994 ), this heuristic becomes impracticable with several hundreds of taxa. This is mainly due to the two-level strategy, which sepa- rates branch lengths and tree topology optimization. In- deed, most calculations are done to optimize the branch lengths and evaluate the likelihood of trees that are finally rejected. New methods have thus been proposed. Strimmer and von Haeseler (1996) and others have assembled four- taxon (quartet) trees inferred by ML, in order to recon- struct a complete tree. However, the results of this ap- proach have not been very satisfactory to date (Ranwez and Gascuel, 2001 ). Ota and Li (2000, 2001) described

...read moreread less

16,261 citations

Cites background or methods from "The rapid generation of mutation da..."

..., 1978) and JTT (Jones et al., 1992) models for proteins are also available and run quickly, requiring about 3 min to analyze a data set comprising 50 mammalian sequences and 1,729 sites (F....
[...]
...The Dayhoff (Dayhoff et al., 1978) and JTT (Jones et al., 1992) models for proteins are also available and run quickly, requiring about 3 min to analyze a data set comprising 50 mammalian sequences and 1,729 sites (F. Delsuc, pers. com.)....
[...]

Journal Article•DOI•

MEGA3: Integrated software for Molecular Evolutionary Genetics Analysis and sequence alignment

[...]

Sudhir Kumar¹, Koichiro Tamura², Masatoshi Nei³•Institutions (3)

Biodesign Institute¹, Tokyo Metropolitan University², Pennsylvania State University³

01 Jun 2004-Briefings in Bioinformatics

TL;DR: An overview of the statistical methods, computational tools, and visual exploration modules for data input and the results obtainable in MEGA is provided.

...read moreread less

Abstract: With its theoretical basis firmly established in molecular evolutionary and population genetics, the comparative DNA and protein sequence analysis plays a central role in reconstructing the evolutionary histories of species and multigene families, estimating rates of molecular evolution, and inferring the nature and extent of selective forces shaping the evolution of genes and genomes. The scope of these investigations has now expanded greatly owing to the development of high-throughput sequencing techniques and novel statistical and computational methods. These methods require easy-to-use computer programs. One such effort has been to produce Molecular Evolutionary Genetics Analysis (MEGA) software, with its focus on facilitating the exploration and analysis of the DNA and protein sequence variation from an evolutionary perspective. Currently in its third major release, MEGA3 contains facilities for automatic and manual sequence alignment, web-based mining of databases, inference of the phylogenetic trees, estimation of evolutionary distances and testing evolutionary hypotheses. This paper provides an overview of the statistical methods, computational tools, and visual exploration modules for data input and the results obtainable in MEGA.

...read moreread less

12,124 citations

Journal Article•DOI•

MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform

[...]

Kazutaka Katoh¹, Kazuharu Misawa, Kei-ichi Kuma¹, Takashi Miyata¹•Institutions (1)

Kyoto University¹

15 Jul 2002-Nucleic Acids Research

TL;DR: A simplified scoring system is proposed that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length.

...read moreread less

Abstract: A multiple sequence alignment program, MAFFT, has been developed. The CPU time is drastically reduced as compared with existing methods. MAFFT includes two novel techniques. (i) Homologous regions are rapidly identified by the fast Fourier transform (FFT), in which an amino acid sequence is converted to a sequence composed of volume and polarity values of each amino acid residue. (ii) We propose a simplified scoring system that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length. Two different heuristics, the progressive method (FFT-NS-2) and the iterative refinement method (FFT-NS-i), are implemented in MAFFT. The performances of FFT-NS-2 and FFT-NS-i were compared with other methods by computer simulations and benchmark tests; the CPU time of FFT-NS-2 is drastically reduced as compared with CLUSTALW with comparable accuracy. FFT-NS-i is over 100 times faster than T-COFFEE, when the number of input sequences exceeds 60, without sacrificing the accuracy.

...read moreread less

12,003 citations

Cites background from "The rapid generation of mutation da..."

...(22), Sop (gap opening penalty, de®ned below) is 2....
[...]
...(22), fa is the frequency of occurrence for amino acid a calculated by Jones et al....
[...]
...(22) with two modi®cations; 20 amino acids are grouped into six physico-chemical groups (24), and the number Tij of 6-tuples shared by sequence i and sequence j is counted....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

Basic Local Alignment Search Tool

[...]

Stephen F. Altschul¹, Warren Gish¹, Webb Miller², Eugene W. Myers³, David J. Lipman¹ - Show less +1 more•Institutions (3)

National Institutes of Health¹, Pennsylvania State University², University of Arizona³

01 Oct 1990-Journal of Molecular Biology

TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

...read moreread less

88,255 citations

Journal Article•DOI•

Amino acid substitutions in structurally related proteins a pattern recognition approach: Determination of a new and efficient scoring matrix

[...]

Jean-Loup Risler¹, M.O. Delorme¹, Hervé Delacroix¹, Alain Hénaut¹•Institutions (1)

Centre national de la recherche scientifique¹

20 Dec 1988-Journal of Molecular Biology

TL;DR: Amino acid substitutions in evolutionarily related proteins have been studied from a structural point of view and the distance matrix determined in this study seems to be very efficient for aligning distantly related protein sequences.

...read moreread less

324 citations

Journal Article•DOI•

Tertiary structural constraints on protein evolutionary diversity: templates, key residues and structure prediction.

[...]

John P. Overington¹, Mark S. Johnson¹, Andrej Sali¹, Tom L. Blundell¹•Institutions (1)

Birkbeck, University of London¹

22 Aug 1990-Proceedings of The Royal Society B: Biological Sciences

TL;DR: A comparative analysis of families of homologous globular proteins to characterize and quantify the structural constraints and to identify ‘key’ residues if one or more structures are known.

...read moreread less

Abstract: The pattern of residue substitution in divergently evolving families of globular proteins is highly variable. At each position in a fold there are constraints on the identities of amino acids from both the three-dimensional structure and the function of the protein. To characterize and quantify the structural constraints, we have made a comparative analysis of families of homologous globular proteins. Residues are classified according to amino acid type, secondary structure, accessibility of the sidechain, and existence of hydrogen bonds from sidechain to other sidechains or peptide carbonyl or amide functions. There are distinct patterns of substitution especially where residues are both solvent inaccessible and hydrogen bonded through their sidechains. The patterns of residue substitution can be used to construct templates or to identify `key9 residues if one or more structures are known. Conversely, analysis of conversation and substitution across a large family of aligned sequences in terms of substitution profiles can allow prediction of tertiary environment or indicate a functional role. Similar analyses can be used to test the validity of putative structures if several homologous sequences are available.

...read moreread less

223 citations

Book Chapter•DOI•

Mutation data matrix and its uses.

[...]

David G. George, Winona C. Barker, Lois T. Hunt

01 Jan 1990-Methods in Enzymology

TL;DR: This chapter describes the mutation data matrix (MDM) and its application for comparing protein sequences and the concept of an alignment that defines the relationship between sequences on a residue-by-residue basis.

...read moreread less

Abstract: Publisher Summary This chapter describes the mutation data matrix (MDM) and its application for comparing protein sequences. Basic to all sequence comparison is the concept of an alignment that defines the relationship between sequences on a residue-by-residue basis. Sequence comparison methods use a scoring matrix that assigns a value to each possible pair of aligned amino acids. One of the most widely used similarity measures is the mutation data matrix (MDM) developed by Dayhoff and colleagues. The first MDM, published in 1968, was derived from over 400 accepted point mutations between present-day sequences and inferred ancestral sequences. Within the Markovian model, the MDM is derived from a transition probability matrix in which each matrix element gives the probability that amino acid A will be replaced by amino acid B in one unit of evolutionary change. The diagonal elements give the probabilities that the amino acids will remain unchanged. The probability of an amino acid being replaced is estimated as its relative mutability, which is calculated as the ratio of the number of observed changes of an amino acid to its total exposure to change.

...read moreread less

105 citations

Book Chapter•DOI•

Maximum parsimony approach to construction of evolutionary trees from aligned homologous sequences.

[...]

John Czelusniak, Morris Goodman, Nancy D. Moncrief, Suzanne M. Kehoe

01 Jan 1990-Methods in Enzymology

TL;DR: The significance of maximum parsimony approach to the construction of evolutionary trees from aligned homologous sequences is described, which maximizes the genetic likenesses associated with common ancestry while minimizing the incidence of convergent mutations.

...read moreread less

Abstract: Publisher Summary This chapter describes the significance of maximum parsimony approach to the construction of evolutionary trees from aligned homologous sequences. A maximum parsimony tree accounts for the evolutionary descent or related sequences by the fewest possible genie changes. Such a tree maximizes the genetic likenesses associated with common ancestry while minimizing the incidence of convergent mutations. Calculation of tree length is simplified by removing the root from the tree. Such an unrooted tree or network still retains the interior nodes and the exterior nodes (the OTUs).The maximum parsimony procedure can reconstruct ancestral sequences for each interior node of a tree but cannot determine which interior node or which pair of adjacent interior nodes is closest to the root. The problem of finding the maximum parsimony tree can be broken down into two parts. The first part proved to be easy and was solved by Fitch for homologous nucleotide sequences. The algorithm requires as input data both the OTUs, which are contemporary homologous nucleotide sequences already aligned against one another, and the instructions for a tree or dendrogram specifying any one of the possible dichotomous branching orders for the OTUs.

...read moreread less

52 citations