scispace - formally typeset
Search or ask a question
Journal ArticleDOI

The rapid generation of mutation data matrices from protein sequences

01 Jun 1992-Bioinformatics (Oxford University Press)-Vol. 8, Iss: 3, pp 275-282
TL;DR: An efficient means for generating mutation data matrices from large numbers of protein sequences is presented, by means of an approximate peptide-based sequence comparison algorithm, which is fast enough to process the entire SWISS-PROT databank in 20 h on a Sun SPARCstation 1, and is fastenough to generate a matrix from a specific family or class of proteins in minutes.
Abstract: An efficient means for generating mutation data matrices from large numbers of protein sequences is presented here. By means of an approximate peptide-based sequence comparison algorithm, the set sequences are clustered at the 85% identity level. The closest relating pairs of sequences are aligned, and observed amino acid exchanges tallied in a matrix. The raw mutation frequency matrix is processed in a similar way to that described by Dayhoff et al. (1978), and so the resulting matrices may be easily used in current sequence analysis applications, in place of the standard mutation data matrices, which have not been updated for 13 years. The method is fast enough to process the entire SWISS-PROT databank in 20 h on a Sun SPARCstation 1, and is fast enough to generate a matrix from a specific family or class of proteins in minutes. Differences observed between our 250 PAM mutation data matrix and the matrix calculated by Dayhoff et al. are briefly discussed.
Citations
More filters
Journal ArticleDOI
TL;DR: The newest addition in MEGA5 is a collection of maximum likelihood (ML) analyses for inferring evolutionary trees, selecting best-fit substitution models, inferring ancestral states and sequences, and estimating evolutionary rates site-by-site.
Abstract: Comparative analysis of molecular sequence data is essential for reconstructing the evolutionary histories of species and inferring the nature and extent of selective forces shaping the evolution of genes and species. Here, we announce the release of Molecular Evolutionary Genetics Analysis version 5 (MEGA5), which is a user-friendly software for mining online databases, building sequence alignments and phylogenetic trees, and using methods of evolutionary bioinformatics in basic biology, biomedicine, and evolution. The newest addition in MEGA5 is a collection of maximum likelihood (ML) analyses for inferring evolutionary trees, selecting best-fit substitution models (nucleotide or amino acid), inferring ancestral states and sequences (along with probabilities), and estimating evolutionary rates site-by-site. In computer simulation analyses, ML tree inference algorithms in MEGA5 compared favorably with other software packages in terms of computational efficiency and the accuracy of the estimates of phylogenetic trees, substitution parameters, and rate variation among sites. The MEGA user interface has now been enhanced to be activity driven to make it easier for the use of both beginners and experienced scientists. This version of MEGA is intended for the Windows platform, and it has been configured for effective use on Mac OS X and Linux desktops. It is available free of charge from http://www.megasoftware.net.

39,110 citations


Cites methods from "The rapid generation of mutation da..."

  • ...MEGA5 automatically infers the evolutionary tree by the NeighborJoining (NJ) algorithm that uses a matrix of pairwise distances estimated under the Jones–Thornton–Taylor (JTT) model for amino acid sequences or the Tamura and Nei (1993) model for nucleotide sequences (Saitou and Nei 1987; Jones et al. 1992; Tamura and Nei 1993; Tamura et al. 2004)....

    [...]

  • ...…or generated automatically by applying NJ and BIONJ algorithms to a matrix of pairwise distances estimated using a maximum composite likelihood approach for nucleotide sequences and a JTT model for amino acid sequences (Saitou and Nei 1987; Jones et al. 1992; Gascuel 1997; Tamura et al. 2004)....

    [...]

  • ...…the NeighborJoining (NJ) algorithm that uses a matrix of pairwise distances estimated under the Jones–Thornton–Taylor (JTT) model for amino acid sequences or the Tamura and Nei (1993) model for nucleotide sequences (Saitou and Nei 1987; Jones et al. 1992; Tamura and Nei 1993; Tamura et al. 2004)....

    [...]

  • ...The initial tree for the ML search can be supplied by the user (Newick format) or generated automatically by applying NJ and BIONJ algorithms to a matrix of pairwise distances estimated using a maximum composite likelihood approach for nucleotide sequences and a JTT model for amino acid sequences (Saitou and Nei 1987; Jones et al. 1992; Gascuel 1997; Tamura et al. 2004)....

    [...]

Journal ArticleDOI
TL;DR: This version of MAFFT has several new features, including options for adding unaligned sequences into an existing alignment, adjustment of direction in nucleotide alignment, constrained alignment and parallel processing, which were implemented after the previous major update.
Abstract: We report a major update of the MAFFT multiple sequence alignment program. This version has several new features, including options for adding unaligned sequences into an existing alignment, adjustment of direction in nucleotide alignment, constrained alignment and parallel processing, which were implemented after the previous major update. This report shows actual examples to explain how these features work, alone and in combination. Some examples incorrectly aligned by MAFFT are also shown to clarify its limitations. We discuss how to avoid misalignments, and our ongoing efforts to overcome such limitations.

27,771 citations


Cites methods from "The rapid generation of mutation da..."

  • ...…in a benchmark using simulated protein sequences (Löytynoja et al. 2012) generated by INDELiBLE (Fletcher and Yang 2009), when we tested a more stringent scoring matrix, JTT 1PAM (Jones et al. 1992) with weaker gap penalties than the default, the benchmark scores were considerably improved....

    [...]

  • ...2012) generated by INDELiBLE (Fletcher and Yang 2009), when we tested a more stringent scoring matrix, JTT 1PAM (Jones et al. 1992) with weaker gap penalties than the default, the benchmark scores were considerably improved....

    [...]

  • ...The ––bl, ––jtt, and ––tm options mean BLOSUM (Henikoff S and Henikoff JG 1992), JTT (Jones et al. 1992), and a transmembrane model (Jones et al....

    [...]

  • ...The ––bl, ––jtt, and ––tm options mean BLOSUM (Henikoff S and Henikoff JG 1992), JTT (Jones et al. 1992), and a transmembrane model (Jones et al. 1994), respectively....

    [...]

  • ...For example, in a benchmark using simulated protein sequences (Löytynoja et al. 2012) generated by INDELiBLE (Fletcher and Yang 2009), when we tested a more stringent scoring matrix, JTT 1PAM (Jones et al. 1992) with weaker gap penalties than the default, the benchmark scores were considerably improved....

    [...]

Journal ArticleDOI
TL;DR: This work has used extensive and realistic computer simulations to show that the topological accuracy of this new method is at least as high as that of the existing maximum-likelihood programs and much higher than the performance of distance-based and parsimony approaches.
Abstract: The increase in the number of large data sets and the complexity of current probabilistic sequence evolution models necessitates fast and reliable phylogeny reconstruction methods. We describe a new approach, based on the maximum- likelihood principle, which clearly satisfies these requirements. The core of this method is a simple hill-climbing algorithm that adjusts tree topology and branch lengths simultaneously. This algorithm starts from an initial tree built by a fast distance-based method and modifies this tree to improve its likelihood at each iteration. Due to this simultaneous adjustment of the topology and branch lengths, only a few iterations are sufficient to reach an optimum. We used extensive and realistic computer simulations to show that the topological accuracy of this new method is at least as high as that of the existing maximum-likelihood programs and much higher than the performance of distance-based and parsimony approaches. The reduction of computing time is dramatic in comparison with other maximum-likelihood packages, while the likelihood maximization ability tends to be higher. For example, only 12 min were required on a standard personal computer to analyze a data set consisting of 500 rbcL sequences with 1,428 base pairs from plant plastids, thus reaching a speed of the same order as some popular distance-based and parsimony algorithms. This new method is implemented in the PHYML program, which is freely available on our web page: http://www.lirmm.fr/w3ifa/MAAS/. (Algorithm; computer simulations; maximum likelihood; phylogeny; rbcL; RDPII project.) The size of homologous sequence data sets has in- creased dramatically in recent years, and many of these data sets now involve several hundreds of taxa. More- over, current probabilistic sequence evolution models (Swofford et al., 1996 ; Page and Holmes, 1998 ), notably those including rate variation among sites (Uzzell and Corbin, 1971 ; Jin and Nei, 1990 ; Yang, 1996 ), require an increasing number of calculations. Therefore, the speed of phylogeny reconstruction methods is becoming a sig- nificant requirement and good compromises between speed and accuracy must be found. The maximum likelihood (ML) approach is especially accurate for building molecular phylogenies. Felsenstein (1981) brought this framework to nucleotide-based phy- logenetic inference, and it was later also applied to amino acid sequences (Kishino et al., 1990). Several vari- ants were proposed, most notably the Bayesian meth- ods (Rannala and Yang 1996; and see below), and the discrete Fourier analysis of Hendy et al. (1994), for ex- ample. Numerous computer studies (Huelsenbeck and Hillis, 1993; Kuhner and Felsenstein, 1994; Huelsenbeck, 1995; Rosenberg and Kumar, 2001; Ranwez and Gascuel, 2002) have shown that ML programs can recover the cor- rect tree from simulated data sets more frequently than other methods can. Another important advantage of the ML approach is the ability to compare different trees and evolutionary models within a statistical framework (see Whelan et al., 2001, for a review). However, like all optimality criterion-based phylogenetic reconstruction approaches, ML is hampered by computational difficul- ties, making it impossible to obtain the optimal tree with certainty from even moderate data sets (Swofford et al., 1996). Therefore, all practical methods rely on heuristics that obtain near-optimal trees in reasonable computing time. Moreover, the computation problem is especially difficult with ML, because the tree likelihood not only depends on the tree topology but also on numerical pa- rameters, including branch lengths. Even computing the optimal values of these parameters on a single tree is not an easy task, particularly because of possible local optima (Chor et al., 2000). The usual heuristic method, implemented in the pop- ular PHYLIP (Felsenstein, 1993 ) and PAUP ∗ (Swofford, 1999 ) packages, is based on hill climbing. It combines stepwise insertion of taxa in a growing tree and topolog- ical rearrangement. For each possible insertion position and rearrangement, the branch lengths of the resulting tree are optimized and the tree likelihood is computed. When the rearrangement improves the current tree or when the position insertion is the best among all pos- sible positions, the corresponding tree becomes the new current tree. Simple rearrangements are used during tree growing, namely "nearest neighbor interchanges" (see below), while more intense rearrangements can be used once all taxa have been inserted. The procedure stops when no rearrangement improves the current best tree. Despite significant decreases in computing times, no- tably in fastDNAml (Olsen et al., 1994 ), this heuristic becomes impracticable with several hundreds of taxa. This is mainly due to the two-level strategy, which sepa- rates branch lengths and tree topology optimization. In- deed, most calculations are done to optimize the branch lengths and evaluate the likelihood of trees that are finally rejected. New methods have thus been proposed. Strimmer and von Haeseler (1996) and others have assembled four- taxon (quartet) trees inferred by ML, in order to recon- struct a complete tree. However, the results of this ap- proach have not been very satisfactory to date (Ranwez and Gascuel, 2001 ). Ota and Li (2000, 2001) described

16,261 citations


Cites background or methods from "The rapid generation of mutation da..."

  • ..., 1978) and JTT (Jones et al., 1992) models for proteins are also available and run quickly, requiring about 3 min to analyze a data set comprising 50 mammalian sequences and 1,729 sites (F....

    [...]

  • ...The Dayhoff (Dayhoff et al., 1978) and JTT (Jones et al., 1992) models for proteins are also available and run quickly, requiring about 3 min to analyze a data set comprising 50 mammalian sequences and 1,729 sites (F. Delsuc, pers. com.)....

    [...]

Journal ArticleDOI
TL;DR: An overview of the statistical methods, computational tools, and visual exploration modules for data input and the results obtainable in MEGA is provided.
Abstract: With its theoretical basis firmly established in molecular evolutionary and population genetics, the comparative DNA and protein sequence analysis plays a central role in reconstructing the evolutionary histories of species and multigene families, estimating rates of molecular evolution, and inferring the nature and extent of selective forces shaping the evolution of genes and genomes. The scope of these investigations has now expanded greatly owing to the development of high-throughput sequencing techniques and novel statistical and computational methods. These methods require easy-to-use computer programs. One such effort has been to produce Molecular Evolutionary Genetics Analysis (MEGA) software, with its focus on facilitating the exploration and analysis of the DNA and protein sequence variation from an evolutionary perspective. Currently in its third major release, MEGA3 contains facilities for automatic and manual sequence alignment, web-based mining of databases, inference of the phylogenetic trees, estimation of evolutionary distances and testing evolutionary hypotheses. This paper provides an overview of the statistical methods, computational tools, and visual exploration modules for data input and the results obtainable in MEGA.

12,124 citations

Journal ArticleDOI
TL;DR: A simplified scoring system is proposed that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length.
Abstract: A multiple sequence alignment program, MAFFT, has been developed. The CPU time is drastically reduced as compared with existing methods. MAFFT includes two novel techniques. (i) Homologous regions are rapidly identified by the fast Fourier transform (FFT), in which an amino acid sequence is converted to a sequence composed of volume and polarity values of each amino acid residue. (ii) We propose a simplified scoring system that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length. Two different heuristics, the progressive method (FFT-NS-2) and the iterative refinement method (FFT-NS-i), are implemented in MAFFT. The performances of FFT-NS-2 and FFT-NS-i were compared with other methods by computer simulations and benchmark tests; the CPU time of FFT-NS-2 is drastically reduced as compared with CLUSTALW with comparable accuracy. FFT-NS-i is over 100 times faster than T-COFFEE, when the number of input sequences exceeds 60, without sacrificing the accuracy.

12,003 citations


Cites background from "The rapid generation of mutation da..."

  • ...(22), Sop (gap opening penalty, de®ned below) is 2....

    [...]

  • ...(22), fa is the frequency of occurrence for amino acid a calculated by Jones et al....

    [...]

  • ...(22) with two modi®cations; 20 amino acids are grouped into six physico-chemical groups (24), and the number Tij of 6-tuples shared by sequence i and sequence j is counted....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A new approach to rapid sequence comparison, basic local alignment search tool (BLAST), directly approximates alignments that optimize a measure of local similarity, the maximal segment pair (MSP) score.

88,255 citations

Journal ArticleDOI
TL;DR: Amino acid substitutions in evolutionarily related proteins have been studied from a structural point of view and the distance matrix determined in this study seems to be very efficient for aligning distantly related protein sequences.

324 citations

Journal ArticleDOI
TL;DR: A comparative analysis of families of homologous globular proteins to characterize and quantify the structural constraints and to identify ‘key’ residues if one or more structures are known.
Abstract: The pattern of residue substitution in divergently evolving families of globular proteins is highly variable. At each position in a fold there are constraints on the identities of amino acids from both the three-dimensional structure and the function of the protein. To characterize and quantify the structural constraints, we have made a comparative analysis of families of homologous globular proteins. Residues are classified according to amino acid type, secondary structure, accessibility of the sidechain, and existence of hydrogen bonds from sidechain to other sidechains or peptide carbonyl or amide functions. There are distinct patterns of substitution especially where residues are both solvent inaccessible and hydrogen bonded through their sidechains. The patterns of residue substitution can be used to construct templates or to identify `key9 residues if one or more structures are known. Conversely, analysis of conversation and substitution across a large family of aligned sequences in terms of substitution profiles can allow prediction of tertiary environment or indicate a functional role. Similar analyses can be used to test the validity of putative structures if several homologous sequences are available.

223 citations

Book ChapterDOI
TL;DR: This chapter describes the mutation data matrix (MDM) and its application for comparing protein sequences and the concept of an alignment that defines the relationship between sequences on a residue-by-residue basis.
Abstract: Publisher Summary This chapter describes the mutation data matrix (MDM) and its application for comparing protein sequences. Basic to all sequence comparison is the concept of an alignment that defines the relationship between sequences on a residue-by-residue basis. Sequence comparison methods use a scoring matrix that assigns a value to each possible pair of aligned amino acids. One of the most widely used similarity measures is the mutation data matrix (MDM) developed by Dayhoff and colleagues. The first MDM, published in 1968, was derived from over 400 accepted point mutations between present-day sequences and inferred ancestral sequences. Within the Markovian model, the MDM is derived from a transition probability matrix in which each matrix element gives the probability that amino acid A will be replaced by amino acid B in one unit of evolutionary change. The diagonal elements give the probabilities that the amino acids will remain unchanged. The probability of an amino acid being replaced is estimated as its relative mutability, which is calculated as the ratio of the number of observed changes of an amino acid to its total exposure to change.

105 citations

Book ChapterDOI
TL;DR: The significance of maximum parsimony approach to the construction of evolutionary trees from aligned homologous sequences is described, which maximizes the genetic likenesses associated with common ancestry while minimizing the incidence of convergent mutations.
Abstract: Publisher Summary This chapter describes the significance of maximum parsimony approach to the construction of evolutionary trees from aligned homologous sequences. A maximum parsimony tree accounts for the evolutionary descent or related sequences by the fewest possible genie changes. Such a tree maximizes the genetic likenesses associated with common ancestry while minimizing the incidence of convergent mutations. Calculation of tree length is simplified by removing the root from the tree. Such an unrooted tree or network still retains the interior nodes and the exterior nodes (the OTUs).The maximum parsimony procedure can reconstruct ancestral sequences for each interior node of a tree but cannot determine which interior node or which pair of adjacent interior nodes is closest to the root. The problem of finding the maximum parsimony tree can be broken down into two parts. The first part proved to be easy and was solved by Fitch for homologous nucleotide sequences. The algorithm requires as input data both the OTUs, which are contemporary homologous nucleotide sequences already aligned against one another, and the instructions for a tree or dendrogram specifying any one of the possible dichotomous branching orders for the OTUs.

52 citations