Author

# Masami Hasegawa

Other affiliations: Fudan University, Graduate University for Advanced Studies, National Institute of Genetics ...read more

Bio: Masami Hasegawa is an academic researcher from Toho University. The author has contributed to research in topics: Phylogenetic tree & Phylogenetics. The author has an hindex of 72, co-authored 199 publications receiving 33107 citations. Previous affiliations of Masami Hasegawa include Fudan University & Graduate University for Advanced Studies.

##### Papers published on a yearly basis

##### Papers

More filters

••

TL;DR: A new statistical method for estimating divergence dates of species from DNA sequence data by a molecular clock approach is developed, and this dating may pose a problem for the widely believed hypothesis that the bipedal creatureAustralopithecus afarensis, which lived some 3.7 million years ago, was ancestral to man and evolved after the human-ape splitting.

Abstract: A new statistical method for estimating divergence dates of species from DNA sequence data by a molecular clock approach is developed. This method takes into account effectively the information contained in a set of DNA sequence data. The molecular clock of mitochondrial DNA (mtDNA) was calibrated by setting the date of divergence between primates and ungulates at the Cretaceous-Tertiary boundary (65 million years ago), when the extinction of dinosaurs occurred. A generalized least-squares method was applied in fitting a model to mtDNA sequence data, and the clock gave dates of 92.3 +/- 11.7, 13.3 +/- 1.5, 10.9 +/- 1.2, 3.7 +/- 0.6, and 2.7 +/- 0.6 million years ago (where the second of each pair of numbers is the standard deviation) for the separation of mouse, gibbon, orangutan, gorilla, and chimpanzee, respectively, from the line leading to humans. Although there is some uncertainty in the clock, this dating may pose a problem for the widely believed hypothesis that the pipedal creature Australopithecus afarensis, which lived some 3.7 million years ago at Laetoli in Tanzania and at Hadar in Ethiopia, was ancestral to man and evolved after the human-ape splitting. Another likelier possibility is that mtDNA was transferred through hybridization between a proto-human and a proto-chimpanzee after the former had developed bipedalism.

8,124 citations

••

TL;DR: A modiﬁcation of the KH test to take into account a multiplicity of testings is presented, which shows how the test was designed for comparing two topologies but is often used for comparing many topologies.

Abstract: The maximum-likelihood method for inferring mo-lecular phylogeny (Felsenstein 1981) is being widelyused. The probabilistic model for generating the molec-ular sequences is speciﬁed by the substitution processand the tree topology. The parameters for the substitu-tion process and the branch lengths are estimated bymaximizing the likelihood, and then the tree topology isestimated by maximizing the maximized likelihood. Toobtain the conﬁdence limit of the topology, the test ofKishino and Hasegawa (1989), referred to as the KHtest, is often used in practice. The same idea that is thebasis for the KH test is also found in the statistical lit-erature (Linhart 1988; Vuong 1989). The KH test wasdesigned for comparing two topologies but is often usedfor comparing many topologies. This use of the KH testleads to overconﬁdence for a wrong tree, because thesampling error due to the selection of the topology isoverlooked in it. In this note, we present a modiﬁcationof the KH test to take into account a multiplicity oftestings.Let a index the topologies and L

4,049 citations

••

TL;DR: A new method for estimating the variance of the difference between log likelihood of different tree topologies is developed by expressing it explicitly in order to evaluate the maximum likelihood branching order among Hominoidea.

Abstract: A maximum likelihood method for inferring evolutionary trees from DNA sequence data was developed by Felsenstein (1981). In evaluating the extent to which the maximum likelihood tree is a significantly better representation of the true tree, it is important to estimate the variance of the difference between log likelihood of different tree topologies. Bootstrap resampling can be used for this purpose (Hasegawa et al. 1988; Hasegawa and Kishino 1989), but it imposes a great computation burden. To overcome this difficulty, we developed a new method for estimating the variance by expressing it explicitly. The method was applied to DNA sequence data from primates in order to evaluate the maximum likelihood branching order among Hominoidea. It was shown that, although the orangutan is convincingly placed as an outgroup of a human and African apes clade, the branching order among human, chimpanzee, and gorilla cannot be determined confidently from the DNA sequence data presently available when the evolutionary rate constancy is not assumed.

3,157 citations

••

TL;DR: UNLABELLED CONSEL is a program to assess the confidence of the tree selection by giving the p-values for the trees using the multi-scale bootstrap technique, which is less biased than the other conventional p- values.

Abstract: Summary: CONSEL is a program to assess the confidence of the tree selection by giving the p-values for the trees. The main thrust of the program is to calculate the p-value of the Approximately Unbiased (AU) test using the multi-scale bootstrap technique. This p-value is less biased than the other conventional p-values such as the Bootstrap Probability (BP), the Kishino‐Hasegawa (KH) test, the Shimodaira‐Hasegawa (SH) test, and the Weighted Shimodaira‐Hasegawa (WSH) test. CONSEL calculates all these p-values from the output of the phylogeny program packages such as Molphy, PAML, and PAUP ∗ . Furthermore, CONSEL is applicable to a wide class of problems where the BPs are available. Availability: The programs are written in C language. The source code for Unix and the executable binary for DOS are found at http://www.ism.ac.jp/∼shimo/

2,037 citations

••

TL;DR: A phylogeny of chloroplast genomes inferred from 41 proteins and 8,303 amino acids sites indicates that at least two independent secondary endosymbiotic events have occurred involving red algae and that amino acid composition bias in chloropleft proteins strongly affects plastid genome phylogeny.

Abstract: Chloroplasts were once free-living cyanobacteria that became endosymbionts, but the genomes of contemporary plastids encode only ≈5–10% as many genes as those of their free-living cousins, indicating that many genes were either lost from plastids or transferred to the nucleus during the course of plant evolution. Previous estimates have suggested that between 800 and perhaps as many as 2,000 genes in the Arabidopsis genome might come from cyanobacteria, but genome-wide phylogenetic surveys that could provide direct estimates of this number are lacking. We compared 24,990 proteins encoded in the Arabidopsis genome to the proteins from three cyanobacterial genomes, 16 other prokaryotic reference genomes, and yeast. Of 9,368 Arabidopsis proteins sufficiently conserved for primary sequence comparison, 866 detected homologues only among cyanobacteria and 834 other branched with cyanobacterial homologues in phylogenetic trees. Extrapolating from these conserved proteins to the whole genome, the data suggest that ≈4,500 of Arabidopsis protein-coding genes (≈18% of the total) were acquired from the cyanobacterial ancestor of plastids. These proteins encompass all functional classes, and the majority of them are targeted to cell compartments other than the chloroplast. Analysis of 15 sequenced chloroplast genomes revealed 117 nuclear-encoded proteins that are also still present in at least one chloroplast genome. A phylogeny of chloroplast genomes inferred from 41 proteins and 8,303 amino acids sites indicates that at least two independent secondary endosymbiotic events have occurred involving red algae and that amino acid composition bias in chloroplast proteins strongly affects plastid genome phylogeny.

1,134 citations

##### Cited by

More filters

••

TL;DR: The newest addition in MEGA5 is a collection of maximum likelihood (ML) analyses for inferring evolutionary trees, selecting best-fit substitution models, inferring ancestral states and sequences, and estimating evolutionary rates site-by-site.

Abstract: Comparative analysis of molecular sequence data is essential for reconstructing the evolutionary histories of species and inferring the nature and extent of selective forces shaping the evolution of genes and species. Here, we announce the release of Molecular Evolutionary Genetics Analysis version 5 (MEGA5), which is a user-friendly software for mining online databases, building sequence alignments and phylogenetic trees, and using methods of evolutionary bioinformatics in basic biology, biomedicine, and evolution. The newest addition in MEGA5 is a collection of maximum likelihood (ML) analyses for inferring evolutionary trees, selecting best-fit substitution models (nucleotide or amino acid), inferring ancestral states and sequences (along with probabilities), and estimating evolutionary rates site-by-site. In computer simulation analyses, ML tree inference algorithms in MEGA5 compared favorably with other software packages in terms of computational efficiency and the accuracy of the estimates of phylogenetic trees, substitution parameters, and rate variation among sites. The MEGA user interface has now been enhanced to be activity driven to make it easier for the use of both beginners and experienced scientists. This version of MEGA is intended for the Windows platform, and it has been configured for effective use on Mac OS X and Linux desktops. It is available free of charge from http://www.megasoftware.net.

39,110 citations

••

TL;DR: The program MODELTEST uses log likelihood scores to establish the model of DNA evolution that best fits the data.

Abstract: Summary: The program MODELTEST uses log likelihood scores to establish the model of DNA evolution that best fits the data. Availability: The MODELTEST package, including the source code and some documentation is available at http://bioag.byu.edu/zoology/crandall―lab/modeltest.html. Contact: dp47@email.byu.edu.

20,105 citations

••

TL;DR: This work has used extensive and realistic computer simulations to show that the topological accuracy of this new method is at least as high as that of the existing maximum-likelihood programs and much higher than the performance of distance-based and parsimony approaches.

Abstract: The increase in the number of large data sets and the complexity of current probabilistic sequence evolution models necessitates fast and reliable phylogeny reconstruction methods. We describe a new approach, based on the maximum- likelihood principle, which clearly satisfies these requirements. The core of this method is a simple hill-climbing algorithm that adjusts tree topology and branch lengths simultaneously. This algorithm starts from an initial tree built by a fast distance-based method and modifies this tree to improve its likelihood at each iteration. Due to this simultaneous adjustment of the topology and branch lengths, only a few iterations are sufficient to reach an optimum. We used extensive and realistic computer simulations to show that the topological accuracy of this new method is at least as high as that of the existing maximum-likelihood programs and much higher than the performance of distance-based and parsimony approaches. The reduction of computing time is dramatic in comparison with other maximum-likelihood packages, while the likelihood maximization ability tends to be higher. For example, only 12 min were required on a standard personal computer to analyze a data set consisting of 500 rbcL sequences with 1,428 base pairs from plant plastids, thus reaching a speed of the same order as some popular distance-based and parsimony algorithms. This new method is implemented in the PHYML program, which is freely available on our web page: http://www.lirmm.fr/w3ifa/MAAS/. (Algorithm; computer simulations; maximum likelihood; phylogeny; rbcL; RDPII project.) The size of homologous sequence data sets has in- creased dramatically in recent years, and many of these data sets now involve several hundreds of taxa. More- over, current probabilistic sequence evolution models (Swofford et al., 1996 ; Page and Holmes, 1998 ), notably those including rate variation among sites (Uzzell and Corbin, 1971 ; Jin and Nei, 1990 ; Yang, 1996 ), require an increasing number of calculations. Therefore, the speed of phylogeny reconstruction methods is becoming a sig- nificant requirement and good compromises between speed and accuracy must be found. The maximum likelihood (ML) approach is especially accurate for building molecular phylogenies. Felsenstein (1981) brought this framework to nucleotide-based phy- logenetic inference, and it was later also applied to amino acid sequences (Kishino et al., 1990). Several vari- ants were proposed, most notably the Bayesian meth- ods (Rannala and Yang 1996; and see below), and the discrete Fourier analysis of Hendy et al. (1994), for ex- ample. Numerous computer studies (Huelsenbeck and Hillis, 1993; Kuhner and Felsenstein, 1994; Huelsenbeck, 1995; Rosenberg and Kumar, 2001; Ranwez and Gascuel, 2002) have shown that ML programs can recover the cor- rect tree from simulated data sets more frequently than other methods can. Another important advantage of the ML approach is the ability to compare different trees and evolutionary models within a statistical framework (see Whelan et al., 2001, for a review). However, like all optimality criterion-based phylogenetic reconstruction approaches, ML is hampered by computational difficul- ties, making it impossible to obtain the optimal tree with certainty from even moderate data sets (Swofford et al., 1996). Therefore, all practical methods rely on heuristics that obtain near-optimal trees in reasonable computing time. Moreover, the computation problem is especially difficult with ML, because the tree likelihood not only depends on the tree topology but also on numerical pa- rameters, including branch lengths. Even computing the optimal values of these parameters on a single tree is not an easy task, particularly because of possible local optima (Chor et al., 2000). The usual heuristic method, implemented in the pop- ular PHYLIP (Felsenstein, 1993 ) and PAUP ∗ (Swofford, 1999 ) packages, is based on hill climbing. It combines stepwise insertion of taxa in a growing tree and topolog- ical rearrangement. For each possible insertion position and rearrangement, the branch lengths of the resulting tree are optimized and the tree likelihood is computed. When the rearrangement improves the current tree or when the position insertion is the best among all pos- sible positions, the corresponding tree becomes the new current tree. Simple rearrangements are used during tree growing, namely "nearest neighbor interchanges" (see below), while more intense rearrangements can be used once all taxa have been inserted. The procedure stops when no rearrangement improves the current best tree. Despite significant decreases in computing times, no- tably in fastDNAml (Olsen et al., 1994 ), this heuristic becomes impracticable with several hundreds of taxa. This is mainly due to the two-level strategy, which sepa- rates branch lengths and tree topology optimization. In- deed, most calculations are done to optimize the branch lengths and evaluate the likelihood of trees that are finally rejected. New methods have thus been proposed. Strimmer and von Haeseler (1996) and others have assembled four- taxon (quartet) trees inferred by ML, in order to recon- struct a complete tree. However, the results of this ap- proach have not been very satisfactory to date (Ranwez and Gascuel, 2001 ). Ota and Li (2000, 2001) described

16,261 citations

••

TL;DR: A new algorithm to search the tree space with user-defined intensity using subtree pruning and regrafting topological moves and a new test to assess the support of the data for internal branches of a phylogeny are introduced.

Abstract: PhyML is a phylogeny software based on the maximum-likelihood principle. Early PhyML versions used a fast algorithm performing nearest neighbor interchanges to improve a reasonable starting tree topology. Since the original publication (Guindon S., Gascuel O. 2003. A simple, fast and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol. 52:696-704), PhyML has been widely used (>2500 citations in ISI Web of Science) because of its simplicity and a fair compromise between accuracy and speed. In the meantime, research around PhyML has continued, and this article describes the new algorithms and methods implemented in the program. First, we introduce a new algorithm to search the tree space with user-defined intensity using subtree pruning and regrafting topological moves. The parsimony criterion is used here to filter out the least promising topology modifications with respect to the likelihood function. The analysis of a large collection of real nucleotide and amino acid data sets of various sizes demonstrates the good performance of this method. Second, we describe a new test to assess the support of the data for internal branches of a phylogeny. This approach extends the recently proposed approximate likelihood-ratio test and relies on a nonparametric, Shimodaira-Hasegawa-like procedure. A detailed analysis of real alignments sheds light on the links between this new approach and the more classical nonparametric bootstrap method. Overall, our tests show that the last version (3.0) of PhyML is fast, accurate, stable, and ready to use. A Web server and binary files are available from http://www.atgc-montpellier.fr/phyml/.

14,385 citations

••

Kyoto University

^{1}TL;DR: A simplified scoring system is proposed that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length.

Abstract: A multiple sequence alignment program, MAFFT, has been developed. The CPU time is drastically reduced as compared with existing methods. MAFFT includes two novel techniques. (i) Homologous regions are rapidly identified by the fast Fourier transform (FFT), in which an amino acid sequence is converted to a sequence composed of volume and polarity values of each amino acid residue. (ii) We propose a simplified scoring system that performs well for reducing CPU time and increasing the accuracy of alignments even for sequences having large insertions or extensions as well as distantly related sequences of similar length. Two different heuristics, the progressive method (FFT-NS-2) and the iterative refinement method (FFT-NS-i), are implemented in MAFFT. The performances of FFT-NS-2 and FFT-NS-i were compared with other methods by computer simulations and benchmark tests; the CPU time of FFT-NS-2 is drastically reduced as compared with CLUSTALW with comparable accuracy. FFT-NS-i is over 100 times faster than T-COFFEE, when the number of input sequences exceeds 60, without sacrificing the accuracy.

12,003 citations