scispace - formally typeset
Search or ask a question

Showing papers on "Tree rearrangement published in 2010"


Journal ArticleDOI
TL;DR: Although the pseudo-likelihood is derived from coalescent theory, and assumes no gene flow or horizontal gene transfer (HGT), the MP-EST method is robust to a small amount of HGT in the dataset and can consistently estimate the topology and branch lengths of the species tree.
Abstract: Several phylogenetic approaches have been developed to estimate species trees from collections of gene trees. However, maximum likelihood approaches for estimating species trees under the coalescent model are limited. Although the likelihood of a species tree under the multispecies coalescent model has already been derived by Rannala and Yang, it can be shown that the maximum likelihood estimate (MLE) of the species tree (topology, branch lengths, and population sizes) from gene trees under this formula does not exist. In this paper, we develop a pseudo-likelihood function of the species tree to obtain maximum pseudo-likelihood estimates (MPE) of species trees, with branch lengths of the species tree in coalescent units. We show that the MPE of the species tree is statistically consistent as the number M of genes goes to infinity. In addition, the probability that the MPE of the species tree matches the true species tree converges to 1 at rate O(M -1). The simulation results confirm that the maximum pseudo-likelihood approach is statistically consistent even when the species tree is in the anomaly zone. We applied our method, Maximum Pseudo-likelihood for Estimating Species Trees (MP-EST) to a mammal dataset. The four major clades found in the MP-EST tree are consistent with those in the Bayesian concatenation tree. The bootstrap supports for the species tree estimated by the MP-EST method are more reasonable than the posterior probability supports given by the Bayesian concatenation method in reflecting the level of uncertainty in gene trees and controversies over the relationship of four major groups of placental mammals. MP-EST can consistently estimate the topology and branch lengths (in coalescent units) of the species tree. Although the pseudo-likelihood is derived from coalescent theory, and assumes no gene flow or horizontal gene transfer (HGT), the MP-EST method is robust to a small amount of HGT in the dataset. In addition, increasing the number of genes does not increase the computational time substantially. The MP-EST method is fast for analyzing datasets that involve a large number of genes but a moderate number of species.

599 citations


Journal ArticleDOI
Tanja Stadler1
TL;DR: The derived tree density can be used as a tree prior in a Bayesian method to reconstruct the evolutionary past of the sequence data on a calender-timescale, and for simulating trees with a given number of sampled extant and extinct individuals which is essential for testing evolutionary hypotheses for the considered datasets.

377 citations


Journal ArticleDOI
TL;DR: This paper quantifies the difficulty of jointly finding the division of samples to species and estimating a species tree without constraining the possible assignments a priori and introduces a parametric and a nonparametric method to do this delimitation and tree inference using individual gene trees as input.
Abstract: Species delimitation and species tree inference are difficult problems in cases of recent divergence, especially when different loci have different histories. This paper quantifies the difficulty of jointly finding the division of samples to species and estimating a species tree without constraining the possible assignments a priori. It introduces a parametric and a nonparametric method, including new heuristic search strategies, to do this delimitation and tree inference using individual gene trees as input. The new methods were evaluated using thousands of simulations and 4 empirical data sets. These analyses suggest that the new methods, especially the nonparametric one, may provide useful insights for systematists working at the species level with molecular data. However, they still often return incorrect results.

266 citations


Book
01 Jan 2010
TL;DR: This chapter discusses Bayesian Estimation of Species Trees, which involves comparing CFs of Contradicting Clades for Reconstructing the Dominant History, and the Challenge of Determining Loci on Whole-Genome Alignments.
Abstract: Preface. Contributors. Chapter 1 Estimating Species Trees: An Introduction to Concepts and Models (L. Lacey Knowles and Laura S. Kubatko). 1.1 Introduction. 1.1.1 Different Tree Types and Their Relationship to Phylogeny. 1.2 The Relationship Between Gene Trees and Species Trees. 1.2.1 Evolutionary Mechanisms for Gene Tree Discord. 1.2.2 The Coalescent Process and Gene Tree Distributions. 1.2.3 Phylogenetic Extensions of the Coalescent Model. 1.3 The Relationship Between Sequence Data and Gene Trees. 1.3.1 Modeling DNA Sequence Evolution along a Gene Tree. 1.4 Statistical Inference of Species Trees. 1.4.1 ML. 1.4.2 Bayesian Analysis. 1.5 Collecting DNA Sequence Data. 1.6 Conclusions. References. Chapter 2 Bayesian Estimation of Species Trees: A Practical Guide to Optimal Sampling and Analysis (Santiago Castillo-Ramirez, Liang Liu, Dennis Pearl and Scott V. Edwards). 2.1 Introduction. 2.1.1 Empirical Examples Using BEST. 2.2 Factors Influencing Confidence in Estimated Species Trees Using BEST. 2.2.1 Simulation Protocol. 2.2.2 Results of Simulations on Number and Length of Loci. 2.2.3 Multifactorial Prediction of Confidence in Species Trees. 2.2.4 Effect of the Number of Alleles Sampled per Locus on Species Tree Estimation. 2.2.5 Effect of Recombination on Species Tree Inference. 2.3 Some Tips on Running the BEST MCMC Algorithm. 2.4 Conclusions and Challenges. Acknowledgments. References. Chapter 3 Reconstructing Concordance Trees and Testing the Coalescent Model from Genome-Wide Data Sets (Cecile Ane). 3.1 Introduction. 3.2 BCA: Background. 3.2.1 Sharing of Information across Gene Trees. 3.2.2 How to Choose the A Priori Level of Discordance alpha. 3.2.3 The Choice of an Infinite alpha in BCA. 3.2.4 A Nonparametric Prior Distribution on Gene Trees. 3.3 Genomic Support versus Statistical Support. 3.4 Comparing CFs of Contradicting Clades for Reconstructing the Dominant History. 3.5 Testing the Hypothesis That All Discordance is Due to ILS. 3.6 Species Tree Reconstruction from CFs. 3.7 The Challenge of Determining Loci on Whole-Genome Alignments. 3.7.1 The Assumption of Homogeneous, Unlinked Loci for GT/ST Reconstruction. 3.7.2 Detecting Recombination Breakpoints for GT/ST Reconstruction. 3.7.3 A Minimum Description Length (MDL) Information Criterion. 3.7.4 Comparisons with Other Partitioning Criteria. Acknowledgments. References. Chapter 4 Probabilities of Gene Tree Topoligies with Intraspecific Sampling Given a Species Tree (James H. Degnan). 4.1 Introduction. 4.2 Background and Terminology. 4.2.1 Incomplete Lineage Sorting. 4.2.2 Notation. 4.3 Gene Tree Topology Probabilities-Theory. 4.3.1 Enumerating Coalescent Histories. 4.3.2 The Probability of a Coalescent History. 4.3.3 Probability Mass Function for Gene Tree Topologies. 4.4 Gene Tree Topology Probabilities-Examples. 4.4.1 Enumeration of Coalescent Histories. 4.4.2 Calculation of Probabilities of Coalescent Histories. 4.5 Applications. 4.5.1 Probabilities of Multilabeled Trees. 4.5.2 Probability of Monophyletic Concordance. 4.5.3 AGTs. 4.6 Conclusions. References. Appendix: Using Coal. Using the Software. Setting Up Species Tree Branch Lengths. Chapter 5 Inference of Parsimonious Species Tree from Multilocus Data by Minimizing Deep Coalescences (Cuong Than and Luay Nakhleh). 5.1 Introduction. 5.2 Trees, Clusters, and the Compatibility Graph. 5.3 Valid Coalescent Histories, Extra Lineages, and the MDC Criterion. 5.4 Exact Algorithms for the MDC Problem. 5.4.1 An ILP Algorithm. 5.4.2 A DP Algorithm. 5.5 Handling Special Cases. 5.5.1 Multiple Individuals per Species. 5.5.2 Nonbinary Trees. 5.6 Performance of MDC. 5.7 Inference from The Clusters of The Gene Trees. 5.8 Using PhyloNet. 5.8.1 Using PhyloNet to Count Valid Coalescent Histories. 5.8.2 Using PhyloNet to Infer Species Trees Under MDC. 5.9 Conclusions. Acknowledgments. References. Chapter 6 Accommodating Hybridization in a Multilocus Phylogenetic Framework (Laura S. Kubatko and Chen Meng). 6.1 Introduction. 6.2 Methods for Detecting Hybridization in The Presence of Incomplete Lineage Sorting. 6.3 A Phylogenetic Model for Hybridization in The Presence of Incomplete Lineage Sorting. 6.3.1 Estimation and Testing for the Hybridization Parameters: Gene Tree Data. 6.3.2 Estimation and Testing for the Hybridization Parameters: Sequence Data. 6.3.3 Comparison of Hybrid Species Phylogenies Using Gene Tree Data. 6.4 Application: Hybridization in the Heliconius Butterflies. 6.4.1 Estimation and Testing for the Hybridization Parameters: Application to the Estimated Gene Trees in Heliconius. 6.4.2 Estimation and Testing for the Hybridization Parameters: Application to Sequence Data in Heliconius. 6.4.3 Comparison of Hybrid Species Phylogenies for the Heliconius Gene Tree Data. 6.5 Conclusions and Future Directions. Acknowledgment. References. Chapter 7 The Influence of Hybrid Zones on Species Tree Inference in Manakins (Robb T. Brumfi eld and Matthew D. Carling). 7.1 Introduction. 7.2 The Manacus Manakins. 7.2.1 Distribution. 7.2.2 Hybrid Zone between M. vitellinus and M. candei. 7.2.3 Two Contact Zones between M. vitellinus and M. manacus. 7.2.4 Inferring a Manacus Species Tree. 7.3 Is Introgression Across the Hybrid Zones Influencing the Species Tree Inference? 7.4 Conclusions. Acknowledgments. References. Chapter 8 Summarizing Gene Tree Incongruence at Multiple Phylogenetic Depths (Karen A. Cranston). 8.1 Introduction. 8.2 Sample Data: Rice, Flies, and Yeast. 8.3 Bayesian Inference of Gene Trees. 8.4 Detecting Convergence Across Hundreds of Genes. 8.5 A Note on Combining Trees. 8.6 BCA. 8.7 gsi. 8.8 Triplet Analysis. 8.9 Missing Data. 8.10 Genomic Distribution of Gene Tree Incongruence. 8.11 Visualization of Gene Tree Incongruence. 8.12 Concluding Remarks. Acknowledgments. References. Chapter 9 Species Tree Estimation for Complex Divergence Histories: A Case Study in Neodiprion Sawflies (Catherine R. Linnen). 9.1 Introduction. 9.2 Study System: Neodiprion Sawflies. 9.3 Sampling Strategy. 9.4 Determining the Source of Mitonuclear Discordance. 9.5 Approaches for Species Tree Estimation. 9.5.1 Concatenation with Monophyly Constraints (CMC). 9.5.2 Minimize Deep Coalescences (MDC). 9.5.3 Shallowest Divergences (SD). 9.5.4 Bayesian Estimation of Species Trees (BEST). 9.6 Comparison of Species Tree Estimates. 9.7 Comparison of Gene Trees to Species Trees. 9.8 Conclusions and Future Directions. References. Chapter 10 Sampling Strategies for Species Tree Estimation (L. Lacey Knowles). 10.1 Introduction. 10.2 Information Content in DNA Sequences for Species Tree Inference. 10.3 Why Phylogenetic History Dictates Appropriate Sampling Strategy. 10.4 Properties of the Data That Impact Sampling Decisions. 10.5 Making Informed Decisions about Sampling Strategies. 10.5.1 Where Does the Initial Species Tree Come from? 10.5.2 Is There Consistency in the Estimated Species Tree Given the Data? 10.6 Summary. Acknowledgments. References. Chapter 11 Developing Nuclear Sequences for Species Tree Estimation in Nonmodel Organisms: Insights from a Case Study of Bottae's Pocket Gopher, Thomomys Bottae (Natalia M. Belfiore). 11.1 Introduction. 11.2 Pocket Gophers. 11.3 Marker Generation Approach and Methodological Comments. 11.3.1 Library Construction. 11.3.2 Subtraction of High-Copy-Number Regions. 11.3.3 Locus Characterization by Genomic Approaches. 11.3.4 Primer Design Experiments. 11.3.5 Locus Evaluation for Inclusion in the Study. 11.3.6 Variation within the Library Construction Species. 11.3.7 Inclusion of Loci and Data Generation within the Genus. 11.4 Data Management and Analysis. 11.4.1 Handling Data and Choosing Analysis Programs. 11.4.2 Phylogenetic Analysis. 11.5 Conclusions. Acknowledgments. References. Chapter 12 Estimating Species Relationships and Taxon Distinctiveness in Sistrurus Rattlesnakes Using Multilocus Data (Laura S. Kubatko and H. Lisle Gibbs). 12.1 Introduction. 12.1.1 Sistrurus Rattlesnakes. 12.2 Analysis of Species and Subspecific Relationships. 12.2.1 Estimation of the Species Phylogeny. 12.2.2 Distinctiveness of Subspecies. 12.2.3 Phased versus Unphased Data. 12.3 Species Tree Estimation. 12.3.1 Estimation Using Gene Trees as Data. 12.3.2 Estimation Using Sequences as Data. 12.4 Distinctiveness Among Species and Subspecies. 12.4.1 Phased Data. 12.4.2 Unphased Data and the Effect of Sample Size. 12.5 Evolutionary and Conservation Implications. 12.6 Conclusions. Acknowledgments. References. Index.

178 citations


Journal ArticleDOI
TL;DR: The bias produced by SSA is explored and identified and an alternative general sampling approach (GSA) is provided that can be applied to most other models, including the constant-rate birth-death model sampling approach, which samples trees very efficiently from a widely used class of models.
Abstract: A wide range of evolutionary models for species-level (and higher) diversification have been developed. These models can be used to test evolutionary hypotheses and provide comparisons with phylogenetic trees constructed from real data. To carry out these tests and comparisons, it is often necessary to sample, or simulate, trees from the evolutionary models. Sampling trees from these models is more complicated than it may appear at first glance, necessitating careful consideration and mathematical rigor. Seemingly straightforward sampling methods may produce trees that have systematically biased shapes or branch lengths. This is particularly problematic as there is no simple method for determining whether the sampled trees are appropriate. In this paper, we show why a commonly used simple sampling approach (SSA)-simulating trees forward in time until n species are first reached-should only be applied to the simplest pure birth model, the Yule model. We provide an alternative general sampling approach (GSA) that can be applied to most other models. Furthermore, we introduce the constant-rate birth-death model sampling approach, which samples trees very efficiently from a widely used class of models. We explore the bias produced by SSA and identify situations in which this bias is particularly pronounced. We show that using SSA can lead to erroneous conclusions: When using the inappropriate SSA, the variance of a gradually evolving trait does not correlate with the age of the tree; when the correct GSA is used, the trait variance correlates with tree age. The algorithms presented here are available in the Perl Bio::Phylo package, as a stand-alone program TreeSample, and in the R TreeSim package.

98 citations


Journal ArticleDOI
TL;DR: iGTP enables, for the first time, gene tree parsimony analyses of thousands of genes from hundreds of taxa using the duplication, duplication-loss, and deep coalescence reconciliation costs, all from within a convenient graphical user interface.
Abstract: The ever-increasing wealth of genomic sequence information provides an unprecedented opportunity for large-scale phylogenetic analysis. However, species phylogeny inference is obfuscated by incongruence among gene trees due to evolutionary events such as gene duplication and loss, incomplete lineage sorting (deep coalescence), and horizontal gene transfer. Gene tree parsimony (GTP) addresses this issue by seeking a species tree that requires the minimum number of evolutionary events to reconcile a given set of incongruent gene trees. Despite its promise, the use of gene tree parsimony has been limited by the fact that existing software is either not fast enough to tackle large data sets or is restricted in the range of evolutionary events it can handle. We introduce iGTP, a platform-independent software program that implements state-of-the-art algorithms that greatly speed up species tree inference under the duplication, duplication-loss, and deep coalescence reconciliation costs. iGTP significantly extends and improves the functionality and performance of existing gene tree parsimony software and offers advanced features such as building effective initial trees using stepwise leaf addition and the ability to have unrooted gene trees in the input. Moreover, iGTP provides a user-friendly graphical interface with integrated tree visualization software to facilitate analysis of the results. iGTP enables, for the first time, gene tree parsimony analyses of thousands of genes from hundreds of taxa using the duplication, duplication-loss, and deep coalescence reconciliation costs, all from within a convenient graphical user interface.

97 citations


Journal ArticleDOI
TL;DR: This article shows that Tree Containment is polynomial-time solvable for normal networks, for binary tree-child networks, and for level-k networks,and shows that, even for tree-sibling, time-consistent, regular networks, both Tree Cont containment and Cluster Containment remain NP-complete.

87 citations


Journal ArticleDOI
TL;DR: It is found that SMRT-ML converges to the correct species tree in many cases in which ML on the full concatenated data set fails to do so, and is therefore a computationally efficient and statistically consistent estimator of the species tree when gene trees are distributed according to the multispecies coalescent model.
Abstract: Concatenated sequence alignments are often used to infer species-level relationships. Previous studies have shown that analysis of concatenated data using maximum likelihood (ML) can produce misleading results when loci have differing gene tree topologies due to incomplete lineage sorting. Here, we develop a polynomial time method that utilizes the modified mincut supertree algorithm to construct an estimated species tree from inferred rooted triples of concatenated alignments. We term this method SuperMatrix Rooted Triple (SMRT) and use the notation SMRT-ML when rooted triples are inferred by ML. We use simulations to investigate the performance of SMRT-ML under Jukes–Cantor and general time-reversible substitution models for four- and five-taxon species trees and also apply the method to an empirical data set of yeast genes. We find that SMRT-ML converges to the correct species tree in many cases in which ML on the full concatenated data set fails to do so. SMRT-ML can be conservative in that its output tree is often partially unresolved for problematic clades. We show analytically that when the species tree is clocklike and mutations occur under the Cavender–Farris–Neyman substitution model, as the number of genes increases, SMRT-ML is increasingly likely to infer the correct species tree even when the most likely gene tree does not match the species tree. SMRT-ML is therefore a computationally efficient and statistically consistent estimator of the species tree when gene trees are distributed according to the multispecies coalescent model.

79 citations


Journal ArticleDOI
TL;DR: It can be shown the MT is a consistent estimator of the species tree even when theMT is built upon the estimates of the true gene trees if the gene tree estimates are statistically consistent.
Abstract: We propose a model based approach to use multiple gene trees to estimate the species tree. The coalescent process requires that gene divergences occur earlier than species divergences when there is any polymorphism in the ancestral species. Under this scenario, speciation times are restricted to be smaller than the corresponding gene split times. The maximum tree (MT) is the tree with the largest possible speciation times in the space of species trees restricted by available gene trees. If all populations have the same population size, the MT is the maximum likelihood estimate of the species tree. It can be shown the MT is a consistent estimator of the species tree even when the MT is built upon the estimates of the true gene trees if the gene tree estimates are statistically consistent. The MT converges in probability to the true species tree at an exponential rate.

74 citations


Journal ArticleDOI
TL;DR: These new algorithms enable, for the first time, gene tree parsimony analyses of thousands of genes from hundreds of taxa using the duplication-loss and deep coalescence reconciliation costs.
Abstract: Genomic data provide a wealth of new information for phylogenetic analysis. Yet making use of this data requires phylogenetic methods that can efficiently analyze extremely large data sets and account for processes of gene evolution, such as gene duplication and loss, incomplete lineage sorting (deep coalescence), or horizontal gene transfer, that cause incongruence among gene trees. One such approach is gene tree parsimony, which, given a set of gene trees, seeks a species tree that requires the smallest number of evolutionary events to explain the incongruence of the gene trees. However, the only existing algorithms for gene tree parsimony under the duplication-loss or deep coalescence reconciliation cost are prohibitively slow for large datasets. We describe novel algorithms for SPR and TBR based local search heuristics under the duplication-loss cost, and we show how they can be adapted for the deep coalescence cost. These algorithms improve upon the best existing algorithms for these problems by a factor of n, where n is the number of species in the collection of gene trees. We implemented our new SPR based local search algorithm for the duplication-loss cost and demonstrate the tremendous improvement in runtime and scalability it provides compared to existing implementations. We also evaluate the performance of our algorithm on three large-scale genomic data sets. Our new algorithms enable, for the first time, gene tree parsimony analyses of thousands of genes from hundreds of taxa using the duplication-loss and deep coalescence reconciliation costs. Thus, this work expands both the size of data sets and the range of evolutionary models that can be incorporated into genome-scale phylogenetic analyses.

69 citations


Journal ArticleDOI
TL;DR: In this article, the authors present the new Cass algorithm that can combine any set of clusters into a phylogenetic network, which is guaranteed to produce a network with at most two reticulations per biconnected component whenever such a network exists.
Abstract: Phylogenetic trees are widely used to display estimates of how groups of species are evolved. Each phylogenetic tree can be seen as a collection of clusters, subgroups of the species that evolved from a common ancestor. When phylogenetic trees are obtained for several datasets (e.g. for different genes), then their clusters are often contradicting. Consequently, the set of all clusters of such a dataset cannot be combined into a single phylogenetic tree. Phylogenetic networks are a generalization of phylogenetic trees that can be used to display more complex evolutionary histories, including reticulate events, such as hybridizations, recombinations and horizontal gene transfers. Here, we present the new Cass algorithm that can combine any set of clusters into a phylogenetic network. We show that the networks constructed by Cass are usually simpler than networks constructed by other available methods. Moreover, we show that Cass is guaranteed to produce a network with at most two reticulations per biconnected component, whenever such a network exists. We have implemented Cass and integrated it into the freely available Dendroscope software. Contact: l.j.j.v.iersel@gmail.com Supplementary information: Supplementary data are available at Bioinformatics online.

Journal ArticleDOI
TL;DR: The NEST of this plant may be interpreted in biological terms and may be used to reveal important aspects of the plant growth, as well as approximating trees by trees with a particular self-nested structure.
Abstract: In this paper, we are interested in the problem of approximating trees by trees with a particular self-nested structure. Self-nested trees are such that all their subtrees of a given height are isomorphic. We show that these trees present remarkable compression properties, with high compression rates. In order to measure how far a tree is from being a self-nested tree, we then study how to quantify the degree of self-nestedness of any tree. For this, we define a measure of the self-nestedness of a tree by constructing a self-nested tree that minimizes the distance of the original tree to the set of self-nested trees that embed the initial tree. We show that this measure can be computed in polynomial time and depict the corresponding algorithm. The distance to this nearest embedding self-nested tree (NEST) is then used to define compression coefficients that reflect the compressibility of a tree. To illustrate this approach, we then apply these notions to the analysis of plant branching structures. Based on a database of simulated theoretical plants in which different levels of noise have been introduced, we evaluate the method and show that the NESTs of such branching structures restore partly or completely the original, noiseless, branching structures. The whole approach is then applied to the analysis of a real plant (a rice panicle) whose topological structure was completely measured. We show that the NEST of this plant may be interpreted in biological terms and may be used to reveal important aspects of the plant growth.

Journal ArticleDOI
TL;DR: It is shown that, for each n>or=7, there exist a species tree topology S and a gene tree topologies G not equalS, both with n leaves, for which the number of coalescent histories exceeds the corresponding number of assemblage histories.

Journal ArticleDOI
Tatsuya Akutsu1
TL;DR: Key results and recent results on the tree edit distance problem and related problems are reviewed, and polynomial time exact algorithms and more efficient approximation algorithms for the editdistance problem for ordered trees, and approximation algorithm for the largest common sub-tree problem for unordered trees are reviewed.
Abstract: Tree structured data often appear in bioinformatics. For example, glycans, RNA secondary structures and phylogenetic trees usually have tree structures. Comparison of trees is one of fundamental tasks in analysis of these data. Various distance measures have been proposed and utilized for comparison of trees, among which extensive studies have been done on tree edit distance. In this paper, we review key results and our recent results on the tree edit distance problem and related problems. In particular, we review polynomial time exact algorithms and more efficient approximation algorithms for the edit distance problem for ordered trees, and approximation algorithms for the largest common sub-tree problem for unordered trees. We also review applications of tree edit distance and its variants to bioinformatics with focusing on comparison of glycan structures.

Journal ArticleDOI
TL;DR: This paper introduces and characterize a new consensus method that refines the majority-rule tree by adding certain compatible clusters satisfying a simple criterion.
Abstract: The construction of a consensus tree to summarize the information of a given set of phylogenetic trees is now routinely a part of many studies in systematic biology. One popular method is the majority-rule consensus tree. In this paper we introduce and characterize a new consensus method that refines the majority-rule tree by adding certain compatible clusters satisfying a simple criterion.

Book ChapterDOI
01 Feb 2010
TL;DR: A multi-objective approach for phylogenetic reconstruction using maximum parsimony (Fitch, 1972) and maximum likelihood (Felsenstein, 1981) criteria is proposed and preliminary results were presented.
Abstract: Phylogenetic inference is one of the central problems in computational biology. It consists in finding the best tree that explains the evolutionary history of species from a given dataset. Various phylogenetic reconstruction methods have been proposed in the literature. Most of them use one optimality criterion (or objective function) to evaluate possible solutions in order to determine the best tree. On the other hand, several researches (Huelsenbeck, 1995; Kuhner & Felsenstein, 1994; Tateno et al., 1994) have shown important differences in the results obtained by applying distinct reconstruction methods to the same input data. Rokas et al. (2003) pointed out that there are several sources of incongruity in phylogenetic analysis: the optimality criterion employed, the data sets used and the evolutionary assumptions concerning data. In other words, according to the literature, the selection of the reconstruction method has a great inuence on the results. In this context, a multi-objective approach can be a relevant contribution since it can search for phylogenies using more than one criterion and produce trees which are consistent with all employed criteria. Recently, Handl et al. (2006) discussed the current and future applications of multi-objective optimization in bioinformatics and computational biology problems. Poladian & Jermiin (2006) showed how multi-objective optimization can be used in phylogenetic inference from various conicting datasets. The authors highlighted that this approach reveals sources of such conicts and provides useful information for a robust inference. Coelho et al. (2007) propose a multi-objective Artificial Immune System (De Castro & Timmis, 2002) approach for the reconstruction of phylogenetic trees. The developed algorithm, called omniaiNet, was employed to find a set of Pareto-optimal trees that represent a trade-off between the minimum evolution (Kidd & Sgaramella, 1971) and the least-squares criteria (Cavalli-Sforza & Edwards, 1967). Compared to the tree found by Neighbor Joining (NJ) algorithm (Saitou & Nei, 1987), solutions obtained by omni-aiNet have better minimum evolution and least squares scores. In this paper, we propose a multi-objective approach for phylogenetic reconstruction using maximum parsimony (Fitch, 1972) and maximum likelihood (Felsenstein, 1981) criteria. The basis of this approach and preliminary results were presented in (Cancino & Delbem, 2007a,b). The proposed technique, called PhyloMOEA, is a multi-objective evolutionary algorithm (MOEA) based on the NSGA-II (Deb, 2001). The PhyloMOEA output is a set of

Journal ArticleDOI
TL;DR: It is shown that for a tree with 4 lineages where 2 nonsister taxa undergo a change in the proportion of variable sites tree reconstruction under the best-fitting model, which is chosen using a relative test, often results in the wrong tree.
Abstract: Commonly used phylogenetic models assume a homogeneous process through time in all parts of the tree. However, it is known that these models can be too simplistic as they do not account for nonhomogeneous lineage-specific properties. In particular, it is now widely recognized that as constraints on sequences evolve, the proportion and positions of variable sites can vary between lineages causing heterotachy. The extent to which this model misspecification affects tree reconstruction is still unknown. Here, we evaluate the effect of changes in the proportions and positions of variable sites on model fit and tree estimation. We consider 5 current models of nucleotide sequence evolution in a Bayesian Markov chain Monte Carlo framework as well as maximum parsimony (MP). We show that for a tree with 4 lineages where 2 nonsister taxa undergo a change in the proportion of variable sites tree reconstruction under the best-fitting model, which is chosen using a relative test, often results in the wrong tree. In this case, we found that an absolute test of model fit is a better predictor of tree estimation accuracy. We also found further evidence that MP is not immune to heterotachy. In addition, we show that increased sampling of taxa that have undergone a change in proportion and positions of variable sites is critical for accurate tree reconstruction.

Journal ArticleDOI
TL;DR: The gist of the approach is the succinct characterization of Steiner trees for a small number of leaves for the two distances that enables the use of known Steiner tree approximation algorithms.
Abstract: We explore the maximum parsimony (MP) and ancestral maximum likelihood (AML) criteria in phylogenetic tree reconstruction. Both problems are NP-hard, so we seek approximate solutions. We formulate the two problems as Steiner tree problems under appropriate distances. The gist of our approach is the succinct characterization of Steiner trees for a small number of leaves for the two distances. This enables the use of known Steiner tree approximation algorithms. The approach leads to a 16/9 approximation ratio for AML and asymptotically to a 1.55 approximation ratio for MP.

Proceedings ArticleDOI
22 Mar 2010
TL;DR: This work proposes the use of the evolutionary algorithms paradigm (EA) as an alternate heuristic to generate model trees in order to improve the convergence to global optimal solutions.
Abstract: Model trees are a particular case of decision trees employed to solve regression problems. They have the advantage of presenting an interpretable output with an acceptable level of predictive performance. Since generating optimal model trees is a NP-Complete problem, the traditional model tree induction algorithms make use of a greedy heuristic, which may not converge to the global optimal solution. We propose the use of the evolutionary algorithms paradigm (EA) as an alternate heuristic to generate model trees in order to improve the convergence to global optimal solutions. We test the predictive performance of this new approach using public UCI datasets, and compare the results with traditional greedy regression/model trees induction algorithms.

Journal ArticleDOI
TL;DR: It is shown that the star-decomposition algorithm is a special case of two other popular tree-search algorithms, subtree pruning and regrafting and tree bisection and reconnection, which means that these two algorithms can efficiently escape when they encounter multifurcations.
Abstract: Phylogenetic tree-search is a major aspect of many evolutionary studies. Several tree rearrangement algorithms are available for tree-search, but it is hard to draw general conclusions about their relative performance because many effects are data set specific and can be highly dependent on individual implementations (e.g., RAxML or phyml). Using only the structure of the rearrangements proposed by the Nearest Neighbor Interchange (NNI) algorithm, we show tree-search can prematurely terminate if it encounters multifurcating trees. We validate the relevance of this result by demonstrating that in real data the majority of possible bifurcating trees potentially encountered during tree-search are actually multifurcations, which suggests NNI would be expected to perform poorly. We also show that the star-decomposition algorithm is a special case of two other popular tree-search algorithms, subtree pruning and regrafting (SPR) and tree bisection and reconnection (TBR), which means that these two algorithms can efficiently escape when they encounter multifurcations. We caution against the use of the NNI algorithm and for most applications we recommend the use of more robust tree-search algorithms, such as SPR and TBR.

Book ChapterDOI
07 Apr 2010
TL;DR: This paper presents the PhyloMOEA parallel version developed using the ParadisEO framework, and shows significant speedup in the execution time for the employed datasets.
Abstract: The inference of the phylogenetic tree that best express the evolutionary relationships concerning data is one of the central problem of bioinformatics. Several single optimality criterion have been proposed for the phylogenetic reconstruction problem. However, different criteria may lead to conflicting phylogenies. In this scenario, a multi-objective approach can be useful since it could produce a set of optimal trees according to multiple criteria. PhyloMOEA is a multi objective evolutionary approach applied to phylogenetic inference using maximum parsimony and maximum likelihood criteria. On the other hand, the computational power required for phylogenetic inference of large alignments easily surpasses the capabilities of single machines. In this context, the parallelization of the heuristic reconstruction methods can not only help to reduce the inference execution time but also improve the results quality and search robustness. On the other hand, The PhyloMOEA parallelization represents the next development step in order to reduce the execution time. In this paper, we present the PhyloMOEA parallel version developed using the ParadisEO framework. The experiments conducted show significant speedup in the execution time for the employed datasets.

Journal ArticleDOI
TL;DR: In this paper, it was shown that for trees with at least eight leaves, assuming the tree topology is already known, seven leaves suffice for identifiability of the numerical parameters.

Journal ArticleDOI
TL;DR: In this paper, it was shown that the maximum likelihood and maximum parsimony methods are equivalent for sequences of characters under a simple symmetric model of substitution with no common mechanism and that small changes to the model assumptions suffice to make the two methods inequivalent.

Proceedings ArticleDOI
02 Aug 2010
TL;DR: Adjusted gene tree parsimony reflects a potentially more realistic and, at least for small data sets, computationally feasible model for counting gene duplication events than treating each duplication independently or minimizing the number of possible duplication episodes.
Abstract: Gene tree parsimony, which infers a species tree that implies the fewest gene duplications across a collection of gene trees, is a method for inferring phylogenetic trees from paralogous genes. However, it assumes that all duplications are independent, and therefore, it does not account for large-scale gene duplication events like whole genome duplications. We describe two methods to infer species trees based on gene duplication events that may involve multiple genes. First, gene episode parsimony seeks the species tree that implies the fewest possible gene duplication episodes. Second, adjusted gene tree parsimony corrects the number of gene duplications at each node in the species tree by treating the largest possible gene duplication episode as a single duplication. We test both new methods, as well as gene tree parsimony, using 7,091 gene trees representing 7 plant taxa. Gene tree parsimony and adjusted gene tree parsimony both perform well, returning the species tree after an exhaustive search of the tree space. By contrast, gene episode parsimony fails to rank the true species tree within the top third of all possible topologies. Furthermore, gene trees with randomly permuted leaf labels can imply fewer duplication episodes than gene trees with the correct leaf labels. Adjusted gene tree parsimony reflects a potentially more realistic and, at least for small data sets, computationally feasible model for counting gene duplication events than treating each duplication independently or minimizing the number of possible duplication episodes.

Journal ArticleDOI
TL;DR: The long branch extraction method seems to mask the majority of the search space rendering it ineffective as a detection method of LBA, and a proposed alternative, the long branch shortening method, is also ineffective in predicting long branch attraction for all tree topologies.
Abstract: Background Long branch attraction (LBA) is a problem that afflicts both the parsimony and maximum likelihood phylogenetic analysis techniques. Research has shown that parsimony is particularly vulnerable to inferring the wrong tree in Felsenstein topologies. The long branch extraction method is a procedure to detect a data set suffering from this problem so that Maximum Likelihood could be used instead of Maximum Parsimony.

Journal ArticleDOI
TL;DR: This paper presents non-parallel optimizations which establish their implementation as the fastest exact implementation in phylogenetics, and their novel parallelized routines are the first of their kind.

Book ChapterDOI
13 Jun 2010
TL;DR: The preliminary experimental validation is promising as the resulting trees can be significantly less complex with at least comparable performance to the classical top-down counterpart.
Abstract: In the paper a new evolutionary algorithm for induction of univariate regression trees is proposed. In contrast to typical top-down approaches it globally searches for the best tree structure and tests in internal nodes. The population of initial trees is created with diverse top-down methods on randomly chosen sub-samples of the training data. Specialized genetic operators allow the algorithm to efficiently evolve regression trees. The complexity term introduced in the fitness function helps to mitigate the over-fitting problem. The preliminary experimental validation is promising as the resulting trees can be significantly less complex with at least comparable performance to the classical top-down counterpart.

Book ChapterDOI
11 Sep 2010
TL;DR: A new evolutionary algorithm for induction of univariate regression trees that associate leaves with simple linear regression models that can be significantly less complex with at least comparable performance to the classical top-down counterparts is proposed.
Abstract: In the paper we propose a new evolutionary algorithm for induction of univariate regression trees that associate leaves with simple linear regression models. In contrast to typical top-down approaches it globally searches for the best tree structure, tests in internal nodes and models in leaves. The population of initial trees is created with diverse top-down methods on randomly chosen subsamples of the training data. Specialized genetic operators allow the algorithm to efficiently evolve regression trees. Akaike's information criterion (AIC) as the fitness function helps to mitigate the overfitting problem. The preliminary experimental validation is promising as the resulting trees can be significantly less complex with at least comparable performance to the classical top-down counterparts.

Proceedings ArticleDOI
01 Nov 2010
TL;DR: The design is sufficiently generic to support any possible input data type, that is, DNA, RNA secondary structure, or protein data, and is able to calculate log-likelihood scores and perform numerical scaling to maintain numerical stability on large datasets.
Abstract: Likelihood-based reconstruction of phylogenetic (evolutionary) trees from molecular sequence data exhibits extreme resource requirements because of the high computational cost of the phylogenetic likelihood function We propose a dedicated computer architecture for the inference of phylogenies under the maximum likelihood criterion Our design is sufficiently generic to support any possible input data type, that is, DNA, RNA secondary structure, or protein data Furthermore, the architecture is able to calculate log-likelihood scores and perform numerical scaling to maintain numerical stability on large datasets It can also optimize the branch lengths of tree topologies and calculate transition probability matrices We used FPGA technology to verify the correctness of our architecture

Dissertation
01 Jan 2010
TL;DR: This dissertation presents parsimony-based algorithms for reconciling species/gene tree incongruence that is assumed to be due solely to lineage sorting, and describes a unified framework for detecting hybridization despite lineage sorting.
Abstract: The main focus of this dissertation is the inference of species phylogenies, i.e. evolutionary histories of species. Species phylogenies allow us to gain insights into the mechanisms of evolution and to hypothesize past evolutionary events. They also find applications in medicine, for example, the understanding of antibiotic resistance in bacteria. The reconstruction of species phylogenies is, therefore, of both biological and practical importance. In the traditional method for inferring species trees from genetic data, we sequence a single locus in species genomes, reconstruct a gene tree, and report it as the species tree. Biologists have long acknowledged that a gene tree can be different from a species tree, thus implying that this traditional method might infer the wrong species tree. Moreover, reticulate events such as horizontal gene transfer and hybridization make the evolution of species no longer tree-like. The availability of multi-locus data provides us with excellent opportunities to resolve those long standing problems. In this dissertation, we present parsimony-based algorithms for reconciling species/gene tree incongruence that is assumed to be due solely to lineage sorting. We also describe a unified framework for detecting hybridization despite lineage sorting. To address the first problem of species/gene tree incongruence caused by lineage sorting, we present three algorithms. In Chapter 3, we present an algorithm based on an integer-linear programming (ILP) formula to infer the species tree's topology and divergence times from multiple gene trees. In Chapter 4, we describe two methods that infer the species tree by minimizing deep coalescences (MDC), a criterion introduced by Maddison in 1997. The first method is also based on an ILP formula, but it eliminates the enumeration phase of candidate species trees of the algorithm in Chapter 3. The second algorithm further eliminates the dependence on external ILP solvers by employing dynamic programming. We ran those methods on both biological and simulated data, and experimental results demonstrate their high accuracy and speed in species tree inference, which makes them suitable for analyzing multi-locus data. The second problem this dissertation deals with is reticulation (e.g., horizontal gene transfer, hybridization) detection despite lineage sorting. The phylogeny-based approach compares the evolutionary histories of different genomic regions and test them for incongruence that would indicate hybridization. However, since species tree and gene tree incongruence can also be due to lineage sorting, phylogeny-based hybridization methods might overestimate the amount of hybridization. We present in this dissertation a framework that can handle both hybridization and lineage sorting simultaneously. In this framework, we extend the MDC criterion to phylogenetic networks, and use it to propose a heuristic to detect hybridization despite lineage sorting. Empirical results on a simulated and a yeast data set show its promising performance, as well as several directions for future research.