scispace - formally typeset
Open AccessBook ChapterDOI

Advances in phylogeny reconstruction from gene order and content data.

Bernard M. E. Moret, +1 more
- 18 Oct 2004 - 
- Vol. 395, pp 673-700
Reads0
Chats0
TLDR
This chapter surveys recent developments in the reconstruction of phylogenies from gene order and content, focusing on their performance under various stochastic models of evolution.
Abstract
Genomes can be viewed in terms of their gene content and the order in which the genes appear along each chromosome. Evolutionary events that affect the gene order or content are “rare genomic events” (rarer than events that affect the composition of the nucleotide sequences) and have been advocated by systematists for inferring deep evolutionary histories. This chapter surveys recent developments in the reconstruction of phylogenies from gene order and content, focusing on their performance under various stochastic models of evolution. Because such methods are quite restricted in the type of data they can analyze, we also present research aimed at handling the full range of whole-genome data.

read more

Content maybe subject to copyright    Report

Advances in Phylogeny Reconstruction from
Gene Order and Content Data
Bernard M.E. Moret
Department of Computer Science, University of New Mexico, Albuquerque NM
87131
Tandy Warnow
Department of Computer Sciences, University of Texas, Austin TX 78712
Abstract
Genomes can be viewed in terms of their gene content and the order in which the
genes appear along each chromosome. Evolutionary events that affect the gene order
or content are “rare genomic events” (rarer than events that affect the composition
of the nucleotide sequences) and have been advocated by systematists for inferring
deep evolutionary histories. This chapter surveys recent developments in the recon-
struction of phylogenies from gene order and content, focusing on their performance
under various stochastic models of evolution. Because such methods are currently
quite restricted in the type of data they can analyze, we also present current research
aimed at handling the full range of whole-genome data.
Key words:
1 Introduction: Molecular Sequence Phylogenetics
A phylogeny represents the evolutionary history of a collection of organisms,
usually in the form of a tree. Sequence data are by far the most common
form of molecular data used in phylogenetic analyses. We begin by briefly re-
viewing techniques for estimating phylogenies from molecular sequences, with
emphasis on the computational and statistical issues involved.
Email addresses: moret@cs.unm.edu (Bernard M.E. Moret),
tandy@cs.utexas.edu (Tandy Warnow).
URLs: www.cs.unm.edu/moret/ (Bernard M.E. Moret),
www.cs.utexas.edu/users/tandy/ (Tandy Warnow).
Preprint submitted to Elsevier Science 18 October 2004

1.1 Model trees and stochastic models of evolution
Most algorithms for phylogenetic reconstruction attempt to reverse a model of
evolution. Such a model embodies certain knowledge and assumptions about
the process of evolution, such as characteristics of speciation and details about
evolutionary changes that affect the content of molecular sequences. Mod-
els of evolution vary in their complexity; in particular, they require different
numbers of parameters. For instance, the Jukes-Cantor model, which assumes
that all sites evolve identically and independently and that all substitutions
are equally likely, requires just one parameter per edge of the tree, viz., the
expected number of changes of a random site on that edge. Overall, then,
a rooted Jukes-Cantor tree with n leaves requires 2n 2 parameters. Under
more complex models of evolution, the process operating on a single edge can
require up to 12 parameters (for the General Markov model), although these
models still requires Θ(n) parameters overall. If edge “lengths” are drawn from
a distribution, however, the complexity can be reduced, since the evolution-
ary process operating on the model tree can then be described just by the
parameters of the distribution.
These parameters describe how a single site evolves down the tree and so
require additional assumptions in order to describe how different sites evolve.
Usually the sites are assumed to evolve independently; sometimes they are
also assumed to evolve identically. Moreover, the different sites are assumed
either to evolve under the same process or to have rates of evolution that
vary depending upon the site. In the latter case (in which each site has its
own rate), an additional k parameters are needed, where k is the number
of sites. However, if the rates are presumed to be drawn from a distribution
(typically, the gamma distribution), then a single additional parameter suffices
to describe the evolutionary process operating on the tree; furthermore, in this
case, the sites still evolve under the i.i.d. assumption.
Tree generation models typically have parameters regulating speciation rates,
but also inheritance characteristics, etc. For more on stochastic models of (se-
quence) evolution, see Felsenstein (1981), Kim and Warnow (1999), Li (1997),
and Swofford et al. (1996); for an interesting discussion of models of tree gen-
eration, see Heard (1996) and Mooers and Heard (1997).
By studying the performance of methods under explicit stochastic models
of evolution, it becomes possible to assess the relative strengths of different
methods, as well as to understand how methods can fail. Such studies can
be theoretical, for instance proving statistical consistency: given long enough
sequences, the method will return the true tree with arbitrarily high proba-
bility. Others can use simulations to study the performance of the methods
under conditions closely approximating practice. In a simulation, sequences
2

are evolved down different model trees and then given to different methods for
reconstruction; the reconstructions can then be compared against the model
trees that generated the data. Such studies provide important quantifications
of the relative merits of phylogenetic reconstruction methods.
1.2 Phylogeny reconstruction from molecular sequences
Three main types of methods are used to reconstruct phylogenies from molec-
ular sequences: distance-based methods, maximum parsimony heuristics, and
maximum likelihood heuristics.
1.2.1 Distance-based methods
Of the three types of methods, only distance-based methods include algorithms
that run in polynomial-time. Distance-based methods operate in two phases:
(1) Pairwise distances between every pair of taxa are estimated.
(2) An algorithm is applied to the matrix of pairwise distances to compute
an edge-weighted tree T .
The statistical consistency (if any) of such two-phase procedures rests on two
assumptions: first, that a statistically consistent distance estimator is used in
the first phase and, second, that an appropriate distance-based algorithm is
used in the second phase. The requirements that the first phase be statisti-
cally consistent means that the distance estimator should return a value that
approaches the expected number of times a random site changes on the path
between the two taxa. Thus, the estimation of pairwise distances must be done
with respect to some assumed stochastic model of evolution. As an example,
in the Jukes-Cantor model of evolution, the estimated distance between se-
quences s
i
and s
j
is given by the formula
d
ij
=
3
4
ln
1
4
3
H
ij
k
,
where k is the sequence length and H
ij
denotes the Hamming distance (the
number of positions in which s
i
and s
j
differ, which is the edit distance under
mutation operations).
Algorithms that attempt to reconstruct trees from distance matrices are guar-
anteed to produce accurate reconstructions of the trees only when the distance
matrix entries approach very closely the actual number of changes between the
pair of sequences. (In the context of estimating model trees, this requirement
means that the estimated distances need to be extremely close to the model
distances, defined to be the expected number of times a random site changes
3

on a leaf-to-leaf path. See Atteson (1999) and Kim and Warnow (1999) for
more on this issue.) Na¨ıvely defined distances, such as the Hamming dis-
tance, typically underestimate the number of changes that took place in the
evolutionary history; thus the first step of a distance-based method is to cor-
rect the na¨ıvely defined distance into one that accurately accounts for the
expected number of unseen back-and-forth changes in a site. Such corrections
are not without problems: as the measured distance grows larger, the variance
in the estimator increases, causing increasing errors in reconstruction.
The most commonly used, and simplest, distance-based method is the neighbor-
joining (NJ) algorithm of Saitou and Nei (1987); improved versions of this ba-
sic method include BioNJ (Gascuel, 1997) and a version known as Weighbor,
which requires an estimate of the variance of the distance estimator (Bruno
et al., 2000). NJ is known to be statistically consistent under most models of
evolution.
1.2.2 Maximum Parsimony
Parsimony-based methods seek the tree, along with sequences labelling its in-
ternal nodes, that together minimize the total number of evolutionary changes
(viewed as distances summed along all edges of the tree). Put formally, the
problem is as follows: Given a set S of sequences in a multiple alignment,
each of length k, find a tree T and a set of additional sequences S
0
, all also of
length k, so that, with the leaves of T are labelled by S and its internal nodes
by S
0
, the value
P
eE(T )
Hamming(e) is minimized, where Hamming(e) de-
notes the Hamming distance between the sequences labelling the endpoints of
e. (Weighted or distance-corrected versions can also be defined.)
The maximum parsimony problem (MP) is thus an optimization problem—
and a hard one: finding the best tree is provably NP-hard (Day, 1983). This
property effectively rules out exact solutions for all but the smallest instances;
indeed, in practice, exact solvers run within reasonable time on at most 30
taxa. Thus heuristics are the normal approach to the problem; most are based
on iterative improvement techniques and appear to return very good solu-
tions for up to a few hundred taxa. Many software packages implement such
heuristics, among them MEGA (Kumar et al., 2001), PAUP* (Swofford, 2001),
Phylip (Felsenstein, 1993), and TNT (Goloboff, 1999).
1.2.3 Maximum Likelihood
Like maximum parsimony, maximum likelihood (ML) is an optimization prob-
lem. ML seeks the tree and associated model parameter values that maximizes
the probability of producing the given set of sequences. ML thus depends ex-
plicitly on the assumed model of evolution. For example, the ML problem
4

under the Jukes-Cantor model needs to estimate one parameter (the substi-
tution probability) for each edge of the tree, while under the General Markov
model 12 parameters must be estimated on each edge. ML is much more com-
putationally expensive than MP: even the problem of point estimation (scoring
a tree), i.e., finding optimal edge parameters, for the simplest (Jukes-Cantor)
model of evolution on a fixed tree is of unknown computational complexity,
and computationally expensive without being provably accurate in practice
(see Steel (1994) for a discussion), whereas it is easily accomplished in linear
time for MP using Fitch’s algorithm (Fitch, 1977). Provably correct solutions
to ML are currently limited to some special cases of four-leaf model trees,
exhaustive searches through tree space that use heuristics for scoring trees
are limited to about ten taxa, and heuristic searches through tree space us-
ing similar heuristics for scoring trees are typically limited to fewer than 100
taxa. Various software packages provide heuristics for ML, including PAUP*
(Swofford, 2001), Phylip (Felsenstein, 1993), FastDNAml (Olsen et al., 1994),
PhyML (Guindon and Gascuel, 2003), and TrExML (Wolf et al., 2000).
1.3 Performance issues
Methods can be compared in terms of their performance guarantees, in terms
of their resource requirements, and in terms of the quality of the trees they
produce. Very few methods offer any performance guarantees, except in purely
theoretical terms. For instance, while ML is known to be statistically consistent
under most models, the same cannot be said of its heuristic implementation;
and even neighbor-joining, which is statistically consistent and is implemented
exactly, may return very poor trees—the guarantee of statistical consistency
only implies good performance in the limit, as sequences lengths become suffi-
ciently large. In terms of computational requirements, the comparison is easy:
distance-based methods are efficient (running in polynomial-time with low co-
efficients); parsimony is much harder to solve (systematists are accustomed
to running MP for weeks on a dataset of modest size); and maximum likeli-
hood is much harder again than MP. These comparisons, however, all have
limited value: as we saw, statistical consistency is a very weak guarantee,
while a guarantee of fast running times is worthless if the returned solution
is poor. Thus experimental studies are our best tool in the study of the rela-
tive performance of methods. Simulation studies, in particular, can establish
the absolute accuracy of methods (whereas studies conducted with biological
datasets can only assess relative performance in terms of the optimization cri-
terion). Such studies have shown that MP methods can produce reasonably
good trees under conditions where neighbor-joining can have high topologi-
cal error (significantly worse than MP); this possibly surprising performance
holds under many realistic model conditions—in particular, when the model
tree has a high evolutionary diameter (Moret et al., 2002b; Nakhleh et al.,
5

Citations
More filters
Journal ArticleDOI

Evolution of Linear Mitochondrial Genomes in Medusozoan Cnidarians

TL;DR: This work determined the sequences of nearly complete linear mtDNA from 24 species representing all four medusozoan classes and posit that these two open reading frames (ORFs) are remnants of a linear plasmid that invaded the mitochondrial genomes of the last common ancestor of Medusozoa and are responsible for its linearity.
Journal ArticleDOI

Distance-Based Genome Rearrangement Phylogeny

TL;DR: This paper investigates how to estimate evolutionary histories from whole genomes with equal gene content using EDE, and presents a technique, the empirically derived estimator (EDE), that is developed for this purpose.
Book ChapterDOI

An Exact Algorithm to Compute the DCJ Distance for Genomes with Duplicate Genes

TL;DR: This paper proposes an ILPi¾źinteger linear programming formulation to compute the DCJ distance between two genomes with duplicate genes and provides an efficient preprocessing approach to simplify the ILP formulation while preserving optimality.
Journal ArticleDOI

Heuristics for the inversion median problem

TL;DR: A new heuristic for inversion medians, ASM, which dominates all others in the authors' framework, puts that issue to rest by providing near-optimal solutions within seconds to minutes on even the largest genomes.
Journal ArticleDOI

Using Jackknife to Assess the Quality of Gene Order Phylogenies

TL;DR: Jackknife is very useful to determine the confidence level of a phylogeny obtained from gene orders and a jackknife rate of 40% should be used, however, although a branch with support value of 85% can be trusted, low support branches require careful investigation before being discarded.
References
More filters
Journal ArticleDOI

The neighbor-joining method: a new method for reconstructing phylogenetic trees.

TL;DR: The neighbor-joining method and Sattath and Tversky's method are shown to be generally better than the other methods for reconstructing phylogenetic trees from evolutionary distance data.
Journal ArticleDOI

A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood.

TL;DR: This work has used extensive and realistic computer simulations to show that the topological accuracy of this new method is at least as high as that of the existing maximum-likelihood programs and much higher than the performance of distance-based and parsimony approaches.
Journal ArticleDOI

Evolutionary trees from DNA sequences: A maximum likelihood approach

TL;DR: A computationally feasible method for finding such maximum likelihood estimates is developed, and a computer program is available that allows the testing of hypotheses about the constancy of evolutionary rates by likelihood ratio tests.
Journal ArticleDOI

MEGA2 : Molecular evolutionary genetics analysis software

TL;DR: MEGA2 vastly extends the capabilities of MEGA version 1 by facilitating analyses of large datasets, enabling creation and analyses of groups of sequences, and expanding the repertoire of statistical methods for molecular evolutionary studies.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What contributions have the authors mentioned in the paper "Advances in phylogeny reconstruction from gene order and content data" ?

This chapter surveys recent developments in the reconstruction of phylogenies from gene order and content, focusing on their performance under various stochastic models of evolution. Because such methods are currently quite restricted in the type of data they can analyze, the authors also present current research aimed at handling the full range of whole-genome data. 

Three main types of methods are used to reconstruct phylogenies from molecular sequences: distance-based methods, maximum parsimony heuristics, and maximum likelihood heuristics. 

a number of events can affect the gene content of genomes: insertions (of genes without existing homologs), duplications (of genes with existing homologs), and deletions. 

Simulation studies, in particular, can establish the absolute accuracy of methods (whereas studies conducted with biological datasets can only assess relative performance in terms of the optimization criterion). 

Evidence to date from simulation studies as well as from the analysis of biological datasets indicates that even when the mechanism of evolution is based entirely on transpositions, solving the inversion phylogeny yields more accurate reconstructions than solving the breakpoint phylogeny—see, e.g., Moret et al. (2002c) and Tang et al. (2004). 

Every inversion clearly creates (or, equivalently, removes) at most two breakpoints, so that the inversion distance is at least half the breakpoint distance. 

What the authors can define precisely and, in some cases, compute, is the edit distance, the minimum number of permitted evolutionary events that can transform one genome into the other. 

Perhaps most seriously, deep evolutionary histories can be hard to reconstruct from molecular sequence data: the further back one goes in time, the harder the alignment of sequences becomes and the greater the impact of homoplasy (multiple point mutations at the same position). 

To date, the main approach to whole-genome phylogenetic analysis has used the ordering of the genes along the chromosomes as its primary data. 

Various software packages provide heuristics for ML, including PAUP* (Swofford, 2001), Phylip (Felsenstein, 1993), FastDNAml (Olsen et al., 1994), PhyML (Guindon and Gascuel, 2003), and TrExML (Wolf et al., 2000).