What are the common events that affect the gene content of genomes?

a number of events can affect the gene content of genomes: insertions (of genes without existing homologs), duplications (of genes with existing homologs), and deletions.

What is the way to solve the inversion phylogeny?

Evidence to date from simulation studies as well as from the analysis of biological datasets indicates that even when the mechanism of evolution is based entirely on transpositions, solving the inversion phylogeny yields more accurate reconstructions than solving the breakpoint phylogeny—see, e.g., Moret et al. (2002c) and Tang et al. (2004).

How many breakpoints does the inversion distance remove?

Every inversion clearly creates (or, equivalently, removes) at most two breakpoints, so that the inversion distance is at least half the breakpoint distance.

What can the authors define precisely and, in some cases, compute?

What the authors can define precisely and, in some cases, compute, is the edit distance, the minimum number of permitted evolutionary events that can transform one genome into the other.

What is the problem with deep evolutionary histories?

Perhaps most seriously, deep evolutionary histories can be hard to reconstruct from molecular sequence data: the further back one goes in time, the harder the alignment of sequences becomes and the greater the impact of homoplasy (multiple point mutations at the same position).

What is the main approach to whole-genome phylogenetic analysis?

To date, the main approach to whole-genome phylogenetic analysis has used the ordering of the genes along the chromosomes as its primary data.

What software packages provide heuristics for ML?

Various software packages provide heuristics for ML, including PAUP* (Swofford, 2001), Phylip (Felsenstein, 1993), FastDNAml (Olsen et al., 1994), PhyML (Guindon and Gascuel, 2003), and TrExML (Wolf et al., 2000).

(Open Access) Advances in phylogeny reconstruction from gene order and content data. (2004) | Bernard M. E. Moret

Q: What contributions have the authors mentioned in the paper "Advances in phylogeny reconstruction from gene order and content data" ?

This chapter surveys recent developments in the reconstruction of phylogenies from gene order and content, focusing on their performance under various stochastic models of evolution. Because such methods are currently quite restricted in the type of data they can analyze, the authors also present current research aimed at handling the full range of whole-genome data.

Advances in Phylogeny Reconstruction from

Gene Order and Content Data

Bernard M.E. Moret

Department of Computer Science, University of New Mexico, Albuquerque NM

87131

Tandy Warnow

Department of Computer Sciences, University of Texas, Austin TX 78712

Abstract

Genomes can be viewed in terms of their gene content and the order in which the

genes appear along each chromosome. Evolutionary events that aﬀect the gene order

or content are “rare genomic events” (rarer than events that aﬀect the composition

of the nucleotide sequences) and have been advocated by systematists for inferring

deep evolutionary histories. This chapter surveys recent developments in the recon-

struction of phylogenies from gene order and content, focusing on their performance

under various stochastic models of evolution. Because such methods are currently

quite restricted in the type of data they can analyze, we also present current research

aimed at handling the full range of whole-genome data.

Key words:

1 Introduction: Molecular Sequence Phylogenetics

A phylogeny represents the evolutionary history of a collection of organisms,

usually in the form of a tree. Sequence data are by far the most common

form of molecular data used in phylogenetic analyses. We begin by brieﬂy re-

viewing techniques for estimating phylogenies from molecular sequences, with

emphasis on the computational and statistical issues involved.

Email addresses: moret@cs.unm.edu (Bernard M.E. Moret),

tandy@cs.utexas.edu (Tandy Warnow).

URLs: www.cs.unm.edu/∼moret/ (Bernard M.E. Moret),

www.cs.utexas.edu/users/tandy/ (Tandy Warnow).

Preprint submitted to Elsevier Science 18 October 2004

1.1 Model trees and stochastic models of evolution

Most algorithms for phylogenetic reconstruction attempt to reverse a model of

evolution. Such a model embodies certain knowledge and assumptions about

the process of evolution, such as characteristics of speciation and details about

evolutionary changes that aﬀect the content of molecular sequences. Mod-

els of evolution vary in their complexity; in particular, they require diﬀerent

numbers of parameters. For instance, the Jukes-Cantor model, which assumes

that all sites evolve identically and independently and that all substitutions

are equally likely, requires just one parameter per edge of the tree, viz., the

expected number of changes of a random site on that edge. Overall, then,

a rooted Jukes-Cantor tree with n leaves requires 2n − 2 parameters. Under

more complex models of evolution, the process operating on a single edge can

require up to 12 parameters (for the General Markov model), although these

models still requires Θ(n) parameters overall. If edge “lengths” are drawn from

a distribution, however, the complexity can be reduced, since the evolution-

ary process operating on the model tree can then be described just by the

parameters of the distribution.

These parameters describe how a single site evolves down the tree and so

require additional assumptions in order to describe how diﬀerent sites evolve.

Usually the sites are assumed to evolve independently; sometimes they are

also assumed to evolve identically. Moreover, the diﬀerent sites are assumed

either to evolve under the same process or to have rates of evolution that

vary depending upon the site. In the latter case (in which each site has its

own rate), an additional k parameters are needed, where k is the number

of sites. However, if the rates are presumed to be drawn from a distribution

(typically, the gamma distribution), then a single additional parameter suﬃces

to describe the evolutionary process operating on the tree; furthermore, in this

case, the sites still evolve under the i.i.d. assumption.

Tree generation models typically have parameters regulating speciation rates,

but also inheritance characteristics, etc. For more on stochastic models of (se-

quence) evolution, see Felsenstein (1981), Kim and Warnow (1999), Li (1997),

and Swoﬀord et al. (1996); for an interesting discussion of models of tree gen-

eration, see Heard (1996) and Mooers and Heard (1997).

By studying the performance of methods under explicit stochastic models

of evolution, it becomes possible to assess the relative strengths of diﬀerent

methods, as well as to understand how methods can fail. Such studies can

be theoretical, for instance proving statistical consistency: given long enough

sequences, the method will return the true tree with arbitrarily high proba-

bility. Others can use simulations to study the performance of the methods

under conditions closely approximating practice. In a simulation, sequences

are evolved down diﬀerent model trees and then given to diﬀerent methods for

reconstruction; the reconstructions can then be compared against the model

trees that generated the data. Such studies provide important quantiﬁcations

of the relative merits of phylogenetic reconstruction methods.

1.2 Phylogeny reconstruction from molecular sequences

Three main types of methods are used to reconstruct phylogenies from molec-

ular sequences: distance-based methods, maximum parsimony heuristics, and

maximum likelihood heuristics.

1.2.1 Distance-based methods

Of the three types of methods, only distance-based methods include algorithms

that run in polynomial-time. Distance-based methods operate in two phases:

(1) Pairwise distances between every pair of taxa are estimated.

(2) An algorithm is applied to the matrix of pairwise distances to compute

an edge-weighted tree T .

The statistical consistency (if any) of such two-phase procedures rests on two

assumptions: ﬁrst, that a statistically consistent distance estimator is used in

the ﬁrst phase and, second, that an appropriate distance-based algorithm is

used in the second phase. The requirements that the ﬁrst phase be statisti-

cally consistent means that the distance estimator should return a value that

approaches the expected number of times a random site changes on the path

between the two taxa. Thus, the estimation of pairwise distances must be done

with respect to some assumed stochastic model of evolution. As an example,

in the Jukes-Cantor model of evolution, the estimated distance between se-

quences s

and s

is given by the formula

= −



1 −



where k is the sequence length and H

denotes the Hamming distance (the

number of positions in which s

and s

diﬀer, which is the edit distance under

mutation operations).

Algorithms that attempt to reconstruct trees from distance matrices are guar-

anteed to produce accurate reconstructions of the trees only when the distance

matrix entries approach very closely the actual number of changes between the

pair of sequences. (In the context of estimating model trees, this requirement

means that the estimated distances need to be extremely close to the model

distances, deﬁned to be the expected number of times a random site changes

on a leaf-to-leaf path. See Atteson (1999) and Kim and Warnow (1999) for

more on this issue.) Na¨ıvely deﬁned distances, such as the Hamming dis-

tance, typically underestimate the number of changes that took place in the

evolutionary history; thus the ﬁrst step of a distance-based method is to cor-

rect the na¨ıvely deﬁned distance into one that accurately accounts for the

expected number of unseen back-and-forth changes in a site. Such corrections

are not without problems: as the measured distance grows larger, the variance

in the estimator increases, causing increasing errors in reconstruction.

The most commonly used, and simplest, distance-based method is the neighbor-

joining (NJ) algorithm of Saitou and Nei (1987); improved versions of this ba-

sic method include BioNJ (Gascuel, 1997) and a version known as Weighbor,

which requires an estimate of the variance of the distance estimator (Bruno

et al., 2000). NJ is known to be statistically consistent under most models of

evolution.

1.2.2 Maximum Parsimony

Parsimony-based methods seek the tree, along with sequences labelling its in-

ternal nodes, that together minimize the total number of evolutionary changes

(viewed as distances summed along all edges of the tree). Put formally, the

problem is as follows: Given a set S of sequences in a multiple alignment,

each of length k, ﬁnd a tree T and a set of additional sequences S

, all also of

length k, so that, with the leaves of T are labelled by S and its internal nodes

by S

, the value

e∈E(T )

Hamming(e) is minimized, where Hamming(e) de-

notes the Hamming distance between the sequences labelling the endpoints of

e. (Weighted or distance-corrected versions can also be deﬁned.)

The maximum parsimony problem (MP) is thus an optimization problem—

and a hard one: ﬁnding the best tree is provably NP-hard (Day, 1983). This

property eﬀectively rules out exact solutions for all but the smallest instances;

indeed, in practice, exact solvers run within reasonable time on at most 30

taxa. Thus heuristics are the normal approach to the problem; most are based

on iterative improvement techniques and appear to return very good solu-

tions for up to a few hundred taxa. Many software packages implement such

heuristics, among them MEGA (Kumar et al., 2001), PAUP* (Swoﬀord, 2001),

Phylip (Felsenstein, 1993), and TNT (Goloboﬀ, 1999).

1.2.3 Maximum Likelihood

Like maximum parsimony, maximum likelihood (ML) is an optimization prob-

lem. ML seeks the tree and associated model parameter values that maximizes

the probability of producing the given set of sequences. ML thus depends ex-

plicitly on the assumed model of evolution. For example, the ML problem

under the Jukes-Cantor model needs to estimate one parameter (the substi-

tution probability) for each edge of the tree, while under the General Markov

model 12 parameters must be estimated on each edge. ML is much more com-

putationally expensive than MP: even the problem of point estimation (scoring

a tree), i.e., ﬁnding optimal edge parameters, for the simplest (Jukes-Cantor)

model of evolution on a ﬁxed tree is of unknown computational complexity,

and computationally expensive without being provably accurate in practice

(see Steel (1994) for a discussion), whereas it is easily accomplished in linear

time for MP using Fitch’s algorithm (Fitch, 1977). Provably correct solutions

to ML are currently limited to some special cases of four-leaf model trees,

exhaustive searches through tree space that use heuristics for scoring trees

are limited to about ten taxa, and heuristic searches through tree space us-

ing similar heuristics for scoring trees are typically limited to fewer than 100

taxa. Various software packages provide heuristics for ML, including PAUP*

(Swoﬀord, 2001), Phylip (Felsenstein, 1993), FastDNAml (Olsen et al., 1994),

PhyML (Guindon and Gascuel, 2003), and TrExML (Wolf et al., 2000).

1.3 Performance issues

Methods can be compared in terms of their performance guarantees, in terms

of their resource requirements, and in terms of the quality of the trees they

produce. Very few methods oﬀer any performance guarantees, except in purely

theoretical terms. For instance, while ML is known to be statistically consistent

under most models, the same cannot be said of its heuristic implementation;

and even neighbor-joining, which is statistically consistent and is implemented

exactly, may return very poor trees—the guarantee of statistical consistency

only implies good performance in the limit, as sequences lengths become suﬃ-

ciently large. In terms of computational requirements, the comparison is easy:

distance-based methods are eﬃcient (running in polynomial-time with low co-

eﬃcients); parsimony is much harder to solve (systematists are accustomed

to running MP for weeks on a dataset of modest size); and maximum likeli-

hood is much harder again than MP. These comparisons, however, all have

limited value: as we saw, statistical consistency is a very weak guarantee,

while a guarantee of fast running times is worthless if the returned solution

is poor. Thus experimental studies are our best tool in the study of the rela-

tive performance of methods. Simulation studies, in particular, can establish

the absolute accuracy of methods (whereas studies conducted with biological

datasets can only assess relative performance in terms of the optimization cri-

terion). Such studies have shown that MP methods can produce reasonably

good trees under conditions where neighbor-joining can have high topologi-

cal error (signiﬁcantly worse than MP); this possibly surprising performance

holds under many realistic model conditions—in particular, when the model

tree has a high evolutionary diameter (Moret et al., 2002b; Nakhleh et al.,

Advances in phylogeny reconstruction from gene order and content data.

Figures

Citations

Evolution of Linear Mitochondrial Genomes in Medusozoan Cnidarians

Distance-Based Genome Rearrangement Phylogeny

An Exact Algorithm to Compute the DCJ Distance for Genomes with Duplicate Genes

Heuristics for the inversion median problem

Using Jackknife to Assess the Quality of Gene Order Phylogenies

References

The neighbor-joining method: a new method for reconstructing phylogenetic trees.

PAUP: Phylogenetic analysis using parsimony (and other methods), Version 4.0b10

A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood.

Evolutionary trees from DNA sequences: A maximum likelihood approach

MEGA2 : Molecular evolutionary genetics analysis software

Related Papers (5)

The neighbor-joining method: a new method for reconstructing phylogenetic trees.

A linear-time algorithm for computing inversion distance between signed permutations with an experimental study.

Combinatorics of Genome Rearrangements

Comparison of phylogenetic trees

Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals

Frequently Asked Questions (10)

Q1. What contributions have the authors mentioned in the paper "Advances in phylogeny reconstruction from gene order and content data" ?

Q2. What are the main types of methods used to reconstruct phylogenies from molecular?

Q3. What are the common events that affect the gene content of genomes?

Q4. What can be the way to determine the accuracy of methods?

Q5. What is the way to solve the inversion phylogeny?

Q6. How many breakpoints does the inversion distance remove?

Q7. What can the authors define precisely and, in some cases, compute?

Q8. What is the problem with deep evolutionary histories?

Q9. What is the main approach to whole-genome phylogenetic analysis?

Q10. What software packages provide heuristics for ML?

Advances in phylogeny reconstruction from gene order and content data.

Figures

Citations

Evolution of Linear Mitochondrial Genomes in Medusozoan Cnidarians

Distance-Based Genome Rearrangement Phylogeny

An Exact Algorithm to Compute the DCJ Distance for Genomes with Duplicate Genes

Heuristics for the inversion median problem

Using Jackknife to Assess the Quality of Gene Order Phylogenies

References

The neighbor-joining method: a new method for reconstructing phylogenetic trees.

PAUP*: Phylogenetic analysis using parsimony (*and other methods), Version 4.0b10

A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood.

Evolutionary trees from DNA sequences: A maximum likelihood approach

MEGA2 : Molecular evolutionary genetics analysis software

Related Papers (5)

The neighbor-joining method: a new method for reconstructing phylogenetic trees.

A linear-time algorithm for computing inversion distance between signed permutations with an experimental study.

Combinatorics of Genome Rearrangements

Comparison of phylogenetic trees

Transforming cabbage into turnip: polynomial algorithm for sorting signed permutations by reversals

Frequently Asked Questions (10)

Q1. What contributions have the authors mentioned in the paper "Advances in phylogeny reconstruction from gene order and content data" ?

Q2. What are the main types of methods used to reconstruct phylogenies from molecular?

Q3. What are the common events that affect the gene content of genomes?

Q4. What can be the way to determine the accuracy of methods?

Q5. What is the way to solve the inversion phylogeny?

Q6. How many breakpoints does the inversion distance remove?

Q7. What can the authors define precisely and, in some cases, compute?

Q8. What is the problem with deep evolutionary histories?

Q9. What is the main approach to whole-genome phylogenetic analysis?

Q10. What software packages provide heuristics for ML?

PAUP: Phylogenetic analysis using parsimony (and other methods), Version 4.0b10