scispace - formally typeset
Open AccessJournal ArticleDOI

PartitionFinder: Combined Selection of Partitioning Schemes and Substitution Models for Phylogenetic Analyses

TLDR
Two new objective methods for the combined selection of best-fit partitioning schemes and nucleotide substitution models are described and implemented in an open-source program, PartitionFinder, which it is hoped will encourage the objective selection of partitions and thus lead to improvements in phylogenetic analyses.
Abstract
In phylogenetic analyses of molecular sequence data, partitioning involves estimating independent models of molecular evolution for different sets of sites in a sequence alignment. Choosing an appropriate partitioning scheme is an important step in most analyses because it can affect the accuracy of phylogenetic reconstruction. Despite this, partitioning schemes are often chosen without explicit statistical justification. Here, we describe two new objective methods for the combined selection of best-fit partitioning schemes and nucleotide substitution models. These methods allow millions of partitioning schemes to be compared in realistic time frames and so permit the objective selection of partitioning schemes even for large multilocus DNA data sets. We demonstrate that these methods significantly outperform previous approaches, including both the ad hoc selection of partitioning schemes (e.g., partitioning by gene or codon position) and a recently proposed hierarchical clustering method. We have implemented these methods in an open-source program, PartitionFinder. This program allows users to select partitioning schemes and substitution models using a range of information-theoretic metrics (e.g., the Bayesian information criterion, akaike information criterion [AIC], and corrected AIC). We hope that PartitionFinder will encourage the objective selection of partitioning schemes and thus lead to improvements in phylogenetic analyses. PartitionFinder is written in Python and runs under Mac OSX 10.4 and above. The program, source code, and a detailed manual are freely available from www.robertlanfear.com/partitionfinder.

read more

Content maybe subject to copyright    Report

HAL Id: lirmm-00705211
https://hal-lirmm.ccsd.cnrs.fr/lirmm-00705211
Submitted on 16 Jun 2021
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Distributed under a Creative Commons Attribution| 4.0 International License
PartitionFinder: Combined Selection of Partitioning
Schemes and Substitution Models for Phylogenetic
Analyses
Stéphane Guindon, Robert Lanfear, Brett Calcott, Simon Y.W. Ho
To cite this version:
Stéphane Guindon, Robert Lanfear, Brett Calcott, Simon Y.W. Ho. PartitionFinder: Combined Se-
lection of Partitioning Schemes and Substitution Models for Phylogenetic Analyses. Molecular Biology
and Evolution, Oxford University Press (OUP), 2012, 29 (6), pp.1695-1701. �10.1093/molbev/mss020�.
�lirmm-00705211�

PartitionFinder: Combined Selection of Partitioning Schemes
and Substitution Models for Phylogenetic Analyses
Robert Lanfear,*
,1
Brett Calcott,
1,2
Simon Y. W. Ho,
3
and Stephane Guindon
4
1
Centre for Macroevolution and Macroecology, Ecology Evolution and Genetics, Research School of Biology, Australian National
University, Canberra, Australian Capital Territory, Australia
2
Philosophy Program, Research School of Social Sciences, Australian National Univers ity, Canberra, Australian Capital Territory,
Australia
3
School of Biological Sciences, University of Sydney, Sydney, New South Wales, Australia
4
Department of Statistics, University of Auckland, Auckland, New Zealand
*Corresponding author: E-mail: rob.lanfear@anu.edu.au.
Associate editor: Sudhir Kumar
Abstract
In phylogenetic analyses of molecular sequence data, partitioning involves estimating independent models of molecular
evolution for different sets of sites in a sequence alignment. Choosing an appropriate partitioning scheme is an important
step in most analyses because it can affect the accuracy of phylogenetic reconstruction. Despite this, partitioning schemes
are often chosen without explicit statistical justification. Here, we describe two new objective methods for the combined
selection of best-fit partitioning schemes and nucleotide substitution models. These methods allow millions of partitioning
schemes to be compared in realistic time frames and so permit the objective selection of partitioning schemes even for
large multilocus DNA data sets. We demonstrate that these methods significantly outperform previous approaches,
including both the ad hoc selection of partitioning schemes (e.g., partitioning by gene or codon position) and a recently
proposed hierarchical clustering method. We have implemented these methods in an open-source program,
PartitionFinder. This program allows users to select partitioning schemes and substitution models using a range of
information-theoretic metrics (e.g., the Bayesian information criterion, akaike information criterion [AIC], and corrected
AIC). We hope that PartitionFinder will encourage the objective selection of partitioning schemes and thus lead to
improvements in phylogenetic analyses. PartitionFinder is written in Python and runs under Mac OSX 10.4 and above. The
program, source code, and a detailed manual are freely available from www.robertlanfear.com/partitionfinder.
Key words: partitioning, AIC, BIC, AICc, model selection, molecular evolution.
Introduction
Molecular phylogenetics provides a wea lth of importa nt in-
formation for evolutionary biologists. However, the accuracy
of molecular phylogenetic infere nce depends on having an
appropriate model of molecular evolution (Sullivan and
Joyce 2005; Simon et al. 2006). Because of this, there is a great
deal of interest in developing methods to select evolutionary
models and assess their adequacy (Ripplinger and Sullivan
2010; Jayaswal et al. 2011; Nguye n et al. 2011). The goal of
model selection is to identify a model that is sufficiently com-
plex to ca pture the evolutionary processes that have
occurred but to avoid models with more par ameters than
canbereliablyestimatedfromtheavailabledata(overpar-
ameterization). One of the most important aspects of
models of molecular evolution is how they ac count for
variation in evolutionary processes among the sites of an
alignments, because the failure to correctly account for this
variation can seriously mislead phylogenetic analyses
(Buckley et al. 20 01; Telford and Copley 2011).
There are two ways to incorporate the variation in
evolutionary processes among different sites using
currently available phylogenetic methods: mixture models
and partitioning. With mixture models, the likelihood of
each site is calculated under more than one substitution
model (e.g., Le et al. 2008). The parameters of these
substitution models, as well as the probability with which
each model applies to each site, can be determined directly
from the data (Pagel and Meade 2004). With partitioning,
the user first groups together sites that are assumed to have
evolved under similar processes and then estimates inde-
pendent (i.e., unlinked) substitution models for each group
of sites (e.g., Nylander et al. 2004; Brandley et al. 2005;
McGuire et al. 2007). In contrast to mixture models, par-
titioning requires the a priori definition of appropriate
groups of sites. Although mixture models are implemented
in an increasing variety of phylogenetic software (e.g., Pagel
and Meade 2004; Stamatakis 2006; Le et al. 2008), partition-
ing remains by far the most common approach to
incorporating heterogeneity in evolutionary processes
among sites (Blair and Murphy 2011).
Choosing an appropriate partitioning scheme is a central
problem for most phylogenetic analyses (Brandley et al.
2005; Shapiro et al. 2006; McGuire et al. 2007; Li et al.
2008; Blair and Murphy 2011). Typically, phylogeneticists
use their biological intuition to group together similar sites
in an alignment into putatively homogeneous data blocks.
© The Author 2012. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please
e-mail: journals.permissions@oup.com
Mol. Biol. Evol. 29(6):1695–1701. 2012 doi:10.1093/molbev/mss020 Advance Access publicat ion January 20, 2012 1695
Research article
Downloaded from https://academic.oup.com/mbe/article/29/6/1695/1000514 by Bibliothèque Universitaire de médecine - Nîmes user on 16 June 2021

This often involves defining data blocks on the basis of
genes and codon positions (e.g., Shapiro et al. 2006; Ho
and Lanfear 2010). For example, in an analysis of four
protein-coding genes, one could define 12 data blocks—
one for each codon position in each gene. This
approach is biologically justified because differences be-
tween codon positions and genes are expected to account
for much of the heterogeneity in evolutionary processes
among sites (Shapiro et al. 2006). However, many studies
have shown that this approach can lead to overparamete-
rization, and that phylogenetic reconstruction can be
improved by merging certain data blocks together, thus de-
fining a partitioning scheme that requires the estimation of
fewer independent substitution models (Brandley et al.
2005; Brown and Lemmon 2007; McGuire et al. 2007; Li
et al. 2008). For example, the second codon positions in
two similar nuclear genes may experience similar rates
and patterns of substitution and so might be better ana-
lyzed together rather than independently. Of course, it is
not always straightforward to identify which data blocks
should be merged and which should be analyzed indepen-
dently. One solution to this problem is to compare all
possible partitioning schemes for a given data set. However,
this approach is usually computationally intractable
because the number of possible partitioning schemes is
astronomical even for relatively small numbers of data
blocks (Li et al. 2008). As a result, most researchers either
choose a single partitioning scheme a priori or select the
best-fit scheme from a handful of candidate schemes
(Brandley et al. 2005; McGuire et al. 2007). Thus, despite
significant advances in phylogenetic methods in recent
years, the accuracy of the inferences we can make from
partitioned phylogenetic analyses remains limited by our
ability to select appropriate partitioning schemes.
In this study, we describe two new methods that solve
many of the problems associated with selecting partition-
ing schemes. These methods increase the efficiency of com-
paring partitioning schemes by many orders of magnitude,
allowing many millions of schemes to be compared in re-
alistic time frames. We describe these new methods below
and assess their performance on a range of published data
sets. We show that our methods select significantly better
partitioning schemes than previous approaches—including
the ad hoc selection of partitioning schemes and previously
suggested objective approaches. We have implemented
these methods in an open-source program, PartitionFinder.
This program has flexible options and allows users to effi-
ciently and objectively find best-fit partitioning schemes
and nucleotide substitution models, even for large data
sets. PartitionFinder, its source code, and a detailed manual
are available from www.robertlanfear.com/partitionfinder.
Materials and Methods
We use the following definitions throughout this article.
We define a ‘‘data block’’ as a user-defined set of sites
in an alignment; a ‘‘subset’’ as a set of one or more data
blocks; and a partitioning scheme as a set of subsets that
includes all sites in the alignment once and only once. For
clarity, we avoid the use of the term ‘‘partition,’’ as this has
different and potentially very confusing meanings in the
mathematical and molecular phylogenetics literature (in
the mathematical literature, a partition is equivalent to
our use of ‘‘partitioning scheme’’ here, whereas in the
molecular phylogenetics literature, it is equivalent to our
use of ‘‘subset’’ here). In the majority of cases, users will
specify data blocks based on genes and codon positions—
for example, by defining 12 data blocks for an alignment of
four protein-coding genes. The sites in a data block need
not be contiguous in the alignment, but a single site can be
a member of only one data block. A subset can comprise
a single data block (e.g., first codon sites from a protein-
coding gene) or multiple data blocks (e.g., first and second
codon sites from a protein-coding gene). For example,
consider an alignment of four protein-coding genes for
which the user has defined 12 data blocks, one for each
codon position in each gene. One possible partitioning
scheme for this data set involves treating each codon
position in each gene independently. This partitioning
scheme has 12 subsets, and so 12 unlinked substitution
models would be estimated from the data during the
phylogenetic analysis. Another possible partitioning
scheme involves treating each codon position indepen-
dently but merging the codon positions across genes. This
partitioning scheme has three subsets (one for each codon
position), and so three unlinked substitution models would
be estimated from the data during the phylogenetic anal-
ysis. The challenge is to find the best-fit partitioning
scheme for a given nucleotide alignment, given the prede-
fined set of data blocks.
The number of possible partitioning schemes for a set of
n data blocks is equivalent to the number of ways of
putting n different-colored balls into one or more indistin-
guishable boxes. This relationship is known as a Bell
number (Bell 1934) and can be described by the following
relationship, where B
n
is the number of possible partition-
ing schemes given n user-defined data blocks (Li et al.
2008), and the curly brackets define a Stirling number of
the second kind:
B
n
5
X
n
k 5 0
f
n
k
g:
The number of possible partitioning schemes can be as-
tronomical even for relatively modest data sets. For exam-
ple, in an analysis of four protein-coding genes (4 genes 3
codons 5 12 data blocks), there are B
12
5 4.2 10
6
possible
partitioning schemes, and for an analysis of 20 protein-
coding genes (20 genes 3 codons 5 60 data blocks), there
are B
60
5 9.8 10
59
possible partitioning schemes.
The set of partitioning schemes will be made up of
a smaller number of possible subsets because most subsets
will be included in a many different partitioning schemes.
Specifically, the number of possible subsets, S
n
, that can be
created from a set of n user-defined data blocks is the
Lanfear et al. · doi:10.1093/molbev/mss020 MBE
1696
Downloaded from https://academic.oup.com/mbe/article/29/6/1695/1000514 by Bibliothèque Universitaire de médecine - Nîmes user on 16 June 2021

number of possible nonempty subsets that can be
generated from a set of size n:
S
n
5 2
n
1:
For example, in an analysis of four protein-coding genes
(12 data blocks), there are S
12
5 4,095 possible subsets, and
in an analysis of 20 protein-coding genes (60 data blocks),
there are S
60
5 1.2 10
18
possible subsets.
The PartitionFinder Algorithm
Previous approaches to comparing partitioning schemes
have been both labor-intensive and computationally inten-
sive because they have required a full likelihood or Bayesian
analysis for each partitioning scheme under consideration
(see e.g., McGuire et al. 2007; Li et al. 2008). This has fun-
damentally limited the number of partitioning schemes
that have been compared in most studies, as comparing
large numbers (e.g., hundreds) of partitioning schemes
in this way is simply not feasible for most data sets. This
approach is also highly inefficient because it involves re-
peatedly recalculating the likelihood of every site in the
alignment, despite the fact that the substitution models
applied to those sites will be the same for many partition-
ing schemes. The PartitionFinder algorithm improves the
efficiency of finding best-fit partitioning schemes by calcu-
lating the log likelihood of each subset of sites only once.
The log likelihood of each partitioning scheme is then cal-
culated by summing the log likelihoods of the subsets that
make up that scheme.
An outline of the PartitionFinder algorithm is as follows:
1. Estimate a phylogenet ic tree of sequences;
2. Select the best-fit substitution model for each possible
subset;
3. Calculate the log likelihood of each partitioning scheme by
summing the log likelihoods of the subsets that make up
that scheme;
4. Select a partitioning scheme using information-theoretic
metrics.
All likelihood calculations are performed using a modi-
fied version of PhyML 3.0 (Guindon et al. 2010), available
from the authors and as part of the PartitionFinder pro-
gram. Tree estimation (step 1) is performed using the BioNJ
algorithm implemented in PhyML 3.0 (Guindon et al. 2010),
using the combined data from all of the user-defined data
blocks. PartitionFinder also allows the user to specify a tree
topology for step 1. The tree topology from step 1 is then
fixed for the rest of the analysis. This differs from previous
approaches, which coestimate the tree topology and the
likelihood of each partitioning scheme. This is a computa-
tionally intensive method that has limited the number of
partitioning schemes that can be compared (see above).
Using a fixed tree topology allows likelihoods from different
subsets to be combined, which increases the efficiency by
many orders of magnitude and allows many millions of par-
titioning schemes to be compared in a single run. Fixing the
tree topology is unlikely to adversely affect the results of
comparing partitioning schemes, as previous studies have
shown that doing so does not affect the results of model
selection procedures as long as a nonrandom tree topology
is used (Posada and Crandall 2001).
Model selection (step 2) is performed on a user-specified
set of up to 56 substitution models from the general time
reversible (GTR) family, and our approach is similar to
other model selection algorithms (e.g., Keane et al. 2006;
Posada 2008). During model selection, we first calculate
the likelihood of each candidate substitution model,
conditioned on the tree topology from step 1. We then
select the best-fit model according to one of three
user-specified information-theoretic metrics: the akaike
information criterion (AIC), the corrected AIC (AICc), or
the Bayesian information criterion (BIC) (Sullivan and Joyce
2005). PartitionFinder implements almost all of the models
of nucleotide evolution included in the most commonly
used phylogenetic tree estimation programs such as PhyML
(Guindon et al. 2010), RaxML (Stamatakis 2006), MrBayes
(Ronquist and Huelsenbeck 2003), and BEAST (Drummond
and Rambaut 2007). This means that the output from
PartitionFinder can be used to directly set up a phylogenetic
analysis in any of these programs. However, all of these
models and programs assume that the data evolved under
a time-reversible, stationary, and homogeneous process,
and they should not be used if the data violate any of these
assumptions.
PartitionFinder includes an option for either linked or
unlinked branch lengths between subsets. When branch
lengths are linked, step 1 includes the reestimation of
branch lengths on the BioNJ topology using a GTR
substitution model, with a proportion of invariant sites
and gamma distributed rates across sites estimated from
the data. The likelihood of each model for each subset (step
2) is then calculated conditioned on this topology and
these branch lengths, with each model afforded an
independent rate multiplier that can increase or decrease
all branch lengths by the same factor. Thus, linked branch
lengths allow for subset-specific substitution rates, but all
subsets share a single set of relative branch lengths. By
contrast, when branch lengths are unlinked, model selec-
tion (step 2) is conditioned on the topology from step 1,
but all branch lengths are estimated independently for each
model in each subset.
The log likelihood of each partitioning scheme (step 3) is
calculated by summing the log likelihoods of the best-fit
model for each subset in the partitioning scheme. Finally,
the best-fit partitioning scheme is selected (step 4) using
one of three information-theoretic measures: the AIC,
AICc, or BIC.
A Greedy Heuristic Algorithm to Search for
Partitioning Schemes
Even using the algorithm described above, exhaustive
searches on desktop computers are practically limited to
data sets for which 12 or fewer data blocks are defined
(corresponding to data sets with 4.2 million or fewer pos-
sible partitioning schemes). Therefore, heuristic searches
Partitioning in Phylogenetics · doi:10.1093/molbev/mss020 MBE
1697
Downloaded from https://academic.oup.com/mbe/article/29/6/1695/1000514 by Bibliothèque Universitaire de médecine - Nîmes user on 16 June 2021

among partitioning schemes are necessary for larger data
sets, even though they cannot be guaranteed to find the
optimum partitioning scheme (Li et al. 2008).
The heuristic search algorithm we describe below incor-
porates the increases in efficiency described above but
hugely reduces the number of partitioning schemes that
need to be considered for a given data set. Our method
builds on a recently proposed method (Li et al. 2008) that
involves estimating GTRþG model parameters for each
data block and then progressively merging the data blocks
with the most similar parameter estimates using hierarchical
cluster analysis. For a set of n data blocks, the hierarchical
clustering method objectively defines n partitioning schemes
that range from having n subsets (all data blocks treated in-
dependently) to having a single subset (all data blocks
merged together). The optimal scheme is then selected from
this set of n schemes using an information-theoretic metric
(e.g., the AIC, AICc, or BIC).
Because the hierarchical clustering approach combines
data blocks based on model parameter estimates, it relies
on those parameter estimates being accurate. For many
data blocks, there will be limited information available
for estimating many of the GTRþG model parameters. This
will result in these estimates being associated with high
variance because the value of the parameters will have
very little effect on the overall likelihood score. Since the
subsequent hierarchical clustering method treats all
parameters as equally important, uncertain parameter es-
timates might limit the ability of the hierarchical clustering
approach to find optimal partitioning schemes. The
algorithm we propose below overcomes this limitation
by merging data blocks based directly on information-
theoretic comparisons between partitioning schemes.
These metrics are calculated directly from the likelihood
so they implicitly incorporate the relative importance
of different model parameters and so avoid problems
associated with error-prone parameter estimates.
In an analysis with n data blocks, our greedy heuristic
algorithm begins by calculating the information-theoretic
score (e.g., AIC, AICc, or BIC) of the partitioning scheme
with n subsets, that is, the scheme in which each data block
is treated independently (P
start
). It then calculates the score
of all partitioning schemes with n 1 subsets, that is, all
schemes that merge two subsets of P
start
, and selects the
scheme with the best score (P
merged
). If P
merged
has a better
score than P
start
, P
merged
replaces P
start
, and the algorithm
iterates. The algorithm continues until either P
merged
does
not have a better score than P
start
, or until all data blocks
have been merged into one subset. This process results in
a greedy hill-climbing algorithm that optimizes the
information-theoretic score of interest while searching
for partitioning schemes.
We can calculate the maximum number of partitioning
schemes (P
n_greedy
) that would need to be examined by
this algorithm as follows. In addition to the starting
scheme, each round of the algorithm involves calculating
the likelihood of k choose two schemes, where k is the
number of subsets in the best scheme from the previous
round. In the worst case, the algorithm has to continue
until k 5 2, at which point the partitioning scheme under
consideration has all data blocks merged into one subset.
Thus,inananalysiswithn data blocks, the maximum
number of partitioning schemes P
n_greedy
considered by
this algorithm is:
P
n gr eedy
5 1 þ
X
n
k 5 2
k
2
5 1 þ nðn
2
1Þ
=
6:
The maximum number of subsets that need to be ex-
amined by this algorithm (S
n_greedy
) is smaller than the
maximum number of partitioning schemes because many
subsets are contained in more than one scheme. S
n_greedy
can be calculated as follows. The starting scheme involves
examining n subsets. In the next round of the algorithm, we
examine all n choose two subsets that merge two data
blocks of the starting scheme. In subsequent rounds, we
need only examine the k 2 novel subsets that can be
created by merging the most recently created subset with
the remaining subsets in the current partitioning scheme.
Thus, the maximum number of subsets that need to be
considered by this algorithm is:
S
n greedy
5 n
2
n þ 1:
The greedy algorithm can be many orders of magnitude
more efficient than an exhaustive search. For instance,
a data set with 60 data blocks requires the analysis of
B
60
5 9.77 10
59
partitioning schemes and S
60
5
1.15 10
18
subsets for an exhaustive search, but at most
P
60_greedy
5 35,991 partitioning schemes and S
60_greedy
5
3,541 subsets with the heuristic algorithm described here.
Comparing Exhaustive and Heuristic Searche s in
PartitionFinder
We tested the ability of our heuristic algorithm to find
optimal partitioning schemes for ten data sets obtained
from Data Dryad (www.datadryad.org) and TreeBase
(www.treebase.org; table 1). The data sets we used range
from 13 to 164 taxa, from 1,896 to 9,005 bp, and from 6
to 12 data blocks (table 1). They include a range of introns,
protein-coding genes, and RNA genes from the mitochon-
drial and nuclear genomes and are typical of the multilocus
data sets routinely used for phylogenetic analyses.
For each nucleotide sequence alignment (table 1), we
excluded sites that had been excluded by the authors of
the original study and then defined data blocks based
on genes and codon positions, treating transfer RNAs
(tRNAs) as a single data block. For some data sets, we ex-
cluded certain genes used in the original studies in order to
limit the size of each data set to a maximum of 12 data
blocks, thus permitting an exhaustive search of partitioning
schemes. To find the optimal partitioning scheme, we used
the algorithm described above, implemented in Partition-
Finder, to perform an exhaustive search of all possible par-
titioning schemes on each data set. We then used
Lanfear et al. · doi:10.1093/molbev/mss020 MBE
1698
Downloaded from https://academic.oup.com/mbe/article/29/6/1695/1000514 by Bibliothèque Universitaire de médecine - Nîmes user on 16 June 2021

Citations
More filters
Journal ArticleDOI

ModelFinder: fast model selection for accurate phylogenetic estimates

TL;DR: ModelFinder is presented, a fast model-selection method that greatly improves the accuracy of phylogenetic estimates by incorporating a model of rate heterogeneity across sites not previously considered in this context and by allowing concurrent searches of model space and tree space.
Journal ArticleDOI

PartitionFinder 2: New Methods for Selecting Partitioned Models of Evolution for Molecular and Morphological Phylogenetic Analyses.

TL;DR: PartitionFinder 2 is a program for automatically selecting best-fit partitioning schemes and models of evolution for phylogenetic analyses that includes the ability to analyze morphological datasets, new methods to analyze genome-scale datasets, and new output formats to facilitate interoperability with downstream software.
Journal ArticleDOI

W-IQ-TREE: a fast online phylogenetic tool for maximum likelihood analysis.

TL;DR: W-IQ-TREE supports multiple sequence types in common alignment formats and a wide range of evolutionary models including mixture and partition models, performing fast model selection, partition scheme finding, efficient tree reconstruction, ultrafast bootstrapping, branch tests, and tree topology tests.
Journal ArticleDOI

Phylogenomics resolves the timing and pattern of insect evolution

Bernhard Misof, +105 more
- 07 Nov 2014 - 
TL;DR: The phylogeny of all major insect lineages reveals how and when insects diversified and provides a comprehensive reliable scaffold for future comparative analyses of evolutionary innovations among insects.
References
More filters
Journal ArticleDOI

MrBayes 3: Bayesian phylogenetic inference under mixed models

TL;DR: MrBayes 3 performs Bayesian phylogenetic analysis combining information from different data partitions or subsets evolving under different stochastic evolutionary models to analyze heterogeneous data sets and explore a wide variety of structured models mixing partition-unique and shared parameters.
Journal ArticleDOI

RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models

TL;DR: UNLABELLED RAxML-VI-HPC (randomized axelerated maximum likelihood for high performance computing) is a sequential and parallel program for inference of large phylogenies with maximum likelihood (ML) that has been used to compute ML trees on two of the largest alignments to date.
Journal ArticleDOI

New Algorithms and Methods to Estimate Maximum-Likelihood Phylogenies: Assessing the Performance of PhyML 3.0

TL;DR: A new algorithm to search the tree space with user-defined intensity using subtree pruning and regrafting topological moves and a new test to assess the support of the data for internal branches of a phylogeny are introduced.
Journal ArticleDOI

BEAST: Bayesian evolutionary analysis by sampling trees

TL;DR: BEAST is a fast, flexible software architecture for Bayesian analysis of molecular sequences related by an evolutionary tree that provides models for DNA and protein sequence evolution, highly parametric coalescent analysis, relaxed clock phylogenetics, non-contemporaneous sequence data, statistical alignment and a wide range of options for prior distributions.
Journal ArticleDOI

jModelTest: Phylogenetic Model Averaging

TL;DR: jModelTest is a new program for the statistical selection of models of nucleotide substitution based on "Phyml" that implements 5 different selection strategies, including "hierarchical and dynamical likelihood ratio tests," the "Akaike information criterion", the "Bayesian information criterion," and a "decision-theoretic performance-based" approach.
Related Papers (5)
Frequently Asked Questions (10)
Q1. What are the contributions in "Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses" ?

Here, the authors describe two new objective methods for the combined selection of best-fit partitioning schemes and nucleotide substitution models. The authors demonstrate that these methods significantly outperform previous approaches, including both the ad hoc selection of partitioning schemes ( e. g., partitioning by gene or codon position ) and a recently proposed hierarchical clustering method. The authors hope that PartitionFinder will encourage the objective selection of partitioning schemes and thus lead to improvements in phylogenetic analyses. 

They include a range of introns, protein-coding genes, and RNA genes from the mitochondrial and nuclear genomes and are typical of the multilocus data sets routinely used for phylogenetic analyses. 

Because the hierarchical clustering approach combines data blocks based on model parameter estimates, it relies on those parameter estimates being accurate. 

There are two ways to incorporate the variation in evolutionary processes among different sites using currently available phylogenetic methods: mixture models and partitioning. 

Although mixture models are implemented in an increasing variety of phylogenetic software (e.g., Pagel and Meade 2004; Stamatakis 2006; Le et al. 2008), partitioning remains by far the most common approach to incorporating heterogeneity in evolutionary processes among sites (Blair and Murphy 2011). 

phylogeneticists use their biological intuition to group together similar sites in an alignment into putatively homogeneous data blocks.© 

The likelihood of each model for each subset (step 2) is then calculated conditioned on this topology and these branch lengths, with each model afforded an independent rate multiplier that can increase or decrease all branch lengths by the same factor. 

Tree estimation (step 1) is performed using the BioNJ algorithm implemented in PhyML 3.0 (Guindon et al. 2010), using the combined data from all of the user-defined data blocks. 

linked branch lengths allow for subset-specific substitution rates, but all subsets share a single set of relative branch lengths. 

This data set comprises ten nuclear protein-coding genes (i.e., 30 data blocks) from 72 ray-finned fish, totaling 7,995 bp (Li et al. 2008).