scispace - formally typeset
Open AccessJournal ArticleDOI

RWTY (R We There Yet): An R Package for Examining Convergence of Bayesian Phylogenetic Analyses.

Reads0
Chats0
TLDR
RWTY as mentioned in this paper is an R package that implements established and new methods for diagnosing phylogenetic MCMC convergence in a single convenient interface, which can be used for large data sets.
Abstract
Bayesian inference using Markov chain Monte Carlo (MCMC) has become one of the primary methods used to infer phylogenies from sequence data. Assessing convergence is a crucial component of these analyses, as it establishes the reliability of the posterior distribution estimates of the tree topology and model parameters sampled from the MCMC. Numerous tests and visualizations have been developed for this purpose, but many of the most popular methods are implemented in ways that make them inconvenient to use for large data sets. RWTY is an R package that implements established and new methods for diagnosing phylogenetic MCMC convergence in a single convenient interface.

read more

Content maybe subject to copyright    Report

RWTY (R We There Yet): An R Package for Examining
Convergence of Bayesian Phylogenetic Analyses
Dan L. Warren,*
,1
Anthony J. Geneva,
2
and Robert Lanfear
1,3
1
Department of Biological Sciences, Macquarie University, Sydney, Australia
2
Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA
3
Division of Evolution, Ecology, and Genetics, Australian National University, Canberra, Australia
*Corresponding author: E-mail: dan.l.warren@gmail.com.
Associate Editor: Michael Rosenberg
Abstract
Bayesian inference using Markov chain Monte Carlo (MCMC) has become one of the primary methods used to infer
phylogenies from sequence data. Assessing convergence is a crucial component of these analyses, as it establishes the
reliability of the posterior distribution estimates of the tree topology and model parameters sampled from the MCMC.
Numerous tests and visualizations have been developed for this purpose, but many of the most popular methods are
implemented in ways that make them inconvenient to use for large data sets. RWTY is an R package that implements
established and new methods for diagnosing phylogenetic MCMC convergence in a single convenient interface.
Key words: MCMC, phylogenetics, topology, MCMC, convergence, stationarity.
Bayesian inference using Markov chain Monte Carlo (MCMC)
in phylogenetics involves inferring posterior distributions (e.g.
of phylogenetic trees and model parameters) given a set of
prior beliefs and a molecular sequence alignment. This ap-
proach has become one of the primary methods used to infer
phylogenies from molecular data (
Ronquist et al. 2012;
Bouckaert et al. 2014). However, practical applications of
MCMC to phylogenetic problems are complicated by a prob-
lem inherent to MCMC methods; it is frequently difficult to
determine whether the chain has undergone enough itera-
tions and whether enough samples have been taken to accu-
rately infer the posterior distributions of clades and model
parameters.
MCMC methods allow researchers to make inferences
about the parameters of interest, such as the phylogenetic
tree topology, while integrating out the uncertainty in other
parameters, such as the model of molecular evolution (
Gilks
et al. 1996
). It is not necessary to explore the entire space of
possible solutions to do this; rather an MCMC chain is said to
have converged” when further exploration of the solution
space does not change the inferred posterior probability dis-
tributions beyond some user-specified tolerance. Failure to
appropriately diagnose non-convergence can lead to prema-
ture termination of chains, resulting in inappropriate esti-
mates of the tree topology, clade support values, and
model parameters.
There are attributes of the phylogeny problem that make
achieving and assessing convergence difficult. Phylogenetic
problems often involve estimating many interacting model
parameters (such as rates of evolution and dates of diver-
gence), as well as the topology of the phylogenetic tree (which
may itself interact with inferred model parameters).
Interactions among continuous parameters can make explor-
ing the space of possible solutions difficult, because efficient
exploration requires coordinated changes among more than
one parameter. On top of this, exploring the space of all
possible phylogenetic tree topologies (“tree space”) is also
difficult; the number of possible tree topologies is astronom-
ical even for small numbers of taxa, and adjacent solutions
can differ considerably in their posterior probability. As such,
tree space can contain local optima (
Whidden and Matsen
2015
) in which the MCMC can become stuck, and so fail to
converge. Interactions among parameters mean that the
failure of a single parameter to converge can lead to poor
inferences of other parameters. Thus, assessing MCMC con-
vergence requires the analysis of all parameters, including the
tree topology itself.
Initially, MCMC convergence in phylogenetics was primar-
ily diagnosed using plots of log likelihood as a function of
chain length (
fig. 1,panelC). Although a converged chain will
have a relatively flat likelihood trace, this is not a sufficient
condition for diagnosing chain convergence; a chain that is
stuck on a single local optimum will also produce a relatively
flat likelihood trace while potentially being far from conver-
gence. Sim ilarly, a chain that is exploring multiple local op-
tima with similar likelihoods may produce a flat likelihood
trace without producing accurate posterior probability distri-
butions. More recently, these methods have been extended,
and it is now standard practice to assess convergence by
examining the traces and posterior distributions of all con-
tinuous parameters in the analysis (
Ho¨hna and Drummond
2012
; Rambautetal.2014). Howev er, these approaches still
fail to address a key question: whether or not the MCMC has
adequately sampled the space of potential topologies.
In order to deal with the shortcomings of convergence
diagnostics based on likelihoods and model parameters,
AWTY (Are We There Yet,
Nylander et al. 2008)provided
novel convergence diagnostics based on directly examining
Letter
ß The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
All rights reserved. For permissions, please e-mail: journals.permissions@oup.com
1016 Mol. Biol. Evol. 34(4):1016–1020 doi:10.1093/molbev/msw279 Advance Access publication January 19, 2017

AMOTL2.run1
AMOTL2.run2
0.000
0.005
0.010
0.015
0.020
0.025
0.000
0.005
0.010
0.015
0.020
0.025
1.0e+07 1.5e+07 2.0e+07
Generation
Change in Split Frequency
Cumulative Change in Split Frequencies
−1710
−1690
−1670
−1650
LnL
2500000
5000000
7500000
10000000
12500000
generation
AMOTL2.run1 AMOTL2.run2
−20
−10
0
10
−10 0 10 −10 0 10
x
y
Tree space for 100 trees
0.08
0.09
0.10
WCSF
AMOTL2.run1
AMOTL2.run2
0.0
0.2
0.4
0.6
0.8
0.0
0.2
0.4
0.6
0.8
2.5e+06 5.0e+06 7.5e+06 1.0e+07
Generation
Split Frequency
Cumulative Split Frequencies for 20 clades with Highest WCSF
AMOTL2.run1
AMOTL2.run2
−20
−10
0
10
−10 0 10 −10 0 10
x
y
Tree space heatmap for 100 trees
AB
C
D
EF
G
H
AMOTL2.run1 (ESS=148)
AMOTL2.run2 (ESS=201)
−1710
−1690
−1670
−1650
−1710
−1690
−1670
−1650
5.0e+06 1.0e+07 1.5e+07 2.0e+07 2.5e+07
Generation
LnL
LnL trace
AMOTL2.run1 (ESS=148)
AMOTL2.run2 (ESS=201)
0
5
10
15
20
0
5
10
15
20
−1700 −1675 −1650
LnL
count
LnL distribution
AMOTL2.run1 (Approximate ESS = 201)
AMOTL2.run2 (Approximate ESS = 201)
90
100
110
120
130
140
90
100
110
120
130
140
5.0e+06 1.0e+07 1.5e+07 2.0e+07 2.5e+07
Generation
Topological Distance of Tree from Focal Tree
Tree topology trace
AMOTL2.run1 (Approximate ESS = 201)
AMOTL2.run2 (Approximate ESS = 201)
0
5
10
15
0
5
10
15
100 120 140
Topological Distance of Tree from Focal Tree
count
Tree topology trace
FIG.1.RWTY plots examining the behavior of individual chains. Data are two chains from Williams et al. (2013). RWTY allows users to visualize the
amount of time chains have spent in different areas of tree space both as a heatmap (A)andascatterplot(B). RWTY also allows users to visualize
likelihoods and model parameters ( C and D) and tree topology (E and F) as a function of chain length. In panel G, the cumulative change in split
frequencies is plotted as a function of chain length, where the solid line gives the mean standard deviat ion of split frequencies as a function of chain length
and increasingly lighter ribbons give the limits of the 75%, 95%, and 100% quantiles. RWTY also plots the cumulative posterior probability estimates for
clades as a function of chain length, and highlights splits that are likely to be problematic using a metric (WCSF) that weights changes in split frequencies
bytheirpositioninthechain(H).
RWTY (R We There Yet)
.
doi:10.1093/molbev/msw279 MBE
1017

the posterior probabilities of clades as a function of chain
length. These diagnostics help users detect when multiple
topological optima are being explored, and can help esti-
mate the number of trees necessary to achieve accurate
posterior estimates (
fig. 1, panel H). Comparison of such
plots from multiple replicate chains can further assist users
in diagnosing problems with phylogenetic MCMC analyses
because well-behaved replicate chains will infer similar pos-
terior distributions. It must be noted that none of these
diagnostic plots is sufficient to positively diagnose conver-
gence, but at minimum they represent a much stricter set of
necessary conditions for accepting the output of an MCMC
when compared to simply examining likelihood or parame-
ter plots alone.
TheAWTYsoftware(
Nylander et al. 2008)hasbeenwidely
used since its release, and is now part of the standard toolbox
of investigators using MCMC methods in phylogenetics.
However, the package is only available through an online
interface, it runs slowly on large tree files, it does not read
tree files from the latest phylogenetic MCMC software, and its
code is not open source and so cannot be extended or further
developed by the community.
RWTY is a new package that leverages the functionality of
R packages for phylogenetics (
Paradis et al. 2004; Schliep
2011
), statistical analysis (Plummer et al. 2006; Wickham
2007
; Schloerke et al. 2011; R Core Team 2016), and visuali-
zation (
Wickham 2009; Schloerke et al. 2011; de Vries and
Ripley 2013
; Garnier 2016) to provide a suite of functions for
visualizing and analyzing the performance of MCMC chains.
RWTY provides a single environment in which to analyze the
convergence of all parameters in a phylogenetic MCMC anal-
ysis, including continuous parameters and those associated
with the tree topology. RWTY accepts input from popular
phylogenetic MCMC packages, currently including MrBayes
(
Ronquist et al. 2012), BEAST (Bouckaert et al. 2014), and
RevBayes (
Ho¨hna et al. 2016). In addition, trees may be man-
ually loaded from any format that can be coerced into an ape
multiphylo object, and parameter data from any format that
can be converted to an R data frame. RWTY provides access
to many existing and new methods for assessing convergence
of phylogenetic MCMC analyses. For example, it produces
plots of the posterior probability of sampled clades similar
to those produced by AWTY (
Nylander et al. 2008)(fig. 1,
panel H), it allows visualization of MCMC exploration of tree
space in a manner similar to TreeSetViz (
Amenta and
Klingner 2002
)(fig. 1, panels A and B), and it allows users
to examine traces, posterior probability distributions, and ef-
fectivesamplesizes(ESS)ofmodel parameters similar to
Tracer (
Rambaut et al. 2014)(fig. 1, panels CF).
RWTY also implements several new methods that focus
specifically on assessing the adequacy with which the MCMC
has sampled the phylogenetic tree topology space. These in-
clude new visualizations of the trace and distrib ution of tree
topologies sampled by the MCMC (Lanfear et al. 2016), visu-
alizations of the similarity of tree topologies sampled by dif-
ferent chains (fig. 2), visualizations of changes in split
frequencies within chains and differences in split frequencies
between chains as the MCMC progressed, methods to
r = 1.00
ASDSF = 0.011
r = 0.48
ASDSF = 0.02
r = 0.47
ASDSF = 0.02
r = 0.53
ASDSF = 0.024
r = 0.53
ASDSF = 0.024
0.0 0.8
AMOTL2.run2
AMOTL2.run2
r = 0.48
ASDSF = 0.02
r = 0.47
ASDSF = 0.019
r = 0.52
ASDSF = 0.025
r = 0.53
ASDSF = 0.024
0.0 0.8
LHX2.run1
r = 0.99
ASDSF = 0.0045
r = 0.36
ASDSF = 0.014
r = 0.35
ASDSF = 0.014
0.0 0.8
LHX2.run2
r = 0.35
ASDSF = 0.014
r = 0.35
ASDSF = 0.014
0.0 0.8
TRMT5.run1
r = 0.99
ASDSF = 0.0049
0.0 0.4 0.8
0.0
0.8
0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8
TRMT5.run2
Split frequency comparisons
AMOTL2.run1
AMOTL2.run2
LHX2.run1
LHX2.run2
TRMT5.run1
TRMT5.run2
0.000 0.005 0.010 0.015 0.020 0.025
Chain similarity dendrogram
Pairwise ASDSF
0.01
0.10
0.0e+00 5.0e+06 1.0e+07 1.5e+07 2.0e+07
Generation
Standard Deviation of Split Frequencies
Average Standard Deviation of Split Frequencies
FIG.2.RWTY plots for between-chain comparisons. Data are six
MCMC chains from an analysis of three loci, two runs per locus
(Williams et al. 2013). In the top panel, posterior probability estimates
from each pair of chains are shown in a scatter plot (below diagonal),
and summary statistics are given above the diagonal. The center panel
shows a chain comparison dendrogram, in which the branch length
separating each pair of chains represents the average standard devi-
ation of split frequencies between those chains. In the lower panel,
the mean and 75%, 95% and 100% quantiles of the standard deviation
of split frequencies across all chains is plotted against chain length.
Warren et al.
.
doi:10.1093/molbev/msw279 MBE
1018

calculate the ESS of the tree topologies sampled (Lanfear et al.
2016
), and visualizations of the autocorrelation of tree topol-
ogies sampled from each chain (
Lanfear et al. 2016). All func-
tionality can be accessed directly using single-purpose
functions, but users will typically interact with RWTY using
a single omnibus function, analyze.rwty. The analyze.rwty
function automatically determines which plots are possible
with the provided data, and produces an R object containing
all plots. Examples of some p lot types output by RWTY are
presented in
figs. 1–3. The software and a comprehensive
vignette (
supplementary data S1 and S2, Supplementary
Material
online) are avail able at https://github. com/ dan lwar
ren/RWTY, or from the Comprehensive R Archive Network
(https://cran. r- project. org/ ). R users with an internet
connection can install the package using the command
“install.packages (‘rwty’)” or “library(devtools); install_github
(‘danlwarren/RWTY’)”.
Supplementary Material
Supplementary data areavailableatMolecular Biology and
Evolution online.
Acknowledgments
This work was funded in part by Australian Research Council
fellowship awards to D.L.W. and R.L.
References
Amenta N, Klingner J. 2002. Case study: visualizing sets of evolutionary
trees. In: IEEE symposium on information visualization, 2002.
INFOVIS 2002. Washington (DC): IEEE. p. 71–74.
Bouckaert R, Heled J, Ku¨hnert D, Vaughan T, Wu C-H, Xie D, Suchard
MA, Rambaut A, Drummond AJ. 2014. BEAST 2: a software platform
for Bayesian evolutionary analysis. PLoS Comput Biol. 10:3537.
deVriesA,RipleyB.2013.Ggdendro:toolsforextractingdendrogram
and tree diagram plot data for use with ggplot. R package version 0.1-
12. Available from: http://cran/ . R- pr oject. org/ package¼ggdendro.
Garnier S. 2016. viridis: default color maps from
0
matplotlib
0
.Rpackage
version 0.3.4.
Gilks WR, Richardson S, Spiegelhalter DJ. 1996. Introducing Markov
chain Monte Carlo. Markov Chain Monte Carlo Practice 1:19.
Hibbett DS, Pine EM, Langer E, Langer G, Donoghue MJ. 1997. Evolution
of gilled mushrooms and puffballs inferred from ribosomal DNA
sequences. Proc Natl Acad Sci. 94:12002–12006.
Ho¨hna S, Drummond AJ. 2012. Guided tree topology proposals for
Bayesian phylogenetic inference. Syst Biol. 61:1–11.
Fungus.Run1
LnLTLpi.A.
pi.T.topological.distance
LnL TL pi.A. pi.T. topolo
g
ical.distance
0
5
10
15
20
4.4
4.5
0.28
0.29
0.30
0.30
0.31
0.32
654
657
660
663
−56875 −56850 −56825 −56800 4.4 4.5 0.28 0.29 0.30 0.30 0.31 0.32 654657660663
FIG.3.Pairwise examination of topology, model parameters, and likelihoods for a single chain from an analysis of a fungus dataset (Hibbett et al.
1997
). Histograms along the diagonal plot the posterior probability distribution for each parameter, with red indicating values outside of the 95%
confidence interval. Relationships between parameters are displayed as scatter plots below the diagonal and contour plots above the diagonal.
Points in the scatter plots are colored according to generation from the MCMC chain, with lighter colors representing points later in the chain.
RWTY (R We There Yet)
.
doi:10.1093/molbev/msw279 MBE
1019

Ho¨hna S, Landis M, Heath T, Boussau B, Lartillot N, Moore B,
Huelsenbeck J, Ronquist F. 2016. RevBayes: a flexible framework for
Bayesian inference of phylogeny. Syst Biol. 64:726–736.
Lanfear R, Hua X, Warren DL. 2016. Estimating the effective sample size of
tree topologies from Bayesian phylogenetic analyses. Genome Biol
Evol. 8:2319–2332.
Nylander JA, Wilgenbusch JC, Warren DL, Swofford DL. 2008. AWTY
(are we there yet?): a system for graphical exploration of
MCMC convergence in Bayesian phylogenetics. Bioinformatics
24:581–583.
Paradis E, Claude J, Strimmer K. 2004. APE: analyses of phylogenetics and
evolution in R language. Bioinformatics 20:289–290.
PlummerM,BestN,CowlesK,VinesK.2006.CODA:convergencedi-
agnosis and output analysis for MCMC. RNews. 6:7–11.
R Core Team. 2016. R: a language and environment for statistical com-
puting. Vienna, Austria. Available from: http://cran/.R-project.o rg/.
Rambaut A, Suchard M, Xie D, Drummond A. 2014. Tracer v1. 6.
Available from: http://tree.bio.ed.ac.uk/software/tracer/.
Ronquist F, Teslenko M, van der Mark P, Ayres DL, Darling A, Ho¨hna S,
Larget B, Liu L, Suchard MA, Huelsenbeck JP. 2012. MrBayes 3.2:
efficient Bayesian phylogenetic inference and model choice across
alargemodelspace.Syst Biol. 61:539–542.
Schliep KP. 2011. phangorn: phylogenetic analysis in R. Bioinformatics
27:592–593.
Schloerke B, Crowley J, Cook D, Hofmann H, Wickham H, Briatte F,
Marbach M, Thoen E. 2011. Ggally: extension to ggplot2.
Whidden C, Matsen FA. 2015. Quantifying MCMC exploration of phy-
logenetic tree space. Syst Biol. 64:472–491.
Wickham H. 2009. ggplot2: elegant graphics for data analysis. New York:
Springer Science & Business Media.
Wickham H. 2007. Reshaping data with the reshape package. JStatSoftw.
21:1–20.
Williams JS, Niedzwi ecki JH, Weisrock DW. 2013. Species tree re-
construction of a poorly resolved clade of salamanders
(Ambystomatidae) using multiple nuclear loci. Mol Phylogenet
Evol. 68:671– 682.
Warren et al.
.
doi:10.1093/molbev/msw279 MBE
1020
Citations
More filters
Journal ArticleDOI

Posterior Summarization in Bayesian Phylogenetics Using Tracer 1.7.

TL;DR: The software package Tracer is presented, for visualizing and analyzing the MCMC trace files generated through Bayesian phylogenetic inference, which provides kernel density estimation, multivariate visualization, demographic trajectory reconstruction, conditional posterior distribution summary, and more.
Journal ArticleDOI

treespace: Statistical exploration of landscapes of phylogenetic trees

TL;DR: The approach, implemented in the R package treespace, combines tree metrics and multivariate analysis to provide low‐dimensional representations of the topological variability in a set of trees, which can be used for identifying clusters of similar trees and group‐specific consensus phylogenies.
Journal ArticleDOI

Bayesian molecular dating: opening up the black box

TL;DR: The aim of this review is to open the ‘black box’ of Bayesian molecular dating and have a look at the machinery inside, to help researchers to make informed choices when using Bayesian phylogenetic methods to estimate evolutionary rates and timescales.
Journal ArticleDOI

How mountains shape biodiversity: The role of the Andes in biogeography, diversification, and reproductive biology in South America's most species-rich lizard radiation (Squamata: Liolaemidae)

TL;DR: It is found that the Andes promoted lineage diversification and acted as a species pump into surrounding biomes, and evidence for possible reversals to oviparity is found, an evolutionary transition believed to be extremely rare.
Journal ArticleDOI

Phylogeny, ecology, morphological evolution, and reclassification of the diatom orders Surirellales and Rhopalodiales.

TL;DR: The first evidence for a 'stepping stone' model of marine-freshwater transitions in which freshwater invasions were preceded by adaptation to intermediate brackish habitats is reported and the challenges of constructing a classification that best leverages available phylogenetic data are discussed.
References
More filters
Journal Article

R: A language and environment for statistical computing.

R Core Team
- 01 Jan 2014 - 
TL;DR: Copyright (©) 1999–2012 R Foundation for Statistical Computing; permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and permission notice are preserved on all copies.
Book

ggplot2: Elegant Graphics for Data Analysis

TL;DR: This book describes ggplot2, a new data visualization package for R that uses the insights from Leland Wilkisons Grammar of Graphics to create a powerful and flexible system for creating data graphics.
Journal ArticleDOI

MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice across a Large Model Space

TL;DR: The new version provides convergence diagnostics and allows multiple analyses to be run in parallel with convergence progress monitored on the fly, and provides more output options than previously, including samples of ancestral states, site rates, site dN/dS rations, branch rates, and node dates.
Journal ArticleDOI

APE: Analyses of Phylogenetics and Evolution in R language

TL;DR: UNLABELLED Analysis of Phylogenetics and Evolution (APE) is a package written in the R language for use in molecular evolution and phylogenetics that provides both utility functions for reading and writing data and manipulating phylogenetic trees.
Related Papers (5)