scispace - formally typeset
Open AccessPosted ContentDOI

Scalable total-evidence inference from molecular and continuous characters in a Bayesian framework

TLDR
This work implements, benchmark and validate popular phylogenetic models for the study of paleontological and neontological continuous trait data, incorporating these models into the BEAST2 platform and illustrating and advancing the paradigm of Bayesian, probabilistic total evidence.
Abstract
AO_SCPLOWBSTRACTC_SCPLOWTime-scaled phylogenetic trees are both an ultimate goal of evolutionary biology and a necessary ingredient in comparative studies. While accumulating genomic data has moved the field closer to a full description of the tree of life, the relative timing of certain evolutionary events remains challenging even when this data is abundant, and absolute timing is impossible without external information such as fossil ages and morphology. The field of phylogenetics lacks efficient tools integrating probabilistic models for these kinds of data into unified frameworks for estimating phylogenies. Here, we implement, benchmark and validate popular phylogenetic models for the study of paleontological and neontological continuous trait data, incorporating these models into the BEAST2 platform. Our methods scale well with number of taxa and of characters. We tip-date and estimate the topology of a phylogeny of Carnivora, comparing results from different configurations of integrative models capable of leveraging ages, as well as molecular and continuous morphological data from living and extinct species. Our results illustrate and advance the paradigm of Bayesian, probabilistic total evidence, in which explanatory models are fully defined, and inferential uncertainty in all their dimensions is accounted for.

read more

Content maybe subject to copyright    Report

Scalable total-evidence inference from molecular and
continuous characters in a Bayesian framework
Rong Zhang
1,2
, Alexei J. Drummond
1,2,3
, and F
´
abio K. Mendes
1,3
1
Centre for Computational Evolution, The University of Auckland, 1010, New Zealand
2
School of Computer Science, The University of Auckland, Auckland, 1010, New Zealand
3
School of Biological Sciences, The University of Auckland, Auckland, 1010, New Zealand
*Correspondence to be sent to: School of Biological Sciences, The University of Auckland, 3A Symonds St.,
Auckland, New Zealand;
E-mail: f.mendes@auckland.ac.nz
Abstract
Time-scaled phylogenetic trees are both an ultimate goal of evolutionary biology and a1
necessary ingredient in comparative studies. While accumulating genomic data has moved2
the field closer to a full description of the tree of life, the relative timing of certain3
evolutionary events remains challenging even when this data is abundant, and absolute4
timing is impossible without external information such as fossil ages and morphology. The5
field of phylogenetics lacks efficient tools integrating probabilistic models for these kinds of6
data into unified frameworks for estimating phylogenies. Here, we implement, benchmark7
and validate popular phylogenetic models for the study of paleontological and8
neontological continuous trait data, incorporating these models into the BEAST2 platform.9
Our methods scale well with number of taxa and of characters. We tip-date and estimate10
the topology of a phylogeny of Carnivora, comparing results from different configurations11
of integrative models capable of leveraging ages, as well as molecular and continuous12
morphological data from living and extinct species. Our results illustrate and advance the13
paradigm of Bayesian, probabilistic total evidence, in which explanatory models are fully14
defined, and inferential uncertainty in all their dimensions is accounted for.15
[Continuous trait, Brownian motion, total evidence, Carnivora]16
1
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2021. ; https://doi.org/10.1101/2021.04.21.440863doi: bioRxiv preprint

2 ZHANG, DRUMMOND AND MENDES
The advent of molecular sequencing has unquestionably revolutionized comparative17
biology, giving phylogeneticists unprecedented power to recover species relationships and18
date important evolutionary events (e.g., Jarvis et al., 2014; Zhang et al., 2014; Suh et al.,19
2015; Pease et al., 2016; Kawahara et al., 2019; Vanderpool et al., 2020), describe drivers of20
diversification (Condamine et al., 2013; Morlon, 2014; anchez-Reyes et al., 2017;21
Condamine et al., 2019), and their relationship with ecologically relevant traits (Goldberg22
and Igi´c, 2012; Burin et al., 2016; de Alencar et al., 2017). The accumulation of genomic23
data further allowed the identification of problems or gaps in molecular evolution models24
(or their usage; e.g., Sullivan and Swofford 1997; Kolaczkowski and Thornton 2004;25
Mendes and Hahn 2018), which led to improvements in their realism (Yang, 2006; Rannala26
and Yang, 2003; Degnan and Salter, 2007), as well as the development of a plethora of27
computational tools for empiricists wishing to use such models (e.g., Lartillot and28
Philippe, 2004; Stamatakis, 2014; Nguyen et al., 2015; Chifman and Kubatko, 2015; ohna29
et al., 2016; Zhang et al., 2018; Suchard et al., 2018; Bouckaert et al., 2019).30
Despite all progress, abundant genomic sequences and more complex substitution31
models have not been a panacea for phylogenetic studies, in which species trees measured32
in absolute time are either the ultimate goal (Philippe et al., 2011) or a critical ingredient33
for downstream analyses (Felsenstein, 1985; Uyeda et al., 2018). First, while molecular34
data informs us on the relative timing of evolutionary events, inferring mutation rates35
remains challenging (Kong et al., 2012; Besenbacher et al., 2015; Wang et al., 2020), as36
does reconciling estimates obtained at different evolutionary timescales (Ho et al., 2005;37
Penny, 2005; Ho et al., 2007). Second, dating the tree of life in absolute time is38
complicated by the absence of a universal strict molecular clock (Zuckerkandl and Pauling,39
1965; Ayala, 1997; Lanfear et al., 2010). Molecular rates have been shown to vary among40
loci and species (Li, 1997; Larracuente et al., 2008; Bromham, 2009), and to correlate with41
phenotypic and natural history traits (Martin and Palumbi, 1993; Smith and Donoghue,42
2008), the environment (Bleiweiss, 1998; Wright et al., 2006; Gillman et al., 2009), and43
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2021. ; https://doi.org/10.1101/2021.04.21.440863doi: bioRxiv preprint

SCALABLE TOTAL EVIDENCE WITH CONTINUOUS TRAITS 3
even the process of speciation (Webster et al., 2003; Witt and Brumfield, 2003; Venditti44
and Pagel, 2010). Finally, although non-contemporaneous DNA can help circumvent the45
aforementioned issues and improve the estimation of substitution rates and divergence46
times (Rieux and Balloux, 2016), extracting DNA from well-preserved ancient remains has47
so far been limited to evolutionary young material. This process is also non-trivial and48
prone to contamination, usually yielding fragmentary data (Cooper and Poinar, 2000;49
Hagelberg et al., 2015). These complications often lead to phylogenetic trees being50
reported in lengths of expected substitutions per site in these “substitution trees”, time51
and evolutionary rates are conflated.52
As a reaction to these findings, the past few decades saw improved descriptions of53
the substitution process from more realistic clock models (Thorne and Kishino, 2005; Ho54
and Ducene, 2014), as well as the development of methods for calibrating substitution55
trees into time-scaled trees. “Node dating” (as dubbed by Ronquist et al. 2012), for56
example, refers to a collection of techniques whereby a specialist determines an age (range)57
for a node using fossil occurrence or biogeographical data (Ho and Phillips, 2009). Node58
dating is complicated by the difficulty in estimating the age of fossils, choosing which59
fossils to use (Parham et al., 2012) in many cases information is lost because younger60
fossils of a group are excluded in favor of the oldest one and what nodes to assign them61
to, and choosing probability distributions for their age ranges, a crucial ad hoc step that62
can introduce bias and circularity to an analysis (Warnock et al., 2011; Field et al., 2020).63
These issues are further compounded by the analysis sensitivity to node-time priors (Welch64
et al., 2005), unclear implicit prior probabilities on node times (Heled and Drummond,65
2012), and overly simplistic molecular clock models (Berv and Field, 2018).66
As an alternative to node dating, the “tip-dating” approach consists of making67
direct use of heterochronous data sample ages and character data in order to calibrate68
and place taxa in the phylogeny. Tip dating was first employed for divergence time69
estimation at shorter time scales, in the context of viral phylodynamics (Rambaut, 2000;70
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2021. ; https://doi.org/10.1101/2021.04.21.440863doi: bioRxiv preprint

4 ZHANG, DRUMMOND AND MENDES
Drummond et al., 2002), where sample times are usually known with good precision and71
molecular data can be abundant. When used at macroevolutionary time scales, tip dating72
has also been dubbed “total-evidence” dating (Ronquist et al., 2012), likely as a reference73
to the original total evidence paradigm proposed by Kluge (1989). As in molecular tip74
dating, total-evidence dating (Pyron, 2011; Ronquist et al., 2012) allows the data fossil75
age estimates and morphological characters to directly inform fossil affinities and76
calibrate phylogenies, precluding the somewhat arbitrary specialist input that characterizes77
node dating. For the purposes of the present study, we use the term “total evidence” to78
mean “probabilistic” total evidence, the analysis of combined data using integrative79
probabilistic models, as opposed to methods rooted in parsimony or other heuristics (e.g.,80
Giribet et al., 2001; Nylander et al., 2004; Grant et al., 2006; Manos et al., 2007; Arango81
and Wheeler, 2007).82
The success of total-evidence tip dating depends on the quality and size of83
morphological data sets (number of characters and phylogenetic coverage), and on how84
well evolutionary models capture the real processes generating the data, i.e., how good the85
model fit is. Because obtaining molecular data from extinct species is usually hard, one86
should strive to obtain as many morphological characters as possible from both extinct and87
extant species, across and along the phylogeny. Crucially, these species will “link” the88
phylogenetic signal coming from morphological data together with that coming from89
molecular sequences, allowing a single phylogeny to be informed by both. Furthermore,90
evolutionary models should meet a delicate balance between realism, utility, and91
practicality. By being very realistic, models run the risk of being overly complex, hindering92
the researcher’s ability to draw general, useful conclusions. Very complex models also tend93
to be computationally onerous and technically hard to implement.94
Continuous-time Markov models are routinely used in phylogenetics for the study of95
both discrete and continuous characters. In the case of discrete traits, the ‘Mk’ and ‘Mkv’96
models (Lewis, 2001) have received the most attention (e.g., Danforth et al., 2006;97
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2021. ; https://doi.org/10.1101/2021.04.21.440863doi: bioRxiv preprint

SCALABLE TOTAL EVIDENCE WITH CONTINUOUS TRAITS 5
Bracken-Grissom et al., 2014) and criticism (e.g., O’Reilly et al., 2016, 2018; Goloboff98
et al., 2019). Key issues with these models or with how they are implemented and99
normally used include their assumption that discrete characters evolve in uncorrelated100
fashion and at the same rate, and the fact that autapomorphic characters are usually not101
represented in character matrices. Solutions for these problems exist, but can be102
computationally expensive. Continuous characters, on the other hand, are scored at a103
resolution that usually makes them variable within and across species. Popular continuous104
character phylogenetic models are based on Brownian motion (BM; Felsenstein, 1973;105
Hansen and Martins, 1996; but see Blomberg et al., 2020) and can incorporate correlated106
evolution among traits, which are assumed to evolve as a random walk whose diffusion rate107
is the evolutionary rate. Using continuous characters in total-evidence tip dating thus not108
only has the potential to improve phylogenetic inference by enhancing morphological data109
sets (Parins-Fukuchi, 2018b;
´
Alvarez Carretero et al., 2019; c.f. Var´on-Gonz´alez et al., 2020110
for some criticism), but also provides natural workarounds for the issues observed under111
discrete-character models.112
While many computational methods exist for the study of morphological character113
evolution (e.g., Revell, 2012; Pennell et al., 2014; Clavel et al., 2015; Caetano and Harmon,114
2017; Mitov et al., 2020), tools capable of jointly modeling molecular and morphological115
characters are still lacking, particularly those that simultaneously account for uncertainty116
in species tree topology and branch lengths. With few exceptions, comparative analysis of117
morphological characters requires a species tree point estimate (e.g., Adams et al., 2009;118
Lister, 2013; Gibson and Fuentes-G., 2015; or more rarely, a posterior distribution, e.g.,119
Silvestro et al., 2018; Fuentes-G. et al., 2020) to be available and assumed as the truth.120
Such species trees will have almost invariably been estimated in previous studies using121
different data sets, often molecular ones. In such cases, the morphological data is then122
analyzed on a phylogenetic “Procrustean bed”, a species tree that might not represent the123
morphological evolutionary history (Hahn and Nakhleh, 2016). One way forward should be124
.CC-BY-NC-ND 4.0 International licenseavailable under a
was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
The copyright holder for this preprint (whichthis version posted April 22, 2021. ; https://doi.org/10.1101/2021.04.21.440863doi: bioRxiv preprint

Figures
Citations
More filters
Journal Article

Accounting for calibration uncertainty in phylogenetic estimation of evolutionary divergence times

TL;DR: A variety of local and relaxed clock methods have been proposed and implemented for phylogenetic divergence dating as discussed by the authors, which allows different molecular clocks in different parts of the phylogenetic tree, thereby retaining the advantages of the classical molecular clock while casting off the restrictive assumption of a single, global rate of substitution.

Molecules as Documents of Evolutionary History

TL;DR: Different types of molecules are discussed in relation to their fitness for providing the basis for a molecular phylogeny, i.e. the different types of macromolecules that carry the genetic information or a very extensive translation thereof.
Journal ArticleDOI

Early cephalopod evolution clarified through Bayesian phylogenetic inference

TL;DR: In this paper , a large morphological character matrix of Cambrian and Ordovician cephalopods was used to conduct a comprehensive phylogenetic analysis and resolve existing controversies.
Journal ArticleDOI

Early cephalopod evolution clarified through Bayesian phylogenetic inference

TL;DR: In this paper , a large morphological character matrix of Cambrian and Ordovician cephalopods was used to conduct a comprehensive phylogenetic analysis and resolve existing controversies.
References
More filters
Journal ArticleDOI

RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies.

TL;DR: This work presents some of the most notable new features and extensions of RAxML, such as a substantial extension of substitution models and supported data types, the introduction of SSE3, AVX and AVX2 vector intrinsics, techniques for reducing the memory requirements of the code and a plethora of operations for conducting post-analyses on sets of trees.
Journal ArticleDOI

IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies

TL;DR: It is shown that a combination of hill-climbing approaches and a stochastic perturbation method can be time-efficiently implemented and found higher likelihoods between 62.2% and 87.1% of the studied alignments, thus efficiently exploring the tree-space.
Journal ArticleDOI

Phylogenies and the Comparative Method

TL;DR: A method of correcting for the phylogeny has been proposed, which specifies a set of contrasts among species, contrasts that are statistically independent and can be used in regression or correlation studies.
Journal ArticleDOI

Dating of the human-ape splitting by a molecular clock of mitochondrial DNA.

TL;DR: A new statistical method for estimating divergence dates of species from DNA sequence data by a molecular clock approach is developed, and this dating may pose a problem for the widely believed hypothesis that the bipedal creatureAustralopithecus afarensis, which lived some 3.7 million years ago, was ancestral to man and evolved after the human-ape splitting.
Book

Genetics and Analysis of Quantitative Traits

Michael Lynch, +1 more
TL;DR: This book discusses the genetic Basis of Quantitative Variation, Properties of Distributions, Covariance, Regression, and Correlation, and Properties of Single Loci, and Sources of Genetic Variation for Multilocus Traits.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What have the authors contributed in "Scalable total-evidence inference from molecular and continuous characters in a bayesian framework" ?

Here, the authors implement, benchmark 7 and validate popular phylogenetic models for the study of paleontological and 8 neontological continuous trait data, incorporating these models into the BEAST2 platform. CC-BY-NC-ND 4. 0 International license available under a was not certified by peer review ) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.