(Open Access) Scalable total-evidence inference from molecular and continuous characters in a Bayesian framework (2021) | Rong Zhang

Q: What have the authors contributed in "Scalable total-evidence inference from molecular and continuous characters in a bayesian framework" ?

Here, the authors implement, benchmark 7 and validate popular phylogenetic models for the study of paleontological and 8 neontological continuous trait data, incorporating these models into the BEAST2 platform. CC-BY-NC-ND 4. 0 International license available under a was not certified by peer review ) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity.

Scalable total-evidence inference from molecular and

continuous characters in a Bayesian framework

Rong Zhang

1,2

, Alexei J. Drummond

1,2,3

, and F

abio K. Mendes

1,3∗

Centre for Computational Evolution, The University of Auckland, 1010, New Zealand

School of Computer Science, The University of Auckland, Auckland, 1010, New Zealand

School of Biological Sciences, The University of Auckland, Auckland, 1010, New Zealand

*Correspondence to be sent to: School of Biological Sciences, The University of Auckland, 3A Symonds St.,

Auckland, New Zealand;

E-mail: f.mendes@auckland.ac.nz

Abstract

Time-scaled phylogenetic trees are both an ultimate goal of evolutionary biology and a1

necessary ingredient in comparative studies. While accumulating genomic data has moved2

the ﬁeld closer to a full description of the tree of life, the relative timing of certain3

evolutionary events remains challenging even when this data is abundant, and absolute4

timing is impossible without external information such as fossil ages and morphology. The5

ﬁeld of phylogenetics lacks eﬃcient tools integrating probabilistic models for these kinds of6

data into uniﬁed frameworks for estimating phylogenies. Here, we implement, benchmark7

and validate popular phylogenetic models for the study of paleontological and8

neontological continuous trait data, incorporating these models into the BEAST2 platform.9

Our methods scale well with number of taxa and of characters. We tip-date and estimate10

the topology of a phylogeny of Carnivora, comparing results from diﬀerent conﬁgurations11

of integrative models capable of leveraging ages, as well as molecular and continuous12

morphological data from living and extinct species. Our results illustrate and advance the13

paradigm of Bayesian, probabilistic total evidence, in which explanatory models are fully14

deﬁned, and inferential uncertainty in all their dimensions is accounted for.15

[Continuous trait, Brownian motion, total evidence, Carnivora]16

.CC-BY-NC-ND 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted April 22, 2021. ; https://doi.org/10.1101/2021.04.21.440863doi: bioRxiv preprint

2 ZHANG, DRUMMOND AND MENDES

The advent of molecular sequencing has unquestionably revolutionized comparative17

biology, giving phylogeneticists unprecedented power to recover species relationships and18

date important evolutionary events (e.g., Jarvis et al., 2014; Zhang et al., 2014; Suh et al.,19

2015; Pease et al., 2016; Kawahara et al., 2019; Vanderpool et al., 2020), describe drivers of20

diversiﬁcation (Condamine et al., 2013; Morlon, 2014; S´anchez-Reyes et al., 2017;21

Condamine et al., 2019), and their relationship with ecologically relevant traits (Goldberg22

and Igi´c, 2012; Burin et al., 2016; de Alencar et al., 2017). The accumulation of genomic23

data further allowed the identiﬁcation of problems or gaps in molecular evolution models24

(or their usage; e.g., Sullivan and Swoﬀord 1997; Kolaczkowski and Thornton 2004;25

Mendes and Hahn 2018), which led to improvements in their realism (Yang, 2006; Rannala26

and Yang, 2003; Degnan and Salter, 2007), as well as the development of a plethora of27

computational tools for empiricists wishing to use such models (e.g., Lartillot and28

Philippe, 2004; Stamatakis, 2014; Nguyen et al., 2015; Chifman and Kubatko, 2015; H¨ohna29

et al., 2016; Zhang et al., 2018; Suchard et al., 2018; Bouckaert et al., 2019).30

Despite all progress, abundant genomic sequences and more complex substitution31

models have not been a panacea for phylogenetic studies, in which species trees measured32

in absolute time are either the ultimate goal (Philippe et al., 2011) or a critical ingredient33

for downstream analyses (Felsenstein, 1985; Uyeda et al., 2018). First, while molecular34

data informs us on the relative timing of evolutionary events, inferring mutation rates35

remains challenging (Kong et al., 2012; Besenbacher et al., 2015; Wang et al., 2020), as36

does reconciling estimates obtained at diﬀerent evolutionary timescales (Ho et al., 2005;37

Penny, 2005; Ho et al., 2007). Second, dating the tree of life in absolute time is38

complicated by the absence of a universal strict molecular clock (Zuckerkandl and Pauling,39

1965; Ayala, 1997; Lanfear et al., 2010). Molecular rates have been shown to vary among40

loci and species (Li, 1997; Larracuente et al., 2008; Bromham, 2009), and to correlate with41

phenotypic and natural history traits (Martin and Palumbi, 1993; Smith and Donoghue,42

2008), the environment (Bleiweiss, 1998; Wright et al., 2006; Gillman et al., 2009), and43

.CC-BY-NC-ND 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted April 22, 2021. ; https://doi.org/10.1101/2021.04.21.440863doi: bioRxiv preprint

SCALABLE TOTAL EVIDENCE WITH CONTINUOUS TRAITS 3

even the process of speciation (Webster et al., 2003; Witt and Brumﬁeld, 2003; Venditti44

and Pagel, 2010). Finally, although non-contemporaneous DNA can help circumvent the45

aforementioned issues and improve the estimation of substitution rates and divergence46

times (Rieux and Balloux, 2016), extracting DNA from well-preserved ancient remains has47

so far been limited to evolutionary young material. This process is also non-trivial and48

prone to contamination, usually yielding fragmentary data (Cooper and Poinar, 2000;49

Hagelberg et al., 2015). These complications often lead to phylogenetic trees being50

reported in lengths of expected substitutions per site – in these “substitution trees”, time51

and evolutionary rates are conﬂated.52

As a reaction to these ﬁndings, the past few decades saw improved descriptions of53

the substitution process from more realistic clock models (Thorne and Kishino, 2005; Ho54

and Duchˆene, 2014), as well as the development of methods for calibrating substitution55

trees into time-scaled trees. “Node dating” (as dubbed by Ronquist et al. 2012), for56

example, refers to a collection of techniques whereby a specialist determines an age (range)57

for a node using fossil occurrence or biogeographical data (Ho and Phillips, 2009). Node58

dating is complicated by the diﬃculty in estimating the age of fossils, choosing which59

fossils to use (Parham et al., 2012) – in many cases information is lost because younger60

fossils of a group are excluded in favor of the oldest one – and what nodes to assign them61

to, and choosing probability distributions for their age ranges, a crucial ad hoc step that62

can introduce bias and circularity to an analysis (Warnock et al., 2011; Field et al., 2020).63

These issues are further compounded by the analysis sensitivity to node-time priors (Welch64

et al., 2005), unclear implicit prior probabilities on node times (Heled and Drummond,65

2012), and overly simplistic molecular clock models (Berv and Field, 2018).66

As an alternative to node dating, the “tip-dating” approach consists of making67

direct use of heterochronous data – sample ages and character data – in order to calibrate68

and place taxa in the phylogeny. Tip dating was ﬁrst employed for divergence time69

estimation at shorter time scales, in the context of viral phylodynamics (Rambaut, 2000;70

.CC-BY-NC-ND 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted April 22, 2021. ; https://doi.org/10.1101/2021.04.21.440863doi: bioRxiv preprint

4 ZHANG, DRUMMOND AND MENDES

Drummond et al., 2002), where sample times are usually known with good precision and71

molecular data can be abundant. When used at macroevolutionary time scales, tip dating72

has also been dubbed “total-evidence” dating (Ronquist et al., 2012), likely as a reference73

to the original total evidence paradigm proposed by Kluge (1989). As in molecular tip74

dating, total-evidence dating (Pyron, 2011; Ronquist et al., 2012) allows the data – fossil75

age estimates and morphological characters – to directly inform fossil aﬃnities and76

calibrate phylogenies, precluding the somewhat arbitrary specialist input that characterizes77

node dating. For the purposes of the present study, we use the term “total evidence” to78

mean “probabilistic” total evidence, the analysis of combined data using integrative79

probabilistic models, as opposed to methods rooted in parsimony or other heuristics (e.g.,80

Giribet et al., 2001; Nylander et al., 2004; Grant et al., 2006; Manos et al., 2007; Arango81

and Wheeler, 2007).82

The success of total-evidence tip dating depends on the quality and size of83

morphological data sets (number of characters and phylogenetic coverage), and on how84

well evolutionary models capture the real processes generating the data, i.e., how good the85

model ﬁt is. Because obtaining molecular data from extinct species is usually hard, one86

should strive to obtain as many morphological characters as possible from both extinct and87

extant species, across and along the phylogeny. Crucially, these species will “link” the88

phylogenetic signal coming from morphological data together with that coming from89

molecular sequences, allowing a single phylogeny to be informed by both. Furthermore,90

evolutionary models should meet a delicate balance between realism, utility, and91

practicality. By being very realistic, models run the risk of being overly complex, hindering92

the researcher’s ability to draw general, useful conclusions. Very complex models also tend93

to be computationally onerous and technically hard to implement.94

Continuous-time Markov models are routinely used in phylogenetics for the study of95

both discrete and continuous characters. In the case of discrete traits, the ‘Mk’ and ‘Mkv’96

models (Lewis, 2001) have received the most attention (e.g., Danforth et al., 2006;97

.CC-BY-NC-ND 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted April 22, 2021. ; https://doi.org/10.1101/2021.04.21.440863doi: bioRxiv preprint

SCALABLE TOTAL EVIDENCE WITH CONTINUOUS TRAITS 5

Bracken-Grissom et al., 2014) and criticism (e.g., O’Reilly et al., 2016, 2018; Goloboﬀ98

et al., 2019). Key issues with these models – or with how they are implemented and99

normally used – include their assumption that discrete characters evolve in uncorrelated100

fashion and at the same rate, and the fact that autapomorphic characters are usually not101

represented in character matrices. Solutions for these problems exist, but can be102

computationally expensive. Continuous characters, on the other hand, are scored at a103

resolution that usually makes them variable within and across species. Popular continuous104

character phylogenetic models are based on Brownian motion (BM; Felsenstein, 1973;105

Hansen and Martins, 1996; but see Blomberg et al., 2020) and can incorporate correlated106

evolution among traits, which are assumed to evolve as a random walk whose diﬀusion rate107

is the evolutionary rate. Using continuous characters in total-evidence tip dating thus not108

only has the potential to improve phylogenetic inference by enhancing morphological data109

sets (Parins-Fukuchi, 2018b;

Alvarez Carretero et al., 2019; c.f. Var´on-Gonz´alez et al., 2020110

for some criticism), but also provides natural workarounds for the issues observed under111

discrete-character models.112

While many computational methods exist for the study of morphological character113

evolution (e.g., Revell, 2012; Pennell et al., 2014; Clavel et al., 2015; Caetano and Harmon,114

2017; Mitov et al., 2020), tools capable of jointly modeling molecular and morphological115

characters are still lacking, particularly those that simultaneously account for uncertainty116

in species tree topology and branch lengths. With few exceptions, comparative analysis of117

morphological characters requires a species tree point estimate (e.g., Adams et al., 2009;118

Lister, 2013; Gibson and Fuentes-G., 2015; or more rarely, a posterior distribution, e.g.,119

Silvestro et al., 2018; Fuentes-G. et al., 2020) to be available and assumed as the truth.120

Such species trees will have almost invariably been estimated in previous studies using121

diﬀerent data sets, often molecular ones. In such cases, the morphological data is then122

analyzed on a phylogenetic “Procrustean bed”, a species tree that might not represent the123

morphological evolutionary history (Hahn and Nakhleh, 2016). One way forward should be124

.CC-BY-NC-ND 4.0 International licenseavailable under a

was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made

The copyright holder for this preprint (whichthis version posted April 22, 2021. ; https://doi.org/10.1101/2021.04.21.440863doi: bioRxiv preprint

Scalable total-evidence inference from molecular and continuous characters in a Bayesian framework

Figures

Citations

Accounting for calibration uncertainty in phylogenetic estimation of evolutionary divergence times

Molecules as Documents of Evolutionary History

Early cephalopod evolution clarified through Bayesian phylogenetic inference

Early cephalopod evolution clarified through Bayesian phylogenetic inference

References

RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies.

IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies

Phylogenies and the Comparative Method

Dating of the human-ape splitting by a molecular clock of mitochondrial DNA.

Genetics and Analysis of Quantitative Traits

Related Papers (5)

Combining phylogenetic and hidden Markov models in biosequence analysis.

Using Genotype Abundance to Improve Phylogenetic Inference.

OneTwoTree: An online tool for phylogeny reconstruction.

Non-homogeneous models of sequence evolution in the Bio++ suite of libraries and programs

Concatenation and Species Tree Methods Exhibit Statistically Indistinguishable Accuracy under a Range of Simulated Conditions.

Frequently Asked Questions (1)

Q1. What have the authors contributed in "Scalable total-evidence inference from molecular and continuous characters in a bayesian framework" ?