scispace - formally typeset
Open AccessJournal ArticleDOI

Inference of Ancient Whole-Genome Duplications and the Evolution of Gene Duplication and Loss Rates

Arthur Zwaenepoel, +1 more
- 01 Jul 2019 - 
- Vol. 36, Iss: 7, pp 1384-1404
Reads0
Chats0
TLDR
A full probabilistic approach for phylogenomic reconciliation-based WGD inference is developed, accounting for both gene tree and reconciliation uncertainty using a method based on the principle of amalgamated likelihood estimation.
Abstract
Gene tree-species tree reconciliation methods have been employed for studying ancient whole-genome duplication (WGD) events across the eukaryotic tree of life. Most approaches have relied on using maximum likelihood trees and the maximum parsimony reconciliation thereof to count duplication events on specific branches of interest in a reference species tree. Such approaches do not account for uncertainty in the gene tree and reconciliation, or do so only heuristically. The effects of these simplifications on the inference of ancient WGDs are unclear. In particular, the effects of variation in gene duplication and loss rates across the species tree have not been considered. Here, we developed a full probabilistic approach for phylogenomic reconciliation-based WGD inference, accounting for both gene tree and reconciliation uncertainty using a method based on the principle of amalgamated likelihood estimation. The model and methods are implemented in a maximum likelihood and Bayesian setting and account for variation of duplication and loss rates across the species tree, using methods inspired by phylogenetic divergence time estimation. We applied our newly developed framework to ancient WGDs in land plants and investigated the effects of duplication and loss rate variation on reconciliation and gene count based assessment of these earlier proposed WGDs.

read more

Content maybe subject to copyright    Report

This is a post-peer-review, pre-copyedit version of an article published in
Molecular Biology & Evolution. The final authenticated version is available
online at: https://doi.org/10.1093/molbev/msz088

Inference of ancient whole genome duplications and the
evolution of the gene duplication and loss rate
Arthur Zwaenepoel
1, 2, 3,
Yves Van de Peer
1, 2, 3, 4,
1. Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium
2. Center for Plant Systems Biology, VIB, 9052 Ghent, Belgium
3. Bioinformatics Institute Ghent, 9052 Ghent, Belgium
4. Department of Biochemistry, Genetics and Microbiology, University of Pretoria, Pretoria 0028, South Africa
* Corresponding author: arzwa@psb.vib-ugent.be, yvpee@psb.vib-ugent.be
Abstract
Gene tree - species tree reconciliation methods have been employed for studying ancient whole genome
duplication (WGD) events across the eukaryotic tree of life. Most approaches have relied on using
maximum likelihood trees and the maximum parsimony reconciliation thereof to count duplication events
on specific branches of interest in a reference species tree. Such approaches do not account for uncertainty
in the gene tree and reconciliation, or do so only heuristically. The effects of these simplifications on the
inference of ancient WGDs are unclear. In particular the effects of variation in gene duplication and loss
rates across the species tree have not been considered. Here, we developed a full probabilistic approach
for phylogenomic reconciliation based WGD inference, accounting for both gene tree and reconciliation
uncertainty using a method based on the principle of amalgamated likelihood estimation. The model and
methods are implemented in a maximum likelihood and Bayesian setting and account for variation of
duplication and loss rate across the species tree, using methods inspired by phylogenetic divergence time
estimation. We applied our newly developed framework to ancient WGDs in land plants and investigate
the effects of duplication and loss rate variation on reconciliation and gene count based assessment of
these earlier proposed WGDs.
Introduction
In the past decades, examination of genomic data has revealed many signatures of ancient whole genome
duplications (WGDs) across the eukaryotic tree of life (reviewed in Van de Peer et al. 2017). These findings
have initiated an active research field concerned with the evolutionary importance of polyploidy, especially
in plants where polyploidization seems to have been rampant. The apparent widespread incidence of
polyploidy in the phylogeny of land plants is often cited as strong evidence for the evolutionary importance
of polyploidy. But just how widespread is ancient polyploidy in land plants? While there has been strong
evidence for many (relatively recent) WGD events, inference of these events, especially very ancient ones,
remains highly challenging. Evidently, the signal of an ancient WGD event erodes through time, and all
current methods suffer a strong loss of power the more ancient the hypothesized event. As a result, some of
the claimed ancient WGD events have been contested, such as the two hypothesized events in land plants
(Jiao et al. 2011; Ruprecht et al. 2017), the 2R hypothesis in vertebrates (Abbasi 2010; Van de Peer et al.
1

2010; Smith and Keinath 2015), or ancient WGD events in hexapods (Zheng Li et al. 2018; Li et al. 2019;
Nakatani and McLysaght 2019).
Methods for unveiling ancient WGDs can be classified crudely into three main approaches. The first
approach takes advantage of the expectation that a WGD leaves a signature in the distribution of duplicate
divergence times. One commonly estimates the synonymous distance (
K
S
or
dS
, which serves as a proxy
for the divergence time) for all paralogous pairs in a genome and visualizes the resulting distribution. In
such a
K
S
distribution, ancient WGDs will be visible as peaks against the background exponential decay
distribution from small scale duplication (SSD) events (Lynch and Conery 2000; Blanc and Wolfe 2004).
There are a couple of pitfalls with this approach, which have been discussed in detail (Vanneste et al. 2013;
Tiley et al. 2018; Zwaenepoel et al. 2018). Importantly, these distributions are not suitable for inferring very
ancient events due to saturation of the synonymous distance. The second main approach is based on the
expectation that a WGD should lead to large co-linear blocks in the genome. Such co-linearity or synteny
based information has often been considered as the strongest evidence for ancient WGDs. In particular the
combination of syntenic and
K
S
information has been vital for the discrimination of WGD-derived and
SSD-derived paralogs. The major drawback is however that high quality genome assemblies are required,
and these are still non-trivial to obtain. Nevertheless, even with high quality assemblies, interpretation
of syntenic signal for very ancient putative WGDs is not always unequivocal. In particular the temporal
(either relative or absolute) framing of a WGD event based on syntenic data is complicated and requires
high quality genomes of multiple related lineages. The last set of methods are united by their usage of
phylogenetic information in individual gene families. Both methods using gene counts and gene tree
topologies have been used, either in a model-based or heuristic framework. Especially heuristic gene tree -
species tree reconciliation methods have been widely employed to unveil evidence for ancient WGDs (e.g.
Jiao et al. 2011; Li et al. 2015; Li et al. 2016; McKain et al. 2016; Thomas et al. 2017; Zheng Li et al. 2018;
Yang et al. 2018), often in combination with other sources of evidence. Here, we take heuristic to mean that
the gene tree is inferred independently from its reconciliation. In these approaches, a larger than expected
number of duplication events inferred for a particular branch of the species tree is regarded as indicative
for an ancient WGD. So far, most of the support for very ancient WGD events has been obtained using gene
tree reconciliation approaches. These approaches naturally provide a temporal view on the hypothesized
event as they assume a known, either dated or undated, species tree.
There are however several potential pitfalls when employing heuristic gene tree - species tree reconciliation
approaches. The first, and probably most obvious, is the need for an arbitrary cut-off on the number of
duplications before some species tree branch is associated with a WGD. This is especially troubling for
putative WGD events on tip branches, as large numbers of SSD events can easily be confused with a WGD
event (Zwaenepoel et al. 2018). The number of duplication events inferred for specific branches can also be
very sensitive to taxon sampling, and some signal for a putative WGD event on a particular branch may be
absent or weakened when the branch is subdivided by adding more taxa to the analysis. Perhaps more
important are the problems with the methodology per se. In most cases, reconciliation approaches rely on a
single gene tree topology for every gene family, inferred by maximum likelihood methods, and a single
reconciliation thereof, typically employing a least common ancestor (LCA) approach which minimizes the
total number of duplication and loss events (Zmasek and Eddy 2001). A gene tree topology is however
a probabilistic model of the phylogeny of that gene family, and for a single gene family there may be a
considerable number of different topologies with near equal support (Salter 2001). Similarly, a reconciliation
of a gene tree to a species tree can also be considered probabilistically, and, although less well studied,
relying on the single most parsimonious reconciliation may be similarly problematic. In particular the
joint effects of these two issues may be of crucial importance, as the reconciliation of uncertain topologies
by means of LCA reconciliation will result in conflicting views on the evolution of the gene family and
2

systematic biases (see e.g. Hahn 2007). To overcome some of these problems, researchers have typically
filtered out nodes with low bootstrap support, evaluated some type of duplication consistency scores or
have used heuristic branch swapping methods in the reconciliation step (as implemented for example in
Notung (Chen et al. 2000)).
Probabilistic methods for WGD inference in a phylogenetic context, both employing gene trees and gene
counts, were recently proposed by Rabier et al. (2014). In a gene count based method (Hahn et al. 2005),
one does not employ topological information but effectively integrates over all possible gene trees that
could have generated the observed counts at the species tree leaves. Such methods therefore naturally
handle uncertainty in the gene tree, albeit in a somewhat crude fashion. The observed gene family is
modeled as the outcome of a birth-death Markov chain, allowing likelihood based inference of duplication
and loss rates, ancestral gene counts and, in the framework of Rabier et al. (2014), WGD retention rates.
While they have yielded great insights in genome evolution, gene count based methods do not consider
all of the information in genomic data sets, as sequence data for the genes provides information about
their phylogeny. Therefore, a gene tree - species tree reconciliation approach that estimates parameters of a
model of gene family evolution is expected to be more accurate. We expect this in particular when models
are employed that allow variation in the duplication and loss rate across the species tree. Additionally,
reconciliation based methods have the obvious advantage of providing the researcher with an actual
reconciled gene tree, i.e. a tree with nodes labeled as either a speciation, duplication or loss node. In our
case, this labeling should also include whether a particular duplication node is inferred to be a WGD or
SSD-derived duplication. This therefore also provides a model-based framework for selecting gene families
for Bayesian molecular dating analyses to estimate absolute ages of ancient WGDs (as in e.g. Vanneste
et al. 2014; and Clark and Donoghue 2017) or to study functional biases in gene retention patterns (e.g.
Li et al. 2016). Contrary to expectations, Rabier et al. (2014) reported a lower power to detect WGDs for
their reconciliation approach compared to their gene count approach, and they recommend usage of the
gene count method for testing WGD hypotheses in a phylogenetic context. However, they attributed these
observations mainly to computational limitations in their reconciliation method.
Here we introduce a novel method for WGD inference using gene trees designed to overcome the issues
of the reconciliation method in Rabier et al. (2014). We draw inspiration from the growing body of
literature on gene tree inference under a known species tree (reviewed in Szöll˝osi et al. 2015) and develop
an approach which allows to assess the statistical support for WGD hypotheses from alignments of
multi-copy gene families. Our approach is based on the principle of amalgamated likelihood estimation
(ALE) for probabilistic gene tree - species tree reconciliation, first proposed and developed by Szöll˝osi,
Rosikiewicz, et al. (2013). We develop an ALE approach, called Whale, employing the probabilistic model
of Rabier et al. (2014) to estimate duplication, loss and WGD retention rates and test WGD hypotheses in
a phylogenetic context. By using the amalgamation principle with a probabilistic model of gene family
evolution in the presence of WGDs, Whale jointly accounts for uncertainty in the gene tree topology and
reconciliation. As in Szöll˝osi, Rosikiewicz, et al. (2013), our approach is fully probabilistic, and does not
employ parsimony-guided reconciliation as in Rabier et al. (2014). We employed the Whale method both
in a maximum likelihood and Bayesian setting and reveal the crucial importance of considering duplication
and loss rate heterogeneity across the species tree when assessing WGD hypotheses. To accommodate
this, we implemented models of duplication and loss rate evolution inspired by molecular divergence time
estimation. Revisiting some of the ancient WGDs reported in the land plant phylogeny, we evaluated our
new approaches and discuss caveats when assessing WGDs using gene tree reconciliation.
3

New approaches
We implemented algorithms to compute the joint gene tree - reconciliation likelihood under the
probabilistic model of Rabier et al. (2014) using the principle of amalgamation.
Through analysis of simulated and empirical data sets we show that likelihood based inference of
whole genome duplications (WGDs) sensu Rabier et al. (2014) is very sensitive to rate variation across
branches of the species tree. This also has implications for simulation-based assessment of putative
‘bursts’ in the number of duplications in data sets of reconciled gene trees.
We implemented models that can accommodate variation in duplication and loss rates inspired by
Bayesian divergence time estimation and employ these to study the evolution of the duplication and
loss rate together with putative ancient WGDs.
Results
Validation using simulated data
The ALE approach and the dynamic programming algorithm for probabilistic reconciliation inference
have been extensively validated using simulations (Szöll˝osi et al. 2012; Szöll˝osi, Rosikiewicz, et al. 2013;
Szöll˝osi, Tannier, et al. 2013). However, our adoption of these methods is considerably different from
these studies, which focused mainly on horizontal gene transfer and improved gene tree inference under
a known species tree. We introduce the WGD model as well as the prior distribution on the number of
lineages at the root first developed by Rabier et al. (2014) in the ALE context, and estimate duplication
and loss rates not family-wise as in ALE (Szöll˝osi, Rosikiewicz, et al. 2013), but across families similar to
Rasmussen and Kellis (2011) and Rabier et al. (2014) (see methods). We verified the correctness of our
new approach and its implementation using simulated data. Importantly, while the ALE approach takes
gene tree uncertainty into consideration by employing samples from the posterior distribution for gene
tree topologies, we simulated only a single unrooted gene tree topology per family, and do not consider
gene tree uncertainty here. This was previously done already using extensive simulations in Szöll˝osi,
Rosikiewicz, et al. (2013), where the basic merits of an ALE approach were shown, and we do not revisit
these highly computationally intensive simulation studies here. Note that all reported rate estimates are
dependent on the time scale used in the species tree, which in our case is in units of 100 million years.
Numerical optimization of the likelihood under the basic constant-rates duplication-loss (DL) model
(i.e. using a single duplication (
λ
) and loss (
µ
) rate for the full species tree) with a geometric prior
distribution on the number of lineages at the root provides accurate maximum likelihood estimates (MLEs)
for the simulated duplication and loss rates (Figure S1). In general, rates are estimated more accurately
when the duplication and loss rate are similar whereas slight biases are observed when the rates are quite
different. If the loss rate is higher than the duplication rate, both rates tend to be underestimated. If the
duplication rate is higher than the loss rate, the duplication rate seems to be slightly overestimated. Not
unexpected, our simulations suggest that the variance of the MLEs increases with the rate. Estimates of the
duplication (
λ
) and loss (
µ
) rate are quite robust to the parametrization of the geometric prior distribution
on the number of genes at the root (Figure S2). As expected, assuming a very low prior probability on
multiple genes at the root (
1/η 1
) leads to overestimation of
λ
and underestimation of
µ
. Conversely,
assigning a strong prior on multiple ancestral lineages (
1/η 1
) leads to an underestimation of
λ
and
overestimation of
µ
to compensate for unobserved lineages assumed at the root. These observations hold
4

Figures
Citations
More filters
Journal ArticleDOI

A rooted phylogeny resolves early bacterial evolution.

TL;DR: A rooted bacterial tree is necessary to understand early evolution, but the position of the root is contested as discussed by the authors, which suggests that LBCA was a free-living flagellated, rod-shaped double-membraned organism.
Journal ArticleDOI

The Origin of Land Plants Is Rooted in Two Bursts of Genomic Novelty.

TL;DR: The findings highlight the biological processes that evolved with the origin of land plants and emphasize the importance of conserved gene novelties in plant diversification.
Journal ArticleDOI

Asterid Phylogenomics/Phylotranscriptomics Uncover Morphological Evolutionary Histories and Support Phylogenetic Placement for Numerous Whole-Genome Duplications

TL;DR: An Aptian (Early Cretaceous) origin of asterids and the origin of all orders before the K-Pg boundary is supported and Ancestral state reconstruction at the family level suggests that the asterid ancestor was a woody terrestrial plant with simple leaves, bisexual and actinomorphic flowers with free petals and free anthers.
Journal ArticleDOI

Distinct Expression and Methylation Patterns for Genes with Different Fates following a Single Whole-Genome Duplication in Flowering Plants

TL;DR: After a WGD genes that returned to single copies show the highest levels and breadth of expression, gene body methylation, and intron numbers, whereas the long-retained duplicates exhibit the highest degrees of protein–protein interactions and protein lengths and the lowest methylation in gene flanking regions.
Posted ContentDOI

A rooted phylogeny resolves early bacterial evolution

TL;DR: This work predicts that the last bacterial common ancestor was a free-living flagellated, rod-shaped cell featuring a double membrane with a lipopolysaccharide outer layer, a Type III CRISPR-Cas system, Type IV pili, and the ability to sense and respond via chemotaxis.
References
More filters
Journal ArticleDOI

MrBayes 3.2: Efficient Bayesian Phylogenetic Inference and Model Choice across a Large Model Space

TL;DR: The new version provides convergence diagnostics and allows multiple analyses to be run in parallel with convergence progress monitored on the fly, and provides more output options than previously, including samples of ancestral states, site rates, site dN/dS rations, branch rates, and node dates.
Journal ArticleDOI

The evolutionary fate and consequences of duplicate genes

TL;DR: Although duplicate genes may only rarely evolve new functions, the stochastic silencing of such genes may play a significant role in the passive origin of new species.
Journal ArticleDOI

The genome of black cottonwood, Populus trichocarpa (Torr. & Gray)

Gerald A. Tuskan, +115 more
- 15 Sep 2006 - 
TL;DR: The draft genome of the black cottonwood tree, Populus trichocarpa, has been reported in this paper, with more than 45,000 putative protein-coding genes identified.
Journal ArticleDOI

Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods

TL;DR: Two approximate methods are proposed for maximum likelihood phylogenetic estimation, which allow variable rates of substitution across nucleotide sites, and one of them uses several categories of rates to approximate the gamma distribution, with equal probability for each category.
BookDOI

MCMC using Hamiltonian dynamics

Radford M. Neal
- 09 Jun 2012 - 
TL;DR: In this paper, the authors discuss theoretical and practical aspects of Hamiltonian Monte Carlo, and present some of its variations, including using windows of states for deciding on acceptance or rejection, computing trajectories using fast approximations, tempering during the course of a trajectory to handle isolated modes, and short-cut methods that prevent useless trajectories from taking much computation time.
Related Papers (5)
Frequently Asked Questions (9)
Q1. What are the contributions in "Inference of ancient whole genome duplications and the evolution of the gene duplication and loss rate" ?

The authors applied their newly developed framework to ancient WGDs in land plants and investigate the effects of duplication and loss rate variation on reconciliation and gene count based assessment of these earlier proposed WGDs. 

Accounting for these complexities in a probabilistic framework is another challenge for future research, and would require more sophisticated models that explicitly model the polyploid phase of the lineage under consideration. In particular, their model does not account for incomplete lineage sorting, and incorporating the multi-species coalescent in their framework to account for the possibility of deep coalescence would be an interesting future development. The authors believe these might be fruitful further research directions. In particular genome-scale molecular dating would be a promising avenue, where the temporal signal from both the gene family and sequence evolution process could be employed using relaxed clock priors on both duplication, loss and substitution rates to date species divergence times and WGDs in an integrative fashion. 

Especially heuristic gene tree - species tree reconciliation methods have been widely employed to unveil evidence for ancient WGDs (e.g. Jiao et al. 

The authors note that, besides being very efficient, the amalgamation approach has the merit that it only requires a sample from the posterior distribution over gene tree topologies. 

A possible explanation for the conflicting signals for the putative gymnosperm WGD in the nine-taxon and five-taxon analyses may be that it is an artifact due to a strong drop in duplication rate in the Ginkgo lineage compared to the gymnosperm stem and the lineage leading to P. abies. 

To assess whether a particular number of duplicates corresponds to a significant increase in the number of duplications (possibly stemming from a WGD) they simulated gene tree topologies under the species tree of interest using a constant-rates DL model, with four sets of duplication and loss rates, which are estimated using gene count data. 

As expected, assuming a very low prior probability on multiple genes at the root (1/η ≈ 1) leads to overestimation of λ and underestimation of µ. 

The authors next consider the special considerations needed when handling the root of S and the ubiquitous clade Γ.Prior on the number of lineages at the root and conditioningA fundamental issue in probabilistic gene tree - species tree reconciliation is that an explicit or implicit assumption on the number of lineages present at the root of the species tree is required. 

The propagation probability is the probability that a single lineage entering a time slice at time t ‘propagates’ through the time slice to generate exactly one lineage at the end of the time slice (time t′) which has observed descendants at the present (t0 = 0).