scispace - formally typeset
Open AccessJournal ArticleDOI

Cache Oblivious Algorithms for Computing the Triplet Distance between Trees

Reads0
Chats0
TLDR
This work considers the problem of computing the triplet distance between two rooted unordered trees with n labeled leaves and presents two cache oblivious algorithms that are the first algorithms that scale to external memory, both in theory and in practice, for this problem.
Abstract
We consider the problem of computing the triplet distance between two rooted unordered trees with n labeled leaves. Introduced by Dobson in 1975, the triplet distance is the number of leaf triples that induce different topologies in the two trees. The current theoretically fastest algorithm is an O(n log n) algorithm by Brodal et al. (SODA 2013). Recently, Jansson and Rajaby proposed a new algorithm that, while slower in theory, requiring O(n log 3 n) time, in practice it outperforms the theoretically faster O(n log n) algorithm. Both algorithms do not scale to external memory. We present two cache oblivious algorithms that combine the best of both worlds. The first algorithm is for the case when the two input trees are binary trees, and the second is a generalized algorithm for two input trees of arbitrary degree. Analyzed in the RAM model, both algorithms require O(n log n) time, and in the cache oblivious model O(n/B log 2 n/M) I/Os. Their relative simplicity and the fact that they scale to external memory makes them achieve the best practical performance. We note that these are the first algorithms that scale to external memory, both in theory and in practice, for this problem.

read more

Content maybe subject to copyright    Report

Cache Oblivious Algorithms for Computing the
Triplet Distance Between Trees
Gerth Stølting Brodal
1
and Konstantinos Mampentzidis
2
1 Department of Computer Science, Aarhus University, Aarhus, Denmark
gerth@cs.au.dk
2 Department of Computer Science, Aarhus University, Aarhus, Denmark
kmampent@cs.au.dk
Abstract
We study the problem of computing the triplet distance between two rooted unordered trees
with n labeled leafs. Introduced by Dobson 1975, the triplet distance is the number of leaf triples
that induce different topologies in the two trees. The current theoretically best algorithm is
an O(n log n) time algorithm by Brodal et al. (SODA 2013). Recently Jansson et al. proposed a
new algorithm that, while slower in theory, requiring O(n log
3
n) time, in practice it outperforms
the theoretically faster O(n log n) algorithm. Both algorithms do not scale to external memory.
We present two cache oblivious algorithms that combine the best of both worlds. The first
algorithm is for the case when the two input trees are binary trees and the second a generalized
algorithm for two input trees of arbitrary degree. Analyzed in the RAM model, both algorithms
require O(n log n) time, and in the cache oblivious model O(
n
B
log
2
n
M
) I/Os. Their relative
simplicity and the fact that they scale to external memory makes them achieve the best practical
performance. We note that these are the first algorithms that scale to external memory, both in
theory and practice, for this problem.
1998 ACM Subject Classification G.2.2 Trees, G.2.1 Combinatorial Algorithms
Keywords and phrases Phylogenetic tree, tree comparison, triplet distance, cache oblivious al-
gorithm
Digital Object Identifier 10.4230/LIPIcs.ESA.2017.21
1 Introduction
Background.
Trees are data structures that are often used to represent relationships. For
example in the field of Biology, a tree can be used to represent evolutionary relationships, with
the leafs corresponding to species that exist today, and internal nodes to ancestor species that
existed in the past. For a fixed set of
n
species, different data or construction methods (e.g.
Q* [
2
], neighbor joining [
13
]) can lead to trees that look structurally different. An interesting
question that arises then is, given two trees
T
1
and
T
2
over
n
species, how different are they?
An answer to this question could potentially be used to determine whether the difference is
statistically significant or not, which in turn could help with evolutionary inferences. Several
ways of comparing two trees have been proposed in the past, with different types of trees (e.g.
rooted versus unrooted, binary versus arbitrary degree) having different distance measures
(e.g. Robinson-Foulds distance [
12
], triplet distance [
6
], quartet distance [
7
]). In this paper
we focus on the triplet distance computation, which is defined for rooted trees.
Research supported by the Danish National Research Foundation, grant DNRF84, Center for Massive
Data Algorithmics (MADALGO).
An extended version of the paper is available on arXiv [4].
© Gerth Stølting Brodal and Konstantinos Mampentzidis;
licensed under Creative Commons License CC-BY
25th Annual European Symposium on Algorithms (ESA 2017).
Editors: Kirk Pruhs and Christian Sohler; Article No. 21; pp. 21:1–21:14
Leibniz International Proceedings in Informatics
Schloss Dagstuhl Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany

21:2 Cache Oblivious Algorithms for Computing the Triplet Distance Between Trees
x
y
z
(a) xy|z
x
z
y
(b) xz|y
y
z
x
(c) yz|x
x
y
z
(d) xyz
Figure 1 Triplet topologies.
Problem Definition.
For a given rooted unordered tree
T
where each leaf has a unique
label, a triplet is defined by a set of three leaf labels
x
,
y
and
z
and their induced topology
in
T
. The four possible topologies are illustrated in Figure 1. For two such trees
T
1
and
T
2
that are built on
n
identical leaf labels, the triplet distance
D
(
T
1
, T
2
) is the number of triplets
that are different in
T
1
and
T
2
. Let
S
(
T
1
, T
2
) be the number of shared triplets in the two
trees, i.e. leaf triples with identical topologies in the two trees. We have the relationship
that D(T
1
, T
2
) + S(T
1
, T
2
) =
n
3
.
Results.
All related work can be found in [
5
,
1
,
14
,
3
,
15
,
9
,
10
,
11
]. Previous and new results
are shown in the table below. For the cache oblivious model [
8
], the papers [
5
,
1
,
14
,
3
,
10
,
11
]
do not provide an analysis, so here we provide an upper bound.
Year Reference Time IOs Space Non-Binary Trees
1996 Critchlow et al. [5] O(n
2
) O(n
2
) O(n
2
) no
2011 Bansal et al. [1] O(n
2
) O(n
2
) O(n
2
) yes
2013 Brodal et al. [14] O(n log
2
n) O(n log
2
n) O(n) no
2013 Brodal et al. [3] O(n log n) O(n log n) O(n log n) yes
2015 Jansson et al. [10, 11] O(n log
3
n) O(n log
3
n) O(n log n) yes
2017 new O(n log n) O(
n
B
log
2
n
M
) O(n) yes
The common main bottleneck with all previous approaches is that the data structures
used rely intensively on Ω(
n log n
) random memory accesses. This means that all algorithms
are penalized by cache performance and thus do not scale to external memory. We address
this limitation by proposing new algorithms for computing the triplet distance on binary
and non-binary trees, that match the previous best O(
n log n
) time and O(
n
) space bounds
in the RAM model, but for the first time also scale to external memory. More specifically,
in the cache oblivious model, the total number of I/Os required is O(
n
B
log
2
n
M
). The basic
idea is to essentially replace the dependency of random access to data structures by scanning
contracted versions of the input trees. A careful implementation of the algorithms is shown
to achieve the best practical performance, thus essentially documenting that the theoretical
results carry over to practice.
2 Previous Approaches
A naive algorithm would enumerate over all
n
3
sets of 3 labels and find for each set whether
the induced topologies in
T
1
and
T
2
differ or not, giving an O(
n
3
) algorithm. This naive
approach does not exploit the fact that the triplets are not completely independent. For
example the triplets
xy|z
and
yx|u
share the leafs
x
and
y
and the fact that the lowest
common ancestor of
x
and
y
is at a lower depth than the lowest common ancestor of
z
with
either
x
or
y
and the lowest common ancestor of
u
with either
x
or
y
. Dependencies like this
can be exploited to count the number of shared triplets faster.

G. S. Brodal and K. Mampentzidis 21:3
Critchlow et al. [
5
] exploit the depth of the leafs’ ancestors to achieve the first improvement
over the naive approach. Bansal et al. [
1
] exploit the shared leafs between subtrees and
reduce the problem to computing the intersection size (number of shared leafs) of all pairs of
subtrees, one from T
1
and one from T
2
, which can be solved with dynamic programming.
The O(n
2
) Algorithm for Binary Trees in [14].
The algorithm for binary trees in [
14
]
is the basis for all subsequent improvements [
14
,
3
,
10
], including ours as well, so we will
describe it in more detail here. The dependency that was exploited is the same as in [
1
],
but the procedure for counting the shared triplets is completely different. More specifically,
each triplet in
T
1
and
T
2
, defined by the leafs
i
,
j
and
k
, is implicitly anchored in the lowest
common ancestor of
i
,
j
and
k
. For a node
u
in
T
1
and
v
in
T
2
, let
s
(
u
) and
s
(
v
) be the set of
triplets that are anchored in
u
and
v
respectively. For the number of shared triplets
S
(
T
1
, T
2
)
we then have that
S(T
1
, T
2
) =
X
uT
1
X
v T
2
|s(u) s(v)| .
For the algorithm to be O(
n
2
) the value
|s(u) s(v)|
must be computed in O(1) time.
This is achieved by a leaf colouring procedure as follows: Fix a node
u
in
T
1
and color the
leafs in the left subtree of
u
red, the leafs in the right subtree of
u
blue, let every other leaf
have no color and then transfer this coloring to the leafs in
T
2
, i.e. identically labelled leafs
get the same color. To compute
|s(u) s(v)|
we do as follows: let
l
and
r
be the left and
right children of
v
, and let
w
red
and
w
blue
be the number of red and blue leafs in a subtree
rooted at a node w in T
2
. We then have that
|s(u) s(v)| =
l
red
2
r
blue
+
l
blue
2
r
red
+
r
red
2
l
blue
+
r
blue
2
l
red
. (1)
Subquadratic Algorithms.
To reduce the time, Brodal et al. [
14
] applied the smaller half
trick, which specifies a depth first order to visit the nodes
u
of
T
1
, so that each leaf in
T
1
changes color at most O(
log n
) times. To count shared triplets efficiently without scanning
T
2
completely for each node
u
in
T
1
, the tree
T
2
is stored in a data structure denoted a
hierarchical decomposition tree (HDT ). This HDT maintains for the current visited node
u
in
T
1
, according to (1) the sum
P
v T
2
|s(u) s(v)|
, so that each color change in
T
1
can be
updated efficiently in
T
2
. In [
14
] the HDT is a binary tree of height O(
log n
) and every update
can be done in a leaf to root path traversal in the HDT, which in total gives O(
n log
2
n
) time.
In [
3
] the HDT is generalized to also handle non-binary trees, each query operates the same,
and now due to a contraction scheme of the HDT the total time is reduced to O(
n log n
).
Finally, in [
10
] as an HDT the so called heavy-light tree decomposition is used. Note that the
only difference in all O(
n polylog n
) results that are available until now is the type of HDT
used.
In terms of external memory efficiency, every O(
n polylog n
) algorithm performs Θ(
n log n
)
updates to an HDT data structure, which means that for sufficiently large input trees every
algorithm requires Ω(n log n) I/Os.
3 The New Algorithm for Binary Trees
Overview.
We will use the O(
n
2
) algorithm described in Section 2 as a basis. The main
difference lies in the order that we visit the nodes of
T
1
and how we process
T
2
when we
count. We propose a new order of visiting the nodes of
T
1
, which we find by applying a
E S A 2 0 1 7

21:4 Cache Oblivious Algorithms for Computing the Triplet Distance Between Trees
hierarchical decomposition on
T
1
. Every component in this decomposition corresponds to a
connected part of
T
1
and a contracted version of
T
2
. In simple terms, if Λ is the set of leafs
in a component of
T
1
, the contracted version of
T
2
is a binary tree on Λ that preserves the
topologies induced by Λ in
T
2
and has size O(
|
Λ
|
). To count shared triplets, every component
of T
1
has a representative node u that we use to scan the corresponding contracted version
of
T
2
in order to find
P
v T
2
|s(u) s(v)|
. Unlike previous algorithms, we do not store
T
2
in
a data structure. We process
T
2
by contracting and counting, both of which can be done
by scanning. At the same time, even though we apply a hierarchical decomposition on
T
1
,
the only reason why we do so, is so we can find the order in which to visit the nodes of
T
1
.
This means that we do not need to store
T
1
in a data structure either. Thus, we completely
remove the need of data structures (and thereby random memory accesses) and scanning
becomes the basic primitive in the algorithm. To make our algorithm I/O efficient, all that
remains to be done is to use a proper layout to store the trees in memory, so that every time
we scan a tree of size s we spend O(s/B) I/Os.
Preprocessing.
As a preprocessing step, first we make
T
1
left heavy, by swapping children
so that for every node
u
in
T
1
the left subtree is larger than the right subtree, by a depth first
traversal. Second, we change the leaf labels of
T
1
, which can also be done by a depth first
traversal of
T
1
, so that the leafs are numbered 1 to
n
from left to right. This step takes O(
n
)
time in the RAM model. The second step is done to simplify the process of transferring the
leaf colors between
T
1
and
T
2
. The coloring of a subtree in
T
1
will correspond to assigning
the same color to a contiguous range of leaf labels. Determining the color of a leaf in
T
2
will
then require one if-statement to find in what range (red or blue) its label belongs to.
Centroid Decomposition.
For a given rooted binary tree
T
we let
|T |
denote the number
of nodes in
T
(internal nodes and leafs). For a node
u
in
T
we let
l
and
r
be the left and
right children of
u
, and
p
the parent. Removing
u
from
T
partitions
T
into three (possibly
empty) connected components
T
l
,
T
r
and
T
p
containing
l
,
r
and
p
, respectively. A centroid is
a node
u
in
T
such that
max{|T
l
|, |T
r
|, |T
p
|} |T |/
2. A centroid always exists and can be
found by starting from the root of
T
and iteratively visiting the child with a largest subtree,
eventually we will reach a centroid. Finding the size of every subtree and identifying
u
takes
O
(
|T |
) time in the RAM model. By recursively finding centroids in each of the three
components, we will in the end get a ternary tree of centroids, which is called the centroid
decomposition of
T
, denoted
CD
(
T
). We can generate a level of
CD
(
T
) in O(
|T |
) time, given
the decomposition of
T
into components by the previous level. Since we have to generate at
most 1 +
log
2
(
|T |
) levels, the total time required to build
CD
(
T
) is O(
|T | log |T |
), hence we
get Lemma 1.
I Lemma 1.
For any rooted binary tree
T
with
n
leafs, building
CD
(
T
) takes O(
n log n
)
time in the RAM model.
A component in a centroid decomposition
CD
(
T
), might have many edges crossing its
boundaries (connecting nodes inside and outside the component). The below modified centroid
decomposition, denoted
MCD
(
T
), generates components with at most two edges crossing the
boundary, one going towards the root and one down to exactly one subtree.
Modified Centroid Decomposition.
An
MCD
(
T
) is built recursively as follows: If a com-
ponent
C
has no edge from below, we select the centroid
c
of
C
as a splitting node as
described above. Otherwise, let (
x, y
) be the edge that crosses the boundary from below,

G. S. Brodal and K. Mampentzidis 21:5
where
x
is in
C
and let
c
be centroid of
C
. As a splitting node choose the lowest common
ancestor of
x
and
c
. By induction every component has at most one edge from below and
one edge from above. A useful property of MCD(T ) is captured by the following lemma:
I Lemma 2.
For any rooted binary tree
T
with
n
leafs, we have that
h
(
MCD
(
T
))
2+2
log
2
n
,
where h(MCD(T )) denotes the height of MCD(T ).
Since each level of MCD(T ) can be constructed in O(n) time, we have
I Theorem 3.
For any rooted binary tree
T
with
n
leafs, building
MCD
(
T
) takes O(
n log n
)
time in the RAM model.
To return to our original problem, we visit the nodes of
T
1
, given by the depth first
traversal of the ternary tree
MCD
(
T
1
), where the children of every node
u
in
MCD
(
T
1
) are
visited from left to right. For every such node
u
we process
T
2
in two phases, the contraction
phase and the counting phase.
Contraction.
Let
L
(
T
2
) denote the set of leafs in
T
2
and Λ
L
(
T
2
). In the contraction
phase,
T
2
is compressed into a binary tree of size O(
|
Λ
|
) whose leaf set is Λ. The contraction
is done in a way so that all the topologies induced by Λ in
T
2
are preserved in the compressed
binary tree. This is achieved by the following three sequential steps: prune all leafs of
T
2
that are not in Λ, repeatedly prune all internal nodes of
T
2
with no children and repeatedly
contract unary internal nodes, i.e. nodes having exactly one child.
Let
u
be a node of
MCD
(
T
1
) and
C
u
the corresponding component of
T
1
. For every
such node
u
we have a contracted version of
T
2
, from now on referred to as
T
2
(
u
), where
L
(
T
2
(
u
)) =
L
(
C
u
). The goal is to augment
T
2
(
u
) with counters (see counting phase below),
so that we can find
P
v T
2
|s(u) s(v)|
by scanning
T
2
(
u
). One can imagine
MCD
(
T
1
) as
being a tree where each node
u
is augmented with
T
2
(
u
). To generate all contractions of
T
2
for level
i
of
MCD
(
T
1
), which correspond to a set of disjoint connected components in
T
1
,
we can reuse the contractions of
T
2
at level
i
1 in
MCD
(
T
1
). This means that we have to
spend O(
n
) time to generate the contractions of level
i
, so to generate all contractions of
T
2
we need O(
n log n
) time. Note that by explicitly storing all contractions, we will also need to
use O(
n log n
) space. For our problem, we traverse
MCD
(
T
1
) in a depth first manner, so we
only have to store a stack of contractions corresponding to the stack of nodes of
MCD
(
T
1
)
that we have to remember during our traversal. Since the components at every second level
of
MCD
(
T
1
) have at most half the size of the components two levels above, Lemma 4 states
that the size of this stack is always O(n).
I Lemma 4.
Let
T
1
and
T
2
be two rooted binary trees with
n
leafs and
u
1
, u
2
, ..., u
k
a root
to leaf path of
MCD
(
T
1
). For the corresponding contracted versions
T
2
(
u
1
)
, T
2
(
u
2
)
, ..., T
2
(
u
k
)
we have that
P
k
i=1
|T
2
(u
i
)| = O(n).
Counting.
In the counting phase, we find
P
v T
2
|s(u) s(v)|
by scanning
T
2
(
u
) instead
of
T
2
. This makes the total time of the algorithm in the RAM model O(
n log n
). We consider
the following two cases:
C
u
has no edges from below.
In this case
C
u
corresponds to a complete subtree of
T
1
. We act exactly like in the O(
n
2
)
algorithm (Section 2) but now instead of scanning T
2
we scan T
2
(u).
C
u
has one edge from below.
In this case
C
u
does not correspond to a complete subtree of
T
1
, since the edge from
below
C
u
, will point to a subtree
X
u
, that is located outside of
C
u
(for an illustration of
E S A 2 0 1 7

References
More filters
Journal ArticleDOI

The neighbor-joining method: a new method for reconstructing phylogenetic trees.

TL;DR: The neighbor-joining method and Sattath and Tversky's method are shown to be generally better than the other methods for reconstructing phylogenetic trees from evolutionary distance data.
Journal ArticleDOI

Comparison of phylogenetic trees

TL;DR: The metric presented in this paper makes possible the comparison of the many nonbinary phylogenetic trees appearing in the literature, and provides an objective procedure for comparing the different methods for constructing phylogenetics trees.
Journal ArticleDOI

The input/output complexity of sorting and related problems

TL;DR: Tight upper and lower bounds are provided for the number of inputs and outputs (I/OS) between internal memory and secondary storage required for five sorting-related problems: sorting, the fast Fourier transform (FFT), permutation networks, permuting, and matrix transposition.
Journal ArticleDOI

First draft of a report on the EDVAC

TL;DR: The first draft of a report on the EDVAC written by John von Neumann is presented and is seen as the definitive source for understanding the nature and design of a general-purpose digital computer.
Journal ArticleDOI

Optimal algorithms for comparing trees with labeled leaves

TL;DR: Algorithms are described that exploit a special representation of the clusters of any treeT Rn, one that permits testing in constant time whether a given cluster exists inT, and enable well-known indices of consensus between two trees to be computed inO(n) time.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What are the contributions in "Cache oblivious algorithms for computing the triplet distance between trees∗†" ?

The authors study the problem of computing the triplet distance between two rooted unordered trees with n labeled leafs. The authors present two cache oblivious algorithms that combine the best of both worlds.