(Open Access) Cache Oblivious Algorithms for Computing the Triplet Distance between Trees (2021)

Q: What are the contributions in "Cache oblivious algorithms for computing the triplet distance between trees∗†" ?

The authors study the problem of computing the triplet distance between two rooted unordered trees with n labeled leafs. The authors present two cache oblivious algorithms that combine the best of both worlds.

Cache Oblivious Algorithms for Computing the

Triplet Distance Between Trees

∗†

Gerth Stølting Brodal

and Konstantinos Mampentzidis

1 Department of Computer Science, Aarhus University, Aarhus, Denmark

gerth@cs.au.dk

2 Department of Computer Science, Aarhus University, Aarhus, Denmark

kmampent@cs.au.dk

Abstract

We study the problem of computing the triplet distance between two rooted unordered trees

with n labeled leafs. Introduced by Dobson 1975, the triplet distance is the number of leaf triples

that induce diﬀerent topologies in the two trees. The current theoretically best algorithm is

an O(n log n) time algorithm by Brodal et al. (SODA 2013). Recently Jansson et al. proposed a

new algorithm that, while slower in theory, requiring O(n log

n) time, in practice it outperforms

the theoretically faster O(n log n) algorithm. Both algorithms do not scale to external memory.

We present two cache oblivious algorithms that combine the best of both worlds. The ﬁrst

algorithm is for the case when the two input trees are binary trees and the second a generalized

algorithm for two input trees of arbitrary degree. Analyzed in the RAM model, both algorithms

require O(n log n) time, and in the cache oblivious model O(

log

) I/Os. Their relative

simplicity and the fact that they scale to external memory makes them achieve the best practical

performance. We note that these are the ﬁrst algorithms that scale to external memory, both in

theory and practice, for this problem.

1998 ACM Subject Classiﬁcation G.2.2 Trees, G.2.1 Combinatorial Algorithms

Keywords and phrases Phylogenetic tree, tree comparison, triplet distance, cache oblivious al-

gorithm

Digital Object Identiﬁer 10.4230/LIPIcs.ESA.2017.21

1 Introduction

Background.

Trees are data structures that are often used to represent relationships. For

example in the ﬁeld of Biology, a tree can be used to represent evolutionary relationships, with

the leafs corresponding to species that exist today, and internal nodes to ancestor species that

existed in the past. For a ﬁxed set of

species, diﬀerent data or construction methods (e.g.

Q* [

], neighbor joining [

]) can lead to trees that look structurally diﬀerent. An interesting

question that arises then is, given two trees

and

over

species, how diﬀerent are they?

An answer to this question could potentially be used to determine whether the diﬀerence is

statistically signiﬁcant or not, which in turn could help with evolutionary inferences. Several

ways of comparing two trees have been proposed in the past, with diﬀerent types of trees (e.g.

rooted versus unrooted, binary versus arbitrary degree) having diﬀerent distance measures

(e.g. Robinson-Foulds distance [

], triplet distance [

], quartet distance [

]). In this paper

we focus on the triplet distance computation, which is deﬁned for rooted trees.

∗

Research supported by the Danish National Research Foundation, grant DNRF84, Center for Massive

Data Algorithmics (MADALGO).

†

An extended version of the paper is available on arXiv [4].

licensed under Creative Commons License CC-BY

25th Annual European Symposium on Algorithms (ESA 2017).

Editors: Kirk Pruhs and Christian Sohler; Article No. 21; pp. 21:1–21:14

Leibniz International Proceedings in Informatics

Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany

21:2 Cache Oblivious Algorithms for Computing the Triplet Distance Between Trees

(a) xy|z

(b) xz|y

(d) xyz

Figure 1 Triplet topologies.

Problem Deﬁnition.

For a given rooted unordered tree

where each leaf has a unique

label, a triplet is deﬁned by a set of three leaf labels

and

and their induced topology

. The four possible topologies are illustrated in Figure 1. For two such trees

and

that are built on

identical leaf labels, the triplet distance

(

, T

) is the number of triplets

that are diﬀerent in

and

. Let

(

, T

) be the number of shared triplets in the two

trees, i.e. leaf triples with identical topologies in the two trees. We have the relationship

that D(T

, T

) + S(T

, T

) =





Results.

All related work can be found in [

]. Previous and new results

are shown in the table below. For the cache oblivious model [

], the papers [

]

do not provide an analysis, so here we provide an upper bound.

Year Reference Time IOs Space Non-Binary Trees

1996 Critchlow et al. [5] O(n

) O(n

) no

2011 Bansal et al. [1] O(n

) O(n

) yes

2013 Brodal et al. [14] O(n log

n) O(n log

n) O(n) no

2013 Brodal et al. [3] O(n log n) O(n log n) O(n log n) yes

2015 Jansson et al. [10, 11] O(n log

n) O(n log

n) O(n log n) yes

2017 new O(n log n) O(

log

) O(n) yes

The common main bottleneck with all previous approaches is that the data structures

used rely intensively on Ω(

n log n

) random memory accesses. This means that all algorithms

are penalized by cache performance and thus do not scale to external memory. We address

this limitation by proposing new algorithms for computing the triplet distance on binary

and non-binary trees, that match the previous best O(

n log n

) time and O(

) space bounds

in the RAM model, but for the ﬁrst time also scale to external memory. More speciﬁcally,

in the cache oblivious model, the total number of I/Os required is O(

log

). The basic

idea is to essentially replace the dependency of random access to data structures by scanning

contracted versions of the input trees. A careful implementation of the algorithms is shown

to achieve the best practical performance, thus essentially documenting that the theoretical

results carry over to practice.

2 Previous Approaches

A naive algorithm would enumerate over all





sets of 3 labels and ﬁnd for each set whether

the induced topologies in

and

diﬀer or not, giving an O(

) algorithm. This naive

approach does not exploit the fact that the triplets are not completely independent. For

example the triplets

xy|z

and

yx|u

share the leafs

and

and the fact that the lowest

common ancestor of

and

is at a lower depth than the lowest common ancestor of

with

either

and the lowest common ancestor of

with either

. Dependencies like this

can be exploited to count the number of shared triplets faster.

G. S. Brodal and K. Mampentzidis 21:3

Critchlow et al. [

] exploit the depth of the leafs’ ancestors to achieve the ﬁrst improvement

over the naive approach. Bansal et al. [

] exploit the shared leafs between subtrees and

reduce the problem to computing the intersection size (number of shared leafs) of all pairs of

subtrees, one from T

and one from T

, which can be solved with dynamic programming.

The O(n

) Algorithm for Binary Trees in [14].

The algorithm for binary trees in [

]

is the basis for all subsequent improvements [

], including ours as well, so we will

describe it in more detail here. The dependency that was exploited is the same as in [

but the procedure for counting the shared triplets is completely diﬀerent. More speciﬁcally,

each triplet in

and

, deﬁned by the leafs

and

, is implicitly anchored in the lowest

common ancestor of

and

. For a node

and

, let

(

) and

(

) be the set of

triplets that are anchored in

and

respectively. For the number of shared triplets

(

, T

)

we then have that

S(T

, T

) =

u∈T

v ∈T

|s(u) ∩ s(v)| .

For the algorithm to be O(

) the value

|s(u) ∩ s(v)|

must be computed in O(1) time.

This is achieved by a leaf colouring procedure as follows: Fix a node

and color the

leafs in the left subtree of

red, the leafs in the right subtree of

blue, let every other leaf

have no color and then transfer this coloring to the leafs in

, i.e. identically labelled leafs

get the same color. To compute

|s(u) ∩ s(v)|

we do as follows: let

and

be the left and

right children of

, and let

red

and

blue

be the number of red and blue leafs in a subtree

rooted at a node w in T

. We then have that

|s(u) ∩ s(v)| =



red



blue



blue



red



red



blue



blue



red

. (1)

Subquadratic Algorithms.

To reduce the time, Brodal et al. [

] applied the smaller half

trick, which speciﬁes a depth ﬁrst order to visit the nodes

, so that each leaf in

changes color at most O(

log n

) times. To count shared triplets eﬃciently without scanning

completely for each node

, the tree

is stored in a data structure denoted a

hierarchical decomposition tree (HDT ). This HDT maintains for the current visited node

, according to (1) the sum

v ∈T

|s(u) ∩ s(v)|

, so that each color change in

can be

updated eﬃciently in

. In [

] the HDT is a binary tree of height O(

log n

) and every update

can be done in a leaf to root path traversal in the HDT, which in total gives O(

n log

) time.

In [

] the HDT is generalized to also handle non-binary trees, each query operates the same,

and now due to a contraction scheme of the HDT the total time is reduced to O(

n log n

Finally, in [

] as an HDT the so called heavy-light tree decomposition is used. Note that the

only diﬀerence in all O(

n polylog n

) results that are available until now is the type of HDT

used.

In terms of external memory eﬃciency, every O(

n polylog n

) algorithm performs Θ(

n log n

)

updates to an HDT data structure, which means that for suﬃciently large input trees every

algorithm requires Ω(n log n) I/Os.

3 The New Algorithm for Binary Trees

Overview.

We will use the O(

) algorithm described in Section 2 as a basis. The main

diﬀerence lies in the order that we visit the nodes of

and how we process

when we

count. We propose a new order of visiting the nodes of

, which we ﬁnd by applying a

E S A 2 0 1 7

21:4 Cache Oblivious Algorithms for Computing the Triplet Distance Between Trees

hierarchical decomposition on

. Every component in this decomposition corresponds to a

connected part of

and a contracted version of

. In simple terms, if Λ is the set of leafs

in a component of

, the contracted version of

is a binary tree on Λ that preserves the

topologies induced by Λ in

and has size O(

). To count shared triplets, every component

of T

has a representative node u that we use to scan the corresponding contracted version

in order to ﬁnd

v ∈T

|s(u) ∩ s(v)|

. Unlike previous algorithms, we do not store

a data structure. We process

by contracting and counting, both of which can be done

by scanning. At the same time, even though we apply a hierarchical decomposition on

the only reason why we do so, is so we can ﬁnd the order in which to visit the nodes of

This means that we do not need to store

in a data structure either. Thus, we completely

remove the need of data structures (and thereby random memory accesses) and scanning

becomes the basic primitive in the algorithm. To make our algorithm I/O eﬃcient, all that

remains to be done is to use a proper layout to store the trees in memory, so that every time

we scan a tree of size s we spend O(s/B) I/Os.

Preprocessing.

As a preprocessing step, ﬁrst we make

left heavy, by swapping children

so that for every node

the left subtree is larger than the right subtree, by a depth ﬁrst

traversal. Second, we change the leaf labels of

, which can also be done by a depth ﬁrst

traversal of

, so that the leafs are numbered 1 to

from left to right. This step takes O(

)

time in the RAM model. The second step is done to simplify the process of transferring the

leaf colors between

and

. The coloring of a subtree in

will correspond to assigning

the same color to a contiguous range of leaf labels. Determining the color of a leaf in

will

then require one if-statement to ﬁnd in what range (red or blue) its label belongs to.

Centroid Decomposition.

For a given rooted binary tree

we let

|T |

denote the number

of nodes in

(internal nodes and leafs). For a node

we let

and

be the left and

right children of

, and

the parent. Removing

from

partitions

into three (possibly

empty) connected components

and

containing

and

, respectively. A centroid is

a node

such that

max{|T

|, |T

|} ≤ |T |/

2. A centroid always exists and can be

found by starting from the root of

and iteratively visiting the child with a largest subtree,

eventually we will reach a centroid. Finding the size of every subtree and identifying

takes

(

|T |

) time in the RAM model. By recursively ﬁnding centroids in each of the three

components, we will in the end get a ternary tree of centroids, which is called the centroid

decomposition of

, denoted

(

). We can generate a level of

(

) in O(

|T |

) time, given

the decomposition of

into components by the previous level. Since we have to generate at

most 1 +

log

(

|T |

) levels, the total time required to build

(

) is O(

|T | log |T |

), hence we

get Lemma 1.

I Lemma 1.

For any rooted binary tree

with

leafs, building

(

) takes O(

n log n

)

time in the RAM model.

A component in a centroid decomposition

(

), might have many edges crossing its

boundaries (connecting nodes inside and outside the component). The below modiﬁed centroid

decomposition, denoted

MCD

(

), generates components with at most two edges crossing the

boundary, one going towards the root and one down to exactly one subtree.

Modiﬁed Centroid Decomposition.

MCD

(

) is built recursively as follows: If a com-

ponent

has no edge from below, we select the centroid

as a splitting node as

described above. Otherwise, let (

x, y

) be the edge that crosses the boundary from below,

G. S. Brodal and K. Mampentzidis 21:5

where

is in

and let

be centroid of

. As a splitting node choose the lowest common

ancestor of

and

. By induction every component has at most one edge from below and

one edge from above. A useful property of MCD(T ) is captured by the following lemma:

I Lemma 2.

For any rooted binary tree

with

leafs, we have that

(

MCD

(

))

≤

2+2

log

where h(MCD(T )) denotes the height of MCD(T ).

Since each level of MCD(T ) can be constructed in O(n) time, we have

I Theorem 3.

For any rooted binary tree

with

leafs, building

MCD

(

) takes O(

n log n

)

time in the RAM model.

To return to our original problem, we visit the nodes of

, given by the depth ﬁrst

traversal of the ternary tree

MCD

(

), where the children of every node

MCD

(

) are

visited from left to right. For every such node

we process

in two phases, the contraction

phase and the counting phase.

Contraction.

Let

(

) denote the set of leafs in

and Λ

⊆ L

(

). In the contraction

phase,

is compressed into a binary tree of size O(

) whose leaf set is Λ. The contraction

is done in a way so that all the topologies induced by Λ in

are preserved in the compressed

binary tree. This is achieved by the following three sequential steps: prune all leafs of

that are not in Λ, repeatedly prune all internal nodes of

with no children and repeatedly

contract unary internal nodes, i.e. nodes having exactly one child.

Let

be a node of

MCD

(

) and

the corresponding component of

. For every

such node

we have a contracted version of

, from now on referred to as

(

), where

(

)) =

(

). The goal is to augment

(

) with counters (see counting phase below),

so that we can ﬁnd

v ∈T

|s(u) ∩ s(v)|

by scanning

(

). One can imagine

MCD

(

) as

being a tree where each node

is augmented with

(

). To generate all contractions of

for level

MCD

(

), which correspond to a set of disjoint connected components in

we can reuse the contractions of

at level

i −

1 in

MCD

(

). This means that we have to

spend O(

) time to generate the contractions of level

, so to generate all contractions of

we need O(

n log n

) time. Note that by explicitly storing all contractions, we will also need to

use O(

n log n

) space. For our problem, we traverse

MCD

(

) in a depth ﬁrst manner, so we

only have to store a stack of contractions corresponding to the stack of nodes of

MCD

(

)

that we have to remember during our traversal. Since the components at every second level

MCD

(

) have at most half the size of the components two levels above, Lemma 4 states

that the size of this stack is always O(n).

I Lemma 4.

Let

and

be two rooted binary trees with

leafs and

, u

, ..., u

a root

to leaf path of

MCD

(

). For the corresponding contracted versions

(

)

, T

(

)

, ..., T

(

)

we have that

i=1

)| = O(n).

Counting.

In the counting phase, we ﬁnd

v ∈T

|s(u) ∩ s(v)|

by scanning

(

) instead

. This makes the total time of the algorithm in the RAM model O(

n log n

). We consider

the following two cases:

has no edges from below.

In this case

corresponds to a complete subtree of

. We act exactly like in the O(

)

algorithm (Section 2) but now instead of scanning T

we scan T

(u).

has one edge from below.

In this case

does not correspond to a complete subtree of

, since the edge from

below

, will point to a subtree

, that is located outside of

(for an illustration of

E S A 2 0 1 7

Cache Oblivious Algorithms for Computing the Triplet Distance between Trees

Figures

References

The neighbor-joining method: a new method for reconstructing phylogenetic trees.

Comparison of phylogenetic trees

The input/output complexity of sorting and related problems

First draft of a report on the EDVAC

Optimal algorithms for comparing trees with labeled leaves

Related Papers (5)

Cache Oblivious Search Trees via Binary Trees of Small Height

Making deterministic signatures quickly

Efficient Tree Layout in a Multilevel Memory Hierarchy

Cache-Oblivious Data Structures and Algorithms for Undirected Breadth-First Search and Shortest Paths

The Cost of Cache-Oblivious Searching

Frequently Asked Questions (1)

Q1. What are the contributions in "Cache oblivious algorithms for computing the triplet distance between trees∗†" ?