Deterministic versus Adaptive Routing in Fat-Trees

doi:10.1109/IPDPS.2007.370482

∗

C. G

´

omez, F. Gilabert, M.E. G

´

omez, P. L

´

opez and J. Duato

Dept. of Computer Engineering

Universidad Polit

´

ecnica de Valencia

Camino de Vera, 14, 46071–Valencia, Spain

{crigore, fragivil}@gap.upv.es and {megomez, plopez, jduato}@disca.upv.es

Abstract

Clusters of PCs have become very popular to build

high performance computers. These machines use commo-

dity PCs linked by a high speed interconnect. Routing is

one of the most important design issues of interconnection

networks. Adaptive routing usually better balances net-

work trafﬁc, thus allowing the network to obtain a higher

throughput. However, adaptive routing introduces out-of-

order packet delivery, which is unacceptable for some ap-

plications. Concerning topology, most of the commercially

available interconnects are based on fat-tree. Fat-trees offer

a rich connectivity among nodes, making possible to obtain

paths between all source-destination pairs that do not share

any link. We exploit this idea to propose a deterministic

routing algorithm for fat-trees, comparing it with adaptive

routing in several workloads. The results show that determi-

nistic routing can achieve a similar, and in some scenarios

higher, level of performance than adaptive routing, while

providing in-order packet delivery.

1 Introduction

In large parallel computers, high-performance inter-

connection networks are crucial to achieve the maximum

performance. Routing is a critical design issues of inter-

connection networks [4]. The routing strategy determi-

nes the path that each packet follows between a source–

destination pair. In deterministic routing schemes, an in-

jected packet traverses a ﬁxed, predetermined path between

source and destination, while in adaptive routing schemes

∗

This work was supported by the Spanish MCYT under Grant

TIN2006-15516-C04-01, by CONSOLIDER-INGENIO 2010 under Grant

CSD2006-00046 and by the European Commission in the context of the

SCALA integrated project #27648 (FP6).

1-4244-0910-1/07/$20.00

c

2007 IEEE.

the packet may traverse a number of alternative paths. De-

terministic routing algorithms usually do a very poor job ba-

lancing trafﬁc among the network links, but they are usually

easier to implement and easier to be deadlock-free. Moreo-

ver, for networks in which the ordering of messages bet-

ween particular source–destination pairs is important, de-

terministic routing is often a simple way to guarantee in-

order delivery. This is the case, for example, for certain ca-

che coherence protocols and some communication libraries.

On the other hand, adaptive routing algorithms take into ac-

count the status of the network in order to make the rou-

ting decisions. This information may include the status of

links or the queue lengths. Adaptive routing better balances

network trafﬁc, thus allowing the network to obtain a hig-

her throughput. A good adaptive routing algorithm should

outperform a deterministic one, since it uses network state

information that is not available for deterministic routing.

Cluster-based machines use any of the commercial high-

performance switch-based point-to-point interconnects.

Either regular direct networks (tori and meshes) or indirect

multistage networks (MINs) are the usual choice. In par-

ticular, fat-trees have raised in popularity in the past few

years (i.e., Myrinet [11], InﬁniBand [8], Quadrics [12]).

The authors in [1] compare an oblivious routing algo-

rithm with an adaptive routing algorithm for multistage net-

works, and conclude that an adaptive algorithm achieves a

higher performance. The readers should take into account

that oblivious routing is not the same than deterministic rou-

ting [4, 3], since oblivious routing can provide several paths

for a source–destination pair and the routing decision is ta-

ken without considering (oblivious to) network status.

In this paper, we focus on routing in fat-trees. In par-

ticular, we propose a deterministic routing algorithm for

fat-trees. As stated above, deterministic routing has some

advantages over adaptive routing, for instance simplicity or

in–order packet delivery. We will show that the proposed

deterministic routing can be implemented in a compact-way

and that it is able to obtain similar or even better perfor-

mance than adaptive routing in fat-trees. The rest of the

paper is organized as follows. Section 2 revises the fat-tree

topology and presents the notation and assumptions used

in the following sections. Section 3 describes an adaptive

routing algorithm for fat-trees and a possible implementa-

tion using Interval Routing (IR) [2]. Section 4 present the

proposed deterministic routing algorithm for fat-trees and

shows how it can be implemented using Flexible Interval

Routing (FIR) [7]. Section 5 evaluates its performance. Fi-

nally, some conclusions are drawn.

2 Fat-Tree Topology

The k-ary n-trees are a parametric family of regular mul-

tistage topologies. The number of stages is n and k is the

arity or the number of links of a switch that connect to the

previous or to the next stage (i.e., the switch degree is 2k).

A k-ary n-tree is able to connect N = k

n

processing nodes

using nk

n−1

switches.

Each processing node is represented as a n-tuple

{0, 1, ..., k −1}

n

, and each switch is deﬁned as a pair s, o,

where s is the stage where the switch is located at, s ∈{0..n-

1}, and o is a (n − 1)-tuple {0, 1, ..., k − 1}

n−1

which iden-

tiﬁes the switch inside the stage. Figure 1 shows a 2-ary

4-tree, with 16 processing nodes and 32 switches.

In a fat-tree, two switches s, o

n−2

, ..., o

1

,o

0

 and

s



,o



n−2

, ... ,o



1

,o



0

 are connected by an edge if s



= s +1

and o

i

= o



i

for all i = s. On the other hand, there is a edge

between the switch 0,o

n−2

, ..., o

1

,o

0

 and the processing

node p

n−1

, ..., p

1

,p

0

if o

i

= p

i+1

for all i ∈{n−2, ..., 1, 0}.

This edge is labeled with p

0

in the stage 0. In what follows,

we will assume that descending links are labeled from 0 to

k − 1, and ascending links from k to 2k − 1.

3 Adaptive Routing in Fat–trees

In k-ary n-trees, minimal routing from a source to a de-

stination can be accomplished by sending packets upwards

to one of the nearest common ancestors of the source and

destination nodes and then, from there, downwards to de-

stination. When crossing stages in the upwards direction,

several paths are possible, thus providing adaptive routing.

In fact, each switch can select any of its up output ports.

Once a nearest common ancestor has been reached, then

the packet is turned around and sent downwards to its desti-

nation and just a single path is available.

The stage up to which the packet must be forwarded

is obtained by comparing the source and destination com-

ponents beginning from the most signiﬁcant one. The ﬁrst

pair of components that differs indicates the last stage to

forward up the packet. For instance, in order to send a

packet from node p

n−1

, ..., p

1

,p

0

to node p



n−1

, ..., p



1

,p



0

,

the packet must be sent up to the stage i,ifp

j

= p



j

for

j ∈{n − 1..i +1} and p

i

= p



i

. Once in the stage i,the

descending path is deterministic. At each stage, the descen-

ding link to choose is indicated by the component corre-

sponding to that stage in the destination n-tuple. In the ex-

ample, from stage i, the packet must be forwarded through

the p



i

link; from stage i − 1 through link p



i−1

, and so on.

3.1 Adaptive Routing Implementation

This routing algorithm can be easily implemented using

Interval Routing (IR) [2]. In IR, each switch output port has

one associated interval. Each packet is forwarded through

the output port whose interval contains the destination of the

packet. The interval associated to each output port can be

cyclic and is implemented with two registers. We will refer

to these two registers as First Interval (FI) and Last Interval

(LI). Moreover, this scheme requires a simple hardware, at

most a pair of comparators for each output link, therefore it

is also very fast.

Figure 1 presents an example of conﬁguration of the IR

registers for a 16-node 2-ary 4-tree. As it can be seen, the

interval associated to some output ports must be cyclic. As

an example, we describe how to route with IR a packet from

node 1 to node 4. Switch 0 can use both ascending links

(links 2 and 3) to route the packet, since destination 4 is in-

cluded in the intervals associated to both links. Assume that

the selection function selects link 3, so switch 9 is reached.

Switch 9 can route the packet through any of its ascending

links. Assume link 2 is selected, then the packet reaches

switch 17. At switch 17, only link 1 is allowed to route the

packet destined to node 4, so the packet arrives to switch 11.

At this switch, only link 0 is allowed to route the packet, and

switch 2 is reached. Finally switch 2 delivers the packet to

node 4 through link 0.

Next, we present a general algorithm to ﬁll the FI and LI

registers to support adaptive routing in fat-trees. Figure 2

shows the prototyped FI and LI conﬁguration for the links

of a generic switch of a k-ary n-tree. The switch is labeled

as s, o

n−2

,o

n−3

, ..., o

1

,o

0

, so it is located at the stage s.

First, we identify the destinations reachable through the de-

scending links, and later the rest of destinations reachable

through all the ascending links.

The p

n−1

, ..., p

1

,p

0

 nodes that are reachable by the de-

scending links can be easily computed from the switch com-

ponents. In particular, p

i

= o

i−1

for i ∈{n − 1, .., s +1}.

This set of destinations is split in several subsets that can

be reached from each descending link depending on the p

s

component, being s the switch stage. The subset of nodes

whose p

s

=0are reachable through link 0, the subset of de-

stinations whose p

s

=1are reachable through link 1, and

so on. As an example, link 0 of switch s, o

n−2

, ..o

1

,o

0



forwards packets destined to nodes o

n−2

, ..., o

s

, 0, X...X,

that is, FI

desc.

= o

n−2

, ..., o

s

, 0, 0...0 and LI

desc.

=

o

n−2

, ..., o

s

, 0,k− 1...k − 1.

0000..0011

0000..0111

1000..1111

0000..0111

1000..1111

0000..0111

1000..1111

0000..0111

1000..1111

0000..0111

1000..1111

0000..0111

1000..1111

0000..0111

1000..1111

4

5

6

7

12

13

14

15

9

1001..1001

1101..1101

1111..1111

1010..1011

8

0000..1101

10

11

12

13

14

15

3

4

2

0

1

2

3

8

9

10

11

7

6

5

1

0001..0001

0011..0011

0101..0101

0111..0111

0010..1111

0010..0011

0100..1111

0110..0111

0

17

18

19

16

21

22

23

20

25

26

27

24

29

30

31

28

0000..0000

0010..0010

0100..0100

1010..1010

1011..1011

1100..1100

1110..1110

0100..0001

0110..0011

1100..1001

0000..0001

0100..0101

0010..0011

0100..0101

0110..0111

1000..1001

1100..1101

1000..1001

1110..1111

1100..1101

1110..1111

0100..1111

0000..1011

0100..0111

0000..0011

0100..0111

1000..1011

1100..1111

1000..1011

1100..1111

1000..1011

1100..1111

1000..1011

1100..1111

1000..1111

0000..0111

0110..0110

1000..1000

0110..0011

0000..1101

1100..0111

1000..0011

1000..0101

1010..0111

1110..1011

1000..0011

FI

..

LI

0000

0,000

0,001

0,010

1,000

1,001

2,001

2,000 3,000

3,001

3,0102,010

1,010

0,011 1,011

2,011

3,011

3,100

2,100

1,1000,100

0,101 1,101

2,101 3,101

3,110

3,111

2,111

2,110

1,110

1,1110,111

0,110

1110

0100

0001

0010

0011

0110

0101

0111

1001

1000

1010

1011

1100

1101

1111

Figure 1. Adaptive routing with IR in a 2-ary 4-tree.

The destinations that are not reachable through the de-

scending links, but are reachable through the ascending

links, are the ones that do not meet p

i

= o

i−1

for i ∈

{n − 1, .., s +1}. Indeed, this set is reachable through all

the ascending links. The FI register of the ascending links

must store the next destination to the largest one reachable

through the descending links. That is, the next destination

to the one stored in the LI register of the k − 1 descending

link. Likewise, the LI register of the ascending links must

store the previous destination to the smallest one reachable

through the descending links. That is, FI

asc.

=(LI

link

k−1

+1) modk

n

and LI

asc.

=(FI

link

0

− 1+k

n

)modk

n

,

being k

n

the number of nodes in the network

1

. This inter-

val is valid for all the ascending links of the switch and can

result in a cyclic interval.

1

For LI

asc.

we add k

n

in order avoid setting a negative destination

value.

4 Deterministic Routing in Fat–trees

In this section, we propose a deterministic routing algo-

rithm for fat-trees. Our challenge is to propose an efﬁcient

mechanism to reduce the multiple ascending paths in a fat-

tree into a single one for each source–destination pair. The

path reduction should be done trying to balance network

link utilization. All the links of a given stage should be

used by a similar number of paths. This is easy to achieve

in the ascending phase. A simple idea is to divide the adap-

tive up interval of a switch into k sub–intervals of the same

size, but by using this mechanism the descending links are

not balanced. We analyzed several approaches, trying to

balance both routing phases, and found that a good alter-

native is to shufﬂe, at each switch, consecutive destinations

in the ascending phase. In other words, consecutive desti-

nations are distributed among the different ascending links,

reaching different switches in the next stage.

n−2 n−3

S,O ,O ,...,O ,O

10

.

0

k

2k−1k−1

LI

((O ,O ,...,O ,0,0,0...,0) + K − 1) MOD K

nn

FI

((O ,O ,...,O ,k−1,k−1,k−1,...,k−1) + 1) MOD K

n

s

n−3

n−3n−2

n−2

FI

LI

(

)O ,O ,...,O ,k−1,0,0,...,0

)

FI

(

LI

(

O ,O ,...,O ,0,0,0...,0 )

)

s

n−2

O ,O ,...,O ,0,k−1,k−1,...,k−1

n−2

n−3

n−

3

O ,O ,...,O ,k−1,k−1,k−1,...,k−1

n−2

Figure 2. Prototyped register conﬁguration.

We will explain the mechanism by using an example. Fi-

gure 3 shows the destination node distribution in the ascen-

ding and descending links of a 2-ary 3-tree using our pro-

posal. In the ﬁgure, each ascending link has been labeled

(in bold-italic) with the destinations whose packets will be

forwarded through it. In the ﬁrst stage, consecutive desti-

nations are shufﬂed between the two up links. To do that,

the least signiﬁcant component of the packet destination ad-

dress (the least signiﬁcant bit) is used to select the ascending

output port. That is, packets that must be forwarded up-

wards select the ascending output port indicated by the least

signiﬁcant component of the packet destination (p

0

). There-

fore, consecutive destinations are sent to different switches

in the next stage. At the second stage, all the destinations

that reach a switch have the same least signiﬁcant compo-

nent. Hence the component to consider in the selection of

the up output port in this stage is the following one in the

destination address. For instance, at switch 4, only packets

destined to nodes 0, 2, 4 and 6 reach that switch and only

packets destined to nodes 4 and 6 must be forwarded up-

wards. Packets destined to node 4 select the ﬁrst up link,

and packets destined to node 6 the other one.

Considering all the switches of stage 1, packets destined

to nodes 0, 1, 4 and 5 use the ﬁrst up output port of the swit-

ches and those packets destined to nodes 2, 3, 6 and 7 use

the second output port. That is, the second least signiﬁcant

component of the packet destination is used. This mecha-

nism distributes the trafﬁc destined to different nodes, as

shown in Figure 3. As it can be seen, packets destined to

the same node reach the same switch at the last stage in-

dependently of their source node. Each switch of the last

stage receives packets addressed only to two destinations,

and packets destined to each one are forwarded through a

different descending link.

The bottom of Figure 3 shows the number of paths

(source–destination pairs) that use of each link at each

stage. Both, the ascending and descending links of a gi-

ven stage are used by the same number of paths. So, trafﬁc

in the network is completely balanced.

0231

RRR

RRR=0011 (ascending links)

RRR=0000 (descending links)

MR

001..001

001

001..001

001

111

674

FI..LI

000..000

001

000..000

001

000..000

001

000..000

010

010..010

010

000..000

010

010..010

010

010..010

010

000..000

010

010..010

010

000..000

111

001..001

111

010..010

011..011

111

100..100

111

101..101

111

110..110

111

111..111

111

110..111

111

100..101

110..111

111

010..011

111

000..001

010..011

111

000..001

000..011

111

100..111

111

000..011

111

100..111

111

000..011

111

100..111

111

000..011

111

100..111

111

001

001..001

001

001..001

100..101

0

1

2

3

4

5

6

7

3

2

1

0

4

5

6

7

11

10

9

8

0,11

0,10

0,01

0,00 1,00

1,01

1,10

1,11

2,11

2,10

2,01

2,00

000

001

010

011

100

101

110

111

stage 0 stage 1 stage 2

2,4,6

4

3,5,7

0

6

0

4

0,4,6

1,5,7 1

3

5

7

1

5

2

6

2

0

6

4

0,2,6

1,3,7

0,2,4

1,3,5

5

7

3

5

3

1

2

Number of paths:

Figure 3. Deterministic routing in a 2-ary 3-

tree with FIR registers.

4.1 Deterministic Routing Implementa-

tion

In this section, we show how the proposed determini-

stic routing strategy for fat-trees can be easily implemented

using the Flexible Interval Routing (FIR) [7]. To make this

paper self-contained, we ﬁrst summarize FIR.

FIR can implement the most commonly-used routing al-

gorithms in meshes and tori. In FIR, as in IR, each output

port has also an associated cyclic interval, which is imple-

mented with the FI and LI registers. But, in order to add

ﬂexibility, additional registers are associated to the output

ports. In particular, each output port has a Mask Register

(MR). This register indicates which bits of the packet de-

stination address are compared with the output port bounds

(provided by the FI and LI registers).

In order to guarantee deadlock freedom, some routing

restrictions must be usually applied. These routing restric-

tions are taken into account in FIR by means of the Routing

Restrictions Register (RRR), which deﬁnes, for each output

port, which other output ports of the switch should be se-

lected prior to this one. This register has one bit per output

port. For a given output port i,thej bit in the RRR indicates

whether the output port j has more preference (bit set to 1)

or not (0) than output port i. Thus, the ﬁnal routing deci-

sion for an output port i is obtained not only by comparing

the masked destination with the interval bounds, but also by

checking the bits in its RRR.

Now, we show how these registers can be conﬁgured to

route packets in fat-trees following the proposed determi-

s*r

{0}

being

k = arity of the switch

n = number of stages

n*r = bits in destination identifiers

r = log(k); (* bits used by each stage in the destination addresses *)

s*rn*r−(s+1)*r

stage s

KK

link L

{0}..

s*r

FI..LI={0}

{1}

r

{0}

L−k

n*r−(s+1)*r

MR={0}

n*r−(s+1)*r

L−k{0}

(* selection of the bits corresponding to the current stage *)

RRR={0} {1}

(* the bits corresponding to the current stage are given by the link identifier *)

(* they must be equal to L−k, being represented by r bits *)

Figure 4. Register conﬁguration in the ascen-

ding.

nistic routing algorithm. Notice that the adaptive routing

algorithm described in section 3 can be also implemented

with FIR, since IR is a subset of FIR.

We begin explaining how to conﬁgure the ascending

links. They are conﬁgured in a very different way than in the

adaptive case, since the number of ascending paths is redu-

ced to one for each source–destination pair. At each switch,

the ascending link to use is obtained from the packet desti-

nation component corresponding to the stage at which the

switch is located. A given packet is only forwarded upwards

through that up link. That is, at stage 0, the least signiﬁcant

component is used, at stage 1 the following one and so on.

To obtain the proper component from the destination iden-

tiﬁer corresponding to the switch stage, we use the Mask

Register (MR). The MR of each ascending link sets to 1

the bits corresponding to the component associated to the

switch stage. Figure 3 shows the FIR register conﬁguration

for a 2-ary 3-tree. In the ﬁrst stage, MR is set to 001, as

only the least signiﬁcant bit is selected and compared with

FI and LI. Packets are forwarded through the ascending out-

put port depending on the least signiﬁcant component of its

destination. At stage 1, the next bit or component is consi-

dered (MR is set to 010) and so on. In k-ary n-trees with

k>2 the components have more than one bit and, thus, in

the MR more than one bit is set to 1 to select the component

given by the switch stage.

Descending links have the same values stored in FI and

LI as the ones with adaptive routing, since the path reduc-

tion is only done in the upwards phase. As the MR is not

used in the downwards phase, it is set to all 1s to select all

the bits in the destination address. Notice that, the same

downwards paths valid in the adaptive case are also valid in

the deterministic case, but only one is actually used.

Notice that in Figure 3 the destinations reachable

through the descending links of a switch (for instance, de-

stinations 0 and 1 at switch 0) are also included in the ascen-

ding intervals, so packets destined to that nodes could be in-

correctly forwarded through those upwards links. To avoid

this problem, the RRR register is used to give preference to

the descending links over the ascending ones and guaran-

tee a minimal path. In the RRR, the half lowest signiﬁcant

bits correspond to the descending links and the half most si-

gniﬁcant to the ascending ones. Therefore, in the ascending

links (links 2 and 3) the RRR stores 0011, to give preference

to links 0 and 1. In this way, as an example, when routing a

packet to destination 0 at switch 0, both output ports, 0 and

2, may be allowed, since this destination, after being mas-

ked, is included in the intervals associated to both output

ports, but as output port 2 gives preference to output port 0,

output port 2 is not ﬁnally returned.

Figure 4 shows an algorithm for conﬁguring the FIR re-

gisters of the ascending links. The RRRs have the half least

signiﬁcant bits set to 1 and the half most signiﬁcant bits set

to 0, to give preference to the descending links. Notice that

with k =2, the MRs will have as many bits set to 1 as the

bits needed to represent a component in the destination ad-

dress, that is log(k). These 1s will be located in the position

corresponding to the switch stage and will select these bits

in the destination identiﬁers. The rest of the MR will be set

to 0. On the other hand, the FI and LI registers of an ascen-

ding link will select the destinations that have the identiﬁer

of the ascending link in the position of the component gi-

ven by the switch stage (s in Figure 4). Since ascending

links are labeled from k to 2k − 1, the value in the desti-

nation identiﬁer component will be L − k, L being the link

identiﬁer. The conﬁguration for the descending links is not

shown because it is very simple. The FI and LI registers

are the same as the adaptive case, and the MR is set all to

1s to select all the bits in the destination address for being

compared with FI and LI. The RRR is set to all 0s, since no

preference is given to the other output ports.

5 Performance Evaluation

5.1 Network Model

To evaluate the routing algorithm proposed above, a de-

tailed event-driven simulator has been implemented. The

simulator models a k-ary n-tree with FIR routing and vir-

tual cut-through switching. Each router has a full crossbar

with queues both at the input and output ports. We assumed

that it takes 20 clock cycles to apply the routing algorithm;

switch and link bandwidth has been assumed to be one ﬂit

per clock cycle; and ﬂy time has been assumed to be 8 clock

cycles. These values were used to model Myrinet networks

in [5]. Credits are used to implement the ﬂow control me-

chanism. Also, each port link has a two-packet buffer.

When adaptive routing is used, a selection function must

be applied after applying the routing function. Remember

that FIR implements the routing function. In [1], the aut-

hors compare several selection functions for adaptive rou-

ting, but they only consider the peak throughput in the eva-

luation, so they conclude that selection function is not criti-

cal. However, in [6], several selection functions for fat-trees

Deterministic versus Adaptive Routing in Fat-Trees

Figures

Citations

Slim fly: a cost effective low-diameter network topology

LABERIO: Dynamic load-balanced Routing in OpenFlow-enabled Networks

Oblivious routing for fat-tree based system area networks with uncertain traffic demands

Modeling and Analysis of Communication Networks in Multicluster Systems under Spatio-Temporal Bursty Traffic

Slim Fly: A Cost Effective Low-Diameter Network Topology

References

Principles and Practices of Interconnection Networks

Interconnection Networks: An Engineering Approach

Interconnection Networks: An Engineering Approach

UNIX disk access patterns

Interconnection Networks

Related Papers (5)

Fat-trees: Universal networks for hardware-efficient supercomputing

Interconnection Networks: An Engineering Approach

Principles and Practices of Interconnection Networks

Input Versus Output Queueing on a Space-Division Packet Switch

Technology-Driven, Highly-Scalable Dragonfly Topology