Mailbox switch: a scalable two-stage switch architecture for conflict resolution of ordered packets

doi:10.1109/INFCOM.2004.1354608

Mailbox Switch: A Scalable Two-stage Switch

Architecture for Conﬂict Resolution of Ordered

Packets

Cheng-Shang Chang, Duan-Shin Lee, Ying-Ju Shih and and Chao-Lin Yu

Institute of Communications Engineering

National Tsing Hua University

Hsinchu 300, Taiwan, R.O.C.

Email: cschang@ee.nthu.edu.tw lds@cs.nthu.edu.tw

yjshih@gibbs.ee.nthu.edu.tw clyu@gibbs.ee.nthu.edu.tw

Abstract

Traditionally, conﬂict resolution in an input-buffered switch is solved by ﬁnding a matching between inputs

and outputs per time slot, which incurs unscalable computation and communication overheads. The main objective

of this paper is to propose a scalable solution, called the mailbox switch, that solves the out-of-sequence problem

in the two-stage switch architecture. The key idea of the mailbox switch is to use a set of symmetric connection

patterns to create a feedback path for packet departure times. With the information of packet departure times, the

mailbox switch can schedule packets so that they depart in the order of their arrivals. Despite the simplicity of the

mailbox switch, we show via both the theoretical models and simulations that the throughput of the mailbox switch

can be as high as 75%. With limited resequencing delay, a modiﬁed version of the mailbox switch achieves 95%

throughput. We also propose a recursive way to construct the switch fabrics for the set of symmetric connection

patterns. If the number of inputs, N , is a power of 2, we show that the switch fabric for the mailbox switch can

be built with

N

2

log

2

N 2 × 2 switches.

Index Terms

Birkhoff-von Neumann switches, input-buffered switches, conﬂict resolution, two-stage switches

I. INTRODUCTION

As the parallel input buffers of input-buffered switches provide the needed speedup for memory access

speed, input-buffered switches are known to be more scalable than shared memory switches. However,

synchronized parallel transmissions among parallel input buffers in every time slot require careful co-

ordination to avoid conﬂicts. Thus, ﬁnding a scalable method (and architecture) for conﬂict resolution

becomes the fundamental design problem of input-buffered switches.

Traditionally, conﬂict resolution is solved by ﬁnding a matching between inputs and outputs per time

slot (see e.g., [11], [1], [25], [17], [18], [19], [9], [15]). Two steps are needed for ﬁnding a matching.

(i) Communication overhead: one has to gather the information of the buffers at the inputs.

(ii) Computation overhead: based on the gathered information, one then applies a certain algorithm

to ﬁnd a matching.

This research is supported in part by the National Science Council, Taiwan, R.O.C., under Contract NSC-91-2219-E007-003, and the

program for promoting academic excellence of universities NSC 93-2752-E007-002-PAE. A conference version of this paper was presented

in IEEE INFOCOM 2004.

2

Most of the works in the literature pay more attention to reducing the computation overhead by ﬁnding

scalable matching algorithms, e.g., wavefront arbitration in [25], PIM in [1], SLIP in [17], and DRRM in

[15]. However, in our view, it is the communication overhead that makes matching per time slot difﬁcult

to scale. To see this, suppose that there are N inputs/outputs and each input implements N virtual output

queues (VOQ). If we use a single bit to indicate whether a VOQ is empty, then we have to transmit N

bits from each input (to a central arbiter or to an output) in every time slot. For instance, transmitting

such N bit information in PIM and SLIP is implemented by an independent circuit that sends out parallel

requests. Suppose that the packet size is chosen to be 64 bytes. Then building a switch with more than

512 inputs/outputs will have more communication overhead than transmitting the data itself.

To reduce the communication overhead, one approach is to gather the long term statistics of the

VOQs, e.g., the average arrival rates, and then use such information to ﬁnd a sequence of pre-determined

connection patterns (see e.g., [1], [14], [10], [4], [5], [2]). Most of the works along this line are based

on the well-known Birkhoff-von Neumann algorithm [3], [27] that decomposes a doubly substochastic

matrix into a convex combination of (sub)permutation matrices. For an N × N switch, the computation

complexity for the Birkhoff-von Neumann decomposition is O(N

4.5

) and the number of permutation

matrices produced by the decomposition is O(N

2

) (see e.g., [4], [5]). The need for storing the O(N

2

)

number of permutation matrices in the Birkhoff-von Neumann switch makes it difﬁcult to scale for a

large N. Even though there are decomposition methods that reduce the number of permutation matrices

(see e.g., [13]), they in general do not have good throughput. For instance, the throughput in [13] is

O(1/ log N) and it tends to 0 when N is large. Another problem of using long term statistics is that the

switch does not adapt too well to trafﬁc ﬂuctuation.

It would be ideal if there is a switch architecture that yields good throughput without the need for gather-

ing trafﬁc information (no communication overhead) and computing connection patterns (no computation

overhead). Recent works on the two-stage switches (see e.g., [6], [7], [12], [8]) shed some light along this

direction. The switch architecture in [6], called the load balanced Birkhoff-von Neumann switch, consist

of two crossbar switch fabrics and parallel buffers between them. In a time slot, both the crossbar switch

fabrics sets up connection patterns corresponding to permutation matrices that are periodically generated

from a one-cycle permutation matrix. By so doing, the ﬁrst stage performs load balancing for the incoming

trafﬁc so that the trafﬁc coming into the second stage is uniform. As such, it sufﬁces to use the same

periodic connection patterns as in the ﬁrst stage to perform switching at the second stage. In the load

balanced Birkhoff-von Neumann switch, there is no need to gather the trafﬁc information. Also, as the

connection patterns are periodically generated, no computation is needed at all. More importantly, it can

be shown to achieve 100% throughput for any non-uniform trafﬁc under a minor technical assumption.

However, the main drawback of the load balanced Birkhoff-von Neumann switch in [6] is that packets

might be out of sequence. To solve the out-of-sequence problem in the two-stage switches, two approaches

3

have been proposed. The ﬁrst one uses sophisticated scheduling in the buffers between the two switch

fabrics (see e.g., [7], [12]) and hence it may require complicated hardware implementation and non-

scalable computation overhead. The second one is to use the rate information for controlling the trafﬁc

entering the switch (see e.g., [8]). However, this requires communication overhead and it also does not

adapt too well to large trafﬁc ﬂuctuation.

One of the main objectives of this paper is to solve the out-of-sequence problem in the two-stage switch

without non-scalable computation and communication overhead. For this, we propose a switch architecture,

called the mailbox switch. The mailbox switch has the same architecture as the load balanced Birkhoff-

von Neumann switch. Instead of using an arbitrary set of periodic connection patterns generated by a

one-cycle permutation matrix, the key idea in the mailbox switch is to use a set of symmetric connection

patterns. As an input and its corresponding output are usually built on the same line card, the symmetric

connection patterns set up a feedback path from the central buffers (called mailboxes in this paper) to an

input/output port. Since everything inside the switch is pre-determined and periodic, the scheduled packet

departure times can then be fed back to inputs to compute the waiting time for the next packet so that

packets can depart in sequence. Thus, the communication overhead incurred by this is the transmission

of the information of the packet departure time, which is constant in every time slot for every input port.

This communication overhead in every time slot for every input port is independent of the size of the

switch. On the other hand, the computation overhead incurred by this is the computation of the waiting

time, which also requires only a constant number of operations.

Simplicity comes at the cost of throughput. The throughput of the mailbox switch is no longer 100%.

There are two key factors that limit the throughput of the mailbox switch: (i) the head-of-line (HOL)

blocking problem at the input buffers, and (ii) the stability of the waiting times. Under the usual uniform

trafﬁc model, we provide exact analysis for two special cases. In the ﬁrst special case, there is only the

HOL blocking problem and the throughput is reduced to the classical head-of-line blocking switch in

[11] that yields 58% throughput. In the second special case, there is only the stability problem of waiting

times and we show the mailbox switch achieves 68% throughput in this case. By balancing these two

constraints, the mailbox switch can achieve more than 75% throughput. These analytical results are also

veriﬁed by simulations. By allowing limited resequencing delay, a modiﬁed version of the mailbox switch

can achieve more than 95% throughput.

In this paper, we also propose a recursive way to construct the switch fabrics for the set of symmetric

connection patterns. If the number of inputs, N, is a power of 2, we show that the switch fabric for the

mailbox switch can be built with

N

2

log

2

N 2 × 2 switches.

4

II. THE SWITCH ARCHITECTURE

A. Generic mailbox switch

In this paper, we assume that packets are of the same size. Also, time is slotted and synchronized so that

a packet can be transmitted within a time slot. As in the load balanced Birkhoff-von-Neumann switch, the

N ×N mailbox switch consists of two N ×N crossbar switch fabrics (see Figure 1) and buffers between

the two crossbar switch fabrics. The buffers between the two switch fabrics are called mailboxes. There

are N mailboxes, indexed from 1 to N. Each mailbox contains N bins (indexed from 1 to N), and each

bin contains F cells (indexed from 1 to F ). Each cell can store exactly one packet. Cells in the i

th

bin of

a mailbox are used for storing packets that are destined for the i

th

output port of the second switch. In

addition to these, a First In First Out (FIFO) queue is added in front of each input port of the ﬁrst stage.

Now we describe how the connection patterns of these two crossbar switch fabrics are set up. In every

time slot, both crossbar switches in Figure 1 have the same connection pattern. During the t

th

time slot,

input port i is connected to the output port j if

(i + j) mod N =(t +1)mod N. (1)

In particular, at t =1, we have input port 1 connected to output port 1, input port 2 connected to output

port N, ..., and input port N connected to output port 2. Clearly, such connection patterns are periodic

with period N. Moreover, each input port is connected to each of the N output ports exactly once in

every N time slot. Speciﬁcally, input port i is connected to output port 1 at time i, output port 2 at time

i +1, ..., output port N at time i + N − 1. Also, we note from (1) that such connection patterns are

symmetric, i.e., input port i and output port j are connected if and only if input port j and output port i

are connected. As such, we call a switch fabric that implements the connection patterns in (1) a symmetric

Time Division Multiplexing (TDM) switch. Note that one can solve j in (1) by the following function

j = h(i, t)=



(t −i) mod N



+1. (2)

Thus, during the t

th

time slot the i

th

input port is connected to the h(i, t)

th

output port of these two

crossbar switch fabrics.

As input port i of the ﬁrst switch and output port i of the second switch are on the same line card,

the symmetric property then enables us to establish a bi-directional communication link between a line

card and a mailbox. As we will see later, such a property plays an important role in keeping packets in

sequence.

As the connection patterns in the mailbox switch is a special case of the load-balanced Birkhoff-von

Neumann switch with one-stage buffering [6], one might expect that it also approaches 100% throughput

if we use the FIFO policy for each bin and increase the bin size F to ∞. However, we also suffer from

the out-of-sequence problem by doing this. Packets that have the same input port at the ﬁrst switch and

5

the same output port at the second switch may be routed to different mailboxes and depart in a sequence

that is different from the sequence of their arrivals at the input port of the ﬁrst switch.

To solve the out-of-sequence problem, one may add a resequencing buffer and adapt a more careful load

balancing mechanism as in the load balanced Birkhoff-von Neumann switch with multi-stage buffering

[7]. However, such an approach requires complicated scheduling and jitter control in order to have a

bounded resequencing delay. Here we take a much simpler approach. The idea is that we do know the

packet departure time once it is placed in a mailbox as the connection patterns are deterministic and

periodic. Also, recall that by building an input port and the output port of the same index on a line card,

the symmetric TDM connection patterns provide a bi-directional feed back path between a line card and

a mailbox. Thus, every input port maintains the delay of the last successfully transmitted packets from

this input port to every output port. For input port i to transmit a HOL packet, input port i transmits

the delay information of the last packet destined for the same output port along with the HOL packet.

If an empty cell in the connected mailbox whose corresponding departure time is larger than that of the

previous packet, the HOL packet will be placed in the mailbox cell and removed from the HOL of input

port i. Further more, the departure time information of the newly transmitted packet is fed back to the

line card i. The delay information at input port i is updated. Otherwise, the transmission is blocked and

the packet remains at the HOL position of input port i.

To be speciﬁc, deﬁne ﬂow ( i, j) as the sequence of packets that arrives at the i

th

input port of the ﬁrst

switch and are destined for the j

th

output port of the second switch. Let V

i,j

(t) be the number of time

slots that a packet of ﬂow (i, j) has to wait in a mailbox for ordered delivery, once it is transmitted from

the head-of-line (HOL) packet at the FIFO queue of the i

th

input port of the ﬁrst switch to the j

th

bin of

the h(i, t)

th

mailbox at time t. Following the terminology in queueing theory, we call V

i,j

(t) the virtual

waiting time of ﬂow (i, j). Now we describe how the mailbox switch works to keep packets of the same

ﬂow in sequence. At each input port i, we keep the information of V

i,j

(t) for j =1, 2, ..., N. Initially, we

set V

i,j

(0) = 0 for all (i, j). At each time slot t, the following operation is executed.

(iA) Retrieving mails: at time t, the j

th

output port of the second switch is connected to the h(j, t)

th

mailbox. The packet in the ﬁrst cell of the j

th

bin is transmitted to the j

th

output port. Packets

in cells 2, 3,...,F of the j

th

bin are moved forward to cells 1,2,..., F − 1. According to

(1), the j

th

output port of the second switch will be connected to the k

th

mailbox at time

t +((k − h(j, t) − 1) mod N)+1. Hence, the packet in the f

th

cell of the j

th

bin of the

k

th

mailbox at time t will be transmitted to the j

th

output port of the second switch at time

t +(f − 1)N +((k − h(j, t) − 1) mod N)+1. This means that the packet departure time can

be determined once a packet is placed in a mailbox.

(iiA) Sending mails: suppose that the HOL packet of the i

th

input port of the ﬁrst switch is from

ﬂow (i, j). Note that the i

th

input port of the ﬁrst switch is connected to the h(i, t)

th

mailbox.

Mailbox switch: a scalable two-stage switch architecture for conflict resolution of ordered packets

Citations

Optimal load-balancing

Byte-focal: a practical load balanced switch

Feedback-based scheduling for load-balanced two-stage switches

CR Switch: A Load-Balanced Switch with Contention and Reservation

CR switch: a load-balanced switch with contention and reservation

References

Input Versus Output Queueing on a Space-Division Packet Switch

The stability of a queue with non-independent inter-arrival and service times

High-speed switch scheduling for local-area networks

Achieving 100% throughput in an input-queued switch

1. A Certain Zero-sum Two-person Game Equivalent to the Optimal Assignment Problem

Related Papers (5)

Load balanced Birkhoff-von Neumann switches, part II: multi-stage buffering

Load balanced Birkhoff-von Neumann switches, part I

Scaling internet routers using optics

The iSLIP scheduling algorithm for input-queued switches

High-speed switch scheduling for local-area networks