scispace - formally typeset
Proceedings ArticleDOI

Mailbox switch: a scalable two-stage switch architecture for conflict resolution of ordered packets

TLDR
This paper proposes a scalable solution, called the mailbox switch, that solves the out-of-sequence problem in the two-stage switch architecture and proposes a recursive way to construct the switch fabrics for the set of symmetric connection patterns.
Abstract
Traditionally, conflict resolution in an input-buffered switch is solved by finding a matching between inputs and outputs per time slot. To do this, a switch not only needs to gather the information of the virtual output queues at the inputs, hut also uses the gathered information to compute a matching. As such, both the communication overhead and the computation overhead make it difficult to scale. Recent works on the two-stage switch architecture in (6|, [7], [12], (8| showed that conflict resolution can be easily solved over time and space without communication and computation overhead. However, the main problem of such a two-stage switch architecture is that packets might be out of sequence. The main objective of this paper is to propose a scalable solution, called the mailbox switch, that solves the out-of-sequence problem in the two-stage switch architecture. The key idea of the mailbox switch is to use a set of symmetric connection patterns to create a feedback path for packet departure times. With the information of packet departure times, the mailbox switch can schedule packets so that they depart in the order of their arrivals. Despite the simplicity of the mailbox switch, we show via both the theoretical models and simulations that the throughput of the mailbox switch can be as high as 75%. With limited resequencing delay, a modified version of the mailbox switch achieves 95% throughput. We also propose a recursive way to construct the switch fabrics for the set of symmetric connection patterns. If the number of inputs, N, is a power of 2, we show that the switch fabric for the mailbox switch can be built with N/2 log2 N 2times2 switches

read more

Content maybe subject to copyright    Report

Mailbox Switch: A Scalable Two-stage Switch
Architecture for Conflict Resolution of Ordered
Packets
Cheng-Shang Chang, Duan-Shin Lee, Ying-Ju Shih and and Chao-Lin Yu
Institute of Communications Engineering
National Tsing Hua University
Hsinchu 300, Taiwan, R.O.C.
Email: cschang@ee.nthu.edu.tw lds@cs.nthu.edu.tw
yjshih@gibbs.ee.nthu.edu.tw clyu@gibbs.ee.nthu.edu.tw
Abstract
Traditionally, conflict resolution in an input-buffered switch is solved by finding a matching between inputs
and outputs per time slot, which incurs unscalable computation and communication overheads. The main objective
of this paper is to propose a scalable solution, called the mailbox switch, that solves the out-of-sequence problem
in the two-stage switch architecture. The key idea of the mailbox switch is to use a set of symmetric connection
patterns to create a feedback path for packet departure times. With the information of packet departure times, the
mailbox switch can schedule packets so that they depart in the order of their arrivals. Despite the simplicity of the
mailbox switch, we show via both the theoretical models and simulations that the throughput of the mailbox switch
can be as high as 75%. With limited resequencing delay, a modified version of the mailbox switch achieves 95%
throughput. We also propose a recursive way to construct the switch fabrics for the set of symmetric connection
patterns. If the number of inputs, N , is a power of 2, we show that the switch fabric for the mailbox switch can
be built with
N
2
log
2
N 2 × 2 switches.
Index Terms
Birkhoff-von Neumann switches, input-buffered switches, conflict resolution, two-stage switches
I. INTRODUCTION
As the parallel input buffers of input-buffered switches provide the needed speedup for memory access
speed, input-buffered switches are known to be more scalable than shared memory switches. However,
synchronized parallel transmissions among parallel input buffers in every time slot require careful co-
ordination to avoid conflicts. Thus, finding a scalable method (and architecture) for conflict resolution
becomes the fundamental design problem of input-buffered switches.
Traditionally, conflict resolution is solved by finding a matching between inputs and outputs per time
slot (see e.g., [11], [1], [25], [17], [18], [19], [9], [15]). Two steps are needed for finding a matching.
(i) Communication overhead: one has to gather the information of the buffers at the inputs.
(ii) Computation overhead: based on the gathered information, one then applies a certain algorithm
to find a matching.
This research is supported in part by the National Science Council, Taiwan, R.O.C., under Contract NSC-91-2219-E007-003, and the
program for promoting academic excellence of universities NSC 93-2752-E007-002-PAE. A conference version of this paper was presented
in IEEE INFOCOM 2004.

2
Most of the works in the literature pay more attention to reducing the computation overhead by finding
scalable matching algorithms, e.g., wavefront arbitration in [25], PIM in [1], SLIP in [17], and DRRM in
[15]. However, in our view, it is the communication overhead that makes matching per time slot difficult
to scale. To see this, suppose that there are N inputs/outputs and each input implements N virtual output
queues (VOQ). If we use a single bit to indicate whether a VOQ is empty, then we have to transmit N
bits from each input (to a central arbiter or to an output) in every time slot. For instance, transmitting
such N bit information in PIM and SLIP is implemented by an independent circuit that sends out parallel
requests. Suppose that the packet size is chosen to be 64 bytes. Then building a switch with more than
512 inputs/outputs will have more communication overhead than transmitting the data itself.
To reduce the communication overhead, one approach is to gather the long term statistics of the
VOQs, e.g., the average arrival rates, and then use such information to find a sequence of pre-determined
connection patterns (see e.g., [1], [14], [10], [4], [5], [2]). Most of the works along this line are based
on the well-known Birkhoff-von Neumann algorithm [3], [27] that decomposes a doubly substochastic
matrix into a convex combination of (sub)permutation matrices. For an N × N switch, the computation
complexity for the Birkhoff-von Neumann decomposition is O(N
4.5
) and the number of permutation
matrices produced by the decomposition is O(N
2
) (see e.g., [4], [5]). The need for storing the O(N
2
)
number of permutation matrices in the Birkhoff-von Neumann switch makes it difficult to scale for a
large N. Even though there are decomposition methods that reduce the number of permutation matrices
(see e.g., [13]), they in general do not have good throughput. For instance, the throughput in [13] is
O(1/ log N) and it tends to 0 when N is large. Another problem of using long term statistics is that the
switch does not adapt too well to traffic fluctuation.
It would be ideal if there is a switch architecture that yields good throughput without the need for gather-
ing traffic information (no communication overhead) and computing connection patterns (no computation
overhead). Recent works on the two-stage switches (see e.g., [6], [7], [12], [8]) shed some light along this
direction. The switch architecture in [6], called the load balanced Birkhoff-von Neumann switch, consist
of two crossbar switch fabrics and parallel buffers between them. In a time slot, both the crossbar switch
fabrics sets up connection patterns corresponding to permutation matrices that are periodically generated
from a one-cycle permutation matrix. By so doing, the first stage performs load balancing for the incoming
traffic so that the traffic coming into the second stage is uniform. As such, it suffices to use the same
periodic connection patterns as in the first stage to perform switching at the second stage. In the load
balanced Birkhoff-von Neumann switch, there is no need to gather the traffic information. Also, as the
connection patterns are periodically generated, no computation is needed at all. More importantly, it can
be shown to achieve 100% throughput for any non-uniform traffic under a minor technical assumption.
However, the main drawback of the load balanced Birkhoff-von Neumann switch in [6] is that packets
might be out of sequence. To solve the out-of-sequence problem in the two-stage switches, two approaches

3
have been proposed. The first one uses sophisticated scheduling in the buffers between the two switch
fabrics (see e.g., [7], [12]) and hence it may require complicated hardware implementation and non-
scalable computation overhead. The second one is to use the rate information for controlling the traffic
entering the switch (see e.g., [8]). However, this requires communication overhead and it also does not
adapt too well to large traffic fluctuation.
One of the main objectives of this paper is to solve the out-of-sequence problem in the two-stage switch
without non-scalable computation and communication overhead. For this, we propose a switch architecture,
called the mailbox switch. The mailbox switch has the same architecture as the load balanced Birkhoff-
von Neumann switch. Instead of using an arbitrary set of periodic connection patterns generated by a
one-cycle permutation matrix, the key idea in the mailbox switch is to use a set of symmetric connection
patterns. As an input and its corresponding output are usually built on the same line card, the symmetric
connection patterns set up a feedback path from the central buffers (called mailboxes in this paper) to an
input/output port. Since everything inside the switch is pre-determined and periodic, the scheduled packet
departure times can then be fed back to inputs to compute the waiting time for the next packet so that
packets can depart in sequence. Thus, the communication overhead incurred by this is the transmission
of the information of the packet departure time, which is constant in every time slot for every input port.
This communication overhead in every time slot for every input port is independent of the size of the
switch. On the other hand, the computation overhead incurred by this is the computation of the waiting
time, which also requires only a constant number of operations.
Simplicity comes at the cost of throughput. The throughput of the mailbox switch is no longer 100%.
There are two key factors that limit the throughput of the mailbox switch: (i) the head-of-line (HOL)
blocking problem at the input buffers, and (ii) the stability of the waiting times. Under the usual uniform
traffic model, we provide exact analysis for two special cases. In the first special case, there is only the
HOL blocking problem and the throughput is reduced to the classical head-of-line blocking switch in
[11] that yields 58% throughput. In the second special case, there is only the stability problem of waiting
times and we show the mailbox switch achieves 68% throughput in this case. By balancing these two
constraints, the mailbox switch can achieve more than 75% throughput. These analytical results are also
verified by simulations. By allowing limited resequencing delay, a modified version of the mailbox switch
can achieve more than 95% throughput.
In this paper, we also propose a recursive way to construct the switch fabrics for the set of symmetric
connection patterns. If the number of inputs, N, is a power of 2, we show that the switch fabric for the
mailbox switch can be built with
N
2
log
2
N 2 × 2 switches.

4
II. THE SWITCH ARCHITECTURE
A. Generic mailbox switch
In this paper, we assume that packets are of the same size. Also, time is slotted and synchronized so that
a packet can be transmitted within a time slot. As in the load balanced Birkhoff-von-Neumann switch, the
N ×N mailbox switch consists of two N ×N crossbar switch fabrics (see Figure 1) and buffers between
the two crossbar switch fabrics. The buffers between the two switch fabrics are called mailboxes. There
are N mailboxes, indexed from 1 to N. Each mailbox contains N bins (indexed from 1 to N), and each
bin contains F cells (indexed from 1 to F ). Each cell can store exactly one packet. Cells in the i
th
bin of
a mailbox are used for storing packets that are destined for the i
th
output port of the second switch. In
addition to these, a First In First Out (FIFO) queue is added in front of each input port of the first stage.
Now we describe how the connection patterns of these two crossbar switch fabrics are set up. In every
time slot, both crossbar switches in Figure 1 have the same connection pattern. During the t
th
time slot,
input port i is connected to the output port j if
(i + j) mod N =(t +1)mod N. (1)
In particular, at t =1, we have input port 1 connected to output port 1, input port 2 connected to output
port N, ..., and input port N connected to output port 2. Clearly, such connection patterns are periodic
with period N. Moreover, each input port is connected to each of the N output ports exactly once in
every N time slot. Specifically, input port i is connected to output port 1 at time i, output port 2 at time
i +1, ..., output port N at time i + N 1. Also, we note from (1) that such connection patterns are
symmetric, i.e., input port i and output port j are connected if and only if input port j and output port i
are connected. As such, we call a switch fabric that implements the connection patterns in (1) a symmetric
Time Division Multiplexing (TDM) switch. Note that one can solve j in (1) by the following function
j = h(i, t)=
(t i) mod N
+1. (2)
Thus, during the t
th
time slot the i
th
input port is connected to the h(i, t)
th
output port of these two
crossbar switch fabrics.
As input port i of the first switch and output port i of the second switch are on the same line card,
the symmetric property then enables us to establish a bi-directional communication link between a line
card and a mailbox. As we will see later, such a property plays an important role in keeping packets in
sequence.
As the connection patterns in the mailbox switch is a special case of the load-balanced Birkhoff-von
Neumann switch with one-stage buffering [6], one might expect that it also approaches 100% throughput
if we use the FIFO policy for each bin and increase the bin size F to . However, we also suffer from
the out-of-sequence problem by doing this. Packets that have the same input port at the first switch and

5
the same output port at the second switch may be routed to different mailboxes and depart in a sequence
that is different from the sequence of their arrivals at the input port of the first switch.
To solve the out-of-sequence problem, one may add a resequencing buffer and adapt a more careful load
balancing mechanism as in the load balanced Birkhoff-von Neumann switch with multi-stage buffering
[7]. However, such an approach requires complicated scheduling and jitter control in order to have a
bounded resequencing delay. Here we take a much simpler approach. The idea is that we do know the
packet departure time once it is placed in a mailbox as the connection patterns are deterministic and
periodic. Also, recall that by building an input port and the output port of the same index on a line card,
the symmetric TDM connection patterns provide a bi-directional feed back path between a line card and
a mailbox. Thus, every input port maintains the delay of the last successfully transmitted packets from
this input port to every output port. For input port i to transmit a HOL packet, input port i transmits
the delay information of the last packet destined for the same output port along with the HOL packet.
If an empty cell in the connected mailbox whose corresponding departure time is larger than that of the
previous packet, the HOL packet will be placed in the mailbox cell and removed from the HOL of input
port i. Further more, the departure time information of the newly transmitted packet is fed back to the
line card i. The delay information at input port i is updated. Otherwise, the transmission is blocked and
the packet remains at the HOL position of input port i.
To be specific, define flow ( i, j) as the sequence of packets that arrives at the i
th
input port of the first
switch and are destined for the j
th
output port of the second switch. Let V
i,j
(t) be the number of time
slots that a packet of flow (i, j) has to wait in a mailbox for ordered delivery, once it is transmitted from
the head-of-line (HOL) packet at the FIFO queue of the i
th
input port of the first switch to the j
th
bin of
the h(i, t)
th
mailbox at time t. Following the terminology in queueing theory, we call V
i,j
(t) the virtual
waiting time of flow (i, j). Now we describe how the mailbox switch works to keep packets of the same
flow in sequence. At each input port i, we keep the information of V
i,j
(t) for j =1, 2, ..., N. Initially, we
set V
i,j
(0) = 0 for all (i, j). At each time slot t, the following operation is executed.
(iA) Retrieving mails: at time t, the j
th
output port of the second switch is connected to the h(j, t)
th
mailbox. The packet in the first cell of the j
th
bin is transmitted to the j
th
output port. Packets
in cells 2, 3,...,F of the j
th
bin are moved forward to cells 1,2,..., F 1. According to
(1), the j
th
output port of the second switch will be connected to the k
th
mailbox at time
t +((k h(j, t) 1) mod N)+1. Hence, the packet in the f
th
cell of the j
th
bin of the
k
th
mailbox at time t will be transmitted to the j
th
output port of the second switch at time
t +(f 1)N +((k h(j, t) 1) mod N)+1. This means that the packet departure time can
be determined once a packet is placed in a mailbox.
(iiA) Sending mails: suppose that the HOL packet of the i
th
input port of the first switch is from
flow (i, j). Note that the i
th
input port of the first switch is connected to the h(i, t)
th
mailbox.

Citations
More filters
Proceedings ArticleDOI

Optimal load-balancing

TL;DR: Whether this particular method of load-balancing is optimal in the sense that it achieves the highest throughput for a given capacity of interconnect is explored.
Proceedings ArticleDOI

Byte-focal: a practical load balanced switch

TL;DR: This paper presents a practical load balanced switch, called the byte-focal switch, which uses packet-by-packet scheduling to significantly improve the delay performance over switches of comparable complexity.
Journal ArticleDOI

Feedback-based scheduling for load-balanced two-stage switches

TL;DR: A framework for designing feedback-based scheduling algorithms is proposed for elegantly solving the notorious packet missequencing problem of a load-balanced switch and it is shown that the efforts made in load balancing and keeping packets in order can complement each other.
Proceedings ArticleDOI

CR Switch: A Load-Balanced Switch with Contention and Reservation

TL;DR: This paper proposes a new switch architecture, called the contention and reservation (CR) switch, that not only delivers packets in order but also guarantees 100% throughput and in-order packet delivery of the CR switch.
Journal ArticleDOI

CR switch: a load-balanced switch with contention and reservation

TL;DR: This paper proposes a new switch architecture, called the contention and reservation (CR) switch, that not only delivers packets in order but also guarantees 100% throughput and in-order packet delivery of the CR switch.
References
More filters
Journal ArticleDOI

Input Versus Output Queueing on a Space-Division Packet Switch

TL;DR: Two simple models of queueing on an N \times N space-division packet switch are examined, and it is possible to slightly increase utilization of the output trunks and drop interfering packets at the end of each time slot, rather than storing them in the input queues.
Journal ArticleDOI

The stability of a queue with non-independent inter-arrival and service times

R. M. Loynes
TL;DR: In this paper, the authors introduced the concept of subcritical and supercritical queues for series of queues in series and showed that a queue in series is subcritical if E(S 0 − To) > 0, and a queue is supercritical when E(T 0 − T 0) < 0.
Journal ArticleDOI

High-speed switch scheduling for local-area networks

TL;DR: Issues in the design of a prototype switch for an arbitrary topology point-to-point network with link speeds of up to 1 Gbit/s are described and a technique called statistical matching is described, which can be used to ensure fairness at the switch and to support applications with rapidly changing needs for guaranteed bandwidth.
Proceedings ArticleDOI

Achieving 100% throughput in an input-queued switch

TL;DR: This paper proves that if a suitable queueing policy and scheduling algorithm are used then it is possible to achieve 100% throughput for all independent arrival processes.
Related Papers (5)