scispace - formally typeset
Open AccessJournal ArticleDOI

Multicast scheduling for input-queued switches

Reads0
Chats0
TLDR
It is found that there is a tradeoff among concentration of residue, strictness of fairness, and implementational simplicity (for the design of high-speed switches) in the general multicast switching problem.
Abstract
We design a scheduler for an M/spl times/N input-queued multicast switch. It is assumed that: 1) each input maintains a single queue for arriving multicast cells and 2) only the cell at the head of line (HOL) can be observed and scheduled at one time. The scheduler needs to be: 1) work-conserving (no output port may be idle as long as there is an input cell destined to it) and 2) fair (which means that no input cell may be held at HOL for more than a fixed number of cell times). The aim is to find a work-conserving, fair policy that delivers maximum throughput and minimizes input queue latency, and yet is simple to implement. When a scheduling policy decides which cells to schedule, contention may require that it leave a residue of cells to be scheduled in the next cell time. The selection of where to place the residue uniquely defines the scheduling policy. Subject to a fairness constraint, we argue that a policy which always concentrates the residue on as few inputs as possible generally outperforms all other policies. We find that there is a tradeoff among concentration of residue (for high throughput), strictness of fairness (to prevent starvation), and implementational simplicity (for the design of high-speed switches). By mapping the general multicast switching problem onto a variation of the popular block-packing game Tetris, we are able to analyze various scheduling policies which possess these attributes in different proportions. We present a novel scheduling policy, called TATRA, which performs extremely well and is strict in fairness. We also present a simple weight-based algorithm, called WBA.

read more

Content maybe subject to copyright    Report

Multicast Scheduling for Input-Queued
Switches
Bala ji Prabhakar Nick McKeown and Ritesh Ahuja
BRIMS Dept of Elec Engg/Comp Sc
Hewlett-Packard Labs, Bristol. Stanford University.
balaji@hplb.hpl.hp.com nickm@ee.stanford.edu
,
ritesh@cs.stanford.edu
Abstract
This paper presents the design of the scheduler for an
M
N
input-queued multi-
cast switch. It is assumed that: (i) Each input maintains a single queue for arriving
multicast cells, and (ii) Only the cell at the head of line (HOL) can b e observed and
scheduled at one time. The scheduler is required to be: (i) Work-conserving, which
means that no output port may be idle as long as there is an input cell destined to it,
and (ii) Fair, which means that no input cell may be held at HOL for more than a xed
number of cell times. The aim of our work is to nd a work-conserving, fair p olicy that
delivers maximum throughput and minimizes input queue latency, and yet is simple
to implement in hardware. When a scheduling p olicy decides which cells to schedule,
contention may require that it leavea
residue
of cells to b e scheduled in the next cell
time. The selection of where to place the residue uniquely denes the scheduling p olicy.
Sub ject to a fairness constraint, we argue that a p olicy which always concentrates the
residue on as few inputs as p ossible generally outperforms all other p olicies. We nd
that there is a tradeo between concentration of residue (for high throughput), strict-
ness of fairness (to prevent starvation), and implementational simplicity (for the design
of
high-speed
switches). By mapping the general multicast switching problem onto a
variation of the popular block-packing game, Tetris, we are able to analyze, in an intu-
itive and geometric fashion, various scheduling p olicies which possess these attributes
in dierent prop ortions. We present a novel scheduling p olicy, called TATRA, which
performs extremely well and is strict in fairness. We also present a simple weight based
algorithm, called WBA, that is simple to implement in hardware, fair, and p erforms
well when compared to a concentrating algorithm.
1 Intro duction
Due to an exp onential growth in the number of users of the Internet, the demand for
network bandwidth has b een growing at an enormous rate. As a result, recent years have
witnessed an increasing interest in high-sp eed, cell-based, switched networks suchas
ATM
.
In order to build such networks, a high p erformance switch is required to quickly deliver
cells arriving on input links to the desired output links. A switch consists of three parts:
(i) Input queues to buer cells arriving on input links, (ii) Output queues to buer the cells
going out on output links, and (iii) A switch fabric to transfer cells from the inputs to the
desired outputs. The switch fabric op erates under a scheduling algorithm which arbitrates
among cells from dierent inputs destined to the same output. A number of approaches
have b een taken in designing these three parts of a switch [9, 20, 19, 17, 14, 16], each with
its own set of advantages and disadvantages.
1

It is well known that when
FIFO
queues are used, the throughput of an input-queued
switch with unicast trac can be limited due to
HOL
blo cking [4], [5]. So the standard
approach has b een to abandon input queueing and instead to use output queueing - by
increasing the bandwidth of the fabric, multiple cells can be forwarded at the same time
to the same output, and queued there for transmission on the output link. However this
approach requires that the output-queues and the internal interconnect have a bandwidth
equal to
M
times (for an
M
N
switch) the line rate. Since memory bandwidth is not
increasing as fast as the demand for network bandwidth, this architecture becomes imprac-
tical for very high-speed switches. Moreover, numerous pap ers have indicated that by using
non-
FIFO
input queues and by using good scheduling p olicies, much higher throughputs
are p ossible [9, 10 , 11, 12, 13, 14 , 16 , 17]. Therefore, input-queued switches are nding a
growing interest in the research and development community.
An increasing prop ortion of trac on the Internet is multicast, with users distributing
a wide variety of audio and video material. This dramatic change in the use of the Internet
has b een facilitated by the
MBONE
[1, 2, 3]. A number of dierent architectures and
implementations have b een proposed for multicast switches [6, 7, 8]. However, since we
are interested in the design of very high-sp eed
ATM
switches, we restrict our attention to
input-queued architectures. This input-queued switch should schedule multicast cells so
as to maximize throughput and minimize latency. It is imp ortant that it be simple to
implement in hardware. For example, a switch running at a line rate of 2.4Gbps (
OC
48c)
must make 6 million scheduling decisions every second.
In this pap er we consider the p erformance of dierent multicast scheduling p olicies
for input-queued switches. Several researchers have studied the Random scheduling p ol-
icy [9, 18 , 21, 22] in which each output selects an input at random from among those
subscribing to it. But, as may b e exp ected, we nd that the Random scheduling p olicy is
not the optimum p olicy. We intro duce three new scheduling algorithms; the Concentrate
algorithm,
TATRA
and
WBA
(a weight based algorithm). We show that the Concentrate
algorithm leads to high throughput and low delay. It achieves this by concentrating the
cells that it leaves b ehind on as few inputs as p ossible. Unfortunately, Concentrate has
two drawbacks that make it unsuitable for use in an ATM switch; it can starve input
queues indenitely, and is dicult to implement in hardware. But Concentrate serves as
a useful upp er-bound on throughput p erformance against which we can compare heuris-
tic approximations. One such approximation,
TATRA
, is motivated byTetris, the p opular
blo ck-packing game.
TATRA
avoids starvation by using a strict denition of fairness, while
comparing well to the p erformance of Concentrate. The second algorithm,
WBA
is designed
to be very simple to implement in hardware, and allows the designer to balance the tradeo
between fairness and throughput.
2 Background
2.1 Assumed Architecture
It is assumed that the switch has
M
input and
N
output p orts and that each input maintains
a single
FIFO
queue for arriving multicast cells. The input cells are assumed to contain a
vector indicating which outputs the cell is to b e sent to. For an
M
N
switch, the destination
vector ofamulticast cell can b e any one of 2
N
,
1 p ossible vectors. We assume that each
input has a single queue and that the scheduler only observes the rst cell in the queue.
2

6
5
4
3
2
3
4
1
Q
A
Q
B
222
555
1
33
11
3
4
6
4
6
4
6
Figure 1:
2
N
multicast crossbar switch with single
FIFO
queue at each input.
As a simple example of our architecture, consider the 2 input and
N
output switch
shown in Figure 1. Queue
Q
A
has an input cell destined for outputs
f
1
;
2
;
3
;
4
g
and queue
Q
B
has an input cell destined for outputs
f
3
;
4
;
5
;
6
g
. The set of outputs to which an input
cell wishes to be copied will be referred to as the
fanout
of that input cell.
1
For clarity,
we distinguish an arriving
input
cell from its corresp onding
output
cells. In the gure, the
single input cell at the head of queue
Q
A
will generate four output cells.
We assume that an input cell must wait in line until all of the cells ahead of it have
departed. A simple way to service the input queues is to replicate the input cell over
multiple cell times, generating one output cell per cell time. However, this approach has
two disadvantages. First, each input must b e copied multiple times, increasing the required
memory bandwidth. Second, input cells contend for access to the switch multiple times,
reducing the bandwidth available to other trac at the same input. Higher throughput can
b e attained if we take advantage of the natural multicast prop erties of a crossbar switch. So
instead, we assume that one input cell can b e copied to anynumb er of outputs in a single
cell time for which there is no conict.
There are two dierent service disciplines that can be used. Following the description
in [18], the rst is
no fanout-splitting
in which all of the copies of a cell must be sent in the
same cell time. If any of the output cells loses contention for an output p ort, none of the
output cells are transmitted and the cell must try again in the next cell time. The second
discipline is
fanout-splitting
, in which output cells may be delivered to output ports over
any number of cell times. Only those output cells that are unsuccessful in one cell time
continue to contend for output p orts in the next cell time
2
.
Because fanout-splitting is work conserving, it enables a higher switch throughput [21]
for little increase in implementation complexity. For example, Figure 2 compares the average
cell latency (via simulations) with and without fanout-splitting of the Random scheduling
p olicy for an 8
8 switch under uniform loading on all inputs and an average fanout of
four. The gure demonstrates that fanout-splitting can lead to approximately 40% higher
throughput.
1
We use the term fanout throughout this pap er to denote both the constitution and the cardinalityof
the input vector. For example, in Figure 1, the input cell at the head of each queue is said to have a fanout
of four.
2
It might app ear that
fanout-splitting
is much more dicult to implement than
no fanout-splitting
.
However this is not the case. In order to supp ort
fanout-splitting
, we need one extra signal from the
scheduler to inform each input port when a cell at its
HOL
is completely served.
3

0.1
1
10
100
1000
10000
0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24
Average Cell Latency
Offered Load
Fanout-splitting
No Fanout-splitting
Figure 2:
Graph of average cel l latency (in number of cel l times) as a function of oered
load for an 8
8 switch (with uniform input trac and average fanout of four). The graph
compares Random scheduling policy with and without fanout-splitting
.
2.2 Denition of Terms
Here we make precise some of the terminology used throughout the paper. Some terms
have already b een lo osely dened, but a few new ones are intro duced.
Denition 1 (Residue):
The
residue
is the set of al l output cel ls that lose contention for
output ports and remain at the
HOL
of the input queues at the end of each cel l time.
It is important to note that given a set of requests, every work-conserving p olicy will
leave the same residue. However, it is up to the p olicy to determine how the residue is
distributed over the inputs.
Denition 2 (Concentrating Policy):
A multicast scheduling policy is said to be
con-
centrating
if, at the end of every cel l time, it leaves the residue on the smal lest possible
number of input ports.
Denition 3 (Distributing Policy):
A multicast scheduling policy is said to be
distribut-
ing
if, at the end of every cel l time, it leaves the residue on the largest possible number of
input ports.
Denition 4 (A Non-concentrating Policy):
A multicast scheduling policy is said to
be
non-concentrating
if it does not always concentrate the residue.
Denition 5 (Fairness Constraint):
A multicast scheduling policy is said to be
fair
if
each input cel l is held at the
HOL
for no more than a xed number of cel l times (this number
4

can be dierent for dierent inputs). This fairness constraint can also be thought of as a
starvation constraint.
2.3 Requirements of an Algorithm
Before describing the details of various scheduling algorithms, we rst lo ok at some require-
ments.
1.
Work conservation:
The algorithm
must
be work conserving, which means that no
output p ort may be idle as long as it can serve some input cell destined to it. This
prop erty is necessary for an algorithm to provide maximum throughput.
2.
Fairness:
The algorithm
must
meet the fairness constraint dened ab ove, i.e. it must
not lead to the starvation of any input.
3 The Heuristic of Residue Concentration
In this section, we describ e two algorithms - the Concentrate algorithm and the Distribute
algorithm, which represent the two extremes of residue placement. We presentanintutive
explanation for why it is b est to concentrate residue in order to achieve a high throughput.
Algorithm: Concentrate.
Concentrate always concentrates the residue onto as
few
in-
puts as p ossible. This is achieved by p erforming the following steps at the beginning of
each cell time.
1. Determine the residue.
2. Find the input with the most in common with the residue. If there is a choice of inputs,
select the one with the input cell that has b een at the
HOL
for the shortest time. This
ensures some fairness, though not in the sense of the denition in Section 2.2 (see
remark b elow).
3. Concentrate as much residue onto this input as p ossible.
4. Remove the input from further consideration.
5. Rep eat steps (2)-(4) until no residue remains.
Remark:
Since an input cell can remain at
HOL
indenitely, this algorithm do es not meet
the fairness constraint. The purp ose of this algorithm is to provide us with a basis for
comparing the p erformance of other algorithms, since it achieves the highest throughput.
This is demonstrated by our simulation results in Section 7.
Algorithm: Distribute.
Distribute always distributes the residue onto as
many
inputs
as p ossible.
1. Determine the residue.
2. Find the input with at least one cell but otherwise the least in common with the
residue. If there is a choice of inputs, select the one with the input cell that has b een
at the
HOL
for the shortest time.
3. Place one output cell of residue onto that input.
4. Remove the input from further consideration.
5. Rep eat steps (2)-(4) until no inputs remain.
6. If residue remains, consider all the inputs again and start at step (2).
5

Citations
More filters
Posted Content

The Tiny Tera: A Packet Switch Core

TL;DR: The Tiny Tera is a CMOS-based input-queued, fixed-size packet switch suitable for a wide range of applications such as a highperformance ATM switch, the core of an Internet router or as a fast multiprocessor interconnect.
Journal ArticleDOI

A 50-Gb/s IP router

TL;DR: A router, nearly completed, which is more than fast enough to keep up with the latest transmission technologies and can forward tens of millions of packets per second.
Journal ArticleDOI

Tiny Tera: a packet switch core

TL;DR: Tiny Tera as mentioned in this paper is an input-buffered switch, which makes it the highest bandwidth switch possible given a particular CMOS and memory technology. But it does not support multicasting.
Patent

Network flow switching and flow data export

TL;DR: In this paper, a message flow is defined to comprise a set of packets to be transmitted between a particular source and a particular destination, and the proper processing may include a determination of a destination port for routing those packets and whether access control permits routing them to their indicated destination.
Journal ArticleDOI

Designing and implementing a fast crossbar scheduler

TL;DR: The design and implementation of a scheduling algorithm for configuring crossbars in input queued switches that support virtual output queues and multiple priority levels of unicast and multicast traffic is described.
References
More filters
Journal ArticleDOI

Input Versus Output Queueing on a Space-Division Packet Switch

TL;DR: Two simple models of queueing on an N \times N space-division packet switch are examined, and it is possible to slightly increase utilization of the output trunks and drop interfering packets at the end of each time slot, rather than storing them in the input queues.
Journal ArticleDOI

Multicast routing in datagram internetworks and extended LANs

TL;DR: In this paper, the authors specify extensions to two common internetwork routing algorithms (distancevector routing and link-state routing) to support low-delay datagram multicasting beyond a single LAN, and discuss how the use of multicast scope control and hierarchical multicast routing allows the multicast service to scale up to large internetworks.
Journal ArticleDOI

High-speed switch scheduling for local-area networks

TL;DR: Issues in the design of a prototype switch for an arbitrary topology point-to-point network with link speeds of up to 1 Gbit/s are described and a technique called statistical matching is described, which can be used to ensure fairness at the switch and to support applications with rapidly changing needs for guaranteed bandwidth.
Journal ArticleDOI

MBONE: the multicast backbone

TL;DR: MBone-the Multicast Backbone, which is a virtual network on “top” of the Internet providing a multicasting facility to the Internet, has been the cause of severe problam m the NSFnet backbone, saturation of major international links rendering them useless as well as sites being completely disconnected due to Internet Connrction Management Protocol (ICMP) responses flooding the networks.
Journal ArticleDOI

Design of a broadcast packet switching network

TL;DR: In this article, an overview of a system designed to handle a heterogeneous and dynamically changing mix of applications is given, based on fiber-optic transmission systems and high-performance packet switching and can handle applications ranging from low speed data to voice to full-rate video.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What have the authors contributed in "Multicast scheduling for input-queued switches" ?

This paper presents the design of the scheduler for an M N input-queued multicast switch. The aim of their work is to nd a work-conserving, fair policy that delivers maximum throughput and minimizes input queue latency, and yet is simple to implement in hardware. Subject to a fairness constraint, the authors argue that a policy which always concentrates the residue on as few inputs as possible generally outperforms all other policies. By mapping the general multicast switching problem onto a variation of the popular block-packing game, Tetris, the authors are able to analyze, in an intuitive and geometric fashion, various scheduling policies which possess these attributes in di erent proportions. The authors present a novel scheduling policy, called TATRA, which performs extremely well and is strict in fairness. The authors also present a simple weight based algorithm, called WBA, that is simple to implement in hardware, fair, and performs well when compared to a concentrating algorithm. 

Since input-queueing architectures are interesting only at very high bandwidths, it is very important that the scheduling algorithm for an input-queued switch be simple enough to implement in hardware. 

In order to build such networks, a high performance switch is required to quickly deliver cells arriving on input links to the desired output links. 

Because fanout-splitting is work conserving, it enables a higher switch throughput [21] for little increase in implementation complexity. 

A simple way to service the input queues is to replicate the input cell over multiple cell times, generating one output cell per cell time. 

When dealing with nonuniform loading or when o ering di erent priorities to di erent inputs, one can use di erent formulae to compute weights at di erent inputs. 

Due to an exponential growth in the number of users of the Internet, the demand for network bandwidth has been growing at an enormous rate. 

If at time t1, the algorithm concentrates the residue on QB then all of a1's (also see gure 12) output cells will be sent and cell a2 will be brought forward at time t2. 

So the standard approach has been to abandon input queueing and instead to use output queueing - by increasing the bandwidth of the fabric, multiple cells can be forwarded at the same time to the same output, and queued there for transmission on the output link. 

De nition 13 (Performance Criterion): A fair scheduling policy 1 for a 2 N multicast switch is said to perform better than another fair policy 2 if every input cell, belonging to either input, departs no later under 1 than under 2. 

These issues motivate us to look for an algorithm that (i) is simple to implement in hardware, (ii) is fair and achieves a high throughput, and (iii) is able to cope with nonuniform loading and/or provide di erent priorities to inputs. 

In addition to requiring that policies be fair and workconserving, the authors also require that they assign departure dates to input cells once the cells advance to HOL. 

It is assumed that the switch hasM input andN output ports and that each input maintains a single FIFO queue for arriving multicast cells. 

De nition 4 (A Non-concentrating Policy): A multicast scheduling policy is said to be non-concentrating if it does not always concentrate the residue. 

It is worth mentioning that if one merely wishes to achieve a high throughput without regard to fairness, then it is best to always achieve the highest residue concentration. 

Trending Questions (1)
Why we don't find port fastethernet0/0 on Catalyst switches?

We find that there is a tradeoff among concentration of residue (for high throughput), strictness of fairness (to prevent starvation), and implementational simplicity (for the design of high-speed switches).