What is the importance of a simple scheduling algorithm for an input-queued switch?

Since input-queueing architectures are interesting only at very high bandwidths, it is very important that the scheduling algorithm for an input-queued switch be simple enough to implement in hardware.

What can be used to compute weights at di erent inputs?

When dealing with nonuniform loading or when o ering di erent priorities to di erent inputs, one can use di erent formulae to compute weights at di erent inputs.

What is the reason why the algorithm concentrates the residue on QB?

If at time t1, the algorithm concentrates the residue on QB then all of a1's (also see gure 12) output cells will be sent and cell a2 will be brought forward at time t2.

What is the performance criteria for a fair scheduling policy for a 2 N switch?

De nition 13 (Performance Criterion): A fair scheduling policy 1 for a 2 N multicast switch is said to perform better than another fair policy 2 if every input cell, belonging to either input, departs no later under 1 than under 2.

What are the main issues that motivate us to look for an algorithm?

These issues motivate us to look for an algorithm that (i) is simple to implement in hardware, (ii) is fair and achieves a high throughput, and (iii) is able to cope with nonuniform loading and/or provide di erent priorities to inputs.

What is the class of policies that are required to be fair and workconserving?

In addition to requiring that policies be fair and workconserving, the authors also require that they assign departure dates to input cells once the cells advance to HOL.

What is the definition of a non-concentrating policy?

De nition 4 (A Non-concentrating Policy): A multicast scheduling policy is said to be non-concentrating if it does not always concentrate the residue.

What is the way to achieve a high throughput without regard to fairness?

It is worth mentioning that if one merely wishes to achieve a high throughput without regard to fairness, then it is best to always achieve the highest residue concentration.

(Open Access) Multicast scheduling for input-queued switches (1997) | Balaji Prabhakar

Q: What have the authors contributed in "Multicast scheduling for input-queued switches" ?

This paper presents the design of the scheduler for an M N input-queued multicast switch. The aim of their work is to nd a work-conserving, fair policy that delivers maximum throughput and minimizes input queue latency, and yet is simple to implement in hardware. Subject to a fairness constraint, the authors argue that a policy which always concentrates the residue on as few inputs as possible generally outperforms all other policies. By mapping the general multicast switching problem onto a variation of the popular block-packing game, Tetris, the authors are able to analyze, in an intuitive and geometric fashion, various scheduling policies which possess these attributes in di erent proportions. The authors present a novel scheduling policy, called TATRA, which performs extremely well and is strict in fairness. The authors also present a simple weight based algorithm, called WBA, that is simple to implement in hardware, fair, and performs well when compared to a concentrating algorithm.

Q: What is the purpose of a high performance switch?

In order to build such networks, a high performance switch is required to quickly deliver cells arriving on input links to the desired output links.

Q: Why is the demand for network bandwidth growing?

Due to an exponential growth in the number of users of the Internet, the demand for network bandwidth has been growing at an enormous rate.

Q: What is the standard approach to using output queueing?

So the standard approach has been to abandon input queueing and instead to use output queueing - by increasing the bandwidth of the fabric, multiple cells can be forwarded at the same time to the same output, and queued there for transmission on the output link.

Multicast Scheduling for Input-Queued

Switches

Bala ji Prabhakar Nick McKeown and Ritesh Ahuja

BRIMS Dept of Elec Engg/Comp Sc

Hewlett-Packard Labs, Bristol. Stanford University.

balaji@hplb.hpl.hp.com nickm@ee.stanford.edu

ritesh@cs.stanford.edu

Abstract

This paper presents the design of the scheduler for an



input-queued multi-

cast switch. It is assumed that: (i) Each input maintains a single queue for arriving

multicast cells, and (ii) Only the cell at the head of line (HOL) can b e observed and

scheduled at one time. The scheduler is required to be: (i) Work-conserving, which

means that no output port may be idle as long as there is an input cell destined to it,

and (ii) Fair, which means that no input cell may be held at HOL for more than a xed

number of cell times. The aim of our work is to nd a work-conserving, fair p olicy that

delivers maximum throughput and minimizes input queue latency, and yet is simple

to implement in hardware. When a scheduling p olicy decides which cells to schedule,

contention may require that it leavea

residue

of cells to b e scheduled in the next cell

time. The selection of where to place the residue uniquely denes the scheduling p olicy.

Sub ject to a fairness constraint, we argue that a p olicy which always concentrates the

residue on as few inputs as p ossible generally outperforms all other p olicies. We nd

that there is a tradeo between concentration of residue (for high throughput), strict-

ness of fairness (to prevent starvation), and implementational simplicity (for the design

high-speed

switches). By mapping the general multicast switching problem onto a

variation of the popular block-packing game, Tetris, we are able to analyze, in an intu-

itive and geometric fashion, various scheduling p olicies which possess these attributes

in dierent prop ortions. We present a novel scheduling p olicy, called TATRA, which

performs extremely well and is strict in fairness. We also present a simple weight based

algorithm, called WBA, that is simple to implement in hardware, fair, and p erforms

well when compared to a concentrating algorithm.

1 Intro duction

Due to an exp onential growth in the number of users of the Internet, the demand for

network bandwidth has b een growing at an enormous rate. As a result, recent years have

witnessed an increasing interest in high-sp eed, cell-based, switched networks suchas

ATM

In order to build such networks, a high p erformance switch is required to quickly deliver

cells arriving on input links to the desired output links. A switch consists of three parts:

(i) Input queues to buer cells arriving on input links, (ii) Output queues to buer the cells

going out on output links, and (iii) A switch fabric to transfer cells from the inputs to the

desired outputs. The switch fabric op erates under a scheduling algorithm which arbitrates

among cells from dierent inputs destined to the same output. A number of approaches

have b een taken in designing these three parts of a switch [9, 20, 19, 17, 14, 16], each with

its own set of advantages and disadvantages.

It is well known that when

FIFO

queues are used, the throughput of an input-queued

switch with unicast trac can be limited due to

HOL

blo cking [4], [5]. So the standard

approach has b een to abandon input queueing and instead to use output queueing - by

increasing the bandwidth of the fabric, multiple cells can be forwarded at the same time

to the same output, and queued there for transmission on the output link. However this

approach requires that the output-queues and the internal interconnect have a bandwidth

equal to

times (for an



switch) the line rate. Since memory bandwidth is not

increasing as fast as the demand for network bandwidth, this architecture becomes imprac-

tical for very high-speed switches. Moreover, numerous pap ers have indicated that by using

non-

FIFO

input queues and by using good scheduling p olicies, much higher throughputs

are p ossible [9, 10 , 11, 12, 13, 14 , 16 , 17]. Therefore, input-queued switches are nding a

growing interest in the research and development community.

An increasing prop ortion of trac on the Internet is multicast, with users distributing

a wide variety of audio and video material. This dramatic change in the use of the Internet

has b een facilitated by the

MBONE

[1, 2, 3]. A number of dierent architectures and

implementations have b een proposed for multicast switches [6, 7, 8]. However, since we

are interested in the design of very high-sp eed

ATM

switches, we restrict our attention to

input-queued architectures. This input-queued switch should schedule multicast cells so

as to maximize throughput and minimize latency. It is imp ortant that it be simple to

implement in hardware. For example, a switch running at a line rate of 2.4Gbps (

48c)

must make 6 million scheduling decisions every second.

In this pap er we consider the p erformance of dierent multicast scheduling p olicies

for input-queued switches. Several researchers have studied the Random scheduling p ol-

icy [9, 18 , 21, 22] in which each output selects an input at random from among those

subscribing to it. But, as may b e exp ected, we nd that the Random scheduling p olicy is

not the optimum p olicy. We intro duce three new scheduling algorithms; the Concentrate

algorithm,

TATRA

and

WBA

(a weight based algorithm). We show that the Concentrate

algorithm leads to high throughput and low delay. It achieves this by concentrating the

cells that it leaves b ehind on as few inputs as p ossible. Unfortunately, Concentrate has

two drawbacks that make it unsuitable for use in an ATM switch; it can starve input

queues indenitely, and is dicult to implement in hardware. But Concentrate serves as

a useful upp er-bound on throughput p erformance against which we can compare heuris-

tic approximations. One such approximation,

TATRA

, is motivated byTetris, the p opular

blo ck-packing game.

TATRA

avoids starvation by using a strict denition of fairness, while

comparing well to the p erformance of Concentrate. The second algorithm,

WBA

is designed

to be very simple to implement in hardware, and allows the designer to balance the tradeo

between fairness and throughput.

2 Background

2.1 Assumed Architecture

It is assumed that the switch has

input and

output p orts and that each input maintains

a single

FIFO

queue for arriving multicast cells. The input cells are assumed to contain a

vector indicating which outputs the cell is to b e sent to. For an



switch, the destination

vector ofamulticast cell can b e any one of 2

1 p ossible vectors. We assume that each

input has a single queue and that the scheduler only observes the rst cell in the queue.

222

555

Figure 1:



multicast crossbar switch with single

FIFO

queue at each input.

As a simple example of our architecture, consider the 2 input and

output switch

shown in Figure 1. Queue

has an input cell destined for outputs

;

and queue

has an input cell destined for outputs

;

. The set of outputs to which an input

cell wishes to be copied will be referred to as the

fanout

of that input cell.

For clarity,

we distinguish an arriving

input

cell from its corresp onding

output

cells. In the gure, the

single input cell at the head of queue

will generate four output cells.

We assume that an input cell must wait in line until all of the cells ahead of it have

departed. A simple way to service the input queues is to replicate the input cell over

multiple cell times, generating one output cell per cell time. However, this approach has

two disadvantages. First, each input must b e copied multiple times, increasing the required

memory bandwidth. Second, input cells contend for access to the switch multiple times,

reducing the bandwidth available to other trac at the same input. Higher throughput can

b e attained if we take advantage of the natural multicast prop erties of a crossbar switch. So

instead, we assume that one input cell can b e copied to anynumb er of outputs in a single

cell time for which there is no conict.

There are two dierent service disciplines that can be used. Following the description

in [18], the rst is

no fanout-splitting

in which all of the copies of a cell must be sent in the

same cell time. If any of the output cells loses contention for an output p ort, none of the

output cells are transmitted and the cell must try again in the next cell time. The second

discipline is

fanout-splitting

, in which output cells may be delivered to output ports over

any number of cell times. Only those output cells that are unsuccessful in one cell time

continue to contend for output p orts in the next cell time

Because fanout-splitting is work conserving, it enables a higher switch throughput [21]

for little increase in implementation complexity. For example, Figure 2 compares the average

cell latency (via simulations) with and without fanout-splitting of the Random scheduling

p olicy for an 8



8 switch under uniform loading on all inputs and an average fanout of

four. The gure demonstrates that fanout-splitting can lead to approximately 40% higher

throughput.

We use the term fanout throughout this pap er to denote both the constitution and the cardinalityof

the input vector. For example, in Figure 1, the input cell at the head of each queue is said to have a fanout

of four.

It might app ear that

fanout-splitting

is much more dicult to implement than

no fanout-splitting

However this is not the case. In order to supp ort

fanout-splitting

, we need one extra signal from the

scheduler to inform each input port when a cell at its

HOL

is completely served.

0.1

100

1000

10000

0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24

Average Cell Latency

Offered Load

Fanout-splitting

No Fanout-splitting

Figure 2:

Graph of average cel l latency (in number of cel l times) as a function of oered

load for an 8



8 switch (with uniform input trac and average fanout of four). The graph

compares Random scheduling policy with and without fanout-splitting

2.2 Denition of Terms

Here we make precise some of the terminology used throughout the paper. Some terms

have already b een lo osely dened, but a few new ones are intro duced.

Denition 1 (Residue):

The

residue

is the set of al l output cel ls that lose contention for

output ports and remain at the

HOL

of the input queues at the end of each cel l time.

It is important to note that given a set of requests, every work-conserving p olicy will

leave the same residue. However, it is up to the p olicy to determine how the residue is

distributed over the inputs.

Denition 2 (Concentrating Policy):

A multicast scheduling policy is said to be

con-

centrating

if, at the end of every cel l time, it leaves the residue on the smal lest possible

number of input ports.

Denition 3 (Distributing Policy):

A multicast scheduling policy is said to be

distribut-

ing

if, at the end of every cel l time, it leaves the residue on the largest possible number of

input ports.

Denition 4 (A Non-concentrating Policy):

A multicast scheduling policy is said to

non-concentrating

if it does not always concentrate the residue.

Denition 5 (Fairness Constraint):

A multicast scheduling policy is said to be

fair

each input cel l is held at the

HOL

for no more than a xed number of cel l times (this number

can be dierent for dierent inputs). This fairness constraint can also be thought of as a

starvation constraint.

2.3 Requirements of an Algorithm

Before describing the details of various scheduling algorithms, we rst lo ok at some require-

ments.

Work conservation:

The algorithm

must

be work conserving, which means that no

output p ort may be idle as long as it can serve some input cell destined to it. This

prop erty is necessary for an algorithm to provide maximum throughput.

Fairness:

The algorithm

must

meet the fairness constraint dened ab ove, i.e. it must

not lead to the starvation of any input.

3 The Heuristic of Residue Concentration

In this section, we describ e two algorithms - the Concentrate algorithm and the Distribute

algorithm, which represent the two extremes of residue placement. We presentanintutive

explanation for why it is b est to concentrate residue in order to achieve a high throughput.

Algorithm: Concentrate.

Concentrate always concentrates the residue onto as

few

in-

puts as p ossible. This is achieved by p erforming the following steps at the beginning of

each cell time.

1. Determine the residue.

2. Find the input with the most in common with the residue. If there is a choice of inputs,

select the one with the input cell that has b een at the

HOL

for the shortest time. This

ensures some fairness, though not in the sense of the denition in Section 2.2 (see

remark b elow).

3. Concentrate as much residue onto this input as p ossible.

4. Remove the input from further consideration.

5. Rep eat steps (2)-(4) until no residue remains.

Remark:

Since an input cell can remain at

HOL

indenitely, this algorithm do es not meet

the fairness constraint. The purp ose of this algorithm is to provide us with a basis for

comparing the p erformance of other algorithms, since it achieves the highest throughput.

This is demonstrated by our simulation results in Section 7.

Algorithm: Distribute.

Distribute always distributes the residue onto as

many

inputs

as p ossible.

1. Determine the residue.

2. Find the input with at least one cell but otherwise the least in common with the

residue. If there is a choice of inputs, select the one with the input cell that has b een

at the

HOL

for the shortest time.

3. Place one output cell of residue onto that input.

4. Remove the input from further consideration.

5. Rep eat steps (2)-(4) until no inputs remain.

6. If residue remains, consider all the inputs again and start at step (2).

Multicast scheduling for input-queued switches

Figures

Citations

The Tiny Tera: A Packet Switch Core

A 50-Gb/s IP router

Tiny Tera: a packet switch core

Network flow switching and flow data export

Designing and implementing a fast crossbar scheduler

References

Input Versus Output Queueing on a Space-Division Packet Switch

Multicast routing in datagram internetworks and extended LANs

High-speed switch scheduling for local-area networks

MBONE: the multicast backbone

Design of a broadcast packet switching network

Related Papers (5)

The iSLIP scheduling algorithm for input-queued switches

Input Versus Output Queueing on a Space-Division Packet Switch

High-speed switch scheduling for local-area networks

Achieving 100% throughput in an input-queued switch

Fast Switched Backplane for a Gigabit Switched Router

Frequently Asked Questions (15)

Q1. What have the authors contributed in "Multicast scheduling for input-queued switches" ?

Q2. What is the importance of a simple scheduling algorithm for an input-queued switch?

Q3. What is the purpose of a high performance switch?

Q4. Why is fanout-splitting a work conserving discipline?

Q5. What is the way to service the input queues?

Q6. What can be used to compute weights at di erent inputs?

Q7. Why is the demand for network bandwidth growing?

Q8. What is the reason why the algorithm concentrates the residue on QB?

Q9. What is the standard approach to using output queueing?

Q10. What is the performance criteria for a fair scheduling policy for a 2 N switch?

Q11. What are the main issues that motivate us to look for an algorithm?

Q12. What is the class of policies that are required to be fair and workconserving?

Q13. What is the FIFO queue for a multicast cell?

Q14. What is the definition of a non-concentrating policy?

Q15. What is the way to achieve a high throughput without regard to fairness?

Trending Questions (1)