Journal Article•DOI•

The iSLIP scheduling algorithm for input-queued switches

Nick McKeown¹•Institutions (1)

01 Apr 1999-IEEE ACM Transactions on Networking (IEEE)-Vol. 7, Iss: 2, pp 188-201

TL;DR: This paper presents a scheduling algorithm called iSLIP, an iterative, round-robin algorithm that can achieve 100% throughput for uniform traffic, yet is simple to implement in hardware, and describes the implementation complexity of the algorithm.

read less

Abstract: An increasing number of high performance internetworking protocol routers, LAN and asynchronous transfer mode (ATM) switches use a switched backplane based on a crossbar switch. Most often, these systems use input queues to hold packets waiting to traverse the switching fabric. It is well known that if simple first in first out (FIFO) input queues are used to hold packets then, even under benign conditions, head-of-line (HOL) blocking limits the achievable bandwidth to approximately 58.6% of the maximum. HOL blocking can be overcome by the use of virtual output queueing, which is described in this paper. A scheduling algorithm is used to configure the crossbar switch, deciding the order in which packets will be served. Previous results have shown that with a suitable scheduling algorithm, 100% throughput can be achieved. In this paper, we present a scheduling algorithm called iSLIP. An iterative, round-robin algorithm, iSLIP can achieve 100% throughput for uniform traffic, yet is simple to implement in hardware. Iterative and noniterative versions of the algorithms are presented, along with modified versions for prioritized traffic. Simulation results are presented to indicate the performance of iSLIP under benign and bursty traffic conditions. Prototype and commercial implementations of iSLIP exist in systems with aggregate bandwidths ranging from 50 to 500 Gb/s. When the traffic is nonuniform, iSLIP quickly adapts to a fair scheduling policy that is guaranteed never to starve an input queue. Finally, we describe the implementation complexity of iSLIP. Based on a two-dimensional (2-D) array of priority encoders, single-chip schedulers have been built supporting up to 32 ports, and making approximately 100 million scheduling decisions per second.

...read moreread less

Summary (5 min read)

Jump to: [Introduction] – [A. Maximum Size Matching] – [B. Parallel Iterative Matching] – [II. THE SLIP ALGORITHM WITH A SINGLE ITERATION] – [A. Basic Round-Robin Matching Algorithm] – [B. Performance of RRM for Bernoulli Arrivals] – [III. THE SLIP ALGORITHM] – [C. As a Function of Switch Size] – [D. Burstiness Reduction] – [V. ANALYSIS OF SLIP PERFORMANCE] – [A. Convergence to Time-Division Multiplexing Under Heavy Load] – [VI. THE SLIP ALGORITHM WITH MULTIPLE ITERATIONS] – [A. Updating Pointers] – [B. Properties] – [A. How Many Iterations?] – [A. Prioritized SLIP] – [B. Threshold SLIP] – [C. Weighted SLIP] and [X. CONCLUSION]

Introduction

This idea is already being carried one step further, with cell switches forming the core, or backplane, of high-performance IP routers [26], [31], [6], [4].
Before using a crossbar switch as a switching fabric, it is important to consider some of the potential drawbacks; the authors consider three here.
But HOL blocking can be eliminated by using a simple buffering strategy at each input port.
Rather than maintain a single FIFO queue for all cells, each input maintains a separate queue for each output as shown in Fig.

A. Maximum Size Matching

These algorithms attempt to maximize the number of connections made in each cell time, and hence, maximize the instantaneous allocation of bandwidth.
There exist many maximum-size bipartite matching algorithms, and the most efficient currently known converges in time [12].
The algorithm should not allow a nonempty VOQ to remain unserved indefinitely.
The authors then consider some small modifications to for various applications, and finally consider its implementation complexity.

B. Parallel Iterative Matching

In some literature, the maximum size matching is called the maximum cardinality matching or just the maximum bipartite matching.
Basis of the algorithm described later, the authors will describe the scheme in detail and consider some of its performance characteristics.
PIM attempts to quickly converge on a conflict-free maximal match in multiple iterations, where each iteration consists of three steps.
If an unmatched output receives any requests, it grants to one by randomly selecting a request uniformly over all requests.
Second, when the switch is oversubscribed, PIM can lead to unfairness between connections.

II. THE SLIP ALGORITHM WITH A SINGLE ITERATION

In this section the authors describe and evaluate the SLIP algorithm.
This section concentrates on the behavior of SLIP with just a single iteration per cell time.
The SLIP algorithm uses rotating priority (“round-robin”) arbitration to schedule each active input and output in turn.
The authors find that the performance of SLIP for uniform traffic is high; for uniform independent identically distributed (i.i.d.).
This is the result of a phenomenon that the authors encounter repeatedly; the arbiters in SLIP have a tendency to desynchronize with respect to one another.

A. Basic Round-Robin Matching Algorithm

SLIP is a variation of simple basic round-robin matching algorithm (RRM).
The RRM algorithm, like PIM, consists of three steps.
The three steps of arbitration are: Step 1: Request.
Each input sends a request to every output for which it has a queued cell.
The pointer to the highest priority element of the round-robin schedule is incremented (modulo to one location beyond the granted input.

B. Performance of RRM for Bernoulli Arrivals

As an introduction to the performance of the RRM algorithm, Fig. 5 shows the average delay as a function of offered load for uniform independent and identically distributed (i.i.d.).
This synchronization phenomenon leads to a maximum throughput of just 50% for this traffic pattern.
Fig. 8 shows the number of synchronized output arbiters as a function of offered load.
The probability that an input will remain ungranted is (N 1=N)N ; hence as N increases, the throughput tends to 1 (1=e) 63%: offered load, cells arriving for output will find in a random position, equally likely to grant to any input.
This agrees well with the simulation result for RRM in Fig.

III. THE SLIP ALGORITHM

The algorithm improves upon RRM by reducing the synchronization of the output arbiters.
Is identical to RRM except for a condition placed on updating the grant pointers.
This is because when the arbiters move their pointers, the most recently granted becomes the lowest priority at that .
The output will serve at most other inputs first, waiting at most cell times to be accepted by each input.
Under heavy load, all queues with a common output have the same throughput, also known as Property 3.

C. As a Function of Switch Size

Fig. 11 shows the average latency imposed by a scheduler as a function of offered load for switches with 4, 8, 16, and 32 ports.
As the authors might expect, the performance degrades with the number of ports.
The performance degrades differently under low and heavy loads.
For a fixed low offered load, the queueing delay converges to a constant value.
Ignoring the queueing delay under low offered load, the number of contending cells for each output is approximately which converges with increasing to 7.

D. Burstiness Reduction

Intuitively, if a switch decreases the average burst length of traffic that it forwards, then the authors can expect it to improve the performance of its downstream neighbor.
The authors can expect any scheduling policy that uses round-robin arbiters to be burst-reducing8 this is also the case for is a deterministic algorithm serving each connection in strict rotation.
The authors use the same measure of burstiness that they use when generating traffic: the average burst length.
The authors define a burst of cells at the output of a switch as the number of consecutive cells that entered the switch at the same input.
This indicates that the output arbiters have become desynchronized and are operating as time-division multiplexers, serving each input in turn.

V. ANALYSIS OF SLIP PERFORMANCE

In general, it is difficult to accurately analyze the performance of a switch, even for the simplest traffic models.
Under uniform load and either very low or very high offered load, the authors can readily approximate and understand the way in which operates.
When arrivals are infrequent, the authors can assume that the arbiters act independently and that arriving cells are successfully scheduled with very low delay.
At the other extreme, when the switch becomes uniformly backlogged, the authors can see that desynchronization will lead the arbiters to find an efficient time division multiplexing scheme and operate without contention.
But when the traffic is nonuniform, or when the offered load is at neither extreme, the interaction between the arbiters becomes difficult to describe.

A. Convergence to Time-Division Multiplexing Under Heavy Load

Under heavy load, will behave similarly to an M/D/1 queue with arrival rates and deterministic service time cell times.
So, under a heavy load of Bernoulli arrivals, the delay will be approximated by (2).
This is because the service policy is not constant; when a queue changes between empty and nonempty, the scheduler must adapt to the new set of queues that require service.
This adaptation takes place over many cell times while the arbiters desynchronize again.
During this time, the throughput will be worse than for the M/D/1 queue and the queue length will increase.

VI. THE SLIP ALGORITHM WITH MULTIPLE ITERATIONS

Until now, the authors have only considered the operation of with a single iteration.
Once again, the authors shall see that desynchronization of the output arbiters plays an important role in achieving low latency.
When multiple iterations are used, it is necessary to modify the algorithm.
If an unmatched output receives any requests, it chooses the one that appears next in a fixed, roundrobin schedule starting from the highest priority element.
The pointer to the highest priority element of the round-robin schedule is incremented (modulo to one location beyond the granted input if and only if the grant is accepted in Step 3 of the first iteration.

A. Updating Pointers

Note that pointers and are only updated for matches found in the first iteration.
Connections made in subsequent iterations do not cause the pointers to be updated.
To understand how starvation can occur, the authors refer to the example of a 3 3 switch with five active and heavily loaded connections, shown in Fig. 15.
The switch is scheduled using two iterations of the algorithm, except in this case, the pointers are updated after both iterations.
Each time the round-robin arbiter at output 2 grants to input 1, input 1 chooses to accept output 1 instead.

B. Properties

With multiple iterations, the algorithm has the following properties: Property 1: Connections matched in the first iteration become the lowest priority in the next cell time.
Because pointers are not updated after the first iteration, an output will continue to grant to the highest priority requesting input until it is successful.
For with more than one iteration, and under heavy load, queues with a common output may each have a different throughput, also known as Property 3.
If zero connections are scheduled in an iteration, then the algorithm has converged; no more connections can be added with more iterations.
The algorithm will not necessarily converge to a maximum sized match, also known as Property 5.

A. How Many Iterations?

When implementing with multiple iterations, the authors need to decide how many iterations to perform during each cell time.
Ideally, from Property 4 above, the authors would like to perform iterations.
After cell times, the arbiters have become totally desynchronized and the algorithm will converge in a single iteration.
In some applications this may be acceptable.
For all the stationary arrival processes the authors have tried for However, they have not been able to prove that this relation holds in general.

A. Prioritized SLIP

Many applications use multiple classes of traffic with different priority levels.
The Prioritized algorithm gives strict priority to the highest priority request in each cell time.
The pointer is incremented (modulo to one location beyond the granted input if and only if input accepts output in step 3 of the first iteration.
The input then chooses one output among only those that have requested at level.
The input arbiter maintains a separate pointer, for each priority level.

B. Threshold SLIP

Scheduling algorithms that find a maximum weight match outperform those that find a maximum sized match.
In particular, if the weight of the edge between input and output is the occupancy of input queue then the authors will conjecture that the algorithm can achieve 100% throughput for all i.i.d.
In the Threshold algorithm, the authors make a compromise between the maximum-sized match and the maximum weight match by quantizing the queue occupancy according to a set of threshold levels.
The threshold level is then used to determine the priority level in the Priority algorithm.
If then the input makes a request of level.

C. Weighted SLIP

In some applications, the strict priority scheme of Prioritized may be undesirable, leading to starvation of lowpriority traffic.
As illustrated in Fig. 20, each arbiter consists of a priority encoder with a programmable highest priority, a register to hold the highest priority value, and an incrementer to move the pointer after it has been updated.
The grant decision from each grant arbiter is then passed to the accept arbiters, where each arbiter selects at most one output on behalf of an input, implementing Step 3.
Finally, the authors have observed that the complexity of the implementation is independent of the number of iterations.
These values were obtained from a VHDL design that was synthesized using the Synopsis design tools, and compiled for the Texas Instruments TSC5000 0.25- m CMOS ASIC process.

X. CONCLUSION

The Internet requires fast switches and routers to handle the increasing congestion.
The authors believe that these switches will use virtual output queueing, and hence will need fast, simple, fair, and efficient scheduling algorithms to arbitrate access to the switching fabric.
To this end, the authors have introduced the algorithm, an iterative algorithm that achieves high throughput, yet is simple to implement in hardware and operate at high speed.
When the traffic is nonuniform, the algorithm quickly adapts to an efficient round-robin policy among the busy queues.
The simplicity of the algorithm allows the arbiter for a 32-port switch to be placed on single chip, and to make close to 100 million arbitration decisions per second.

Did you find this useful? Give us your feedback

Figures (12)

Fig. 9. Illustration of 100% throughput foriSLIP caused by desynchronization of output arbiters. Note that pointers[gi] become desynchronized at the end of Cell 1 and stay desynchronized, leading to an alternating cycle of 2 cell times and a maximum throughput of 100%.

Fig. 16. Example of the number of iterations required to converge for a heavily loadedN N switch. All input queues remain nonempty for the duration of the example. In the first cell time, the arbiters are all synchronized. During each cell time, one more arbiter is desynchronized from the others. After N cell times, all arbiters are desynchronized and a maximum sized match is found in a single iteration.

Fig. 17. Performance ofiSLIP for 1, 2, and 4 iterations compared with FIFO and output queueing for i.i.d. Bernoulli arrivals with destinations uniformly distributed over all outputs. Results obtained using simulation for a 16 16 switch. The graph shows the average delay per cell, measured in cell times, between arriving at the input buffers and departing from the switch.

Fig. 3. Example of unfairness for PIM under heavy oversubscribed load with more than one iterations. Because of the random and independent selection by the arbiters, output 1 will grant to each input with probability 1/2, yet input 1 will only accept output 1 a quarter of the time. This leads to different rates at each output.

Fig. 14. Comparison of analytical approximation and simulation results for the average number of synchronized output schedulers. Simulation results are for a 16 16 switch with i.i.d. Bernoulli arrivals and an on–off process modulated by a two-state Markov chain with an average burst length of 64 cells. The analytical approximation is shown in (3).

Fig. 13. Comparison of average latency for theiSLIP algorithm and an M/D/1 queue. The switch is 16 16 and, for theiSLIP algorithm, arrivals are uniform i.i.d. Bernoulli arrivals.

Fig. 4. Example of the three steps of the RRM matching algorithm. (a) Step 1: Request. Each input makes a request to each output for which it has a cell. Step 2:Grant. Each output selects the next requesting input at or after the pointer in the round-robin schedule. Arbiters are shown here for outputs 2 and 4. Inputs 1 and 3 both requested output 2. Sinceg2 = 1; output 2 grants to input 1.g2 andg4 are updated to favor the input after the one that is granted. (b) Step 3:Accept.Each input selects at most one output. The arbiter for input 1 is shown. Sincea1 = 1; input 1 accepts output 1.a1 is updated to point to output 2. (c) When the arbitration is completed, a matching of size two has been found. Note that this is less than the maximum sized matching of three.

Fig. 5. Performance of RRM andiSLIP compared with PIM for i.i.d. Bernoulli arrivals with destinations uniformly distributed over all outputs. Results obtained using simulation for a 16 16 switch. The graph shows the average delay per cell, measured in cell times, between arriving at the input buffers and departing from the switch.

Fig. 21. Interconnection of2N arbiters to implementiSLIP for anN N switch.

Fig. 22. Interconnection ofN arbiters to implementiSLIP for anN N switch. Each arbiter is used for both input and output arbitration. In this case, each arbiter containstwo registers to hold pointersgi andai:

TABLE I NUMBER OF INVERTER EQUIVALENTS REQUIRED TO IMPLEMENT 1 AND N ARBITERS FOR A PRIORITIZED-iSLIP SCHEDULER, WITH FOUR LEVELS OF PRIORITY

Fig. 20. Round-robingrant arbiter foriSLIP algorithm. The priority encoder has a programmed highest prioritygi. The accept arbiter at the input is identical.

Content maybe subject to copyright Report

188 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 7, NO. 2, APRIL 1999

The iSLIP Scheduling Algorithm

for Input-Queued Switches

Nick McKeown, Senior Member, IEEE

Abstract—An increasing number of high performance inter-

networking protocol routers, LAN and asynchronous transfer

mode (ATM) switches use a switched backplane based on a

crossbar switch. Most often, these systems use input queues

to hold packets waiting to traverse the switching fabric. It is

well known that if simple ﬁrst in ﬁrst out (FIFO) input queues

are used to hold packets then, even under benign conditions,

head-of-line (HOL) blocking limits the achievable bandwidth

to approximately 58.6% of the maximum. HOL blocking can

be overcome by the use of virtual output queueing, which is

described in this paper. A scheduling algorithm is used to con-

ﬁgure the crossbar switch, deciding the order in which packets

will be served. Recent results have shown that with a suitable

scheduling algorithm, 100% throughput can be achieved. In

this paper, we present a scheduling algorithm called

SLIP.

An iterative, round-robin algorithm,

SLIP can achieve 100%

throughput for uniform trafﬁc, yet is simple to implement in

hardware. Iterative and noniterative versions of the algorithms

are presented, along with modiﬁed versions for prioritized trafﬁc.

Simulation results are presented to indicate the performance

SLIP under benign and bursty trafﬁc conditions. Prototype

and commercial implementations of

SLIP exist in systems with

aggregate bandwidths ranging from 50 to 500 Gb/s. When the

trafﬁc is nonuniform,

SLIP quickly adapts to a fair scheduling

policy that is guaranteed never to starve an input queue. Finally,

we describe the implementation complexity of

SLIP. Based on

a two-dimensional (2-D) array of priority encoders, single-chip

schedulers have been built supporting up to 32 ports, and making

approximately 100 million scheduling decisions per second.

Index Terms— ATM switch, crossbar switch, input-queueing,

IP router, scheduling.

I. INTRODUCTION

N AN ATTEMPT to take advantage of the cell-switching

capacity of the asynchronous transfer mode (ATM), there

has recently been a merging of ATM switches and Inter-

net Protocol (IP) routers [29], [32]. This idea is already

being carried one step further, with cell switches forming

the core, or backplane, of high-performance IP routers [26],

[31], [6], [4]. Each of these high-speed switches and routers

is built around a crossbar switch that is conﬁgured using a

centralized scheduler, and each uses a ﬁxed-size cell as a

transfer unit. Variable-length packets are segmented as they

arrive, transferred across the central switching fabric, and

then reassembled again into packets before they depart. A

crossbar switch is used because it is simple to implement and

Manuscript received November 19, 1996; revised February 9, 1998; ap-

proved by IEEE/ACM T

RANSACTIONS ON NETWORKING Editor H. J. Chao.

The author is with the Department of Electrical Engineering, Stanford

University, Stanford, CA 94305-9030 USA (e-mail: nickm@stanford.edu).

Publisher Item Identiﬁer S 1063-6692(99)03593-1.

is nonblocking; it allows multiple cells to be transferred across

the fabric simultaneously, alleviating the congestion found on

a conventional shared backplane. In this paper, we describe an

algorithm that is designed to conﬁgure a crossbar switch using

a single-chip centralized scheduler. The algorithm presented

here attempts to achieve high throughput for best-effort unicast

trafﬁc, and is designed to be simple to implement in hardware.

Our work was motivated by the design of two such systems:

the Cisco 12 000 GSR, a 50-Gb/s IP router, and the Tiny Tera:

a 0.5-Tb/s MPLS switch [7].

Before using a crossbar switch as a switching fabric, it is

important to consider some of the potential drawbacks; we

consider three here. First, the implementation complexity of an

-port crossbar switch increases with making crossbars

impractical for systems with a very large number of ports.

Fortunately, the majority of high-performance switches and

routers today have only a relatively small number of ports

(usually between 8 and 32). This is because the highest

performance devices are used at aggregation points where port

density is low.

Our work is, therefore, focussed on systems

with low port density. A second potential drawback of crossbar

switches is that they make it difﬁcult to provide guaranteed

qualities of service. This is because cells arriving to the switch

must contend for access to the fabric with cells at both the

input and the output. The time at which they leave the input

queues and enter the crossbar switching fabric is dependent on

other trafﬁc in the system, making it difﬁcult to control when

a cell will depart. There are two common ways to mitigate

this problem. One is to schedule the transfer of cells from

inputs to outputs in a similar manner to that used in a time-

slot interchanger, providing peak bandwidth allocation for

reserved ﬂows. This method has been implemented in at least

two commercial switches and routers.

The second approach

is to employ “speedup,” in which the core of the switch

runs faster than the connected lines. Simulation and analytical

results indicate that with a small speedup, a switch will deliver

cells quickly to their outgoing port, apparently independent of

contending trafﬁc [27], [37]–[41]. While these techniques are

of growing importance, we restrict our focus in this paper to

the efﬁcient and fast scheduling of best-effort trafﬁc.

Some people believe that this situation will change in the future, and that

switches and routers with large aggregate bandwidths will support hundreds

or even thousands of ports. If these systems become real, then crossbar

switches—and the techniques that follow in this paper—may not be suitable.

However, the techniques described here will be suitable for a few years hence.

A peak-rate allocation method was supported by the DEC AN2

Gigaswitch/ATM [2] and the Cisco Systems LS2020 ATM Switch.

1063–6692/99$10.00  1999 IEEE

MCKEOWN: THE iSLIP SCHEDULING ALGORITHM FOR INPUT-QUEUED SWITCHES 189

Fig. 1. An input-queued switch with VOQ. Note that head of line blocking

is eliminated by using a separate queue for each output at each input.

A third potential drawback of crossbar switches is that they

(usually) employ input queues. When a cell arrives, it is placed

in an input queue where it waits its turn to be transferred across

the crossbar fabric. There is a popular perception that input-

queued switches suffer from inherently low performance due

to head-of-line (HOL) blocking. HOL blocking arises when

the input buffer is arranged as a single ﬁrst in ﬁrst out (FIFO)

queue: a cell destined to an output that is free may be held

up in line behind a cell that is waiting for an output that is

busy. Even with benign trafﬁc, it is well known that HOL can

limit thoughput to just

[16]. Many techniques

have been suggested for reducing HOL blocking, for example

by considering the ﬁrst

cells in the FIFO queue, where

[8], [13], [17]. Although these schemes can improve

throughput, they are sensitive to trafﬁc arrival patterns and

may perform no better than regular FIFO queueing when the

trafﬁc is bursty. But HOL blocking can be eliminated by using

a simple buffering strategy at each input port. Rather than

maintain a single FIFO queue for all cells, each input maintains

a separate queue for each output as shown in Fig. 1. This

scheme is called virtual output queueing (VOQ) and was ﬁrst

introduced by Tamir et al. in [34]. HOL blocking is eliminated

because cells only queue behind cells that are destined to

the same output; no cell can be held up by a cell ahead of

it that is destined to a different output. When VOQ’s are

used, it has been shown possible to increase the throughput

of an input-queued switch from 58.6% to 100% for both

uniform and nonuniform trafﬁc [25], [28]. Crossbar switches

that use VOQ’s have been employed in a number of studies

[1], [14], [19], [23], [34], research prototypes [26], [31], [33],

and commercial products [2], [6]. For the rest of this paper,

we will be considering crossbar switches that use VOQ’s.

When we use a crossbar switch, we require a scheduling

algorithm that conﬁgures the fabric during each cell time and

decides which inputs will be connected to which outputs; this

determines which of the

VOQ’s are served in each cell

time. At the beginning of each cell time, a scheduler examines

the contents of the

input queues and determines a conﬂict-

free match

between inputs and outputs. This is equivalent

to ﬁnding a bipartite matching on a graph with

vertices

[2], [25], [35]. For example, the algorithms described in [25]

and [28] that achieve 100% throughput, use maximum weight

bipartite matching algorithms [35], which have a running-time

complexity of

A. Maximum Size Matching

Most scheduling algorithms described previously are heuris-

tic algorithms that approximate a maximum size

matching

[1], [2], [5], [8], [18], [30], [36]. These algorithms attempt

to maximize the number of connections made in each cell

time, and hence, maximize the instantaneous allocation of

bandwidth. The maximum size matching for a bipartite graph

can be found by solving an equivalent network ﬂow problem

[35]; we call the algorithm that does this maxsize. There exist

many maximum-size bipartite matching algorithms, and the

most efﬁcient currently known converges in

time

[12].

The problem with this algorithm is that although it is

guaranteed to ﬁnd a maximum match, for our application it

is too complex to implement in hardware and takes too long

to complete.

One question worth asking is “Does the maxsize algorithm

maximize the throughput of an input-queued switch?” The

answer is no; maxsize can cause some queues to be starved of

service indeﬁnitely. Furthermore, when the trafﬁc is nonuni-

form, maxsize cannot sustain very high throughput [25]. This is

because it does not consider the backlog of cells in the VOQ’s,

or the time that cells have been waiting in line to be served.

For practical high-performance systems, we desire algo-

rithms with the following properties.

• High Throughput: An algorithm that keeps the backlog

low in the VOQ’s; ideally, the algorithm will sustain an

offered load up to 100% on each input and output.

• Starvation Free: The algorithm should not allow a

nonempty VOQ to remain unserved indeﬁnitely.

• Fast: To achieve the highest bandwidth switch, it is im-

portant that the scheduling algorithm does not become the

performance bottleneck; the algorithm should therefore

ﬁnd a match as quickly as possible.

• Simple to Implement: If the algorithm is to be fast

in practice, it must be implemented in special-purpose

hardware, preferably within a single chip.

The

algorithm presented in this paper is designed

to meet these goals, and is currently implemented in a 16-

port commercial IP router with an aggregate bandwidth of 50

Gb/s [6], and a 32-port prototype switch with an aggregate

bandwidth of 0.5 Tb/s [26].

is based on the parallel

iterative matching algorithm (PIM) [2], and so to understand

its operation, we start by describing PIM. Then, in Section II,

we describe

and its performance. We then consider

some small modiﬁcations to

for various applications,

and ﬁnally consider its implementation complexity.

B. Parallel Iterative Matching

PIM was developed by DEC Systems Research Center for

the 16-port, 16 Gb/s AN2 switch [2].

Because it forms the

In some literature, the maximum size matching is called the maximum

cardinality matching or just the maximum bipartite matching.

This algorithm is equivalent to Dinic’s algorithm [9].

This switch was commercialized as the Gigaswitch/ATM.

190 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 7, NO. 2, APRIL 1999

(a)

(b) (c)

Fig. 2. An example of the three steps that make up one iteration of the PIM

scheduling algorithm [2]. In this example, the ﬁrst iteration does not match

input 4 to output 4, even though it does not conﬂict with other connections.

This connection would be made in the second iteration. (a) Step 1: Request.

Each input makes a request to each output for which it has a cell. This is shown

here as a graph with all weights

(b) Step 2: Grant. Each output

selects an input uniformly among those that requested it. In this example,

inputs 1 and 3 both requested output 2. Output 2 chose to grant to input 3.

granted to it. In this example, outputs 2 and 4 both granted to input 3. Input

3 chose to accept output 2.

basis of the algorithm described later, we will describe

the scheme in detail and consider some of its performance

characteristics.

PIM uses randomness to avoid starvation and reduce the

number of iterations needed to converge on a maximal-sized

match. A maximal-sized match (a type of on-line match) is

one that adds connections incrementally, without removing

connections made earlier in the matching process. In general, a

maximal match is smaller than a maximum-sized match, but is

much simpler to implement. PIM attempts to quickly converge

on a conﬂict-free maximal match in multiple iterations, where

each iteration consists of three steps. All inputs and outputs

are initially unmatched and only those inputs and outputs not

matched at the end of one iteration are eligible for matching in

the next. The three steps of each iteration operate in parallel on

each output and input and are shown in Fig. 2. The steps are:

Step 1: Request. Each unmatched input sends a request to

every output for which it has a queued cell.

Step 2: Grant. If an unmatched output receives any re-

quests, it grants to one by randomly selecting a request

uniformly over all requests.

Step 3: Accept. If an input receives a grant, it accepts one

by selecting an output randomly among those that granted to

this output.

By considering only unmatched inputs and outputs, each

iteration only considers connections not made by earlier iter-

ations.

Note that the independent output arbiters randomly select

a request among contending requests. This has three effects:

ﬁrst, the authors in [2] show that each iteration will match

or eliminate, on average, at least

of the remaining pos-

sible connections, and thus, the algorithm will converge to a

maximal match, on average, in

iterations. Second,

it ensures that all requests will eventually be granted, ensuring

Fig. 3. Example of unfairness for PIM under heavy oversubscribed load with

more than one iterations. Because of the random and independent selection

by the arbiters, output 1 will grant to each input with probability 1/2, yet

input 1 will only accept output 1 a quarter of the time. This leads to different

rates at each output.

that no input queue is starved of service. Third, it means that

no memory or state is used to keep track of how recently a

connection was made in the past. At the beginning of each cell

time, the match begins over, independently of the matches that

were made in previous cell times. Not only does this simplify

our understanding of the algorithm, but it also makes analysis

of the performance straightforward; there is no time-varying

state to consider, except for the occupancy of the input queues.

Using randomness comes with its problems, however. First,

it is difﬁcult and expensive to implement at high speed; each

arbiter must make a random selection among the members of

a time-varying set. Second, when the switch is oversubscribed,

PIM can lead to unfairness between connections. An extreme

example of unfairness for a 2

2 switch when the inputs are

oversubscribed is shown in Fig. 3. We will see examples later

for which PIM and some other algorithms are unfair when

no input or output is oversubscribed. Finally, PIM does not

perform well for a single iteration; it limits the throughput

to approximately 63%, only slightly higher than for a FIFO

switch. This is because the probability that an input will

remain ungranted is

hence as increases, the

throughput tends to

Although the algorithm

will often converge to a good match after several iterations,

the time to converge may affect the rate at which the switch

can operate. We would prefer an algorithm that performs well

with just a single iteration.

II. T

HE SLIP ALGORITHM WITH A SINGLE ITERATION

In this section we describe and evaluate the SLIP algorithm.

This section concentrates on the behavior of

SLIP with just

a single iteration per cell time. Later, we will consider

SLIP

with multiple iterations.

The

SLIP algorithm uses rotating priority (“round-robin”)

arbitration to schedule each active input and output in turn.

The main characteristic of

SLIP is its simplicity; it is readily

implemented in hardware and can operate at high speed. We

ﬁnd that the performance of

SLIP for uniform trafﬁc is

high; for uniform independent identically distributed (i.i.d.)

Bernoulli arrivals,

SLIP with a single iteration can achieve

100% throughput. This is the result of a phenomenon that we

encounter repeatedly; the arbiters in

SLIP have a tendency to

desynchronize with respect to one another.

A. Basic Round-Robin Matching Algorithm

SLIP is a variation of simple basic round-robin matching

algorithm (RRM). RRM is perhaps the simplest and most

MCKEOWN: THE iSLIP SCHEDULING ALGORITHM FOR INPUT-QUEUED SWITCHES 191

(a)

(b) (c)

Fig. 4. Example of the three steps of the RRM matching algorithm. (a) Step

1: Request. Each input makes a request to each output for which it has a cell.

Step 2: Grant. Each output selects the next requesting input at or after the

pointer in the round-robin schedule. Arbiters are shown here for outputs 2 and

4. Inputs 1 and 3 both requested output 2. Since

;

output 2 grants to

input 1.

and

are updated to favor the input after the one that is granted.

(b) Step 3: Accept. Each input selects at most one output. The arbiter for input

1 is shown. Since

;

input 1 accepts output 1.

is updated to point to

output 2. (c) When the arbitration is completed, a matching of size two has

been found. Note that this is less than the maximum sized matching of three.

obvious form of iterative round-robin scheduling algorithms,

comprising a 2-D array of round-robin arbiters; cells are

scheduled by round-robin arbiters at each output, and at each

input. As we shall see, RRM does not perform well, but it

helps us to understand how

SLIP performs, so we start here

with a description of RRM. RRM potentially overcomes two

problems in PIM: complexity and unfairness. Implemented as

priority encoders, the round-robin arbiters are much simpler

and can perform faster than random arbiters. The rotating

priority aids the algorithm in assigning bandwidth equally

and more fairly among requesting connections. The RRM

algorithm, like PIM, consists of three steps. As shown in

Fig. 4, for an

switch, each round-robin schedule

contains

ordered elements. The three steps of arbitration

are:

Step 1: Request. Each input sends a request to every output

for which it has a queued cell.

Step 2: Grant. If an output receives any requests, it chooses

the one that appears next in a ﬁxed, roundrobin schedule

starting from the highest priority element. The output notiﬁes

each input whether or not its request was granted. The pointer

to the highest priority element of the round-robin schedule is

incremented (modulo

to one location beyond the granted

input.

Step 3: Accept. If an input receives a grant, it accepts the

one that appears next in a ﬁxed, round-robin schedule starting

from the highest priority element. The pointer

to the highest

priority element of the round-robin schedule is incremented

(modulo

to one location beyond the accepted output.

B. Performance of RRM for Bernoulli Arrivals

As an introduction to the performance of the RRM algo-

rithm, Fig. 5 shows the average delay as a function of offered

load for uniform independent and identically distributed (i.i.d.)

Fig. 5. Performance of RRM and

SLIP compared with PIM for i.i.d.

Bernoulli arrivals with destinations uniformly distributed over all outputs.

Results obtained using simulation for a 16

16 switch. The graph shows the

average delay per cell, measured in cell times, between arriving at the input

buffers and departing from the switch.

Fig. 6. 2

2 switch with RRM algorithm under heavy load. In the example

of Fig. 7, synchronization of output arbiters leads to a throughout of just 50%.

Bernoulli arrivals. For an offered load of just 63% RRM

becomes unstable.

The reason for the poor performance of RRM lies in the

rules for updating the pointers at the output arbiters. We

illustrate this with an example, shown in Fig. 6. Both inputs

1 and 2 are under heavy load and receive a new cell for

both outputs during every cell time. But because the output

schedulers move in lock-step, only one input is served during

each cell time. The sequence of requests, grants, and accepts

for four consecutive cell times are shown in Fig. 7. Note that

the grant pointers change in lock-step: in cell time 1, both

point to input 1, and during cell time 2, both point to input

2, etc. This synchronization phenomenon leads to a maximum

throughput of just 50% for this trafﬁc pattern.

Synchronization of the grant pointers also limits perfor-

mance with random arrival patterns. Fig. 8 shows the number

of synchronized output arbiters as a function of offered load.

The graph plots the number of nonunique

’s, i.e., the number

of output arbiters that clash with another arbiter. Under low

The probability that an input will remain ungranted is

(

)

;

hence as

increases, the throughput tends to

)



63%

192 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 7, NO. 2, APRIL 1999

Fig. 7. Illustration of low throughput for RRM caused by synchronization

of output arbiters. Note that pointers

[

]

stay synchronized, leading to a

maximum throughput of just 50%.

Fig. 8. Synchronization of output arbiters for RRM and

SLIP for i.i.d.

Bernoulli arrivals with destinations uniformly distributed over all outputs.

Results obtained using simulation for a 16

16 switch.

offered load, cells arriving for output will ﬁnd in a

random position, equally likely to grant to any input. The

probability that

for all is

which for implies that the expected number of

arbiters with the same highest priority value is 9.9. This agrees

well with the simulation result for RRM in Fig. 8. As the

offered load increases, synchronized output arbiters tend to

move in lockstep and the degree of synchronization changes

only slightly.

III. T

HE SLIP ALGORITHM

The algorithm improves upon RRM by reducing the

synchronization of the output arbiters.

achieves this by

not moving the grant pointers unless the grant is accepted.

is identical to RRM except for a condition placed on

updating the grant pointers. The Grant step of RRM is changed

to:

Step 2: Grant. If an output receives any requests, it chooses

the one that appears next in a ﬁxed round-robin schedule,

starting from the highest priority element. The output notiﬁes

each input whether or not its request was granted. The pointer

to the highest priority element of the round-robin schedule

is incremented (modulo

to one location beyond the granted

input if, and only if, the grant is accepted in Step 3.

This small change to the algorithm leads to the following

properties of

with one iteration:

Property 1: Lowest priority is given to the most recently

made connection. This is because when the arbiters move their

pointers, the most recently granted (accepted) input (output)

becomes the lowest priority at that output (input). If input

successfully connects to output both and are updated

and the connection from input

to output becomes the lowest

priority connection in the next cell time.

Property 2: No connection is starved. This is because an

input will continue to request an output until it is successful.

The output will serve at most

other inputs ﬁrst, waiting

at most

cell times to be accepted by each input. Therefore,

a requesting input is always served in less than

cell times.

Property 3: Under heavy load, all queues with a common

output have the same throughput. This is a consequence of

Property 2: the output pointer moves to each requesting input

in a ﬁxed order, thus providing each with the same throughput.

But most importantly, this small change prevents the output

arbiters from moving in lock-step leading to a large improve-

ment in performance.

IV. S

IMULATED PERFORMANCE OF SLIP

A. With Benign Bernoulli Arrivals

Fig. 5 shows the performance improvement of

SLIP over

RRM. Under low load,

SLIP’s performance is almost identical

to RRM and FIFO; arriving cells usually ﬁnd empty input

queues, and on average there are only a small number of inputs

requesting a given output. As the load increases, the number

of synchronized arbiters decreases (see Fig. 8), leading to a

large-sized match. In other words, as the load increases, we

can expect the pointers to move away from each, making it

more likely that a large match will be found quickly in the

next cell time. In fact, under uniform 100% offered load, the

SLIP arbiters adapt to a time-division multiplexing scheme,

providing a perfect match and 100% throughput. Fig. 9 is an

HTML Viewer

Frequently Asked Questions (13)

Q1. What are the contributions mentioned in the paper "The islip scheduling algorithm for input-queued switches" ?

HOL blocking can be overcome by the use of virtual output queueing, which is described in this paper. In this paper, the authors present a scheduling algorithm called iSLIP. Finally, the authors describe the implementation complexity of iSLIP.

Q2. How does the synchronization effect affect the output?

8. As the offered load increases, synchronized output arbiters tend to move in lockstep and the degree of synchronization changes only slightly.

Q3. What is the effect of bursty arrivals on the performance of a switch?

With bursty arrivals, the performance of an input-queued switch becomes more and more like an output-queued switch under the save arrival conditions [9].

Q4. How many gates are needed to implement a 32-port scheduler?

The number of gates for a 32-port scheduler is less than 100 000, making it readily implementable in current CMOS technologies, and the total number of gates grows approximately with

Q5. What is the effect of burst size on the queueing delay?

As the authors would expect, the increased burst size leads to a higher queueing delay whereas an increased number of iterations leads to a lower queueing delay.

Q6. Why is the service policy not constant?

This is because the service policy is not constant; when a queue changes between empty and nonempty, the scheduler must adapt to the new set of queues that require service.

Q7. What is the approximation for the expected number of unmatched inputs at time?

The approximation is based on two assumptions:1) inputs that are unmatched at time are uniformly distributed over all inputs; 2) the number of unmatched inputs at time has zero variance.

Q8. How does the algorithm calculate the maximum weight of the input queue?

In particular, if the weight of the edge between input and output is the occupancy of input queue then the authors will conjecture that the algorithm can achieve 100% throughput for all i.i.d.

Q9. How many iterations does it take to converge?

in practice there may be insufficient time for iterations, and so the authors need to consider the penalty of performing only iterations, where In fact, because of the desynchronization of the arbiters, will usually converge in fewer than iterations.

Q10. How can the basic algorithm be extended to include requests at multiple priority levels?

The basic algorithm can be extended to include requests at multiple priority levels with only a small performance and complexity penalty.

Q11. What is the problem with the interaction between the arbiters?

But when the traffic is nonuniform, or when the offered load is at neither extreme, the interaction between the arbiters becomes difficult to describe.

Q12. What is the pointer to the highest priority element of the round-robin schedule?

The pointer to the highest priority element of the round-robin schedule is incremented (modulo to one location beyond the granted input if and only if the grant is accepted in Step 3 of the first iteration.

Q13. How does PIM achieve a conflict-free maximal match?

PIM attempts to quickly converge on a conflict-free maximal match in multiple iterations, where each iteration consists of three steps.

The iSLIP scheduling algorithm for input-queued switches

Summary (5 min read)

Introduction

A. Maximum Size Matching

B. Parallel Iterative Matching

II. THE SLIP ALGORITHM WITH A SINGLE ITERATION

A. Basic Round-Robin Matching Algorithm

B. Performance of RRM for Bernoulli Arrivals

III. THE SLIP ALGORITHM

C. As a Function of Switch Size

D. Burstiness Reduction

V. ANALYSIS OF SLIP PERFORMANCE

A. Convergence to Time-Division Multiplexing Under Heavy Load

VI. THE SLIP ALGORITHM WITH MULTIPLE ITERATIONS

A. Updating Pointers

B. Properties

A. How Many Iterations?

A. Prioritized SLIP

B. Threshold SLIP

C. Weighted SLIP

X. CONCLUSION

Figures (12)

Citations

Cites background from "The iSLIP scheduling algorithm for ..."

Cites background from "The iSLIP scheduling algorithm for ..."

Cites background or methods from "The iSLIP scheduling algorithm for ..."

Cites background from "The iSLIP scheduling algorithm for ..."

References

"The iSLIP scheduling algorithm for ..." refers background or methods in this paper

"The iSLIP scheduling algorithm for ..." refers background in this paper

Related Papers (5)

Frequently Asked Questions (13)

Q1. What are the contributions mentioned in the paper "The islip scheduling algorithm for input-queued switches" ?

Q2. How does the synchronization effect affect the output?

Q3. What is the effect of bursty arrivals on the performance of a switch?

Q4. How many gates are needed to implement a 32-port scheduler?

Q5. What is the effect of burst size on the queueing delay?

Q6. Why is the service policy not constant?

Q7. What is the approximation for the expected number of unmatched inputs at time?

Q8. How does the algorithm calculate the maximum weight of the input queue?

Q9. How many iterations does it take to converge?

Q10. How can the basic algorithm be extended to include requests at multiple priority levels?

Q11. What is the problem with the interaction between the arbiters?

Q12. What is the pointer to the highest priority element of the round-robin schedule?

Q13. How does PIM achieve a conflict-free maximal match?