scispace - formally typeset
Search or ask a question
Journal ArticleDOI

The iSLIP scheduling algorithm for input-queued switches

Nick McKeown1
01 Apr 1999-IEEE ACM Transactions on Networking (IEEE)-Vol. 7, Iss: 2, pp 188-201
TL;DR: This paper presents a scheduling algorithm called iSLIP, an iterative, round-robin algorithm that can achieve 100% throughput for uniform traffic, yet is simple to implement in hardware, and describes the implementation complexity of the algorithm.
Abstract: An increasing number of high performance internetworking protocol routers, LAN and asynchronous transfer mode (ATM) switches use a switched backplane based on a crossbar switch. Most often, these systems use input queues to hold packets waiting to traverse the switching fabric. It is well known that if simple first in first out (FIFO) input queues are used to hold packets then, even under benign conditions, head-of-line (HOL) blocking limits the achievable bandwidth to approximately 58.6% of the maximum. HOL blocking can be overcome by the use of virtual output queueing, which is described in this paper. A scheduling algorithm is used to configure the crossbar switch, deciding the order in which packets will be served. Previous results have shown that with a suitable scheduling algorithm, 100% throughput can be achieved. In this paper, we present a scheduling algorithm called iSLIP. An iterative, round-robin algorithm, iSLIP can achieve 100% throughput for uniform traffic, yet is simple to implement in hardware. Iterative and noniterative versions of the algorithms are presented, along with modified versions for prioritized traffic. Simulation results are presented to indicate the performance of iSLIP under benign and bursty traffic conditions. Prototype and commercial implementations of iSLIP exist in systems with aggregate bandwidths ranging from 50 to 500 Gb/s. When the traffic is nonuniform, iSLIP quickly adapts to a fair scheduling policy that is guaranteed never to starve an input queue. Finally, we describe the implementation complexity of iSLIP. Based on a two-dimensional (2-D) array of priority encoders, single-chip schedulers have been built supporting up to 32 ports, and making approximately 100 million scheduling decisions per second.

Summary (5 min read)

Introduction

  • This idea is already being carried one step further, with cell switches forming the core, or backplane, of high-performance IP routers [26], [31], [6], [4].
  • Before using a crossbar switch as a switching fabric, it is important to consider some of the potential drawbacks; the authors consider three here.
  • But HOL blocking can be eliminated by using a simple buffering strategy at each input port.
  • Rather than maintain a single FIFO queue for all cells, each input maintains a separate queue for each output as shown in Fig.

A. Maximum Size Matching

  • These algorithms attempt to maximize the number of connections made in each cell time, and hence, maximize the instantaneous allocation of bandwidth.
  • There exist many maximum-size bipartite matching algorithms, and the most efficient currently known converges in time [12].
  • The algorithm should not allow a nonempty VOQ to remain unserved indefinitely.
  • The authors then consider some small modifications to for various applications, and finally consider its implementation complexity.

B. Parallel Iterative Matching

  • In some literature, the maximum size matching is called the maximum cardinality matching or just the maximum bipartite matching.
  • Basis of the algorithm described later, the authors will describe the scheme in detail and consider some of its performance characteristics.
  • PIM attempts to quickly converge on a conflict-free maximal match in multiple iterations, where each iteration consists of three steps.
  • If an unmatched output receives any requests, it grants to one by randomly selecting a request uniformly over all requests.
  • Second, when the switch is oversubscribed, PIM can lead to unfairness between connections.

II. THE SLIP ALGORITHM WITH A SINGLE ITERATION

  • In this section the authors describe and evaluate the SLIP algorithm.
  • This section concentrates on the behavior of SLIP with just a single iteration per cell time.
  • The SLIP algorithm uses rotating priority (“round-robin”) arbitration to schedule each active input and output in turn.
  • The authors find that the performance of SLIP for uniform traffic is high; for uniform independent identically distributed (i.i.d.).
  • This is the result of a phenomenon that the authors encounter repeatedly; the arbiters in SLIP have a tendency to desynchronize with respect to one another.

A. Basic Round-Robin Matching Algorithm

  • SLIP is a variation of simple basic round-robin matching algorithm (RRM).
  • The RRM algorithm, like PIM, consists of three steps.
  • The three steps of arbitration are: Step 1: Request.
  • Each input sends a request to every output for which it has a queued cell.
  • The pointer to the highest priority element of the round-robin schedule is incremented (modulo to one location beyond the granted input.

B. Performance of RRM for Bernoulli Arrivals

  • As an introduction to the performance of the RRM algorithm, Fig. 5 shows the average delay as a function of offered load for uniform independent and identically distributed (i.i.d.).
  • This synchronization phenomenon leads to a maximum throughput of just 50% for this traffic pattern.
  • Fig. 8 shows the number of synchronized output arbiters as a function of offered load.
  • The probability that an input will remain ungranted is (N 1=N)N ; hence as N increases, the throughput tends to 1 (1=e) 63%: offered load, cells arriving for output will find in a random position, equally likely to grant to any input.
  • This agrees well with the simulation result for RRM in Fig.

III. THE SLIP ALGORITHM

  • The algorithm improves upon RRM by reducing the synchronization of the output arbiters.
  • Is identical to RRM except for a condition placed on updating the grant pointers.
  • This is because when the arbiters move their pointers, the most recently granted becomes the lowest priority at that .
  • The output will serve at most other inputs first, waiting at most cell times to be accepted by each input.
  • Under heavy load, all queues with a common output have the same throughput, also known as Property 3.

C. As a Function of Switch Size

  • Fig. 11 shows the average latency imposed by a scheduler as a function of offered load for switches with 4, 8, 16, and 32 ports.
  • As the authors might expect, the performance degrades with the number of ports.
  • The performance degrades differently under low and heavy loads.
  • For a fixed low offered load, the queueing delay converges to a constant value.
  • Ignoring the queueing delay under low offered load, the number of contending cells for each output is approximately which converges with increasing to 7.

D. Burstiness Reduction

  • Intuitively, if a switch decreases the average burst length of traffic that it forwards, then the authors can expect it to improve the performance of its downstream neighbor.
  • The authors can expect any scheduling policy that uses round-robin arbiters to be burst-reducing8 this is also the case for is a deterministic algorithm serving each connection in strict rotation.
  • The authors use the same measure of burstiness that they use when generating traffic: the average burst length.
  • The authors define a burst of cells at the output of a switch as the number of consecutive cells that entered the switch at the same input.
  • This indicates that the output arbiters have become desynchronized and are operating as time-division multiplexers, serving each input in turn.

V. ANALYSIS OF SLIP PERFORMANCE

  • In general, it is difficult to accurately analyze the performance of a switch, even for the simplest traffic models.
  • Under uniform load and either very low or very high offered load, the authors can readily approximate and understand the way in which operates.
  • When arrivals are infrequent, the authors can assume that the arbiters act independently and that arriving cells are successfully scheduled with very low delay.
  • At the other extreme, when the switch becomes uniformly backlogged, the authors can see that desynchronization will lead the arbiters to find an efficient time division multiplexing scheme and operate without contention.
  • But when the traffic is nonuniform, or when the offered load is at neither extreme, the interaction between the arbiters becomes difficult to describe.

A. Convergence to Time-Division Multiplexing Under Heavy Load

  • Under heavy load, will behave similarly to an M/D/1 queue with arrival rates and deterministic service time cell times.
  • So, under a heavy load of Bernoulli arrivals, the delay will be approximated by (2).
  • This is because the service policy is not constant; when a queue changes between empty and nonempty, the scheduler must adapt to the new set of queues that require service.
  • This adaptation takes place over many cell times while the arbiters desynchronize again.
  • During this time, the throughput will be worse than for the M/D/1 queue and the queue length will increase.

VI. THE SLIP ALGORITHM WITH MULTIPLE ITERATIONS

  • Until now, the authors have only considered the operation of with a single iteration.
  • Once again, the authors shall see that desynchronization of the output arbiters plays an important role in achieving low latency.
  • When multiple iterations are used, it is necessary to modify the algorithm.
  • If an unmatched output receives any requests, it chooses the one that appears next in a fixed, roundrobin schedule starting from the highest priority element.
  • The pointer to the highest priority element of the round-robin schedule is incremented (modulo to one location beyond the granted input if and only if the grant is accepted in Step 3 of the first iteration.

A. Updating Pointers

  • Note that pointers and are only updated for matches found in the first iteration.
  • Connections made in subsequent iterations do not cause the pointers to be updated.
  • To understand how starvation can occur, the authors refer to the example of a 3 3 switch with five active and heavily loaded connections, shown in Fig. 15.
  • The switch is scheduled using two iterations of the algorithm, except in this case, the pointers are updated after both iterations.
  • Each time the round-robin arbiter at output 2 grants to input 1, input 1 chooses to accept output 1 instead.

B. Properties

  • With multiple iterations, the algorithm has the following properties: Property 1: Connections matched in the first iteration become the lowest priority in the next cell time.
  • Because pointers are not updated after the first iteration, an output will continue to grant to the highest priority requesting input until it is successful.
  • For with more than one iteration, and under heavy load, queues with a common output may each have a different throughput, also known as Property 3.
  • If zero connections are scheduled in an iteration, then the algorithm has converged; no more connections can be added with more iterations.
  • The algorithm will not necessarily converge to a maximum sized match, also known as Property 5.

A. How Many Iterations?

  • When implementing with multiple iterations, the authors need to decide how many iterations to perform during each cell time.
  • Ideally, from Property 4 above, the authors would like to perform iterations.
  • After cell times, the arbiters have become totally desynchronized and the algorithm will converge in a single iteration.
  • In some applications this may be acceptable.
  • For all the stationary arrival processes the authors have tried for However, they have not been able to prove that this relation holds in general.

A. Prioritized SLIP

  • Many applications use multiple classes of traffic with different priority levels.
  • The Prioritized algorithm gives strict priority to the highest priority request in each cell time.
  • The pointer is incremented (modulo to one location beyond the granted input if and only if input accepts output in step 3 of the first iteration.
  • The input then chooses one output among only those that have requested at level.
  • The input arbiter maintains a separate pointer, for each priority level.

B. Threshold SLIP

  • Scheduling algorithms that find a maximum weight match outperform those that find a maximum sized match.
  • In particular, if the weight of the edge between input and output is the occupancy of input queue then the authors will conjecture that the algorithm can achieve 100% throughput for all i.i.d.
  • In the Threshold algorithm, the authors make a compromise between the maximum-sized match and the maximum weight match by quantizing the queue occupancy according to a set of threshold levels.
  • The threshold level is then used to determine the priority level in the Priority algorithm.
  • If then the input makes a request of level.

C. Weighted SLIP

  • In some applications, the strict priority scheme of Prioritized may be undesirable, leading to starvation of lowpriority traffic.
  • As illustrated in Fig. 20, each arbiter consists of a priority encoder with a programmable highest priority, a register to hold the highest priority value, and an incrementer to move the pointer after it has been updated.
  • The grant decision from each grant arbiter is then passed to the accept arbiters, where each arbiter selects at most one output on behalf of an input, implementing Step 3.
  • Finally, the authors have observed that the complexity of the implementation is independent of the number of iterations.
  • These values were obtained from a VHDL design that was synthesized using the Synopsis design tools, and compiled for the Texas Instruments TSC5000 0.25- m CMOS ASIC process.

X. CONCLUSION

  • The Internet requires fast switches and routers to handle the increasing congestion.
  • The authors believe that these switches will use virtual output queueing, and hence will need fast, simple, fair, and efficient scheduling algorithms to arbitrate access to the switching fabric.
  • To this end, the authors have introduced the algorithm, an iterative algorithm that achieves high throughput, yet is simple to implement in hardware and operate at high speed.
  • When the traffic is nonuniform, the algorithm quickly adapts to an efficient round-robin policy among the busy queues.
  • The simplicity of the algorithm allows the arbiter for a 32-port switch to be placed on single chip, and to make close to 100 million arbitration decisions per second.

Did you find this useful? Give us your feedback

Figures (12)

Content maybe subject to copyright    Report

188 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 7, NO. 2, APRIL 1999
The iSLIP Scheduling Algorithm
for Input-Queued Switches
Nick McKeown, Senior Member, IEEE
AbstractAn increasing number of high performance inter-
networking protocol routers, LAN and asynchronous transfer
mode (ATM) switches use a switched backplane based on a
crossbar switch. Most often, these systems use input queues
to hold packets waiting to traverse the switching fabric. It is
well known that if simple first in first out (FIFO) input queues
are used to hold packets then, even under benign conditions,
head-of-line (HOL) blocking limits the achievable bandwidth
to approximately 58.6% of the maximum. HOL blocking can
be overcome by the use of virtual output queueing, which is
described in this paper. A scheduling algorithm is used to con-
figure the crossbar switch, deciding the order in which packets
will be served. Recent results have shown that with a suitable
scheduling algorithm, 100% throughput can be achieved. In
this paper, we present a scheduling algorithm called
i
SLIP.
An iterative, round-robin algorithm,
i
SLIP can achieve 100%
throughput for uniform traffic, yet is simple to implement in
hardware. Iterative and noniterative versions of the algorithms
are presented, along with modified versions for prioritized traffic.
Simulation results are presented to indicate the performance
of
i
SLIP under benign and bursty traffic conditions. Prototype
and commercial implementations of
i
SLIP exist in systems with
aggregate bandwidths ranging from 50 to 500 Gb/s. When the
traffic is nonuniform,
i
SLIP quickly adapts to a fair scheduling
policy that is guaranteed never to starve an input queue. Finally,
we describe the implementation complexity of
i
SLIP. Based on
a two-dimensional (2-D) array of priority encoders, single-chip
schedulers have been built supporting up to 32 ports, and making
approximately 100 million scheduling decisions per second.
Index Terms ATM switch, crossbar switch, input-queueing,
IP router, scheduling.
I. INTRODUCTION
I
N AN ATTEMPT to take advantage of the cell-switching
capacity of the asynchronous transfer mode (ATM), there
has recently been a merging of ATM switches and Inter-
net Protocol (IP) routers [29], [32]. This idea is already
being carried one step further, with cell switches forming
the core, or backplane, of high-performance IP routers [26],
[31], [6], [4]. Each of these high-speed switches and routers
is built around a crossbar switch that is configured using a
centralized scheduler, and each uses a fixed-size cell as a
transfer unit. Variable-length packets are segmented as they
arrive, transferred across the central switching fabric, and
then reassembled again into packets before they depart. A
crossbar switch is used because it is simple to implement and
Manuscript received November 19, 1996; revised February 9, 1998; ap-
proved by IEEE/ACM T
RANSACTIONS ON NETWORKING Editor H. J. Chao.
The author is with the Department of Electrical Engineering, Stanford
University, Stanford, CA 94305-9030 USA (e-mail: nickm@stanford.edu).
Publisher Item Identifier S 1063-6692(99)03593-1.
is nonblocking; it allows multiple cells to be transferred across
the fabric simultaneously, alleviating the congestion found on
a conventional shared backplane. In this paper, we describe an
algorithm that is designed to configure a crossbar switch using
a single-chip centralized scheduler. The algorithm presented
here attempts to achieve high throughput for best-effort unicast
traffic, and is designed to be simple to implement in hardware.
Our work was motivated by the design of two such systems:
the Cisco 12 000 GSR, a 50-Gb/s IP router, and the Tiny Tera:
a 0.5-Tb/s MPLS switch [7].
Before using a crossbar switch as a switching fabric, it is
important to consider some of the potential drawbacks; we
consider three here. First, the implementation complexity of an
-port crossbar switch increases with making crossbars
impractical for systems with a very large number of ports.
Fortunately, the majority of high-performance switches and
routers today have only a relatively small number of ports
(usually between 8 and 32). This is because the highest
performance devices are used at aggregation points where port
density is low.
1
Our work is, therefore, focussed on systems
with low port density. A second potential drawback of crossbar
switches is that they make it difficult to provide guaranteed
qualities of service. This is because cells arriving to the switch
must contend for access to the fabric with cells at both the
input and the output. The time at which they leave the input
queues and enter the crossbar switching fabric is dependent on
other traffic in the system, making it difficult to control when
a cell will depart. There are two common ways to mitigate
this problem. One is to schedule the transfer of cells from
inputs to outputs in a similar manner to that used in a time-
slot interchanger, providing peak bandwidth allocation for
reserved flows. This method has been implemented in at least
two commercial switches and routers.
2
The second approach
is to employ “speedup,” in which the core of the switch
runs faster than the connected lines. Simulation and analytical
results indicate that with a small speedup, a switch will deliver
cells quickly to their outgoing port, apparently independent of
contending traffic [27], [37]–[41]. While these techniques are
of growing importance, we restrict our focus in this paper to
the efficient and fast scheduling of best-effort traffic.
1
Some people believe that this situation will change in the future, and that
switches and routers with large aggregate bandwidths will support hundreds
or even thousands of ports. If these systems become real, then crossbar
switches—and the techniques that follow in this paper—may not be suitable.
However, the techniques described here will be suitable for a few years hence.
2
A peak-rate allocation method was supported by the DEC AN2
Gigaswitch/ATM [2] and the Cisco Systems LS2020 ATM Switch.
1063–6692/99$10.00 1999 IEEE

MCKEOWN: THE iSLIP SCHEDULING ALGORITHM FOR INPUT-QUEUED SWITCHES 189
Fig. 1. An input-queued switch with VOQ. Note that head of line blocking
is eliminated by using a separate queue for each output at each input.
A third potential drawback of crossbar switches is that they
(usually) employ input queues. When a cell arrives, it is placed
in an input queue where it waits its turn to be transferred across
the crossbar fabric. There is a popular perception that input-
queued switches suffer from inherently low performance due
to head-of-line (HOL) blocking. HOL blocking arises when
the input buffer is arranged as a single first in first out (FIFO)
queue: a cell destined to an output that is free may be held
up in line behind a cell that is waiting for an output that is
busy. Even with benign traffic, it is well known that HOL can
limit thoughput to just
[16]. Many techniques
have been suggested for reducing HOL blocking, for example
by considering the first
cells in the FIFO queue, where
[8], [13], [17]. Although these schemes can improve
throughput, they are sensitive to traffic arrival patterns and
may perform no better than regular FIFO queueing when the
traffic is bursty. But HOL blocking can be eliminated by using
a simple buffering strategy at each input port. Rather than
maintain a single FIFO queue for all cells, each input maintains
a separate queue for each output as shown in Fig. 1. This
scheme is called virtual output queueing (VOQ) and was first
introduced by Tamir et al. in [34]. HOL blocking is eliminated
because cells only queue behind cells that are destined to
the same output; no cell can be held up by a cell ahead of
it that is destined to a different output. When VOQ’s are
used, it has been shown possible to increase the throughput
of an input-queued switch from 58.6% to 100% for both
uniform and nonuniform traffic [25], [28]. Crossbar switches
that use VOQ’s have been employed in a number of studies
[1], [14], [19], [23], [34], research prototypes [26], [31], [33],
and commercial products [2], [6]. For the rest of this paper,
we will be considering crossbar switches that use VOQ’s.
When we use a crossbar switch, we require a scheduling
algorithm that configures the fabric during each cell time and
decides which inputs will be connected to which outputs; this
determines which of the
VOQ’s are served in each cell
time. At the beginning of each cell time, a scheduler examines
the contents of the
input queues and determines a conflict-
free match
between inputs and outputs. This is equivalent
to finding a bipartite matching on a graph with
vertices
[2], [25], [35]. For example, the algorithms described in [25]
and [28] that achieve 100% throughput, use maximum weight
bipartite matching algorithms [35], which have a running-time
complexity of
A. Maximum Size Matching
Most scheduling algorithms described previously are heuris-
tic algorithms that approximate a maximum size
3
matching
[1], [2], [5], [8], [18], [30], [36]. These algorithms attempt
to maximize the number of connections made in each cell
time, and hence, maximize the instantaneous allocation of
bandwidth. The maximum size matching for a bipartite graph
can be found by solving an equivalent network flow problem
[35]; we call the algorithm that does this maxsize. There exist
many maximum-size bipartite matching algorithms, and the
most efficient currently known converges in
time
[12].
4
The problem with this algorithm is that although it is
guaranteed to find a maximum match, for our application it
is too complex to implement in hardware and takes too long
to complete.
One question worth asking is “Does the maxsize algorithm
maximize the throughput of an input-queued switch?” The
answer is no; maxsize can cause some queues to be starved of
service indefinitely. Furthermore, when the traffic is nonuni-
form, maxsize cannot sustain very high throughput [25]. This is
because it does not consider the backlog of cells in the VOQ’s,
or the time that cells have been waiting in line to be served.
For practical high-performance systems, we desire algo-
rithms with the following properties.
High Throughput: An algorithm that keeps the backlog
low in the VOQ’s; ideally, the algorithm will sustain an
offered load up to 100% on each input and output.
Starvation Free: The algorithm should not allow a
nonempty VOQ to remain unserved indefinitely.
Fast: To achieve the highest bandwidth switch, it is im-
portant that the scheduling algorithm does not become the
performance bottleneck; the algorithm should therefore
find a match as quickly as possible.
Simple to Implement: If the algorithm is to be fast
in practice, it must be implemented in special-purpose
hardware, preferably within a single chip.
The
algorithm presented in this paper is designed
to meet these goals, and is currently implemented in a 16-
port commercial IP router with an aggregate bandwidth of 50
Gb/s [6], and a 32-port prototype switch with an aggregate
bandwidth of 0.5 Tb/s [26].
is based on the parallel
iterative matching algorithm (PIM) [2], and so to understand
its operation, we start by describing PIM. Then, in Section II,
we describe
and its performance. We then consider
some small modifications to
for various applications,
and finally consider its implementation complexity.
B. Parallel Iterative Matching
PIM was developed by DEC Systems Research Center for
the 16-port, 16 Gb/s AN2 switch [2].
5
Because it forms the
3
In some literature, the maximum size matching is called the maximum
cardinality matching or just the maximum bipartite matching.
4
This algorithm is equivalent to Dinic’s algorithm [9].
5
This switch was commercialized as the Gigaswitch/ATM.

190 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 7, NO. 2, APRIL 1999
(a)
(b) (c)
Fig. 2. An example of the three steps that make up one iteration of the PIM
scheduling algorithm [2]. In this example, the first iteration does not match
input 4 to output 4, even though it does not conflict with other connections.
This connection would be made in the second iteration. (a) Step 1: Request.
Each input makes a request to each output for which it has a cell. This is shown
here as a graph with all weights
w
ij
=1
:
(b) Step 2: Grant. Each output
selects an input uniformly among those that requested it. In this example,
inputs 1 and 3 both requested output 2. Output 2 chose to grant to input 3.
(c) Step 3: Accept. Each input selects an output uniformly among those that
granted to it. In this example, outputs 2 and 4 both granted to input 3. Input
3 chose to accept output 2.
basis of the algorithm described later, we will describe
the scheme in detail and consider some of its performance
characteristics.
PIM uses randomness to avoid starvation and reduce the
number of iterations needed to converge on a maximal-sized
match. A maximal-sized match (a type of on-line match) is
one that adds connections incrementally, without removing
connections made earlier in the matching process. In general, a
maximal match is smaller than a maximum-sized match, but is
much simpler to implement. PIM attempts to quickly converge
on a conflict-free maximal match in multiple iterations, where
each iteration consists of three steps. All inputs and outputs
are initially unmatched and only those inputs and outputs not
matched at the end of one iteration are eligible for matching in
the next. The three steps of each iteration operate in parallel on
each output and input and are shown in Fig. 2. The steps are:
Step 1: Request. Each unmatched input sends a request to
every output for which it has a queued cell.
Step 2: Grant. If an unmatched output receives any re-
quests, it grants to one by randomly selecting a request
uniformly over all requests.
Step 3: Accept. If an input receives a grant, it accepts one
by selecting an output randomly among those that granted to
this output.
By considering only unmatched inputs and outputs, each
iteration only considers connections not made by earlier iter-
ations.
Note that the independent output arbiters randomly select
a request among contending requests. This has three effects:
first, the authors in [2] show that each iteration will match
or eliminate, on average, at least
of the remaining pos-
sible connections, and thus, the algorithm will converge to a
maximal match, on average, in
iterations. Second,
it ensures that all requests will eventually be granted, ensuring
Fig. 3. Example of unfairness for PIM under heavy oversubscribed load with
more than one iterations. Because of the random and independent selection
by the arbiters, output 1 will grant to each input with probability 1/2, yet
input 1 will only accept output 1 a quarter of the time. This leads to different
rates at each output.
that no input queue is starved of service. Third, it means that
no memory or state is used to keep track of how recently a
connection was made in the past. At the beginning of each cell
time, the match begins over, independently of the matches that
were made in previous cell times. Not only does this simplify
our understanding of the algorithm, but it also makes analysis
of the performance straightforward; there is no time-varying
state to consider, except for the occupancy of the input queues.
Using randomness comes with its problems, however. First,
it is difficult and expensive to implement at high speed; each
arbiter must make a random selection among the members of
a time-varying set. Second, when the switch is oversubscribed,
PIM can lead to unfairness between connections. An extreme
example of unfairness for a 2
2 switch when the inputs are
oversubscribed is shown in Fig. 3. We will see examples later
for which PIM and some other algorithms are unfair when
no input or output is oversubscribed. Finally, PIM does not
perform well for a single iteration; it limits the throughput
to approximately 63%, only slightly higher than for a FIFO
switch. This is because the probability that an input will
remain ungranted is
hence as increases, the
throughput tends to
Although the algorithm
will often converge to a good match after several iterations,
the time to converge may affect the rate at which the switch
can operate. We would prefer an algorithm that performs well
with just a single iteration.
II. T
HE SLIP ALGORITHM WITH A SINGLE ITERATION
In this section we describe and evaluate the SLIP algorithm.
This section concentrates on the behavior of
SLIP with just
a single iteration per cell time. Later, we will consider
SLIP
with multiple iterations.
The
SLIP algorithm uses rotating priority (“round-robin”)
arbitration to schedule each active input and output in turn.
The main characteristic of
SLIP is its simplicity; it is readily
implemented in hardware and can operate at high speed. We
find that the performance of
SLIP for uniform traffic is
high; for uniform independent identically distributed (i.i.d.)
Bernoulli arrivals,
SLIP with a single iteration can achieve
100% throughput. This is the result of a phenomenon that we
encounter repeatedly; the arbiters in
SLIP have a tendency to
desynchronize with respect to one another.
A. Basic Round-Robin Matching Algorithm
SLIP is a variation of simple basic round-robin matching
algorithm (RRM). RRM is perhaps the simplest and most

MCKEOWN: THE iSLIP SCHEDULING ALGORITHM FOR INPUT-QUEUED SWITCHES 191
(a)
(b) (c)
Fig. 4. Example of the three steps of the RRM matching algorithm. (a) Step
1: Request. Each input makes a request to each output for which it has a cell.
Step 2: Grant. Each output selects the next requesting input at or after the
pointer in the round-robin schedule. Arbiters are shown here for outputs 2 and
4. Inputs 1 and 3 both requested output 2. Since
g
2
=1
;
output 2 grants to
input 1.
g
2
and
g
4
are updated to favor the input after the one that is granted.
(b) Step 3: Accept. Each input selects at most one output. The arbiter for input
1 is shown. Since
a
1
=1
;
input 1 accepts output 1.
a
1
is updated to point to
output 2. (c) When the arbitration is completed, a matching of size two has
been found. Note that this is less than the maximum sized matching of three.
obvious form of iterative round-robin scheduling algorithms,
comprising a 2-D array of round-robin arbiters; cells are
scheduled by round-robin arbiters at each output, and at each
input. As we shall see, RRM does not perform well, but it
helps us to understand how
SLIP performs, so we start here
with a description of RRM. RRM potentially overcomes two
problems in PIM: complexity and unfairness. Implemented as
priority encoders, the round-robin arbiters are much simpler
and can perform faster than random arbiters. The rotating
priority aids the algorithm in assigning bandwidth equally
and more fairly among requesting connections. The RRM
algorithm, like PIM, consists of three steps. As shown in
Fig. 4, for an
switch, each round-robin schedule
contains
ordered elements. The three steps of arbitration
are:
Step 1: Request. Each input sends a request to every output
for which it has a queued cell.
Step 2: Grant. If an output receives any requests, it chooses
the one that appears next in a fixed, roundrobin schedule
starting from the highest priority element. The output notifies
each input whether or not its request was granted. The pointer
to the highest priority element of the round-robin schedule is
incremented (modulo
to one location beyond the granted
input.
Step 3: Accept. If an input receives a grant, it accepts the
one that appears next in a fixed, round-robin schedule starting
from the highest priority element. The pointer
to the highest
priority element of the round-robin schedule is incremented
(modulo
to one location beyond the accepted output.
B. Performance of RRM for Bernoulli Arrivals
As an introduction to the performance of the RRM algo-
rithm, Fig. 5 shows the average delay as a function of offered
load for uniform independent and identically distributed (i.i.d.)
Fig. 5. Performance of RRM and
i
SLIP compared with PIM for i.i.d.
Bernoulli arrivals with destinations uniformly distributed over all outputs.
Results obtained using simulation for a 16
2
16 switch. The graph shows the
average delay per cell, measured in cell times, between arriving at the input
buffers and departing from the switch.
Fig. 6. 2
2
2 switch with RRM algorithm under heavy load. In the example
of Fig. 7, synchronization of output arbiters leads to a throughout of just 50%.
Bernoulli arrivals. For an offered load of just 63% RRM
becomes unstable.
6
The reason for the poor performance of RRM lies in the
rules for updating the pointers at the output arbiters. We
illustrate this with an example, shown in Fig. 6. Both inputs
1 and 2 are under heavy load and receive a new cell for
both outputs during every cell time. But because the output
schedulers move in lock-step, only one input is served during
each cell time. The sequence of requests, grants, and accepts
for four consecutive cell times are shown in Fig. 7. Note that
the grant pointers change in lock-step: in cell time 1, both
point to input 1, and during cell time 2, both point to input
2, etc. This synchronization phenomenon leads to a maximum
throughput of just 50% for this traffic pattern.
Synchronization of the grant pointers also limits perfor-
mance with random arrival patterns. Fig. 8 shows the number
of synchronized output arbiters as a function of offered load.
The graph plots the number of nonunique
’s, i.e., the number
of output arbiters that clash with another arbiter. Under low
6
The probability that an input will remain ungranted is
(
N
0
1
=N
)
N
;
hence as
N
increases, the throughput tends to
1
0
(1
=e
)
63%
:

192 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 7, NO. 2, APRIL 1999
Fig. 7. Illustration of low throughput for RRM caused by synchronization
of output arbiters. Note that pointers
[
g
i
]
stay synchronized, leading to a
maximum throughput of just 50%.
Fig. 8. Synchronization of output arbiters for RRM and
i
SLIP for i.i.d.
Bernoulli arrivals with destinations uniformly distributed over all outputs.
Results obtained using simulation for a 16
2
16 switch.
offered load, cells arriving for output will find in a
random position, equally likely to grant to any input. The
probability that
for all is
which for implies that the expected number of
arbiters with the same highest priority value is 9.9. This agrees
well with the simulation result for RRM in Fig. 8. As the
offered load increases, synchronized output arbiters tend to
move in lockstep and the degree of synchronization changes
only slightly.
III. T
HE SLIP ALGORITHM
The algorithm improves upon RRM by reducing the
synchronization of the output arbiters.
achieves this by
not moving the grant pointers unless the grant is accepted.
is identical to RRM except for a condition placed on
updating the grant pointers. The Grant step of RRM is changed
to:
Step 2: Grant. If an output receives any requests, it chooses
the one that appears next in a fixed round-robin schedule,
starting from the highest priority element. The output notifies
each input whether or not its request was granted. The pointer
to the highest priority element of the round-robin schedule
is incremented (modulo
to one location beyond the granted
input if, and only if, the grant is accepted in Step 3.
This small change to the algorithm leads to the following
properties of
with one iteration:
Property 1: Lowest priority is given to the most recently
made connection. This is because when the arbiters move their
pointers, the most recently granted (accepted) input (output)
becomes the lowest priority at that output (input). If input
successfully connects to output both and are updated
and the connection from input
to output becomes the lowest
priority connection in the next cell time.
Property 2: No connection is starved. This is because an
input will continue to request an output until it is successful.
The output will serve at most
other inputs first, waiting
at most
cell times to be accepted by each input. Therefore,
a requesting input is always served in less than
cell times.
Property 3: Under heavy load, all queues with a common
output have the same throughput. This is a consequence of
Property 2: the output pointer moves to each requesting input
in a fixed order, thus providing each with the same throughput.
But most importantly, this small change prevents the output
arbiters from moving in lock-step leading to a large improve-
ment in performance.
IV. S
IMULATED PERFORMANCE OF SLIP
A. With Benign Bernoulli Arrivals
Fig. 5 shows the performance improvement of
SLIP over
RRM. Under low load,
SLIP’s performance is almost identical
to RRM and FIFO; arriving cells usually find empty input
queues, and on average there are only a small number of inputs
requesting a given output. As the load increases, the number
of synchronized arbiters decreases (see Fig. 8), leading to a
large-sized match. In other words, as the load increases, we
can expect the pointers to move away from each, making it
more likely that a large match will be found quickly in the
next cell time. In fact, under uniform 100% offered load, the
SLIP arbiters adapt to a time-division multiplexing scheme,
providing a perfect match and 100% throughput. Fig. 9 is an

Citations
More filters
Book
01 Jan 2004
TL;DR: This book offers a detailed and comprehensive presentation of the basic principles of interconnection network design, clearly illustrating them with numerous examples, chapter exercises, and case studies, allowing a designer to see all the steps of the process from abstract design to concrete implementation.
Abstract: One of the greatest challenges faced by designers of digital systems is optimizing the communication and interconnection between system components. Interconnection networks offer an attractive and economical solution to this communication crisis and are fast becoming pervasive in digital systems. Current trends suggest that this communication bottleneck will be even more problematic when designing future generations of machines. Consequently, the anatomy of an interconnection network router and science of interconnection network design will only grow in importance in the coming years. This book offers a detailed and comprehensive presentation of the basic principles of interconnection network design, clearly illustrating them with numerous examples, chapter exercises, and case studies. It incorporates hardware-level descriptions of concepts, allowing a designer to see all the steps of the process from abstract design to concrete implementation. ·Case studies throughout the book draw on extensive author experience in designing interconnection networks over a period of more than twenty years, providing real world examples of what works, and what doesn't. ·Tightly couples concepts with implementation costs to facilitate a deeper understanding of the tradeoffs in the design of a practical network. ·A set of examples and exercises in every chapter help the reader to fully understand all the implications of every design decision. Table of Contents Chapter 1 Introduction to Interconnection Networks 1.1 Three Questions About Interconnection Networks 1.2 Uses of Interconnection Networks 1.3 Network Basics 1.4 History 1.5 Organization of this Book Chapter 2 A Simple Interconnection Network 2.1 Network Specifications and Constraints 2.2 Topology 2.3 Routing 2.4 Flow Control 2.5 Router Design 2.6 Performance Analysis 2.7 Exercises Chapter 3 Topology Basics 3.1 Nomenclature 3.2 Traffic Patterns 3.3 Performance 3.4 Packaging Cost 3.5 Case Study: The SGI Origin 2000 3.6 Bibliographic Notes 3.7 Exercises Chapter 4 Butterfly Networks 4.1 The Structure of Butterfly Networks 4.2 Isomorphic Butterflies 4.3 Performance and Packaging Cost 4.4 Path Diversity and Extra Stages 4.5 Case Study: The BBN Butterfly 4.6 Bibliographic Notes 4.7 Exercises Chapter 5 Torus Networks 5.1 The Structure of Torus Networks 5.2 Performance 5.3 Building Mesh and Torus Networks 5.4 Express Cubes 5.5 Case Study: The MIT J-Machine 5.6 Bibliographic Notes 5.7 Exercises Chapter 6 Non-Blocking Networks 6.1 Non-Blocking vs. Non-Interfering Networks 6.2 Crossbar Networks 6.3 Clos Networks 6.4 Benes Networks 6.5 Sorting Networks 6.6 Case Study: The Velio VC2002 (Zeus) Grooming Switch 6.7 Bibliographic Notes 6.8 Exercises Chapter 7 Slicing and Dicing 7.1 Concentrators and Distributors 7.2 Slicing and Dicing 7.3 Slicing Multistage Networks 7.4 Case Study: Bit Slicing in the Tiny Tera 7.5 Bibliographic Notes 7.6 Exercises Chapter 8 Routing Basics 8.1 A Routing Example 8.2 Taxonomy of Routing Algorithms 8.3 The Routing Relation 8.4 Deterministic Routing 8.5 Case Study: Dimension-Order Routing in the Cray T3D 8.6 Bibliographic Notes 8.7 Exercises Chapter 9 Oblivious Routing 9.1 Valiant's Randomized Routing Algorithm 9.2 Minimal Oblivious Routing 9.3 Load-Balanced Oblivious Routing 9.4 Analysis of Oblivious Routing 9.5 Case Study: Oblivious Routing in the Avici Terabit Switch Router(TSR) 9.6 Bibliographic Notes 9.7 Exercises Chapter 10 Adaptive Routing 10.1 Adaptive Routing Basics 10.2 Minimal Adaptive Routing 10.3 Fully Adaptive Routing 10.4 Load-Balanced Adaptive Routing 10.5 Search-Based Routing 10.6 Case Study: Adaptive Routing in the Thinking Machines CM-5 10.7 Bibliographic Notes 10.8 Exercises Chapter 11 Routing Mechanics 11.1 Table-Based Routing 11.2 Algorithmic Routing 11.3 Case Study: Oblivious Source Routing in the IBM Vulcan Network 11.4 Bibliographic Notes 11.5 Exercises Chapter 12 Flow Control Basics 12.1 Resources and Allocation Units 12.2 Bufferless Flow Control 12.3 Circuit Switching 12.4 Bibliographic Notes 12.5 Exercises Chapter 13 Buffered Flow Control 13.1 Packet-Buffer Flow Control 13.2 Flit-Buffer Flow Control 13.3 Buffer Management and Backpressure 13.4 Flit-Reservation Flow Control 13.5 Bibliographic Notes 13.6 Exercises Chapter 14 Deadlock and Livelock 14.1 Deadlock 14.2 Deadlock Avoidance 14.3 Adaptive Routing 14.4 Deadlock Recovery 14.5 Livelock 14.6 Case Study: Deadlock Avoidance in the Cray T3E 14.7 Bibliographic Notes 14.8 Exercises Chapter 15 Quality of Service 15.1 Service Classes and Service Contracts 15.2 Burstiness and Network Delays 15.3 Implementation of Guaranteed Services 15.4 Implementation of Best-Effort Services 15.5 Separation of Resources 15.6 Case Study: ATM Service Classes 15.7 Case Study: Virtual Networks in the Avici TSR 15.8 Bibliographic Notes 15.9 Exercises Chapter 16 Router Architecture 16.1 Basic Router Architecture 16.2 Stalls 16.3 Closing the Loop with Credits 16.4 Reallocating a Channel 16.5 Speculation and Lookahead 16.6 Flit and Credit Encoding 16.7 Case Study: The Alpha 21364 Router 16.8 Bibliographic Notes 16.9 Exercises Chapter 17 Router Datapath Components 17.1 Input Buffer Organization 17.2 Switches 17.3 Output Organization 17.4 Case Study: The Datapath of the IBM Colony Router 17.5 Bibliographic Notes 17.6 Exercises Chapter 18 Arbitration 18.1 Arbitration Timing 18.2 Fairness 18.3 Fixed Priority Arbiter 18.4 Variable Priority Iterative Arbiters 18.5 Matrix Arbiter 18.6 Queuing Arbiter 18.7 Exercises Chapter 19 Allocation 19.1 Representations 19.2 Exact Algorithms 19.3 Separable Allocators 19.4 Wavefront Allocator 19.5 Incremental vs. Batch Allocation 19.6 Multistage Allocation 19.7 Performance of Allocators 19.8 Case Study: The Tiny Tera Allocator 19.9 Bibliographic Notes 19.10 Exercises Chapter 20 Network Interfaces 20.1 Processor-Network Interface 20.2 Shared-Memory Interface 20.3 Line-Fabric Interface 20.4 Case Study: The MIT M-Machine Network Interface 20.5 Bibliographic Notes 20.6 Exercises Chapter 21 Error Control 411 21.1 Know Thy Enemy: Failure Modes and Fault Models 21.2 The Error Control Process: Detection, Containment, and Recovery 21.3 Link Level Error Control 21.4 Router Error Control 21.5 Network-Level Error Control 21.6 End-to-end Error Control 21.7 Bibliographic Notes 21.8 Exercises Chapter 22 Buses 22.1 Bus Basics 22.2 Bus Arbitration 22.3 High Performance Bus Protocol 22.4 From Buses to Networks 22.5 Case Study: The PCI Bus 22.6 Bibliographic Notes 22.7 Exercises Chapter 23 Performance Analysis 23.1 Measures of Interconnection Network Performance 23.2 Analysis 23.3 Validation 23.4 Case Study: Efficiency and Loss in the BBN Monarch Network 23.5 Bibliographic Notes 23.6 Exercises Chapter 24 Simulation 24.1 Levels of Detail 24.2 Network Workloads 24.3 Simulation Measurements 24.4 Simulator Design 24.5 Bibliographic Notes 24.6 Exercises Chapter 25 Simulation Examples 495 25.1 Routing 25.2 Flow Control Performance 25.3 Fault Tolerance Appendix A Nomenclature Appendix B Glossary Appendix C Network Simulator

3,233 citations

Book
05 Apr 2006
TL;DR: In this article, the authors present abstract models that capture the cross-layer interaction from the physical to transport layer in wireless network architectures including cellular, ad-hoc and sensor networks as well as hybrid wireless-wireline.
Abstract: Information flow in a telecommunication network is accomplished through the interaction of mechanisms at various design layers with the end goal of supporting the information exchange needs of the applications. In wireless networks in particular, the different layers interact in a nontrivial manner in order to support information transfer. In this text we will present abstract models that capture the cross-layer interaction from the physical to transport layer in wireless network architectures including cellular, ad-hoc and sensor networks as well as hybrid wireless-wireline. The model allows for arbitrary network topologies as well as traffic forwarding modes, including datagrams and virtual circuits. Furthermore the time varying nature of a wireless network, due either to fading channels or to changing connectivity due to mobility, is adequately captured in our model to allow for state dependent network control policies. Quantitative performance measures that capture the quality of service requirements in these systems depending on the supported applications are discussed, including throughput maximization, energy consumption minimization, rate utility function maximization as well as general performance functionals. Cross-layer control algorithms with optimal or suboptimal performance with respect to the above measures are presented and analyzed. A detailed exposition of the related analysis and design techniques is provided.

1,612 citations


Cites background from "The iSLIP scheduling algorithm for ..."

  • ...Multi-step arbitration schemes are frequently used in packet switches for computer systems [41] [3] [103] [139]....

    [...]

Proceedings ArticleDOI
30 Aug 2010
TL;DR: This work presents Helios, a hybrid electrical/optical switch architecture that can deliver significant reductions in the number of switching elements, cabling, cost, and power consumption relative to recently proposed data center network architectures.
Abstract: The basic building block of ever larger data centers has shifted from a rack to a modular container with hundreds or even thousands of servers. Delivering scalable bandwidth among such containers is a challenge. A number of recent efforts promise full bisection bandwidth between all servers, though with significant cost, complexity, and power consumption. We present Helios, a hybrid electrical/optical switch architecture that can deliver significant reductions in the number of switching elements, cabling, cost, and power consumption relative to recently proposed data center network architectures. We explore architectural trade offs and challenges associated with realizing these benefits through the evaluation of a fully functional Helios prototype.

1,045 citations


Cites background from "The iSLIP scheduling algorithm for ..."

  • ...PIM [3] and iSLIP [18] schedule packets or cells across a crossbar switch fabric over very short timescales....

    [...]

Proceedings ArticleDOI
30 Aug 2010
TL;DR: This work proposes a hybrid packet and circuit switched data center network architecture (or HyPaC) which augments the traditional hierarchy of packet switches with a high speed, low complexity, rack-to-rack optical circuit-switched network to supply high bandwidth to applications.
Abstract: Data-intensive applications that operate on large volumes of data have motivated a fresh look at the design of data center networks. The first wave of proposals focused on designing pure packet-switched networks that provide full bisection bandwidth. However, these proposals significantly increase network complexity in terms of the number of links and switches required and the restricted rules to wire them up. On the other hand, optical circuit switching technology holds a very large bandwidth advantage over packet switching technology. This fact motivates us to explore how optical circuit switching technology could benefit a data center network. In particular, we propose a hybrid packet and circuit switched data center network architecture (or HyPaC for short) which augments the traditional hierarchy of packet switches with a high speed, low complexity, rack-to-rack optical circuit-switched network to supply high bandwidth to applications. We discuss the fundamental requirements of this hybrid architecture and their design options. To demonstrate the potential benefits of the hybrid architecture, we have built a prototype system called c-Through. c-Through represents a design point where the responsibility for traffic demand estimation and traffic demultiplexing resides in end hosts, making it compatible with existing packet switches. Our emulation experiments show that the hybrid architecture can provide large benefits to unmodified popular data center applications at a modest scale. Furthermore, our experimental experience provides useful insights on the applicability of the hybrid architecture across a range of deployment scenarios.

680 citations


Cites background or methods from "The iSLIP scheduling algorithm for ..."

  • ...Several algorithms can compute the maximum matching efficiently [23], and several approximation algorithms [23, 32] can compute near-optimal solutions even faster....

    [...]

  • ...4At larger scales or if switching times drop drastically, faster but slightly heuristic algorithms such as iSLIP could be used [32]....

    [...]

Proceedings ArticleDOI
21 Apr 2013
TL;DR: The simulator, BookSim, is designed for simulation flexibility and accurate modeling of network components and offers a large set of configurable network parameters in terms of topology, routing algorithm, flow control, and router microarchitecture, including buffer management and allocation schemes.
Abstract: Network-on-Chips (NoCs) are becoming integral parts of modern microprocessors as the number of cores and modules integrated on a single chip continues to increase. Research and development of future NoC technology relies on accurate modeling and simulations to evaluate the performance impact and analyze the cost of novel NoC architectures. In this work, we present BookSim, a cycle-accurate simulator for NoCs. The simulator is designed for simulation flexibility and accurate modeling of network components. It features a modular design and offers a large set of configurable network parameters in terms of topology, routing algorithm, flow control, and router microarchitecture, including buffer management and allocation schemes. BookSim furthermore emphasizes detailed implementations of network components that accurately model the behavior of actual hardware. We have validated the accuracy of the simulator against RTL implementations of NoC routers.

645 citations


Cites background from "The iSLIP scheduling algorithm for ..."

  • ...Efficient utilization of the buffer resources is thus critical in minimizing the total cost of the network for a given set of performance constraints....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: It is demonstrated that Ethernet LAN traffic is statistically self-similar, that none of the commonly used traffic models is able to capture this fractal-like behavior, and that such behavior has serious implications for the design, control, and analysis of high-speed, cell-based networks.
Abstract: Demonstrates that Ethernet LAN traffic is statistically self-similar, that none of the commonly used traffic models is able to capture this fractal-like behavior, that such behavior has serious implications for the design, control, and analysis of high-speed, cell-based networks, and that aggregating streams of such traffic typically intensifies the self-similarity ("burstiness") instead of smoothing it. These conclusions are supported by a rigorous statistical analysis of hundreds of millions of high quality Ethernet traffic measurements collected between 1989 and 1992, coupled with a discussion of the underlying mathematical and statistical properties of self-similarity and their relationship with actual network behavior. The authors also present traffic models based on self-similar stochastic processes that provide simple, accurate, and realistic descriptions of traffic scenarios expected during B-ISDN deployment. >

5,567 citations

Journal ArticleDOI
TL;DR: This paper shows how to construct a maximum matching in a bipartite graph with n vertices and m edges in a number of computation steps proportional to $(m + n)\sqrt n $.
Abstract: The present paper shows how to construct a maximum matching in a bipartite graph with n vertices and m edges in a number of computation steps proportional to $(m + n)\sqrt n $.

2,785 citations

Book
Robert E. Tarjan1
01 Jan 1983
TL;DR: This paper presents a meta-trees tree model that automates the very labor-intensive and therefore time-heavy and therefore expensive process of manually selecting trees to grow in a graph.
Abstract: Foundations Disjoint Sets Heaps Search Trees Linking and Cutting Trees Minimum Spanning Trees Shortest Paths Network Flows Matchings

2,120 citations


"The iSLIP scheduling algorithm for ..." refers background or methods in this paper

  • ...The maximum size matching for a bipartite graph can be found by solving an equivalent network flow problem [35]; we call the algorithm that does this maxsize....

    [...]

  • ...This is equivalent to finding a bipartite matching on a graph with vertices [2], [25], [35]....

    [...]

  • ...But maximum weight matches are significantly harder to calculate than maximum sized matches [35], and to be practical, must be implemented using an upper limit on the number of bits used to represent...

    [...]

  • ...For example, the algorithms described in [25] and [28] that achieve 100% throughput, use maximum weight bipartite matching algorithms [35], which have a running-time complexity of...

    [...]

Journal ArticleDOI
TL;DR: A calculus is developed for obtaining bounds on delay and buffering requirements in a communication network operating in a packet switched mode under a fixed routing strategy, and burstiness constraints satisfied by the traffic that exits the element are derived.
Abstract: A calculus is developed for obtaining bounds on delay and buffering requirements in a communication network operating in a packet switched mode under a fixed routing strategy. The theory developed is different from traditional approaches to analyzing delay because the model used to describe the entry of data into the network is nonprobabilistic. It is supposed that the data stream entered into the network by any given user satisfies burstiness constraints. A data stream is said to satisfy a burstiness constraint if the quantity of data from the stream contained in any interval of time is less than a value that depends on the length of the interval. Several network elements are defined that can be used as building blocks to model a wide variety of communication networks. Each type of network element is analyzed by assuming that the traffic entering it satisfies bursting constraints. Under this assumption, bounds are obtained on delay and buffering requirements for the network element; burstiness constraints satisfied by the traffic that exits the element are derived. >

2,049 citations


"The iSLIP scheduling algorithm for ..." refers background in this paper

  • ...8 There are many definitions of burstiness, for example the coefficient of variation [36], burstiness curves [20], maximum burst length [10], or effective bandwidth [21]....

    [...]

Journal ArticleDOI
TL;DR: Two simple models of queueing on an N \times N space-division packet switch are examined, and it is possible to slightly increase utilization of the output trunks and drop interfering packets at the end of each time slot, rather than storing them in the input queues.
Abstract: Two simple models of queueing on an N \times N space-division packet switch are examined. The switch operates synchronously with fixed-length packets; during each time slot, packets may arrive on any inputs addressed to any outputs. Because packet arrivals to the switch are unscheduled, more than one packet may arrive for the same output during the same time slot, making queueing unavoidable. Mean queue lengths are always greater for queueing on inputs than for queueing on outputs, and the output queues saturate only as the utilization approaches unity. Input queues, on the other hand, saturate at a utilization that depends on N , but is approximately (2 -\sqrt{2}) = 0.586 when N is large. If output trunk utilization is the primary consideration, it is possible to slightly increase utilization of the output trunks-upto (1 - e^{-1}) = 0.632 as N \rightarrow \infty -by dropping interfering packets at the end of each time slot, rather than storing them in the input queues. This improvement is possible, however, only when the utilization of the input trunks exceeds a second critical threshold-approximately ln (1 +\sqrt{2}) = 0.881 for large N .

1,592 citations

Frequently Asked Questions (13)
Q1. What are the contributions mentioned in the paper "The islip scheduling algorithm for input-queued switches" ?

HOL blocking can be overcome by the use of virtual output queueing, which is described in this paper. In this paper, the authors present a scheduling algorithm called iSLIP. Finally, the authors describe the implementation complexity of iSLIP. 

8. As the offered load increases, synchronized output arbiters tend to move in lockstep and the degree of synchronization changes only slightly. 

With bursty arrivals, the performance of an input-queued switch becomes more and more like an output-queued switch under the save arrival conditions [9]. 

The number of gates for a 32-port scheduler is less than 100 000, making it readily implementable in current CMOS technologies, and the total number of gates grows approximately with 

As the authors would expect, the increased burst size leads to a higher queueing delay whereas an increased number of iterations leads to a lower queueing delay. 

This is because the service policy is not constant; when a queue changes between empty and nonempty, the scheduler must adapt to the new set of queues that require service. 

The approximation is based on two assumptions:1) inputs that are unmatched at time are uniformly distributed over all inputs; 2) the number of unmatched inputs at time has zero variance. 

In particular, if the weight of the edge between input and output is the occupancy of input queue then the authors will conjecture that the algorithm can achieve 100% throughput for all i.i.d. 

in practice there may be insufficient time for iterations, and so the authors need to consider the penalty of performing only iterations, where In fact, because of the desynchronization of the arbiters, will usually converge in fewer than iterations. 

The basic algorithm can be extended to include requests at multiple priority levels with only a small performance and complexity penalty. 

But when the traffic is nonuniform, or when the offered load is at neither extreme, the interaction between the arbiters becomes difficult to describe. 

The pointer to the highest priority element of the round-robin schedule is incremented (modulo to one location beyond the granted input if and only if the grant is accepted in Step 3 of the first iteration. 

PIM attempts to quickly converge on a conflict-free maximal match in multiple iterations, where each iteration consists of three steps.