scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A practical scheduling algorithm to achieve 100% throughput in input-queued switches

29 Mar 1998-Vol. 2, pp 792-799
TL;DR: This work introduces a new algorithm called longest port first (LPF), which is designed to overcome the complexity problems of LQF, and can be implemented in hardware at high speed.
Abstract: Input queueing is becoming increasingly used for high-bandwidth switches and routers. In previous work, it was proved that it is possible to achieve 100% throughput for input-queued switches using a combination of virtual output queueing and a scheduling algorithm called LQF However, this is only a theoretical result: LQF is too complex to implement in hardware. We introduce a new algorithm called longest port first (LPF), which is designed to overcome the complexity problems of LQF, and can be implemented in hardware at high speed. By giving preferential service based on queue lengths, we prove that LPF can achieve 100% throughput.

Summary (3 min read)

1 Introduction

  • Traditionally, switches and routers have been most often designed as a collection of line-cards connected to a single shared bus.
  • If the aggregate bandwidths of the bus and memory are high enough, the system is able to keep all of the outgoing links continuously busy, making the system highly efficient.
  • Furthermore, the system is able to control packet departure times and hence provides guaranteed qualities-of-service (QoS) [3] [15] [20] [21] .
  • Switch and router designers are finding that the continued growth in bandwidth is making it increasingly difficult to design a shared bus and centralized memory that run fast enough.
  • The data rate of a shared bus is limited by electrical considerations, such as the loading on the bus, and reflections from connectors.

N N

  • Increasingly, a passive shared bus is being replaced by an active non-blocking switch fabric -most often a crossbar switch.
  • The very fastest switches and routers usually transfer packets across the switching fabric in fixed size units, that the authors shall refer to as "cells.".
  • Increased overflows occur because a maximum size matching algorithm does not consider queue lengths when deciding which input queues to service.
  • With LPF their goal is to combine the benefits of a maximum size matching algorithm, with those of a maximum weight algorithm, while lending itself to simple implementation in hardware.
  • This enables LPF to take advantage of both the high instantaneous throughput of a maximum size matching algorithm, and the ability of a maximum weight matching algorithm to achieve high throughput, and a small number of overflows even when the arriving traffic is non-uniform.

LPF has a running-time complexity of

  • Furthermore, the comparators that limit the performance of LQF are removed from the critical path of the LPF algorithm.
  • In fact, the heart of the LPF algorithm uses a slightly modified maximum size matching algorithm, for which there are a variety of existing, heuristic approximations [1][9][10][17].
  • In Section 3, the authors describe LPF and its properties before presenting their performance analysis.

2 Our Switch Model

  • Figure 1 shows an input-queued switch consisting of input and output ports, a non-blocking switching fabric and a scheduler.
  • The scheduler determines which inputs and outputs are connected during each slot.

3 The LPF Algorithm

  • Together, the sum of the input and output occupancies represents the work load or congestion that a cell faces as it competes for transmission to its output.
  • The authors call this sum the port occupancy; LPF favors queues with high port occupancy.

Property 1: The total weight of an LPF match is equal to the occupancy sum of all matched inputs and outputs, i.e,

  • , where and are the set of matched inputs and matched outputs respectively.
  • LPF finds a match that is both maximum size and maximum weight, also known as Theorem 1.

3.1 Finding an LPF Match Using a Maximum Size Matching Algorithm

  • Existing maximum size matching algorithms cannot be used to implement LPF because they are unable to select the maximum size match with the largest weight.
  • Then the authors use a modified Edmonds-Karp maximum size matching algorithm [2] [19] to find the LPF match .
  • First, LPFS builds a tree with as its root.
  • Initially every input and output is colored white -undiscovered, then is grayed when it is discovered, and finally is blackened when it is finished.
  • From the tree, an augmenting path from to which must go through an unmatched input can be found by walking the predecessor list which begins at a selected unmatched input.

3.2 A Practical Approximation to LPF

  • LPF can be adapted to run at higher speed using simple heuristic approximations.
  • The second step consists of a double for-loop used to find a maximal size match.
  • Since the requests have already been ordered in the first step, the maximal size matching in the second step does not need to compare request weights.
  • The authors exploratory design work suggests that the second step can be implemented using simple hardware; for a switch, their synthesized design can make a scheduling decision in just 10ns using a commercial 0.25 CMOS ASIC technology.
  • The first step, which requires simple integer arithmetic, can also run in 10ns, allowing the switch to run at a line rate of 20 Gb/s. 1.

Figure 6:

  • First, the algorithm builds a sorted list of all inputs and outputs based on their occupancies.
  • Then, starting from the largest output and input, the algorithm finds a maximal size match.

Iterative LPF algorithm

  • Step 1. 1 Sort inputs&outputs based on their occupancies 2 Reorder requests according to their input and output occupancies Step 2.
  • The authors define a switch to be stable for a particular arrival process if the expected length of the input queues does not grow without bound, i.e., .
  • A switch can achieve 100% throughput if it is stable for all independent and admissible arrivals, also known as Definition 5.
  • The LPF algorithm is stable for all admissible independent arrival processes, also known as Theorem 3.

3.4 Stability With a Finite Pipeline Delay

  • Because the modified maximum size matching algorithm requires the input and outputs to be pre-ordered, LPF and iLPF need sorting networks to sort all inputs and outputs.
  • Due to the relatively high complexity of the sorting networks, they could dominate the running time of the algorithm.
  • This means that the maximum size matching algorithm is operating on weights that are now one slot out of date -it is possible for the algorithm to favor the 6 , inputs and outputs are pre-sorted by the two sorter networks.
  • Raw requests (requests with weights removed) is given in a matrix form.
  • The match needs to be permuted back to its natural order.

Raw Requests

  • Because of the speed benefits of pipelining, the authors consider here its effect on throughput.
  • A slot pipeline delay is equivalent to non-pipelined LPF but with slot old weights, .
  • Hence, it finds the match that maximizes .
  • Perhaps surprisingly, the authors can verify the following: Theorem 4: Using k slot old weights, the LPF algorithm is stable for all admissible independent arrival processes, .

4 Conclusion

  • Input-queued non-blocking switches offer much higher aggregate bandwidth than systems based on shared buses and centralized shared memory.
  • While VOQs make it theoretically possible for an input-queued switch to achieve high throughput, most existing scheduling algorithms yield low throughput or are too complex to run at high speed.
  • The authors new scheduling algorithm, LPF, is both practical, and can achieve 100% throughput for all traffic with independent arrivals.
  • Because LPF uses a maximum size matching algorithm, it leads to a fast, iterative, heuristic algorithm called iLPF that is simple to implement in hardware.
  • Initial investigation suggests that iLPF can configure a switch in 10ns using today's ASIC technology.

Did you find this useful? Give us your feedback

Figures (6)

Content maybe subject to copyright    Report

Abstract
Input queueing is becoming increasingly used for high-
bandwidth switches and routers. In previous work, it was
proved that it is possible to achieve 100% throughput for
input-queued switches using a combination of virtual out-
put queueing and a scheduling algorithm called LQF.
However, this is only a theoretical result: LQF is too com-
plex to implement in hardware. In this paper we introduce
a new algorithm called Longest Port First (LPF), which is
designed to overcome the complexity problems of LQF,
and can be implemented in hardware at high speed. By
giving preferential service based on queue lengths, we
prove that LPF can achieve 100% throughput.
1 Introduction
Traditionally, switches and routers have been most
often designed as a collection of line-cards connected to a
single shared bus. Packets waiting to be transmitted on
outgoing links are stored in a centralized, shared pool of
memory. If the aggregate bandwidths of the bus and mem-
ory are high enough, the system is able to keep all of the
outgoing links continuously busy, making the system
highly efficient. Furthermore, the system is able to control
packet departure times and hence provides guaranteed
qualities-of-service (QoS) [3][15][20][21]. However,
switch and router designers are finding that the continued
growth in bandwidth is making it increasingly difficult to
design a shared bus and centralized memory that run fast
enough. The data rate of a shared bus is limited by electri-
cal considerations, such as the loading on the bus, and
reflections from connectors. And the data rate of a central-
ized shared memory is limited because it requires buffer
memories that run times faster than the line rate, where
is the number of switch ports.
N
N
Increasingly, a passive shared bus is being replaced by
an active non-blocking switch fabric — most often a
crossbar switch. Each line card is connected by a dedi-
cated point-to-point link to the central switch fabric, and
therefore has fewer electrical limitations due to loading
and reflections. More importantly, each connection to the
switch need run only as fast as the line rate, rather than at
the aggregate bandwidth of the switch. Centralized shared
memory is also being replaced—by separate queues at
each input of the switching fabric. Input queues need only
run at the line rate, and therefore allow a faster overall sys-
tem to be built [6][11].
The very fastest switches and routers usually transfer
packets across the switching fabric in fixed size units, that
we shall refer to as “cells.” Variable length packets are
segmented into cells upon arrival, transferred across the
switch fabric and then reassembled again before they
depart. At the beginning of each cell time, a (usually cen-
tralized) scheduler selects a configuration for the switch-
ing fabric and then transfers cells from inputs to outputs.
Using fixed sized cells simplifies the switch design, and
makes it easier for the scheduler to configure the switch
fabric for high throughput.
But systems that use input queues have two potential
problems: low throughput due to head-of-line (HOL)
blocking and the difficulty of controlling cell delay. In this
paper, we focus on the first problem: achieving high
throughput.
It is well known that if an input-queued switch
employs a single FIFO queue at each input, HOL blocking
limits the throughput to just 58.6% of the maximum [7].
But HOL blocking can be eliminated entirely using a
queueing technique known as virtual output queueing
(VOQ) in which each input maintains a separate queue for
each output [1][10][12][13][17]. It has been shown that
with a suitable centralized scheduling algorithm, the
throughput can be increased from 58.6% to 100% [12].
A Practical Scheduling Algorithm to Achieve
100% Throughput in Input-Queued Switches.
Adisak Mekkittikul Nick McKeown
Computer Systems Laboratory
Stanford University, Stanford, CA 94305-9030
{adisak, nickm}@stanford.edu
This work was funded by a fellowship from National Semicon-
ductor and also by Texas Instruments, Cisco Systems, the Alfred
P. Sloan Foundation and a Robert N. Noyce faculty fellowship.

Unfortunately, the algorithms known to-date (LQF [12]
and OCF [13]) are too complex to implement in hardware,
and are therefore unsuitable for switches operating at high
speed. Instead, most switches and routers use a much sim-
pler scheduling algorithm to configure the switch fabric
[1][10][18]. Typically, a configuration is selected in an
attempt to maximize the number of connections made dur-
ing each cell time. Such an algorithm is called a maximum
size bipartite matching algorithm, and is found to perform
well when the arriving traffic is uniformly distributed over
all the switch outputs.
But real traffic is not uniform: traffic tends to be
focused on a relatively small number of active ports. And
unfortunately, a maximum size matching algorithm is
known to perform poorly when traffic is non-uniform [12].
The algorithm performs poorly in two (albeit related)
ways: increased buffer overflows, and reduced throughput.
Increased overflows occur because a maximum size
matching algorithm does not consider queue lengths when
deciding which input queues to service. When traffic is
non-uniform, the occupancies of the various input queues
can differ greatly, and queues with heavy traffic can over-
flow while ones with light traffic remain empty most of the
time. The reason for reduced throughput is a little more
complex. For a given number of cells in the system, if the
traffic is non-uniform, the cells are concentrated on a rela-
tively small number of VOQs. This reduces the number of
configurations available to the scheduler, and therefore
reduces the size of the maximum size match. If instead the
traffic was uniform, the cells in the system would be dis-
tributed uniformly over a relatively large number of
VOQs, making available a larger number of configurations
for the scheduler to choose from.
In earlier work [12][13], it was found that LQF (long-
est queue first) can achieve 100% for both uniform and
non-uniform traffic by considering the occupancies of the
queues. LQF gives preferential service to long queues by
using a maximum weight matching algorithm, where each
weight is set to the corresponding queue length. But LQF
is very difficult to implement in hardware at high speed.
First of all, it takes too long to run—the most efficient
algorithm known to-date has a running-time complexity
. Second, an implementation requires a large
number of multi-bit comparators to perform many weight
comparisons in parallel. Attempts to implement LQF (and
even heuristic approximations [10]) have been limited by
the design of a single-chip scheduler that: (i) has fast
enough comparators, (ii) can support a sufficient number
of comparators, and (iii) can interconnect them in a rich
enough pattern.
ON
3
Nlog()
Motivated by the desire to overcome the impracticali-
ties of LQF, yet achieve its high performance, we propose
a new algorithm: LPF (longest port first). With LPF our
goal is to combine the benefits of a maximum size match-
ing algorithm, with those of a maximum weight algorithm,
while lending itself to simple implementation in hardware.
LPF effectively finds the set of maximum size matches,
and from among this set chooses the match with the largest
total weight. In LPF each weight is a function of queue
lengths (we shall see later that the weights in LPF are not
exactly equal to the queue lengths, but are similar). This
enables LPF to take advantage of both the high instanta-
neous throughput of a maximum size matching algorithm,
and the ability of a maximum weight matching algorithm
to achieve high throughput, and a small number of over-
flows even when the arriving traffic is non-uniform. We
find that LPF—like LQF—can achieve 100% throughput
for both uniform and non-uniform traffic.
LPF has a running-time complexity of ; lower
than LQF. Furthermore, the comparators that limit the per-
formance of LQF are removed from the critical path of the
LPF algorithm. In fact, the heart of the LPF algorithm uses
a slightly modified maximum size matching algorithm, for
which there are a variety of existing, heuristic approxima-
tions [1][9][10][17].
The paper is organized as follows. In Section 2, we
provide some definitions. In Section 3, we describe LPF
and its properties before presenting our performance anal-
ysis.
2 Our Switch Model
We follow the general definitions used in [12]. Figure
1 shows an input-queued switch consisting of
input and output ports, a non-blocking switching fabric
and a scheduler. To eliminate head-of-line (HOL) block-
ing, each input maintains FIFO virtual output queues,
one for each output. denotes the VOQ at input con-
taining cells destined to output . Arrivals are fixed size
packets or cells, allowing us to split time into discrete cell
times, or slots. During any given slot, there is at most one
arrival to and departure from each input, and similarly for
each output. is the arrival process of cells to input
destined to output at rate . Consequently, is
the aggregate process of all arrivals to input at rate
.
ON
2.5
()
MN× M
N
N
Q
ij,
i
j
A
ij,
n()
i
j
λ
ij,
A
i
n()
i
λ
i
λ
ij,
j 1=
N
=

Definition 1: An arrival process is said to be admissible
when no input or output is oversubscribed, i.e., when
.
Definition 2: The traffic is uniform if all arrival processes
have the same arrival rate, and if the destinations of cells
are uniformly distributed over all outputs. Otherwise the
traffic is non-uniform.
The scheduler determines which inputs and outputs
are connected during each slot. The scheduling problem
can be viewed as a bipartite graph matching problem
Figure 1: A Simple Model of VOQ Switches.
Input 1
Q
1,1
Q
1,N
A
1
(t)
Input M
Q
M,1
Q
M,N
A
M
(t)
D
1
(t)
D
N
(t)
Output 1
Output N
Crossbar
A
1,1
(t)
Scheduler
switch
λ
ij,
i 1=
M
1 λ
ij,
j 1=
N
1λ
ij,
0,<,<
Inputs, I
Outputs, J
1
2
3
N
Request graph, G
Matching, M
a) Example of G for
b) Example of matching
Inputs, I
Outputs, J
1
2
3
N
1
2
3
M
1
2
3
M
w
1,1
w
2,1
w
1,3
w
3,2
w
3,N
w
M,3
I = M and J = N.
M on G.
w
2,1
w
1,3
w
3,2
F
igure 2: A request graph and a matching graph of an
s
witch. Define G = [V,E] as an undirected graph connecting the
s
et of vertices V with the set of edges E. The edge connecting ver-
t
ices i, 1iM and j, 1jN has an associated weight denoted w
i,j
.
G
raph G is bipartite if the set of inputs I = {i: 1iM} and outputs
J
= {i: 1jN} partition V such that every edge has one end in I
a
nd one end in J. Matching M on G is any subset of E such that no
t
wo edges in M have a common vertex.
MN×
[2][19], an example of which is shown in Figure 2. Each
input makes a request to every output for which it has cells
queued. An edge in the graph represents a request from
with weight (denoted in Figure 2 as ).
Let be a service indicator such that
and ; a value of one indicates
that input is matched to output , i.e., is allowed to
forward one cell to its output.
Definition 3: A maximum size match is one that maximizes
, i.e., the number of connections.
Definition 4: A maximum weight match is one that maxi-
mizes , i.e., the total weight.
Alternatively, a bipartite graph matching problem can
be easily solved and understood by transforming it into a
flow network [2][19], as illustrated in Figure 3.
3 The LPF Algorithm
Although in practice LPF can be thought of as a spe-
cial maximum size matching algorithm, in theory it is eas-
ier to consider LPF as a maximum weight matching
algorithm. Each LPF request weight, , for a request
from input to output is defined as follows:
(1)
Q
ij,
w
ij,
n() w
ij,
S
ij,
n()
S
ij,
n()
i 1=
M
1 S
ij,
n()
j 1=
N
1
1
2
3
N
a) Weighted request graph.
b) A corresponding flow network
.
1
2
3
M
Figure 3: Transformation of a request graph into a flow network.
(a) A weighted request graph. (b) The corresponding flow net-
work, G, whose all edges are of unity capacity. A source and a
target are added. The cost of every edge from and to is set
to zero. The cost of all other edges are equal to the negated value
of the corresponding weight.
s
t s t
1
2
3
N
1
2
3
M
-w
1,1
-w
2,1
-w
1,3
-w
3,2
-w
M,3
s
t
w
1,1
w
2,1
w
1,3
w
3,2
w
M,3
w
3,N
-w
3,N
i
j
Q
ij,
S
ij,
n()
ij,
S
ij,
n()w
ij,
n()
ij,
w
ij,
n()
i
j
w
ij,
n()
R
i
n() C
j
n()+ L
ij,
n() 0>,
0 otherwise,,
=

where is the occupancy of at slot ,
and .
, which we call the input occupancy, is the total
number of cells that are currently waiting at input to be
forwarded to their respective outputs. Similarly, , the
output occupancy, is the total number of cells at all inputs
waiting to be forwarded to output . Together, the sum of
the input and output occupancies represents the work load
or congestion that a cell faces as it competes for transmis-
sion to its output. We call this sum the port occupancy;
LPF favors queues with high port occupancy.
Property 1: The total weight of an LPF match is equal to
the occupancy sum of all matched inputs and outputs, i.e,
, where and are the
set of matched inputs and matched outputs respectively.
We now show that LPF is a special case of a maxi-
mum size matching algorithm.
Theorem 1: LPF finds a match that is both maximum size
and maximum weight.
Proof of Main Theorem: see Appendix A.
Since an LPF match is a maximum size match, we can
use a maximum size matching algorithm to find an LPF
L
ij,
n() Q
ij,
n
R
i
n() L
ij,
n()
j
N
= C
j
n() L
ij,
n()
i
N
=
R
i
n()
i
Figure 4: Modified Edmonds-Karp algorithm [2]. is a flow
network or graph constructed as described in Figure 3. is
the set of all edges in ; or is a vertex in representing
an input or output; is an edge from to ; is the total
flow through the network; denotes a flow from to .
is a residual network [2] [19], also called a residual graph.
LPFS is a largest unmatched port first search.
G
EG[]
Gu v G
uv,() uv
f
f
uv,[] uv
G
f
Modified Edmonds-Karp algorithm
1 for each edge
2 do
3
4 while LPFS finds a path from to in the residual
network
5 for each edge in
6 do if then
7 else
8
uv,()EG[]
f
uv,[]0=
f
vu,[]0=
p st
G
f
uv,()p
f
uv,[]0=
f
uv,[]cuv,[]=
f
vu,[]0
f
uv,[] fvu,[]=
C
j
n()
j
S
ij,
n()w
ij,
n()
ij,
R
i
iI
C
j
jJ
+= I J
match. But we need to make sure that among all possible
maximum size matches we choose one with the largest
total weight.
3.1 Finding an LPF Match Using a Maximum
Size Matching Algorithm
Existing maximum size matching algorithms cannot
be used to implement LPF because they are unable to
select the maximum size match with the largest weight. A
simple modification is called for. First, in order to keep the
algorithm free of complex magnitude comparisons, all
inputs and outputs are pre-ordered according to their LPF
weights prior to running the maximum size matching algo-
rithm. Then we use a modified Edmonds-Karp maximum
size matching algorithm [2][19] to find the LPF match (see
Figure 4). A breadth-first search (BFS) in the Edmonds-
Karp algorithm is replaced by a largest-unmatched-port
first search (LPFS) described in Figure 5. LPFS enables
the modified algorithm to search for a maximum weight
match while performing path augmentation [19] to find a
maximum size match. As a result, line 2 of the LPFS-Visit
does not involve any magnitude comparison. It is proved
in [14] that the modified algorithm finds an LPF match.
Figure 5: A largest-unmatched-port first search (LPFS). First,
LPFS builds a tree with as its root. Initially every input and out-
put is colored white — undiscovered, then is grayed when it is
discovered, and finally is blackened when it is finished. is
the predecessor of . From the tree, an augmenting path from
to which must go through an unmatched input can be found by
walking the predecessor list which begins at a selected un-
matched input.
t
π v[]
v s
t
LPFS(G)
1 for each vertex
2 do white
3 nil
4 LPFS-Visit(t)
LPFS-Visit(u)
1 gray
2 for each , starting the largest to
the smallest.
3 do if white
4 then
5 LPFS-Visit(v)
6 black
uVG[]
color u[]
πu[]
color u[]
v Adjacent u[]
color v[]=
πv[] u
color u[]=

Theorem 2: The maximum size match found by the modi-
fied Edmonds-Karp algorithm is also a maximum weight
match with weights as defined in Equation 1.
Proof of Main Theorem: see reference [14].
3.2 A Practical Approximation to LPF
LPF can be adapted to run at higher speed using sim-
ple heuristic approximations. Shown in Figure 6 is an iter-
ative algorithm called iLPF that approximates LPF. All
weight processing is done in step 1 prior to the iterative
steps. The second step consists of a double for-loop used
to find a maximal size match. Since the requests have
already been ordered in the first step, the maximal size
matching in the second step does not need to compare
request weights. Figure 7 shows the schematic of a hard-
ware implementation of iLPF. Our exploratory design
work suggests that the second step can be implemented
using simple hardware; for a switch, our synthe-
sized design can make a scheduling decision in just 10ns
using a commercial 0.25 CMOS ASIC technology.
The first step, which requires simple integer arithmetic,
can also run in 10ns, allowing the switch to run at a line
rate of 20 Gb/s.
1
3.3 Stability
We now prove that LPF can achieve 100% throughput
for all traffic patterns with independent arrivals, using the
1. Calculated based on the size of an ATM cell.
Figure 6: An iterative LPF algorithm. First, the algorithm builds
a sorted list of all inputs and outputs based on their occupancies.
Then, starting from the largest output and input, the algorithm
finds a maximal size match.
Iterative LPF algorithm
Step 1.
1 Sort inputs&outputs based on their occupancies
2 Reorder requests according to their input and output
occupancies
Step 2. Maximal size matching
1 for each output from largest->smallest
2 for each input from largest->smallest
3 if (there is a request) and (both input and output
unmatched)
4 then match them
32 32×
µm
notion of stability [8]. We define a switch to be stable for a
particular arrival process if the expected length of the
input queues does not grow without bound, i.e.,
. (2)
Definition 5: A switch can achieve 100% throughput if it
is stable for all independent and admissible arrivals.
Theorem 3: The LPF algorithm is stable for all admissi-
ble independent arrival processes.
Proof of Main Theorem: see Appendix B.
3.4 Stability With a Finite Pipeline Delay
Because the modified maximum size matching algo-
rithm requires the input and outputs to be pre-ordered,
LPF and iLPF need sorting networks to sort all inputs and
outputs. Due to the relatively high complexity of the sort-
ing networks, they could dominate the running time of the
algorithm. Alternatively, we can pipeline the design to
reduce its running time; the sorting networks can operate
in one slot, and the maximum size matching algorithm in
the next. This means that the maximum size matching
algorithm is operating on weights that are now one slot out
of date — it is possible for the algorithm to favor the
Figure 7: A block diagram of iLPF. Referring to the algorithm in
Figure 6, inputs and outputs are pre-sorted by the two sorter net-
works. Raw requests (requests with weights removed) is given in
a matrix form. Request reordering is done by the two crossbars
which are configured by the sorting results. The maximal size
matching block, which implements the double for-loop, finds a
maximal size match that approximates an LPF match. The match
needs to be permuted back to its natural order.
Input Occupancies
Output Occupancies
Maximal size
Raw Requests
Sorter
Sorter
X Bar
X Bar
1
1
0
1
1
0
0
0
1
Matching
Permuted Requests
Match
1
0
0
0
1
0
0
0
1
{10, 20, 30}
{20, 25, 15}
{3, 2, 1}
{2, 1, 3}
EL
ij,
n()
ij,
< n,

Citations
More filters
Journal ArticleDOI
Nick McKeown1
TL;DR: This paper presents a scheduling algorithm called iSLIP, an iterative, round-robin algorithm that can achieve 100% throughput for uniform traffic, yet is simple to implement in hardware, and describes the implementation complexity of the algorithm.
Abstract: An increasing number of high performance internetworking protocol routers, LAN and asynchronous transfer mode (ATM) switches use a switched backplane based on a crossbar switch. Most often, these systems use input queues to hold packets waiting to traverse the switching fabric. It is well known that if simple first in first out (FIFO) input queues are used to hold packets then, even under benign conditions, head-of-line (HOL) blocking limits the achievable bandwidth to approximately 58.6% of the maximum. HOL blocking can be overcome by the use of virtual output queueing, which is described in this paper. A scheduling algorithm is used to configure the crossbar switch, deciding the order in which packets will be served. Previous results have shown that with a suitable scheduling algorithm, 100% throughput can be achieved. In this paper, we present a scheduling algorithm called iSLIP. An iterative, round-robin algorithm, iSLIP can achieve 100% throughput for uniform traffic, yet is simple to implement in hardware. Iterative and noniterative versions of the algorithms are presented, along with modified versions for prioritized traffic. Simulation results are presented to indicate the performance of iSLIP under benign and bursty traffic conditions. Prototype and commercial implementations of iSLIP exist in systems with aggregate bandwidths ranging from 50 to 500 Gb/s. When the traffic is nonuniform, iSLIP quickly adapts to a fair scheduling policy that is guaranteed never to starve an input queue. Finally, we describe the implementation complexity of iSLIP. Based on a two-dimensional (2-D) array of priority encoders, single-chip schedulers have been built supporting up to 32 ports, and making approximately 100 million scheduling decisions per second.

1,277 citations


Cites methods from "A practical scheduling algorithm to..."

  • ...For example, the algorithms described in [25] and [ 28 ] that achieve 100% throughput, use maximum weight bipartite matching algorithms [35], which have a running-time complexity of...

    [...]

  • ...When VOQ’s are used, it has been shown possible to increase the throughput of an input-queued switch from 58.6% to 100% for both uniform and nonuniform traffic [25], [ 28 ]....

    [...]

Journal ArticleDOI
TL;DR: The main objective of this sequel is to solve the out-of-sequence problem that occurs in the load balanced Birkhoff-von Neumann switch with one-stage buffering by adding a load-balancing buffer in front of the first stage and a resequencing-and-output buffer after the second stage.

328 citations

Journal ArticleDOI
TL;DR: A power-allocation policy is developed which stabilizes the system whenever the rate vector lies within the capacity region and provides a performance bound for the Choose-the-K-Largest-Connected-Queues policy.
Abstract: We consider power and server allocation in a multibeam satellite downlink which transmits data to N different ground locations over N time-varying channels. Packets destined for each ground location are stored in separate queues and the server rate for each queue, i, depends on the power, p/sub i/(t), allocated to that server and the channel state, c/sub i/(t), according to a concave rate-power curve /spl mu//sub i/(p/sub i/,c/sub i/). We establish the capacity region of all arrival rate vectors (/spl lambda//sub 1/,...,/spl lambda//sub N/) which admit a stabilizable system. We then develop a power-allocation policy which stabilizes the system whenever the rate vector lies within the capacity region. Such stability is guaranteed even if the channel model and the specific arrival rates are unknown. Furthermore, the algorithm is shown to be robust to arbitrary variations in the input rates and a bound on average delay is established. As a special case, this analysis verifies stability and provides a performance bound for the choose-the-K-largest-connected-queues policy when channels can be in one of two states (ON or OFF ) and K servers are allocated at every timestep (K

314 citations


Cites background from "A practical scheduling algorithm to..."

  • ...1 This policy is shown to maintain av­erage queue occupancy within a fixed upper bound and is robust to arbitrary changes in the input rates....

    [...]

Dissertation
01 Jan 2003
TL;DR: The notion of network layer capacity is developed and capacity achieving power allocation and routing algorithms for general networks with wireless links and adaptive transmission rates are described and a fundamental rate-delay tradeoff curve is established.
Abstract: Satellite and wireless networks operate over time varying channels that depend on attenuation conditions, power allocation decisions, and inter-channel interference. In order to reliably integrate these systems into a high speed data network and meet the increasing demand for high throughput and low delay, it is necessary to develop efficient network layer strategies that fully utilize the physical layer capabilities of each network element. In this thesis, we develop the notion of network layer capacity and describe capacity achieving power allocation and routing algorithms for general networks with wireless links and adaptive transmission rates. Fundamental issues of delay, throughput optimality, fairness, implementation complexity, and robustness to time varying channel conditions and changing user demands are discussed. Analysis is performed at the packet level and fully considers the queueing dynamics in systems with arbitrary, potentially bursty, arrival processes. Applications of this research are examined for the specific cases of satellite networks and ad-hoc wireless networks. Indeed, in Chapter 3 we consider a multi-beam satellite downlink and develop a dynamic power allocation algorithm that allocates power to each link in reaction to queue backlog and current channel conditions. The algorithm operates without knowledge of the arriving traffic or channel statistics, and is shown to achieve maximum throughput while maintaining average delay guarantees. At the end of Chapter 4, a crosslinked collection of such satellites is considered and a satellite separation principle is developed, demonstrating that joint optimal control can be implemented with separate algorithms for the downlinks and crosslinks. Ad-hoc wireless networks are given special attention in Chapter 6. A simple cell-partitioned model for a mobile ad-hoc network with N users is constructed, and exact expressions for capacity and delay are derived. End-to-end delay is shown to be O(N), and hence grows large as the size of the network is increased. To reduce delay, a transmission protocol which sends redundant packet information over multiple paths is developed and shown to provide O( N ) delay at the cost of reducing throughput. A fundamental rate-delay tradeoff curve is established, and the given protocols for achieving O(N) and O( N ) delay are shown to operate on distinct boundary points of this curve. In Chapters 4 and 5 we consider optimal control for a general time-varying network. A cross-layer strategy is developed that stabilizes the network whenever possible, and makes fair decisions about which data to serve when inputs exceed capacity. The strategy is decoupled into separate algorithms for dynamic flow control, power allocation, and routing, and allows for each user to make greedy decisions independent of the actions of others. The combined strategy is shown to yield data rates that are arbitrarily close to the optimally fair operating point that is achieved when all network controllers are coordinated and have perfect knowledge of future events. The cost of approaching this fair operating point is an end-to-end delay increase for data that is served by the network. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

311 citations


Cites background or methods from "A practical scheduling algorithm to..."

  • ...Maximum weight metrics are also considered in the switching and scheduling literature [95] [97] [88] [132] [81] [62], and recently for multi-access uplink communication in [149] [84] and for a single server downlink with heavy traffic in [124]....

    [...]

  • ...Such a technique has been recently used for establishing stability in an uplink with static channels in [149], [84], in a one-hop static network in [71], and in the switching literature [97] [95] [75] [88] [109]....

    [...]

Patent
17 Oct 2005
TL;DR: In this article, the authors present methods and devices for implementing a Low Latency Ethernet (LLE) solution, referred to herein as a Data Center Ethernet (DCE) solution which simplifies the connectivity of data centers and provides a high bandwidth, low latency network for carrying Ethernet and storage traffic.
Abstract: The present invention provides methods and devices for implementing a Low Latency Ethernet (“LLE”) solution, also referred to herein as a Data Center Ethernet (“DCE”) solution, which simplifies the connectivity of data centers and provides a high bandwidth, low latency network for carrying Ethernet and storage traffic. Some aspects of the invention involve transforming FC frames into a format suitable for transport on an Ethernet. Some preferred implementations of the invention implement multiple virtual lanes (“VLs”) in a single physical connection of a data center or similar network. Some VLs are “drop” VLs, with Ethernet-like behavior, and others are “no-drop” lanes with FC-like behavior. Some preferred implementations of the invention provide guaranteed bandwidth based on credits and VL. Active buffer management allows for both high reliability and low latency while using small frame buffers. Preferably, the rules for active buffer management are different for drop and no drop VLs.

260 citations

References
More filters
Book
01 Jan 1990
TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.
Abstract: From the Publisher: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures. Like the first edition,this text can also be used for self-study by technical professionals since it discusses engineering issues in algorithm design as well as the mathematical aspects. In its new edition,Introduction to Algorithms continues to provide a comprehensive introduction to the modern study of algorithms. The revision has been updated to reflect changes in the years since the book's original publication. New chapters on the role of algorithms in computing and on probabilistic analysis and randomized algorithms have been included. Sections throughout the book have been rewritten for increased clarity,and material has been added wherever a fuller explanation has seemed useful or new information warrants expanded coverage. As in the classic first edition,this new edition of Introduction to Algorithms presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers. Further,the algorithms are presented in pseudocode to make the book easily accessible to students from all programming language backgrounds. Each chapter presents an algorithm,a design technique,an application area,or a related topic. The chapters are not dependent on one another,so the instructor can organize his or her use of the book in the way that best suits the course's needs. Additionally,the new edition offers a 25% increase over the first edition in the number of problems,giving the book 155 problems and over 900 exercises thatreinforcethe concepts the students are learning.

21,651 citations

Journal ArticleDOI
TL;DR: In this article, a language similar to logo is used to draw geometric pictures using this language and programs are developed to draw geometrical pictures using it, which is similar to the one we use in this paper.
Abstract: The primary purpose of a programming language is to assist the programmer in the practice of her art. Each language is either designed for a class of problems or supports a different style of programming. In other words, a programming language turns the computer into a ‘virtual machine’ whose features and capabilities are unlimited. In this article, we illustrate these aspects through a language similar tologo. Programs are developed to draw geometric pictures using this language.

5,749 citations

Journal ArticleDOI
Abhay Parekh1, Robert G. Gallager1
TL;DR: Worst-case bounds on delay and backlog are derived for leaky bucket constrained sessions in arbitrary topology networks of generalized processor sharing (GPS) servers and the effectiveness of PGPS in guaranteeing worst-case session delay is demonstrated under certain assignments.
Abstract: Worst-case bounds on delay and backlog are derived for leaky bucket constrained sessions in arbitrary topology networks of generalized processor sharing (GPS) servers. The inherent flexibility of the service discipline is exploited to analyze broad classes of networks. When only a subset of the sessions are leaky bucket constrained, we give succinct per-session bounds that are independent of the behavior of the other sessions and also of the network topology. However, these bounds are only shown to hold for each session that is guaranteed a backlog clearing rate that exceeds the token arrival rate of its leaky bucket. A much broader class of networks, called consistent relative session treatment (CRST) networks is analyzed for the case in which all of the sessions are leaky bucket constrained. First, an algorithm is presented that characterizes the internal traffic in terms of average rate and burstiness, and it is shown that all CRST networks are stable. Next, a method is presented that yields bounds on session delay and backlog given this internal traffic characterization. The links of a route are treated collectively, yielding tighter bounds than those that result from adding the worst-case delays (backlogs) at each of the links in the route. The bounds on delay and backlog for each session are efficiently computed from a universal service curve, and it is shown that these bounds are achieved by "staggered" greedy regimes when an independent sessions relaxation holds. Propagation delay is also incorporated into the model. Finally, the analysis of arbitrary topology GPS networks is related to Packet GPS networks (PGPS). The PGPS scheme was first proposed by Demers, Shenker and Keshav (1991) under the name of weighted fair queueing. For small packet sizes, the behavior of the two schemes is seen to be virtually identical, and the effectiveness of PGPS in guaranteeing worst-case session delay is demonstrated under certain assignments. >

3,967 citations


"A practical scheduling algorithm to..." refers background in this paper

  • ...Furthermore, the system is able to control packet departure times and hence provides guaranteed qualities-of-service (QoS) [3][ 15 ][20][21]....

    [...]

Journal ArticleDOI
TL;DR: This paper shows how to construct a maximum matching in a bipartite graph with n vertices and m edges in a number of computation steps proportional to $(m + n)\sqrt n $.
Abstract: The present paper shows how to construct a maximum matching in a bipartite graph with n vertices and m edges in a number of computation steps proportional to $(m + n)\sqrt n $.

2,785 citations

Journal ArticleDOI
TL;DR: In this article, a fair gateway queueing algorithm based on an earlier suggestion by Nagle is proposed to control congestion in datagram networks, based on the idea of fair queueing.
Abstract: We discuss gateway queueing algorithms and their role in controlling congestion in datagram networks. A fair queueing algorithm, based on an earlier suggestion by Nagle, is proposed. Analysis and s...

2,639 citations