scispace - formally typeset
Open AccessJournal ArticleDOI

Matching output queueing with a combined input/output-queued switch

Reads0
Chats0
TLDR
It is demonstrated that a combined input/output-queueing (CIOQ) switch running twice as fast as an input-queued switch can provide precise emulation of a broad class of packet-scheduling algorithms, including WFQ and strict priorities.
Abstract
The Internet is facing two problems simultaneously: there is a need for a faster switching/routing infrastructure and a need to introduce guaranteed qualities-of-service (QoS). Each problem can be solved independently: switches and routers can be made faster by using input-queued crossbars instead of shared memory systems; QoS can be provided using weighted-fair queueing (WFQ)-based packet scheduling. Until now, however, the two solutions have been mutually exclusive-all of the work on WFQ-based scheduling algorithms has required that switches/routers use output-queueing or centralized shared memory. This paper demonstrates that a combined input/output-queueing (CIOQ) switch running twice as fast as an input-queued switch can provide precise emulation of a broad class of packet-scheduling algorithms, including WFQ and strict priorities. More precisely, we show that for an N/spl times/N switch, a "speedup" of 2-1/N is necessary, and a speedup of two is sufficient for this exact emulation. Perhaps most interestingly, this result holds for all traffic arrival patterns. On its own, the result is primarily a theoretical observation; it shows that it is possible to emulate purely OQ switches with CIOQ switches running at approximately twice the line rate. To make the result more practical, we introduce several scheduling algorithms that with a speedup of two can emulate an OQ switch. We focus our attention on the simplest of these algorithms, critical cells first (CCF), and consider its running time and implementation complexity. We conclude that additional techniques are required to make the scheduling algorithms implementable at a high speed and propose two specific strategies.

read more

Content maybe subject to copyright    Report

1
Matching Output Queueing with a Combined Input Output Queued
Switch
1
Shang-Tse Chuang
Ashish Goel
Nick McKeown
Balaji Prabhakar
Stanford University
Abstract — The Internet is facing two problems simultaneously: there is a need for a
faster switching/routing infrastructure, and a need to introduce guaranteed qualities of
service (QoS). Each problem can be solved independently: switches and routers can be
made faster by using input-queued crossbars, instead of shared memory systems; and QoS
can be provided using WFQ-based packet scheduling. However, until now, the two
solutions have been mutually exclusive — all of the work on WFQ-based scheduling
algorithms has required that switches/routers use output-queueing, or centralized shared
memory. This paper demonstrates that a Combined Input Output Queueing (CIOQ) switch
running twice as fast as an input-queued switch can provide precise emulation of a broad
class of packet scheduling algorithms, including WFQ and strict priorities. More
precisely, we show that for an switch, a “speedup” of is necessary and a
speedup of two is sufficient for this exact emulation. Perhaps most interestingly, this result
holds for all traffic arrival patterns. On its own, the result is primarily a theoretical
observation; it shows that it is possible to emulate purely OQ switches with CIOQ
switches running at approximately twice the line-rate. To make the result more practical,
we introduce several scheduling algorithms that, with a speedup of two, can emulate an
OQ switch. We focus our attention on the simplest of these algorithms, Critical Cells First
(CCF), and consider its running-time and implementation complexity. We conclude that
additional techniques are required to make the scheduling algorithms implementable at
high speed, and propose two specific strategies.
1 Introduction
Many commercial switches and routers today employ output-queueing.
2
When a packet
arrives at an output-queued (OQ) switch, it is immediately placed in a queue that is dedicated to its
outgoing line, where it waits until departing from the switch. This approach is known to maximize
the throughput of the switch: so long as no input or output is oversubscribed, the switch is able to
support the traffic and the occupancies of queues remain bounded. Furthermore, by carefully
scheduling the time a packet is placed onto the outgoing line, a switch or router can control the
packet’s latency, and hence provide quality-of-service (QoS) guarantees. But output queueing is
1. This paper was presented at Infocom ‘99, New York, USA.
2. When we refer to output-queueing in this paper, we include designs that employ centralized shared memory.
NN
×
21
N

2
impractical for switches with high line rates and/or with a large number of ports, since the fabric
and memory of an switch must run times as fast as the line rate. Unfortunately, at high
line rates, memories with sufficient bandwidth are simply not available.
On the other hand, the fabric and the memory of an input queued (IQ) switch need only run as
fast as the line rate. This makes input queueing very appealing for switches with fast line rates, or
with a large number of ports. For this reason, the highest performance switches and routers use
input-queued crossbar switches [3][4]. But IQ switches can suffer from head-of-line (HOL) block-
ing, which can have a severe effect on throughput. It is well-known that if each input maintains a
single FIFO, then HOL blocking can limit the throughput to just 58.6% [5].
One method that has been proposed to reduce HOL blocking is to increase the “speedup” of a
switch. A switch with a speedup of can remove up to packets from each input and deliver up
to packets to each output within a time slot, where a time slot is the time between packet arrivals
at input ports. Hence, an OQ switch has a speedup of while an IQ switch has a speedup of one.
For values of between 1 and packets need to be buffered at the inputs before switching as
well as at the outputs after switching. We call this architecture a combined input and output queued
(CIOQ) switch.
Both analytical and simulation studies of a CIOQ switch which maintains a single FIFO at
each input have been conducted for various values of speedup [6][7][8][9]. A common conclusion
of these studies is that with or 5 one can achieve about 99% throughput when arrivals are
independent and identically distributed at each input, and the distribution of packet destinations is
uniform across the outputs. Whereas these studies consider average delay (and simplistic input
traffic patterns), they make no guarantees about the delay of individual packets. This is particularly
important if a switch or router is to offer QoS guarantees.
We believe that a well-designed network switch should perform predictably in the face of all
types of arrival process
1
and allow the delay of individual packets to be controlled. Hence our
approach is quite different: rather than find values of speedup that work well on average, or with
simplistic and unrealistic traffic models, we find the minimum speedup such that a CIOQ switch
behaves identically to an OQ switch for all types of traffic. (Here, “behave identically” means that
when the same inputs are applied to both the OQ switch and to the CIOQ switch, the correspond-
ing output processes from the two switches are completely indistinguishable.) This approach was
first formulated in the recent work of Prabhakar and McKeown [12]. They show that a CIOQ
switch with a speedup of four can behave identically to a FIFO OQ switch for arbitrary input traf-
fic patterns and switch sizes. In this sense, this paper builds upon and extends the results in [12], as
described in the next paragraph. A number of researchers have recently considered various aspects
of the speedup problem, most notably [18] which obtains packet delay bounds and [19] which
finds sufficient conditions for maximizing throughput through work conservation and mimicking
of output queueing.
2
In this paper, we show that a CIOQ switch with a speedup of two can behave identically to an
OQ switch. The result holds for switches with an arbitrary number of ports, and for any traffic
1. The need for a switch that can deliver a certain grade of service, irrespective of the applied traffic is par-
ticularly important given the number of recent studies that show how little we understand network traffic
processes [11]. Indeed, a sobering conclusion of these studies is that it is not yet possible to accurately
model or simulate a trace of actual network traffic. Furthermore, new applications, protocols or data-cod-
ing mechanisms may bring new traffic types in future years.
2. [20] aimed to extend the results of [12], but the algorithms and proofs presented there are incorrect and do
not solve the speedup problem. See http://www.cs.cmu.edu/~istoica/IWQoS98-fix.html for a discussion
of the errors.
NN
×
N
S
S
S
N
S
N
S 4=

3
arrival pattern. It is also found to be true for a broad class of widely used output link scheduling
algorithms such as weighted fair queueing, strict priorities, and FIFO. We introduce some specific
scheduling algorithms that achieve this result. We also show more generally that a speedup of
is both necessary and sufficient for a CIOQ switch to behave identically to a FIFO OQ
switch.
It is worth briefly considering the implications of this result. It demonstrates that it is possible
to emulate an OQ switch using buffer memory operating at only twice the speed of the
external line. Previously, an OQ switch could only be implemented with memories operating at N
times the speed of the external line. However, the advantages do not come for free. In essence, the
memory bandwidth is reduced at the expense of a fast cell scheduling algorithm that is required to
configure the crossbar. As we shall see, the scheduling algorithms are complex, the best known-to-
date having a running-time complexity of N. (We discuss the implementation complexity in some
detail in Section 5). This means that it is not yet practicable to emulate fast OQ switches with a
large number of ports. While we propose some strategies in this paper, this is a topic for further
research.
1.1 Background
Consider the single stage, switch shown in Figure 1. Throughout the paper we assume
that packets begin to arrive at the switch from time , the switch having been empty before
that time. Although packets arriving to the switch or router may have variable length, we will
assume that they are treated internally as fixed length “cells”. This is common practice in high per-
formance LAN switches and routers; variable length packets are segmented into cells as they
arrive, carried across the switch as cells, and reassembled back into packets again before they
depart [4][3]. We take the arrival time between cells as the basic time unit and refer to it as a time
slot. The switch is said to have a speedup of , for if it can remove up to
cells from each input and transfer at most cells to each output in a time slot. A speedup of
requires the switch fabric to run times as fast as the input or output line rate. For
buffering is required both at the inputs and at the outputs, and leads to a combined input and output
queued (CIOQ) architecture. The following is the problem we wish to solve.
The speedup problem: Determine the smallest value of and an appropriate cell scheduling
algorithm that
1. allows a CIOQ switch to exactly mimic the performance of an output-queued switch (in a
sense that will be made precise),
2. achieves this for any arbitrary input traffic pattern,
3. is independent of switch size.
In an OQ switch, arriving cells are immediately forwarded to their corresponding outputs.
This (a) ensures that the switch is work-conserving, i.e. an output never idles so long as there is a
cell destined for it in the system, and (b) allows the departure of cells to be scheduled to meet
latency constraints.
1
We will require that any solution of the speedup problem possess these two
desirable features; that is, a CIOQ switch must behave identically to an OQ switch in the following
sense:
1. For ease of exposition, we will at times assume that the output uses a FIFO queueing discipline, i.e. cells depart from
the output in the same order that they arrived to the inputs of the switch. However, we are interested in a broader class
of queueing disciplines: ones that allow cells to depart in time to meet particular bandwidth and delay guarantees.
2
1
N
----
NN
×
t 1=
S
S
12
N
,,,
{}
S
S
S
S
1 SN
<<
S
π

4
Identical Behavior: A CIOQ switch is said to behave identically to an OQ switch if, under iden-
tical inputs, the departure time of every cell from both switches is identical.
Figure 1: A General Combined Input and Output Queued (CIOQ) switch.
As a benchmark with which to compare our CIOQ switch, we will assume there exists a
shadow OQ switch that is fed the same input traffic pattern as the CIOQ switch. Our goal is
to arrange for each cell to depart from the CIOQ switch at exactly the same time as its counterpart
cell departs from the OQ switch. In the CIOQ switch, the sequence in which cells are transferred
from their input queues to the output queue is determined by a scheduling algorithm. In each time
slot, the scheduling algorithm matches each non-empty input with at most one output and, con-
versely, each output is matched with at most one input. The matching is used to configure the
crossbar fabric before cells are transferred from the input side to the output side. A CIOQ switch
with a speedup of is able to make such transfers during each time slot.
Selecting the appropriate scheduling algorithm is the key to achieving identical behavior
between the CIOQ switch and its shadow OQ switch.
1.2 Push-in Queues
Throughout this paper, we will make repeated use of what we will call a push-in queue. Simi-
lar to a discrete-event queue, a push-in queue is one in which an arriving cell is inserted at an arbi-
trary location in the queue based on some criterion. For example, each cell may carry with it a
departure time, and is placed in the queue ahead of all cells with a later departure time, yet behind
cells with an earlier departure time. The only property that defines a push-in queue is that once
placed in the queue, cells may not switch places with other cells. In other words, their relative
ordering remains unchanged. In general, we distinguish two types of push-in queues: (1) “Push-In
First-Out” (PIFO) queues, in which arriving cells are placed at an arbitrary location, and the cell at
the head of the queue is always the next to depart. PIFO queues are quite general — for example, a
WFQ scheduling discipline operating at an output queued switch is a special case of a PIFO queue.
(2) “Push-In Arbitrary-Out” (PIAO) queues, in which cells are removed from the queue in an arbi-
trary order. i.e. it is not necessarily the case that the next cell to depart is the one currently at the
head of the queue.
It is assumed that each input of the CIOQ switch maintains a queue, which can be thought of
as an ordered set of cells waiting at the input port. In general, the CIOQ switches that we consider,
can all be described using PIAO input queues.
1
Many orderings of the cells are possible — each
ordering leading to a different switch scheduling algorithm, as we shall see.
Output 1
Input 1
Input N
Output N
S
S

5
Each output maintains a queue for the cells waiting to depart from the switch. In addition, each
output also maintains an output priority list: an ordered list of cells at the inputs waiting to be
transferred to this particular output. The output priority list is drawn in the order in which the cells
would depart from the OQ switch we wish to emulate (i.e. the shadow OQ switch). This priority
list will depend on the queueing policy followed by the OQ switch (FIFO, WFQ, strict priorities
etc.).
1.3 Definitions
The following definitions are crucial to the rest of the paper.
Definition 1: Time to Leave — The “time to leave” for cell c, TL(c), is the time slot at which c
will leave the shadow OQ switch. Note that it is possible for TL(c) to increase. This happens if
new cells arrive to the switch, destined for c’s output, and have a higher priority than c. (Of
course, TL(c) is also the time slot in which c must leave from our CIOQ switch for the identical
behavior to be achieved.)
Definition 2: Output Cushion — At any time, the “output cushion of a cell c”, OC(c), is the
number of cells waiting in the output buffer at cell c’s output port with a smaller time to leave
value than cell c.
Notice that if a cell is still on the input side and has a small (or zero) output cushion, the sched-
uling algorithm must urgently deliver the cell to its output so that it may depart on time. Since the
switch is work-conserving, a cell’s output cushion decreases by one during every time slot, and can
only be increased by newly arriving cells that are destined to the same output and have a more
urgent time to leave.
Definition 3: Input Thread — At any time, the “input thread of cell c”, IT(c), is the number of
cells ahead of cell c in its input priority list.
In other words, IT(c) represents the number of cells currently at the input that need to be trans-
ferred to their outputs more urgently than cell c. A cell’s input thread is decremented only when a
cell ahead of it is transferred from the input, and is possibly incremented by newly arriving cells.
Notice that it would be undesirable for a cell to simultaneously have a large input thread and a
small output cushion — the cells ahead of it at the input may prevent it from reaching its output
before its time to leave. This motivates our definition of slackness.
Definition 4: Slackness — At any time, the “slackness of cell c”, L(c), equals the output cushion
of cell c minus its input thread i.e. .
Slackness is a measure of how large a cell’s output cushion is with respect to its input thread. If
a cell’s slackness is small, then it urgently needs to be transferred to its output. Conversely, if a cell
has a large slackness, then it may languish at the input without fear of missing its time to leave.
1
Figure 2: A snapshot of a CIOQ switch
1. In practice, we need not necessarily use a PIAO queue to implement these techniques. But we will use the PIAO
queue as a general way of describing the input queueing mechanism.
1. Note that a cell’s input thread and slackness are only defined when the cell is waiting at the input side of
the switch.
L
c
()
OC
c
()
IT
c
()
=

Citations
More filters
Journal ArticleDOI

The iSLIP scheduling algorithm for input-queued switches

TL;DR: This paper presents a scheduling algorithm called iSLIP, an iterative, round-robin algorithm that can achieve 100% throughput for uniform traffic, yet is simple to implement in hardware, and describes the implementation complexity of the algorithm.
Book

Algorithmics of Matching Under Preferences

TL;DR: This book builds on the author’s prior research in this area, and also his practical experience of developing algorithms for matching kidney patients to donors in the UK, for assigning medical students to hospitals in Scotland, and for allocating students to elective courses and projects.
Proceedings ArticleDOI

The throughput of data switches with and without speedup

TL;DR: This paper uses fluid model techniques to establish two results concerning the throughput of data switches: for an input-queued switch (with no speedup) it is shown that a maximum weight algorithm for connecting inputs and outputs delivers a throughput of 100%, and for combined input- and output-Queued switches that run at a speedup of 2 they are shown to be correct.
Journal ArticleDOI

Analysis of a dynamically wavelength-routed optical burst switched network architecture

TL;DR: In this paper, an alternative network architecture combining OBS with dynamic wavelength allocation under fast circuit switching is proposed to provide a scalable optical architecture with a guaranteed QoS in the presence of dynamic and bursty traffic loads.
Patent

Network flow switching and flow data export

TL;DR: In this paper, a message flow is defined to comprise a set of packets to be transmitted between a particular source and a particular destination, and the proper processing may include a determination of a destination port for routing those packets and whether access control permits routing them to their indicated destination.
References
More filters
Journal ArticleDOI

College Admissions and the Stability of Marriage

TL;DR: In this article, the authors studied the relationship between college admission and the stability of marriage in the United States, and found that college admission is correlated with the number of stable marriages.
Journal ArticleDOI

On the self-similar nature of Ethernet traffic (extended version)

TL;DR: It is demonstrated that Ethernet LAN traffic is statistically self-similar, that none of the commonly used traffic models is able to capture this fractal-like behavior, and that such behavior has serious implications for the design, control, and analysis of high-speed, cell-based networks.
Journal ArticleDOI

Analysis and simulation of a fair queueing algorithm

TL;DR: In this article, a fair gateway queueing algorithm based on an earlier suggestion by Nagle is proposed to control congestion in datagram networks, based on the idea of fair queueing.
Journal ArticleDOI

Input Versus Output Queueing on a Space-Division Packet Switch

TL;DR: Two simple models of queueing on an N \times N space-division packet switch are examined, and it is possible to slightly increase utilization of the output trunks and drop interfering packets at the end of each time slot, rather than storing them in the input queues.

Analysis and Simulation of Fair Queueing Algorithm

A. Demers
TL;DR: It is found that fair queueing provides several important advantages over the usual first-come-first-serve queueing algorithm: fair allocation of bandwidth, lower delay for sources using less than their full share of bandwidth and protection from ill-behaved sources.
Related Papers (5)
Frequently Asked Questions (15)
Q1. What are the contributions mentioned in the paper "Matching output queueing with a combined input output queued switch1" ?

The Internet is facing two problems simultaneously: there is a need for a faster switching/routing infrastructure, and a need to introduce guaranteed qualities of service ( QoS ). This paper demonstrates that a Combined Input Output Queueing ( CIOQ ) switch running twice as fast as an input-queued switch can provide precise emulation of a broad class of packet scheduling algorithms, including WFQ and strict priorities. More precisely, the authors show that for an switch, a “ speedup ” of is necessary and a speedup of two is sufficient for this exact emulation. To make the result more practical, the authors introduce several scheduling algorithms that, with a speedup of two, can emulate an OQ switch. The authors focus their attention on the simplest of these algorithms, Critical Cells First ( CCF ), and consider its running-time and implementation complexity. The authors conclude that additional techniques are required to make the scheduling algorithms implementable at high speed, and propose two specific strategies. 

The authors believe this to be an important area for future research. 

Like CCF, GBVOQ can be used in conjunction with the DTC strategy to reduce the number of iterations needed to compute a stable matching. 

A common conclusion of these studies is that with or 5 one can achieve about 99% throughput when arrivals are independent and identically distributed at each input, and the distribution of packet destinations is uniform across the outputs. 

The intuition behind this insertion policy is that a a cell with a small output cushion needs to leave soon (i.e. it is “more critical”), and therefore needs to be delivered to its output sooner than a cell with a larger output cushion. 

In the CIOQ switch, the sequence in which cells are transferred from their input queues to the output queue is determined by a scheduling algorithm. 

“Critical Cells First” (CCF) inserts an arriving cell as far from the head of its input queue as possible, such that the input thread of the cell is not larger than its output cushion. 

When a packet arrives at an output-queued (OQ) switch, it is immediately placed in a queue that is dedicated to its outgoing line, where it waits until departing from the switch. 

During an arrival phase, the slackness of a cell already in the system can go down by at most F since a new cell with fanout F may get inserted ahead of it. 

for emulating a FIFO OQ switch, the authors can group incoming cells into Virtual Output Queues and obtain an upper bound of on the number of cells that need to be considered. 

This is common practice in high performance LAN switches and routers; variable length packets are segmented into cells as they arrive, carried across the switch as cells, and reassembled back into packets again before they depart [4][3]. 

They show that a CIOQ switch with a speedup of four can behave identically to a FIFO OQ switch for arbitrary input traffic patterns and switch sizes. 

and unfortunately, to determine which VOQs need to be marked active, the authors again need access to global state, namely the output cushion of each cell at the head of a VOQ. 

Counting the changes in each of the four phases (arrival, departure, and two scheduling phases), the authors conclude that the slackness of cell c can not decrease from time slot to time slot. 

This approach is known to maximize the throughput of the switch: so long as no input or output is oversubscribed, the switch is able to support the traffic and the occupancies of queues remain bounded. 

Trending Questions (1)
What does a router use to determine which packet to send when several packets are queued for transmission from a single output interface?

This paper demonstrates that a combined input/output-queueing (CIOQ) switch running twice as fast as an input-queued switch can provide precise emulation of a broad class of packet-scheduling algorithms, including WFQ and strict priorities.