A 3-stage interconnection structure for very large packet switches

doi:10.1109/ICC.1990.117181

A 3-STAGE INTERCONNECTION STRUCTURE FOR VERY LARGE

PACKET SWITCHES

Soung

C.

Liew and Kevin

W.

Lu

Bell Communications Research

445

South Street

Morristown,

NJ

07960-1910,

U.S.A.

Abstract

This paper proposes

a

3-stage broadband packet switch archi-

tecture with more than 16,000 ports for

a

future central office.

The switch

is

constructed by interconnecting many independent

switch modules of small size which can be implemented using

modifications of various well-studied switch fabric designs. Mul-

tiple paths are provided for each input-output pair, and the tech-

nique of channel grouping

is

used to decrease delay and increase

throughput.

A

datagram packet routing approach

is

adopted in

order to eliminate table lookup that

will

be required by virtual-

circuit routing. Ways of guaranteeing the sequence integrity of

packets are discussed. We estimate from our performance anal-

yses that switches with acceptable performance of 32,768 ports

can be constructed based on switch fabrics of no more than 128

ports.

I.

Introduction

“Divide-and-Conquer”

is

a popular engineering technique for

designing systems of large scale.

A

good approach to designing

a

very large switching system, for example, is to interconnect many

independent small switch modules in a way that satisfies the

overall switching requirements. Most studies of packet-switch

implementations

[l],

[2], switch control

[l],

[3],

[4],

performance

analysis [3],[5], and prototypes

[l],

[6] have been restricted to

small switch fabrics that do not scale easily. It is now time to

address the challenge of building a very large packet switch based

on small switch fabrics.

Toward this end, the present paper proposes a packet switch

architecture which consists of three stages of small switch mod-

ules, each of which in turn can be realized by several switch fab

rics. Several points of interest regarding this work are presented

below.

The proposed switch structure has a modular design that

scales easily to dimensions larger than 16,000 ports

x

16,000

ports. Thus,

a

packet switch with Terabit capacity can be

achieved with a per-port transmission capacity of only 150

to 200 Mb/s. It is difficult to realize a Batcher-banyan

switch [l], [6] of this scale because of stringent synchre

nization requirements for the self-routing elements at each

stage of the switch. The modular growth of the Knockout

switch described in [2] will encounter significant difficulties

for 16,000 ports also, since each

bus

in the architecture must

then have a fanout of more than 16,000.

b

a

11.

Lee has proposed a 2-stage nonblocking modular switch [7]

in which the interconnection complesity grows faster than

linearly with the overall switch size. For

a

1282

x

1282 2-

stage modular switch based on switch fabrics of 128 input

ports, the number of interconnections is 1283

=

2,097,152.

While it is interesting and challenging to explore the possi-

bility of building a very large nonblocking switch, the non-

blocking property may be too stringent fiom an engineering

viewpoint. By abandoning the nonblocking property, in our

$stage design, we can achieve a switch of similar size with

only 131,072 interconnections while maintaining acceptable

performance.

The technique of channel grouping (i.e., providing more

than one physical output port for each physical destination

address) [8] is used to improve the performance of the indi-

vidual switch modules. This paper presents several switch

module designs, and shows that for

a

given switch size,

a switch module with channel grouping is simpler to re-

alize than one without channel grouping. For instance, a

Batcher-banyan switch module with channel grouping can

be realized simply by truncating the last few stages of the

original Batcher-banyan network.

There are multiple paths between any input and any out-

put in our overall switch architecture. We assume data-

gram

routing in which packets of

a

given service session

may travel over different paths within the switch architec-

ture. This means packets may arrive at the output out-of-

sequence, due to delay differences of the paths. However,

this paper shows that sequence integrity of packets can be

maintained automatically by proper design of the individ-

ual switch modules. The same technique can be extended

to other multistage switches with channel grouping.

The

General 3-Stage Switch Architecture

The general switch architecture we propose is illustrated in

Fig.

1.

The dimensions of the first-stage, second-stage, and

third-stage switch modules are

n

x

m

(m

2

n),

I

x

l’,

and

m’

x

n’

(m’

2

n’),

respectively. The switch modules are nonblocking in-

ternally. For interconnection of the first-stage and second-stage

modules, the modules are divided into g partitions called the

first-stage-second-stage partztzons. The figure shows the 1st and

the gth enclosed in dashed boxes. There are no interconnections

between partitions, but within each partition, each first-stage

module is connected to each second-stage module via

T

links.

316.7.1.

CH2829-019010000-0771 $1

.OO

0

1990

IEEE

0771

Authorized licensed use limited to: IEEE Xplore. Downloaded on February 6, 2009 at 03:39 from IEEE Xplore. Restrictions apply.

Figure

1:

General structure of

a

3-stage switch.

We shall refer to the links that interconnect two switch modules

as

a

channel

group,

and the number of links within each channel

group

as

the channel

group

size. In each partition, there are

h

first-stage modules,

g‘

second-stage modules, and

hg’

channel

groups of size

r.

A similar, but not identical, partitioning strategy is used for

interconnection of the second-stage and third-stage modules.

Specifically, the third-stage modules are divided into

g’

parti-

tions. One second-stage module out of each first-stage-second-

stage partition is chosen to be included in each of the

g’

second-

stage-third-stage partitions.

As

before, there are no intercon-

nections between partitions, but each second-stage module is

connected to each third-stage module within the same partition

via

a

channel group of size

7’.

In Fig.

1,

two second-stage-third-

stage partitions are shown. The darkened lines show the inter-

connections of one partition, while the undarkened lines show

the interconnections of the other. In each partition, there are

g

second-stage modules,

h’

third-stage modules, and

h’g

chan-

nel groups of size

T’. Note that although there is more than

one path between

a

first-stage module and a third-stage mod-

ule, these paths belong to the same path

gwup

consisting of two

channel groups.

It is interesting to note that when

1

=

1’

=

1

(i.e., the second-

stage switch modules are removed),

T

=

r‘

=

1,

h

=

h’

=

1,

m

=

g‘,

m‘

=

g,

and

n’

=

1,

the 3-stage architecture reduces to

Lee’s 2-stage architecture

[7].

When

n

=

n’

=

m

=

m’

=

1

=

1’

(i.e., switch modules of

all

three stages are of the same size),

switch performance will be degraded at the first and second

stages under high input loads because of congestion on the inter-

connections. By having

m

>

n

and

m’

>

n’,

the loading on the

interconnections can be reduced, and therefore the throughput

of the switch can be increased.

It is not necessary to have

r

>

n

or

r’

>

n‘

because no more

than

n

or

n’

packets will travel simultaneously through

a

first-

stage or third-stage switch module, respectively. With

r

=

n

and

7’

=

n’, there is always enough capacity to carry the packets

to their destinations, but there

will be a very large number of

interconnections

(N’YZ).

With

T

<

n

or

r’

<

n’,

the switch is

blocking because there may not be enough capacity to carry the

traffic between two modules. However, if the switch

is

properly

designed, the likelihood of this event can be made extremely

small.

The rest of this paper will consider only the symmetric caSe

in which n

=

n’,

m

=

m’,

1

=

l’,

g

=

g’,

h

=

h’,

and

T

=

r’.

The

total number

of

switch ports

is

(1)

N

=

ngh.

A

total of log,

N

bits is necessary for routing purposes, and of

these, log,

g

bits are used for routing

at

the first stage, log, h

bits at the second stage, and log,

n

at the third stage. The total

number of interconnections is

I

=

2mgh.

(2)

From the above, we get

(3)

I

-

2m

N-

n.

We shall call

m/n

the ezpansion ratio, referring to the fact that

this

is

the ratio

of

the number of intermediate links in the switch

architecture to the number of inputs or outputs. Equation (3)

shows that the only way to increase

N

without increasing

I

is

to

decrease the expansion ratio

m/n,

but this

will

result in degra-

dation in performance. Thus, we see that although an optimal

switch should have large

N,

small

I,

and good performance,

these objectives conflict with each other. Our task, therefore,

is to design the overall switch to achieve the best compromise

between these objectives.

--

Global

FIFO

Pmpedy

Because of channel grouping, there are

T

possible paths in our

switch architecture over which

a

packet can travel from input

to output. It is not difficult to envision situations in which two

packets are reversed in order

at

the output because they have

traveled over different paths.

On the other hand, it is easier to

achieve the first-in-first-out (FIFO) in our switch structure than

in

a

switch structure with more than one intermediate switch

module interconnecting

a

first-stage switch module and a third-

stage switch module. In such

a

case, packets of an input-output

pair may travel through different intermediate switch modules

at different times, and sequence integrity

is

difficult to maintain

unless .the intermediate modules are somehow coordinated and

synchronized. For our switch architecture, on the other hand, we

only need to focus on one path group to tackle the FIFO prob-

lem, thus allowing the intermediate switch modules to function

independently.

Whether the proposed 3-stage switch architecture is FIFO de-

pends on the buffering strategies used by switch modules at the

316.7.2.

0772

Authorized licensed use limited to: IEEE Xplore. Downloaded on February 6, 2009 at 03:39 from IEEE Xplore. Restrictions apply.

when inultiple stages of these output-queueing switch modules

are cascaded.

\'

\

1

I

1.-

TRUNCATED

BANYAN

Figure 2:

(a)

Batcher-banyan implementation of

a

first-stage

input-buffered switch module, (b) Expansion network.

different stages,

as

well

as

on the switch designs themselves.

This paper considers three different buffering strategies: input

queueing, output queueing, and packet dropping

[5].

With input

queueing, an arriving packet enters

a

FIFO buffer on its input

and waits for its turn to access its destination output. When

there is channel grouping on the output, the packet may be

routed to any output of the destination output group. The chan-

nel group size is the maximum number of packets that can be

cleared from the same output address simultaneously. With out-

put queueing,

a

FIFO buffer is allocated to each output group,

and arriving packets destined for this output group are immedi-

ately placed in this buffer. Again, packets may be routed to any

output of their destination output group. With packet-dropping,

there is no queueing at input or output ports. Any packets not

cleared in one time slot are simply dropped from the system.

For a structure with input queueing at the first stage, packet

dropping at the second, and output queueing at the third, it is

not difficult to see that sequence integrity of undropped packets

is maintained, regardless of the design details of the switch mod-

ules. In fact, it can be shown that in general at least one the

three stages must be packet-dropping in order to maintain FIFO.

However, with the particular output-queueing switch modules

proposed in the next section, the first-stage modules and/or the

second-stage modules can

also

be output queueing without vio-

lating FIFO. Specifically, the proposed output-queueing switch

modules assume an implicit time relationship between packets

of the saine channel group that arrive in the same time slot, and

this implicit time relationship is automatically preserved even

111.

Switch

Module Designs

We now consider the internal structures of the switch modules.

For this paper, the switch hierarchy is ordered as:

very large

swatch

+

switch module

---t

swatch fabrac

-i

swatch element.

This

section addresses the lowest three levels of this hierarchy. The

design of switch modules is basically similar to the design of

small switches, and as such, results and insights gained from

other studies

[l], [2], [7]

can be drawn upon to design the switch

modules. We show in the rest of this section ways in which

channel grouping allows

us

to simplify switch designs.

Input- buffered Swatch Module wzth Channel Groupang

Only input-buffered switch modules at the first stage will be

considered here. Two possibilities are discussed below.

(i)

Modified Batcher-Banyan Network:

The structure of the

first design is basically

a

modification based on the expansion

Batcher-banyan network described in

[7].

The Batcher network

[lo]

sorts the packets according to their destination addresses,

either in ascending order or descending order, and the expansion

banyan network then routes the packets to their destinations. It

is well known that this switch structure is nonblocking

[l], [7],

The expansion banyan network is basically

a

gr

x

gr

banyan

network with some of the 2

x 2 switch elements omitted

[7].

A schematic of the expansion banyan network is illustrated in

Fig. 2(b), where

ns

=

gT

=

m,

and

n,

s,

g

and

r

are

all

multiples

of 2. By using a tree-branching interconnection of switch cells,

each output from the Batcher network is expanded into

s

lines.

Then,

s

n

x

n

banyan networks are used to further route the

packets. With

a

control scheme that makes sure that no mpre

than

T

packets destined for the same output group enter the

Batcher network in the same time slot (e.g., modifications bded

on

[3],[4]),

the network

is

nonblocking

if

log2gr bits are used

for routing

[7].

To route packets in the banyan network, ke

could generate another log,

T

address bits (in addition to logt

g

bits of original output-group address) in such a way that packets

originally having the same output address are routed to different

output ports of the same output group.

Because the output lines of a given output group are indistin-

guishable to packets, the above scheme is more complex than is

actually needed. The following can be shown:

[Ill.

0

A truncated

n

x

gr

banyan network with the last log,?

stages removed is nonblocking if the input packets are pre-

sorted and at most

T

packets are destined for each of the

g

output groups.

With this simplification, only the log,

g

original output address

bits are required for routing.

(ii)

Ezpansion- Concentration Network:

The second design

consists of an expansion stage followed by

a

concentration stage

(see Fig.

3(a)).

Unlike the Batcher network of the previous de-

sign in which all log,

g

bits of the destination address are exam-

ined in each sorting element, all switch cells in this design are

similar to those in a banyan network in that only one address

bit needs to be examined.

Each input port has an associated expansion network, and

packets are routed according to their destination output groups.

316.7.3.

0773

Authorized licensed use limited to: IEEE Xplore. Downloaded on February 6, 2009 at 03:39 from IEEE Xplore. Restrictions apply.

x

ADDRESS

,,

GENERATOR

CONC.

Figure 3: (a) Expansion-concentration implementation of a first-

stage input-buffered switch module, (b) Implementation of

a

con-

centrator.

Note, in particular, that the number of output ports for each

expansion network is

g,

the number of output groups for the

switch module, rather than

gT,

the number of output ports for

the switch module. This reduces the fanout by a factor of

T.

Packets destined for the same output group are then collected

and fed into a running-adder address generator

[l],

[12], before

they are cleared from the switch module at the

T

output ports

of

a Concentrator. The address generator produces a set of adjacent

destination addresses in the concentrator for the active packets.

One possible addressing scheme is to give the first active packet

(from top to bottom) the first output address, the second active

packet the second output address, and

so

on. If there is an

activity bit for each packet, then the address assigned to a packet

is the sum of all activity bits belonging to the packets above it.

The structure of

a

concentrator,

as

shown in Fig. 3(b), con-

sists basically

of

an interconnection of several reversed banyan

networks. A reversed banyan network is a mirror image of

a

regular banyan network in which the inputs and outputs are re-

versed. The

n

inputs to the concentrator are partitioned into

n/T

groups of

T

inputs each, and each group is fed into an

T

x

T

reversed banyan network. The corresponding outputs of the re-

versed banyan networks (from top to bottom) are connected to

common buses, resulting in exactly

T

output ports. Given

at

most

T

packets to each output group, the concentrator is non-

blocking if the routing at the kth stage is done according to the

kth least significant bit (as opposed to the kth most significant

bit in the regular banyan network) [Ill,

[12].

The address-generation scheme discussed above is topto-

bottom, and when there are fewer than

r

packets, the packets

tend to concentrate on the upper portion of the outputs. If we

want to distribute the packets more evenly over the outputs (e.g.,

when the switch modules at the next stage are input-buffered),

the address-generation scheme above can be easily modified by

adding an offset quantity

a,

0

5

a

5

T

-

1,

to the sum of the

activity bits of the upper inputs. That

is,

if

S

is sum of the

activity bits of the upper inputs, the generated output address

will be

a

+

S

(mod

T).

It can be shown (using Theorem

5

and

Theorem 13 in

[ll])

that the same routing scheme

is

still non-

blocking. The

a

value can then be varied from time slot to time

slot in order to distribute the packets more evenly among the

outputs.

The expansion-concentration network just described elimi-

nates the sorting requirement of

a

Batcher-banyan switch design.

This is accomplished, however, through the addition of address

generators. As in the Batcher-banyan design,

a

control scheme

is required to make sure no more than

T

packets for the same

output group enter the switch simultaneously.

Input-buffered switch modules for the second and third stages

can be implemented based on modifications of the above designs.

Because of channel grouping on the input ports, however, addi-

tional control mechanisms will be required to maintain the FIFO

property of the modules. These FIFO-guaranteeing mechanisms

will not be discussed in this paper.

Packet Dropping

Packet-dropping switch modules can be designed

as

above,

except that buffers at the inputs are omitted. For illustration,

consider

a

packet-dropping switch module at the second stage

with dimensions

h~

x

hr.

If we use the Batcher-banyan design

described earlier in this section, then

n

+

hr,

g

-+

h

and

s

+

1.

In particular, the banyan network is

hr

x

hr

and not expanded,

but

as

before, the last log,

T

stages of the banyan network can

be omitted.

Output Queueing

Output-buffered switch modules can be implemented based

on modifications of the second design of the input-buffer switch

modules. We first consider an output-buffered switch module at

the third stage.

Figure 4(a) shows the general structure of the switch module.

There are

gT

expansion networks, one for each input. The log,

n

address bits of a packet are used to route the packet to one

output of the expansion network. The corresponding outputs

of the expansion networks are collected and fed to the logical

FIFO output queues, which simply clear the buffered packets on

a first-in-first-out basis.

To implement a logical FIFO output queue, an approach like

the one used in the Knockout Switch

[2]

can be adopted, in which

a concentrator, shifter, and multiple packet buffers are used.

In particular, the concentrator is used to reduce the number of

inputs that need to be buffered simultaneously. Here, we follow

the same general approach, but propose

a

different structure to

replace the concentrator and shifter, which yields a smaller count

of switch elements.

A schematic diagram for the FIFO output queue is shown in

Fig. 4(b). The

n

input lines are concentrated to

L

lines by a

reversed-banyan concentrator. At any given time slot, a packet

is selected from one of the

L

buffers and transmitted to the out-

put over a common

bus.

To maintain the FIFO property, packets

3

16.7.4.

0774

Authorized licensed use limited to: IEEE Xplore. Downloaded on February 6, 2009 at 03:39 from IEEE Xplore. Restrictions apply.

-

lLOGICALRFO

1_

1

gr

QUEUE

--Dl

-l

1-m-;-

9.

RUNNING REV

'

ADDER

:

BANYAN

:

'

.

ADDRESS

.

CONC.

9.

BUS

1

.-

-

grGEN.

Figure 4: (a) Implementation of a third-stage output-buffered

switch module, (b) Implementation of a logical FIFO queue at

the third stage.

-------.I

at the buffers exit the switch in a round-robin fashion, proceed-

ing from top to bottom and wrapping back. Packets also enter

the buffers in round-robin fashion, from top to bottom,

so

that

packets from earlier time slots always precede packets from later

time slots. To ensure this, a running-adder address generator

is cascaded with a reverse banyan network. The running-adder

address generator assigns each packet an output address which

is equal to

a

+

S

(mod

L),

where

0

2

a,S

2

P

-

1.

Here,

a

is set to the tail-of-queue

(TOQ),

i.e., the buffer just below the

one entered by the last packet in the previous time slot.

S

is

the total number of active packets at inputs above the input of

the packet concerned. The quantity

a

in the next time slot

is

simply the sum

of

a

and the number of packets in the current

time slot. The number of routing switch cells required for each

reversed-banyan concentrator is the sum of those in the

gT/L

L

x

L

banyan networks (i.e.,

n

-+

gr,

P

+

L

in Fig. 3(b)).

This is

(gr/2)

log,

L,

which is of smaller order than

grL

in the

Knockout Switch

[2].

Under conditions of homogeneous traffic, it is unlikely (with

probability

<

that there will be more than

8

new incoming

packets for the same output in a given time slot, regardless of

g'

[2]

.

In practice, we can take into account an occasional surge in

instantaneous traffic by fixing

L

at

16.

The buffers do not have

to be very deep, either. Reference

[2]

shows that it is sufficient

to

have enough storage for

40

packets per output. Thus, for

'L

=

16,

each buffer needs to be only 3 packets deep. In generd,

ADDRESS BANYAN

ADDER

:

.

GEN. CONC.

BANYAN

ADDRESS CONC.

-

GEN.

4

hr

L

--L

m

+

i"l

a-TOQ

Figure

5:

Implementation of a logical FIFO queue at the second

stage.

the buffering requirement in output-buffered switch designs is no

worse than that in input-buffered switch design.

For output-queueing switch modules at the first stage,

gr

-+

n

and

n

+

g

in Fig. 4(a). Similarly, for output-queueing switch

modules at the second stage,

gT

-+

hr

and

n

-+

h

in the fig-

ure. In either case, there is channel grouping on the outputs,

so

the logical FIFO queue structure in Fig. 4(b) must be mod-

ified. Suppose the channel group size

is

given by

P

=

L/2'

for

some

i

E

(0,

1,2,.

.

.},

and suppose we want to preserve the or-

der of packets in the logical FIFO queue by transmitting the first

packet on the top output, the second packet on the next output,

and

so

on. The incentive for this is that the switch module at the

next stage will then know the implicit sequence of the simulta-

neously received packets, and

it

will put the packets in the FIFO

output buffers according to their implicit sequence. In this way,

the sequence integrity of packets is preserved even when multiple

stages of output-buffered switch modules are cascaded together.

Figure

5

illustrates

a

way of achieving this at the second stage

by cascading an address generator and a banyan concentrator

after the buffers. The banyan concentrator is similar to that in

Fig. 3(b), except here we have

2'

P

x

T banyan networks con-

nected by common buses. Furthermore, both regular and re-

versed banyan networks can be used here. The head-of-queue

(HOQ)

is the input containing the first packet of the logical

FIFO queue. To map the first

T

packets to the

T

outputs of the

next concentrator, the address generator performs the following

mapping: output address

=

input address

-

HOQ

(mod

L).

Only packets with output addresses in the range of

0

to

T

-

1

will be transmitted through the concentrator. The

HOQ

in the

next time slot is simply

HOQ

+

max(S,

r)

(mod

L),

where

S

is the sum of the activity bits of the

L

head-of-line packets in

the

L

buffers. For the desired cyclic mapping, this banyan con-

centrator is nonblocking

[ll].

The reversed banyan networks in Fig. 4(b) and Fig.

5

may

be difficult to implement when

gr

and

hr

are large. Figure 6

shows a decomposition method for solving this problem, using a

third-stage module as an example.

IV.

Switch

Performance

We now consider the performance of the 3-stage switch archi-

tecture in order to derive reasonable ranges for the various switch

parameters. Only the results of

our

analyses and simulations will

be presented here.

To

determine the merits

of

the 3-stage switch, we consider the

number

of

interconnections between modules, and performance

316.7.5.

0775

Authorized licensed use limited to: IEEE Xplore. Downloaded on February 6, 2009 at 03:39 from IEEE Xplore. Restrictions apply.

A 3-stage interconnection structure for very large packet switches

Citations

Architectural choices in large scale ATM switches

A copy network with shared buffers for large-scale multicast ATM switching

ATM shared-memory switching architectures

Comparison of buffering strategies for asymmetric packet switch modules

Polling response selection using request monitoring in a network switch apparatus

References

Sorting networks and their applications

Access and Alignment of Data in an Array Processor

Queueing in high-performance packet switching

The Knockout Switch: a simple, modular architecture for high-performance packet switching

A Broadband Packet Switch for Integrated Transport

Related Papers (5)

A modular architecture for very large packet switches

Input Versus Output Queueing on a Space-Division Packet Switch

The Knockout Switch: A Simple, Modular Architecture for High-Performance Packet Switching

A growable packet (ATM) switch architecture: design principles and applications

Multichannel bandwidth allocation in a broadband packet switch

Trending Questions (1)