scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A three-stage ATM switch with cell-level path allocation

01 Jun 1997-IEEE Transactions on Communications (Institute of Electrical and Electronics Engineers)-Vol. 45, Iss: 6, pp 701-709
TL;DR: A method is described for performing routing in three-stage asynchronous transfer mode (ATM) switches which feature multiple channels between the switch modules in adjacent stages, which allows cell-level routing to be performed, whereby routes are updated in each time slot.
Abstract: A method is described for performing routing in three-stage asynchronous transfer mode (ATM) switches which feature multiple channels between the switch modules in adjacent stages. The method is suited to hardware implementation using parallelism to achieve a very short execution time. This allows cell-level routing to be performed, whereby routes are updated in each time slot. The algorithm allows a contention-free routing to be performed, so that buffering is not required in the intermediate stage. An algorithm with this property, which preserves the cell sequence, is referred to as a path allocation algorithm. A detailed description of the necessary hardware is presented. This hardware uses a novel circuit to count the number of cells requesting each output module, it allocates a path through the intermediate stage of the switch to each cell, and it generates a routing tag for each cell, indicating the path assigned to it. The method of routing tag assignment described employs a nonblocking copy network. The use of highly parallel hardware reduces the clock rate required of the circuitry, for a given-switch size. The performance of ATM switches using this path allocation algorithm has been evaluated by simulation, and is described.

Summary (4 min read)

I. INTRODUCTION

  • T HE THROUGHPUT achievable (in bits/second) in an asynchronous transfer mode (ATM) switch depends heavily on the process used to fabricate it.
  • Some method of routing is then necessary, to select among the available paths from source to destination, through the second stage of the switch.
  • In one approach (call-level routing), all cells belonging to a virtual connection ("call") are allocated the same route.
  • The algorithm described here requires fewer iterations than that in [6] , does not require input buffering (which degrades the throughput), unlike [7] , and is fairer than that presented in [5] , in addition to readily supporting intermediate channel grouping.

A. The Objectives of a Path Allocation Algorithm

  • There are routes from each input module to each intermediate module.
  • There are routes from each intermediate module to each output module.
  • The authors must choose, for every input cell (if possible) an intermediate switch module through which to pass on the way to the selected destination, such that no input module attempts to route more than cells via any intermediate module, and no intermediate module attempts to route more than cells to any output module, in any one time slot.
  • It will be assumed, for simplicity, that all input ports of the switch operate at the same rate, and thus that the duration of the time slot (the interval between successive cell boundaries) is the same for every cell.

B. Basic Principles of the Path Allocation Algorithm

  • A new and efficient algorithm will now be described.
  • Note that and need only be local to the input module.
  • The procedure determines the capacity available from input module to output module via intermediate switch module (i.e., the minimum of and .
  • The number of requests which can be satisfied is equal to the minimum of the number of requests outstanding and the available capacity.
  • A parallel implementation requires multiple processors, each executing the procedure for a different set of procedure parameters, subject to the following constraints: no two processors shall simultaneously require access to the same quantity.

C. Implementation of the Algorithm

  • Suppose that there are modules in each stage of the switch.
  • The processor in the th row (numbered from the right) and th column (numbered from the bottom) of the array is labeled .
  • The values stored in the processor array are shown in Fig. 2 (a) for the case where .
  • The algorithm then requires iterations (iterations zero through .
  • Specifically, processor is initialized as follows: otherwise.

otherwise.

  • An examination of the operation of the resulting algorithm reveals that the processors in row or higher and in column or above never modify the and values they receive, and thus may be replaced by simple delays.
  • If , each column requires additional registers.
  • Hence a relatively high clock speed will be required in the array, so as to complete iterations of the algorithm in the time available (which is less than the duration of one time slot).
  • A switch with intermediate channel grouping affords the possibility of reducing cell loss probability by increasing and , rather than by increasing Thus, the proposed algorithm is fairer than that described in [5] .

D. Implementation Issues

  • The processor must execute the procedure, and thus must perform two types of operation: 1) find the minimum of three numbers; 2) perform three subtractions.
  • The and values are obtained from (and forwarded to) adjacent processors.
  • A fast implementation using bit-serial arithmetic, and which does not require the calculation of the minimum of three numbers, was described in [10] .
  • The input and output port controllers must perform the necessary bit rate adaptation (and multiplexing/demultiplexing) for links operating at other rates, so that cells traverse the switch fabric at a common rate.
  • This requires the path allocation algorithm to preferentially allocate paths to cells with the CLP bit set to zero.

III. A FAST METHOD OF REQUEST COUNTING

  • Suitable hardware to simultaneously calculate (the number of requests from input module for output module for all values of will now be described.
  • The execution time for this hardware is clock cycles.
  • Under these circumstances, it may readily be shown that where is the number of data cells requesting output module , and is fixed, since the Batcher network processes only requests from input module .
  • A total of control packets is thus simultaneously launched into the concentrator, and these are routed to the serial adders at outputs zero through without blocking.
  • The concentrated list of values is then read by these serial adders, the lower input (as shown in Fig. 5 ) being inverted.

A. Principles of Operation

  • The processor generates a sequence of values, one after every iteration of the path allocation algorithm, commencing with (the initial value of determined by the request counting hardware) and decrementing, after every iteration, in accordance with the procedure, as paths are allocated to cells.
  • Thus represents the number of outstanding requests from input module for output module .
  • When the path allocation process is complete, a special null token is broadcast to the cells which have lost contention.
  • During each iteration of the algorithm, submits a routing packet to the network, to be broadcast to address generators through containing in the data field the token address, i.e., the address of the intermediate switch module through which a route has been allocated.
  • Two bits (one each from the upper and lower address), in addition to the activity bit, must be processed at each node of the network.

TABLE I PATTERN OF REQUESTS AND POSSIBLE OUTCOME OF PATH ALLOCATION PROCESS

  • Changes after the first iteration of the algorithm [16] .
  • Hence, on subsequent iterations of the algorithm, there is no need to distribute the lower address, so that the header on the routing packet may be shortened, reducing the delay through the copy network.

B. An Example of Routing Tag Assignment

  • Table I indicates the number of cells from input module 0 which have requested each of the four output modules and a possible pattern of path allocations which might be generated by the processors.
  • The copy network must be initialized before path allocation commences.
  • After each iteration of the path allocation algorithm (i.e., iterations 0, 1, 2 and 3), the corresponding iteration of the routing tag assignment algorithm is performed (iterations and respectively).
  • Also shown are the lower address bits processed by each switch element.
  • The token address is not broadcast, except during the first iteration.

V. PERFORMANCE OF THE PATH ALLOCATION ALGORITHM

  • The performance of a three-stage switch using the celllevel path allocation algorithm described above will now be evaluated.
  • The simulation model is based on the following assumptions.
  • 3) The destination of each cell is drawn from a uniform distribution; all output modules receive the same load.
  • The probability of an individual cell being lost is obviously much less, but cannot be evaluated without knowing how the probability of a given cell losing contention, and the corresponding probabilities for the cells with which it contends, are correlated.).
  • These graphs can be used to find the maximum number of input ports which a switch with a given capacity in the intermediate stage can support, for a given probability of cell loss during path allocation.

VI. A DESIGN EXAMPLE

  • The resulting switch has a cell loss probability (due to loss of contention during path allocation) below 10 even in the presence of a nonuniform load [14] .
  • The input modules must accept data from the address generators in Fig. 4 , and so must have 128 inputs, even though at most 96 data cells will be present.
  • One execution of the procedure will require nine clock cycles, using the efficient implementation described in [10] .
  • The number of processors required is 1024 (32 32), but the IC count should be relatively low because of the simplicity of the processor design.
  • The complexity of the path allocation circuitry is relatively high, but the switch modules in the first and second stages are of simple design, because of the avoidance of output contention.

VII. CONCLUSIONS

  • A new algorithm for path allocation in three-stage broadband networks has been described.
  • A complete hardware implementation of this algorithm has been presented, including a method for generating the initial data required by the algorithm, and for forwarding the results to each cell at the input side of the switch, in the form of a routing tag.
  • The operating speed required of the design appears within the capabilities of VLSI technology in the short term.
  • The resulting switch offers the delay performance of an output-buffered switch, unlike either three-stage switches featuring call-level routing, which buffer the cells at each stage, or those featuring input buffers.
  • It avoids the fairness problem intrinsic to the "cell scheduling" algorithm of the Growable Packet Switch [5] .

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 45, NO. 6, JUNE 1997 701
A Three-Stage ATM Switch
with Cell-Level Path Allocation
Martin Collier, Member, IEEE
Abstract A method is described for performing routing in
three-stage asynchronous transfer mode (ATM) switches which
feature multiple channels between the switch modules in adjacent
stages. The method is suited to hardware implementation using
parallelism to achieve a very short execution time. This allows
cell-level routing to be performed, whereby routes are updated in
each time slot. The algorithm allows a contention-free routing to
be performed, so that buffering is not required in the intermediate
stage. An algorithm with this property, which preserves the cell
sequence, is referred to here as a path allocation algorithm.
A detailed description of the necessary hardware is presented.
This hardware uses a novel circuit to count the number of cells
requesting each output module, it allocates a path through the
intermediate stage of the switch to each cell, and it generates
a routing tag for each cell, indicating the path assigned to
it. The method of routing tag assignment described employs a
nonblocking copy network. The use of highly parallel hardware
reduces the clock rate required of the circuitry, for a given switch
size. The performance of ATM switches using this path allocation
algorithm has been evaluated by simulation, and is described
here.
Index Terms— Asynchronous transfer mode, communication
switching, communication system routing.
I. INTRODUCTION
T
HE THROUGHPUT achievable (in bits/second) in an
asynchronous transfer mode (ATM) switch depends heav-
ily on the process used to fabricate it. For example, Bianchini
and Kim [1] have described a single-board switch prototype
with 155-Mb/s link rate and a throughput of 2.48 Gb/s, con-
structed using “off-the-shelf” integrated circuits and PLD’s.
Collivignarelli et al. [2] have described a 16
16 switch chip
with a 311-Mb/s link rate (and hence, with a throughput close
to 5 Gb/s) fabricated using a 0.8
m BiCMOS process, which
dissipates 7 W. Merayo et al. [3] have reported a switch with
a 10-Gb/s throughput and a 2.5-Gb/s link rate, using a 0.7-
m
BiCMOS process and requiring approximately twenty chips.
Hino et al. [4] have developed a 4
4 switching element (for
a rerouting banyan network) with link rates of 10 Gb/s using
a 0.2-
m GaAs MESFET technology. The power dissipated
by this switch (some 30W) necessitates its implementation on
three integrated circuits.
It may be concluded, from the results reported above,
which are typical of the current state of the art, that the
tradeoffs to be performed between circuit complexity, power
Paper approved by G. P. O’Reilly, the Editor for Communications Switch-
ing of the IEEE Communications Society. Manuscript received July 3, 1995;
revised December 1, 1995.
The author is with the School of Electronic Engineering, Dublin City
University, Glasnevin, Dublin 9, Ireland.
Publisher Item Identifier S 0090-6778(97)04172-X.
dissipation and process cost in designing ATM switches are
such as to restrict single-chip and single-board switch fabrics
to throughputs below perhaps 40 Gb/s for the foreseeable
future, even when using leading-edge (and thus expensive) IC
technologies. Hence, a large switch fabric (i.e., a switch with
a throughput exceeding, say, 200 Gb/s) will require a modular
architecture, allowing the switch fabric to be distributed across
multiple boards or cabinets.
An obvious method of implementing a large switch, given
these constraints, is to design the switch with three stages,
where each stage consists of smaller switch modules. Many
authors have proposed such switches [5]–[9]. This approach
typically introduces a new problem (not present in a single-
stage switch) whereby multiple paths from source to desti-
nation become available. Thus even if the individual switch
modules possess the self-routing feature, this feature is not
retained by the overall switch. Some method of routing is then
necessary, to select among the available paths from source to
destination, through the second stage of the switch.
Routing may be performed over a number of time scales.
In one approach (call-level routing), all cells belonging to a
virtual connection (“call”) are allocated the same route. Thus
the routing decision is made at connection setup time, and
this route is fixed for the duration of the connection. Cell-
level routing is performed if the routing decision is made
independently in each time slot. The process of determining
a routing pattern such that no blocking can occur in the
second stage of the switch is referred to here as cell-level
path allocation.
This paper considers cell-level path allocation, and, specif-
ically, the problem of implementing a cell-level algorithm for
path allocation in the channel-grouped three stage network of
Fig. 1. This is an
switch, with , and
modules in the input, intermediate and output stages, respec-
tively. There are
links in the channel group connecting input
and intermediate stage modules, and
links in the channel
group connecting intermediate and output stage modules. The
use of channel grouping allows additional flexibility when
dimensioning the three-stage switch. Cell-level path allocation
has been proposed by a number of authors [5]–[7]. The
algorithm described here requires fewer iterations than that
in [6], does not require input buffering (which degrades the
throughput), unlike [7], and is fairer than that presented in
[5], in addition to readily supporting intermediate channel
grouping.
The path allocation algorithm and the hardware necessary
to implement it are described in Section II of this paper.
0090–6778/97$10.00 1997 IEEE
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded on July 19,2010 at 08:54:07 UTC from IEEE Xplore. Restrictions apply.

702 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 45, NO. 6, JUNE 1997
Fig. 1. A three-stage switch with intermediate channel grouping.
The algorithm requires ancillary hardware to count incoming
cells and to deliver routing tags to them. Suitable hardware
is described in Sections III and IV of this paper. The switch
performance is discussed in Section V.
II. A
N ALGORITHM FOR PAT H ALLOCATION AT CELL LEVEL
A. The Objectives of a Path Allocation Algorithm
There are
routes from each input module to each inter-
mediate module. There are
routes from each intermediate
module to each output module. We must choose, for every
input cell (if possible) an intermediate switch module through
which to pass on the way to the selected destination, such that
no input module attempts to route more than
cells via any
intermediate module, and no intermediate module attempts to
route more than
cells to any output module, in any one
time slot. This strategy ensures that:
1) the intermediate stage can never be congested;
2) no queueing occurs in the intermediate stage; thus the
delay through the intermediate stage is uniform, regard-
less of the path taken; this makes it possible to preserve
cell sequence on a virtual connection;
3) contention can never occur in the intermediate stage,
simplifying its design.
An algorithm to implement this strategy will now be de-
scribed. It will be assumed, for simplicity, that all input ports of
the switch operate at the same rate, and thus that the duration of
the time slot (the interval between successive cell boundaries)
is the same for every cell.
B. Basic Principles of the Path Allocation Algorithm
A new and efficient algorithm will now be described. It is
suitable for use in a channel-grouped three-stage switch and
requires only knowledge obtainable at the input side of the
switch. It operates on the following quantities:
number of channels available from input module to
intermediate switch module
number of channels available from intermediate switch
module
to output module
number of requests from input module for output
module
.
(a)
(b)
Fig. 2. Examples of the processor array (a) showing contents of processors
during Iteration Zero
(
L
1
=
L
2
=
m
=4)
and (b) showing initial conditions
for
L
1
=2
;m
=4
, and
L
2
=3
.
Note that and need only be local to the input
module. The
’s must be forwarded to each input module
in turn. Let
be the number of cells to be routed from
input module
to output module via intermediate switch
module
The values of and are updated using
the procedure
described below:
This procedure is “atomic” in the sense that it is the basic
building block from which the path allocation algorithm is
constructed. The procedure determines the capacity available
from input module
to output module via intermediate
switch module
(i.e., the minimum of and . The
number of requests which can be satisfied is equal to the
minimum of the number of requests outstanding
and
the available capacity.
A parallel implementation requires multiple processors, each
executing the
procedure for a different set of
procedure parameters, subject to the following constraints:
no two processors shall simultaneously require
access to the same quantity. For example,
uses and so that neither
nor
can be executed concurrently with for
any
;
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded on July 19,2010 at 08:54:07 UTC from IEEE Xplore. Restrictions apply.

COLLIER: A THREE-STAGE ATM SWITCH 703
Fig. 3. Implementation of the
atomic
()
processor. Min: Calculator of minimum;
D
x
: Delay (needed to synchronise arrival times—may be zero).
the data required by a processor for the next iteration of
the algorithm should be available locally, or from adjacent
processors.
An implementation satisfying these two constraints will now
be presented.
C. Implementation of the Algorithm
Suppose that there are
modules in each stage of the
switch. An array of
processors is used. The processor
in the
th row (numbered from the right) and th column
(numbered from the bottom) of the array is labeled
.
Processor
is initialized by loading the following three
values:
1) initial value of
;
2) initial value of
(i.e., ;
3) initial value of
(i.e., .
The values stored in the processor array are shown in
Fig. 2(a) for the case where
.
The algorithm then requires
iterations (iterations zero
through
. Processor executes
during iteration ; after each iteration
forwards the updated value of to and of
to , and retains .
If we choose
the same algorithm
may be used for a switch with an arbitrary number of modules
in each stage. Suppose that a square array of
processors is used. Some of the processor registers must be
initialized to zero if their contents pertain to a nonexistent
switch module. Specifically, processor
is initialized as
follows:
otherwise.
otherwise.
otherwise.
where
.
An examination of the operation of the resulting algorithm
reveals that the processors in row
or higher and in column
or above never modify the and values they receive,
and thus may be replaced by simple delays.
In general, a switch with
input modules and output
modules requires a processor array with
rows and
columns. If , each column requires additional
registers. If , each row requires additional
registers. The initial conditions in the array for the case where
and are shown in Fig. 2(b).
An unichannel architecture may require a large value for
to obtain low cell loss probabilities. Hence a relatively
high clock speed will be required in the array, so as to
complete
iterations of the algorithm in the time available
(which is less than the duration of one time slot). A switch
with intermediate channel grouping affords the possibility
of reducing cell loss probability by increasing
and ,
rather than by increasing
This can reduce the clock speed
requirements. Note that, unlike the cell scheduling algorithm
in [5], this algorithm attempts to allocate a path to each cell
at the switch inputs during every iteration of the algorithm.
Thus, the proposed algorithm is fairer than that described
in [5].
D. Implementation Issues
The processor must execute the
procedure, and
thus must perform two types of operation:
1) find the minimum of three numbers;
2) perform three subtractions.
Hence, in principle, the processor may be implemented as
shown in Fig. 3. The value of
is stored locally. The
and values are obtained from (and forwarded to) adjacent
processors. The simple structure of the
processor
ensures that many copies of it may be constructed on a single
integrated circuit (IC), and also ensures that it can operate at
high speed. A fast implementation using bit-serial arithmetic,
and which does not require the calculation of the minimum of
three numbers, was described in [10].
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded on July 19,2010 at 08:54:07 UTC from IEEE Xplore. Restrictions apply.

704 IEEE TRANSACTIONS ON COMMUNICATIONS, VOL. 45, NO. 6, JUNE 1997
Fig. 4. The circuitry for request counting and routing tag assignment. CG: Count generator; RPG: Routing packet generator; AG: Address generator.
Hardware is also needed in each input module to perform the
following tasks before and during the path allocation process:
to count the number of requests for each output module
so as to obtain the initial values of the
’s;
to forward a routing tag based on the results of path
allocation to each input cell.
The circuitry to implement these functions is shown in
Fig. 4. Its operation will be described in Sections III and IV.
It is assumed that cells losing contention are discarded. If
this is not the case, additional hardware will be required to
forward acknowledgments to the input port controllers, and
this circuitry will introduce an additional delay.
The switch fabric, as described above, operates at a
single rate (which will typically be the OC-3/STM-1 rate
of 155 Mb/s). The input and output port controllers must
perform the necessary bit rate adaptation (and multiplex-
ing/demultiplexing) for links operating at other rates, so
that cells traverse the switch fabric at a common rate. The
demultiplexing of incoming cell streams of high bit rate to a
number of switch fabric inputs has implications for the switch
performance (since correlations are then possible between
the arrival processes on adjacent input ports), and for cell
sequence preservation, which will be addressed in a future
paper.
The switch will be required to support multiple loss prior-
ities in practice. This requires the path allocation algorithm
to preferentially allocate paths to cells with the CLP bit
set to zero. The simplest way of modifying the described
algorithm to achieve this is to perform path allocation twice,
once for cells with CLP
, and a second time for the
cells tolerating higher loss rates, with the initialization of the
processor array being appropriately modified. However, this
approach doubles the required operating speed of the array,
which may be impractical in many cases. A less expensive
method for introducing differentials in loss probabilities is
described in [11].
III. A F
AST METHOD OF REQUEST COUNTING
Suitable hardware to simultaneously calculate (the
number of requests from input module
for output module
for all values of will now be described.
The execution time for this hardware is
clock cycles. A slower solution, requiring less hardware, was
described in [12].
The hardware required is shown in Fig. 5. Data cells from
the
input ports associated with input module are merged
with
control packets (one per output module) by a Batcher
sorting network. The merge operation is performed in such a
way that idle cells (i.e., empty cells from inactive input ports)
are sorted to the highest output ports of the Batcher network.
If the control packet for output module
appears at output
of the Batcher network, then the data cells (if any) requesting
that output module appear at lower output ports of the sorter
(ports
etc.), as shown in Fig. 5.
Under these circumstances, it may readily be shown that
where is the number of data cells requesting output
module
, and is fixed, since the Batcher network processes
only requests from input module
.
The key to this method of request counting is the obser-
vation that
The necessary subtraction can be performed very efficiently,
since
where is the 1’s complement of obtained by bitwise
inversion of
It follows that the value of can be
generated using a serial adder, and can then be stored in the
register of the appropriate processor (i.e., of
It is necessary to generate a concentrated list of the values
as input data for the serial adders.
These values are obviously available at the sorter outputs
which have received control packets (since, for example, con-
trol packet 4 appears at output
, but are not concentrated
onto contiguous outputs. Hence a concentrator is required. This
is the purpose of the binary self-routing network shown in
Fig. 5, which is often called the “reverse banyan” [13]. A
well-known property of this network is that it is nonblocking
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded on July 19,2010 at 08:54:07 UTC from IEEE Xplore. Restrictions apply.

COLLIER: A THREE-STAGE ATM SWITCH 705
Fig. 5. An example of request counting. CG: Count generator.
when acting as a concentrator. A formal proof that blocking
cannot occur in Fig. 5 was given in [14].
The count generators forward only control packets to this
network. Count generators which have received a data cell
or an idle cell through the Batcher network submit an inactive
packet to the concentrator. The count generator which receives
control packet
from output of the Batcher network
appends a data field to the packet containing the value of
This packet is then routed to output of the concentrator.
A total of
control packets is thus simultaneously launched
into the concentrator, and these are routed to the serial adders
at outputs zero through
without blocking.
The concentrated list of
values is then read by these
serial adders, the lower input (as shown in Fig. 5) being
inverted. Hence the
values are generated, and passed to the
processors. The example considered in Fig. 5 shows
three requests for output module zero, two for output module
one, and none for output module two. It can be seen that
the correct values (i.e., 3, 2 and 0) are returned to processors
and , respectively.
The submitted packets take two cycles to propagate through
each stage of the concentrator (one cycle to identify if the
packet is active, and another to determine where to route it)
and an additional clock cycle is required before the serial adder
generates the least significant bit of the appropriate
value.
Thus the number of clock cycles required by the request count
hardware before path allocation can commence is
Hence, for a switch with and , the number
of clock cycles required is just 15.
IV. R
OUTING TAG ASSIGNMENT
A. Principles of Operation
The
processor generates a sequence of
values, one after every iteration of the path allocation
algorithm, commencing with
(the initial value of
determined by the request counting hardware) and decrement-
ing, after every iteration, in accordance with the
procedure, as paths are allocated to cells. Thus represents
the number of outstanding requests from input module
for
output module
. The relevant cells must be informed of the
path through the intermediate stage which they have been
assigned. The relevant information is obtained from the
output of the processor shown in Fig. 3. After each iteration
of the
algorithm, tokens are broadcast to cells
by the circuitry for routing tag assignment. A cell may receive
multiple tokens, but only the last token it receives contains
valid routing information. When the path allocation process is
complete, a special null token is broadcast to the cells which
have lost contention. The address generator then prefixes a
routing tag to each data cell whose value equals the token
value. Cells losing contention are marked as inactive.
The broadcasting is done by the copy network shown in
Fig. 4. This must copy tokens and perform routing in such a
way that the token required by the data cell at a given Batcher
network output in Fig. 4 appears at the corresponding copy
network output, and is thus received by the correct address
generator.
The copy network has
inputs and outputs. The
routing packet generators are connected to
of the copy
network inputs, and the remaining inputs are idle. Routing
packet generator
receives the value of from the
appropriate
processor.
The cells requesting output module
appear at outputs
through of the Batcher network, where
(as before)
The routing packet generator for output module
must forward the relevant routing tokens to the data cells at
outputs
through of the Batcher network.
The value of
is readily obtainable from the request
counting hardware.
Authorized licensed use limited to: DUBLIN CITY UNIVERSITY. Downloaded on July 19,2010 at 08:54:07 UTC from IEEE Xplore. Restrictions apply.

Citations
More filters
01 Jan 1998

16 citations


Additional excerpts

  • ...This is our core expertise (cf [3])....

    [...]

Journal ArticleDOI
TL;DR: A new space-division grid-based ATM architecture with fault tolerant characteristics and minimal number of switching elements (SE's) is proposed.

2 citations

Proceedings ArticleDOI
29 May 2001
TL;DR: This paper describes a technique for implementing the switch fabric of a high-speed router (with a throughput in excess of 600 Gb/s based on the current slate of the art), with the following properties: delay performance is virtually identical to that of a standard output-buffered switch, and the switch Fabric preserves the packet sequence, so that no resequencing is required for segmented packets.
Abstract: This paper describes a technique for implementing the switch fabric of a high-speed router (with a throughput in excess of 600 Gb/s based on the current slate of the art), with the following properties. Delay performance is virtually identical to that of a standard output-buffered switch, and the switch fabric preserves the packet sequence, so that no resequencing is required for segmented packets. Clock rates are moderate except at ingress and egress points. This is achieved by distributing traffic across a number of crossbar switches operating at a low bit rate. The techniques used to resolve contention in the crossbar switches are described, and the bottlenecks limiting the capacity of the switch are discussed.
References
More filters
Journal ArticleDOI
TL;DR: A nonblocking, self-routing copy network with constant latency is proposed, capable of packet replications and switching, which is usually a serial combinations of a copy network and a point-to-point switch.
Abstract: In addition to handling point-to-point connections, a broadband packet network should be able to provide multipoint communications that are required by a wide range of applications. The essential component to enhance the connection capability of a packet network is a multicast packet switch, capable of packet replications and switching, which is usually a serial combinations of a copy network and a point-to-point switch. The copy network replicates input packets from various sources simultaneously, after which copies of broadcast packets are routed to their final destination by the switch. A nonblocking, self-routing copy network with constant latency is proposed. Packet replications are accomplished by an encoding process and a decoding process. The encoding process transforms the set of copy numbers, specified in the headers of incoming packets, into a set of monotone address intervals which form new packet headers. The decoding process performs the packet replication according to the Boolean interval splitting algorithm through the broadcast banyan network, the decision making is based on a two-bit header information. This yields minimum complexity in the switch nodes. >

387 citations

Book
02 Jan 1991
Abstract: In addition to handling point-to-point connections, a broadband packet network should be able to provide multipoint communications that are required by a wide range of applications. The essential component to enhance the connection capability of a packet network is a multicast packet switch, capable of packet replications and switching, which is usually a serial combinations of a copy network and a point-to-point switch. The copy network replicates input packets from various sources simultaneously, after which copies of broadcast packets are routed to their final destination by the switch. A nonblocking, self-routing copy network with constant latency is proposed. Packet replications are accomplished by an encoding process and a decoding process. The encoding process transforms the set of copy numbers, specified in the headers of incoming packets, into a set of monotone address intervals which form new packet headers. The decoding process performs the packet replication according to the Boolean interval splitting algorithm through the broadcast banyan network, the decision making is based on a two-bit header information. This yields minimum complexity in the switch nodes. >

155 citations

Journal ArticleDOI
TL;DR: A growable switch architecture is presented that is based on three key principles: a generalized knockout principle exploits the statistical behaviour of packet arrivals and thereby reduces the interConnect complexity, output queuing yields the best possible delay/throughput performance, and distributed intelligence in routing packets through the interconnect fabric eliminates internal path conflicts.
Abstract: The problem of designing a large high-performance, broadband packet of ATM (asynchronous transfer mode) switch is discussed. Ways to construct arbitrarily large switches out of modest-size packet switches without sacrificing overall delay/throughput performance are presented. A growable switch architecture is presented that is based on three key principles: a generalized knockout principle exploits the statistical behaviour of packet arrivals and thereby reduces the interconnect complexity, output queuing yields the best possible delay/throughput performance, and distributed intelligence in routing packets through the interconnect fabric eliminates internal path conflicts. Features of the architecture include the guarantee of first-in-first-out packet sequence, broadcast and multicast capabilities, and compatibility with variable-length packets, which avoids the need for packet-size standardization. As a broadband ISDN example, a 2048*2048 configuration with building blocks of 42*16 packet switch modules and 128*128 interconnect modules, both of which fall within existing hardware capabilities, is presented. >

145 citations


"A three-stage ATM switch with cell-..." refers background or methods in this paper

  • ...The author is currently investigating the practical implementation of the path allocation circuitry, with a view to confirming that the overall complexity of the switch is no greater than that of competing architectures, such as those in [5-9]....

    [...]

  • ...Note that, unlike the cell scheduling algorithm in [5], this algorithm attempts to allocate a path to each cell at the switch inputs during every iteration of the algorithm....

    [...]

  • ...Many authors have proposed such switches [5-9]....

    [...]

  • ...[5] K....

    [...]

  • ...Cell-level path allocation has been proposed by a number of authors [5-7]....

    [...]

Proceedings ArticleDOI
Kai Y. Eng1, Mark J. Karol1, Y.S. Yeh1
27 Nov 1989
TL;DR: A growable switch architecture is proposed based on a generalized knockout principle which exploits the statistical behavior of packet arrivals and thereby reduces the interconnect complexity and output queuing, which yields the best possible delay/throughput performance.
Abstract: The authors consider the generic problem of designing a large N*N(N>1000) high-performance, broadband packet (or asynchronous transfer mode) switch. They provide ways to construct arbitrarily large switches out of modest-size packet switches, without sacrificing overall delay/throughput performance. They propose and study a growable switch architecture based on three key principles: (a) a generalized knockout principle which exploits the statistical behavior of packet arrivals and thereby reduces the interconnect complexity; (b) output queuing, which yields the best possible delay/throughput performance; and (c) distributed intelligence in routing packets through the interconnect fabric. Other features include the guarantee of a first-in first-out packet sequence, broadcast and multicast capabilities, and compatibility with variable-length packets. In a broadband ISDN (integrated services digital network) example, the authors show a 2048*2048 switch configuration with building blocks of 42*16 packet switch modules and 128*128 interconnect modules. >

132 citations


"A three-stage ATM switch with cell-..." refers background or methods in this paper

  • ...Note that, unlike the cell scheduling algorithm in [5], this algorithm attempts to allocate a path to eachcell at the switch inputs duringevery iteration of the algorithm....

    [...]

  • ...The author is currently investigating the practical implementation of the path allocation circuitry, with a view to confirming that the overall complexity of the switch is no greater than that of competing architectures, such as those in [5]–[9]....

    [...]

  • ...It avoids the fairness problem intrinsic to the “cell scheduling” algorithm of the Growable Packet Switch [5]....

    [...]

  • ...[5], in addition to readily supporting intermediate channel grouping....

    [...]

  • ...has been proposed by a number of authors [5]–[7]....

    [...]

Journal ArticleDOI
TL;DR: Owing to novel parallel structures inside the switch element, VLSI implementation is possible for transmission rates on the order of a gigabit per second per port and for a switch in a single-stage configuration as well as for the case of a three-stage switch fabric.
Abstract: This paper presents the architecture of a very high-speed VLSI packet switch and its performance. The switch, called PRIZMA, is suited for broadband telecommunications, based on ATM, the Asynchronous Transfer Mode. However, the concept is not restricted to ATM-oriented architectural environments. There may be applications within private networks, independent of whether they are ATM-based. There may also be other potential applications such as multiprocessor interconnection. The architecture of the PRIZMA switch follows the architecture of its lower-speed earlier version (H. Ahmadi et al., Int. J. Digital Analog Cabled Syst. 2 (4) (1989) 277–287) to a large degree: It is based on a single-chip switch element that exploits the performance advantage of output queuing and from which larger, self-routing single-stage or multistage switch fabrics can be constructed in a modular way. However, compared to the precursor, higher performance is achieved by output queues that now are configured as a dynamically shared memory. This shared memory can also be expanded by linking multiple switch elements. Owing to novel parallel structures inside the switch element, VLSI implementation is possible for transmission rates on the order of a gigabit per second per port. In the last section of this paper, performance results are presented for a switch in a single-stage configuration as well as for the case of a three-stage switch fabric.

74 citations


"A three-stage ATM switch with cell-..." refers background in this paper

  • ...The author is currently investigating the practical implementation of the path allocation circuitry, with a view to confirming that the overall complexity of the switch is no greater than that of competing architectures, such as those in [5]–[9]....

    [...]

  • ...Many authors have proposed such switches [5]–[9]....

    [...]