scispace - formally typeset
Open AccessProceedings ArticleDOI

Sorting networks and their applications

TLDR
To achieve high throughput rates today's computers perform several operations simultaneously; not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently.
Abstract
To achieve high throughput rates today's computers perform several operations simultaneously. Not only are I/O operations performed concurrently with computing, but also, in multiprocessors, several computing operations are done concurrently. A major problem in the design of such a computing system is the connecting together of the various parts of the system (the I/O devices, memories, processing units, etc.) in such a way that all the required data transfers can be accommodated. One common scheme is a high-speed bus which is time-shared by the various parts; speed of available hardware limits this scheme. Another scheme is a cross-bar switch or matrix; limiting factors here are the amount of hardware (an m × n matrix requires m × n cross-points) and the fan-in and fan-out of the hardware.

read more

Content maybe subject to copyright    Report

Sorting networks and their applications
by
K. E. BATCHER
Goodyear Aerospace Corporation
Akron, Ohio
INTRODUCTION
To achieve high throughput rates today's computers
perform several operations simultaneously. Not only
are I/O operations performed concurrently with com-
puting, but also, in multiprocessors, several computing
operations are done concurrently. A ma jor problem in
the design of such a computing system is the connect-
ing together of the various parts of the system (the
I/O devices, memories, pro cessing units, etc.) in such
a way that all the required data transfers can be ac-
commodated. One common scheme is a high-sp eed
bus which is time-shared by the various parts; speed of
available hardware limits this scheme. Another scheme
is a cross-bar switch or matrix; limiting factors here are
the amount of hardware (an m X n matrix requires m
X n cross-points) and the fan-in and fan-out of the
hardware.
This paper describ es networks that have a fast sort-
ing or ordering capability (sorting networks or sorting
memories). In (
1
2
)
p
(
p
+1) steps 2
p
words can be or-
dered. A sorting network can be used as a multiple-
input, multiple-output switching network. It has the
advantages over a normal crossbar of requiring less
hardware (an n-input n-output switching network can
be built with approximately (
1
4
)
n
(
log
2
n
)
2
elements ver-
sus
n
2
in a normal crossbar) and of having a constant
fan-in and a fan-out requirement on its elements. Thus,
a sorting network should be useful as a exible means
of tieing together the various parts of a large-scale com-
puting system. Thousands of input and output lines
can be accommo dated with a reasonable amount of
hardware.
Other applications of sorting memories are as a
switching network with buering, a multiaccess mem-
ory,amultiaccess content-addressable memory and as
amultiprocessor. Of course, the networks also maybe
used just for sorting and merging.
Comparison elements
The basic element of sorting networks is the com-
parison element (Figure 1). It receives two numbers
over its inputs, A and B, and presents their minimum
on its L output and their maximum on its H output.
H’
B
A’
H MAX(A,B)
L’
AL
Figure 1 - Symbol for a comparison element
B’
MIN (A,B)
If the numbers in and out of the element are trans-
mitted serially most-signicant bit rst the element
has the state diagram of Figure 2. A reset input places
the element in the
A
=
B
state and as long as the
A
and
B
bits agree it remains in this state with its
outputs equal to its inputs. When the
A
and
B
bits
disagree the elementgoestothe
A<B
or the
A>B
state and remains there until the next reset input. In
the
A>B
state the output
H
equals the input
A
and
the output
L
equals the input
B
. In the
A<B
state
the opposite situation o ccurs.
(A < B)
H = B
L = A
,
( A = B)
H=L=A
significant-bit first)
(A > B)
H=A
L=B
.
RESET RESET
A=B
A=0
B=1
A=1
B=0
Figure 2 - State diagram for a serial comparison element (most-
307

308 Spring Joint Computer Conference, 1968
A serial comparison element can be implemented
with 13 NORS and can be put on one integrated cir-
cuit chip. When used in sorting networks each H and L
output will feed an A or B input of another elementso
the fan-out is constant regardless of network size; this
fact could be used to simplify the design of the chip.
With several of the currently available logic families
speeds of 100 nanoseconds/bit with a propagation de-
lay from inputs to outputs of 40 nanoseconds are easily
achieved.
Faster op eration can be attained by treating sev-
eral bits in parallel in each step with more complex
comparison elements.
Some of the applications described b elow will re-
quire \bi-directional" comparison elements. Besides
the
A
and
B
inputs and the
H
and
L
outputs there
are
H
0
and
L
0
inputs and
A
0
and
B
0
outputs (see Figure
1). If
A>B
then
B
0
=
L
0
and
A
0
=
H
0
,if
A<B
then
B
0
=
H
0
and
A
0
=
L
0
, otherwise
A
0
and
B
0
are left
undened. Information ows from left-to-right over
the solid lines and from right-to-left over the dotted
lines.
Odd-even merging networks
Merging is the pro cess of arranging two
ascendingly-ordered lists of numb ers into one
ascendingly-ordered list. Figure 3 shows a symbol for
an \s by t" merging network in which the s numbers of
one ascendingly-ordered list,
a
1
;a
2
; :::; a
s
are presented
over s inputs simultaneously with the t numbers of an-
other ascendingly-ordered list
b
1
;b
2
; :::; b
t
over another
t inputs. The s + t outputs of the merging network
present the s+t numbers of the merged lists in ascend-
ing order,
c
2
;c
2
; :::; c
s
+
t
.
A\1by 1" merging network is simply one compari-
son element. Larger networks can be built by using the
iterative rule shown in Figure 4. An \s by t" merging
network can be built by presenting the o dd-indexed
numbers of the two input lists to one small merging
network (the o dd merge), presenting the even-indexed
number to another small merging network (the even
merge) and then comparing the outputs of these small
merges with a row of comparison elements.
1
The low-
est output of the o dd merge is left alone and becomes
the lowest number of the nal list. The
i
th
output of
the even merge is compared with the
i
+1
th
output of
the o dd merge to form the 2
i
th
and 2
i
+1
th
numbers of
the nal list for all applicable i's. This mayormay not
exhaust all the outputs of the o dd and even merges; if
an output remains in the odd or even merge it is left
alone and becomes the highest numb er in the nal list.
MERGE
a
a
1
s
b
1
b
2
b
t
c
1
c
2
a
2
.
.
.
.
.
.
.
.
.
a
1
<
a
s
a
2
Figure 3 - Symbol for an ‘‘s by t’’ merging network
-
<
-
<
-
.
..
1
<
2
-
<
-
<
-
.
..
1
<
2
-
<
-
<
-
.
..
b
b
b
t
c
c
c
s+t
s+t
c
Appendix A sketches the pro of of this iterative rule.
Figure 5 shows a \2 by2"anda\4by 4" merging net-
work constructed by this rule.
A\2
p
by2
p
" merging network constructed by this
rule uses p.2
p
+1 comparison elements. The longest
path goes through p+1 comparison elements and the
shortest path through one element. Doubling the size
of a merge only increases the longest path by unityso
the merging time increases slowly with the size of the
network.

Sorting Networks and Their Applications 309
.
.
a
1
a
a
2
3
a
4
a
5
a
6
a
s
.
.
.
A
L
B
H
A
L
B
H
A
L
B
H
.
.
.
.
.
.
c
s+t
.
.
.
.
Figure 4 - Iterative rule for odd-even merging networks
1
2
3
4
5
6
t
b
b
b
b
b
b
b
3
2
1
c
c
c
c
7
c
6
c
5
c
4
ODD
4
d
3
d
2
d
1
d
1
e
2
e
3
e
EVEN
MERGE
MERGE
A
L
B
H
A
L
B
H
A
L
B
H
3
2
c
c
a
2
a
1
1
b
2
b
c
4
c
1
A
L
B
H
A
L
B
H
A
L
B
H
A
L
B
H
A
L
B
H
A
L
B
H
odd-even merging networks
c
8
A
L
B
H
A
L
B
H
c
7
c
6
c
5
c
4
A
L
B
H
3
2
c
c
c
1
a
1
4
b
a
2
2
b
1
b
3
b
a
3
a
4
Figure 5- Construction of ‘‘2 by 2’’ and ‘‘4 by 4’’
Bitonic sorters
Another way of constructing merging networks
from comparison elements is presented here. While
requiring somewhat more elements than the o dd-even
merging networks, they have the advantage of exibil-
ity (one network can accommo date input lists of var-
ious lengths) and of mo dularity ( a large network can
be split up into several identical modules).
2
We will call a sequence of numbers
bitonic
if it is
the juxtaposition of two monotonic sequences, one as-
cending, the other descending. We also say it remains
bitonic if it is split anywhere and the two parts in-
terchanged. Since any two monotonic sequences can
be put together to form a bitonic sequence a network
which rearranges a bitonic sequence into monotonic or-
der (a bitonic sorter) can b e used as a merging network.
Appendix B shows that if a sequence of 2n num-
bers,
a
1
;a
2
; :::; a
2
n
is bitonic and if we form the two
n-number sequences:
min(
a
1
;a
n
+1
)
;
min(
a
2
;a
n
+2
)
; :::;
min(
a
n
;a
2
n
) (1)
and
max(
a
1
;a
n
+1
)
;
max(
a
2
;a
n
+2
)
; :::;
max(
a
n
;a
2
n
) (2)
that each of these sequences is bitonic and no number
of (1) is greater than anynumber of (2).
This fact gives us the iterative rule illustrated in
Figure 6. A bitonic sorter for 2n numb ers can be con-
structed from n comparison elements and two bitonic
sorters for n numbers. The comparison elements form
the sequences (1) and (2) and since each is bitonic they
are sorted by the two n-number bitonic sorters. Since
no number of (1) is greater than anynumber of (2) the
output of one bitonic sorter is the lower half of the sort
and the output of the other is the upp er half.
A bitonic sorter for 2 numbers is simply a compari-
son element and using the iterative rule bitonic sorters
for 2
p
numbers can be constructed for anyp. Figure
7 shows bitonic sorters for 4 numbers and 8 numbers.
A 2
p
-number bitonic sorter requires p levels of 2
p
,
1
elements each for a total of
p:
2
p
,
1
elements. It can
act as a merging network for anytwo input lists whose
total length equals 2
p
.
Large bitonic sorters can be constructed from a
number of smaller bitonic sorters; for instance, a 16-
number bitonic sorter can be constructed from eight
4-number bitonic sorters, as shown in Fig. 8. This
allows large networks to b e built of standard mo dules
Readers may recognize the similaritybetween the top ologies of the bitonic sort and the fast-fourier-transform.

310 Spring Joint Computer Conference, 1968
of convenient size.
A
L
B
H
A
L
B
H
A
L
B
H
.
.
.
.
.
.
.
.
.
A
L
B
H
A
L
B
H
A
L
B
H
n
n-1
n-2
c
c
c
.
.
.
.
.
.
.
.
.
.
.
.
1
c
2
c
a
1
a
2
a
n
2n
c
a
1
a
2
a
2n
a
2n
1
c
2
c
2n
c
..
.
.
Figure 6-Iterative rule for bitonic sorters
.
.
_
<
_
<
_
<
2n-1
2n-2
c
c
3
c
n+3
n+2
n+1
c
c
c
n-ITEM
n-ITEM
BITONIC
BITONIC
SORTER
SORTER
a
a
2n-1
2n-2
a
a
a
a
a
n+3
n+2
n+1
n-1
n-2
a
3
,
,,
IS BITONIC
Sorting networks
A sorter for arbitrary sequences can b e constructed
from o dd-even merges or bitonic sorters using the well-
known sorting-by-merging scheme: The numbers are
combined two at a time to from ordered lists of length
two; these lists are merged two at a time to form or-
dered lists of length four, etc. until all numbers are
merged into one ordered list.
To sort 2
p
numbers using odd-even merges requires
2
p
,
1
comparison elements followed by 2
p
,
2
\2-by-2"
merging networks followed by 2
p
,
3
\4-by-4" merging
networks, etc,. etc. The longest path will go through
(
1
2
)
p
(
p
+ 1) elements and the shortest path through p
elements. The network requires (
p
2
,
p
+ 4)2
p
,
2
,
1
comparison elements.
To sort 2
p
numbers using bitonic sorters requires
(
1
2
)
p
(
p
+1) levels each with 2
p
,
1
elements for (
p
2
+
p
)2
p
,
2
elements. Each path go es through (
1
2
)
p
(
p
+1)
levels.
8
a
2
a
1
1
b
2
b
c
4
c
1
2
c
3
c
A
L
B
H
A
L
B
H
A
L
B
H
A
L
B
H
A
L
B
H
A
L
B
H
A
L
B
H
A
L
B
H
A
L
B
H
A
L
B
H
A
L
B
c
H
8 numbers
A
L
B
H
A
L
B
H
A
L
B
H
A
L
B
H
A
L
B
H
a
8
a
7
a
6
a
5
a
4
a
3
a
1
a
2
c
1
3
2
c
c
c
7
c
6
c
5
c
4
Figure 7- Construction of bitonic sorters for 4 numbers and for
a
5
a
1
a
a
4-number bitonic sorters
a
a
a
a
a
a
a
a
a
2
c
c
c
c
c
c
c
c
c
c
c
c
c
c
c
9
13
1
c
3
4
5
6
7
8
9
10
11
12
13
14
15
16
16
a
12
8
4
15
11
7
3
14
10
a
a
6
2
Figure 8- A 16 number bitonic sorter constructed from eight
A sorter of 1024 numbers will have 55 levels and
24,063 elements with o dd-even merges or 28,160 el-
ements with bitonic sorters. With a 40 nanosecond
propagation delayperlevel the total delayis2.2 mi-
croseconds. Serial transmission of the bits would re-
quire about this much time between successive bits of

Sorting Networks and Their Applications 311
the numb ers unless re-clo cking occurs within the net-
work. Parallel-input-parallel-output registers of 1024
bits each can be placed between certain levels to per-
form this task or the re-clo cking may b e incorporated
within each comparison element with a pair of ip-
ops on the outputs. The latter scheme does not add
to the terminal count of the comparison element so
the cost of the added ip-ops on the comparison el-
ementchip is small. One can use anyofthefamiliar
techniques for driving shift registers such as the \A-B"
technique where successive levels are clo cked out-of-
phase with each other. With present circuit and wiring
techniques a bit rate of 10 megahertz may b e possible
with 50 nanosecond delay p er level (2.75 microsecond
delay from input to output of a 1024-word sorter).
With re-clo cking in the element and odd-even
merges extra elements are needed to balance the
unequal-length paths. Bitonic sorters do not have
this problem.
Applications
The fast sorting capability of these networks allows
their use in solving other problems where large sets of
data must be manipulated. Some of these applications
are sketched below.
Switching network
A sorting network can connect its input lines to its
output lines with any p ermutation. The connection is
made bynumbering the output lines in order and pre-
senting the desired output address for each input line at
the input. The sorting network sorts the addresses and
in the pro cess makes a connection from each input line
to its desired output line for the transmission of data.
Bi-directional paths will be obtained if bi-directional
comparison elements are used.
An alternative p ermuting network has been shown
in the recent literature
3
which has less elements [(
p
,
1)2
p
+1 versus (
p
2
,
p
+ 4)2
p
,
2
,
1 for permuting 2
p
items] but a more complex set-up algorithm.
Switching network with conict resolution
The aforementioned switching network assumes
each input wants a unique output line. In many ap-
plications conicts between inputs occur and must be
resolved by inhibiting conicting inputs. Figure 9
sketches an m-input, n-output network that performs
this task. Each input line inserts a word containing
the output address desired (or zero es if the line is in-
active), a control bit equal to 1 and a prioritynumber
into an m-item sorting network with bi-directional el-
ements. This orders the items so input items with the
same output address are grouped together and ordered
by their prioritynumber. The ordered set of m-input
items is merged with a set of n items, each containing
a xed output address and a control bit equal to 0.
At the right side of the m by n merge the m+n items
are in one ordered list; each address-inserter item will
be directly below any input items with the same ad-
dress. The adjacentword transfer network, lo oking at
the control bits, connects each address-inserter item to
the input item directly aboveit if one exists (the in-
put item with lowest prioritynumber is picked in each
case). The elements in the sort and the merge are bi-
directional so two-way paths are formed from input to
output. The adjacent word transfer sends back sig-
nals over each path to signal each input and output
line whether or not a connection has b een established.
Data can then be transmitted over each of the con-
nected input lines.
DESIRED OUTPUT 1 PRIORITY
OUTPUT ADDRESS 0 0 0
M-ITEM
SORTING
NETWORK
conflict resolution
CONTROL BIT
INPUT ITEM
ADDRESS_INSERTER ITEM
ADDRESS INSERTER
ADJACENT WORD TRANSFER
M+N
‘‘M BY N’’
MERGING NETWORK
N
M
N INPUT LINES
N OUTPUT LINES
Figure 9 - An m-input, n-output switching network with
Multi-access memory
Re-clocking delays in the comparison elements give
a sorting network some storage capability which can
be augmented if needed with shift registers on the out-
puts. When the output lines are fed back to the input
lines a recirculating self-sorting store is created (Fig-
ure 10). In each recirculation cycle word positions are
changed to keep the memory in order.
Inputs to the memory can b e made by breaking the
recirculation paths of some words and inserting new
words. To prevent destroying old information during
input we use the convention that words with all bits
equal to \one" are \empty" and contain no informa-
tion: these will automatically collect at the \high-end"
of memory where input lines can use them to insert new
words.
Outputs from the memory can be accommo dated
by reserving the most-signicant-bit (MSB) of each
word: \1" for normal words and \0" for words to be
outputted. Words for output will automatically col-
lect at the \low end" of memory where output lines
can read them. Selection of which words to output
is accommodated by reserving the least-signicant-bit
(LSB) of eachword; \1" for normal words and \0"

Citations
More filters
Journal ArticleDOI

A Survey of General-Purpose Computation on Graphics Hardware

TL;DR: This report describes, summarize, and analyzes the latest research in mapping general‐purpose computation to graphics hardware.
Journal ArticleDOI

Software protection and simulation on oblivious RAMs

TL;DR: This paper shows how to do an on-line simulation of an arbitrary RAM by a probabilistic oblivious RAM with a polylogaithmic slowdown in the running time, and shows that a logarithmic slowdown is a lower bound.
Proceedings Article

A Survey of General-Purpose Computation on Graphics Hardware.

TL;DR: The techniques used in mapping general-purpose computation to graphics hardware will be generally useful for researchers who plan to develop the next generation of GPGPU algorithms and techniques.
Posted Content

Billion-scale similarity search with GPUs

TL;DR: In this paper, the authors propose a design for k-selection that operates at up to 55% of theoretical peak performance, enabling a nearest neighbor implementation that is 8.5x faster than prior GPU state of the art.

GPU Computing

TL;DR: The background, hardware, and programming model for GPU computing is described, the state of the art in tools and techniques are summarized, and four GPU computing successes in game physics and computational biophysics that deliver order-of-magnitude performance gains over optimized CPU applications are presented.
References
More filters
Journal ArticleDOI

On the Synthesis of Signal Switching Networks with Transient Blocking

TL;DR: The number of 2×2 crossbars necessary to synthesize a signal switching network with transient blocking capable of performing all one-to-one connections of N inputs to N outputs is shown to be at least N log 2 N-N log 2 e+(?) log 2N + log 2 2 + 0(1) as N? ?.
Frequently Asked Questions (16)
Q1. What are the contributions in this paper?

This paper describes networks that have a fast sorting or ordering capability ( sorting networks or sorting memories ). 

Other applications of sorting memories are as a switching network with bu ering, a multiaccess memory, a multiaccess content-addressable memory and as a multiprocessor. 

Since any two monotonic sequences can be put together to form a bitonic sequence a network which rearranges a bitonic sequence into monotonic order (a bitonic sorter) can be used as a merging network. 

Parallel-input-parallel-output registers of 1024 bits each can be placed between certain levels to perform this task or the re-clocking may be incorporated within each comparison element with a pair of ipops on the outputs. 

Multi-processorBy adding processing logic to perform additions, subtractions, etc., on groups of adjacent words of a sorting memory one can implement a multi-processor. 

The adjacent word transfer sends back signals over each path to signal each input and output line whether or not a connection has been established. 

Doubling the size of a merge only increases the longest path by unity so the merging time increases slowly with the size of the network. 

It has the advantages over a normal crossbar of requiring less hardware (an n-input n-output switching network can be built with approximately ( 14 )n(log2n) 2 elements versus n2 in a normal crossbar) and of having a constant fan-in and a fan-out requirement on its elements. 

Multi-access content addressable memoryBy adding facilities for shifting the bits within the words in the aforementioned memory di erent elds of the words can be brought into the more-signi cant positions which govern the ordering of the words. 

While a complete cycle may be long in this memory (50-bit words at 100 nanoseconds/bit = 5 microseconds/recirculation = 10 microseconds/complete cycle) many inputs and outputs can be accommodated in each cycle. 

A major problem in the design of such a computing system is the connecting together of the various parts of the system (the I/O devices, memories, processing units, etc.) in such a way that all the required data transfers can be accommodated. 

Each input line inserts a word containing the output address desired (or zeroes if the line is inactive), a control bit equal to 1 and a priority number into an m-item sorting network with bi-directional elements. 

An \\s by t" merging network can be built by presenting the odd-indexed numbers of the two input lists to one small merging network (the odd merge), presenting the even-indexed number to another small merging network (the even merge) and then comparing the outputs of these small merges with a row of comparison elements. 

Such fast sorting capability can be used to manipulate large sets of data quickly and solve some of the communications problems associated with large-scale computing systems. 

The s + t outputs of the merging network present the s+t numbers of the merged lists in ascending order, c2; c2; :::; cs+t.A \\1 by 1" merging network is simply one comparison element. 

Each path goes through ( 1 2 )p(p+ 1) levels.8 numbers4-number bitonic sortersA sorter of 1024 numbers will have 55 levels and 24,063 elements with odd-even merges or 28,160 elements with bitonic sorters.