scispace - formally typeset
Open AccessJournal ArticleDOI

Scalable parallel computers for real-time signal processing

Kai Hwang, +1 more
- 01 Jul 1996 - 
- Vol. 13, Iss: 4, pp 50-66
TLDR
The purpose is to reveal the capabilities, limits, and effectiveness of massively parallel processors with symmetric multiprocessors and clusters of workstations in signal processing.
Abstract
We assess the state-of-the-art technology in massively parallel processors (MPPs) and their variations in different architectural platforms. Architectural and programming issues are identified in using MPPs for time-critical applications such as adaptive radar signal processing. We review the enabling technologies. These include high-performance CPU chips and system interconnects, distributed memory architectures, and various latency hiding mechanisms. We characterize the concept of scalability in three areas: resources, applications, and technology. Scalable performance attributes are analytically defined. Then we compare MPPs with symmetric multiprocessors (SMPs) and clusters of workstations (COWs). The purpose is to reveal their capabilities, limits, and effectiveness in signal processing. We evaluate the IBM SP2 at MHPCC, the Intel Paragon at SDSC, the Gray T3D at Gray Eagan Center, and the Gray T3E and ASCI TeraFLOP system proposed by Intel. On the software and programming side, we evaluate existing parallel programming environments, including the models, languages, compilers, software tools, and operating systems. Some guidelines for program parallelization are provided. We examine data-parallel, shared-variable, message-passing, and implicit programming models. Communication functions and their performance overhead are discussed. Available software tools and communication libraries are also introduced.

read more

Content maybe subject to copyright    Report

KAI HWANG
and
ZHlWEl
XU
n
this article, we assess the state-of-the-art technology in
massively parallel processors (MPPs) and their vari-
ations in different architectural platforms. Architectural
and programming issues are identified in using MPPs for
time-critical applications such as adaptive radar signal proc-
essing.
First, we review the enabling technologies. These include
high-performance CPU chips and system interconnects, dis-
tributed memory architectures, and various latency hiding
mechanisms. We characterize the concept of scalability in
three areas: resources, applications, and technology. Scalable
performance attributes are analytically defined. Then we com-
pare MPPs with symmetric multiprocessors (SMPs) and clus-
ters of workstations (COWS). The purpose is to reveal their
capabilities, limits, and effectiveness in signal processing.
In particular, we evaluate the IBM
SP2
at MHPCC
[33],
the Intel Paragon at SDSC
[38],
the Cray
T3D
at Cray Eagan
Center
[I],
and the Cray T3E and ASCI TeraFLOP system
recently proposed by Intel
[32].
On
the software and pro-
gramming side, we evaluate existing parallel programming
environments, including the models, languages, compilers,
software tools, and operating systems. Some guidelines
for
program parallelization are provided. We examine data-par-
allel, shared-variable, message-passing, and implicit pro-
gramming models. Communication functions and their
performance overhead are discussed. Available software
tools andcommunication libraries are introduced.
Our experiences in porting the MITLincoln Laboratory
STAP (space-time adaptive processing) benchmark pro-
grams onto the
SP2,
T3D,
and Paragon are reported. Bench-
mark performance results are presented along with some
scalability analysis
on
machine and problem sizes. Finally,
we comment
on
using these scalable computers for signal
processing in the future.
Scalable Parallel Computers
A
computer system, including hardware, system software,
and applications software, is called
scalable
if it can
scale
up
to accommodate ever increasing users demand, or scale down
to improve cost-effectiveness. We are most interested
in
scaling
up
by improving hardware and software resources to
expect proportional increase in performance. Scalability is a
multi-dimentional concept, ranging from resource, applica-
tion, to technology
[
12,27,37].
Resource scalability
refers to gaining higher performance
or functionality by increasing the
machine size
(i.e., the
number of processors), investing in more storage (cache,
main memory, disks), and improving the software. Commer-
cial MPPs have limited resource scalability. For instance, the
normal configuration of the IBM SP2 only allows for up to
128
processors. The largest SP2 system installed to date is
the
5
12-node system at Come11 Theory Center
[
141,
requiring
a
special configuration.
Technology scalability
refers to a scalable system
which can adapt to changes in technology. It should be
generation
scalable: When part of the system is upgraded
to the next generation, the rest
of
the system should still
work. For instance, the most rapidly changing component
is the processor. When the processor is upgraded, the
system should be able to provide increased performance,
using existing components (memory, disk, network,
OS,
and application software, etc.) in the remaining system. A
scalable system should enable integration
of
hardware and
software components from different sources or vendors.
This will reduce the cost and expand the system’s usabil-
ity. This
heterogeneity scalability
concept is called
port-
ability
when used for software. It calls for using
components with
an
open, standard architecture and inter-
face. An ideal scalable system should also allow
space
scalability.
It should allow scaling
up
from a desktop
machine to a multi-rack machine to provide higher per-
formance, or scaling down to a board or even a chip to be
fit in an embedded signal processing system.
To
fully exploit the power of scalable parallel computers,
the application programs must also be scalable.
Scalability
over machine size
measures how well the performance will
improve with additional processors.
Scalability overproblem
size
indicates how well the system can handle large problems
with large data size and workload. Most real parallel appli-
50
IEEE SIGNAL PROCESSING MAGAZINE
1053-58S8/96/$5.0001996IEEE
JULY
1996

-
_-
-
l’able
1:
Architectural Attributes
of
Five
Parallel
Computer
Categories
-
-
-
-
Attribute PVP SMP
~____~
Example Cray C-90, Cray
(36400,
DASH Berkeley
NOW,
Alpha Farm
-
~~
DEC
8000
~~
-~
Distributed
Unshared
-~
~~
_____
i
Address Space Single Single
Access Model
UMA
UMA
7-
1
Interconnect
Custom Crossbar
~
Bus
or Crossbar
I
cations have limited scalability in both machine size and
problem size. For instance, some coarse-grain parallel radar
signal processing program may
use
at most
256
processors to
handle at most 100 radar channels. These limitations can not
be removed by simply increasing machine resources. The
program has to be significantly modified to handle more
processors or more radar channels.
Large-scale computer systems are generally classified into
six architectural categories
[25]
:
the
single-instruction-mul-
tiple-data
(SIMD) machines, the
parallel vector processors
(PVPs), the
symmetric multiprocessors
(SMPs), the
mas-
sively parallel processors
(MPPs), the clusters of worksta-
tions
(COWs),
and the
distributed shared memory
multiprocessors
(DSMs). SIMD computers are mostly for
special-purpose applications, which are beyond the scope
of
this paper. The remaining categories are all MIMD
(multiple-
instruction-multiple-data)
machines.
Important common features in these parallel computer
architectures are characterized below:
Commodity Components:
Most systems use commercially
off-the-shelf, commodity components such as microproc-
essors, memory clhips, disks, and key software.
MIMD:
Parallel machines are moving towards the
MIMD
architecture for general-purpose applications. A parallel
program running on such a machine consists of multiple
processes, each executing a possibly different code on a
processor autonomously.
Asynchrony:
Each process executes at its own pace, inde-
pendent of the speed
of
other processes. The processes can
be forced
to
wait for one another through special synchro-
nization operations, such as semaphores, barriers, block-
ing-mode communications, etc.
Distributed Memory:
Highly scalable computers are
all
using distributed imemory, either shared or unshared. Most
of
the
distributed
memories
are acccssed
by
the
none-uni-
form memory
access
(NUMA) model. Most of the NUMA
machines support
no
remote memory access
(NORMA).
The conventional PVPs and SMPs use the centralized,
unijorm memory access
(UMA) shared memory, which
may limit scalability.
Custom Network Custom Network
I
Parallel Vector Processors
The structure of a typical PVP is shown in Fig. la. Examples
of PVP include the Cray
C-90
and T-90. Such a system
contains a s8mall number of powerful custom-designed
vector
processors
(VPs), each capable of at least 1 Gflop/s perform-
ance. A custom-designed, high-bandwidth crossbar switch
connects these vector processors to a number
of
shared
memory
(SM)
modules. For instance, in the T-90, the shared
memory can supply data to a processor at 14 GB/s. Such
machines normally do not use caches, but they use a large
number of vector registers and an instruction buffer.
Symmetric
Mu
Iti process0 rs
The SMP architecture is ;shown in Fig. lb. Examples include
the Cray CS6400, the IBM R30, the SGI Power Challenge,
and the DEC Alphaserver 8000. Unlike
a
PVP, an SMP
system uses commodity microprocessors with on-chip and
off-chip caches. These processors are connected to a shared
memory though a high-speed bus. On some SMP, a crossbar
switch is also used in adldition to the bus. SMP systems are
heavily used in commerlcial applications, such as database
systems, on-line transaction systems, and data warehouses. It
is important for the system to be
symmetric,
in that every
processor lhas equal access to the shared memory, the I/O
devices, and operating system. This way, a higher degree
of
parallelism can
be
released, which is not possible in an
asymmetric
(or
master-slave)
multiprocessor system.
Massively Parallel Processors
To take advantage of higlher parallelism available in applica-
tions such
,as
signal processing, we need to use more scalable
computer platforms by exploiting the distributed memory
architectures,
such
as
MPPs,
DSMs,
and
COWs.
The term
MPP generally refers
to
a large-scale computer system that
has the following features:
It uses commodity microprocessors in processing nodes.
It uses physically distributed memory over processing
nodes.
JULY
1996
IEEE SIGNAL PROCESSING MAGAZINE
51

o
It uses an interconnect with high communication band-
o
It can be scaled up
to
hundreds or even thousands
of
By this definition, MPPs, DSMs, and even some
COWS
in Table
1
are
qualified to be called
as
MPPs. The MPP
modeled in Fig.
1
c
is
more restricted, representing machines
such
as
the Intel Paragon. Such a machine consists
a
number
of
processing
nodes,
each containing one or more micro-
processors interconnected by
a
high-speed memory bus
to
a
local memory and
a
network
interface
circuitv
(NIC). The
nodes are interconnected by
a
high-speed, proprietary, com-
munication network.
width and low latency.
processors.
Distributed Shared Memory Systems
DSM
machines are modeled in Fig.ld, based on the Stan-
ford DASH architecture.
Cache
directory
(DIR)
is
used
to
support distributed coherent caches
[30].
The Cray
T3D
is
also a DSM machine. But it does not use the DIR to
implement coherent caches. Instead, the T3D relies on
special hardware and software extensions to achieve the
DSM at arbitrary block-size level, ranging from words to
large pages of shared data. The main difference of DSM
machines from
SMP
is that the memory is physically
distributed among different nodes. However, the system
hardware and software create an illusion of
a
single ad-
dress space
to
application users.
~
Crossbar Switch
I
(a) Parallel Vector Processor (b) Symmetric Multiprocessor
v>-*,;
I
I
I
H
NIC~
I
LIJ
Custom-Designed Network
I
Custom-Designed Network
1
(c)
Massively Parallel Processor
(d) Distributed Shared Memory Machine
Bridge:Interface between
memory bus and
U0
bus
DIR: Cache directory
IOB:
U0
bus
LD: Local disk
LM: Local memory
MB:
Memorybus
NIC:
Network Interface Circuitry
1
P/C: Microprocessor and cache
r-----i
r
;
',.
I
Brid e
1
&,E,;
I
NIC
1
I
L
-
-
1
Commodity Network (Ethernet, ATM, etc.)
1
SM:
Shared memory
(e) Cluster
of
Workstations
VP: Vector processor
.
Conceptual architectures
offive
categories
of
scalable parallel
computers.
52
IEEE SIGNAL PROCESSING MAGAZINE
JULY
1996

MPP Architectural Evaluation
Clusters
of
Workstations
Architectural features of five
MPPs
are summarized in Table
2. The configurations of SP2, T3D and Paragon are based
on
current systems our USC team has actually ported the STAP
benchmarks. Both SP2 and Paragon are message-passing
multicomputers with the NORMA memory access model
[26]. Internode communication relies
on
explicit message
passing in these
NORMA
machines. The ASCI TeraFLOP
system is ithe successor
of
the Paragon. The T3D and its
successor T3E are both MPPs based on the DSM model.
The COW concept is shown in Fig.le. Examples
of
COW
include the Digital Alpha Farm
[
161
and the Berkeley NOW
[SI.
COWs are a low-cost variation
of
MPPs. Important
distinctions are listed below [36]:
Each node
of
a COW is a complete workstation, minus the
peripherals.
The nodes are connected through a low-cost (compared to
the proprietary network of an MPP) commodity network,
such as Ethernet, FDDI, Fiber-Channel, and ATM switch.
The network interface is
loosely
coupled
to the
I/O
bus. This
is
in contrast to the
tightly
coupled
network interface which is
connected to the memory bus of a processing node.
e
There
is
always
a
local disk, which may be absent in an
MPP node.
A
complete operating system resides on each node, as
compared to some
MPPs
where only a microkernel exists.
The
OS
of
a COW is the same UNIX workstation, plus an
add-on software layer to support parallelism, communica-
tion, and load balancing.
The boundary between MPPs and COWs are becoming
fuzzy these days. The IBM SP2 is considered an MPP. But it
has also a COW architecture, except that a proprietary
High-
Perj%rmance
Switch
is used as the communication network.
COWs have many cost-performance advantages over the
MPPs. Clustering of workstations,
SMPs,
and or PCs
is
be-
coming a trend in developing scalable parallel computers [36].
MPP
Architectures
Among the three existing; MPPs, the SP2 has the most pow-
erful processors for floating-point operations. Each
POWER2 processor has
a
peak speed
of
267 Mflop/s, almost
two to three times higher than each Alpha processor in the
T3D and each 860 processor
in
the Paragon, respectively.
The Pentium Pro processor in the ASCI TFLOPS machine
has the potential to compete with the POWER2 processor in
the future. The successor of T3D (the T3E) will use the new
Alpha 21
164
which has i.he potential to deliver 600 Mflop/s
with a
3001
MHz
clock. T3E and TFLOPS are scheduled to
appear in late 1996.
The Intel MPPs (Paragon and TFLOPS) continue using
the 2-D mesh network, which is the most scalable intercon-
nect among all existing
MPP
architectures. This is evidenced
by the fact that the Paragon scales
to
4536 nodes (9072
.
.
.
Intel ASCI
TeraFLOPS
1
MPPModels
IBM SP2
-____~
400-node
100
Gflopls
at
MHPCCS
67
MHz 267
Mflop/s POWER2
Cray T3D Cray T3E
Intel Paragon
I
A
Large Sample
Configuration
12-node
153
Gflop/s
at NSA
Maximal
51
2-node,
1.2 Tflop/s
400-node 40
Gflop/s
at SDSC
4536-node
1.8
1
Tflop/s
at
SNL
1
I
CPUType
150
MHz
150 Mflop/s Alpha
2
1064
2 processors, 64
MB memory
SO
GB Shared disk
300
MHz,
ti00
Mflop/s
Alpha 21 164
4-8 processors,
256MB-16GB
DSM
memory,
Shared disk
SO
MHz
100 Mflop/s Intel
i860
1-2 processors,
16-128 MB local
memory, 48
GB
shared disk
~~
200 MHz
200
Mflop/s
2 processors
memory shared
disk
32-2.56
MB
local
1
Node Architecture
1
processor, 64
MB-2 GB local
memory, 1-4.5GB
Local disk
Interconnect and
memory
Operating System
on Compute Node
Native
Programming
~~
i
Mechanism
Multistage
Network, NORMA
3-D Torus
DSM
3-D Torus DSM
2-D
Mesh
NORMA
Split 2-D Mesh
NORMA
i
~~~__
Microkernel based
on Chorus
Complete
AIX
(IBM Unix)
Micirokernel
Light-Weighted
Kernel
(LWK)
Microkernel
Message passing
WL)
shared variable
and message
passing, PVM
shared variable
and messag,e
passing,
P\'M
Message Passing
(Nx)
Message Passing
(MPI based on
Nx,
PVM
MPI, PVM,
HPF,
Linda
MPI. HPF MPI. HPF
SUPJMOS,
MPI,
PVM
Other
Programming
Models
30
pis
175
MB/s
40
ps
3.5 MB/s
2
~s
150 MB/s
480 MB/s
10
ks
380
MB/s
1
Point-to-point
latency
and
bandwidth
JULY
1996
IEEE
SIGNAL PROCESSING MAGAZINE
53

4
Clock Rate 1000000 +Clock Rate
~
H-
Total Memory
+Total Memory
4
Machine Size
-A-
Machine
Size
X
Total Speed
+Bandwidth
1OOOOO
-+-Processor Speed
10000
Processor Speed
0
*Latency
loooo
+Toid Sped
U’
/
-
0
/
I
/
2
L?
~
/
E
1000
d
A
100
E
loo0
I
6
2
100
-
J
P
a,
/
5
0
A
-
/A
A
/
,
4
10
2
10
4
-+-
1
1985 1987 1989 1992
1996
iPSC/l iPSCI2
iPSC/860
Paragon TeraFLOP
1979
1983
1987
1991
1995
Cray
1
X-MP Y-MP
c-90
T-90
I
(a)
Cray
vector supercomputers
(b)
Intel MPPs
2.
Improvement trends of various performance attributes in
Gray
ruperconipiiters and
Intel
MPPs
Pentium Pro processors) in the
TFLOPS.
The Cray T3DiT3E
use a 3-D torus network. The IBM SP2 uses a multistage Omega
network. The latency and bandwidth numbers are for one-way,
point-to-point communication between two node processes.
The latency is the time to send an empty message. The band-
width refers to the asymptotic bandwidth for sending large
messages. While the bandwidth is mainly limited by the com-
munication hardware, the latency is mainly limited by the
software overhead. The distributed shared memory design of
T3D allows it
to
achieve the lowest latency
of
only
2
pi.
Message passing is supported as a native programming
model in all three MPPs. The T3D is the most flexible machine
in terms of programmability. Its native MPP programming
language (called Cray Craft) supports three models: the data
parallel Fortran
90,
shared-variable extensions, and message-
passing PVM
[18].
All MPPs also support the standard
Mes-
sage-Passing
Ifiterface
(MPI)
library
[20].
We have used
MPI
to
code the parallel STAP benchmark programs. This
approach makes them portable among all three MPPs.
Our MPI-based STAP benchmarks are readily portable to
the next generation of MPPs, namely the T3E, the ASCI, and
the successor to SP2. In 1996 and beyond, this implies that
the portable
STAP
benchmark suite can be used to evaluate
these new MPPs. Our experience with the STAP radar bench-
marks can also be extended to convert
SAR
(synthetic aper-
ture radar) and ATR (Automatic target recognition) programs
for parallel execution on future MPPs.
Hot
CPU
Chips
Most current systems use commodity microprocessors. With
wide-spread use of microprocessors, the chip companies can
afford to invest huge resources into research and develop-
ment on microprocessor-based hardware, software, and ap-
plications. Consequently, the low-cost commodity
microprocessors are approaching the performance of custom-
designed processors used in Cray supercomputers. The speed
performance of commodity microprocessors has been in-
creasing steadily, almost doubling every
18
months during
the past decade.
From Table
3,
Alpha 21
164A
is by far the fastest micro-
processor announced in late
1995
[
171.
All high-performance
CPU chips are made from CMOS technology consisting of
5M to 20M transistors. With a low-voltage supply from
2.2
V
to 3.3
V,
the power consumption falls between 20
W
and
30
W.
All five CPUs are superscalar processors, issuing 3 or
4
instructions per cycle. The clock rate increases beyond 200
MHz
and approaches
417
MHz for the
21
164A. All proces-
sors use dynamic branch prediction along with out-of-order
RISC execution core. The Alpha
21
164A,
UltraSPARC
11,
and
R
10000
have comparable floating-point speed approach-
ing
600
SPECfp92.
Scalable
Growth Trends
Table
4
and Fig.2 illustrate the evolution trends
of
the Cray
supercomputer family and
of
the Intel MPP family. Com-
modity microprocessors have been improving at a much
faster rate than custom-designed processors. The peak speed
of
Cray processors has improved
12.5
times in
16
years, half
of which comes from faster clock rates. In
10
years, the peak
speed of the Intel microprocessors has increased
5000
times,
of which only
25
times come from faster clock rate, the
remaining 200 come from advances in the processor archi-
tecture. At the same time period, the one-way, point-to-point
communication bandwidth for the Intel
MPPs
has increased
740
times, and the latency has improved by 86.2 times. Cray
supercomputers use fast
SRAMs
as the main memory. The
custom-designed crossbar provide high bandwidth and low
communication latency.
As
a consequence, applications run-
54
I€€€
SIGNAL
PROCESSING
MAGAZINE
JULY
1996

Citations
More filters
Proceedings ArticleDOI

A general radar surface target echo simulator

TL;DR: A general radar echo simulator system proposed in this paper is to satisfy the various needs of the radar signal processor testing, based on the type of echo data playback device.

Using COTS Components for Real-Time Processing of SAR Systems

C. Le, +1 more
TL;DR: The research activities are divided into three stages, with each stage lasting approximately one month, and the design trade-off for real-time SAR processing and a typical software architecture for parallel programming of SAR signal processing are described.
Dissertation

Parallel Signal-Processing for Everyone

TL;DR: A signal-processing environment that runs on a general-purpose multiprocessor system, allowing easy prototyping of new algorithms and integration with applications, and decomposes the problem into four independent components: signal processing, data management, scheduling, and control.
Journal ArticleDOI

New multi-DSP parallel computing architecture for real-time image processing*

TL;DR: A new TMS320C64x-based multi-DSP parallel computing architecture is presented that has many promising characteristics such as powerful computing capability, broad I/O bandwidth, topology flexibility, and expansibility.
Posted Content

Nonsubsampled Graph Filter Banks and Distributed Implementation

TL;DR: It is proved that the proposed NSGFBs can control the resonance effect in the presence of bounded noise and they can limit the influence of shot noise primarily to a small neighborhood of its location on the graph.
References
More filters
Proceedings ArticleDOI

Validity of the single processor approach to achieving large scale computing capabilities

TL;DR: In this paper, the authors argue that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution.
Journal ArticleDOI

The Nas Parallel Benchmarks

TL;DR: A new set of benchmarks has been developed for the performance evaluation of highly parallel supercom puters that mimic the computation and data move ment characteristics of large-scale computational fluid dynamics applications.
Journal ArticleDOI

PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing

TL;DR: The PVM system, a heterogeneous network computing trends in distributed computing PVM overview other packages, and troubleshooting: geting PVM installed getting PVM running compiling applications running applications debugging and tracing debugging the system.
Book

Reevaluating Amdahl's law

TL;DR: Dans cet article, il est question de l'importance, pour la communaute scientifique informatique, de venir a bout du «blocage mental» contre le parallelisme massif impose par une mauvaise utilisation de la formule de Amdahl.
Journal ArticleDOI

Reevaluating Amdahl's law

TL;DR: In this article, Amdahl et al. describe a "blocage mental" contre le parallelisme massif impose par une mauvaise utilisation of the formule de Amdahls.
Related Papers (5)