Scalable parallel computers for real-time signal processing

doi:10.1109/79.526898

KAI HWANG

and

ZHlWEl

XU

n

this article, we assess the state-of-the-art technology in

massively parallel processors (MPPs) and their vari-

ations in different architectural platforms. Architectural

and programming issues are identified in using MPPs for

time-critical applications such as adaptive radar signal proc-

essing.

First, we review the enabling technologies. These include

high-performance CPU chips and system interconnects, dis-

tributed memory architectures, and various latency hiding

mechanisms. We characterize the concept of scalability in

three areas: resources, applications, and technology. Scalable

performance attributes are analytically defined. Then we com-

pare MPPs with symmetric multiprocessors (SMPs) and clus-

ters of workstations (COWS). The purpose is to reveal their

capabilities, limits, and effectiveness in signal processing.

In particular, we evaluate the IBM

SP2

at MHPCC

[33],

the Intel Paragon at SDSC

[38],

the Cray

T3D

at Cray Eagan

Center

[I],

and the Cray T3E and ASCI TeraFLOP system

recently proposed by Intel

[32].

On

the software and pro-

gramming side, we evaluate existing parallel programming

environments, including the models, languages, compilers,

software tools, and operating systems. Some guidelines

for

program parallelization are provided. We examine data-par-

allel, shared-variable, message-passing, and implicit pro-

gramming models. Communication functions and their

performance overhead are discussed. Available software

tools andcommunication libraries are introduced.

Our experiences in porting the MITLincoln Laboratory

STAP (space-time adaptive processing) benchmark pro-

grams onto the

SP2,

T3D,

and Paragon are reported. Bench-

mark performance results are presented along with some

scalability analysis

on

machine and problem sizes. Finally,

we comment

on

using these scalable computers for signal

processing in the future.

Scalable Parallel Computers

A

computer system, including hardware, system software,

and applications software, is called

scalable

if it can

scale

up

to accommodate ever increasing users demand, or scale down

to improve cost-effectiveness. We are most interested

in

scaling

up

by improving hardware and software resources to

expect proportional increase in performance. Scalability is a

multi-dimentional concept, ranging from resource, applica-

tion, to technology

[

12,27,37].

Resource scalability

refers to gaining higher performance

or functionality by increasing the

machine size

(i.e., the

number of processors), investing in more storage (cache,

main memory, disks), and improving the software. Commer-

cial MPPs have limited resource scalability. For instance, the

normal configuration of the IBM SP2 only allows for up to

128

processors. The largest SP2 system installed to date is

the

5

12-node system at Come11 Theory Center

[

141,

requiring

a

special configuration.

Technology scalability

refers to a scalable system

which can adapt to changes in technology. It should be

generation

scalable: When part of the system is upgraded

to the next generation, the rest

of

the system should still

work. For instance, the most rapidly changing component

is the processor. When the processor is upgraded, the

system should be able to provide increased performance,

using existing components (memory, disk, network,

OS,

and application software, etc.) in the remaining system. A

scalable system should enable integration

of

hardware and

software components from different sources or vendors.

This will reduce the cost and expand the system’s usabil-

ity. This

heterogeneity scalability

concept is called

port-

ability

when used for software. It calls for using

components with

an

open, standard architecture and inter-

face. An ideal scalable system should also allow

space

scalability.

It should allow scaling

up

from a desktop

machine to a multi-rack machine to provide higher per-

formance, or scaling down to a board or even a chip to be

fit in an embedded signal processing system.

To

fully exploit the power of scalable parallel computers,

the application programs must also be scalable.

Scalability

over machine size

measures how well the performance will

improve with additional processors.

Scalability overproblem

size

indicates how well the system can handle large problems

with large data size and workload. Most real parallel appli-

50

IEEE SIGNAL PROCESSING MAGAZINE

1053-58S8/96/$5.0001996IEEE

JULY

1996

-

_-

-

l’able

1:

Architectural Attributes

of

Five

Parallel

Computer

Categories

-

Attribute PVP SMP

~____~

Example Cray C-90, Cray

(36400,

DASH Berkeley

NOW,

Alpha Farm

-

~~

DEC

8000

~~

-~

Distributed

Unshared

-~

~~

_____

i

Address Space Single Single

Access Model

UMA

7-

1

Interconnect

Custom Crossbar

~

Bus

or Crossbar

I

cations have limited scalability in both machine size and

problem size. For instance, some coarse-grain parallel radar

signal processing program may

use

at most

256

processors to

handle at most 100 radar channels. These limitations can not

be removed by simply increasing machine resources. The

program has to be significantly modified to handle more

processors or more radar channels.

Large-scale computer systems are generally classified into

six architectural categories

[25]

:

the

single-instruction-mul-

tiple-data

(SIMD) machines, the

parallel vector processors

(PVPs), the

symmetric multiprocessors

(SMPs), the

mas-

sively parallel processors

(MPPs), the clusters of worksta-

tions

(COWs),

and the

distributed shared memory

multiprocessors

(DSMs). SIMD computers are mostly for

special-purpose applications, which are beyond the scope

of

this paper. The remaining categories are all MIMD

(multiple-

instruction-multiple-data)

machines.

Important common features in these parallel computer

architectures are characterized below:

Commodity Components:

Most systems use commercially

off-the-shelf, commodity components such as microproc-

essors, memory clhips, disks, and key software.

MIMD:

Parallel machines are moving towards the

MIMD

architecture for general-purpose applications. A parallel

program running on such a machine consists of multiple

processes, each executing a possibly different code on a

processor autonomously.

Asynchrony:

Each process executes at its own pace, inde-

pendent of the speed

of

other processes. The processes can

be forced

to

wait for one another through special synchro-

nization operations, such as semaphores, barriers, block-

ing-mode communications, etc.

Distributed Memory:

Highly scalable computers are

all

using distributed imemory, either shared or unshared. Most

of

the

distributed

memories

are acccssed

by

the

none-uni-

form memory

access

(NUMA) model. Most of the NUMA

machines support

no

remote memory access

(NORMA).

The conventional PVPs and SMPs use the centralized,

unijorm memory access

(UMA) shared memory, which

may limit scalability.

Custom Network Custom Network

I

Parallel Vector Processors

The structure of a typical PVP is shown in Fig. la. Examples

of PVP include the Cray

C-90

and T-90. Such a system

contains a s8mall number of powerful custom-designed

vector

processors

(VPs), each capable of at least 1 Gflop/s perform-

ance. A custom-designed, high-bandwidth crossbar switch

connects these vector processors to a number

of

shared

memory

(SM)

modules. For instance, in the T-90, the shared

memory can supply data to a processor at 14 GB/s. Such

machines normally do not use caches, but they use a large

number of vector registers and an instruction buffer.

Symmetric

Mu

Iti process0 rs

The SMP architecture is ;shown in Fig. lb. Examples include

the Cray CS6400, the IBM R30, the SGI Power Challenge,

and the DEC Alphaserver 8000. Unlike

a

PVP, an SMP

system uses commodity microprocessors with on-chip and

off-chip caches. These processors are connected to a shared

memory though a high-speed bus. On some SMP, a crossbar

switch is also used in adldition to the bus. SMP systems are

heavily used in commerlcial applications, such as database

systems, on-line transaction systems, and data warehouses. It

is important for the system to be

symmetric,

in that every

processor lhas equal access to the shared memory, the I/O

devices, and operating system. This way, a higher degree

of

parallelism can

be

released, which is not possible in an

asymmetric

(or

master-slave)

multiprocessor system.

Massively Parallel Processors

To take advantage of higlher parallelism available in applica-

tions such

,as

signal processing, we need to use more scalable

computer platforms by exploiting the distributed memory

architectures,

such

as

MPPs,

DSMs,

and

COWs.

The term

MPP generally refers

to

a large-scale computer system that

has the following features:

It uses commodity microprocessors in processing nodes.

It uses physically distributed memory over processing

nodes.

JULY

1996

IEEE SIGNAL PROCESSING MAGAZINE

51

o

It uses an interconnect with high communication band-

o

It can be scaled up

to

hundreds or even thousands

of

By this definition, MPPs, DSMs, and even some

COWS

in Table

1

are

qualified to be called

as

MPPs. The MPP

modeled in Fig.

1

c

is

more restricted, representing machines

such

as

the Intel Paragon. Such a machine consists

a

number

of

processing

nodes,

each containing one or more micro-

processors interconnected by

a

high-speed memory bus

to

a

local memory and

a

network

interface

circuitv

(NIC). The

nodes are interconnected by

a

high-speed, proprietary, com-

munication network.

width and low latency.

processors.

Distributed Shared Memory Systems

DSM

machines are modeled in Fig.ld, based on the Stan-

ford DASH architecture.

Cache

categories

of

scalable parallel

computers.

52

IEEE SIGNAL PROCESSING MAGAZINE

JULY

1996

MPP Architectural Evaluation

Clusters

of

Workstations

Architectural features of five

MPPs

are summarized in Table

2. The configurations of SP2, T3D and Paragon are based

on

current systems our USC team has actually ported the STAP

benchmarks. Both SP2 and Paragon are message-passing

multicomputers with the NORMA memory access model

[26]. Internode communication relies

on

explicit message

passing in these

NORMA

machines. The ASCI TeraFLOP

system is ithe successor

of

the Paragon. The T3D and its

successor T3E are both MPPs based on the DSM model.

The COW concept is shown in Fig.le. Examples

of

COW

include the Digital Alpha Farm

[

161

and the Berkeley NOW

[SI.

COWs are a low-cost variation

of

MPPs. Important

distinctions are listed below [36]:

Each node

of

a COW is a complete workstation, minus the

peripherals.

The nodes are connected through a low-cost (compared to

the proprietary network of an MPP) commodity network,

such as Ethernet, FDDI, Fiber-Channel, and ATM switch.

The network interface is

loosely

coupled

to the

I/O

bus. This

is

in contrast to the

tightly

coupled

network interface which is

connected to the memory bus of a processing node.

e

There

is

always

a

local disk, which may be absent in an

MPP node.

A

complete operating system resides on each node, as

compared to some

MPPs

where only a microkernel exists.

The

OS

of

a COW is the same UNIX workstation, plus an

add-on software layer to support parallelism, communica-

tion, and load balancing.

The boundary between MPPs and COWs are becoming

fuzzy these days. The IBM SP2 is considered an MPP. But it

has also a COW architecture, except that a proprietary

High-

Perj%rmance

Switch

is used as the communication network.

COWs have many cost-performance advantages over the

MPPs. Clustering of workstations,

SMPs,

and or PCs

is

be-

coming a trend in developing scalable parallel computers [36].

MPP

Architectures

Among the three existing; MPPs, the SP2 has the most pow-

erful processors for floating-point operations. Each

POWER2 processor has

a

peak speed

of

267 Mflop/s, almost

two to three times higher than each Alpha processor in the

T3D and each 860 processor

in

the Paragon, respectively.

The Pentium Pro processor in the ASCI TFLOPS machine

has the potential to compete with the POWER2 processor in

the future. The successor of T3D (the T3E) will use the new

Alpha 21

164

which has i.he potential to deliver 600 Mflop/s

with a

3001

MHz

clock. T3E and TFLOPS are scheduled to

appear in late 1996.

The Intel MPPs (Paragon and TFLOPS) continue using

the 2-D mesh network, which is the most scalable intercon-

nect among all existing

MPP

architectures. This is evidenced

by the fact that the Paragon scales

to

4536 nodes (9072

.

Intel ASCI

TeraFLOPS

1

MPPModels

IBM SP2

-____~

400-node

100

Gflopls

at

MHPCCS

67

MHz 267

Mflop/s POWER2

Cray T3D Cray T3E

Intel Paragon

I

A

Large Sample

Configuration

12-node

153

Gflop/s

at NSA

Maximal

51

2-node,

1.2 Tflop/s

400-node 40

Gflop/s

at SDSC

4536-node

1.8

1

Tflop/s

at

SNL

1

I

CPUType

150

MHz

150 Mflop/s Alpha

2

1064

2 processors, 64

MB memory

SO

GB Shared disk

300

MHz,

ti00

Mflop/s

Alpha 21 164

4-8 processors,

256MB-16GB

DSM

memory,

Shared disk

SO

MHz

100 Mflop/s Intel

i860

1-2 processors,

16-128 MB local

memory, 48

GB

shared disk

~~

200 MHz

200

Mflop/s

2 processors

memory shared

disk

32-2.56

MB

local

1

Node Architecture

1

processor, 64

MB-2 GB local

memory, 1-4.5GB

Local disk

Interconnect and

memory

Operating System

on Compute Node

Native

Programming

~~

i

Mechanism

Multistage

Network, NORMA

3-D Torus

DSM

3-D Torus DSM

2-D

Mesh

NORMA

Split 2-D Mesh

NORMA

i

~~~__

Microkernel based

on Chorus

Complete

AIX

(IBM Unix)

Micirokernel

Light-Weighted

Kernel

(LWK)

Microkernel

Message passing

WL)

shared variable

and message

passing, PVM

shared variable

and messag,e

passing,

P\'M

Message Passing

(Nx)

Message Passing

(MPI based on

Nx,

PVM

MPI, PVM,

HPF,

Linda

MPI. HPF MPI. HPF

SUPJMOS,

MPI,

PVM

Other

Programming

Models

30

pis

175

MB/s

40

ps

3.5 MB/s

2

~s

150 MB/s

480 MB/s

10

ks

380

MB/s

1

Point-to-point

latency

and

bandwidth

JULY

1996

IEEE

SIGNAL PROCESSING MAGAZINE

53

4

Clock Rate 1000000 +Clock Rate

~

H-

Total Memory

+Total Memory

4

Machine Size

-A-

Machine

Size

X

Total Speed

+Bandwidth

’

1OOOOO

-+-Processor Speed

10000

Processor Speed

0

*Latency

loooo

+Toid Sped

U’

/

-

0

/

I

/

2

L?

~

/

E

1000

d

A

100

E

loo0

I

6

2

100

-

J

P

”

a,

/

5

0

A

-

/A

A

/

,

4

10

2

10

4

-+-

1

1985 1987 1989 1992

1996

iPSC/l iPSCI2

iPSC/860

Paragon TeraFLOP

1979

1983

1987

1991

1995

Cray

1

X-MP Y-MP

c-90

T-90

I

(a)

Cray

vector supercomputers

(b)

Intel MPPs

2.

Improvement trends of various performance attributes in

Gray

ruperconipiiters and

Intel

MPPs

Pentium Pro processors) in the

TFLOPS.

The Cray T3DiT3E

use a 3-D torus network. The IBM SP2 uses a multistage Omega

network. The latency and bandwidth numbers are for one-way,

point-to-point communication between two node processes.

The latency is the time to send an empty message. The band-

width refers to the asymptotic bandwidth for sending large

messages. While the bandwidth is mainly limited by the com-

munication hardware, the latency is mainly limited by the

software overhead. The distributed shared memory design of

T3D allows it

to

achieve the lowest latency

of

only

2

pi.

Message passing is supported as a native programming

model in all three MPPs. The T3D is the most flexible machine

in terms of programmability. Its native MPP programming

language (called Cray Craft) supports three models: the data

parallel Fortran

90,

shared-variable extensions, and message-

passing PVM

[18].

All MPPs also support the standard

Mes-

sage-Passing

Ifiterface

(MPI)

library

[20].

We have used

MPI

to

code the parallel STAP benchmark programs. This

approach makes them portable among all three MPPs.

Our MPI-based STAP benchmarks are readily portable to

the next generation of MPPs, namely the T3E, the ASCI, and

the successor to SP2. In 1996 and beyond, this implies that

the portable

STAP

benchmark suite can be used to evaluate

these new MPPs. Our experience with the STAP radar bench-

marks can also be extended to convert

SAR

(synthetic aper-

ture radar) and ATR (Automatic target recognition) programs

for parallel execution on future MPPs.

Hot

CPU

Chips

Most current systems use commodity microprocessors. With

wide-spread use of microprocessors, the chip companies can

afford to invest huge resources into research and develop-

ment on microprocessor-based hardware, software, and ap-

plications. Consequently, the low-cost commodity

microprocessors are approaching the performance of custom-

designed processors used in Cray supercomputers. The speed

performance of commodity microprocessors has been in-

creasing steadily, almost doubling every

18

months during

the past decade.

From Table

3,

Alpha 21

164A

is by far the fastest micro-

processor announced in late

1995

[

171.

All high-performance

CPU chips are made from CMOS technology consisting of

5M to 20M transistors. With a low-voltage supply from

2.2

V

to 3.3

V,

the power consumption falls between 20

W

and

30

W.

All five CPUs are superscalar processors, issuing 3 or

4

instructions per cycle. The clock rate increases beyond 200

MHz

and approaches

417

MHz for the

21

164A. All proces-

sors use dynamic branch prediction along with out-of-order

RISC execution core. The Alpha

21

164A,

UltraSPARC

11,

and

R

10000

have comparable floating-point speed approach-

ing

600

SPECfp92.

Scalable

Growth Trends

Table

4

and Fig.2 illustrate the evolution trends

of

the Cray

supercomputer family and

of

the Intel MPP family. Com-

modity microprocessors have been improving at a much

faster rate than custom-designed processors. The peak speed

of

Cray processors has improved

12.5

times in

16

years, half

of which comes from faster clock rates. In

10

years, the peak

speed of the Intel microprocessors has increased

5000

times,

of which only

25

times come from faster clock rate, the

remaining 200 come from advances in the processor archi-

tecture. At the same time period, the one-way, point-to-point

communication bandwidth for the Intel

MPPs

has increased

740

times, and the latency has improved by 86.2 times. Cray

supercomputers use fast

SRAMs

as the main memory. The

custom-designed crossbar provide high bandwidth and low

communication latency.

As

a consequence, applications run-

54

I€€€

SIGNAL

PROCESSING

MAGAZINE

JULY

1996

Scalable parallel computers for real-time signal processing

Citations

A general radar surface target echo simulator

Using COTS Components for Real-Time Processing of SAR Systems

Parallel Signal-Processing for Everyone

New multi-DSP parallel computing architecture for real-time image processing*

Nonsubsampled Graph Filter Banks and Distributed Implementation

References

Validity of the single processor approach to achieving large scale computing capabilities

The Nas Parallel Benchmarks

PVM: Parallel virtual machine: a users' guide and tutorial for networked parallel computing

Reevaluating Amdahl's law

Reevaluating Amdahl's law

Related Papers (5)

Benchmark evaluation of the IBM SP2 for parallel signal processing

Parallel computing environments and methods

Clusters of computers for commercial processing: the invisible architecture

Adaptive parallelism and Piranha

Beowulf parallel processing for dynamic load-balancing