scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Scalable parallel computers for real-time signal processing

01 Jul 1996-IEEE Signal Processing Magazine (IEEE)-Vol. 13, Iss: 4, pp 50-66
TL;DR: The purpose is to reveal the capabilities, limits, and effectiveness of massively parallel processors with symmetric multiprocessors and clusters of workstations in signal processing.
Abstract: We assess the state-of-the-art technology in massively parallel processors (MPPs) and their variations in different architectural platforms. Architectural and programming issues are identified in using MPPs for time-critical applications such as adaptive radar signal processing. We review the enabling technologies. These include high-performance CPU chips and system interconnects, distributed memory architectures, and various latency hiding mechanisms. We characterize the concept of scalability in three areas: resources, applications, and technology. Scalable performance attributes are analytically defined. Then we compare MPPs with symmetric multiprocessors (SMPs) and clusters of workstations (COWs). The purpose is to reveal their capabilities, limits, and effectiveness in signal processing. We evaluate the IBM SP2 at MHPCC, the Intel Paragon at SDSC, the Gray T3D at Gray Eagan Center, and the Gray T3E and ASCI TeraFLOP system proposed by Intel. On the software and programming side, we evaluate existing parallel programming environments, including the models, languages, compilers, software tools, and operating systems. Some guidelines for program parallelization are provided. We examine data-parallel, shared-variable, message-passing, and implicit programming models. Communication functions and their performance overhead are discussed. Available software tools and communication libraries are also introduced.

Summary (5 min read)

Scalable Parallel Computers

  • A computer system, including hardware, system software, and applications software, is called scalable if it can scale up to accommodate ever increasing users demand, or scale down to improve cost-effectiveness.
  • Scalability is a multi-dimentional concept, ranging from resource, application, to technology [ 12, 27, 37] .
  • When the processor is upgraded, the system should be able to provide increased performance, using existing components (memory, disk, network, OS, and application software, etc.) in the remaining system.
  • The program has to be significantly modified to handle more processors or more radar channels.
  • Large-scale computer systems are generally classified into six architectural categories [25] : the single-instruction-multiple-data (SIMD) machines, the parallel vector processors (PVPs), the symmetric multiprocessors (SMPs), the massively parallel processors (MPPs), the clusters of workstations (COWs), and the distributed shared memory multiprocessors (DSMs).

I Parallel Vector Processors

  • Such a system contains a s8mall number of powerful custom-designed vector processors (VPs), each capable of at least 1 Gflop/s performance.
  • A custom-designed, high-bandwidth crossbar switch connects these vector processors to a number of shared memory (SM) modules.
  • In the T-90, the shared memory can supply data to a processor at 14 GB/s.
  • Such machines normally do not use caches, but they use a large number of vector registers and an instruction buffer.

Symmetric Mu Iti process0 rs

  • Examples include the Cray CS6400, the IBM R30, the SGI Power Challenge, and the DEC Alphaserver 8000.
  • Unlike a PVP, an SMP system uses commodity microprocessors with on-chip and off-chip caches.
  • These processors are connected to a shared memory though a high-speed bus.
  • On some SMP, a crossbar switch is also used in adldition to the bus.
  • It is important for the system to be symmetric, in that every processor lhas equal access to the shared memory, the I/O devices, and operating system.

Massively Parallel Processors

  • To take advantage of higlher parallelism available in applica- [30] .
  • But it does not use the DIR to implement coherent caches.
  • Instead, the T3D relies on special hardware and software extensions to achieve the DSM at arbitrary block-size level, ranging from words to large pages of shared data.
  • The main difference of DSM machines from SMP is that the memory is physically distributed among different nodes.
  • The system hardware and software create an illusion of a single address space to application users.

Clusters of Workstations

  • Architectural features of five MPPs are summarized in Table 2 .
  • The configurations of SP2, T3D and Paragon are based on current systems their USC team has actually ported the STAP benchmarks.
  • Both SP2 and Paragon are message-passing multicomputers with the NORMA memory access model [26] .
  • A complete operating system resides on each node, as compared to some MPPs where only a microkernel exists.
  • The OS of a COW is the same UNIX workstation, plus an add-on software layer to support parallelism, communication, and load balancing.

MPP Architectures

  • Among the three existing; MPPs, the SP2 has the most powerful processors for floating-point operations.
  • Each POWER2 processor has a peak speed of 267 Mflop/s, almost two to three times higher than each Alpha processor in the T3D and each 8 6 0 processor in the Paragon, respectively.
  • T3E and TFLOPS are scheduled to appear in late 1996.
  • The Intel MPPs (Paragon and TFLOPS) continue using the 2-D mesh network, which is the most scalable interconnect among all existing MPP architectures.

Intel ASCI TeraFLOPS

  • The latency is the time to send an empty message.
  • While the bandwidth is mainly limited by the communication hardware, the latency is mainly limited by the software overhead.
  • Message passing is supported as a native programming model in all three MPPs.
  • The T3D is the most flexible machine in terms of programmability.
  • The authors MPI-based STAP benchmarks are readily portable to the next generation of MPPs, namely the T3E, the ASCI, and the successor to SP2.

Hot CPU Chips

  • With wide-spread use of microprocessors, the chip companies can afford to invest huge resources into research and development on microprocessor-based hardware, software, and applications.
  • The low-cost commodity microprocessors are approaching the performance of customdesigned processors used in Cray supercomputers.
  • The speed performance of commodity microprocessors has been increasing steadily, almost doubling every 18 months during the past decade.

Performance Metrics for Parallel Applications

  • The authors define below performance metrics used on scalable parallel computers.
  • The terminology is consistent with that proposed by the Parkbench group [25] , which is consistent with the conventions used in other scientific fields, such as physics.
  • These metrics are summarized in Table 5 .

Performance Metrics

  • The parallel computational steps in a typical scientific or signal processing application are illustrated in Fig. 3 .
  • The authors assume all interactions (communication and synchronization operations) happen between the consecutive steps.
  • Traditionally, four metrics have been used to measure the performance of a parallel program: the parallel execution time, the speed (or sustained speed), the speedup, and the efficiency: as shown in Table 5 .
  • The utilization metric does not have this problem.
  • The critical path and the average parallelisnz are two extreme value metrics, providing a lower bound for execution time and an upper bound for speedup, respectively.

Communication Overhead

  • Xu and Hwang [43] have shown that the time of a communication operation can be estimated by a general timing model: where m is the message length in bytes, the latency to(n) and the asymptotic bandwidth r J n ) can be linear or nonlinear functions of n.
  • Timing expressions are obtained for some MPL message-passing operations on the SP2, as shown in Table 6 .
  • Details on how to derive these and other expressions are treated in [43] , where the MPI performance on SP2 is also compared to the native IBM MPL operations.
  • To is the sum of the times of all interaction operations occurred in a parallel program.

Parallel Programming Models

  • Four models for parallel programming are widely used on parallel computers: implicit, data parallel, message-passing, and shared variable.
  • Table 7 compares these four models from a user's perspective.
  • A four-star (***a) entry indicates that the model is the most advantageous with respect to a particular issue, while a one-star (*) corresponds to the weakest model.
  • Parallelism issues are related to how to exploit and manage parallelism, such as process creationhermination, context switching, inquiring about number of processes. , < l < !.

I Dimensionless

  • Interaction issues address how to allocate workload and hot to distribute data to different processors and how to synchronizelcommunicate among the processors.
  • Semantic issues consider termination, determinacy, and correctness properties.
  • They can also be indeterminate: the same input could produce different results.
  • Parallel programs are also more difficult to test, to debug, or to prove for correctness.
  • Programmability issues refer to whether a programming model facilitates the development of portable and efficient application codes.

The Implicit Model

  • Programmers write codes using a familiar sequential programming language (e.g., C or Fortran).
  • Examples of such compilers include KAP from Kuck and Associates [29] and FORGE from Advanced Parallel Research [7] .
  • Compared to explicit parallel programs, sequential programs have simpler semantics: (1) They do not deadlock or livelock.
  • ( 2 ) They are always determinate: the same input always produces the same result.
  • Therefore, the implicit approach suffers in performance.

The Data Parallel Model

  • The data parallel programming model is used in standard languages such as Fortran 90 and High-Performance Fortran (HPF) [24] and proprietary languages such as CM-5 C*.
  • This model is characterized by the following features:.
  • In other words, as far as control flow is concerned, a data parallel program is just like a sequential program.
  • There is an implicit or explicit synchronization after every statement.
  • This is in contrast to the message passing approach, where variables may reside in different address spaces.

The Shared Variable Model

  • The shared-variable programming is the native model for PVP, SMP, and DSM machines.
  • There is an ANSI standard 0 Explicit Interactions:.
  • The programmer must resolve all the interaction issues, including data mapping, communication and synchronization.
  • Both shared-variable and message-passing approaches can achieve high performance.
  • For signal processing, the authors often require the highest performance.

Realization Approaches The Message Passing Model

  • The message passing programming model is the native model for MPPs and COWS.
  • The programming language is extended with some new constructs to support parallelism and interaction.
  • This approach leaves error checking to the user.
  • These are formatted comments, called compiler directives or pragmas, to help the compiler to do a better job ini optimization and parallelization.
  • The APT performs a Householder transform to generate a triangular learning matrix, which is used in a beamforming step to null the jammers and the clutter; whereas, in the HO-PD program, the two adaptive beamforming steps are coimbined into one step.

Parallelization of STAP Programs

  • The authors have used three MPPs (IBM SP2, Intel Paragon, and Cray T3D) to exec Ute the STAP benchmarks.
  • Performance results on the Paragon and T3D are yet to be released.
  • The sequential HO-PD program was parallelized to run on the IBM SP2, the Intel Paragon, and the Cray T3D.
  • The collection of radar signals forms a 3-dimensional data cube, coordinated by the numbers of antenna elements (EL), pulse repetition interval (PRI), and range gates (RNG).
  • The program was run in batch mode to have dedicated use of the nodes.

STAP Benchmark Performance

  • To demonstrate the performance of MPPs for signal processing, the authors choose to port the space-time adaptive processing (STAP) benchmark programs, originally developed by MIT Lincoln Laboratory for real time radar signal processing on UNIX workstations in sequential C code [34] .
  • The authors have to parallelize these C codes on all three target MPPs.
  • The STAP benchmark consists of five radar signal processing programs: Adaptive Processing Testbed (APT), High-Order Post-Doppler (HO-PD), Element-Space PRI-Staggered Post-Doppler (EL-Stag), Beam-Space PRI-Staggered Post-Doppler (BM-Stag), and General (GEN).
  • These benchmarks were written to test the STAP algorithms for adaptive ratdar signal processing.
  • These programs start with Dopplerprocessing (DP), in which a large number.

Measured Benchmark Results

  • Figure 6 shows the measured parallel execution time, speed, and utilization as a function of machine size.
  • Only the HO-PD performance is shown here.
  • The degradation of Paragon performance when the number of nodes is less than 16 is due to the use of small local memory (1 6 MB/node in the SDSC Paragon, of which only 8 MB is available to the user applications).
  • This results in excessive paging when a few nodes are used.
  • The SP2's high performance is further explained by Fig. 6c , which shows the utilization of the three machines.

Execution Timing Analysis

  • In Table 8 , the authors show the breakdown of the communication overhead and the computation time of the HO-PD program in all three MPPs.
  • The parallel HO-PD program is a computation-intensive application.
  • There, excessive paging drastically increases both the computational and communication times.
  • Afterwards, the communication time decreases as n increases.
  • This is attributed to the decreasing message size (m is about 50/n Mbyte) as the machine size n increases.

Scalability over Machine §ize

  • In an MPP, the total memory capacity increases with the number of nodes available.
  • Assume every node has the same memory capacity of M bytes.
  • On an n-node MPP, the total memory capacity is nM.
  • This total workload has a sequential portion, x, and a parallelizable portion 1 -a.
  • Three approaches have been used to get better performance as the machine size increases, which are formulated as three scalable performance laws.

Lessons Learned and Conclusions

  • The authors summarize below important lessons learned from their MPP/STAP benchmark experiments.
  • Then the authors make a number of suggestions towards general-purpose signal processing on scalable parallel computer platforms including MPPs, DSMs, and COWS.
  • None of these systems is supported by a real-time operating system.
  • Developing an MF'P application is a time-consuming task.
  • For portability reasons, the code should be independent of any specific topology, also known as e Topology Independent.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

KAI HWANG
and
ZHlWEl
XU
n
this article, we assess the state-of-the-art technology in
massively parallel processors (MPPs) and their vari-
ations in different architectural platforms. Architectural
and programming issues are identified in using MPPs for
time-critical applications such as adaptive radar signal proc-
essing.
First, we review the enabling technologies. These include
high-performance CPU chips and system interconnects, dis-
tributed memory architectures, and various latency hiding
mechanisms. We characterize the concept of scalability in
three areas: resources, applications, and technology. Scalable
performance attributes are analytically defined. Then we com-
pare MPPs with symmetric multiprocessors (SMPs) and clus-
ters of workstations (COWS). The purpose is to reveal their
capabilities, limits, and effectiveness in signal processing.
In particular, we evaluate the IBM
SP2
at MHPCC
[33],
the Intel Paragon at SDSC
[38],
the Cray
T3D
at Cray Eagan
Center
[I],
and the Cray T3E and ASCI TeraFLOP system
recently proposed by Intel
[32].
On
the software and pro-
gramming side, we evaluate existing parallel programming
environments, including the models, languages, compilers,
software tools, and operating systems. Some guidelines
for
program parallelization are provided. We examine data-par-
allel, shared-variable, message-passing, and implicit pro-
gramming models. Communication functions and their
performance overhead are discussed. Available software
tools andcommunication libraries are introduced.
Our experiences in porting the MITLincoln Laboratory
STAP (space-time adaptive processing) benchmark pro-
grams onto the
SP2,
T3D,
and Paragon are reported. Bench-
mark performance results are presented along with some
scalability analysis
on
machine and problem sizes. Finally,
we comment
on
using these scalable computers for signal
processing in the future.
Scalable Parallel Computers
A
computer system, including hardware, system software,
and applications software, is called
scalable
if it can
scale
up
to accommodate ever increasing users demand, or scale down
to improve cost-effectiveness. We are most interested
in
scaling
up
by improving hardware and software resources to
expect proportional increase in performance. Scalability is a
multi-dimentional concept, ranging from resource, applica-
tion, to technology
[
12,27,37].
Resource scalability
refers to gaining higher performance
or functionality by increasing the
machine size
(i.e., the
number of processors), investing in more storage (cache,
main memory, disks), and improving the software. Commer-
cial MPPs have limited resource scalability. For instance, the
normal configuration of the IBM SP2 only allows for up to
128
processors. The largest SP2 system installed to date is
the
5
12-node system at Come11 Theory Center
[
141,
requiring
a
special configuration.
Technology scalability
refers to a scalable system
which can adapt to changes in technology. It should be
generation
scalable: When part of the system is upgraded
to the next generation, the rest
of
the system should still
work. For instance, the most rapidly changing component
is the processor. When the processor is upgraded, the
system should be able to provide increased performance,
using existing components (memory, disk, network,
OS,
and application software, etc.) in the remaining system. A
scalable system should enable integration
of
hardware and
software components from different sources or vendors.
This will reduce the cost and expand the system’s usabil-
ity. This
heterogeneity scalability
concept is called
port-
ability
when used for software. It calls for using
components with
an
open, standard architecture and inter-
face. An ideal scalable system should also allow
space
scalability.
It should allow scaling
up
from a desktop
machine to a multi-rack machine to provide higher per-
formance, or scaling down to a board or even a chip to be
fit in an embedded signal processing system.
To
fully exploit the power of scalable parallel computers,
the application programs must also be scalable.
Scalability
over machine size
measures how well the performance will
improve with additional processors.
Scalability overproblem
size
indicates how well the system can handle large problems
with large data size and workload. Most real parallel appli-
50
IEEE SIGNAL PROCESSING MAGAZINE
1053-58S8/96/$5.0001996IEEE
JULY
1996

-
_-
-
l’able
1:
Architectural Attributes
of
Five
Parallel
Computer
Categories
-
-
-
-
Attribute PVP SMP
~____~
Example Cray C-90, Cray
(36400,
DASH Berkeley
NOW,
Alpha Farm
-
~~
DEC
8000
~~
-~
Distributed
Unshared
-~
~~
_____
i
Address Space Single Single
Access Model
UMA
UMA
7-
1
Interconnect
Custom Crossbar
~
Bus
or Crossbar
I
cations have limited scalability in both machine size and
problem size. For instance, some coarse-grain parallel radar
signal processing program may
use
at most
256
processors to
handle at most 100 radar channels. These limitations can not
be removed by simply increasing machine resources. The
program has to be significantly modified to handle more
processors or more radar channels.
Large-scale computer systems are generally classified into
six architectural categories
[25]
:
the
single-instruction-mul-
tiple-data
(SIMD) machines, the
parallel vector processors
(PVPs), the
symmetric multiprocessors
(SMPs), the
mas-
sively parallel processors
(MPPs), the clusters of worksta-
tions
(COWs),
and the
distributed shared memory
multiprocessors
(DSMs). SIMD computers are mostly for
special-purpose applications, which are beyond the scope
of
this paper. The remaining categories are all MIMD
(multiple-
instruction-multiple-data)
machines.
Important common features in these parallel computer
architectures are characterized below:
Commodity Components:
Most systems use commercially
off-the-shelf, commodity components such as microproc-
essors, memory clhips, disks, and key software.
MIMD:
Parallel machines are moving towards the
MIMD
architecture for general-purpose applications. A parallel
program running on such a machine consists of multiple
processes, each executing a possibly different code on a
processor autonomously.
Asynchrony:
Each process executes at its own pace, inde-
pendent of the speed
of
other processes. The processes can
be forced
to
wait for one another through special synchro-
nization operations, such as semaphores, barriers, block-
ing-mode communications, etc.
Distributed Memory:
Highly scalable computers are
all
using distributed imemory, either shared or unshared. Most
of
the
distributed
memories
are acccssed
by
the
none-uni-
form memory
access
(NUMA) model. Most of the NUMA
machines support
no
remote memory access
(NORMA).
The conventional PVPs and SMPs use the centralized,
unijorm memory access
(UMA) shared memory, which
may limit scalability.
Custom Network Custom Network
I
Parallel Vector Processors
The structure of a typical PVP is shown in Fig. la. Examples
of PVP include the Cray
C-90
and T-90. Such a system
contains a s8mall number of powerful custom-designed
vector
processors
(VPs), each capable of at least 1 Gflop/s perform-
ance. A custom-designed, high-bandwidth crossbar switch
connects these vector processors to a number
of
shared
memory
(SM)
modules. For instance, in the T-90, the shared
memory can supply data to a processor at 14 GB/s. Such
machines normally do not use caches, but they use a large
number of vector registers and an instruction buffer.
Symmetric
Mu
Iti process0 rs
The SMP architecture is ;shown in Fig. lb. Examples include
the Cray CS6400, the IBM R30, the SGI Power Challenge,
and the DEC Alphaserver 8000. Unlike
a
PVP, an SMP
system uses commodity microprocessors with on-chip and
off-chip caches. These processors are connected to a shared
memory though a high-speed bus. On some SMP, a crossbar
switch is also used in adldition to the bus. SMP systems are
heavily used in commerlcial applications, such as database
systems, on-line transaction systems, and data warehouses. It
is important for the system to be
symmetric,
in that every
processor lhas equal access to the shared memory, the I/O
devices, and operating system. This way, a higher degree
of
parallelism can
be
released, which is not possible in an
asymmetric
(or
master-slave)
multiprocessor system.
Massively Parallel Processors
To take advantage of higlher parallelism available in applica-
tions such
,as
signal processing, we need to use more scalable
computer platforms by exploiting the distributed memory
architectures,
such
as
MPPs,
DSMs,
and
COWs.
The term
MPP generally refers
to
a large-scale computer system that
has the following features:
It uses commodity microprocessors in processing nodes.
It uses physically distributed memory over processing
nodes.
JULY
1996
IEEE SIGNAL PROCESSING MAGAZINE
51

o
It uses an interconnect with high communication band-
o
It can be scaled up
to
hundreds or even thousands
of
By this definition, MPPs, DSMs, and even some
COWS
in Table
1
are
qualified to be called
as
MPPs. The MPP
modeled in Fig.
1
c
is
more restricted, representing machines
such
as
the Intel Paragon. Such a machine consists
a
number
of
processing
nodes,
each containing one or more micro-
processors interconnected by
a
high-speed memory bus
to
a
local memory and
a
network
interface
circuitv
(NIC). The
nodes are interconnected by
a
high-speed, proprietary, com-
munication network.
width and low latency.
processors.
Distributed Shared Memory Systems
DSM
machines are modeled in Fig.ld, based on the Stan-
ford DASH architecture.
Cache
directory
(DIR)
is
used
to
support distributed coherent caches
[30].
The Cray
T3D
is
also a DSM machine. But it does not use the DIR to
implement coherent caches. Instead, the T3D relies on
special hardware and software extensions to achieve the
DSM at arbitrary block-size level, ranging from words to
large pages of shared data. The main difference of DSM
machines from
SMP
is that the memory is physically
distributed among different nodes. However, the system
hardware and software create an illusion of
a
single ad-
dress space
to
application users.
~
Crossbar Switch
I
(a) Parallel Vector Processor (b) Symmetric Multiprocessor
v>-*,;
I
I
I
H
NIC~
I
LIJ
Custom-Designed Network
I
Custom-Designed Network
1
(c)
Massively Parallel Processor
(d) Distributed Shared Memory Machine
Bridge:Interface between
memory bus and
U0
bus
DIR: Cache directory
IOB:
U0
bus
LD: Local disk
LM: Local memory
MB:
Memorybus
NIC:
Network Interface Circuitry
1
P/C: Microprocessor and cache
r-----i
r
;
',.
I
Brid e
1
&,E,;
I
NIC
1
I
L
-
-
1
Commodity Network (Ethernet, ATM, etc.)
1
SM:
Shared memory
(e) Cluster
of
Workstations
VP: Vector processor
.
Conceptual architectures
offive
categories
of
scalable parallel
computers.
52
IEEE SIGNAL PROCESSING MAGAZINE
JULY
1996

MPP Architectural Evaluation
Clusters
of
Workstations
Architectural features of five
MPPs
are summarized in Table
2. The configurations of SP2, T3D and Paragon are based
on
current systems our USC team has actually ported the STAP
benchmarks. Both SP2 and Paragon are message-passing
multicomputers with the NORMA memory access model
[26]. Internode communication relies
on
explicit message
passing in these
NORMA
machines. The ASCI TeraFLOP
system is ithe successor
of
the Paragon. The T3D and its
successor T3E are both MPPs based on the DSM model.
The COW concept is shown in Fig.le. Examples
of
COW
include the Digital Alpha Farm
[
161
and the Berkeley NOW
[SI.
COWs are a low-cost variation
of
MPPs. Important
distinctions are listed below [36]:
Each node
of
a COW is a complete workstation, minus the
peripherals.
The nodes are connected through a low-cost (compared to
the proprietary network of an MPP) commodity network,
such as Ethernet, FDDI, Fiber-Channel, and ATM switch.
The network interface is
loosely
coupled
to the
I/O
bus. This
is
in contrast to the
tightly
coupled
network interface which is
connected to the memory bus of a processing node.
e
There
is
always
a
local disk, which may be absent in an
MPP node.
A
complete operating system resides on each node, as
compared to some
MPPs
where only a microkernel exists.
The
OS
of
a COW is the same UNIX workstation, plus an
add-on software layer to support parallelism, communica-
tion, and load balancing.
The boundary between MPPs and COWs are becoming
fuzzy these days. The IBM SP2 is considered an MPP. But it
has also a COW architecture, except that a proprietary
High-
Perj%rmance
Switch
is used as the communication network.
COWs have many cost-performance advantages over the
MPPs. Clustering of workstations,
SMPs,
and or PCs
is
be-
coming a trend in developing scalable parallel computers [36].
MPP
Architectures
Among the three existing; MPPs, the SP2 has the most pow-
erful processors for floating-point operations. Each
POWER2 processor has
a
peak speed
of
267 Mflop/s, almost
two to three times higher than each Alpha processor in the
T3D and each 860 processor
in
the Paragon, respectively.
The Pentium Pro processor in the ASCI TFLOPS machine
has the potential to compete with the POWER2 processor in
the future. The successor of T3D (the T3E) will use the new
Alpha 21
164
which has i.he potential to deliver 600 Mflop/s
with a
3001
MHz
clock. T3E and TFLOPS are scheduled to
appear in late 1996.
The Intel MPPs (Paragon and TFLOPS) continue using
the 2-D mesh network, which is the most scalable intercon-
nect among all existing
MPP
architectures. This is evidenced
by the fact that the Paragon scales
to
4536 nodes (9072
.
.
.
Intel ASCI
TeraFLOPS
1
MPPModels
IBM SP2
-____~
400-node
100
Gflopls
at
MHPCCS
67
MHz 267
Mflop/s POWER2
Cray T3D Cray T3E
Intel Paragon
I
A
Large Sample
Configuration
12-node
153
Gflop/s
at NSA
Maximal
51
2-node,
1.2 Tflop/s
400-node 40
Gflop/s
at SDSC
4536-node
1.8
1
Tflop/s
at
SNL
1
I
CPUType
150
MHz
150 Mflop/s Alpha
2
1064
2 processors, 64
MB memory
SO
GB Shared disk
300
MHz,
ti00
Mflop/s
Alpha 21 164
4-8 processors,
256MB-16GB
DSM
memory,
Shared disk
SO
MHz
100 Mflop/s Intel
i860
1-2 processors,
16-128 MB local
memory, 48
GB
shared disk
~~
200 MHz
200
Mflop/s
2 processors
memory shared
disk
32-2.56
MB
local
1
Node Architecture
1
processor, 64
MB-2 GB local
memory, 1-4.5GB
Local disk
Interconnect and
memory
Operating System
on Compute Node
Native
Programming
~~
i
Mechanism
Multistage
Network, NORMA
3-D Torus
DSM
3-D Torus DSM
2-D
Mesh
NORMA
Split 2-D Mesh
NORMA
i
~~~__
Microkernel based
on Chorus
Complete
AIX
(IBM Unix)
Micirokernel
Light-Weighted
Kernel
(LWK)
Microkernel
Message passing
WL)
shared variable
and message
passing, PVM
shared variable
and messag,e
passing,
P\'M
Message Passing
(Nx)
Message Passing
(MPI based on
Nx,
PVM
MPI, PVM,
HPF,
Linda
MPI. HPF MPI. HPF
SUPJMOS,
MPI,
PVM
Other
Programming
Models
30
pis
175
MB/s
40
ps
3.5 MB/s
2
~s
150 MB/s
480 MB/s
10
ks
380
MB/s
1
Point-to-point
latency
and
bandwidth
JULY
1996
IEEE
SIGNAL PROCESSING MAGAZINE
53

4
Clock Rate 1000000 +Clock Rate
~
H-
Total Memory
+Total Memory
4
Machine Size
-A-
Machine
Size
X
Total Speed
+Bandwidth
1OOOOO
-+-Processor Speed
10000
Processor Speed
0
*Latency
loooo
+Toid Sped
U’
/
-
0
/
I
/
2
L?
~
/
E
1000
d
A
100
E
loo0
I
6
2
100
-
J
P
a,
/
5
0
A
-
/A
A
/
,
4
10
2
10
4
-+-
1
1985 1987 1989 1992
1996
iPSC/l iPSCI2
iPSC/860
Paragon TeraFLOP
1979
1983
1987
1991
1995
Cray
1
X-MP Y-MP
c-90
T-90
I
(a)
Cray
vector supercomputers
(b)
Intel MPPs
2.
Improvement trends of various performance attributes in
Gray
ruperconipiiters and
Intel
MPPs
Pentium Pro processors) in the
TFLOPS.
The Cray T3DiT3E
use a 3-D torus network. The IBM SP2 uses a multistage Omega
network. The latency and bandwidth numbers are for one-way,
point-to-point communication between two node processes.
The latency is the time to send an empty message. The band-
width refers to the asymptotic bandwidth for sending large
messages. While the bandwidth is mainly limited by the com-
munication hardware, the latency is mainly limited by the
software overhead. The distributed shared memory design of
T3D allows it
to
achieve the lowest latency
of
only
2
pi.
Message passing is supported as a native programming
model in all three MPPs. The T3D is the most flexible machine
in terms of programmability. Its native MPP programming
language (called Cray Craft) supports three models: the data
parallel Fortran
90,
shared-variable extensions, and message-
passing PVM
[18].
All MPPs also support the standard
Mes-
sage-Passing
Ifiterface
(MPI)
library
[20].
We have used
MPI
to
code the parallel STAP benchmark programs. This
approach makes them portable among all three MPPs.
Our MPI-based STAP benchmarks are readily portable to
the next generation of MPPs, namely the T3E, the ASCI, and
the successor to SP2. In 1996 and beyond, this implies that
the portable
STAP
benchmark suite can be used to evaluate
these new MPPs. Our experience with the STAP radar bench-
marks can also be extended to convert
SAR
(synthetic aper-
ture radar) and ATR (Automatic target recognition) programs
for parallel execution on future MPPs.
Hot
CPU
Chips
Most current systems use commodity microprocessors. With
wide-spread use of microprocessors, the chip companies can
afford to invest huge resources into research and develop-
ment on microprocessor-based hardware, software, and ap-
plications. Consequently, the low-cost commodity
microprocessors are approaching the performance of custom-
designed processors used in Cray supercomputers. The speed
performance of commodity microprocessors has been in-
creasing steadily, almost doubling every
18
months during
the past decade.
From Table
3,
Alpha 21
164A
is by far the fastest micro-
processor announced in late
1995
[
171.
All high-performance
CPU chips are made from CMOS technology consisting of
5M to 20M transistors. With a low-voltage supply from
2.2
V
to 3.3
V,
the power consumption falls between 20
W
and
30
W.
All five CPUs are superscalar processors, issuing 3 or
4
instructions per cycle. The clock rate increases beyond 200
MHz
and approaches
417
MHz for the
21
164A. All proces-
sors use dynamic branch prediction along with out-of-order
RISC execution core. The Alpha
21
164A,
UltraSPARC
11,
and
R
10000
have comparable floating-point speed approach-
ing
600
SPECfp92.
Scalable
Growth Trends
Table
4
and Fig.2 illustrate the evolution trends
of
the Cray
supercomputer family and
of
the Intel MPP family. Com-
modity microprocessors have been improving at a much
faster rate than custom-designed processors. The peak speed
of
Cray processors has improved
12.5
times in
16
years, half
of which comes from faster clock rates. In
10
years, the peak
speed of the Intel microprocessors has increased
5000
times,
of which only
25
times come from faster clock rate, the
remaining 200 come from advances in the processor archi-
tecture. At the same time period, the one-way, point-to-point
communication bandwidth for the Intel
MPPs
has increased
740
times, and the latency has improved by 86.2 times. Cray
supercomputers use fast
SRAMs
as the main memory. The
custom-designed crossbar provide high bandwidth and low
communication latency.
As
a consequence, applications run-
54
I€€€
SIGNAL
PROCESSING
MAGAZINE
JULY
1996

Citations
More filters
Journal ArticleDOI
TL;DR: This paper evaluates the IBM SP2 architecture, the AIX parallel programming environment, and the IBM message-passing library through STAP (Space-Time Adaptive Processing) benchmark experiments, and conducts a scalability analysis to reveal the performance growth rate as a function of machine size and STAP problem size.
Abstract: This paper evaluates the IBM SP2 architecture, the AIX parallel programming environment, and the IBM message-passing library (MPL) through STAP (Space-Time Adaptive Processing) benchmark experiments. Only coarse-grain parallelism was exploited on the SP2 due to its high communication overhead. A new parallelization scheme is developed for programming message passing multicomputers. Parallel STAP benchmark structures are illustrated with domain decomposition, efficient mapping of partitioned programs, and optimization of collective communication operations. We measure the SP2 performance in terms of execution time, Gflop/s rate, speedup over a single SP2 node, and overall system utilization. With 256 nodes, the Maul SP2 demonstrated the best performance of 23 Gflop/s in executing the High-Order Post-Doppler program, corresponding to a 34% system utilization. We have conducted a scalability analysis to reveal the performance growth rate as a function of machine size and STAP problem size. Important lessons learned from these parallel processing benchmark experiments are discussed in the context of real-time, adaptive, radar signal processing on massively parallel processors (MPP).

46 citations

Journal ArticleDOI
01 Oct 1996
TL;DR: The main contribution of this work lies in providing a systematic procedure to estimate the computational work-load, to determine the application attributes, and to reveal the communication overhead in using these MPPs.
Abstract: The performance of Massively Parallel Processors (MPPs) is attributed to a large number of machine and program factors. Software development for MPP applications is often very costly. The high cost is partially caused by a lack of early prediction of MPP performance. The program development cycle may iterate many times before achieving the desired performance level. In this paper, we present an early prediction scheme we have developed at the University of Southern California for reducing the cost of application software development. Using workload analysis and overhead estimation, our scheme optimizes the design of parallel algorithm before entering the tedious coding, debugging, and testing cycle of the applications. The scheme is generally applied at user/programmer level, not tied to any particular machine platform or any specific software environment. We have tested the effectiveness of this early performance prediction scheme by running the MIT/STAP benchmark programs on a 400-node IBM SP2 system at the Maui High-Performance Computing Center (MHPCC), on a 400-node Intel Paragon system at the San Diego Supercomputing Center (SDSC), and on a 128-node Cray T3D at the Cray Research Eagan Center in Wisconsin. Our prediction shows to be rather accurate compared with the actual performance measured on these machines. We use the SP2 data to illustrate the early prediction scheme. The main contribution of this work lies in providing a systematic procedure to estimate the computational work-load, to determine the application attributes, and to reveal the communication overhead in using these MPPs. These results can be applied to develop any MPP applications other than the STAP benchmarks by which this prediction scheme was developed.

34 citations

Journal ArticleDOI
TL;DR: This paper proposes an algorithm that considers the index computation time and the I/O time and reduces the overall execution time and results in an overall reduction in the execution time due to the elimination of the expensive index computation.
Abstract: Efficient transposition of out-of-core matrices has been widely studied. These efforts have focused on reducing the number of I/O operations. However, in state-of-the-art architectures, the memory-memory data transfer time and the index computation time are also significant components of the overall time. In this paper, we propose an algorithm that considers the index computation time and the I/O time and reduces the overall execution time. Our algorithm reduces the total execution time by reducing the number of I/O operations and eliminating the index computation. In doing so, two techniques are employed: writing the data on to disk in pre-defined patterns and balancing the number of disk read and write operations. The index computation time, which is an expensive operation involving two divisions and a multiplication, is eliminated by partitioning the memory into read and write buffers. The expensive in-processor permutation is replaced by data collection from the read buffer to the write buffer. Even though this partitioning may increase the number of I/O operations for some cases, it results in an overall reduction in the execution time due to the elimination of the expensive index computation. Our algorithm is analyzed using the well-known linear model and the parallel disk model. The experimental results on a Sun Enterprise, an SGI R12000 and a Pentium III show that our algorithm reduces the overall execution time by up to 50% compared with the best known algorithms in the literature.

29 citations


Cites background from "Scalable parallel computers for rea..."

  • ...Matrix transpose is also a fundamental operation in adaptive signal processing [3, 9, 16, 21, 22]....

    [...]

Journal ArticleDOI
TL;DR: This article describes an ESP application, an adaptive sonar beamformer, and shows a task mapping methodology for application software development based on the execution model (Lee et al., 1998), which uses a novel stage partitioning technique to exploit the independent activities in a processing stage.
Abstract: The main focus of this article is the design of embedded signal processing (ESP) application software. We identify the characteristics of such applications in terms of their computational requirements, data layouts, and latency and throughput constraints. We describe an ESP application, an adaptive sonar beamformer. Then, we briefly survey the state-of-the-art in high performance computing (HPC) technology and address the advantages and challenges in using HPC technology for implementing ESP applications. To describe the software design issues in this context, we define a task model to capture the features of ESP applications. This model specifies the independent activities in each processing stage. We also identify various optimization problems in parallelizing ESP applications. We address the key issues in developing scalable and portable algorithms for ESP applications. We focus on the algorithmic issues in exploiting coarse-grain parallelism. These issues include data layout design and task mapping. We show a task mapping methodology for application software development based on our execution model (Lee et al., 1998). This uses a novel stage partitioning technique to exploit the independent activities in a processing stage. We use our methodology to maximize the throughput of an ESP application for a given platform size. The resulting application software using this methodology is called a software task pipeline. An adaptive sonar beamformer has been implemented using this design methodology.

23 citations

Journal ArticleDOI
TL;DR: The parallelization of the H.261 video coding algorithm on the IBM SP2(R) multiprocessor system is described and the spatial-temporal algorithms achieved average speedup performance, but are most scalable for large n, with efficiency up to 70%.
Abstract: The parallelization of the H.261 video coding algorithm on the IBM SP2(R) multiprocessor system is described. The effect of parallelizing computations and communications in the spatial, temporal, and both spatial-temporal domains are considered through the study of frame rate, speedup, and implementation efficiency, which are modeled and measured with respect to the number of nodes (n) and parallel methods used. Four parallel algorithms were developed, of which the first two exploited the spatial parallelism in each frame, and the last two exploited both the temporal and spatial parallelism over a sequence of frames. The two spatial algorithms differ in that one utilizes a single communication master, while the other attempts to distribute communications across three masters. On the other hand, the spatial-temporal algorithms use a pipeline structure for exploiting the temporal parallelism together with either a single master or multiple masters. The best median speedup (frame rate) achieved was close to 15 [15 frames per second (fps)] for 352/spl times/240 video on 24 nodes, and 13 (37 fps) for QCIF video, by the spatial algorithm with distributed communications. For n 10, with efficiency up to 70%. The spatial-temporal algorithms achieved average speedup performance, but are most scalable for large n.

20 citations

References
More filters
Proceedings ArticleDOI
Gene Myron Amdahl1
18 Apr 1967
TL;DR: In this paper, the authors argue that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution.
Abstract: For over a decade prophets have voiced the contention that the organization of a single computer has reached its limits and that truly significant advances can be made only by interconnection of a multiplicity of computers in such a manner as to permit cooperative solution. Variously the proper direction has been pointed out as general purpose computers with a generalized interconnection of memories, or as specialized computers with geometrically related memory interconnections and controlled by one or more instruction streams.

3,653 citations

Journal ArticleDOI
01 Sep 1991
TL;DR: A new set of benchmarks has been developed for the performance evaluation of highly parallel supercom puters that mimic the computation and data move ment characteristics of large-scale computational fluid dynamics applications.
Abstract: A new set of benchmarks has been developed for the performance evaluation of highly parallel supercom puters. These consist of five "parallel kernel" bench marks and three "simulated application" benchmarks. Together they mimic the computation and data move ment characteristics of large-scale computational fluid dynamics applications. The principal distinguishing feature of these benchmarks is their "pencil and paper" specification-all details of these benchmarks are specified only algorithmically. In this way many of the difficulties associated with conventional bench- marking approaches on highly parallel systems are avoided.

2,246 citations

Journal ArticleDOI
TL;DR: The PVM system, a heterogeneous network computing trends in distributed computing PVM overview other packages, and troubleshooting: geting PVM installed getting PVM running compiling applications running applications debugging and tracing debugging the system.
Abstract: Part 1 Introduction: heterogeneous network computing trends in distributed computing PVM overview other packages. Part 2 The PVM system. Part 3 Using PVM: how to obtain the PVM software setup to use PVM setup summary starting PVM common startup problems running PVM programs PVM console details host file options. Part 4 Basic programming techniques: common parallel programming paradigms workload allocation porting existing applications to PVM. Part 5 PVM user interface: process control information dynamic configuration signalling setting and getting options message passing dynamic process groups. Part 6 Program examples: fork-join dot product failure matrix multiply one-dimensional heat equation. Part 7 How PVM works: components messages PVM daemon libpvm library protocols message routing task environment console program resource limitations multiprocessor systems. Part 8 Advanced topics: XPVM porting PVM to new architectures. Part 9 Troubleshooting: geting PVM installed getting PVM running compiling applications running applications debugging and tracing debugging the system. Appendices: history of PVM versions PVM 3 routines.

2,060 citations

Book
01 Jan 1995
TL;DR: Dans cet article, il est question de l'importance, pour la communaute scientifique informatique, de venir a bout du «blocage mental» contre le parallelisme massif impose par une mauvaise utilisation de la formule de Amdahl.
Abstract: Dans cet article, il est question de l'importance, pour la communaute scientifique informatique, de venir a bout du «blocage mental» contre le parallelisme massif impose par une mauvaise utilisation de la formule de Amdahl

1,342 citations

Journal ArticleDOI
TL;DR: In this article, Amdahl et al. describe a "blocage mental" contre le parallelisme massif impose par une mauvaise utilisation of the formule de Amdahls.
Abstract: Dans cet article, il est question de l'importance, pour la communaute scientifique informatique, de venir a bout du «blocage mental» contre le parallelisme massif impose par une mauvaise utilisation de la formule de Amdahl

1,280 citations