scispace - formally typeset
Open AccessBook ChapterDOI

Computer Architecture and Design

About
The article was published on 2008-09-26 and is currently open access. It has received 0 citations till now. The article focuses on the topics: Applications architecture & Reference architecture.

read more

Content maybe subject to copyright    Report

5
-1
0-8493-0885-2/02/$0.00+$1.50
© 2002 by CRC Press LLC
5
Computer Architecture
and Design
5.1 Server Computer Architecture ..........................................
5
-2
Introduction Client–Server Computing Server
Types Server Deployment Considerations Server
Architecture Challenges in Server Design Summary
5.2 Very Large Instruction Word Architectures ...................
5
-10
What Is a VLIW Processor? Different Flavors of
Parallelism A Brief History of VLIW Processors Defoe: An
Example VLIW Architecture The Intel Itanium
Processor The Transmeta Crusoe Processor Scheduling
Algorithms for VLIW
5.3 Vector Processing..............................................................
5
-22
Introduction Data Parallelism History of Data Parallel
Machines Basic Vector Register Architecture Vector
Instruction Set Advantages Lanes: Parallel Execution
Units Ve c t o r Re g i s t e r F i l e O r g a ni z a t i o n Tr a di ti on al Vec to r
Computers versus Microprocessor Multimedia Extensions
Memory System Design Future Directions Conclusions
5.4 Multithreading, Multiprocessing.....................................
5
-32
Introduction Parallel Processing Software Framework
Parallel Processing Hardware Framework Concluding
Remarks To Probe Further Acknowledgments
5.5 Survey of Parallel Systems ...............................................
5
-48
Introduction Single Instruction Multiple Processors
(SIMD) Multiple Instruction Multiple Data (MIMD)
Ve c t o r M ac h i n e s Dataow Machine Out of Order
Execution Concept Multithreading Very Long Instruction
Word (VLIW) Interconnection Network Conclusion
5.6 Virtual Memory Systems and TLB Structures ...............
5
-55
Virtual Memory, a Third of a Century Later Caching the
Process Address Space An Example Page Table
Organization Tr an sl at io n Lo ok as id e Bu ff er s: Ca ch in g th e
Page Table
Introduction
Jean-Luc Gaudiot
It is a truism that computers have become ubiquitous and portable in the modern world: Personal Digital
Assistants, as well as many various kinds of mobile computing devices are easily available at low cost.
This is also true because of the ever increasing presence of the Wide World Web connectivity. One should
not forget, however, that these life changing applications have only been made possible by the phenomenal
Jean-Luc Gaudiot
University of Southern California
Siamack Haghighi
Intel Corporation
Binu Matthew
University of Utah
Krste Asanovic
MIT Laboratory for Computer
Science
Manoj Franklin
University of Maryland
Donna Quammen
George Mason University
Bruce Jacob
University of Maryland
0885_frame_C05.fm Page 1 Tuesday, November 13, 2001 6:33 PM

5-22 The Computer Engineering Handbook
6. Scott Rixner, William J. Dally, Ujval J. Kapasi, Brucek, Lopez-Lagunas, Abelardo, Peter R. Mattson,
and John D. Owens. A bandwidth-efcient architecture for media processing, in Proc. 31st Annual
International Symposium on Microarchitecture, Dallas, TX, November 1998.
7. Intel Corporation. Itanium Processor Microarchitecture Reference for Software Optimization. Intel
Corporation, March 2000.
8. Intel Corporation. Intel IA-64 Architecture Software Developer’s Manula, Volume 3: Instruction Set
Reference. Intel Corporation, January 2000.
9. Intel Corporation. IA-64 Application Developers Architecture Guide. Intel Corporation, May 1999.
10. P. G. Lowney, S. M. Freudenberger, T. J. Karzes, W. D. Lichtenstein, R. P. Nix, J. S. ODonnell, and
J. C. Ruttenberg. The multiow trace scheduling compiler. Journal of Supercomputing, 7, 1993.
11. R. E. Hank, S. A. Mahlke, J. C. Gyllenhaal, R. Bringmann, and W. W. Hwu, Superblock formation
using static program analysis, in Proc. 26th Annual International Symposium on Microarchitecture,
Austin, TX, pp. 247255, Dec. 1993.
12. S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann, Effective compiler support
for predicated execution using the hyperblock, in Proc. 25th International Symposium on Microar-
chitecture, pp. 4554, December 1992.
13. James C. Dehnert, Peter Y. T. Hsu, Joseph P. Bratt, Overlapped loop support in the Cydra 5, in Proc.
ASPLOS 89, pp. 2638.
14. Alexander Klaiber, The Technology Behind Crusoe Processors. Transmeta Corp., 2000.
5.3 Vector Processing
Krste Asanovic
Introduction
For nearly 30 years, vector processing has been used in the worlds fastest supercomputers to accelerate
applications in scientic and technical computing. More recently vector-like extensions have become
popular on desktop and embedded microprocessors to accelerate multimedia applications. In both cases,
architects are motivated to include data parallel instructions because they enable large increases in
performance at much lower cost than alternative approaches to exploiting application parallelism. This
chapter reviews the development of data parallel instruction sets from the early SIMD (single instruction,
multiple data) machines, through the vector supercomputers, to the new multimedia instruction sets.
Data Parallelism
An application is said to contain data parallelism when the same operation can be carried out across
arrays of operands, for example, when two vectors are added element by element to produce a result
vector. Data parallel operations are usually expressed as loops in sequential programming languages. If
each loop iteration is independent of the others, data parallel instructions can be used to execute the
code. The following vector add code written in C is a simple example of a data parallel loop:
1.$! "'4MG! 'EaG! '::-
bN'O! 4! JN'O! :! KN'OG
Provided that the result array b does not overlap the source arrays J and K, the individual loop iterations
can be run in parallel. Many compute-intensive applications are built around such data parallel loop
kernels. One of the most important factors in determining the performance of data parallel programs is
the range of vector lengths observed for typical data sets. Vector lengths vary depending on the application,
how the application is coded, and also on the input data for each run. In general, the longer the vectors,
the greater the performance achieved by a data parallel architecture, as any loop startup overheads will
be amortized over a larger number of elements.
0885_frame_C05.fm Page 22 Tuesday, November 13, 2001 6:33 PM

Computer Architecture and Design 5-23
The performance of a piece of vector code running on a data parallel machine can be summarized with
a few key parameters. R
n
is the rate of execution (for example, in MFLOPS) for a vector of length n. R
is the maximum rate of execution achieved assuming innite length vectors. N_ is the number of elements
at which vector performance reaches one half of R
. N_ indirectly measures startup overhead, as it gives
the vector length at which the time lost to overheads is equal to the time taken to execute the vector
operation at peak speed ignoring overheads. The larger the N_ for a code kernel running on a particular
machine, the longer the vectors must be to achieve close to peak performance.
History of Data Parallel Machines
Data parallel architectures were rst developed to provide high throughput for supercomputing appli-
cations. There are two main classes of data parallel architectures: distributed memory SIMD (single
instruction, multiple data [1]) architecture and shared memory vector architecture. An early example
of a distributed memory SIMD (DM-SIMD) architecture is the Illiac-IV [2]. A typical DM-SIMD
architecture has a general-purpose scalar processor acting as the central controller and an array of
processing elements (PEs) each with its own private memory, as shown in Fig. 5.8. The central processor
executes arbitrary scalar code and also fetches instructions, and broadcasts them across the array of PEs,
which execute the operations in parallel and in lockstep. Usually the local memories of the PE array are
mapped into the central processors address space so that it can read and write any word in the entire
machine. PEs can communicate with each other, using a separate parallel inter-PE data network. Many
DM-SIMD machines, including the ICL DAP [3] and the Goodyear MPP [4], used single-bit processors
connected in a 2-D mesh, providing communication well-matched to image processing or scientic
simulations that could be mapped to a regular grid. The later connection machine design [5] added a
more exible router to allow arbitrary communication between single-bit PEs, although at much slower
rates than the 2-D mesh connect. One advantage of single-bit PEs is that the number of cycles taken to
perform a primitive operation, such as an add can scale with the precision of the operands, making them
well suited to tasks such as image processing where low-precision operands are common. An alternative
approach was taken in the Illiac-IV where wide 64-bit PEs could be subdivided into multiple 32-bit or
8-bit PEs to give higher performance on reduced precision operands. This approach reduces N_ for
calculations on vectors with wider operands but requires more complex PEs. This same technique of
subdividing wide datapaths has been carried over into the new generation of multimedia extensions
(referred to as MX in the rest of this chapter) for microprocessors. The main attraction of DM-SIMD
machines is that the PEs can be much simpler than the central processor because they do not need to
fetch and decode instructions. This allows large arrays of simple PEs to be constructed, for example, up
to 65,536 single-bit PEs in the original connection machine design.
FIGURE 5.8 Structure of a distributed memory SIMD (DM-SIMD) processor.
0885_frame_C05.fm Page 23 Tuesday, November 13, 2001 6:33 PM

5-24 The Computer Engineering Handbook
Shared-memory vector architectures (henceforth abbreviated to just vector architectures) also belong
to the class of SIMD machines, as they apply a single instruction to multiple data items. The primary
difference in the programming model of vector machines versus DM-SIMD machines is that vector
machines allow any PE to access any word in the systems main memory. Because it is difcult to construct
machines that allow a large number of simple processors to share a large central memory, vector machines
typically have a smaller number of highly pipelined PEs.
The two earliest commercial vector architectures were CDC STAR-100 [6] and TI ASC [7]. Both of
these machines were vector memorymemory architectures where the vector operands to a vector instruc-
tion were streamed in and out of memory. For example, a vector add instruction would specify the start
addresses of both source vectors and the destination vector, and during execution elements were fetched
from memory before being operated on by the arithmetic unit which produced a set of results to write
back to main memory.
The Cray-1 [8] was the rst commercially successful vector architecture and introduced the idea of
vector registers. A vector register architecture provides vector arithmetic operations that can only take
operands from vector registers, with vector load and store instructions that only move data between the
vector registers and memory. Vector registers hold short vectors close to the vector functional units,
shortening instruction latencies and allowing vector operands to be reused from registers thereby reducing
memory bandwidth requirements. These advantages have led to the dominance of vector register archi-
tectures and vector memorymemory machines are ignored for the rest of this section.
DM-SIMD machines have two primary disadvantages compared to vector supercomputers when writing
applications. The rst is that the programmer has to be extremely careful in selecting algorithms and mapping
data arrays across the machine to ensure that each PE can satisfy almost all of its data accesses from its local
memory, while ensuring the local data set still ts into the limited local memory of each PE. In contrast,
the PEs in a vector machine have equal access to all of main memory, and the programmer only has to
ensure that data accesses are spread across all the interleaved memory banks in the memory subsystem.
The second disadvantage is that DM-SIMD machines typically have a large number of simple PEs and
so to avoid having many PEs sit idle, applications must have long vectors. For the large-scale DM-SIMD
machines, N_ can be in the range of tens of thousands of elements. In contrast, the vector supercomputers
contain a few highly pipelined PEs and have N_ in the range of tens to hundreds of elements.
To make effective use of a DM-SIMD machine, the programmer has to nd a way to restructure code
to contain very long vector lengths, while simultaneously mapping data structures to distributed small
local memories in each PE. Achieving high performance under these constraints has proven difcult
except for a few specialized applications. In contrast, the vector supercomputers do not require data
partitioning and provide reasonable performance on much shorter vectors and so require much less
effort to port and tune applications. Although DM-SIMD machines can provide much higher peak
performances than vector supercomputers, sustained performance was often similar or lower and pro-
gramming effort was much higher. As a result, although they achieved some popularity in the 1980s,
DM-SIMD machines have disappeared from the high-end, general-purpose computing market with no
current commercial manufacturers, while there are still several manufacturers of high-end vector super-
computers with sufcient revenue to fund continued development of new implementations. DM-SIMD
architectures remain popular in a few niche special-purpose areas, particularly in image processing and
in graphics rendering, where the natural application parallelism maps well onto the DM-SIMD array,
providing extremely high throughput at low cost.
Although data parallel instructions were originally introduced for high-end supercomputers, they can
be applied to many applications outside of scientic and technical supercomputing. Beginning with the
Intel i860 released in 1989, microprocessor manufacturers have introduced data parallel instruction set
extensions that allow a small number of parallel SIMD operations to be specied in single instruction. These
microprocessor SIMD ISA (instruction set architecture) extensions were originally targeted at multimedia
applications and supported only limited-precision, xed-point arithmetic, but now support single and
double precision oating-point and hence a much wider range of applications. In this chapter, SIMD ISA
extensions are viewed as a form of short vector instruction to allow a unied discussion of design trade-offs.
0885_frame_C05.fm Page 24 Tuesday, November 13, 2001 6:33 PM

Computer Architecture and Design 5-25
Basic Vector Register Architecture
Ve c t or p r o ce s s o r s co n t ai n a c o nv e nt i o n a l s c a l a r p r oc e ss o r t h a t e xe c u t e s g e ne r a l - p u r p o s e co d e t o g e t h e r
with a vector processing unit that handles data parallel code. Figure 5.9 shows the general architecture
of a typical vector machine. The vector processing unit includes a set of vector registers and a set of vector
functional units that operate on the vector registers. Each vector register contains a set of two or more
data elements. A typical vector arithmetic instruction reads source operand vectors from two vector
registers, performs an operation pair-wise on all elements in each vector register and writes a result vector
to a destination vector register, as shown in Fig. 5.10. Often, versions of vector instructions are provided
that replace one vector operand with a scalar value; these are termed vectorscalar instructions. The
scalar value is used as one of the operand inputs at each element position.
FIGURE 5.9 Structure of a vector machine. This example has a central vector register le, two vector arithmetic
units (VAU), one vector load/store unit (VMU), and one vector mask unit (VFU) that operates on the mask registers.
(Adapted from Asanovic, K., Vector Microprocessors, 1998. With permission.)
FIGURE 5.10 Operation of a vector add instruction. Here, the instruction is adding vector registers 1 and 2 to give
a result in vector register 3.
0885_frame_C05.fm Page 25 Tuesday, November 13, 2001 6:33 PM

References
More filters
Journal ArticleDOI

A mathematical theory of communication

TL;DR: This final installment of the paper considers the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed until now.
Journal ArticleDOI

New Directions in Cryptography

TL;DR: This paper suggests ways to solve currently open problems in cryptography, and discusses how the theories of communication and computation are beginning to provide the tools to solve cryptographic problems of long standing.
Book ChapterDOI

Use of Elliptic Curves in Cryptography

TL;DR: In this paper, an analogue of the Diffie-Hellmann key exchange protocol was proposed, which appears to be immune from attacks of the style of Western, Miller, and Adleman.
Book ChapterDOI

Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems

TL;DR: By carefully measuring the amount of time required to perform private key operalions, attackers may be able to find fixed Diffie-Hellman exponents, factor RSA keys, and break other cryptosystems.
Frequently Asked Questions (17)
Q1. What are the advantages of vector registers?

Vector registers hold short vectors close to the vector functional units, shortening instruction latencies and allowing vector operands to be reused from registers thereby reducing memory bandwidth requirements. 

For loops where few temporary values exist, longer vector registers can be used to reduce instruction bandwidth and stripmining overhead, while for loops where many temporary values exist, the number of shorter vector registers can be increased to reduce the number of vector register spills and, hence, the data memory bandwidth required. 

A modern high-end vector supercomputer provides over 50 GB/s of main memory bandwidth per CPU, while high-end microprocessor systems provide only around 1 GB/s per CPU. 

If the application requires vectors longer than will fit into a vector register, a process called strip mining is used to construct a vector loop that executes the application code loop in segments that each fit into the machine’s vector registers. 

The main disadvantage of a configurable vector register file is the increase in control logic complexity and the increase in machine state to hold the configuration information. 

The vector processing unit includes a set of vector registers and a set of vector functional units that operate on the vector registers. 

For nearly 30 years, vector processing has been used in the world’s fastest supercomputers to accelerate applications in scientific and technical computing. 

Software prefetching can be very accurate as the compiler knows the reference patterns of each piece of code, but the software prefetch instructions have to be carefully scheduled so that data are not brought in too early, perhaps evicting useful data, or too late, which will leave some memory latency exposed. 

The simplest form of vector load and store transfers a set of elements that are contiguous in memory to successive elements of a vector register. 

One of the most important factors in determining the performance of data parallel programs is the range of vector lengths observed for typical data sets. 

Basic Vector Register Architecture Vector processors contain a conventional scalar processor that executes general-purpose code together with a vector processing unit that handles data parallel code. 

In both cases, architects are motivated to include data parallel instructions because they enable large increases in performance at much lower cost than alternative approaches to exploiting application parallelism. 

An alternative approach to attaining high throughput on data parallel applications is to add more CPUs each with vector units and to parallelize loops at the thread level. 

Because of the way the vector ISA is designed, there is no need for communication between the lanes except via the memory system. 

This allows large arrays of simple PEs to be constructed, for example, up to 65,536 single-bit PEs in the original connection machine design. 

Various forms of hardware and software prefetching schemes have become popular with microprocessor designers to hide memory latency. 

Because it is difficult to construct machines that allow a large number of simple processors to share a large central memory, vector machines typically have a smaller number of highly pipelined PEs.