Open AccessBook ChapterDOI

Computer Architecture and Design

About:

The article was published on 2008-09-26 and is currently open access. It has received 0 citations till now. The article focuses on the topics: Applications architecture & Reference architecture.

Content maybe subject to copyright Report

-1

0-8493-0885-2/02/$0.00+$1.50

Computer Architecture

and Design

5.1 Server Computer Architecture ..........................................

-2

Introduction • Client–Server Computing • Server

Types • Server Deployment Considerations • Server

Architecture • Challenges in Server Design • Summary

5.2 Very Large Instruction Word Architectures ...................

-10

What Is a VLIW Processor? • Different Flavors of

Parallelism • A Brief History of VLIW Processors • Defoe: An

Example VLIW Architecture • The Intel Itanium

Processor • The Transmeta Crusoe Processor • Scheduling

Algorithms for VLIW

5.3 Vector Processing..............................................................

-22

Introduction • Data Parallelism • History of Data Parallel

Machines • Basic Vector Register Architecture • Vector

Instruction Set Advantages • Lanes: Parallel Execution

Units • Ve c t o r Re g i s t e r F i l e O r g a ni z a t i o n • Tr a di ti on al Vec to r

Computers versus Microprocessor Multimedia Extensions •

Memory System Design • Future Directions • Conclusions

5.4 Multithreading, Multiprocessing.....................................

-32

Introduction • Parallel Processing Software Framework •

Parallel Processing Hardware Framework • Concluding

Remarks • To Probe Further • Acknowledgments

5.5 Survey of Parallel Systems ...............................................

-48

Introduction • Single Instruction Multiple Processors

(SIMD) • Multiple Instruction Multiple Data (MIMD) •

Ve c t o r M ac h i n e s • Dataﬂow Machine • Out of Order

Execution Concept • Multithreading • Very Long Instruction

Word (VLIW) • Interconnection Network • Conclusion

5.6 Virtual Memory Systems and TLB Structures ...............

-55

Virtual Memory, a Third of a Century Later • Caching the

Process Address Space • An Example Page Table

Organization • Tr an sl at io n Lo ok as id e Bu ff er s: Ca ch in g th e

Page Table

Introduction

Jean-Luc Gaudiot

It is a truism that computers have become ubiquitous and portable in the modern world: Personal Digital

Assistants, as well as many various kinds of mobile computing devices are easily available at low cost.

This is also true because of the ever increasing presence of the Wide World Web connectivity. One should

not forget, however, that these life changing applications have only been made possible by the phenomenal

Jean-Luc Gaudiot

University of Southern California

Siamack Haghighi

Intel Corporation

Binu Matthew

University of Utah

Krste Asanovic

MIT Laboratory for Computer

Science

Manoj Franklin

University of Maryland

Donna Quammen

George Mason University

Bruce Jacob

University of Maryland

0885_frame_C05.fm Page 1 Tuesday, November 13, 2001 6:33 PM

5-22 The Computer Engineering Handbook

6. Scott Rixner, William J. Dally, Ujval J. Kapasi, Brucek, Lopez-Lagunas, Abelardo, Peter R. Mattson,

and John D. Owens. A bandwidth-efﬁcient architecture for media processing, in Proc. 31st Annual

International Symposium on Microarchitecture, Dallas, TX, November 1998.

7. Intel Corporation. Itanium Processor Microarchitecture Reference for Software Optimization. Intel

Corporation, March 2000.

8. Intel Corporation. Intel IA-64 Architecture Software Developer’s Manula, Volume 3: Instruction Set

Reference. Intel Corporation, January 2000.

9. Intel Corporation. IA-64 Application Developer’s Architecture Guide. Intel Corporation, May 1999.

10. P. G. Lowney, S. M. Freudenberger, T. J. Karzes, W. D. Lichtenstein, R. P. Nix, J. S. O’Donnell, and

J. C. Ruttenberg. The multiﬂow trace scheduling compiler. Journal of Supercomputing, 7, 1993.

11. R. E. Hank, S. A. Mahlke, J. C. Gyllenhaal, R. Bringmann, and W. W. Hwu, Superblock formation

using static program analysis, in Proc. 26th Annual International Symposium on Microarchitecture,

Austin, TX, pp. 247–255, Dec. 1993.

12. S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann, Effective compiler support

for predicated execution using the hyperblock, in Proc. 25th International Symposium on Microar-

chitecture, pp. 45–54, December 1992.

13. James C. Dehnert, Peter Y. T. Hsu, Joseph P. Bratt, Overlapped loop support in the Cydra 5, in Proc.

ASPLOS 89, pp. 26–38.

14. Alexander Klaiber, The Technology Behind Crusoe Processors. Transmeta Corp., 2000.

5.3 Vector Processing

Krste Asanovic

Introduction

For nearly 30 years, vector processing has been used in the world’s fastest supercomputers to accelerate

applications in scientiﬁc and technical computing. More recently vector-like extensions have become

popular on desktop and embedded microprocessors to accelerate multimedia applications. In both cases,

architects are motivated to include data parallel instructions because they enable large increases in

performance at much lower cost than alternative approaches to exploiting application parallelism. This

chapter reviews the development of data parallel instruction sets from the early SIMD (single instruction,

multiple data) machines, through the vector supercomputers, to the new multimedia instruction sets.

Data Parallelism

An application is said to contain data parallelism when the same operation can be carried out across

arrays of operands, for example, when two vectors are added element by element to produce a result

vector. Data parallel operations are usually expressed as loops in sequential programming languages. If

each loop iteration is independent of the others, data parallel instructions can be used to execute the

code. The following vector add code written in C is a simple example of a data parallel loop:

1.$! "'4MG! 'EaG! '::-

bN'O! 4! JN'O! :! KN'OG

Provided that the result array b does not overlap the source arrays J and K, the individual loop iterations

can be run in parallel. Many compute-intensive applications are built around such data parallel loop

kernels. One of the most important factors in determining the performance of data parallel programs is

the range of vector lengths observed for typical data sets. Vector lengths vary depending on the application,

how the application is coded, and also on the input data for each run. In general, the longer the vectors,

the greater the performance achieved by a data parallel architecture, as any loop startup overheads will

be amortized over a larger number of elements.

0885_frame_C05.fm Page 22 Tuesday, November 13, 2001 6:33 PM

Computer Architecture and Design 5-23

The performance of a piece of vector code running on a data parallel machine can be summarized with

a few key parameters. R

is the rate of execution (for example, in MFLOPS) for a vector of length n. R

∞

is the maximum rate of execution achieved assuming inﬁnite length vectors. N_ is the number of elements

at which vector performance reaches one half of R

∞

. N_ indirectly measures startup overhead, as it gives

the vector length at which the time lost to overheads is equal to the time taken to execute the vector

operation at peak speed ignoring overheads. The larger the N_ for a code kernel running on a particular

machine, the longer the vectors must be to achieve close to peak performance.

History of Data Parallel Machines

Data parallel architectures were ﬁrst developed to provide high throughput for supercomputing appli-

cations. There are two main classes of data parallel architectures: distributed memory SIMD (single

instruction, multiple data [1]) architecture and shared memory vector architecture. An early example

of a distributed memory SIMD (DM-SIMD) architecture is the Illiac-IV [2]. A typical DM-SIMD

architecture has a general-purpose scalar processor acting as the central controller and an array of

processing elements (PEs) each with its own private memory, as shown in Fig. 5.8. The central processor

executes arbitrary scalar code and also fetches instructions, and broadcasts them across the array of PEs,

which execute the operations in parallel and in lockstep. Usually the local memories of the PE array are

mapped into the central processor’s address space so that it can read and write any word in the entire

machine. PEs can communicate with each other, using a separate parallel inter-PE data network. Many

DM-SIMD machines, including the ICL DAP [3] and the Goodyear MPP [4], used single-bit processors

connected in a 2-D mesh, providing communication well-matched to image processing or scientiﬁc

simulations that could be mapped to a regular grid. The later connection machine design [5] added a

more ﬂexible router to allow arbitrary communication between single-bit PEs, although at much slower

rates than the 2-D mesh connect. One advantage of single-bit PEs is that the number of cycles taken to

perform a primitive operation, such as an add can scale with the precision of the operands, making them

well suited to tasks such as image processing where low-precision operands are common. An alternative

approach was taken in the Illiac-IV where wide 64-bit PEs could be subdivided into multiple 32-bit or

8-bit PEs to give higher performance on reduced precision operands. This approach reduces N_ for

calculations on vectors with wider operands but requires more complex PEs. This same technique of

subdividing wide datapaths has been carried over into the new generation of multimedia extensions

(referred to as MX in the rest of this chapter) for microprocessors. The main attraction of DM-SIMD

machines is that the PEs can be much simpler than the central processor because they do not need to

fetch and decode instructions. This allows large arrays of simple PEs to be constructed, for example, up

to 65,536 single-bit PEs in the original connection machine design.

FIGURE 5.8 Structure of a distributed memory SIMD (DM-SIMD) processor.

0885_frame_C05.fm Page 23 Tuesday, November 13, 2001 6:33 PM

5-24 The Computer Engineering Handbook

Shared-memory vector architectures (henceforth abbreviated to just “vector architectures”) also belong

to the class of SIMD machines, as they apply a single instruction to multiple data items. The primary

difference in the programming model of vector machines versus DM-SIMD machines is that vector

machines allow any PE to access any word in the system’s main memory. Because it is difﬁcult to construct

machines that allow a large number of simple processors to share a large central memory, vector machines

typically have a smaller number of highly pipelined PEs.

The two earliest commercial vector architectures were CDC STAR-100 [6] and TI ASC [7]. Both of

these machines were vector memory–memory architectures where the vector operands to a vector instruc-

tion were streamed in and out of memory. For example, a vector add instruction would specify the start

addresses of both source vectors and the destination vector, and during execution elements were fetched

from memory before being operated on by the arithmetic unit which produced a set of results to write

back to main memory.

The Cray-1 [8] was the ﬁrst commercially successful vector architecture and introduced the idea of

vector registers. A vector register architecture provides vector arithmetic operations that can only take

operands from vector registers, with vector load and store instructions that only move data between the

vector registers and memory. Vector registers hold short vectors close to the vector functional units,

shortening instruction latencies and allowing vector operands to be reused from registers thereby reducing

memory bandwidth requirements. These advantages have led to the dominance of vector register archi-

tectures and vector memory–memory machines are ignored for the rest of this section.

DM-SIMD machines have two primary disadvantages compared to vector supercomputers when writing

applications. The ﬁrst is that the programmer has to be extremely careful in selecting algorithms and mapping

data arrays across the machine to ensure that each PE can satisfy almost all of its data accesses from its local

memory, while ensuring the local data set still ﬁts into the limited local memory of each PE. In contrast,

the PEs in a vector machine have equal access to all of main memory, and the programmer only has to

ensure that data accesses are spread across all the interleaved memory banks in the memory subsystem.

The second disadvantage is that DM-SIMD machines typically have a large number of simple PEs and

so to avoid having many PEs sit idle, applications must have long vectors. For the large-scale DM-SIMD

machines, N_ can be in the range of tens of thousands of elements. In contrast, the vector supercomputers

contain a few highly pipelined PEs and have N_ in the range of tens to hundreds of elements.

To make effective use of a DM-SIMD machine, the programmer has to ﬁnd a way to restructure code

to contain very long vector lengths, while simultaneously mapping data structures to distributed small

local memories in each PE. Achieving high performance under these constraints has proven difﬁcult

except for a few specialized applications. In contrast, the vector supercomputers do not require data

partitioning and provide reasonable performance on much shorter vectors and so require much less

effort to port and tune applications. Although DM-SIMD machines can provide much higher peak

performances than vector supercomputers, sustained performance was often similar or lower and pro-

gramming effort was much higher. As a result, although they achieved some popularity in the 1980s,

DM-SIMD machines have disappeared from the high-end, general-purpose computing market with no

current commercial manufacturers, while there are still several manufacturers of high-end vector super-

computers with sufﬁcient revenue to fund continued development of new implementations. DM-SIMD

architectures remain popular in a few niche special-purpose areas, particularly in image processing and

in graphics rendering, where the natural application parallelism maps well onto the DM-SIMD array,

providing extremely high throughput at low cost.

Although data parallel instructions were originally introduced for high-end supercomputers, they can

be applied to many applications outside of scientiﬁc and technical supercomputing. Beginning with the

Intel i860 released in 1989, microprocessor manufacturers have introduced data parallel instruction set

extensions that allow a small number of parallel SIMD operations to be speciﬁed in single instruction. These

microprocessor SIMD ISA (instruction set architecture) extensions were originally targeted at multimedia

applications and supported only limited-precision, ﬁxed-point arithmetic, but now support single and

double precision ﬂoating-point and hence a much wider range of applications. In this chapter, SIMD ISA

extensions are viewed as a form of short vector instruction to allow a uniﬁed discussion of design trade-offs.

0885_frame_C05.fm Page 24 Tuesday, November 13, 2001 6:33 PM

Computer Architecture and Design 5-25

Basic Vector Register Architecture

Ve c t or p r o ce s s o r s co n t ai n a c o nv e nt i o n a l s c a l a r p r oc e ss o r t h a t e xe c u t e s g e ne r a l - p u r p o s e co d e t o g e t h e r

with a vector processing unit that handles data parallel code. Figure 5.9 shows the general architecture

of a typical vector machine. The vector processing unit includes a set of vector registers and a set of vector

functional units that operate on the vector registers. Each vector register contains a set of two or more

data elements. A typical vector arithmetic instruction reads source operand vectors from two vector

registers, performs an operation pair-wise on all elements in each vector register and writes a result vector

to a destination vector register, as shown in Fig. 5.10. Often, versions of vector instructions are provided

that replace one vector operand with a scalar value; these are termed vector–scalar instructions. The

scalar value is used as one of the operand inputs at each element position.

FIGURE 5.9 Structure of a vector machine. This example has a central vector register ﬁle, two vector arithmetic

units (VAU), one vector load/store unit (VMU), and one vector mask unit (VFU) that operates on the mask registers.

(Adapted from Asanovic, K., Vector Microprocessors, 1998. With permission.)

FIGURE 5.10 Operation of a vector add instruction. Here, the instruction is adding vector registers 1 and 2 to give

a result in vector register 3.

0885_frame_C05.fm Page 25 Tuesday, November 13, 2001 6:33 PM

HTML Viewer

Figures

FIGURE 5.10 Operation of a vector add instruction. Here, the instruction is adding vector registers 1 and 2 to give a result in vector register 3.

References

PDF

Open Access

More filters

Journal ArticleDOI

A mathematical theory of communication

Claude E. Shannon

- 01 Jul 1948 -

Bell System Technical Journal

TL;DR: This final installment of the paper considers the case where the signals or the messages or both are continuously variable, in contrast with the discrete nature assumed until now.

...read moreread less

Journal ArticleDOI

New Directions in Cryptography

Whitfield Diffie, +1 more

- 01 Nov 1976 -

IEEE Transactions on Information Theory

TL;DR: This paper suggests ways to solve currently open problems in cryptography, and discusses how the theories of communication and computation are beginning to provide the tools to solve cryptographic problems of long standing.

...read moreread less

Book ChapterDOI

Use of Elliptic Curves in Cryptography

Victor S. Miller

TL;DR: In this paper, an analogue of the Diffie-Hellmann key exchange protocol was proposed, which appears to be immune from attacks of the style of Western, Miller, and Adleman.

...read moreread less

Book ChapterDOI

Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems

Paul C. Kocher

TL;DR: By carefully measuring the amount of time required to perform private key operalions, attackers may be able to find fixed Diffie-Hellman exponents, factor RSA keys, and break other cryptosystems.

...read moreread less

Proceedings Article

Timing attacks on Implementations of Diffie-Hellman, RSA, DSS, and other system

C. Kocher

Collapse

Computer design & architecture

Sajjan G. Shiva

Computer Architecture Computer Architecture Computer Architecture Computer Architecture

Ieee Tcca, +1 more

The Architecture of the Computer

John Harwood

System Architecture and Design

Jim Sturdevant

System architecture design

Claudia Eckert, +1 more

- 01 Aug 2016 -

Ai Edam Artificial Intelligence for Engi...

Frequently Asked Questions (17)

Q1. What are the advantages of vector registers?

Vector registers hold short vectors close to the vector functional units, shortening instruction latencies and allowing vector operands to be reused from registers thereby reducing memory bandwidth requirements.

Q2. What is the primary advantage of lengthening a vector register?

For loops where few temporary values exist, longer vector registers can be used to reduce instruction bandwidth and stripmining overhead, while for loops where many temporary values exist, the number of shorter vector registers can be increased to reduce the number of vector register spills and, hence, the data memory bandwidth required.

Q3. How much memory is provided by a modern high-end vector supercomputer?

A modern high-end vector supercomputer provides over 50 GB/s of main memory bandwidth per CPU, while high-end microprocessor systems provide only around 1 GB/s per CPU.

Q4. What is the common way to perform a vector loop?

If the application requires vectors longer than will fit into a vector register, a process called strip mining is used to construct a vector loop that executes the application code loop in segments that each fit into the machine’s vector registers.

Q5. What is the main disadvantage of a configurable vector register file?

The main disadvantage of a configurable vector register file is the increase in control logic complexity and the increase in machine state to hold the configuration information.

Q6. What is the basic architecture of a vector processor?

The vector processing unit includes a set of vector registers and a set of vector functional units that operate on the vector registers.

Q7. How has vector processing been used in the world’s fastest supercomputers?

For nearly 30 years, vector processing has been used in the world’s fastest supercomputers to accelerate applications in scientific and technical computing.

Q8. What is the advantage of software prefetching?

Software prefetching can be very accurate as the compiler knows the reference patterns of each piece of code, but the software prefetch instructions have to be carefully scheduled so that data are not brought in too early, perhaps evicting useful data, or too late, which will leave some memory latency exposed.

Q9. What is the simplest form of vector load and store?

The simplest form of vector load and store transfers a set of elements that are contiguous in memory to successive elements of a vector register.

Q10. What is the important factor in determining the performance of data parallel programs?

One of the most important factors in determining the performance of data parallel programs is the range of vector lengths observed for typical data sets.

Q11. What is the basic vector register architecture?

Basic Vector Register Architecture Vector processors contain a conventional scalar processor that executes general-purpose code together with a vector processing unit that handles data parallel code.

Q12. Why are architects motivated to include data parallel instructions?

In both cases, architects are motivated to include data parallel instructions because they enable large increases in performance at much lower cost than alternative approaches to exploiting application parallelism.

Q13. What is the way to achieve high throughput on data parallel applications?

An alternative approach to attaining high throughput on data parallel applications is to add more CPUs each with vector units and to parallelize loops at the thread level.

Q14. Why is there no need for communication between the lanes?

Because of the way the vector ISA is designed, there is no need for communication between the lanes except via the memory system.

Q15. How many PEs can be constructed in the original connection machine design?

This allows large arrays of simple PEs to be constructed, for example, up to 65,536 single-bit PEs in the original connection machine design.

Q16. What is the main advantage of using a multiprocessor approach to hide memory latency?

Various forms of hardware and software prefetching schemes have become popular with microprocessor designers to hide memory latency.

Q17. Why do vector machines have a smaller number of highly pipelined PEs?

Because it is difficult to construct machines that allow a large number of simple processors to share a large central memory, vector machines typically have a smaller number of highly pipelined PEs.

Computer Architecture and Design

Figures

References

A mathematical theory of communication

New Directions in Cryptography

Use of Elliptic Curves in Cryptography

Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems

Timing attacks on Implementations of Diffie-Hellman, RSA, DSS, and other system

Related Papers (5)

Computer design & architecture

Computer Architecture Computer Architecture Computer Architecture Computer Architecture

The Architecture of the Computer

System Architecture and Design

System architecture design

Frequently Asked Questions (17)

Q1. What are the advantages of vector registers?

Q2. What is the primary advantage of lengthening a vector register?

Q3. How much memory is provided by a modern high-end vector supercomputer?

Q4. What is the common way to perform a vector loop?

Q5. What is the main disadvantage of a configurable vector register file?

Q6. What is the basic architecture of a vector processor?

Q7. How has vector processing been used in the world’s fastest supercomputers?

Q8. What is the advantage of software prefetching?

Q9. What is the simplest form of vector load and store?

Q10. What is the important factor in determining the performance of data parallel programs?

Q11. What is the basic vector register architecture?

Q12. Why are architects motivated to include data parallel instructions?

Q13. What is the way to achieve high throughput on data parallel applications?

Q14. Why is there no need for communication between the lanes?

Q15. How many PEs can be constructed in the original connection machine design?

Q16. What is the main advantage of using a multiprocessor approach to hide memory latency?

Q17. Why do vector machines have a smaller number of highly pipelined PEs?