MA JER

C.L. N.F - (11 20 -1

THE LASS HARDWARE PROCESSOR\*

Paul F. Kunz, Richard N. Fall, Michael F. Gravina

Stanford Linear Accelerator Centor Stanford University, Stanford, California, 94305, U.S.A.

and

## Hanoch Brefman

The Wiezmann Institute Rehovot, Israel

## ABSTRACT

, i

MASTER

N

=

4

1.1

The problems of data analysis with hardware processors are reviewed and a description is given for a programmable processor. This processor, the 168/8, has been designed for use in the LASS multi-processor system; it has an execution speed comparable to that of the IBM 370/168 and uses the subset of IBM 370 instructions appropriate to the LASS analysis task.

## 1. INTRODUCTION.

In the interest of performing systematic stu-dies in high energy nuclear physics, several large spectrometers have been constructed at CERN, ENL, and SLAC [1]. These spectrometers are capable of taking data at such a rate that the amount of computing time required for the data analysis is becoming a major problem. At SLAC, for example, the Large Aperture Solenoid Spectrometer (LASS) has the capability of recording events on magnetic tape at an average rate of an event every 10 milliseconds [2]. However, the mean time required for processing an event at the SLAC computer center is of the order of 100 milliseconds.\*\* The goal of the IASS processor project is to process the events so as to cut down significantly the amount of computer time required to support a LASS experiment. With the advent of large detectors and relatively inexpensive read-out electronics, the problem of computer support for the data anal-ysis in LASS is fast becoming a familiar one faced by many experimenters in high energy nuclear phys-

Section 2 of this paper discusses the criteria imposed on hardware processing in general while section 3 reviews the components available for implementation. Section 4 describes the

\* Work supported by the U.S. Department of Energy. \*\* The SLAC computer center consists of a tripler system with two IAM 378/A68's and one DAM 360/31 SLAC-PUB-?198 Septembe: 1978 (1)

The second secon

programmable processor designed for use with LASS. Finally, section 5 is a summary.

## 2. CRITERIA.

In order to specify the criteria for the LASS hardware processor, a study was made of the data analyis task. It was quickly realized that the inherent structure of the task lent itself naturally to a culti-processor system since the overall task can be broken down into distinct sub-tasks such as unpacking the raw data, finding space points, finding line segments, etc. These subtasks generally try all combinations of a pair of coordinates in two planes and search for a match with coordinates in the remaining planes. This sub-tasks which accounts for the considerable mount of computer time required for encourtion.

The question remained, however, as to what set of criteria one should use in selecting the individual processors. The criteria used for LASS, outlined in the following sections, are similar to those one might use in other project: whether they be single or multi-processor systems.

## 2.1 SPEED OF EXECUTION.

The effective execution speed of the overall system must be about an order of magnitude faster than a large scale computer such as SUAC's IBM 370/168. Such spoeds are difficult to achieve since the 370/168 has a cycle time of 80 nemoseconds and can do a memory to register ADD in 4 cycles.

## 2.2 SPEED OF PROGRAMMING.

An aspect frequently underestimated in hardware processing projects is the programming time. Since the intention is to duplicate in hardware a

۱

(To be presented at the 11th Annual Microprogramming Workshop, Asilomar, Pacific Grove, CA., November 19-22, 1978)

• · · · ·

complex algorithm which normally requires a considerable uffort in software on a large computer, the means by which one will understand, write, debug, and support the program in a hardwre processor is an important consideration.

## 2.3 FLEXIBILITY.

Program algorithms frequently charge as one geins experience, encounters unforeseen problems, or changes the focus of the experiment in light of preliminary data. The programs may also charge as new or modified detectors are brought into the apparatus, or different experiments are run on the same apparatus. It is important therefore that a processor's program can be easily modified.

#### 2.4 RELIABILITY.

4

The processors and the system in which they are contained should be as simple as possible in order to achieve a high degree of reliability. A modular system would allow easy replacement of faulty modules or the introduction of upgraded ones. The modules should be made of components which can be replaced, if faulty, by parts readily available from stock.

#### 2.5 SPEED OF FABRICATION.

The fabrication time, which includes the time it takes to design, build and debug the processing system, a usit take into account the talents of the people involved in the project. To be practical one should make maximum use of technologies which are already well known.

## 2.6 COST AND SIZE.

The cost and size are important criteria if one is going to have many parallel processors.

#### 2.7 COMPATIBILITY.

One would like to have a system which has maximum compatibility with existing equipment including the format in which the data is presented by the detectors and the physical configuration of the apparatus. extremely fast processor. For example, one can do the calculation

#### XP = A X(i) + B Y(i)

in the time it takes to do one multiplication and one addition by building two parallel multipliers and separate memory banks for X and Y. Furthermore. 16 by 16 bit multiplication time can be reduced to under 200 nanoseconds by using specialized integrated circuits. This approach has been made considerably easier in recent years with the availability of #SI and LSI integrated circuits. But since the program is effectively contained in the point-to-point wiring, it suffers from certain severe disadvantages. For example, the writing of a program takes a considerable logic design effort and the debugging or changing of the program usually involves rewiring sections of the circuit. Consequently, these processors take a long time to build and debug. Flexibility is limited when the algorithm one would like to use has been simplifled in order to be implemented in hardware and only a limited range of program changes are allowed without a major reworking of the cir-cultry. Also, reliability is impaired by the fact that the circuits are one of a kind and hence cannot be easily replaced and must be repaired by an expert when faulty. Cost and size of such processors may be reasonable but frequently they are not compatible with existing equipment. For the above reasons it was decided that hardwired processors were undesirable for a large spectrometer facility such as LASS.

The required fast offer use execution speed can also be achieved by an array of programmable processors. There are many ...expensive programmable processors commercially available today which one might consider as dements in a multi-processor system. Since in recent years the cost of minicomputers has dropped considerably and their speed has increased, one might also consider their use in such a system.

In order to compare various processors, a study was made on the execution time of a simple Do-LOOP which frequently occurs in the data analysis task as the innermost DO-LOOP of many of the sub-tasks. Our studies show that for a space-point or linefinding subroutine, the repeated execution of this DO-LOOP can account for about half of the total execution time. The execution time for this DO-COOP gives us a rough idea of the execution speed of various processors without running benchmark programs.

ŝ

## 3. REVIEW OF AVAILABLE COMPONENTS.

A frequently used approach to hardware processing is to build hardwired boxes with point-topoint logic.\* It allows one to design an \* For an excellent review with a large bibliography see C. Verkerk, "Special Purpose Processors", Proc. 1974 CERN School of Computing, Godóysund, Norway, August 1974. The equivalent FURTRAN statements for the NO-LOOP studied are:

```
DO 130 1=1,N
IF (X(I) .LT. XP) GO TO 200
100 CONTINUE
.
```

#### 208 CONTINUE

1

ŝ

÷

In machine code the CO-LOAP consists of only four operations:

- a COMPARE of a measured coordinate with a predicted coordinate in memory;
- a BLACKH if the compare was low;
- 3) a DECrement of the coordinate index and;
- a BRANCH to the top of the loop if one has not exhausted the coordinate list.

Table 1 shows the execution time of this simple

loop for various projammable processors. For each processor except the right-most two, the code was optimized in ansembly language with 16 bit integral data. The approximate cost of each processor, relative to the Intel 8004, is also given.

The two popular MOS chero-processors suffer in this comparison because of their 8-bit word size, thus the LSI-11 has a clear advantage over then. The execution time of a typical mini-computer is represented by the PDP-11/4d while that of an advanced mini-computer with MCG menory by the POP-11/43. The Hop-11's are both micro-programmed processors, so they also represent roughly the kind of performance one could achieve by designing a mini-computer with an - 151 bipular micro-processor slice such as the 2901 series. The execution time on an IBM 378/166 is shown for comparison. One should tear in mind that to meet the real-time data rate of LASS one needs a system which is an order of magnitude faster than the 373/164. Thus in a system of perallel processors one would need 10 370/168's, 30 PDP-11/45's, 70 PDP-11/40's, 160 None of these LSI-11's or 200 Intel 8080's. options is within our budget and even if it were it is deemed extremely difficult to organize the interconnection of so many processors into a workable system.

| TABLE | 1. |  |
|-------|----|--|
|-------|----|--|

COMPARISON OF PROGRAMMABLE PROCESSORS

| Manufacturer<br>Model<br>Program Code | Intel<br>8080 | Motorola<br>6800 | DEC<br>LSI-11 | D6C<br>PDP-11/40 | PDP-11/45 | 19M<br>378/168 | SLAC<br>168/E |
|---------------------------------------|---------------|------------------|---------------|------------------|-----------|----------------|---------------|
| Cumpare Xi and Xp                     | 16.0 us       | 16.2 us          | 4.9 us        | 2.5 us           | 8.9 us    | 0.32 us        | 0.45 us       |
| BRANCH LOW                            | 5.0           | 4.0              | 3.5           | 1.4              | 0.5       | 0.24           | Ø.15          |
| DiCrement i                           | 2.5           | 4.0              | 4.2           | 1.0              | 0.5       | 0.08           | 0.15          |
| BRANCH Greater                        | 5.0           | 4.0              | 3.5           | 1.8              | 0.9       | Ø.3C           | ø.15          |
| Total Time                            | 28.5          | 28.0             | 16.1          | 6.7              | 2_8       | 1.00           | Ø.9Ø          |
| Relative Cost                         | 1             | 1                | 1,2           | 10               | 30        | 3000           | 2             |

Anuther aspect of this comparison of processors is the difference in their instruction set. For example, the 6830 matches the performance of the 8000, in spite of its longer cycle time, because it requires only 7 instructions, rather than 9, for the 20-LUCP, The LSI-LI, PDF-LI's and ISM 370 require only 4 instructions. In general, an average programmer can produce faster and more efficient code with a processor that has a more flatible instruction set. The IBM 370/160 has a certain advantage of the Processors considered, having 16 working registers which can be used either as accumulators or inder registers.

None of the available inexpensive procedsors are fast enough nor do they have a sufficiently powerful instruction set. Consequently a programmable processor has been designed which would meet our needs. The remainder of this paper discusses the features of this processor.

#### 4. THE LASS PROGRAMMABLE PROCESSOR: 168/E

The LASS hardware processors have been designed so that they are very fast, easily programed, and relatively low in cost. Each processor has the execution speed comparable to an IBM 370/168 and, in order to minimize the programming task, the processors have been designed to efficiently emulate a subset of the 370 machine instructions.

#### 4.1 PROCESSOR HARDWARE.

The processors, which have been given the name 168/E, are divided into four parts: a program memory 24 bits wide, a data memory 32 bits wide, an integer processing unit, and a floating-point processing unit. The separation of program and data memories, which allows simultaneous access to them, is an important feature for the speed of execution. Figure 1 shows a block diagram of the processing unit, and the following paragraphs discuss its various features. The integer processing unit is contained on one circuit board. Its basic sections are the microprocessor alice array with its associated control logic, the branch control logic, and the data memory control logic. The most significant bits of the program memory (the control field) determine the control section within the integer processing unit which will execute the instruction, with data for the instruction in the remaining bits (the data field).



Figur's I : Block diagrem of the 168/E

The heart of the processing unit is an array of 8 biplar LSI micro-processor 4-bit alices, the 290LA. As shown in figure 2, it comprises an 8 function Arithmetic Logic Unit; 16 adressable registers with dval port readout; an auxiliary register 0 which is used for double precision shifts and multiplication; and a shifting network at the register file input porta. In addition there are status outputs to indicate CARRY or CREFLOW conditions and ZERO or NEGATIVE results.

7

The micro-processor slices require 18 bits of information to execute an instruction: 3 bits to define the source operands, 3 bits to define the function, 1 for the CARNY input, 3 bits to define the the destination, and 4 bits each to define the too read addresses of the register flin. A register is used between the program manory and che micropromassor slices so that the instruction fetch cycle can be pipelined. The simultaneous factor is and execution feature of this processor is another of the reasons for its high execution speed.

ck:=nual to the silce array is a 15 bit birary program counter. Normally, the processor clock steps the processor sequentially through the program memory. An uncoditional BRNACH instruction is executed by a parallel lock to the counter from the data field of the program memory or the data output of the slices. The status bits from the slice are not bit for bit the same as the 370/168 condition code bits, but with a few logic gates the 2901A status bits can be changed to match those of the 370/168 exactly. These modified status bits can then be loaded into the condition code register in the integer processor. A conditional BRANCH instruction is executed by placing the counter in the parallel load mode if the status bits of the slice patch those in the current processor instruction.



Figure 2 : Block diagram of the 2901A

r,

1

The programs for these processors can almost always be written is such a way that the pRAXCM addresses aru known at load time, and this address can be in the program memory data field data. Thus, most BRANCH instructions can be executed in one machine cycle.

A hardware multiplication and division algorithm has been implemented in the 168/E processors. It is done by momentarily stopping the program counter clock while cycling the slices through conditional ADD and SHET instructions. In order to allow for efficient indexing of the data memory, the data memory address is formed by an ADD of bits from the data field of the program memory and from the outputs of the slices. Data may be written to the memory from the slices and similarly data may be presented to the direct inputs of the slices frum the data memory or from the program memory.

All of the 168/E instruction that manipulate floating-point quantities are executed by the floating-point processing unit (not shown in

figure 1). This unit comprises a two-port register file, an ALU, a control unit, and a hard-ware shifting network.

When a floating-point instruction is encountered by the 168/E, the clock to the integer processor is storged, and the floating-point processor is allowed to execute the instruction. If the instruction is one that calls for a setting of the condition codes, the appropriate information from the floating-wint processor is strobed into the condition code register in the integer processing unit. If the floating-point instruction requires multiple clock cycles, the floating-point control unit stops the clock to the integer processor for as long an is mecessary. In the case of floatingpoint instructions that require duta memory accesses, the data memory address is generated by the integer processor, and the data is strobed into a working register in the floatingpoint point of the floating-

The floating-point processor can manipulate quantities in either 12 or 48 bit precision. The 32 bit precision yields results identical to the single precision of the 370/168, while the 48 bit length is a pseudo-double precision that has been found to be sufficient for most calculations done in LASS experiments. The precision of the floating-point unit could have been extended by widening the data paths on the floating-point processing unit, but since the interconnection complexity grows rapidly above 48 bits, a empromise between cost and precision was made. This 48 bit precision mode is the only place where the 160/E does not match the 370/168 in the results it produces.

Since the floating-point processing unit is constructed separately from the integer processing unit, its inclusion in a users 160/E processor is optional.

#### 4.2 PROCESSOR SOFTWARE.

important aspects of the processor's structure will become apparent by comparison of the program code generated to perform the DO-LOOP described above. Table II shows the DO-LOOP implemented on the IBM 370/160 and the 168/E processor, The first instruction on the 370/168, cf. Table II, is a COMPARE between the contents of a memory location and register 0. The memory address is formed by the sum of register 9 (the index register), register 10 (the base register), and 12 bits from the instruction (the displacement ED). The 168/E performs the same operation in three microinstruction cycles. In the first cycle, the slices execute an instruction which places the sum of registers 9 and 10 at its output. In the second, the dispurmement from the data field of the program memory is added to the outputs of the sinces and loaded to the memory address register. In the same cycle, the memory is switched to the read mode and the data is strobed into a working register of the floating-point unit. In the third cycle, the comparison is made in the floatingpoint unit between register 0 and the working register. If the instruction had been an integer compare instead of a floating-point compare, then the integer processor would have oper fled on the data memory contents. As shown in Table II, the remaining three instructions on the 170/168 can be implemented in one cycle each on the 168/E, using the integer CPU only. Thus one sees that the structure of the LASS processor allows it to enulate the IBM 373 efficiently. Enulation is possi-ble because both the 379/168 and the 168/E have the same number of working registers, can perform the same arithmetic and logic operations, and have the same form of data memory addressing and branching.

| Program Step       | Code for 370/168   | Action of code for 168/E                                                                                                                                                                        |  |  |  |
|--------------------|--------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|
| (COMpare Xi and Xp | LCUP CE 0,ED(9,10) | Slice: Reg. 9 + Reg. 10 → Slice-out<br>Memory: ED + Slice-out → MAH, and<br>F.P.F: Data Memory → Working Register<br>F.P.F: Data Memory → Working Register, and<br>(CMH, acc) = → C.C. Register |  |  |  |
| BRANCH LOW         | BL GOTTE           | Branch: If less tran 0, GOTE -> P.C.                                                                                                                                                            |  |  |  |
| DECrement i        | SR 9,1             | Slice: keg, 9 - keg, 1 -> keg, 9, and<br>Cond. Code -> C.C. kegister                                                                                                                            |  |  |  |
| BRANCH Greater     | ENM LOOP           | Branch: If greater than 0, LOOP -> P.C.                                                                                                                                                         |  |  |  |

TABLE II.

Comparison of Program Code Generated for 370/168 and 168/E

A translator has been written which takes object code produced by the IBM Fortran H Optimizing Compiler, and converts it to relocatable program and data modules in 168/E format. An important aspect of this process is the splitting of program instructions and local data constants and variables into separate areas, ready for loading into the 168/E's separate program and data

memories. When possible, advantage is taken of the 168/2's direct proyram memory addressing scheme by changing IBM BRANCH instructions from their displacement plus base-register addressing format to absolute 168/2 program memory addressing, Execution time saved by direct addressing applies also to the first BR Bytes of local constants and variables in data memory, where fixed base registers may be dispensed with entirely.

After translation a linker program is used. It functions exactly like the TWA linkage Editor, in taking all the 168/E object modules needed to compose one complete load module, and linking them together. Address constants and direct addresses are filled in or adjusted at this time. By means of control statements, the user may if he wishes position ccamon-blocks, local data constants, and program code in specific locations, according to a predefined plat.

Not all of the 370 instruction set can be enulated by the 1678, but all those instructions needed for track recentruction have been emulated. In fact, the IBM FDHNAM compiler requires about the same submet of 370 instructions as those implemented in the 160%. Those instructions of that same being the form of the same submet for the same submet of 170 instructions of the and the same submet of 170 instructions of that same the 160%. Those instructions of that same the same submet of the 170 one has reduced the cust and complexity of this produces while intreasing its speed. The goal of this project is to build fast programmable process sors for physics applications and not to build a general parpose computer with the entire instruction set of the 180 370.

## 4.3 WHY EMULATE?

7

I

One could have built a processor with its own unique instruction set, tailored to ones needs. Instead, the 168/E is based on the architecture of the 37% for several important reasons. First of all. the writing and debugging of programs for the 168/E can be done on the 378/168 at the SLAC computer center. Once a program is running on real or circulated data, it is easily translated to the instruction set of the 168/E. Secondly, programmers who are famililar with the 370 do not require any concial understanding of the 168/E process.r in order to produce fast and efficient code. Even the experimenters can write programs for the 168/E because of its FORTRAN capability. In addition, the programs do not need to be debugged on a hardware Low with limited 1/0 capabilities.

## 4.4 SPEED AND COST

Emulation of the 37% has greatly reduced the burden of prograting the LAS processors. The question remaining, however, is now much emulation of the 37% has cost us in execution speed and the dollar cost of the processors. The cycle time of the 160% is 150 anoseconds, which is slower than

. ..

the 370/168, but, as Table I shows, the executing time of the 168/E is actually competitive with the 370/168. Fast execution of the 168/E for UD-LOOP shown comes mainly from the fact that BRANCH instructions can be done in one cycle. The 370/168 operates in a multi-programming environment so that is spends many cycles calculating the absolute address of the BRANCH. The basic DD-LOOP of Table I is biased in favor of the 168/E because of Table I is biased in favor of the 168/E because of Table I is biased in favor of the 168/E because and with complete programs. Programs were written on the 370/168 using FORTNAN H OPF=2 Compler. With real data the execution time for the program on the 168/E is no worse than a factor of 2 slower.

Wost of the cost of the 168/E is in the program and data memories for the processor. The cost of 16% bytes of data memory is about \$1000, while 8% bytes (equivalent, of program memory cost around 5780 at current memory prices. The integer and 5100 at current memory prices. The integer and \$1000, respectively. These prices include compoments, circuit boards, and power supplies, but exclude lator for assembly. Thus a complete 166/E with both processors, 96K bytes of data memory, (roughly the largest amount needed for LASS experient's) would cost about \$10,000. The important point is that the speed-cost ratio of the 168/E is sufficiently high that a multi-processor system that memots the needs of LASS is economically [easiele.

#### 5. SUMMARY.

The computer support for the data analysis in a high data rate physics experiment is becoming a familiar problem. The fact that the analysis task can be broken down into many simple sub-tasks has led namy experimenters to thinking about using bardware processors. The processors should have, however, both high execution speed and programability. Hariwired processors can be extracely fast but take a considerable effort to dealyn and maintain. Coverecially available program.Mbb. processors are either too slow or too cuvily to meet our meals, even m a multi-processor system.

The hardware processing system in LASS will be based on an array of fast programmable processors. Each processor has an execution speed comparable to an IRM 373/168 and englates a subset of the 374 machine instructions. The programs for the processors can thus be written and debuyged on an ISM 370 bofor translation and loading to the bardware processors. The task of programming the processors appears to be no more difficult than that of programing a large computer since projars can be written in FORNAN. Only a small number of processors are needed to neet the goal of executing at speeds an order of magnitude greater than a large computer such as the IBM 373/168.

# Acknowledgements

.

.

\_

÷

.

A Construction of the second

we would like to thank D.W.G.S. Leith for his support and encouragement.

# BIBLIOGRAPHY

- Michelini, A., <u>Int'l.</u> <u>Conf. on Instrumenta-</u> <u>tion for High Energy Physics</u>, Fraecati, <u>Italy</u>, May 1973.
- Armstrong, G., et. al., <u>IEEE Trans. Nucl. Sci.</u> <u>NS-20, No. 1</u>, Fabruary 1973.
- Dhawan, S., et. al., <u>Report of the NIM/CANKC</u> <u>Committee on Data Rate Regulroments for</u> <u>Physics Applications</u>, November 1975 (unpublished).

 $\langle ,$ 

i