scispace - formally typeset
Search or ask a question

Showing papers on "Pipeline (computing) published in 1983"


Journal ArticleDOI
TL;DR: The basic problem of reorganization of machine-level instructions at compile time is shown to be np-complete, and a heuristic algorithm is proposed, and its properties and effectiveness are explored.
Abstract: Pipeline interlocks are used in a pipelined architecture to prevent the execution of a machine instruction before its operands are available. An alternative to this complex piece of hardware is to rearrange the instructions at compile time to avoid pipeline interlocks. This problem is called code reorganization and is studied. The basic problem of reorganization of machine-level instructions at compile time is shown to be np-complete. A heuristic algorithm is proposed, and its properties and effectiveness are explored. Empirical data from MIPs, a VLSI processor design, are given. The impact of code reorganization techniques on the rest of a compiler system is discussed. 30 references.

250 citations


Patent
Tangu Hao Shii1, Huei Ling1, Howard E. Sachar1, Jeffrey Weiss1, Yannis John Yamour1 
14 Mar 1983
TL;DR: In this paper, a secondary data flow facility with additional capability, to emulate the simultaneous processing of the prerequisite instruction and the dependent instruction, is proposed to improve simultaneous pipeline processing of inherently sequential instructions (k)-at-a-time, by eliminating delays for calculating prerequisite operands.
Abstract: Equipping a secondary data flow facility with additional capability, to emulate for certain operations the simultaneous processing of the prerequisite instruction and the dependent instruction, significantly improves simultaneous pipeline processing of inherently sequential instructions (k)-at-a-time, by eliminating delays for calculating prerequisite operands. For example, Instruction A+B=Z1 followed by Instruction Z1+C=Z2 is inherently sequential, with A+B=Z1 the prerequisite instruction and Z1+C=Z2 the dependent instruction. The specially equipped secondary data flow facility does not wait for Z1, the apparent input operand from the prerequisite instruction; it simulates Z1 instead, performing A+B+C=Z2 in parallel with A+B=Z1. All data flow facilities need not be fully equipped for all instructions; the secondary data flow facility may be generally less massive than a primary data flow facility, but is more sophisticated in a critical organ, such as the adder. The three-input adder of the secondary data flow facility emulates the result of a two-input adder of a primary data flow facility, occuring simultaneously in the two-input primary data flow facility adder, adding the third operand to the emulated result, without delay. The instruction unit decodes the instruction sequence normally to control (k)-at-a-time execution where there are no instruction interlocks or dependencies; to delay execution of dependent instructions until operands become available; and to reinstate (k)-at-a-time execution in a limited number of cases by using the additional capability of the secondary data flow facility to emulate the prerequisite operands. A control unit performs housekeeping to execute the instructions.

137 citations


01 Nov 1983
TL;DR: The authors address two important issues in systolic array designs: fault-tolerance and two-level pipelining and show that both problems can be reduced to the same mathematical problem of incorporating extra delays on certain data paths in originally correct syStolic designs.
Abstract: The authors address two important issues in systolic array designs: fault-tolerance and two-level pipelining. The proposed systolic fault-tolerant scheme maintains the original data flow pattern by bypassing defective cells with a few registers. As a result, many of the desirable properties of systolic arrays (such as local and regular communication between cells) are preserved. Two-level pipelining refers to the use of pipelined functional units in the implementation of systolic cells. The authors paper addresses the problem of efficiently utilizing pipelined units to increase the overall system throughput. They show that both of these problems can be reduced to the same mathematical problem of incorporating extra delays on certain data paths in originally correct systolic designs. They introduce the mathematical notion of a cut which enables them to handle this problem effectively. The results obtained by applying the techniques described are encouraging. When applied to systolic arrays without feedback cycles, the arrays can tolerate large numbers of failures (with the addition of very little hardware) while maintaining the original throughput. Furthermore, all of the pipeline stages in the cells can be kept fully utilized through the addition of a small number of delay registers. However, adding delays to systolic arrays with more » cycles typically induces a significant decrease in throughput. In response to this, they have derived a new class of systolic algorithms in which the data cycle around a ring of processing cells. The systolic ring architecture has the property that its performance degrades gracefully as cells fail. Using the cut theory for arrays without feedback and the ring architecture approach for those with feedback, they have effective fault-tolerant and two-level pipelining schemes for most systolic arrays. 24 references. « less

114 citations


Patent
27 May 1983
TL;DR: In this paper, the address generator, the pipeline control sequencer, and the master processing unit are configured in parallel, and a sign latch micro-instruction control is operative to provide the arithmetic and logical unit with a data dependent decision making capability.
Abstract: A full floating point vector processor includes a master processing unit having DMA I/O means, a wide bandwidth data memory having static RAM and/or interleaved dynamic RAM, an address generator operative to provide address generation for data loaded in the data memory, a concurrently operating pipeline control sequencer operative to provide fully programmable horizontal format microinstructions synchronously with the addresses generated by the address generator, and a pipelined arithmetic and logical unit responsive to the addressed data and to the synchronously provided microinstructions and operative to evaluate one of a user selectable plurality of computationally intensive functions. The address generator, the pipeline controlsequencer, and the master processing unit are configured in parallel. The address generator includes means operative to provide pipeline input and output data dependent address generation. The microinstruction controlled pipelined arithmetic and logical unit includes two register files controllably interconnectable over feedforward and feedback data flow paths, a user selectable fixed or floating point format multiplier, a user selectable fixed or floating point format arithmetic and logical unit, and a sign latch coupled between the arithmetic and logical unit and one of the register files. The sign latch microinstruction control is operative to provide the arithmetic and logical unit with a data dependent decison making capability. A microinstruction controlled write address FIFO and a read address FIFO are coupled to the data memory.

85 citations


Proceedings ArticleDOI
13 Jun 1983
TL;DR: Although no direct comparisons are made with other computers, the low pipeline idle time in this machine indicates that this architectural technique may be more beneficial in an MIMD machine than in either SISD or SIMD machines.
Abstract: A pipelined implementation of MIMD operation is embodied in the HEP computer This architectural concept should be carefully evaluated now that such a computer is available commercially This paper studies the degree of utilization of pipelines in the MIMD environment A detailed analysis of two extreme cases indicates that pipeline utilization is quite high Although no direct comparisons are made with other computers, the low pipeline idle time in this machine indicates that this architectural technique may be more beneficial in an MIMD machine than in either SISD or SIMD machines

61 citations


Journal ArticleDOI
TL;DR: A number of concepts related to computer studies of transient flow in pipeline systems are addressed in this paper, including organizational concepts for system data handling, and ideas to realize certain storage and computational efficiencies when using the method of characteristics as the computational procedure.
Abstract: A number of concepts related to computer studies of transient flow in pipeline systems are addressed. The topics are directed to applications on microcomputers, but are not limited to that special purpose. The topics include organizational concepts for system data handling, and ideas to realize certain storage and computational efficiencies when using the method of characteristics as the computational procedure. Alternatives to the method of specified time intervals, namely a staggered grid and an algebraic treatment, are discussed. A simple improved modification in the friction term is also emphasized. Two common elements in hydraulic systems that influence the form of pressure waves, series connections and lossy elements, are focused upon with a view to providing an improved visualization of their response characteristics. Equations and graphs are presented for cases with and without initial through flow. Finally, an example of the failure of a physical system is presented to emphasize the importance of unsteady flow visualization in hydraulic system design.

43 citations


PatentDOI
TL;DR: The PLA contains microcode which enables programs in external memory to be loaded from any location in memory and run by command from the console as well as to enable the operator to halt user program execution, read the pertinent internal registers, and then continue program execution such that single step execution for debug purposes is possible.

42 citations


Patent
28 Jun 1983
TL;DR: In this paper, an arithmetic system includes an arithmetic unit of a pipeline structure for executing arithmetic operations for instructions which require different arithmetic cycles, and the arithmetic unit executes N arithmetics in pipeline for N instruction at maximum.
Abstract: An arithmetic system includes an arithmetic unit of a pipeline structure for executing arithmetic operations for instructions which require different arithmetic cycles. The arithmetic unit executes N arithmetics in pipeline for N instruction at maximum. Initiation of arithmetic operation for a new instruction in the arithmetic unit is indicated by an indicator which detects that each of the instruction executed in the arithmetic is N cycles before completion of the execution and allows arithmetic operation for the new instruction to be initiated in the succeeding cycle.

35 citations


Proceedings Article
01 Jan 1983
TL;DR: The authors show that, for certain programs in the VAL language, it is possible to construct machine-level data flow programs that support fully pipelined computation.
Abstract: Data flow computers are a radical departure from conventional computer architecture, and new methodologies are required for generating efficient machine-level programs from high level user programming languages. The authors show that, for certain programs in the VAL language, it is possible to construct machine-level data flow programs that support fully pipelined computation. A VAL program in the class considered consists of blocks of code each of which defines a new array value either by a forall expression in which each element may be computed independently, or by a for-iter expression that defines array elements by a first-order recurrence relation. 7 references.

35 citations


Proceedings ArticleDOI
13 Jun 1983
TL;DR: The experimental system adopts a low key technology and yet is capable of executing about 0.7 million instructions per second through the benchmarks, implying that data flow computers can be alternative to the conventional von-Neumann computers if state-of-the-art technologies are adequately introduced.
Abstract: This paper describes an architecture of a data flow computer named the Distributed Data Driven Processor (DDDP), and presents an experimental system and the results of experiments using several benchmarks. The experimental system has four processing elements connected by a ring bus, and a structured data memory. The main features of our system are that each processing element is provided with a hardware hashing mechanism to implement token coloring, and a ring bus is used to pass tokens concurrently among processing elements. A hardware monitor was used to measure the performance of the experimental system. The experimental system adopts a low key technology and yet is capable of executing about 0.7 million instructions per second through the benchmarks. This implies that data flow computers can be alternative to the conventional von-Neumann computers if state-of-the-art technologies are adequately introduced.

33 citations


Patent
Nukiyama Tomoji1
31 May 1983
TL;DR: In this article, a pipeline bus serially connects pipeline stages such that input data supplied through an input unit can be serially transported through the several pipeline stages and finally to an output unit.
Abstract: A pipeline processing apparatus has a plurality of pipeline stages, each stage including a pipeline latch and a pipeline processing circuit. A pipeline bus serially connects the several pipeline stages such that input data supplied through an input unit can be serially transported through the several pipeline stages and finally to an output unit. To facilitate testing the pipeline processing apparatus and specifically the individual pipeline stages and the data passing through these individual stages independently of the pipeline processing cycle, there is provided a common bus coupled to the input unit, the output unit and selectively to each of the pipeline stages. A designated pipeline stage is selectively coupled to the common bus and to cause test data to be supplied to the designated pipeline stage and subsequently read out from the designated stage.

01 Jan 1983
TL;DR: A new rearrangeability proof for interconnection networks is developed, with the same lower bound hardware requirement as the Benes network but for a general configuration, that can be applied to any lower bound rearrangeable interconnection network.
Abstract: In this thesis the rearrangeability of interconnection networks and the data movement between the global memory and the processor local memory are studied. A new rearrangeability proof for interconnection networks is developed, with the same lower bound hardware requirement as the Benes network but for a general configuration. This new proof technique is universal, in the sense that it can be applied to any lower bound rearrangeable interconnection network. It is also a constructive proof which yields a control algorithm. Another problem studied is the effect of global delays on system speed, caused by the traffic between local memory and global memory in parallel processor systems. The memory bandwidth, memory conflicts and interconnection conflicts contribute to global delays. A Prefetch/Execute/Poststore pipeline is introduced to reduce the performance degradation due to global delays for innermost vector loops. The analyzing vectorizer PARAFRASE is used to measure the speedup loss on 31 scientific programs with and without the pipeline.

Journal ArticleDOI
Yoshimune Hagiwara1, Y. Kita, T. Miyamoto, Y. Toba, H. Hara, T. Akazawa 
TL;DR: HSP architecture, LSI design, and a speech analysis application are described, which makes it possible to construct a compact speech analysis circuit by the LPC (PARCOR) method with two HSP's.
Abstract: A single chip high-performance digital signal processor (HSP) has been developed for speech, telecommunication, and other applications. The HSP uses 3 μm CMOS technology and its architecture features floating point arithmetic and pipeline structure. By adoption of floating point arithmetic, data covering a wide dynamic range (up to 32 bits) can be manipulated. The input clock frequency is 16 MHz, and the instruction cycle time is 250 ns. Efficient signal processing instructions and a large internal memory (program ROM: 512 words; data RAM: 200 words; data ROM: 128 words) make it possible to construct a compact speech analysis circuit by the LPC (PARCOR) method with two HSP's. This paper describes HSP architecture, LSI design, and a speech analysis application.

Proceedings ArticleDOI
13 Jun 1983
TL;DR: The design philosophy of the data flow processor array system presented in this paper is to achieve high performance by adapting a system structure to operational characteristics of application programs, and also to attain flexibility through executing instructions based on a data driven mechanism.
Abstract: This paper presents the architecture of a highly parallel processor array system which executes programs by means of a data driven control mechanism. The data driven control mechanism makes it easy to construct an MIMD (multiple instruction stream and multiple data stream) system, since it unifies inter-processor data transfer and intra-processor execution control. The design philosophy of the data flow processor array system presented in this paper is to achieve high performance by adapting a system structure to operational characteristics of application programs, and also to attain flexibility through executing instructions based on a data driven mechanism. The operational characteristics of the proposed system are analyzed using a probability model of the system behavior. Comparing the analytical results with the simulation results through an experimental hardware system, the results of the analysis clarify the principal effectiveness of the proposed system. This system can achieve high operation rates and is neither sensitive to inter-processor communication delay nor sensitive to system load imbalance.

Proceedings ArticleDOI
28 Nov 1983
TL;DR: The fault-tolerant scheme proposed maintains the original data flow patterns by simply by-passing defective cells with a small number of registers, maintaining the desirable properties of systolic arrays, such as local and regular communication, massive parallelism and high data throughput.
Abstract: This paper addresses two important problems in systolic arrays: fault-tolerance and two-level pipelining. The fault-tolerant scheme we propose maintains the original data flow patterns by simply by-passing defective cells with a small number of registers. As a result, the desirable properties of systolic arrays, such as local and regular communication, massive parallelism and high data throughput, are all preserved. 'Iwo-level pipelining refers to the use of pipelined functional units in the implementation of systolic cells. This paper also addresses the problem of efficiently utilizing such units to increase overall system throughput. We show that both of these problems can be reduced to the same mathematical problem of incorporating extra delays on certain data paths in originally correct systolic designs. We introduce the mathematical notion of a cut which enables us to handle this problem systematically. The results obtained by applying the techniques described in this paper arc encouraging. When applied to systolic arrays without feedback cycles, the arrays can tolerate large numbers of faults with the addition of very little hardware, while maintaining the original throughput. Furthermore, all of the pipeline stages in the cells can be kept fully utilized through the addition of a small number of delay registers. However, adding delays to systolic arrays with cycles typically induces a significant decrease in throughput. In response to this, we have derived a new class of systolic algorithms in which the data cycle around a ring of processing cells The systolic ring architecture has the property of degrading gracefully as cells fail. It can be used in place of many systolic arrays with feedback cycles. Using our cut theory for arrays without feedback and the ring architecture approach for arrays with feedback, we have an effective fault-tolerant scheme for every systolic array that we have considered. Furthermore, as by-products of the ring architecture approach we have derived new systolic algorithms. These algorithms generally require only one-third to one-half of the number of cells used in previous designs to achieve the same throughput. Included in these new systolic algorithms are ones for LU-decomposition, QR-decomposition and the solution of triangular linear systems.© (1983) COPYRIGHT SPIE--The International Society for Optical Engineering. Downloading of the abstract is permitted for personal use only.

Journal ArticleDOI
TL;DR: A family of VLSI circuits is presented to perform open convolution, i.e., polynomial multiplication, and, depending on the degree of paralleism or pipeline, they range from a compact but slow convolver to a large but very fast convolver.
Abstract: A family of VLSI circuits is presented to perform open convolution, i.e., polynomial multiplication. The circuits are all based on a recursive construction and are therefore particularly well adapted to automated design. All the circuits presented are optimal with respect to the area–time2 tradeoff, and, depending on the degree of paralleism or pipeline, they range from a compact but slow convolver to a large but very fast convolver.

Patent
28 Feb 1983
TL;DR: In this paper, a method of perforating a main pipeline while in service, and split joints suited to this method, in order to connect a branch pipeline to the main pipeline, is described.
Abstract: The disclosed invention provides a novel method of perforating a main pipeline while in service, and split joints suited to this method, in order to connect a branch pipeline thereto. According to this method, perforating work is carried out from an opposite side across the main pipeline to where the branch pipeline is connected, as distinct from known methods in which perforating work is carried out on the side of the main pipeline to which a branch pipeline is connected.

Patent
29 Sep 1983
TL;DR: In this article, an entry control store in a central processing unit (CPU) is addressed by the next macroinstruction to be executed by the CPU and fetches the microcode for the first line of that macro-instruction.
Abstract: An entry control store in a central processing unit (CPU) is addressed by the next macroinstruction to be executed by the CPU and fetches the microcode for the first line of that macroinstruction. Subsequent lines of microcode for that macroinstruction are fetched from a main control store.

Proceedings ArticleDOI
20 Jun 1983
TL;DR: Two new vector reduction methods, symmetric and asymmetric, are proposed and analyzed for pipelined processing and compare favorably with the known recursive reduction method in achieving higher pipeline utilization and in eliminating large memory for intermediate results.
Abstract: Vector reduction arithmetic accepts a vector as input and produces a scalar output. This class of vector operations forms the basis of many scientific computations. In a pipelined processor, a feedback loop is required to reduce vectors. Since the output of the pipeline depends on previous outputs, improper control of the feedback loop will destroy the benefit from pipelining. A generalized computing model is proposed to schedule the activities in a vector reduction pipeline. Two new vector reduction methods, symmetric and asymmetric, are proposed and analyzed for pipelined processing. These two methods compare favorably with the known recursive reduction method in achieving higher pipeline utilization and in eliminating large memory for intermediate results. An interleaving method is proposed to reduce multiple vectors to multiple scalars in a single arithmetic pipeline. The pipeline can be fully utilized by interleaved multiple vector processing.

Patent
13 Oct 1983
TL;DR: In this article, a collector for the results of a pipelined central processing unit of a digital data processing system is presented, where results of the execution of each instruction are stored in a result stack 38, 40, 42, 44 associated with each execution unit.
Abstract: A collector for the results of a pipelined central processing unit of a digital data processing system. The processor has a plurality of execution units 24, 26, 28, 30, with each execution unit executing a different set of instructions of the instruction repertoire of the processor. The execution units execute instructions issued to them in order of issuance by the pipeline 12 and in parallel. As instructions are issued to the execution units, the operation code identifying each instruction is also issued in program order to an instruction execution queue 18 of the collector. The results of the execution of each instruction by an execution unit are stored in a result stack 38, 40, 42, 44 associated with each execution unit. Collector control 46 causes the results of the execution of instructions to program visible registers to be stored in a master safe store register 48 in program order which is determined by the order of instructions stored in the instruction execution stack on a first-in, first-out basis. The collector also issues write commands to write results of the execution of instructions into memory in program order via a store stack 50.


Proceedings ArticleDOI
30 Nov 1983
TL;DR: Recent developments in the design of integrated optical circuits for performing optical numerical computations are discussed and the natural marriage of IOC's with the systolic concept is discussed.
Abstract: The development of integrated optical circuits (IOC) for numerical-computation applications is reviewed, with a focus on the use of systolic architectures. The basic architecture criteria for optical processors are shown to be the same as those proposed by Kung (1982) for VLSI design, and the advantages of IOCs over bulk techniques are indicated. The operation and fabrication of electrooptic grating structures are outlined, and the application of IOCs of this type to an existing 32-bit, 32-Mbit/sec digital correlator, a proposed matrix multiplier, and a proposed pipeline processor for polynomial evaluation is discussed. The problems arising from the inherent nonlinearity of electrooptic gratings are considered. Diagrams and drawings of the application concepts are provided.

PatentDOI
TL;DR: In this article, a method of detecting inadequately supported sections or overloaded points in a pipeline including the steps of traversing the interior of the pipeline with an instrumentation pig, sequentially striking or vibrating the wall of a pipeline by means carried by the pig to introduce vibratory signals into the pipeline, receiving said signals from within the pipeline by listening to the sounds generated as a consequence of the striking of the interior wall and detecting preselected characteristics of received sound which are indicative of unsupported sections or of points of load and stress concentration in the pipeline.
Abstract: A method of detecting inadequately supported sections or overloaded points in a pipeline including the steps of traversing the interior of the pipeline with an instrumentation pig, sequentially striking or vibrating the wall of the pipeline by means carried by the pig to introduce vibratory signals into the pipeline, receiving said signals from within the pipeline by listening to the sounds generated as a consequence of the striking of the interior wall, and detecting preselected characteristics of received sound which are indicative of unsupported sections or of points of load and stress concentration in the pipeline

Proceedings ArticleDOI
01 Jan 1983
TL;DR: A realtime system using parallel processing and pipeline techniques to implement the pattern alignment operation will be reported, and the chip has been fabricated in 5μ NMOS technology.
Abstract: A realtime system using parallel processing and pipeline techniques to implement the pattern alignment operation will be reported. A 3.5mm × 4.2mm chip has been fabricated in 5μ NMOS technology.

Patent
21 Jan 1983
TL;DR: In this paper, a microprogram-controlled type arithmetic control apparatus which performs pipeline processing is provided, in which a microinstruction corresponding to a macro-instruction to be processed is read out of a control storage at least one stage prior to an OF (operand fetch) stage.
Abstract: A microprogram-controlled type arithmetic control apparatus which performs pipeline processing is provided. In the apparatus, a microinstruction corresponding to a macroinstruction to be processed are read out of a control storage at least one stage prior to an OF (operand fetch) stage. The microinstruction and data (source data) are respectively output onto a microinstruction bus and a data bus at the OF stage. By the beginning of an E (instruction execution) stage, data (source data) on the data bus and the microinstruction on the microinstruction bus are loaded into first and second registers, respectively. Then, at the beginning of the E stage, arithmetic operation by an arithmetic and logic unit can be performed immediately.

Proceedings ArticleDOI
07 Nov 1983
TL;DR: New paradigms for the construction of efficient parallel graph algorithms, called filtration and funnelled pipelining, are introduced and illustrated with VLSI circuits for computing connected components, minimum spanning forests, and biconnected components.
Abstract: We introduce new paradigms for the construction of efficient parallel graph algorithms. These paradigms, called filtration and funnelled pipelining, are illustrated with VLSI circuits for computing connected components, minimum spanning forests, and biconnected components. These circuits use realistic I/O schedules and require time and area of O(n1+e). Thus they are essentially optimal. Filtration is a technique used to rapidly discard irrelevant input data. This greatly reduces storage, time, and communications costs in a wide variety of problems. A funnelled pipeline is obtained by building a series of increasingly thorough filter stages. Transition times along such a pipeline of filters form an exponentially increasing sequence. The increasing amount of time exactly balances the increasing degree of filtration. This balance makes possible the cascaded filtration critical to the minimum spanning forest and the biconnected components algorithms.

01 May 1983
TL;DR: A technique is described that signals a storage access-exception condition for a data word after an execution unit pipeline has completed processing all preceding elements.
Abstract: A technique is described that signals a storage access-exception condition for a data word after an execution unit pipeline has completed processing all preceding elements.

Journal ArticleDOI
Yoshimune Hagiwara1, Y. Kita, T. Miyamoto, Y. Toba, H. Hara, T. Akazawa 
TL;DR: HSP architecture, LSI design, and a speech analysis application are described, which makes it possible to construct a compact speech analysis circuit by the LPC (PARCOR) method with two HSP's.
Abstract: A single chip high-performance digital signal processor (HSP) has been developed for speech, telecommunication, and other applications. The HSP uses 3 µm CMOS technology and its architecture features floating point arithmetic and pipeline structure. By adoption of floating point arithmetic, data covering a wide dynamic range (up to 32 bits) can be manipulated. The input clock frequency is 16 MHz, and the instruction cycle time is 250 ns. Efficient signal processing instructions and a large internal memory (program ROM: 512 words; data RAM: 200 words; data ROM: 128 words) make it possible to construct a compact speech analysis circuit by the LPC (PARCOR) method with two HSP's. This paper describes HSP architecture, LSI design, and a speech analysis application.

Journal ArticleDOI
TL;DR: In this paper, a model for the estimation of liquid-hydrogen pressure and temperature along a pipeline equipped with cryogenic insulation is presented together with a numerical simulation scheme and the summary of a sensitivity analysis.

Patent
09 Sep 1983
TL;DR: In this paper, a pipelined parallel vector processor is described, where the vector registers are subdivided into a plurality of smaller registers, and an element processor, functioning in a pipeline mode, is associated with each smaller register for processing the M elements of the vectors stored in the smaller register and generating results of the processing.
Abstract: A pipelined parallel vector processor is disclosed. In order to increase the performance of the parallel vector processor, the present invention decreases the time required to process a pair of vectors stored in a pair of vector registers. The vector registers are subdivided into a plurality of smaller registers. A vector, stored in a vector register, comprises N elements; however, each of the smaller registers store M elements of the vector, where M is less than N. An element processor, functioning in a pipeline mode, is associated with each smaller register for processing the M elements of the vectors stored in the smaller register and generating results of the processing, the results being stored in one of the vector registers. The smaller registers of the vector registers, and their corresponding element processors, are structurally configured in a parallel fashion. The element processors and their associated smaller registers operate simultaneously. Consequently, processing of the N element vectors, stored in the vector registers, is complete in the time required to complete the processing of the M elements of the N element vector.