scispace - formally typeset
Search or ask a question

Showing papers on "Pipeline (computing) published in 1994"


Journal ArticleDOI
TL;DR: The CFPP architecture and a proposal for an asynchronous implementation are presented and the architecture seeks geometric regularity in processor chip layout, purely local control to avoid performance limitations of complex global pipeline stall signals, and simplicity that might lead to provably correct processor designs.
Abstract: The counterflow pipeline processor architecture (CFPP) is a proposal for a family of microarchitectures for RISC processors. The architecture derives its name from its fundamental feature, namely that instructions and results flow in opposite directions within a pipeline and interact as they pass. The architecture seeks geometric regularity in processor chip layout, purely local control to avoid performance limitations of complex global pipeline stall signals, and simplicity that might lead to provably correct processor designs. Moreover, CFPP designs allow asynchronous implementations, in contrast to conventional pipeline designs where the synchronization required for operand forwarding makes asynchronous designs unattractive. This paper presents the CFPP architecture and a proposal for an asynchronous implementation. Detailed performance simulations of a complete processor design are not yet available.

180 citations


Patent
21 Oct 1994
TL;DR: In this paper, an external command mode for directly accessing the execution unit, responsive to externally generated commands and instructions, is presented, where the user can examine and modify registers, memory, and I/O space without otherwise affecting their contents.
Abstract: A microprocessor is disclosed herein having an external command mode for directly accessing the execution unit, responsive to externally generated commands and instructions. An external instruction path is provided, as well as a conventional processor-driven instruction path. A multiplexer is provided that selects which of the instruction paths is actually supplied to the execution unit. Using the external command mode, the user can examine and modify registers, memory, and I/O space without otherwise affecting their contents. Any instruction executable by the execution unit is executable in the external command mode. Because direct access is provided into the execution unit, there is no implicit updating that would otherwise affect the state of the processor and require saving to an alternate memory. The present invention is implemented with a conventional test access port designed in accordance with the IEEE 1149.1 boundary scan standard, with modification to include an instruction register, a data register, and control logic. The external command mode is applicable to single and multiple pipeline processors. The circuit described herein includes several selectors for selecting between the probe mode and the processor-driven mode of operation, including an external pin, an external command, and a debug exception. For ascertaining if the circuit is in the external command mode, an acknowledge pin is provided to indicate when the execution unit is ready to accept an instruction in the probe model.

173 citations


Journal ArticleDOI
01 Dec 1994
TL;DR: The implementation of a 200 MHz 13.3 mm/sup 2/ 8/spl times/8 2-D DCT macrocell capable of HDTV rates, based on a direct realization of the DCT, and using distributed arithmetic is presented.
Abstract: The two-dimensional discrete cosine transform (2D DCT) has been widely recognized as a key processing unit for image data compression/decompression. In this paper, the implementation of a 200 MHz 13.3 mm/sup 2/ 8/spl times/8 2-D DCT macrocell capable of HDTV rates, based on a direct realization of the DCT, and using distributed arithmetic is presented. The macrocell, fabricated using 0.8 /spl mu/m base-rule CMOS technology and 0.5 /spl mu/m MOSFET's, performs the DCT processing with 1 sample-(pixel)-per-clock throughput. The fast speed and small area are achieved by a novel sense-amplifying pipeline flip-flop (SA-F/F) circuit technique in combination with nMOS differential logic. The SA-F/F, a class of delay flip-flops, can be used as a differential synchronous sense-amplifier, and can amplify dual-rail inputs with swings lower than 100 mV. A 1.6 ns 20 bit carry skip adder used in the DCT macrocell, which was designed by the same scheme, is also described. The adder is 50% faster and 30% smaller than a conventional CMOS carry look ahead adder, which reduces the macrocell size by 15% compared to a conventional CMOS implementation. >

156 citations


Proceedings ArticleDOI
10 Oct 1994
TL;DR: A behavioral synthesis method targeting low power consumption for data-dominated CMOS circuits by considering loops, conditional branches, and scheduling constructs such as multicycling, chaining and structural pipelining is presented.
Abstract: We present a behavioral synthesis method targeting low power consumption for data-dominated CMOS circuits. A study of how the high-level synthesis process affects power consumption is presented, based on, which we have developed the first allocation method for low power. We also present a method of optimizing the controller to reduce data path power dissipation. We consider loops, conditional branches, and scheduling constructs such as multicycling, chaining and structural pipelining. The techniques were implemented within the framework of an existing behavioral synthesis system. Experiments performed on various examples and benchmarks show that low power circuits can be synthesized by our method with very low or zero overheads. >

151 citations


Patent
Prathima Agrawal1, Soumitra Bose1
19 Dec 1994
TL;DR: In this article, test vectors for a circuit containing both logic gates and memory blocks are evaluated by applying candidate test vectors to good and faulty versions of the circuit in a computer simulation.
Abstract: Test vectors for a circuit containing both logic gates and memory blocks are evaluated by applying candidate test vectors to good and faulty versions of the circuit in a computer simulation. The functions of the gates and interconnections in the circuit are stored in memory and the operation of the good and faulty circuits is simulated concurrently. During the simulation, a memory record is created for storing the state of a circuit element in a faulty circuit if the fault is visible at the element. Such records are removed when no longer needed, which speeds up the simulation. A multiprocessor in a pipeline configuration is disclosed for performing the simulation. A first branch in the pipeline simulates the logic gates in the circuit; a second branch simulates the memory blocks.

136 citations


Patent
Siamak Arya1, Howard G. Sachs1
27 Oct 1994
TL;DR: In this paper, a computing system is described in which groups of individual instructions are executed in parallel by processing pipelines, and instructions to be executed by different pipelines are supplied to the pipelines simultaneously.
Abstract: A computing system is described in which groups of individual instructions are executable in parallel by processing pipelines, and instructions to be executed in parallel by different pipelines are supplied to the pipelines simultaneously. During compilation of the instructions those which can be executed in parallel are identified. The system includes a register for storing an arbitrary number of the instructions to be executed. The instructions to be executed are tagged with pipeline identification tags and group identification tags indicative of the pipeline to which they should be dispatched, and the group of instructions which may be dispatched during the same operation. The pipeline and group identification tags are used to dispatch the appropriate groups of instructions simultaneously to the differing pipelines.

129 citations


Patent
23 Dec 1994
TL;DR: In this paper, a data stream processing unit comprises a CPU which comprises an ALU, a shift/extract unit, timers, a scheduler, an event system, a plurality of sets of general purpose registers and masquerade registers, pipeline controller, a memory controller and a pair of internal buses.
Abstract: A data stream processing unit comprises a CPU which comprises an ALU, a shift/extract unit, timers, a scheduler, an event system, a plurality of sets of general purpose registers, a plurality of sets of special purpose registers, masquerade registers, pipeline controller, a memory controller and a pair of internal buses. The multiple sets of general and special purpose registers improves the speed of the CPU in switching between environments. The pipeline controller, the scheduler, the events system, and the masquerade registers facilitate the implementation and execution of the methods of the present invention such as efficient thread scheduling, branch delays, elimination of delay slots after stores that provide further increases in the performance and bandwidth.

119 citations


Proceedings ArticleDOI
01 Apr 1994
TL;DR: This paper used trace-driven simulation to show that the proposed design, which uses fewer resources, offers better performance than previously proposed alternatives for most programs, and indicates how to further improve this design.
Abstract: Accurate branch prediction is critical to performance; mispredicted branches mean that ten's of cycles may be wasted in superscalar architectures. Architectures combining very effective branch prediction mechanisms coupled with modified branch target buffers (BTB's) have been proposed for wide-issue processors. These mechanisms require considerable processor resources. Concurrently, the larger address space of 64-bit architectures introduce new obstacles and opportunities. A larger address space means branch target buffers become more expensive. In this paper, we show how a combination of less expensive mechanisms can achieve better performance than BTB's. This combination relies on a number of design choices described in the paper. We used trace-driven simulation to show that our proposed design, which uses fewer resources, offers better performance than previously proposed alternatives for most programs, and indicate how to further improve this design.

117 citations


Patent
Roger W. Swanson1
31 Mar 1994
TL;DR: In this article, a pipelined processing system with context switching for each of the pipelining processing circuits within the pipeline may be accomplished without flushing the data from the pipeline by sending the pipeline commands and data together through the pipeline and differentiating the commands from the data using a flag added to the commands.
Abstract: A pipelined processing system in which context switching for each of the pipelined processing circuits within the pipeline may be accomplished without flushing the data from the pipeline. This is accomplished by sending the pipeline commands and data together through the pipeline and differentiating the commands from the data using a flag added to the commands and data which specifies whether the associated data word is a command or data. During operation of the pipeline, when the input data is received by one of the pipelined processing circuits in the pipeline, the flag is checked to see if the associated data word includes a command. If the associated data word includes data to be processed, it is processed in accordance with the current configuration of the pipeline. However, if the associated data word includes a command for setup and control and the like, each pipelined processing circuit within the pipeline compares its identification value with a tag field in the command to determine whether it is to be reconfigured by that command. If it is to be reconfigured by that command, the appropriate context switching and the like takes place. However, if the current pipelined processing circuit is not to be reconfigured by that command, that command is passed through the current pipelined processing circuit unprocessed so that a similar determination may be made by the next pipelined processing circuit in the pipeline. As a result, setup and control commands for the pipelined processing circuits may be passed through the data processing pipeline along with the data in the desired processing order such that a pipeline data flush is not necessary between reconfigurations of the pipelined processing circuits. Since the pipeline need not be flushed when processes are changed, processing efficiency and throughput are substantially improved.

110 citations


Proceedings ArticleDOI
01 May 1994
TL;DR: In this article, a 10-bit 20-MS/s pipeline A/D converter implemented in 1.2-/spl mu/m CMOS technology achieves a power dissipation of 35 mW at full speed operation.
Abstract: This paper describes a 10-bit 20-MS/s pipeline A/D converter implemented in 1.2-/spl mu/m CMOS technology which achieves a power dissipation of 35 mW at full speed operation. Circuit techniques used to achieve this level of power dissipation include operation on a 3.3 V power supply, optimum scaling of capacitor values through the pipeline, and digital correction to allow the use of dynamic comparators. Measured performance includes 0.6 LSB of INL, 59.1 dB of SNDR for 100 kHz input at 20 MS/s. At Nyquist sampling (10 MHz input), SNDR is 55.0 dB. >

94 citations


01 Jan 1994
TL;DR: The technique can sustain maximum communication bandwidth while achieving an arbitrarily low, non-zero probability of synchronization failure, P/ sub f, with the price in both latency and chip area being /spl Oscr/(log 1/P/sub f/).
Abstract: Pipeline synchronization is a simple, low-cost, high-bandwidth, high-reliability solution to interfaces between synchronous and asynchronous systems, or between synchronous systems operating from different clocks. The technique can sustain maximum communication bandwidth while achieving an arbitrarily low, non-zero probability of synchronization failure, P/sub f/, with the price in both latency and chip area being /spl Oscr/(log 1/P/sub f/). Pipeline synchronization has been successfully applied to high-performance inter-computer communication in multicomputers and local-area networks.

Patent
01 Mar 1994
TL;DR: In this paper, an instruction sequencer issues an instruction that computes the register value minus a pre-determined number of iterations to be issued into the pipeline, followed by the instruction returning with the calculated number.
Abstract: In a pipelined processor, an apparatus for handling string operations. When a string operation is received by the processor, the length of the string as specified by the programmer is stored in a register. Next, an instruction sequencer issues an instruction that computes the register value minus a pre-determined number of iterations to be issued into the pipeline. Following the instruction, the pre-determined number of iterations are issued to the pipeline. When the instruction returns with the calculated number, the instruction sequencer then knows exactly how many iterations should be executed. Any extra iterations that had initially been issued are canceled by the execution unit, and additional iterations are issued as necessary. A loop counter in the instruction sequencer is used to track the number of iterations.

Patent
31 Aug 1994
TL;DR: In this paper, a target finder array in the instruction cache contains a lower portion of the target address and a block encoding indicating if the target addresses are within the same 2K-byte block that the branch instruction is in, or if target address is in the next or previous 2Kbyte block.
Abstract: A target finder array in the instruction cache contains a lower portion of the target address and a block encoding indicating if the target address is within the same 2K-byte block that the branch instruction is in, or if the target address is in the next or previous 2K-byte block. The upper portion of the target address, its block number, which corresponds to the starting address of a 2K block, is generated from the target finder simply by taking the upper portion or block number of the branch instruction and incrementing and decrementing it, and using the block encoding in the finder to select either the unmodified block number of the branch instruction, or the incremented or decremented block number of the branch instruction. The lower portion of the target address that was stored in the finder is concatenated with the selected block number to get the predicted target address. The target address can be predicted in parallel with reading an instruction out of the cache, making the target available at the same time the branch instruction is available, eliminating pipeline stalls for correctly predicted branches. The initially predicted target address in the finder is generated by a quick decode of the instruction and is written when the cache is loaded from memory. The initial prediction does not have to be accurate because branch resolution logic will update the finder on each branch resolution. Register indirect branches and exceptions may also be predicted. Two instruction sets may be accommodated by different block encodings to indicate the instruction set. By using the block encoding, the finder array is small and inexpensive.

Journal ArticleDOI
TL;DR: In this article, the authors present a new mitigation design approach which not only reduces AC voltages effectively and economically, but also provides cathodic protection for the protected pipeline, which is illustrated with results from computer simulations, which show how important it is to have an accurate electrical model of the soil structure in any interference study.
Abstract: In joint-use corridors where both pipelines and AC electric transmission lines are present, a portion of the energy contained in the electromagnetic field surrounding the electric transmission lines is captured by each pipeline, resulting in induced AC voltages which vary in magnitude throughout the length of each pipeline. During a fault on any of the transmission lines, energization of the earth by supporting structures near the fault can result in large voltages appearing locally between the earth and the steel wall of any nearby pipeline. Some form of mitigation is usually required to reduce these voltages to acceptable levels for the protection of personnel and of the pipeline itself. This paper presents a new mitigation design approach which not only reduces AC voltages effectively and economically, but also provides cathodic protection for the protected pipeline. Performance of this new mitigation method is illustrated with results from computer simulations, which show how important it is to have an accurate electrical model of the soil structure in any interference study. Results from large-scale mitigation design studies performed for ANR Pipeline Company and other gas transmission companies are presented. >

Patent
22 Feb 1994
TL;DR: In this article, a method for improving CGSI imaging system throughput by selectively decimating rows and/or columns of object image data prior to warp transformation in order to more closely approach a 1:1 compression ratio is presented.
Abstract: A method for improving CGSI imaging system throughput by selectively decimating rows and/or columns of object image data prior to warp transformation in order to more closely approach a 1:1 compression ratio. The best results are produced on compressed object image data prior to decompression.

Journal ArticleDOI
16 Feb 1994
TL;DR: A video DSP with macroblock-level-pipeline and a SIMD type vector-p Pipeline architecture (VDSP2) has been developed, using 0.5 /spl mu/m triple-layer-metal CMOS technology.
Abstract: A video DSP with macroblock-level-pipeline and a SIMD type vector-pipeline architecture (VDSP2) has been developed, using 0.5 /spl mu/m triple-layer-metal CMOS technology. This 17.00 mm/spl times/15.00 mm chip consists of 2.5 M transistors, and operates at 100 MHz. The real-time encoder and decoder specified in the MPEG2 main profile at the main level can be realized with two VDSP2's and a motion estimation (ME) unit, and one VDSP2 respectively, at an 80 MHz clock rate, with a total power dissipation of 4.2 W at 3.3 V. >

Patent
Joerg Schepers1
01 Mar 1994
TL;DR: In this article, a heuristic selection process is used to select the instructions whose precursor instructions have already been processed and investigate these instructions as to whether before their execution a minimum number of delay cycles is necessary.
Abstract: In order to be able to execute rapid processing of a program on super-scalar microprocessors, the individual instructions of this program must be divided into instruction groups, which can be processed by processing units of the microprocessor, in such a way that the instructions can be processed in parallel. In this case, it is necessary to take account of data-flow dependences and control-flow dependences as well as pipeline conflicts. For this purpose, the first step is to select the instructions whose precursor instructions have already been processed and to investigate these instructions as to whether before their execution a minimum number of delay cycles is necessary, and the instructions are stored with a minimum number in a list. From these instructions, one is selected using a heuristic selection process, and this one is classified into an instruction group in which the instruction can be processed in the earliest possible execution cycle.

Patent
04 Mar 1994
TL;DR: In this article, a single line buffer in a motion video card is used for both vertical reduction of the pixel image before storage in a video memory buffer and vertical expansion after being outputted by the video buffer.
Abstract: A single line buffer in a motion video card is used for both vertical reduction of the pixel image before storage in a video memory buffer and vertical expansion of the pixel image after being outputted by the video memory buffer. When the desired display size is smaller than the original pixel image size, then the line buffer is used by the input pipeline to reduce the image. When the desired display size is larger than the original pixel image size (or larger than the image stored in the memory buffer), then the line buffer is used by the output pipeline to enlarge the image.

Patent
11 Jan 1994
TL;DR: In this paper, an address pipeline is provided to hold the addresses of the instructions presently in the instruction pipeline, which facilitates tracing only executed instructions, and permits stopping the data processor during a branch delay slot without losing the branch information.
Abstract: In a pipelined data processor (11), an address pipeline (39, 41) is provided to hold the addresses of the instructions presently in the instruction pipeline (23, 25) The address pipeline facilitates tracing only executed instructions, and permits stopping the data processor during a branch delay slot without losing the branch information

Journal ArticleDOI
TL;DR: A general-purpose fuzzy processor, the core of which is based on an analog-numerical approach combining the inherent advantages of analog and digital implementations, above all as regards noise margins is presented.
Abstract: In this paper we present a design for a general-purpose fuzzy processor, the core of which is based on an analog-numerical approach combining the inherent advantages of analog and digital implementations, above all as regards noise margins. The architectural model proposed was chosen in such a way as to obtain a processor capable of working with a considerable degree of parallelism. The internal structure of the processor is organized as a cascade of pipeline stages which perform parallel execution of the processes into which each inference can be decomposed. A particular feature of the project is the definition of a 'fuzzy-gate', which executes elementary fuzzy computations, on which construction of the whole core of the processor is based. Designed using CMOS technology, the core can be integrated into a single chip and can easily be extended. The performance obtainable, in the order of 50 Mega fuzzy rules per second, is of a considerable level. >

Patent
08 Mar 1994
TL;DR: A pipelined memory (20) as mentioned in this paper includes output registers (34) and output enable registers (48) which are used to electrically switch between the asynchronous operating mode and the synchronous operating mode.
Abstract: A pipelined memory (20) has a synchronous operating mode and an asynchronous operating mode. The memory (20) includes output registers (34) and output enable registers (48) which are used to electrically switch between the asynchronous operating mode and the synchronous operating mode. In addition, in the synchronous operating mode, the depth of pipelining can be changed between a three stage pipeline and a two stage pipeline. By changing the depth of pipelining, the memory (20) can operate using a greater range of clock frequencies. In addition, the operating frequency can be changed to facilitate testing and debugging of the memory (20).

Journal ArticleDOI
TL;DR: A timing constraint formulation for the correct clocking of wave-pipelined systems and implications and motivations for the use of accurate delay models and exact timing analysis in the determination of combinational logic delays are given.
Abstract: Wave-pipelining is a timing methodology used in digital systems to achieve maximal rate operation. Using this technique, new data are applied to the inputs of a combinational block before the previous outputs are available, thus effectively pipelining the combinational logic and maximizing the utilization of the logic without inserting registers. This paper presents a timing constraint formulation for the correct clocking of wave-pipelined systems. Both single- and multiple-stage systems including feedback are considered. Based on the formulation of this paper, several important new results are presented relating to performance limits of wave-pipelined circuits. These results include the specification of distinct and disjoint regions of valid operation dependent on the clock period, intentional clock skew, and the global clock latency. Also, implications and motivations for the use of accurate delay models and exact timing analysis in the determination of combinational logic delays are given, and an analogous relationship between the multi-stage system and the single-stage system in terms of performance limits is shown. The minimum clock period is obtained by clock skew optimization formulated as a linear program. In addition, important special cases are examined and their relative performance limits are analyzed. >

Patent
Son Dao-Trong1, Juergen Haas1, Rolf Mueller1
20 Jul 1994
TL;DR: In this article, a pipeline floating point processor is proposed, where the addition pipelining is reorganized so that no wait cycle is needed when the addition uses the result of an immediately foregoing multiplication (fast multiply-add instruction).
Abstract: A pipeline floating point processor in which the addition pipelining is reorganized so that no wait cycle is needed when the addition uses the result of an immediately foregoing multiplication (fast multiply-add instruction). The reorganization implies the following changes of an existing data flow of the pipeline floating processor: data feed-back via path ND of normalized data from the multiplier M into the aligners AL1 and AL2; shift left one digit feature on both sides of the data path for taking account of a possible leading zero digit of the product, and special zeroing of potential guard digits by Z1 and Z2; exponent build by 9 bits for overflow and underflow recognition, and due to an underflow the exponent result, is reset to zero on the fly by a true zero unit (T/C).

Journal ArticleDOI
01 Mar 1994
TL;DR: Simulation studies show that application of pipelining techniques can provide an effective throughput of one 32-bit addition every 1.6 ns using minimal hardware.
Abstract: Negative differential resistance characteristics of several new quantum electronic devices have been used to design high-speed logic gates with the latching property. These latching gates form the basis of the ultrafast pipelined adder circuit described in this paper. The latching or memory feature of these circuits, which was previously considered to be a nuisance in the design of combinational circuits, is exploited to overcome the pipeline overheads of area and time. Simulation studies show that application of pipelining techniques can provide an effective throughput of one 32-bit addition every 1.6 ns using minimal hardware.

Patent
Alain Artieri1
24 May 1994
TL;DR: In this article, a system that processes compressed data arriving in packets corresponding to picture blocks, the packets being separated by headers containing decoding parameters of the packets, is described, where a memory bus is controlled by a memory controller to exchange data between the processing elements and a picture memory.
Abstract: A system that processes compressed data arriving in packets corresponding to picture blocks, the packets being separated by headers containing decoding parameters of the packets. A memory bus is controlled by a memory controller to exchange data between the processing elements and a picture memory. A pipeline circuit contains a plurality of processing elements. A parameter bus provides packets to be processed to the pipeline circuit, as well as the decoding parameters to elements of the system. The parameter bus is controlled by a variable length decoder that receives the compressed data from the memory bus and that extracts the packets and the decoding parameters therefrom.

Patent
28 Sep 1994
TL;DR: In this paper, a pipeline processing device that enables the implementation of optimized pipeline processing and moreover has a simple configuration and control method, and a clipping processing device, that uses this pipeline processing devices.
Abstract: An objective of this invention is to provide a pipeline processing device that enables the implementation of optimized pipeline processing and moreover has a simple configuration and control method, and a clipping processing device that uses this pipeline processing device. Data is sequentially transferred to pipeline register sections (500 to 506), but only when there is processing data in each previous stage, and given data processing is performed in data processing sections (520 to 524). After the end of input of processing data in which a plurality of data items is formed into one string D 0:3!, this data is automatically extracted from the pipeline register sections (500 to 506). These transfer and automatic extraction operations in the pipeline control sections (530 to 536) are controlled by an LD signal. This LD signal is formed by ENIN and FLASHIN signals.

Journal ArticleDOI
TL;DR: A parallelizing compiler that, given a sequential program and a memory layout of its data, performs process decomposition while balancing parallelism against locality of reference and several message optimizations that address the issues of overhead and synchronization in message transmission.
Abstract: The lack of high-level languages and good compilers for parallel machines hinders their widespread acceptance and use. Programmers must address issues such as process decomposition, synchronization, and load balancing. We have developed a parallelizing compiler that, given a sequential program and a memory layout of its data, performs process decomposition while balancing parallelism against locality of reference. A process decomposition is obtained by specializing the program for each processor to the data that resides on that processor. If this analysis fails, the compiler falls back to a simple but inefficient scheme called run-time resolution. Each process's role in the computation is determined by examining the data required for execution at run-time. Thus, our approach to process decomposition is data-driven rather than program-driven. We discuss several message optimizations that address the issues of overhead and synchronization in message transmission. Accumulation reorganizes the computation of a commutative and associative operator to reduce message traffic. Pipelining sends a value as close to its computation as possible to increase parallelism. Vectorization of messages combines messages with the same source and the same destination to reduce overhead. Our results from experiments in parallelizing SIMPLE, a large hydrodynamics benchmark, for the Intel iPSC/2, show a speedup within 60% to 70% of handwritten code. >


Patent
30 Mar 1994
TL;DR: In this article, a processing apparatus which adaptively performs image compensation and encoding/expansion and decoding processing such as discrete cosine transformation (DCT), inner product computation, image data addition, and image data differential processing, etc.
Abstract: A processing apparatus which adaptively performs image compensation and encoding/expansion and decoding processing such as discrete cosine transformation (DCT)/inverse discrete cosine transformation (IDCT), inner product computation, image data addition, and image data differential processing, etc. for blocks of image data of a size of m×n, which is provided with (a) plurality of parallel processing units 1 to 4 each of which performs addition, subtraction, various types of logical computations, comparison of magnitude, computation of absolute values of differences, and butterfly addition·subtraction processing, performs multiplication, and performs accumulation; (b) mutually connected pipeline memories 5 to 7 which are disposed so as to connect adjoining processing units among these processing units; and (c) data selectors 41 to 44 which selectively apply input data to the processing units 1 to 4, wherein adjoining processing units are coupled via the mutually connected pipeline memories and wherein an internal pipeline memory in the aforesaid processing unit is selected to constitute a predetermined data flow path, thereby to perform DCT or other desired video signal processing.

Patent
19 Oct 1994
TL;DR: In this article, a processor has an execution pipeline, a register file and a controller, and the controller makes the first result stored in the register file available in the event that the first results are needed for the execution of a subsequent instruction.
Abstract: A processor method and apparatus. The processor has an execution pipeline, a register file and a controller. The execution pipeline is for executing an instruction and has a first stage for generating a first result and a last stage for generating a final result. The register file is for storing the first result and the final result. The controller makes the first result stored in the register file available in the event that the first result is needed for the execution of a subsequent instruction. By storing the result of the first stage in the register file, the length of the execution pipeline is reduced from that of the prior art. Furthermore, logic required for providing inputs to the execution pipeline is greatly simplified over that required by the prior art.