scispace - formally typeset
Search or ask a question

Showing papers on "Pipeline (computing) published in 1996"


Proceedings ArticleDOI
15 Apr 1996
TL;DR: A new VLSI architecture for a real-time pipeline FFT processor is proposed, derived by integrating a twiddle factor decomposition technique in the divide-and-conquer approach, which has the same multiplicative complexity as the radix-4 algorithm, but retains the butterfly structure of the Radix-2 algorithm.
Abstract: A new VLSI architecture for a real-time pipeline FFT processor is proposed. A hardware-oriented radix-2/sup 2/ algorithm is derived by integrating a twiddle factor decomposition technique in the divide-and-conquer approach. The radix-2/sup 2/ algorithm has the same multiplicative complexity as the radix-4 algorithm, but retains the butterfly structure of the radix-2 algorithm. The single-path delay-feedback architecture is used to exploit the spatial regularity in the signal flow graph of the algorithm. For length-N DFT computation, the hardware requirement of the proposed architecture is minimal on both dominant components: log/sub 4/N-1 complexity multipliers and N-1 complexity data memory. The validity and efficiency of the architecture have been verified by simulation in the hardware description language VHDL.

410 citations


Patent
28 May 1996
TL;DR: The Adaptive Logic Processor (ALP) as mentioned in this paper uses a programmable logic structure called an adaptive logic processor, similar to an extendible field programmable gate array (FPGA) and is optimized for the implementation of program specific pipeline functions.
Abstract: An architecture for information processing devices which allows the construction of low cost, high performance systems for specialized computing applications involving sensor data processing. The reconfigurable processor architecture of the invention uses a programmable logic structure called an Adaptive Logic Processor (ALP). This structure is similar to an extendible field programmable gate array (FPGA) and is optimized for the implementation of program specific pipeline functions, where the function may be changed any number of times during the progress of a computation. A Reconfigurable Pipeline Instruction Control (RPIC) unit is used for loading the pipeline functions into the ALP during the configuration process and coordinating the operations of the ALP with other information processing structures, such as memory, I/O devices, and arithmetic processing units. Multiple components having the reconfigurable architecture of the present invention may be combined to produce high performance parallel processing systems based on the Single Instruction Multiple Data (SIMD) architecture concept.

232 citations


Proceedings ArticleDOI
01 May 1996
TL;DR: This paper introduces a new hybrid branch predictor and shows that it is more accurate (for a given cost) than any previously published scheme, especially if the branch histories are periodically flushed due to the presence of context switches.
Abstract: Pipeline stalls due to conditional branches represent one of the most significant impediments to realizing the performance potential of deeply pipelined, superscalar processors. Many branch predictors have been proposed to help alleviate this problem, including the Two-Level Adaptive Branch Predictor, and more recently, two-component hybrid branch predictors.In a less idealized environment, such as a time-shared system, code of interest involves context switches. Context switches, even at fairly large intervals, can seriously degrade the performance of many of the most accurate branch prediction schemes. In this paper, we introduce a new hybrid branch predictor and show that it is more accurate (for a given cost) than any previously published scheme, especially if the branch histories are periodically flushed due to the presence of context switches.

180 citations


Proceedings ArticleDOI
30 Oct 1996
TL;DR: The development of a new language was necessary in order to cover the gap between coarse ISA models used in compilers, and instruction set simulators on the one hand, and detailed models used for hardware design on the other.
Abstract: A machine description language is presented The language, LISA, and its generic machine model are able to produce bit- and cycle/phase-accurate processor models covering the specific needs of HW/SW codesign, and cosimulation environments The development of a new language was necessary in order to cover the gap between coarse ISA models used in compilers, and instruction set simulators on the one hand, and detailed models used for hardware design on the other The main part of the paper is devoted to behavioral pipeline modeling The pipeline controller of the generic machine model is represented as an ASAP (as soon as possible) sequencer parameterized by precedence and resource constraints of operations of each instruction The standard pipeline description based on reservation tables and Gantt charts was extended by additional operation descriptors which enable the detection of data and control hazards, and permit modeling of pipeline flushes Using the newly introduced L-charts we reduced the parameterization of the pipeline controller to a minimum and at the same time covered typical pipeline controls found in state of the art signal processors As an example, the application of the LISA model on the TI-TMS320C54x signal processor is presented

151 citations


Journal ArticleDOI
TL;DR: An adaptive pipeline (APL) technique is described, which is a new pipeline scheme capable of compensating for device-parameter deviations and for operating-environment variations, and it is shown that MOS current-mode logic circuits are suitable for a low-noise variable delay circuit.
Abstract: This paper describes an adaptive pipeline (APL) technique, which is a new pipeline scheme capable of compensating for device-parameter deviations and for operating-environment variations. This technique can also compensate for clock skew and eliminate excessive power dissipation in current-mode logic (CML) circuits. The APL technique is here applied to a 0.4-/spl mu/m MOS 1.6-V 1-GHz 64-bit double-stage pipeline adder, and this paper shows that the adder can operate accurately on condition that the clock has 20% skew. The APL technique uses MOS current-mode logic (MCML) circuits, whose propagation delay time can be varied by the control ports. MCML circuits can operate with lower signal voltage swing and higher operating frequency at lower supply voltage than CMOS circuits can. This paper also shows that MCML circuits are suitable for a low-noise variable delay circuit. Measurement results show that jitter of MCML circuits is about 65% that of CMOS circuits.

146 citations


Journal ArticleDOI
J. Christiansen1
TL;DR: In this article, the authors describe the architecture and performance of a new high resolution timing generator used as a building block for time-to-digital converters (TDC) and clock alignment functions.
Abstract: This paper describes the architecture and performance of a new high resolution timing generator used as a building block for time-to-digital converters (TDC) and clock alignment functions. The timing generator is implemented as an array of delay locked loops. This architecture enables a timing generator with subgate delay resolution to be implemented in a standard digital CMOS process. The TDC function is implemented by storing the state of the timing generator signals in an asynchronous pipeline buffer when a hit signal is asserted. The clock alignment function is obtained by selecting one of the timing generator signals as an output clock. The proposed timing generator has been mapped into a 1.0 /spl mu/m CMOS process and an r.m.s. error of the time taps of 48 ps has been measured with a bin size of 0.15 ns. Used as a TDC device, an r.m.s. error of 76 ps has been obtained, A short overview of the basic principles of major TDC and timing generator architectures is given to compare the merits of the proposed scheme to other alternatives.

145 citations


Journal ArticleDOI
TL;DR: Logic embedded memory is an emerging technology that combines high transfer rates and computing power and Texram implements this technology and a new filtering algorithm to achieve high speed, high quality texture mapping.
Abstract: Logic embedded memory is an emerging technology that combines high transfer rates and computing power. Texram implements this technology and a new filtering algorithm to achieve high speed, high quality texture mapping. Integrating arithmetic units and large memory arrays on the same chip and thus exploiting the enormous internal transfer rates provides an elegant solution to the memory access bottleneck of high quality texture mapping. Using this technology, we can not only achieve higher texturing speed at lower system costs, we can also incorporate new functionalities such as detail mapping and footprint assembly to produce higher quality images at real time rendering speeds. Environment and video mapping are also integrated on the Texram, which therefore represents an autonomous and versatile texturing coprocessor. Logic enhanced memories might become the computing paradigm of the future, not just in graphics applications. Technological advances will foster this trend by providing an ever increasing amount of memory capacity and chip space for arithmetic units. As the ultimate solution, we can expect a complete 3D graphics pipeline including all memory systems integrated on a single chip.

131 citations


01 Jan 1996
TL;DR: Analysis indicates that window (wakeup and select) logic and operand bypass logic are likely to be the most critical in the future of superscalar processors.
Abstract: To characterize future performance limitations of superscalar processors, the delays of key pipeline structures in superscalar processors are studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of 0:8 m, 0:35 m, and 0:18 m. Performance (delay) results and trends are expressed in terms of issue width and window size. This analysis indicates that window (wakeup and select) logic and operand bypass logic are likely to be the most critical in the future.

125 citations


Proceedings Article
01 Jan 1996
TL;DR: In this paper, a new computational model, called a linear array with a reconfigurable pipelined bus system (LARPBS), has been proposed as a feasible and efficient parallel computational model based on current optical technologies.
Abstract: A new computational model, called a linear array with a reconfigurable pipelined bus system (LARPBS), has been proposed as a feasible and efficient parallel computational model based on current optical technologies. In this paper, we further study this model by proposing several basic data movement operations on the model. These operations include broadcast, multicast, compression, split, binary prefix sum, maximum finding. Using these basic operations, several image processing algorithms are also presented for the model. We show that all algorithms can be executed efficiently on the LARPBS model. It is our hope that the LARPBS model can be used as a new and practical parallel computational model for designing parallel algorithms.

119 citations


Proceedings ArticleDOI
01 Sep 1996
TL;DR: This approach overcomes the instruction fetch bottle-neck exhibited by wide-dispatch "brainiac" processors by enabling them to efficiently predict addresses of two instruction blocks in a single cycle.
Abstract: A basic rule in computer architecture is that a processor cannot execute an application faster than it fetches its instructions. This paper presents a novel cost-effective mechanism called the two-block ahead branch predictor. Information from the current instruction block is not used for predicting the address of the next instruction block, but rather for predicting the block following the next instruction block.This approach overcomes the instruction fetch bottle-neck exhibited by wide-dispatch "brainiac" processors by enabling them to efficiently predict addresses of two instruction blocks in a single cycle. Furthermore, pipelining the branch prediction process can also be done by means of our predictor for "speed demon" processors to achieve higher clock rate or to improve the prediction accuracy by means of bigger prediction structures.Moreover, and unlike the previously-proposed multiple predictor schemes, multiple-block ahead branch predictors can use any of the branch prediction schemes to perform the very accurate predictions required to achieve high-performance on superscalar processors.

110 citations


Patent
19 Apr 1996
TL;DR: In this paper, a display FIFO pipeline processes background graphics display data and a separate overlay display pipeline processes overlay display data stored in an off-screen part of a graphics memory.
Abstract: A system and method for processing overlay display data. A display FIFO pipeline processes background graphics display data and a separate overlay FIFO pipeline processes overlay display data stored in an off-screen part of a graphics memory. The overlay FIFO pipeline performs format conversion, interpolation and scaling on the overlay display data and outputs it to an overlay mux. The overlay mux selects between the outputs of the display FIFO pipeline and the overlay FIFO pipeline in the processing of each scan line.

Proceedings ArticleDOI
20 Oct 1996
TL;DR: This paper shows how this technique reduces pattern history table interference for two versions of the 2-level branch predictor and that this significantly improves branch prediction accuracy for the SPEC 95 benchmarks.
Abstract: Today's deeply pipelined, superscalar processors rely on accurate branch prediction in order to approach their performance potential. Branch mispredictions result in a flushing of the speculative information in the pipeline, thus limiting the amount of useful work that can be done. The 2-level branch predictors have been shown to achieve high prediction accuracy. However, it has also been shown that there is a high degree of pattern history table interference in 2-level branch predictors and that the interference generally has a negative effect on the prediction accuracy. This paper introduces a method for reducing the pattern history table interference by dynamically identifying some easily predictable branches and inhibiting the pattern history table update for these branches. We show how this technique reduces pattern history table interference for two versions of the 2-level branch predictor and that this significantly improves branch prediction accuracy for the SPEC 95 benchmarks. In particular, we eliminate up to 30% of the mispredictions for the gcc benchmark.

Patent
16 May 1996
TL;DR: A superscalar microprocessor includes a scheduler which contains storage for information related to operations and scan logic for selecting operations for out-of-order execution by a set of execution units as mentioned in this paper.
Abstract: A superscalar microprocessor includes a scheduler which contains storage for information related to operations and scan logic for selecting operations for out-of-order execution by a set of execution units. To provide fast operation, the selection is made without regard for the availability of operands which are required for execution of the operation but may be unavailable pending completion of an operation. An operand forward stage, which follows the issue stage, selects sources for an operand which may be a register file or a sourcing operation in the scheduler, completed or not. The scheduler contains all information describing the sourcing operations and forwards an operand value and information indicating the state of a sourcing operations. The state information indicates whether the sourcing operation is complete and execution of the issued operation can continue. The state also indicates a wait until the sourcing operation will complete. If the wait is too long, the issued operation is bumped so that another operation can be executed. This reduces pipeline hold ups and increase execution unit utilization.

Proceedings ArticleDOI
18 Mar 1996
TL;DR: This paper presents design and simulation results of two high-performance asynchronous pipeline circuits that uses pseudo-static Svensson-style double edge-triggered D-flip-flops for data storage in place of traditional transmission gate latches or Sutherland's capture-pass latches.
Abstract: This paper presents design and simulation results of two high-performance asynchronous pipeline circuits. The first circuit is a two-phase micropipeline but uses pseudo-static Svensson-style double edge-triggered D-flip-flops (DETDFF) for data storage in place of traditional transmission gate latches or Sutherland's capture-pass latches. The second circuit is a four-phase micropipeline with burst-mode control circuits. We compare our DETDFF and four-phase implementations of a FIFO buffer with the current state-of-the-art micropipeline implementation using four-phase controllers designed by Day and Woods for the AMULET-2 processor. We implemented Day and Woods's design and both of our designs in the MOSIS 1.2 /spl mu/m CMOS process and simulated them with a 4.6 V power supply and at 100/spl deg/C. Our SPICE simulations show that our DETDFF and four-phase designs have 70% and 30% higher throughput respectively than Day and Woods's design. This higher throughput for the DETDFF design is due to latching the data on both edges of the latch control, removing the need of a reset phase and simplifying the control structures. Our four-phase design, on the other hand, has higher throughput because of the simplified control structures and the removal of the latch enable buffers from the critical path. The four-phase design, though not quite as fast as the DETDFF design, requires much smaller area for data storage.

Patent
Yoshiyuki Okada1
27 Dec 1996
TL;DR: In this article, a data compression system uses a pipeline control unit to enable an occurrence frequency modeling unit and entropy coding unit to operate in pipelining for handling data having a fixed-order context.
Abstract: For handling data having a fixed-order context, a data compression system uses a pipeline control unit to enable an occurrence frequency modeling unit and entropy coding unit to operate in pipelining. A data restoration system uses a pipeline control unit to enable an entropy decoding unit and occurrence frequency modeling unit to operate in pipelining. For handling data having a blend context, occurrence frequency modeling units associated with orders are operated in parallel for data compression or data restoration. Furthermore, word data is separated byte by byte, and byte data items are encoded or restored on the basis of the correlation thereof in a word-stream direction.

Patent
29 Jan 1996
TL;DR: In this article, a run-time system collects profile data in response to execution of the native instructions to determine execution characteristics of the non-native instruction and then feeds them to a binary translator operating in a background mode and which is responsive to the profile data generated by the runtime system to form a translated native image.
Abstract: A computer system for executing a binary image conversion system which converts instructions from a instruction set of a first, non native computer system to a second, different, native computer system, includes an run-time system which in response to a non-native image of an application program written for a non-native instruction set provides an native instruction or a native instruction routine. The run-time system collects profile data in response to execution of the native instructions to determine execution characteristics of the non-native instruction. Thereafter, the non-native instructions and the profile statistics are fed to a binary translator operating in a background mode and which is responsive to the profile data generated by the run-time system to form a translated native image. The run-time system and the binary translator are under the control of a server process. The non-native image is executed in two different enviroments with first portion executed as an interpreted image and remaining portions as a translated image. The run-time system includes an interpreter which is capable of handling condition codes corresponding to the non-native architecute. A technique is also provided to jacket calls between the two execution enviroments and to support object based services. Preferred techniques are also provide to determine interprocedural translation units. Further, intermixed translation/optimization techniques are discussed.

Proceedings ArticleDOI
07 Oct 1996
TL;DR: In this article, the focus of the non-restoring is on the "partial remainder", not on "each bit of the square root", with each iteration, and it only requires one traditional adder/subtracter in each iteration.
Abstract: We present a new non-restoring square root algorithm that is very efficient to implement. The new algorithm presented here has the following features unlike other square root algorithms. First, the focus of the "non-restoring" is on the "partial remainder", not on "each bit of the square root", with each iteration. Second, it only requires one traditional adder/subtracter in each iteration, i.e., it does not require other hardware components, such as seed generators, multipliers, or even multiplexors. Third, it generates the correct resulting value even in the last bit position. Next, based on the resulting value of the last bit, a precise remainder can be obtained immediately without any correction or addition operation. And finally, it can be implemented at very fast clock rate because of the very simple operations at each iteration. We illustrate two VLSI implementations of the new algorithm. One is a fully pipelined high-performance implementation that can accept a new square-root instruction each clock cycle with each pipeline stage requiring a minimum number of gate counts. The other is a low-cost implementation that uses only a single adder/subtractor for iterative operation.

Journal ArticleDOI
TL;DR: This paper proposes branch classification, a methodology for building more accurate branch predictors, and an example classification scheme is given and a new hybrid predictor is built based on this scheme which achieves a higher prediction accuracy than any branch predictor previously reported in the literature.
Abstract: There is wide agreement that one of the most significant impediments to the performance of current and future pipelined superscalar processors is the presence of conditional branches in the instruction stream. Speculative execution is one solution to the branch problem, but speculative work is discarded if a branch is mispredicted. For it to be effective, speculative execution requires a very accurate branch predictor; 95% accuracy is not good enough. This paper proposes branch classification, a methodology for building more accurate branch predictors. Branch classification allows an individual branch instruction to be associated with the branch predictor best suited to predict its direction. Using this approach, a hybrid branch predictor can be constructed such that each component branch predictor predicts those branches for which it is best suited. To demonstrate the usefulness of branch classification, an example classification scheme is given and a new hybrid predictor is built based on this scheme which achieves a higher prediction accuracy than any branch predictor previously reported in the literature.

Patent
15 May 1996
TL;DR: In this paper, a return stack buffer mechanism that uses two separate return stack buffers is described, the speculative return stack and the actual return stack, which are updated using speculatively fetched instructions.
Abstract: A return stack buffer mechanism that uses two separate return stack buffers is disclosed. The first return stack buffer is the Speculative Return Stack Buffer. The Speculative Return Stack Buffer is updated using speculatively fetched instructions. Thus, the Speculative Return Stack Buffer may become corrupted when incorrect instructions are fetched. The second return stack buffer is the Actual Return Stack Buffer. The Actual Return Stack Buffer is updated using information from fully executed branch instructions. When a branch misprediction causes a pipeline flush, the contents of the Actual Return Stack Buffer is copied into the Speculative Return Stack Buffer to correct any corrupted information.

Journal ArticleDOI
TL;DR: An important and counter-intuitive result is shown, which proves that the authors can always obtain full-parallelism for MDFGs with more than one dimension.
Abstract: Most scientific and digital signal processing (DSP) applications are recursive or iterative. Transformation techniques are usually applied to get optimal execution rates in parallel and/or pipeline systems. The retiming technique is a common and valuable transformation tool in one-dimensional problems, when loops are represented by data flow graphs (DFGs). In this paper, uniform nested loops are modeled as multidimensional data flow graphs (MDFGs). Full parallelism of the loop body, i.e., all nodes in the MDFG executed in parallel, substantially decreases the overall computation time. It is well known that, for one-dimensional DFGs, retiming can not always achieve full parallelism. Other existing optimization techniques for nested loops also can not always achieve full parallelism. This paper shows an important and counter-intuitive result, which proves that we can always obtain full-parallelism for MDFGs with more than one dimension. This result is obtained by transforming the MDFG into a new structure. The restructuring process is based on a multidimensional retiming technique. The theory and two algorithms to obtain full parallelism are presented in this paper. Examples of optimization of nested loops and digital signal processing designs are shown to demonstrate the effectiveness of the algorithms.

Proceedings ArticleDOI
01 Jun 1996
TL;DR: A new optimization technique called architectural retiming is presented which is able to improve the performance of many latency-constrained circuits by increasing the number of registers on the latency- Constrained path while preserving the functionality and latency of the circuit.
Abstract: This paper presents a new optimization technique called architectural retiming which is able to improve the performance of many latency-constrained circuits. Architectural retiming achieves this by increasing the number of registers on the latency-constrained path while preserving the functionality and latency of the circuit. This is done using the concept of a negative register, which can be implemented using precomputation and prediction. We use the name architectural retiming since it both reschedules operations in time and modifies the structure of the circuit to preserve its functionality. We illustrate the use of architectural retiming on two realistic examples and present performance improvement results for a number of sample circuits.

Patent
14 May 1996
TL;DR: In this paper, the master cache has a tag pipeline for accessing the tag RAM array, and a data pipeline is optimized for overall data transfer bandwidth, and the tag pipeline and the data pipeline are bound together for retrieving the first sub-line of a new miss from the slave cache.
Abstract: A cache system has a large master cache and smaller slave caches. The slave caches are coupled to the processor's pipelines and are kept small and simple to increase their speed. The master cache is set-associative and performs many of the complex cache management operations for the slave caches, freeing the slaves of these bandwidth-robbing duties. The master cache has a tag pipeline for accessing the tag RAM array, and a data pipeline for accessing the data RAM array. The tag pipeline is optimized for fast access of the tag RAM array, while the data pipeline is optimized for overall data transfer bandwidth. The tag pipeline and the data pipeline are bound together for retrieving the first sub-line of a new miss from the slave cache. Subsequent sub-lines only use the data pipeline, freeing the tag pipeline for other operations. Bus snoops and cache management operations can use just the tag pipeline without impacting data bandwidth. Loop-back flows are performed which cancel an intervening flow in the tag pipeline when the index portions of the addresses match.

Patent
19 Aug 1996
TL;DR: In this paper, an apparatus and method for improving the performance of pipelined computer processors which have segment bits for specifying the operand size, the address size for memory reference, and the stack size is presented.
Abstract: An apparatus and method for improving the performance of pipelined computer processors which have segment bits for specifying the operand size, the address size for memory reference, and the stack size, and which can run self-modifying code. The processor predicts segment bits based on previously used segment bits. Actual segment bits are later determined during execution of an instruction. The predicted segment bits are compared with the actual segment bits, and the pipeline is flushed if they do not match. Also, an instruction verification method is provided to determine if self-modifying code has modified instructions already in the pipeline. Upon execution of a write instruction, each instruction address in the pipeline is compared with the write address. If a match is found, the pipeline is flushed.

Patent
20 Sep 1996
TL;DR: In this article, the authors present a processor architecture in which each processor has its own memory, strategically distributed along the stages of an execution pipeline of the processor, to provide fast access to often used information, such as the contents of the address and data registers, the program counter, etc.
Abstract: A computer system architecture in which each processor has its own memory, strategically distributed along the stages of an execution pipeline of the processor, to provide fast access to often used information, such as the contents of the address and data registers, the program counter, etc. Memory storage is strategically located in close physical proximity to a stage in an execution pipeline at which memory is commonly or repeatedly accessed. Coupled to the pipeline at various stages are small memory cells for storing information that is consistently and repeatedly requested at that stage in the execution pipeline. The speed of the execution pipeline in a processor is critical to overall performance of the processor and the computer architecture of the present invention as a whole. To that end, the clock cycle time at which the pipeline is operated is increased as much as the operating characteristics of the logic and associated circuitry will allow. Generally, access times for memory are slower than the clock cycle times at which the pipeline logic can operate. Thus, there is a point of diminishing return at which increasing the clock cycle time of the pipeline is less advantageous if the pipeline must wait for memory access to complete. Thus, there is provided two sets of strategically located memory cells distributed along the execution pipeline of a processor, and alternately accesses the memory cells.

Journal ArticleDOI
TL;DR: This paper shows that other schemes can be designed, based on the idea of pipelining a serial-input adder or a ripple-carry adder, to obtain pipelined adders for more than two numbers.
Abstract: A well-known scheme for obtaining high throughput adders is a pipeline in which each stage contains an array of half-adders performing a carry-save addition. This paper shows that other schemes can be designed, based on the idea of pipelining a serial-input adder or a ripple-carry adder. Such schemes offer a considerable savings of components while preserving high throughput. These schemes can be generalized by using (p,q) parallel counters to obtain pipelined adders for more than two numbers.

Patent
26 Feb 1996
TL;DR: A real-time pipeline processor based on a hardware oriented radix-22 algorithm derived by integrating a twiddle factor decomposition technique in a divide and conquer approach is presented in this article.
Abstract: A real-time pipeline processor, which is particularly suited for VLSI implementation, is based on a hardware oriented radix-22 algorithm derived by integrating a twiddle factor decomposition technique in a divide and conquer approach. The radix-22 algorithm has the same multiplicative complexity as a radix-4 algorithm, but retains the butterfly structure of a radix-2 algorithm. A single-path delay-feedback architecture is used in order to exploit the spatial regularity in the signal flow graph of the algorithm. For a length-N DFT transform, the hardware requirements of the processor proposed by the present invention is minimal on both dominant components: Log4N-1 complex multipliers, and N-1 complex data memory.

Patent
26 Sep 1996
TL;DR: In this article, the carry-save-add circuit is coupled with a set of carry propagate adder circuits to propagate a carry generated by the carry save-add adder circuit.
Abstract: A pipeline processor having an add circuit configured to execute separate atomic add instructions in consecutive clock cycles, wherein each separate atomic add instructions can be updating the same memory address location. In one embodiment, the add circuit includes a carry-save-add circuit coupled to a set of carry propagate adder circuits. The carry-save-add circuit is configured to perform an add operation in one processor clock cycle and the set of carry propagate adder circuits are configured to propagate, in subsequent clock cycles, a carry generated by the carry-save-add circuit. The add circuit is further configured to feedforward partially propagated sums to the carry-save-add circuit as at least one operand for subsequent atomic add instructions. In one embodiment, the pipeline processor is implemented on a multitasking computer system architecture supporting multiple independent processors dedicated to processing data packets.


Journal ArticleDOI
TL;DR: An efficient means of synchronizing and pipelining asynchronous circuits implemented using differential cascode voltage switch logic (DCVSL) precharged function blocks using a modified version of this logic, based on the use of a private asynchronous standard cell library, fully compatible with an existing CMOS standard cell Library provided by the foundry.
Abstract: This paper describes an efficient means of synchronizing and pipelining asynchronous circuits implemented using differential cascode voltage switch logic (DCVSL) precharged function blocks. A modified version of this logic, called LDCVSL (latch differential cascode voltage switch logic), which is similar to the LCDL (latched CMOS differential logic), or DCVSL with NORA-latch, is used to improve the storage capability of the precharged function blocks. Improving the storage capability of the building blocks allows the design of an efficient pipeline scheme which is described in detail. Following a description of its potential performance, the pipeline scheme is applied to the design of self-timed rings. It is shown that more compact ring structures can be obtained without loss of performance. Our design methodology is then presented. It is based on the use of a private asynchronous standard cell library, fully compatible with an existing CMOS standard cell library provided by the foundry. Our approach allows the rapid design of standard cell based asynchronous circuits. Finally, both the pipeline scheme and design approach are illustrated through the design of a 32-b self-timed ring divider. The division algorithm is first briefly presented. The chip architecture is then described with the results obtained after fabrication. The test chip has been fabricated using the CNET/SGS-Thomson 0.5 /spl mu/m three metal layer technology. The 0.7 mm/sup 2/ chip computes 32-b divisions in 101 ns with a power consumption of 30 mW at a throughput of 10 million operations per second.

Patent
22 Oct 1996
TL;DR: The synchronous DRAM as mentioned in this paper allows switching of bit configuration, and it takes 2-bits prefetch configuration in x8 configuration mode, and signal pipeline configuration in X16 configuration mode.
Abstract: A synchronous DRAM includes a selector which supplies 2 bits of serial data signals from one data input/output terminal to two input/output line pairs as parallel data signals in x8 configuration mode, and supplies 2 bits of parallel data signals from both data input/output terminals directly to two input/output line pairs in x16 configuration mode. Therefore, the synchronous DRAM allows switching of bit configuration, and it takes 2-bits prefetch configuration in x8 configuration mode, and signal pipeline configuration in x16 configuration mode.