Showing papers on "Pipeline (computing) published in 1996"

PDF

Open Access

Proceedings Article•DOI•

A new approach to pipeline FFT processor

[...]

Shousheng He¹, M. Torkelson¹•Institutions (1)

15 Apr 1996

TL;DR: A new VLSI architecture for a real-time pipeline FFT processor is proposed, derived by integrating a twiddle factor decomposition technique in the divide-and-conquer approach, which has the same multiplicative complexity as the radix-4 algorithm, but retains the butterfly structure of the Radix-2 algorithm.

...read moreread less

Abstract: A new VLSI architecture for a real-time pipeline FFT processor is proposed. A hardware-oriented radix-2/sup 2/ algorithm is derived by integrating a twiddle factor decomposition technique in the divide-and-conquer approach. The radix-2/sup 2/ algorithm has the same multiplicative complexity as the radix-4 algorithm, but retains the butterfly structure of the radix-2 algorithm. The single-path delay-feedback architecture is used to exploit the spatial regularity in the signal flow graph of the algorithm. For length-N DFT computation, the hardware requirement of the proposed architecture is minimal on both dominant components: log/sub 4/N-1 complexity multipliers and N-1 complexity data memory. The validity and efficiency of the architecture have been verified by simulation in the hardware description language VHDL.

...read moreread less

410 citations

Patent•

Reconfigurable computer architecture for use in signal processing applications

[...]

Charle R. Rupp¹•Institutions (1)

National Semiconductor¹

28 May 1996

TL;DR: The Adaptive Logic Processor (ALP) as mentioned in this paper uses a programmable logic structure called an adaptive logic processor, similar to an extendible field programmable gate array (FPGA) and is optimized for the implementation of program specific pipeline functions.

...read moreread less

Abstract: An architecture for information processing devices which allows the construction of low cost, high performance systems for specialized computing applications involving sensor data processing. The reconfigurable processor architecture of the invention uses a programmable logic structure called an Adaptive Logic Processor (ALP). This structure is similar to an extendible field programmable gate array (FPGA) and is optimized for the implementation of program specific pipeline functions, where the function may be changed any number of times during the progress of a computation. A Reconfigurable Pipeline Instruction Control (RPIC) unit is used for loading the pipeline functions into the ALP during the configuration process and coordinating the operations of the ALP with other information processing structures, such as memory, I/O devices, and arithmetic processing units. Multiple components having the reconfigurable architecture of the present invention may be combined to produce high performance parallel processing systems based on the Single Instruction Multiple Data (SIMD) architecture concept.

...read moreread less

232 citations

Proceedings Article•DOI•

Using Hybrid Branch Predictors to Improve Branch Prediction Accuracy in the Presence of Context Switches

[...]

Marius Evers¹, Po-Yung Chang¹, Yale N. Patt¹•Institutions (1)

University of Michigan¹

01 May 1996

TL;DR: This paper introduces a new hybrid branch predictor and shows that it is more accurate (for a given cost) than any previously published scheme, especially if the branch histories are periodically flushed due to the presence of context switches.

...read moreread less

Abstract: Pipeline stalls due to conditional branches represent one of the most significant impediments to realizing the performance potential of deeply pipelined, superscalar processors. Many branch predictors have been proposed to help alleviate this problem, including the Two-Level Adaptive Branch Predictor, and more recently, two-component hybrid branch predictors.In a less idealized environment, such as a time-shared system, code of interest involves context switches. Context switches, even at fairly large intervals, can seriously degrade the performance of many of the most accurate branch prediction schemes. In this paper, we introduce a new hybrid branch predictor and show that it is more accurate (for a given cost) than any previously published scheme, especially if the branch histories are periodically flushed due to the presence of context switches.

...read moreread less

180 citations

Proceedings Article•DOI•

LISA-machine description language and generic machine model for HW/SW co-design

[...]

V. Zivojnovic¹, Stefan Pees, Heinrich Meyr•Institutions (1)

RWTH Aachen University¹

30 Oct 1996

TL;DR: The development of a new language was necessary in order to cover the gap between coarse ISA models used in compilers, and instruction set simulators on the one hand, and detailed models used for hardware design on the other.

...read moreread less

Abstract: A machine description language is presented The language, LISA, and its generic machine model are able to produce bit- and cycle/phase-accurate processor models covering the specific needs of HW/SW codesign, and cosimulation environments The development of a new language was necessary in order to cover the gap between coarse ISA models used in compilers, and instruction set simulators on the one hand, and detailed models used for hardware design on the other The main part of the paper is devoted to behavioral pipeline modeling The pipeline controller of the generic machine model is represented as an ASAP (as soon as possible) sequencer parameterized by precedence and resource constraints of operations of each instruction The standard pipeline description based on reservation tables and Gantt charts was extended by additional operation descriptors which enable the detection of data and control hazards, and permit modeling of pipeline flushes Using the newly introduced L-charts we reduced the parameterization of the pipeline controller to a minimum and at the same time covered typical pipeline controls found in state of the art signal processors As an example, the application of the LISA model on the TI-TMS320C54x signal processor is presented

...read moreread less

151 citations

Journal Article•DOI•

A GHz MOS adaptive pipeline technique using MOS current-mode logic

[...]

Masayuki Mizuno¹, Masakazu Yamashina¹, Koichiro Furuta¹, H. Igura¹, Hitoshi Abiko¹, K. Okabe¹, Atsuki Ono¹, Hachiro Yamada¹ - Show less +4 more•Institutions (1)

NEC¹

01 Jun 1996-IEEE Journal of Solid-state Circuits

TL;DR: An adaptive pipeline (APL) technique is described, which is a new pipeline scheme capable of compensating for device-parameter deviations and for operating-environment variations, and it is shown that MOS current-mode logic circuits are suitable for a low-noise variable delay circuit.

...read moreread less

Abstract: This paper describes an adaptive pipeline (APL) technique, which is a new pipeline scheme capable of compensating for device-parameter deviations and for operating-environment variations. This technique can also compensate for clock skew and eliminate excessive power dissipation in current-mode logic (CML) circuits. The APL technique is here applied to a 0.4-/spl mu/m MOS 1.6-V 1-GHz 64-bit double-stage pipeline adder, and this paper shows that the adder can operate accurately on condition that the clock has 20% skew. The APL technique uses MOS current-mode logic (MCML) circuits, whose propagation delay time can be varied by the control ports. MCML circuits can operate with lower signal voltage swing and higher operating frequency at lower supply voltage than CMOS circuits can. This paper also shows that MCML circuits are suitable for a low-noise variable delay circuit. Measurement results show that jitter of MCML circuits is about 65% that of CMOS circuits.

...read moreread less

146 citations

Journal Article•DOI•

An integrated high resolution CMOS timing generator based on an array of delay locked loops

[...]

J. Christiansen¹•Institutions (1)

CERN¹

01 Jul 1996-IEEE Journal of Solid-state Circuits

TL;DR: In this article, the authors describe the architecture and performance of a new high resolution timing generator used as a building block for time-to-digital converters (TDC) and clock alignment functions.

...read moreread less

Abstract: This paper describes the architecture and performance of a new high resolution timing generator used as a building block for time-to-digital converters (TDC) and clock alignment functions. The timing generator is implemented as an array of delay locked loops. This architecture enables a timing generator with subgate delay resolution to be implemented in a standard digital CMOS process. The TDC function is implemented by storing the state of the timing generator signals in an asynchronous pipeline buffer when a hit signal is asserted. The clock alignment function is obtained by selecting one of the timing generator signals as an output clock. The proposed timing generator has been mapped into a 1.0 /spl mu/m CMOS process and an r.m.s. error of the time taps of 48 ps has been measured with a bin size of 0.15 ns. Used as a TDC device, an r.m.s. error of 76 ps has been obtained, A short overview of the basic principles of major TDC and timing generator architectures is given to compare the merits of the proposed scheme to other alternatives.

...read moreread less

145 citations

Journal Article•DOI•

Texram: a smart memory for texturing

[...]

Andreas Schilling¹, Günter Knittel¹, Wolfgang Strasser¹•Institutions (1)

University of Tübingen¹

01 May 1996-IEEE Computer Graphics and Applications

TL;DR: Logic embedded memory is an emerging technology that combines high transfer rates and computing power and Texram implements this technology and a new filtering algorithm to achieve high speed, high quality texture mapping.

...read moreread less

Abstract: Logic embedded memory is an emerging technology that combines high transfer rates and computing power. Texram implements this technology and a new filtering algorithm to achieve high speed, high quality texture mapping. Integrating arithmetic units and large memory arrays on the same chip and thus exploiting the enormous internal transfer rates provides an elegant solution to the memory access bottleneck of high quality texture mapping. Using this technology, we can not only achieve higher texturing speed at lower system costs, we can also incorporate new functionalities such as detail mapping and footprint assembly to produce higher quality images at real time rendering speeds. Environment and video mapping are also integrated on the Texram, which therefore represents an autonomous and versatile texturing coprocessor. Logic enhanced memories might become the computing paradigm of the future, not just in graphics applications. Technological advances will foster this trend by providing an ever increasing amount of memory capacity and chip space for arithmetic units. As the ultimate solution, we can expect a complete 3D graphics pipeline including all memory systems integrated on a single chip.

...read moreread less

131 citations

Quantifying the Complexity of Superscalar Processors

[...]

Subbarao Palacharla¹, Norman P. Jouppi, James E. Smith¹•Institutions (1)

University of Wisconsin-Madison¹

01 Jan 1996

TL;DR: Analysis indicates that window (wakeup and select) logic and operand bypass logic are likely to be the most critical in the future of superscalar processors.

...read moreread less

Abstract: To characterize future performance limitations of superscalar processors, the delays of key pipeline structures in superscalar processors are studied. First, a generic superscalar pipeline is defined. Then the specific areas of register renaming, instruction window wakeup and selection logic, and operand bypassing are analyzed. Each is modeled and Spice simulated for feature sizes of 0:8 m, 0:35 m, and 0:18 m. Performance (delay) results and trends are expressed in terms of issue width and window size. This analysis indicates that window (wakeup and select) logic and operand bypass logic are likely to be the most critical in the future.

...read moreread less

125 citations

Proceedings Article•

Linear Array with a Reconfigurable Pipeline Bus System - Concepts and Applications.

[...]

Yi Pan, Keqin Li

01 Jan 1996

TL;DR: In this paper, a new computational model, called a linear array with a reconfigurable pipelined bus system (LARPBS), has been proposed as a feasible and efficient parallel computational model based on current optical technologies.

...read moreread less

Abstract: A new computational model, called a linear array with a reconfigurable pipelined bus system (LARPBS), has been proposed as a feasible and efficient parallel computational model based on current optical technologies. In this paper, we further study this model by proposing several basic data movement operations on the model. These operations include broadcast, multicast, compression, split, binary prefix sum, maximum finding. Using these basic operations, several image processing algorithms are also presented for the model. We show that all algorithms can be executed efficiently on the LARPBS model. It is our hope that the LARPBS model can be used as a new and practical parallel computational model for designing parallel algorithms.

...read moreread less

119 citations

Proceedings Article•DOI•

Multiple-block ahead branch predictors

[...]

André Seznec, Stephan Jourdan¹, Pascal Sainrat¹, Pierre Michaud•Institutions (1)

Paul Sabatier University¹

01 Sep 1996

TL;DR: This approach overcomes the instruction fetch bottle-neck exhibited by wide-dispatch "brainiac" processors by enabling them to efficiently predict addresses of two instruction blocks in a single cycle.

...read moreread less

Abstract: A basic rule in computer architecture is that a processor cannot execute an application faster than it fetches its instructions. This paper presents a novel cost-effective mechanism called the two-block ahead branch predictor. Information from the current instruction block is not used for predicting the address of the next instruction block, but rather for predicting the block following the next instruction block.This approach overcomes the instruction fetch bottle-neck exhibited by wide-dispatch "brainiac" processors by enabling them to efficiently predict addresses of two instruction blocks in a single cycle. Furthermore, pipelining the branch prediction process can also be done by means of our predictor for "speed demon" processors to achieve higher clock rate or to improve the prediction accuracy by means of bigger prediction structures.Moreover, and unlike the previously-proposed multiple predictor schemes, multiple-block ahead branch predictors can use any of the branch prediction schemes to perform the very accurate predictions required to achieve high-performance on superscalar processors.

...read moreread less

110 citations

Patent•

System and method for implementing an overlay pathway

[...]

Lawrence P. Chee¹, John David Mulvenna¹•Institutions (1)

Epson¹

19 Apr 1996

TL;DR: In this paper, a display FIFO pipeline processes background graphics display data and a separate overlay display pipeline processes overlay display data stored in an off-screen part of a graphics memory.

...read moreread less

Abstract: A system and method for processing overlay display data. A display FIFO pipeline processes background graphics display data and a separate overlay FIFO pipeline processes overlay display data stored in an off-screen part of a graphics memory. The overlay FIFO pipeline performs format conversion, interpolation and scaling on the overlay display data and outputs it to an overlay mux. The overlay mux selects between the outputs of the display FIFO pipeline and the overlay FIFO pipeline in the processing of each scan line.

...read moreread less

Proceedings Article•DOI•

Improving branch prediction accuracy by reducing pattern history table interference

[...]

Po-Yung Chang¹, Marius Evers¹, Yale N. Patt¹•Institutions (1)

University of Michigan¹

20 Oct 1996

TL;DR: This paper shows how this technique reduces pattern history table interference for two versions of the 2-level branch predictor and that this significantly improves branch prediction accuracy for the SPEC 95 benchmarks.

...read moreread less

Abstract: Today's deeply pipelined, superscalar processors rely on accurate branch prediction in order to approach their performance potential. Branch mispredictions result in a flushing of the speculative information in the pipeline, thus limiting the amount of useful work that can be done. The 2-level branch predictors have been shown to achieve high prediction accuracy. However, it has also been shown that there is a high degree of pattern history table interference in 2-level branch predictors and that the interference generally has a negative effect on the prediction accuracy. This paper introduces a method for reducing the pattern history table interference by dynamically identifying some easily predictable branches and inhibiting the pattern history table update for these branches. We show how this technique reduces pattern history table interference for two versions of the 2-level branch predictor and that this significantly improves branch prediction accuracy for the SPEC 95 benchmarks. In particular, we eliminate up to 30% of the mispredictions for the gcc benchmark.

...read moreread less

Patent•

Out-of-order processing that removes an issued operation from an execution pipeline upon determining that the operation would cause a lengthy pipeline delay

[...]

John G. Favor¹, Amos Ben-Meir¹•Institutions (1)

Advanced Micro Devices¹

16 May 1996

TL;DR: A superscalar microprocessor includes a scheduler which contains storage for information related to operations and scan logic for selecting operations for out-of-order execution by a set of execution units as mentioned in this paper.

...read moreread less

Abstract: A superscalar microprocessor includes a scheduler which contains storage for information related to operations and scan logic for selecting operations for out-of-order execution by a set of execution units. To provide fast operation, the selection is made without regard for the availability of operands which are required for execution of the operation but may be unavailable pending completion of an operation. An operand forward stage, which follows the issue stage, selects sources for an operand which may be a register file or a sourcing operation in the scheduler, completed or not. The scheduler contains all information describing the sourcing operations and forwards an operand value and information indicating the state of a sourcing operations. The state information indicates whether the sourcing operation is complete and execution of the issued operation can continue. The state also indicates a wait until the sourcing operation will complete. If the wait is too long, the issued operation is bumped so that another operation can be executed. This reduces pipeline hold ups and increase execution unit utilization.

...read moreread less

Proceedings Article•DOI•

High-performance asynchronous pipeline circuits

[...]

K.Y. Yun¹, Peter A. Beerel¹, Juan Carlos Arceo¹•Institutions (1)

University of California, San Diego¹

18 Mar 1996

TL;DR: This paper presents design and simulation results of two high-performance asynchronous pipeline circuits that uses pseudo-static Svensson-style double edge-triggered D-flip-flops for data storage in place of traditional transmission gate latches or Sutherland's capture-pass latches.

...read moreread less

Abstract: This paper presents design and simulation results of two high-performance asynchronous pipeline circuits. The first circuit is a two-phase micropipeline but uses pseudo-static Svensson-style double edge-triggered D-flip-flops (DETDFF) for data storage in place of traditional transmission gate latches or Sutherland's capture-pass latches. The second circuit is a four-phase micropipeline with burst-mode control circuits. We compare our DETDFF and four-phase implementations of a FIFO buffer with the current state-of-the-art micropipeline implementation using four-phase controllers designed by Day and Woods for the AMULET-2 processor. We implemented Day and Woods's design and both of our designs in the MOSIS 1.2 /spl mu/m CMOS process and simulated them with a 4.6 V power supply and at 100/spl deg/C. Our SPICE simulations show that our DETDFF and four-phase designs have 70% and 30% higher throughput respectively than Day and Woods's design. This higher throughput for the DETDFF design is due to latching the data on both edges of the latch control, removing the need of a reset phase and simplifying the control structures. Our four-phase design, on the other hand, has higher throughput because of the simplified control structures and the removal of the latch enable buffers from the critical path. The four-phase design, though not quite as fast as the DETDFF design, requires much smaller area for data storage.

...read moreread less

Patent•

Data compression and restoration system for encoding an input character on the basis of a conditional appearance rate obtained in relation to an immediately preceding character string

[...]

Yoshiyuki Okada¹•Institutions (1)

Fujitsu¹

27 Dec 1996

TL;DR: In this article, a data compression system uses a pipeline control unit to enable an occurrence frequency modeling unit and entropy coding unit to operate in pipelining for handling data having a fixed-order context.

...read moreread less

Abstract: For handling data having a fixed-order context, a data compression system uses a pipeline control unit to enable an occurrence frequency modeling unit and entropy coding unit to operate in pipelining. A data restoration system uses a pipeline control unit to enable an entropy decoding unit and occurrence frequency modeling unit to operate in pipelining. For handling data having a blend context, occurrence frequency modeling units associated with orders are operated in parallel for data compression or data restoration. Furthermore, word data is separated byte by byte, and byte data items are encoded or restored on the basis of the correlation thereof in a word-stream direction.

...read moreread less

Patent•

Method for providing a pipeline interpreter for a variable length instruction set

[...]

John S. Yates, Stephen C. Root

29 Jan 1996

TL;DR: In this article, a run-time system collects profile data in response to execution of the native instructions to determine execution characteristics of the non-native instruction and then feeds them to a binary translator operating in a background mode and which is responsive to the profile data generated by the runtime system to form a translated native image.

...read moreread less

Abstract: A computer system for executing a binary image conversion system which converts instructions from a instruction set of a first, non native computer system to a second, different, native computer system, includes an run-time system which in response to a non-native image of an application program written for a non-native instruction set provides an native instruction or a native instruction routine. The run-time system collects profile data in response to execution of the native instructions to determine execution characteristics of the non-native instruction. Thereafter, the non-native instructions and the profile statistics are fed to a binary translator operating in a background mode and which is responsive to the profile data generated by the run-time system to form a translated native image. The run-time system and the binary translator are under the control of a server process. The non-native image is executed in two different enviroments with first portion executed as an interpreted image and remaining portions as a translated image. The run-time system includes an interpreter which is capable of handling condition codes corresponding to the non-native architecute. A technique is also provided to jacket calls between the two execution enviroments and to support object based services. Preferred techniques are also provide to determine interprocedural translation units. Further, intermixed translation/optimization techniques are discussed.

...read moreread less

Proceedings Article•DOI•

A new non-restoring square root algorithm and its VLSI implementations

[...]

Yamin Li¹, Wanming Chu¹•Institutions (1)

University of Aizu¹

07 Oct 1996

TL;DR: In this article, the focus of the non-restoring is on the "partial remainder", not on "each bit of the square root", with each iteration, and it only requires one traditional adder/subtracter in each iteration.

...read moreread less

Abstract: We present a new non-restoring square root algorithm that is very efficient to implement. The new algorithm presented here has the following features unlike other square root algorithms. First, the focus of the "non-restoring" is on the "partial remainder", not on "each bit of the square root", with each iteration. Second, it only requires one traditional adder/subtracter in each iteration, i.e., it does not require other hardware components, such as seed generators, multipliers, or even multiplexors. Third, it generates the correct resulting value even in the last bit position. Next, based on the resulting value of the last bit, a precise remainder can be obtained immediately without any correction or addition operation. And finally, it can be implemented at very fast clock rate because of the very simple operations at each iteration. We illustrate two VLSI implementations of the new algorithm. One is a fully pipelined high-performance implementation that can accept a new square-root instruction each clock cycle with each pipeline stage requiring a minimum number of gate counts. The other is a low-cost implementation that uses only a single adder/subtractor for iterative operation.

...read moreread less

Journal Article•DOI•

Branch Classification: A New Mechanism for Improving Branch Predictor Performance

[...]

Po-Yung Chang¹, Eric Hao¹, Tse-Yu Yeh², Yale N. Patt¹•Institutions (2)

University of Michigan¹, Intel²

01 Apr 1996-International Journal of Parallel Programming

TL;DR: This paper proposes branch classification, a methodology for building more accurate branch predictors, and an example classification scheme is given and a new hybrid predictor is built based on this scheme which achieves a higher prediction accuracy than any branch predictor previously reported in the literature.

...read moreread less

Abstract: There is wide agreement that one of the most significant impediments to the performance of current and future pipelined superscalar processors is the presence of conditional branches in the instruction stream. Speculative execution is one solution to the branch problem, but speculative work is discarded if a branch is mispredicted. For it to be effective, speculative execution requires a very accurate branch predictor; 95% accuracy is not good enough. This paper proposes branch classification, a methodology for building more accurate branch predictors. Branch classification allows an individual branch instruction to be associated with the branch predictor best suited to predict its direction. Using this approach, a hybrid branch predictor can be constructed such that each component branch predictor predicts those branches for which it is best suited. To demonstrate the usefulness of branch classification, an example classification scheme is given and a new hybrid predictor is built based on this scheme which achieves a higher prediction accuracy than any branch predictor previously reported in the literature.

...read moreread less

Patent•

Method and apparatus for implementing a speculative return stack buffer

[...]

Simcha Gochman¹, Nicolas Kacevas¹, Farah Jubran¹•Institutions (1)

Intel¹

15 May 1996

TL;DR: In this paper, a return stack buffer mechanism that uses two separate return stack buffers is described, the speculative return stack and the actual return stack, which are updated using speculatively fetched instructions.

...read moreread less

Abstract: A return stack buffer mechanism that uses two separate return stack buffers is disclosed. The first return stack buffer is the Speculative Return Stack Buffer. The Speculative Return Stack Buffer is updated using speculatively fetched instructions. Thus, the Speculative Return Stack Buffer may become corrupted when incorrect instructions are fetched. The second return stack buffer is the Actual Return Stack Buffer. The Actual Return Stack Buffer is updated using information from fully executed branch instructions. When a branch misprediction causes a pipeline flush, the contents of the Actual Return Stack Buffer is copied into the Speculative Return Stack Buffer to correct any corrupted information.

...read moreread less

Journal Article•DOI•

Achieving full parallelism using multidimensional retiming

[...]

N.L. Passos¹, Edwin H.-M. Sha²•Institutions (2)

Midwestern State University¹, University of Notre Dame²

01 Nov 1996-IEEE Transactions on Parallel and Distributed Systems

TL;DR: An important and counter-intuitive result is shown, which proves that the authors can always obtain full-parallelism for MDFGs with more than one dimension.

...read moreread less

Abstract: Most scientific and digital signal processing (DSP) applications are recursive or iterative. Transformation techniques are usually applied to get optimal execution rates in parallel and/or pipeline systems. The retiming technique is a common and valuable transformation tool in one-dimensional problems, when loops are represented by data flow graphs (DFGs). In this paper, uniform nested loops are modeled as multidimensional data flow graphs (MDFGs). Full parallelism of the loop body, i.e., all nodes in the MDFG executed in parallel, substantially decreases the overall computation time. It is well known that, for one-dimensional DFGs, retiming can not always achieve full parallelism. Other existing optimization techniques for nested loops also can not always achieve full parallelism. This paper shows an important and counter-intuitive result, which proves that we can always obtain full-parallelism for MDFGs with more than one dimension. This result is obtained by transforming the MDFG into a new structure. The restructuring process is based on a multidimensional retiming technique. The theory and two algorithms to obtain full parallelism are presented in this paper. Examples of optimization of nested loops and digital signal processing designs are shown to demonstrate the effectiveness of the algorithms.

...read moreread less

Proceedings Article•DOI•

Architectural retiming: pipelining latency-constrained circuits

[...]

Soha Hassoun¹, Carl Ebeling¹•Institutions (1)

University of Washington¹

01 Jun 1996

TL;DR: A new optimization technique called architectural retiming is presented which is able to improve the performance of many latency-constrained circuits by increasing the number of registers on the latency- Constrained path while preserving the functionality and latency of the circuit.

...read moreread less

Abstract: This paper presents a new optimization technique called architectural retiming which is able to improve the performance of many latency-constrained circuits. Architectural retiming achieves this by increasing the number of registers on the latency-constrained path while preserving the functionality and latency of the circuit. This is done using the concept of a negative register, which can be implemented using precomputation and prediction. We use the name architectural retiming since it both reschedules operations in time and modifies the structure of the circuit to preserve its functionality. We illustrate the use of architectural retiming on two realistic examples and present performance improvement results for a number of sample circuits.

...read moreread less

Patent•

Master-slave cache system with de-coupled data and tag pipelines and loop-back

[...]

Earl T. Cohen, Jay C. Pattin

14 May 1996

TL;DR: In this paper, the master cache has a tag pipeline for accessing the tag RAM array, and a data pipeline is optimized for overall data transfer bandwidth, and the tag pipeline and the data pipeline are bound together for retrieving the first sub-line of a new miss from the slave cache.

...read moreread less

Abstract: A cache system has a large master cache and smaller slave caches. The slave caches are coupled to the processor's pipelines and are kept small and simple to increase their speed. The master cache is set-associative and performs many of the complex cache management operations for the slave caches, freeing the slaves of these bandwidth-robbing duties. The master cache has a tag pipeline for accessing the tag RAM array, and a data pipeline for accessing the data RAM array. The tag pipeline is optimized for fast access of the tag RAM array, while the data pipeline is optimized for overall data transfer bandwidth. The tag pipeline and the data pipeline are bound together for retrieving the first sub-line of a new miss from the slave cache. Subsequent sub-lines only use the data pipeline, freeing the tag pipeline for other operations. Bus snoops and cache management operations can use just the tag pipeline without impacting data bandwidth. Loop-back flows are performed which cancel an intervening flow in the tag pipeline when the index portions of the addresses match.

...read moreread less

Patent•

Method for verifying the correct processing of pipelined instructions including branch instructions and self-modifying code in a microprocessor

[...]

Edward T. Grochowski¹, Donald B. Alpert¹•Institutions (1)

Intel¹

19 Aug 1996

TL;DR: In this paper, an apparatus and method for improving the performance of pipelined computer processors which have segment bits for specifying the operand size, the address size for memory reference, and the stack size is presented.

...read moreread less

Abstract: An apparatus and method for improving the performance of pipelined computer processors which have segment bits for specifying the operand size, the address size for memory reference, and the stack size, and which can run self-modifying code. The processor predicts segment bits based on previously used segment bits. Actual segment bits are later determined during execution of an instruction. The predicted segment bits are compared with the actual segment bits, and the pipeline is flushed if they do not match. Also, an instruction verification method is provided to determine if self-modifying code has modified instructions already in the pipeline. Upon execution of a write instruction, each instruction address in the pipeline is compared with the write address. If a match is found, the pipeline is flushed.

...read moreread less

Patent•

Distributed pipeline memory architecture for a computer system with even and odd pids

[...]

Richard L. Angle¹, Edward S. Harriman¹, Geoffrey B. Ladwig¹•Institutions (1)

Nortel¹

20 Sep 1996

TL;DR: In this article, the authors present a processor architecture in which each processor has its own memory, strategically distributed along the stages of an execution pipeline of the processor, to provide fast access to often used information, such as the contents of the address and data registers, the program counter, etc.

...read moreread less

Abstract: A computer system architecture in which each processor has its own memory, strategically distributed along the stages of an execution pipeline of the processor, to provide fast access to often used information, such as the contents of the address and data registers, the program counter, etc. Memory storage is strategically located in close physical proximity to a stage in an execution pipeline at which memory is commonly or repeatedly accessed. Coupled to the pipeline at various stages are small memory cells for storing information that is consistently and repeatedly requested at that stage in the execution pipeline. The speed of the execution pipeline in a processor is critical to overall performance of the processor and the computer architecture of the present invention as a whole. To that end, the clock cycle time at which the pipeline is operated is increased as much as the operating characteristics of the logic and associated circuitry will allow. Generally, access times for memory are slower than the clock cycle times at which the pipeline logic can operate. Thus, there is a point of diminishing return at which increasing the clock cycle time of the pipeline is less advantageous if the pipeline must wait for memory access to complete. Thus, there is provided two sets of strategically located memory cells distributed along the execution pipeline of a processor, and alternately accesses the memory cells.

...read moreread less

Journal Article•DOI•

Pipelined adders

[...]

L. Dadda¹, Vincenzo Piuri¹•Institutions (1)

Polytechnic University of Milan¹

01 Mar 1996-IEEE Transactions on Computers

TL;DR: This paper shows that other schemes can be designed, based on the idea of pipelining a serial-input adder or a ripple-carry adder, to obtain pipelined adders for more than two numbers.

...read moreread less

Abstract: A well-known scheme for obtaining high throughput adders is a pipeline in which each stage contains an array of half-adders performing a carry-save addition. This paper shows that other schemes can be designed, based on the idea of pipelining a serial-input adder or a ripple-carry adder. Such schemes offer a considerable savings of components while preserving high throughput. These schemes can be generalized by using (p,q) parallel counters to obtain pipelined adders for more than two numbers.

...read moreread less

Patent•

Real-time pipeline fast fourier transform processors

[...]

Shousheng He, Mats Torkelsson

26 Feb 1996

TL;DR: A real-time pipeline processor based on a hardware oriented radix-22 algorithm derived by integrating a twiddle factor decomposition technique in a divide and conquer approach is presented in this article.

...read moreread less

Abstract: A real-time pipeline processor, which is particularly suited for VLSI implementation, is based on a hardware oriented radix-22 algorithm derived by integrating a twiddle factor decomposition technique in a divide and conquer approach. The radix-22 algorithm has the same multiplicative complexity as a radix-4 algorithm, but retains the butterfly structure of a radix-2 algorithm. A single-path delay-feedback architecture is used in order to exploit the spatial regularity in the signal flow graph of the algorithm. For a length-N DFT transform, the hardware requirements of the processor proposed by the present invention is minimal on both dominant components: Log4N-1 complex multipliers, and N-1 complex data memory.

...read moreread less

Patent•

Apparatus for performing an atomic add instructions

[...]

Jr. Edward S. Harriman

26 Sep 1996

TL;DR: In this article, the carry-save-add circuit is coupled with a set of carry propagate adder circuits to propagate a carry generated by the carry save-add adder circuit.

...read moreread less

Abstract: A pipeline processor having an add circuit configured to execute separate atomic add instructions in consecutive clock cycles, wherein each separate atomic add instructions can be updating the same memory address location. In one embodiment, the add circuit includes a carry-save-add circuit coupled to a set of carry propagate adder circuits. The carry-save-add circuit is configured to perform an add operation in one processor clock cycle and the set of carry propagate adder circuits are configured to propagate, in subsequent clock cycles, a carry generated by the carry-save-add circuit. The add circuit is further configured to feedforward partially propagated sums to the carry-save-add circuit as at least one operand for subsequent atomic add instructions. In one embodiment, the pipeline processor is implemented on a multitasking computer system architecture supporting multiple independent processors dedicated to processing data packets.

...read moreread less

Proceedings Article•DOI•

Prevention of Severe Slugging in the Dunbar 16' Multiphase Pipeline

[...]

A. Courbot

01 Jan 1996

Journal Article•DOI•

A new asynchronous pipeline scheme: application to the design of a self-timed ring divider

[...]

M. Renaudin¹, B.E. Hassan¹, A. Guyot•Institutions (1)

École nationale supérieure des télécommunications de Bretagne¹

01 Jul 1996-IEEE Journal of Solid-state Circuits

TL;DR: An efficient means of synchronizing and pipelining asynchronous circuits implemented using differential cascode voltage switch logic (DCVSL) precharged function blocks using a modified version of this logic, based on the use of a private asynchronous standard cell library, fully compatible with an existing CMOS standard cell Library provided by the foundry.

...read moreread less

Abstract: This paper describes an efficient means of synchronizing and pipelining asynchronous circuits implemented using differential cascode voltage switch logic (DCVSL) precharged function blocks. A modified version of this logic, called LDCVSL (latch differential cascode voltage switch logic), which is similar to the LCDL (latched CMOS differential logic), or DCVSL with NORA-latch, is used to improve the storage capability of the precharged function blocks. Improving the storage capability of the building blocks allows the design of an efficient pipeline scheme which is described in detail. Following a description of its potential performance, the pipeline scheme is applied to the design of self-timed rings. It is shown that more compact ring structures can be obtained without loss of performance. Our design methodology is then presented. It is based on the use of a private asynchronous standard cell library, fully compatible with an existing CMOS standard cell library provided by the foundry. Our approach allows the rapid design of standard cell based asynchronous circuits. Finally, both the pipeline scheme and design approach are illustrated through the design of a 32-b self-timed ring divider. The division algorithm is first briefly presented. The chip architecture is then described with the results obtained after fabrication. The test chip has been fabricated using the CNET/SGS-Thomson 0.5 /spl mu/m three metal layer technology. The 0.7 mm/sup 2/ chip computes 32-b divisions in 101 ns with a power consumption of 30 mW at a throughput of 10 million operations per second.

...read moreread less

Patent•

Synchronous semiconductor memory device which allows switching of bit configuration

[...]

Hisashi Iwamoto¹, Yasuhiro Konishi¹•Institutions (1)

Mitsubishi¹

22 Oct 1996

TL;DR: The synchronous DRAM as mentioned in this paper allows switching of bit configuration, and it takes 2-bits prefetch configuration in x8 configuration mode, and signal pipeline configuration in X16 configuration mode.

...read moreread less

Abstract: A synchronous DRAM includes a selector which supplies 2 bits of serial data signals from one data input/output terminal to two input/output line pairs as parallel data signals in x8 configuration mode, and supplies 2 bits of parallel data signals from both data input/output terminals directly to two input/output line pairs in x16 configuration mode. Therefore, the synchronous DRAM allows switching of bit configuration, and it takes 2-bits prefetch configuration in x8 configuration mode, and signal pipeline configuration in x16 configuration mode.

...read moreread less

Collapse