scispace - formally typeset
Search or ask a question
Journal ArticleDOI

SPIM: a pipelined 64*64-bit iterative multiplier

01 Apr 1989-IEEE Journal of Solid-state Circuits (IEEE)-Vol. 24, Iss: 2, pp 487-493
TL;DR: A 64*64-bit iterating multiplier, the Stanford pipelined iterative multiplier (SPIM), is presented, which consists of a small tree of 4:2 adders that is better suited than a Wallace tree for a VLSI implementation because it is a more regular structure.
Abstract: A 64*64-bit iterating multiplier, the Stanford pipelined iterative multiplier (SPIM), is presented. The pipelined array consists of a small tree of 4:2 adders. The 4:2 tree is better suited than a Wallace tree for a VLSI implementation because it is a more regular structure. A 4:2 carry-save accumulator at the bottom of the array is used to iteratively accumulate partial products, allowing a partial array to be used, which reduces area. SPIM was fabricated in a 1.6- mu m CMOS process. It has a core size of 3.8 mm*6.5 mm and contains 41000 transistors. The on-chip clock generator runs at an internal clock frequency of 85 MHz. The latency for a 64*64-bit fractional multiply is under 120 ns, with a pipeline rate of one multiply every 47 ns. >
Citations
More filters
Dissertation
01 Jan 1996
TL;DR: MATRIX is developed, the first architecture to defer the binding of instruction resources until run-time, allowing the application to organize resources according to its needs, and it is shown that MATRIX yields 10-20$\times the computational density of conventional processors.
Abstract: General-purpose computing devices allow us to (1) customize computation after fabrication and (2) conserve area by reusing expensive active circuitry for different functions in time. We define RP-space, a restricted domain of the general-purpose architectural space focussed on reconfigurable computing architectures. Two dominant features differentiate reconfigurable from special-purpose architectures and account for most of the area overhead associated with RP devices: (1) instructions which tell the device how to behave, and (2) flexible interconnect which supports task dependent dataflow between operations. We can characterize RP-space by the allocation and structure of these resources and compare the efficiencies of architectural points across broad application characteristics. Conventional FPGAs fall at one extreme end of this space and their efficiency ranges over two orders of magnitude across the space of application characteristics. Understanding RP-space and its consequences allows us to pick the best architecture for a task and to search for more robust design points in the space. Our DPGA, a fine-grained computing device which adds small, on-chip instruction memories to FPGAs is one such design point. For typical logic applications and finite-state machines, a DPGA can implement tasks in one-third the area of a traditional FPGA. TSFPGA, a variant of the DPGA which focuses on heavily time-switched interconnect, achieves circuit densities close to the DPGA, while reducing typical physical mapping times from hours to seconds. Rigid, fabrication-time organization of instruction resources significantly narrows the range of efficiency for conventional architectures. To avoid this performance brittleness, we developed MATRIX, the first architecture to defer the binding of instruction resources until run-time, allowing the application to organize resources according to its needs. Our focus MATRIX design point is based on an array of 8-bit ALU and register-file building blocks interconnected via a byte-wide network. With today's silicon, a single chip MATRIX array can deliver over 10 Gop/s (8-bit ops). On sample image processing tasks, we show that MATRIX yields 10-20$\times$ the computational density of conventional processors. Understanding the cost structure of RP-space helps us identify these intermediate architectural points and may provide useful insight more broadly in guiding our continual search for robust and efficient general-purpose computing structures. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

435 citations

Journal ArticleDOI
TL;DR: The proposed method is applicable to any multiplier size and adaptable to any technology for which speed parameters are known, and it is easy to incorporate this method in silicon compilation or logic synthesis tools.
Abstract: This paper presents a method and an algorithm for generation of a parallel multiplier, which is optimized for speed. This method is applicable to any multiplier size and adaptable to any technology for which speed parameters are known. Most importantly, it is easy to incorporate this method in silicon compilation or logic synthesis tools. The parallel multiplier produced by the proposed method outperforms other schemes used for comparison in our experiment. It uses the minimal number of cells in the partial product reduction tree. These findings are tested on design examples simulated in 1 /spl mu/ CMOS ASIC technology.

370 citations

Patent
01 Jul 2011
TL;DR: In this article, a method and apparatus for including in a processor instructions for performing multiply-add operations on packed data is described. But it is not shown how to include such instructions in the instructions themselves.
Abstract: A method and apparatus for including in a processor instructions for performing multiply-add operations on packed data. In one embodiment, a processor is coupled to a memory. The memory has stored therein a first packed data and a second packed data. The processor performs operations on data elements in said first packed data and said second packed data to generate a third packed data in response to receiving an instruction. At least two of the data elements in this third packed data storing the result of performing multiply-add operations on data elements in the first and second packed data.

334 citations

Journal ArticleDOI
13 Feb 1991
TL;DR: The authors describe the design of a custom integrated circuit for the arithmetic operation of division that uses self-timing to avoid the need for high-speed clocks and directly concatenates precharged function blocks without latches.
Abstract: The authors describe the design of a custom integrated circuit for the arithmetic operation of division. The chip uses self-timing to avoid the need for high-speed clocks and directly concatenates precharged function blocks without latches. Internal stages form a ring that cycles without any external signaling. The self-timed control introduces no serial overhead, making the total chip latency equal just the combinational logic delays of the data elements. The ring's data path uses embedded completion encoding and generates the mantissa of a 54-b (floating-point IEEE double-precision) result. Fabricated in 1.2- mu m CMOS, the ring occupies 7 mm/sup 2/ and generates a quotient and done indication in 45 to 160 ns, depending on the particular data operands. >

205 citations

Journal ArticleDOI
TL;DR: In this article, a digitally controlled power converter that dynamically tracks circuit performance with a ring oscillator and regulates the supply voltage to the minimum required to operate at a desired frequency is presented.
Abstract: A voltage scaling technique for energy-efficient operation requires an adaptive power-supply regulator to significantly reduce dynamic power consumption in synchronous digital circuits. A digitally controlled power converter that dynamically tracks circuit performance with a ring oscillator and regulates the supply voltage to the minimum required to operate at a desired frequency is presented. This paper investigates the issues involved in designing a fully digital power converter and describes a design fabricated in a MOSIS 0.8-/spl mu/m process. A variable-frequency digital controller design takes advantage of the power savings available through adaptive supply-voltage scaling and demonstrates converter efficiency greater than 90% over a dynamic range of regulated voltage levels.

193 citations


Cites background from "SPIM: a pipelined 64*64-bit iterati..."

  • ...Since this type of converter’s performance has been thoroughly documented in [11]–[16], the reader is referred to these references for detailed description and analysis of this and other power-converter designs....

    [...]

References
More filters
Journal ArticleDOI
TL;DR: A design is developed for a multiplier which generates the product of two numbers using purely combinational logic, i.e., in one gating step, using straightforward diode-transistor logic.
Abstract: It is suggested that the economics of present large-scale scientific computers could benefit from a greater investment in hardware to mechanize multiplication and division than is now common. As a move in this direction, a design is developed for a multiplier which generates the product of two numbers using purely combinational logic, i.e., in one gating step. Using straightforward diode-transistor logic, it appears presently possible to obtain products in under 1, ?sec, and quotients in 3 ?sec. A rapid square-root process is also outlined. Approximate component counts are given for the proposed design, and it is found that the cost of the unit would be about 10 per cent of the cost of a modern large-scale computer.

1,750 citations

Journal ArticleDOI
TL;DR: The principal requirement for the Model 91 floating-point execution unit was that it be designed to support the instructionissuing rate of the processor, so separate, instruction-oriented algorithms for the add, multiply, and divide functions were developed.
Abstract: The principal requirement for the Model 91 floating-point execution unit was that it be designed to support the instructionissuing rate of the processor. The chosen solution was to develop separate, instruction-oriented algorithms for the add, multiply, and divide functions. Linked together by the floating-point instruction unit, the multiple execution units provide concurrent instruction execution at the burst rate of one instruction per cycle.

226 citations

Journal ArticleDOI
TL;DR: Three floating-point arithmetic chips have been implemented in 1.5-/spl mu/m NMOS technology utilizing several novel circuit designs and a method is presented for constructing balanced delay trees that have a better area-time product than binary trees.
Abstract: Three floating-point arithmetic chips have been implemented in 1.5-/spl mu/m NMOS technology utilizing several novel circuit designs. The theories behind two of these are presented. A method is presented for constructing balanced delay trees that have a better area-time product than binary trees; one important application of these trees is in the construction of fast multipliers. Also presented is a technique for doing redundant digital division that lends itself to implementation in combinatorial VLSI.

54 citations

Proceedings ArticleDOI
01 Jan 1986
TL;DR: Three floating point arithmetic chips have been developed in a 1.5μm NMOS process and they are an adder, modified Wallace Tree multiplier, and a combinatorial divider.
Abstract: Three floating point arithmetic chips have been developed in a 1.5μm NMOS process. They are an adder, modified Wallace Tree multiplier, and a combinatorial divider. Speed of scalar operation is 490ns, 660ns and 1610ns, respectively.

16 citations