SPIM: a pipelined 64*64-bit iterative multiplier

doi:10.1109/4.18614

Home
/
Papers
/
SPIM: a pipelined 64*64-bit iterative multiplier

Journal Article•DOI•

SPIM: a pipelined 64*64-bit iterative multiplier

M.R. Santoro¹, Mark Horowitz¹•Institutions (1)

Stanford University¹

01 Apr 1989-IEEE Journal of Solid-state Circuits (IEEE)-Vol. 24, Iss: 2, pp 487-493

TL;DR: A 64*64-bit iterating multiplier, the Stanford pipelined iterative multiplier (SPIM), is presented, which consists of a small tree of 4:2 adders that is better suited than a Wallace tree for a VLSI implementation because it is a more regular structure.

read less

Abstract: A 64*64-bit iterating multiplier, the Stanford pipelined iterative multiplier (SPIM), is presented. The pipelined array consists of a small tree of 4:2 adders. The 4:2 tree is better suited than a Wallace tree for a VLSI implementation because it is a more regular structure. A 4:2 carry-save accumulator at the bottom of the array is used to iteratively accumulate partial products, allowing a partial array to be used, which reduces area. SPIM was fabricated in a 1.6- mu m CMOS process. It has a core size of 3.8 mm*6.5 mm and contains 41000 transistors. The on-chip clock generator runs at an internal clock frequency of 85 MHz. The latency for a 64*64-bit fractional multiply is under 120 ns, with a pipeline rate of one multiply every 47 ns. >

...read moreread less

Citations

PDF

Open Access

More filters

Dissertation•

Reconfigurable Architectures for General-Purpose Computing

[...]

André DeHon, Thomas F. Knight

01 Jan 1996

TL;DR: MATRIX is developed, the first architecture to defer the binding of instruction resources until run-time, allowing the application to organize resources according to its needs, and it is shown that MATRIX yields 10-20$\times the computational density of conventional processors.

...read moreread less

Abstract: General-purpose computing devices allow us to (1) customize computation after fabrication and (2) conserve area by reusing expensive active circuitry for different functions in time. We define RP-space, a restricted domain of the general-purpose architectural space focussed on reconfigurable computing architectures. Two dominant features differentiate reconfigurable from special-purpose architectures and account for most of the area overhead associated with RP devices: (1) instructions which tell the device how to behave, and (2) flexible interconnect which supports task dependent dataflow between operations. We can characterize RP-space by the allocation and structure of these resources and compare the efficiencies of architectural points across broad application characteristics. Conventional FPGAs fall at one extreme end of this space and their efficiency ranges over two orders of magnitude across the space of application characteristics. Understanding RP-space and its consequences allows us to pick the best architecture for a task and to search for more robust design points in the space. Our DPGA, a fine-grained computing device which adds small, on-chip instruction memories to FPGAs is one such design point. For typical logic applications and finite-state machines, a DPGA can implement tasks in one-third the area of a traditional FPGA. TSFPGA, a variant of the DPGA which focuses on heavily time-switched interconnect, achieves circuit densities close to the DPGA, while reducing typical physical mapping times from hours to seconds. Rigid, fabrication-time organization of instruction resources significantly narrows the range of efficiency for conventional architectures. To avoid this performance brittleness, we developed MATRIX, the first architecture to defer the binding of instruction resources until run-time, allowing the application to organize resources according to its needs. Our focus MATRIX design point is based on an array of 8-bit ALU and register-file building blocks interconnected via a byte-wide network. With today's silicon, a single chip MATRIX array can deliver over 10 Gop/s (8-bit ops). On sample image processing tasks, we show that MATRIX yields 10-20$\times$ the computational density of conventional processors. Understanding the cost structure of RP-space helps us identify these intermediate architectural points and may provide useful insight more broadly in guiding our continual search for robust and efficient general-purpose computing structures. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

...read moreread less

435 citations

Journal Article•DOI•

A method for speed optimized partial product reduction and generation of fast parallel multipliers using an algorithmic approach

[...]

Vojin G. Oklobdzija, D. Villeger¹, S.S. Liu²•Institutions (2)

École Normale Supérieure¹, Advanced Micro Devices²

01 Mar 1996-IEEE Transactions on Computers

TL;DR: The proposed method is applicable to any multiplier size and adaptable to any technology for which speed parameters are known, and it is easy to incorporate this method in silicon compilation or logic synthesis tools.

...read moreread less

Abstract: This paper presents a method and an algorithm for generation of a parallel multiplier, which is optimized for speed. This method is applicable to any multiplier size and adaptable to any technology for which speed parameters are known. Most importantly, it is easy to incorporate this method in silicon compilation or logic synthesis tools. The parallel multiplier produced by the proposed method outperforms other schemes used for comparison in our experiment. It uses the minimal number of cells in the partial product reduction tree. These findings are tested on design examples simulated in 1 /spl mu/ CMOS ASIC technology.

...read moreread less

370 citations

Patent•

Method and apparatus for performing multiply-add operations on packed data

[...]

Alexander D. Peleg¹, Millind Mittal¹, Larry M. Mennemeier¹, Benny Eitan¹, Carole Dulong¹, Eiichi Kowashi¹, Wolf Witt¹ - Show less +3 more•Institutions (1)

Intel¹

01 Jul 2011

TL;DR: In this article, a method and apparatus for including in a processor instructions for performing multiply-add operations on packed data is described. But it is not shown how to include such instructions in the instructions themselves.

...read moreread less

Abstract: A method and apparatus for including in a processor instructions for performing multiply-add operations on packed data. In one embodiment, a processor is coupled to a memory. The memory has stored therein a first packed data and a second packed data. The processor performs operations on data elements in said first packed data and said second packed data to generate a third packed data in response to receiving an instruction. At least two of the data elements in this third packed data storing the result of performing multiply-add operations on data elements in the first and second packed data.

...read moreread less

334 citations

Journal Article•DOI•

A zero-overhead self-timed 160-ns 54-b CMOS divider

[...]

T.E. Williarns¹, Mark Horowitz•Institutions (1)

Stanford University¹

13 Feb 1991

TL;DR: The authors describe the design of a custom integrated circuit for the arithmetic operation of division that uses self-timing to avoid the need for high-speed clocks and directly concatenates precharged function blocks without latches.

...read moreread less

Abstract: The authors describe the design of a custom integrated circuit for the arithmetic operation of division. The chip uses self-timing to avoid the need for high-speed clocks and directly concatenates precharged function blocks without latches. Internal stages form a ring that cycles without any external signaling. The self-timed control introduces no serial overhead, making the total chip latency equal just the combinational logic delays of the data elements. The ring's data path uses embedded completion encoding and generates the mantissa of a 54-b (floating-point IEEE double-precision) result. Fabricated in 1.2- mu m CMOS, the ring occupies 7 mm/sup 2/ and generates a quotient and done indication in 45 to 160 ns, depending on the particular data operands. >

...read moreread less

205 citations

Journal Article•DOI•

A fully digital, energy-efficient, adaptive power-supply regulator

[...]

Gu-Yeon Wei¹, Mark Horowitz¹•Institutions (1)

Stanford University¹

01 Apr 1999-IEEE Journal of Solid-state Circuits

TL;DR: In this article, a digitally controlled power converter that dynamically tracks circuit performance with a ring oscillator and regulates the supply voltage to the minimum required to operate at a desired frequency is presented.

...read moreread less

Abstract: A voltage scaling technique for energy-efficient operation requires an adaptive power-supply regulator to significantly reduce dynamic power consumption in synchronous digital circuits. A digitally controlled power converter that dynamically tracks circuit performance with a ring oscillator and regulates the supply voltage to the minimum required to operate at a desired frequency is presented. This paper investigates the issues involved in designing a fully digital power converter and describes a design fabricated in a MOSIS 0.8-/spl mu/m process. A variable-frequency digital controller design takes advantage of the power savings available through adaptive supply-voltage scaling and demonstrates converter efficiency greater than 90% over a dynamic range of regulated voltage levels.

...read moreread less

193 citations

Cites background from "SPIM: a pipelined 64*64-bit iterati..."

...Since this type of converter’s performance has been thoroughly documented in [11]–[16], the reader is referred to these references for detailed description and analysis of this and other power-converter designs....
[...]

1
2
3
4
…
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26

Collapse

References

PDF

Open Access

More filters

Journal Article•DOI•

A Suggestion for a Fast Multiplier

[...]

Chris S. Wallace¹•Institutions (1)

University of Sydney¹

01 Feb 1964-IEEE Transactions on Electronic Computers

TL;DR: A design is developed for a multiplier which generates the product of two numbers using purely combinational logic, i.e., in one gating step, using straightforward diode-transistor logic.

...read moreread less

Abstract: It is suggested that the economics of present large-scale scientific computers could benefit from a greater investment in hardware to mechanize multiplication and division than is now common. As a move in this direction, a design is developed for a multiplier which generates the product of two numbers using purely combinational logic, i.e., in one gating step. Using straightforward diode-transistor logic, it appears presently possible to obtain products in under 1, ?sec, and quotients in 3 ?sec. A rapid square-root process is also outlined. Approximate component counts are given for the proposed design, and it is found that the cost of the unit would be about 10 per cent of the cost of a modern large-scale computer.

...read moreread less

1,750 citations

Journal Article•DOI•

A signed binary multiplication technique

[...]

Andrew D. Booth¹•Institutions (1)

Birkbeck, University of London¹

01 Jan 1951-Quarterly Journal of Mechanics and Applied Mathematics

1,286 citations

Journal Article•DOI•

The IBM system/360 model 91: floating-point execution unit

[...]

S. F. Anderson, J. G. Earle, R. E. Goldschmidt, D. M. Powers

01 Jan 1967-Ibm Journal of Research and Development

TL;DR: The principal requirement for the Model 91 floating-point execution unit was that it be designed to support the instructionissuing rate of the processor, so separate, instruction-oriented algorithms for the add, multiply, and divide functions were developed.

...read moreread less

Abstract: The principal requirement for the Model 91 floating-point execution unit was that it be designed to support the instructionissuing rate of the processor. The chosen solution was to develop separate, instruction-oriented algorithms for the add, multiply, and divide functions. Linked together by the floating-point instruction unit, the multiple execution units provide concurrent instruction execution at the burst rate of one instruction per cycle.

...read moreread less

226 citations

Journal Article•DOI•

Balanced delay trees and combinatorial division in VLSI

[...]

D. Zuras¹, W.H. McAllister¹•Institutions (1)

Hewlett-Packard¹

01 Oct 1986-IEEE Journal of Solid-state Circuits

TL;DR: Three floating-point arithmetic chips have been implemented in 1.5-/spl mu/m NMOS technology utilizing several novel circuit designs and a method is presented for constructing balanced delay trees that have a better area-time product than binary trees.

...read moreread less

Abstract: Three floating-point arithmetic chips have been implemented in 1.5-/spl mu/m NMOS technology utilizing several novel circuit designs. The theories behind two of these are presented. A method is presented for constructing balanced delay trees that have a better area-time product than binary trees; one important application of these trees is in the construction of fast multipliers. Also presented is a technique for doing redundant digital division that lends itself to implementation in combinatorial VLSI.

...read moreread less

54 citations

Proceedings Article•DOI•

An NMOS 64b floating-point chip set

[...]

W. McAllister¹, D. Zuras•Institutions (1)

Hewlett-Packard¹

01 Jan 1986

TL;DR: Three floating point arithmetic chips have been developed in a 1.5μm NMOS process and they are an adder, modified Wallace Tree multiplier, and a combinatorial divider.

...read moreread less

Abstract: Three floating point arithmetic chips have been developed in a 1.5μm NMOS process. They are an adder, modified Wallace Tree multiplier, and a combinatorial divider. Speed of scalar operation is 490ns, 660ns and 1610ns, respectively.

...read moreread less

16 citations