Author

M.R. Santoro

Bio: M.R. Santoro is an academic researcher from Stanford University. The author has contributed to research in topics: Multiplier (economics) & Clock rate. The author has an hindex of 1, co-authored 1 publications receiving 125 citations.

Topics: Multiplier (economics), Clock rate, Wallace tree, Clock generator ...read more

Papers

PDF

Open Access

More filters

Journal Article•DOI•

SPIM: a pipelined 64*64-bit iterative multiplier

[...]

M.R. Santoro¹, Mark Horowitz¹•Institutions (1)

Stanford University¹

01 Apr 1989-IEEE Journal of Solid-state Circuits

TL;DR: A 64*64-bit iterating multiplier, the Stanford pipelined iterative multiplier (SPIM), is presented, which consists of a small tree of 4:2 adders that is better suited than a Wallace tree for a VLSI implementation because it is a more regular structure.

...read moreread less

Abstract: A 64*64-bit iterating multiplier, the Stanford pipelined iterative multiplier (SPIM), is presented. The pipelined array consists of a small tree of 4:2 adders. The 4:2 tree is better suited than a Wallace tree for a VLSI implementation because it is a more regular structure. A 4:2 carry-save accumulator at the bottom of the array is used to iteratively accumulate partial products, allowing a partial array to be used, which reduces area. SPIM was fabricated in a 1.6- mu m CMOS process. It has a core size of 3.8 mm*6.5 mm and contains 41000 transistors. The on-chip clock generator runs at an internal clock frequency of 85 MHz. The latency for a 64*64-bit fractional multiply is under 120 ns, with a pipeline rate of one multiply every 47 ns. >

...read moreread less

127 citations

Cited by

PDF

Open Access

More filters

Dissertation•

Reconfigurable Architectures for General-Purpose Computing

[...]

André DeHon, Thomas F. Knight

01 Jan 1996

TL;DR: MATRIX is developed, the first architecture to defer the binding of instruction resources until run-time, allowing the application to organize resources according to its needs, and it is shown that MATRIX yields 10-20$\times the computational density of conventional processors.

...read moreread less

Abstract: General-purpose computing devices allow us to (1) customize computation after fabrication and (2) conserve area by reusing expensive active circuitry for different functions in time. We define RP-space, a restricted domain of the general-purpose architectural space focussed on reconfigurable computing architectures. Two dominant features differentiate reconfigurable from special-purpose architectures and account for most of the area overhead associated with RP devices: (1) instructions which tell the device how to behave, and (2) flexible interconnect which supports task dependent dataflow between operations. We can characterize RP-space by the allocation and structure of these resources and compare the efficiencies of architectural points across broad application characteristics. Conventional FPGAs fall at one extreme end of this space and their efficiency ranges over two orders of magnitude across the space of application characteristics. Understanding RP-space and its consequences allows us to pick the best architecture for a task and to search for more robust design points in the space. Our DPGA, a fine-grained computing device which adds small, on-chip instruction memories to FPGAs is one such design point. For typical logic applications and finite-state machines, a DPGA can implement tasks in one-third the area of a traditional FPGA. TSFPGA, a variant of the DPGA which focuses on heavily time-switched interconnect, achieves circuit densities close to the DPGA, while reducing typical physical mapping times from hours to seconds. Rigid, fabrication-time organization of instruction resources significantly narrows the range of efficiency for conventional architectures. To avoid this performance brittleness, we developed MATRIX, the first architecture to defer the binding of instruction resources until run-time, allowing the application to organize resources according to its needs. Our focus MATRIX design point is based on an array of 8-bit ALU and register-file building blocks interconnected via a byte-wide network. With today's silicon, a single chip MATRIX array can deliver over 10 Gop/s (8-bit ops). On sample image processing tasks, we show that MATRIX yields 10-20$\times$ the computational density of conventional processors. Understanding the cost structure of RP-space helps us identify these intermediate architectural points and may provide useful insight more broadly in guiding our continual search for robust and efficient general-purpose computing structures. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

...read moreread less

435 citations

Journal Article•DOI•

A method for speed optimized partial product reduction and generation of fast parallel multipliers using an algorithmic approach

[...]

Vojin G. Oklobdzija, D. Villeger¹, S.S. Liu²•Institutions (2)

École Normale Supérieure¹, Advanced Micro Devices²

01 Mar 1996-IEEE Transactions on Computers

TL;DR: The proposed method is applicable to any multiplier size and adaptable to any technology for which speed parameters are known, and it is easy to incorporate this method in silicon compilation or logic synthesis tools.

...read moreread less

Abstract: This paper presents a method and an algorithm for generation of a parallel multiplier, which is optimized for speed. This method is applicable to any multiplier size and adaptable to any technology for which speed parameters are known. Most importantly, it is easy to incorporate this method in silicon compilation or logic synthesis tools. The parallel multiplier produced by the proposed method outperforms other schemes used for comparison in our experiment. It uses the minimal number of cells in the partial product reduction tree. These findings are tested on design examples simulated in 1 /spl mu/ CMOS ASIC technology.

...read moreread less

370 citations

Patent•

Method and apparatus for performing multiply-add operations on packed data

[...]

Alexander D. Peleg¹, Millind Mittal¹, Larry M. Mennemeier¹, Benny Eitan¹, Carole Dulong¹, Eiichi Kowashi¹, Wolf Witt¹ - Show less +3 more•Institutions (1)

Intel¹

01 Jul 2011

TL;DR: In this article, a method and apparatus for including in a processor instructions for performing multiply-add operations on packed data is described. But it is not shown how to include such instructions in the instructions themselves.

...read moreread less

Abstract: A method and apparatus for including in a processor instructions for performing multiply-add operations on packed data. In one embodiment, a processor is coupled to a memory. The memory has stored therein a first packed data and a second packed data. The processor performs operations on data elements in said first packed data and said second packed data to generate a third packed data in response to receiving an instruction. At least two of the data elements in this third packed data storing the result of performing multiply-add operations on data elements in the first and second packed data.

...read moreread less

334 citations

Journal Article•DOI•

A zero-overhead self-timed 160-ns 54-b CMOS divider

[...]

T.E. Williarns¹, Mark Horowitz•Institutions (1)

Stanford University¹

13 Feb 1991

TL;DR: The authors describe the design of a custom integrated circuit for the arithmetic operation of division that uses self-timing to avoid the need for high-speed clocks and directly concatenates precharged function blocks without latches.

...read moreread less

Abstract: The authors describe the design of a custom integrated circuit for the arithmetic operation of division. The chip uses self-timing to avoid the need for high-speed clocks and directly concatenates precharged function blocks without latches. Internal stages form a ring that cycles without any external signaling. The self-timed control introduces no serial overhead, making the total chip latency equal just the combinational logic delays of the data elements. The ring's data path uses embedded completion encoding and generates the mantissa of a 54-b (floating-point IEEE double-precision) result. Fabricated in 1.2- mu m CMOS, the ring occupies 7 mm/sup 2/ and generates a quotient and done indication in 45 to 160 ns, depending on the particular data operands. >

...read moreread less

205 citations

Journal Article•DOI•

A fully digital, energy-efficient, adaptive power-supply regulator

[...]

Gu-Yeon Wei¹, Mark Horowitz¹•Institutions (1)

Stanford University¹

01 Apr 1999-IEEE Journal of Solid-state Circuits

TL;DR: In this article, a digitally controlled power converter that dynamically tracks circuit performance with a ring oscillator and regulates the supply voltage to the minimum required to operate at a desired frequency is presented.

...read moreread less

Abstract: A voltage scaling technique for energy-efficient operation requires an adaptive power-supply regulator to significantly reduce dynamic power consumption in synchronous digital circuits. A digitally controlled power converter that dynamically tracks circuit performance with a ring oscillator and regulates the supply voltage to the minimum required to operate at a desired frequency is presented. This paper investigates the issues involved in designing a fully digital power converter and describes a design fabricated in a MOSIS 0.8-/spl mu/m process. A variable-frequency digital controller design takes advantage of the power savings available through adaptive supply-voltage scaling and demonstrates converter efficiency greater than 90% over a dynamic range of regulated voltage levels.

...read moreread less

193 citations

Collapse