scispace - formally typeset
Search or ask a question
Author

M.R. Santoro

Bio: M.R. Santoro is an academic researcher from Stanford University. The author has contributed to research in topics: Multiplier (economics) & Clock rate. The author has an hindex of 1, co-authored 1 publications receiving 125 citations.

Papers
More filters
Journal ArticleDOI
TL;DR: A 64*64-bit iterating multiplier, the Stanford pipelined iterative multiplier (SPIM), is presented, which consists of a small tree of 4:2 adders that is better suited than a Wallace tree for a VLSI implementation because it is a more regular structure.
Abstract: A 64*64-bit iterating multiplier, the Stanford pipelined iterative multiplier (SPIM), is presented. The pipelined array consists of a small tree of 4:2 adders. The 4:2 tree is better suited than a Wallace tree for a VLSI implementation because it is a more regular structure. A 4:2 carry-save accumulator at the bottom of the array is used to iteratively accumulate partial products, allowing a partial array to be used, which reduces area. SPIM was fabricated in a 1.6- mu m CMOS process. It has a core size of 3.8 mm*6.5 mm and contains 41000 transistors. The on-chip clock generator runs at an internal clock frequency of 85 MHz. The latency for a 64*64-bit fractional multiply is under 120 ns, with a pipeline rate of one multiply every 47 ns. >

127 citations


Cited by
More filters
Dissertation
01 Jan 1996
TL;DR: MATRIX is developed, the first architecture to defer the binding of instruction resources until run-time, allowing the application to organize resources according to its needs, and it is shown that MATRIX yields 10-20$\times the computational density of conventional processors.
Abstract: General-purpose computing devices allow us to (1) customize computation after fabrication and (2) conserve area by reusing expensive active circuitry for different functions in time. We define RP-space, a restricted domain of the general-purpose architectural space focussed on reconfigurable computing architectures. Two dominant features differentiate reconfigurable from special-purpose architectures and account for most of the area overhead associated with RP devices: (1) instructions which tell the device how to behave, and (2) flexible interconnect which supports task dependent dataflow between operations. We can characterize RP-space by the allocation and structure of these resources and compare the efficiencies of architectural points across broad application characteristics. Conventional FPGAs fall at one extreme end of this space and their efficiency ranges over two orders of magnitude across the space of application characteristics. Understanding RP-space and its consequences allows us to pick the best architecture for a task and to search for more robust design points in the space. Our DPGA, a fine-grained computing device which adds small, on-chip instruction memories to FPGAs is one such design point. For typical logic applications and finite-state machines, a DPGA can implement tasks in one-third the area of a traditional FPGA. TSFPGA, a variant of the DPGA which focuses on heavily time-switched interconnect, achieves circuit densities close to the DPGA, while reducing typical physical mapping times from hours to seconds. Rigid, fabrication-time organization of instruction resources significantly narrows the range of efficiency for conventional architectures. To avoid this performance brittleness, we developed MATRIX, the first architecture to defer the binding of instruction resources until run-time, allowing the application to organize resources according to its needs. Our focus MATRIX design point is based on an array of 8-bit ALU and register-file building blocks interconnected via a byte-wide network. With today's silicon, a single chip MATRIX array can deliver over 10 Gop/s (8-bit ops). On sample image processing tasks, we show that MATRIX yields 10-20$\times$ the computational density of conventional processors. Understanding the cost structure of RP-space helps us identify these intermediate architectural points and may provide useful insight more broadly in guiding our continual search for robust and efficient general-purpose computing structures. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

435 citations

Journal ArticleDOI
TL;DR: The proposed method is applicable to any multiplier size and adaptable to any technology for which speed parameters are known, and it is easy to incorporate this method in silicon compilation or logic synthesis tools.
Abstract: This paper presents a method and an algorithm for generation of a parallel multiplier, which is optimized for speed. This method is applicable to any multiplier size and adaptable to any technology for which speed parameters are known. Most importantly, it is easy to incorporate this method in silicon compilation or logic synthesis tools. The parallel multiplier produced by the proposed method outperforms other schemes used for comparison in our experiment. It uses the minimal number of cells in the partial product reduction tree. These findings are tested on design examples simulated in 1 /spl mu/ CMOS ASIC technology.

370 citations

Patent
01 Jul 2011
TL;DR: In this article, a method and apparatus for including in a processor instructions for performing multiply-add operations on packed data is described. But it is not shown how to include such instructions in the instructions themselves.
Abstract: A method and apparatus for including in a processor instructions for performing multiply-add operations on packed data. In one embodiment, a processor is coupled to a memory. The memory has stored therein a first packed data and a second packed data. The processor performs operations on data elements in said first packed data and said second packed data to generate a third packed data in response to receiving an instruction. At least two of the data elements in this third packed data storing the result of performing multiply-add operations on data elements in the first and second packed data.

334 citations

Journal ArticleDOI
13 Feb 1991
TL;DR: The authors describe the design of a custom integrated circuit for the arithmetic operation of division that uses self-timing to avoid the need for high-speed clocks and directly concatenates precharged function blocks without latches.
Abstract: The authors describe the design of a custom integrated circuit for the arithmetic operation of division. The chip uses self-timing to avoid the need for high-speed clocks and directly concatenates precharged function blocks without latches. Internal stages form a ring that cycles without any external signaling. The self-timed control introduces no serial overhead, making the total chip latency equal just the combinational logic delays of the data elements. The ring's data path uses embedded completion encoding and generates the mantissa of a 54-b (floating-point IEEE double-precision) result. Fabricated in 1.2- mu m CMOS, the ring occupies 7 mm/sup 2/ and generates a quotient and done indication in 45 to 160 ns, depending on the particular data operands. >

205 citations

Journal ArticleDOI
TL;DR: In this article, a digitally controlled power converter that dynamically tracks circuit performance with a ring oscillator and regulates the supply voltage to the minimum required to operate at a desired frequency is presented.
Abstract: A voltage scaling technique for energy-efficient operation requires an adaptive power-supply regulator to significantly reduce dynamic power consumption in synchronous digital circuits. A digitally controlled power converter that dynamically tracks circuit performance with a ring oscillator and regulates the supply voltage to the minimum required to operate at a desired frequency is presented. This paper investigates the issues involved in designing a fully digital power converter and describes a design fabricated in a MOSIS 0.8-/spl mu/m process. A variable-frequency digital controller design takes advantage of the power savings available through adaptive supply-voltage scaling and demonstrates converter efficiency greater than 90% over a dynamic range of regulated voltage levels.

193 citations