scispace - formally typeset
Search or ask a question
Author

M. Nakajima

Bio: M. Nakajima is an academic researcher from Fujitsu. The author has contributed to research in topics: Tree structure & Wallace tree. The author has an hindex of 1, co-authored 1 publications receiving 134 citations.

Papers
More filters
Journal ArticleDOI
G. Goto1, T. Sato1, M. Nakajima1, T. Sukemura1
TL;DR: By using recurring wire shifters, the authors can expand the level of repeated blocks to cover the entire adder tree, which simplifies the complicated Wallace tree wiring scheme.
Abstract: A 54-b*54-b parallel multiplier was implemented in 0.88- mu m CMOS using the new, regularly structured tree (RST) design approach. The circuit is basically a Wallace tree, but the tree and the set of partial-product-bit generators are combined into a recurring block which generates seven partial-product bits and compresses them to a pair of bits for the sum and carry signals. This block is used repeatedly to construct an RST block in which even wiring among blocks included in wire shifters is designed as recurring units. By using recurring wire shifters, the authors can expand the level of repeated blocks to cover the entire adder tree, which simplifies the complicated Wallace tree wiring scheme. In addition, to design time savings, layout density is increased by 70% to 6400 transistors/mm/sup 2/, and the multiplication time is decreased by 30% to 13 ns. >

138 citations


Cited by
More filters
Dissertation
01 Jan 1996
TL;DR: MATRIX is developed, the first architecture to defer the binding of instruction resources until run-time, allowing the application to organize resources according to its needs, and it is shown that MATRIX yields 10-20$\times the computational density of conventional processors.
Abstract: General-purpose computing devices allow us to (1) customize computation after fabrication and (2) conserve area by reusing expensive active circuitry for different functions in time. We define RP-space, a restricted domain of the general-purpose architectural space focussed on reconfigurable computing architectures. Two dominant features differentiate reconfigurable from special-purpose architectures and account for most of the area overhead associated with RP devices: (1) instructions which tell the device how to behave, and (2) flexible interconnect which supports task dependent dataflow between operations. We can characterize RP-space by the allocation and structure of these resources and compare the efficiencies of architectural points across broad application characteristics. Conventional FPGAs fall at one extreme end of this space and their efficiency ranges over two orders of magnitude across the space of application characteristics. Understanding RP-space and its consequences allows us to pick the best architecture for a task and to search for more robust design points in the space. Our DPGA, a fine-grained computing device which adds small, on-chip instruction memories to FPGAs is one such design point. For typical logic applications and finite-state machines, a DPGA can implement tasks in one-third the area of a traditional FPGA. TSFPGA, a variant of the DPGA which focuses on heavily time-switched interconnect, achieves circuit densities close to the DPGA, while reducing typical physical mapping times from hours to seconds. Rigid, fabrication-time organization of instruction resources significantly narrows the range of efficiency for conventional architectures. To avoid this performance brittleness, we developed MATRIX, the first architecture to defer the binding of instruction resources until run-time, allowing the application to organize resources according to its needs. Our focus MATRIX design point is based on an array of 8-bit ALU and register-file building blocks interconnected via a byte-wide network. With today's silicon, a single chip MATRIX array can deliver over 10 Gop/s (8-bit ops). On sample image processing tasks, we show that MATRIX yields 10-20$\times$ the computational density of conventional processors. Understanding the cost structure of RP-space helps us identify these intermediate architectural points and may provide useful insight more broadly in guiding our continual search for robust and efficient general-purpose computing structures. (Copies available exclusively from MIT Libraries, Rm. 14-0551, Cambridge, MA 02139-4307. Ph. 617-253-5668; Fax 617-253-1690.)

435 citations

Journal ArticleDOI
TL;DR: The results show that the area, delay, and power dissipation are improved by LEAP and that the value-cost ratio is improved by a factor of three, demonstrating that LEAP has the potential to achieve a quantum leap in value of LSI's while reducing the cost.
Abstract: The pass-transistor based cell library and synthesis tool are constructed, for the first time, to clarify the potential of top-down pass-transistor logic. The entire scheme is called LEAP (Lean Integration with Pass-Transistors). The feature of a pass-transistor based cell is its multiplexer function and the open-drain structure. This cell has the flexibility of transistor level circuit design and compatibility with conventional cell based design. An extremely simple cell library with only seven cells combined with a synthesis tool called "circuit inventor" is compared with the conventional CMOS library that has over 60 cells combined with the state-of-the-art logic synthesis. The results show that the area, delay, and power dissipation are improved by LEAP and that the value-cost ratio is improved by a factor of three. This demonstrates that LEAP has the potential to achieve a quantum leap in value of LSI's while reducing the cost. Key issues which have to be cleared before pass transistor logic is used as the generic logic scheme replacing CMOS are also discussed.

270 citations

Journal ArticleDOI
TL;DR: A new modified Booth encoding (MBE) scheme is proposed to improve the performance of traditional MBE schemes and a new algorithm is developed to construct multiple-level conditional-sum adder (MLCSMA).
Abstract: This paper presents a design methodology for high-speed Booth encoded parallel multiplier. For partial product generation, we propose a new modified Booth encoding (MBE) scheme to improve the performance of traditional MBE schemes. For final addition, a new algorithm is developed to construct multiple-level conditional-sum adder (MLCSMA). The proposed algorithm can optimize final adder according to the given cell properties and input delay profile. Compared with a binary tree-based conditional-sum adder, the speed performance improvement is up to 25 percent. On average, the design developed herein reduces the total delay by 8 percent for parallel multiplier. The whole design has been verified by gate level simulation.

263 citations

Journal ArticleDOI
TL;DR: A 54/spl times/54-b multiplier using pass-transistor multiplexers has been fabricated by 0.25 /spl mu/m CMOS technology and a new 4-2 compressor and a carry lookahead adder (CLA) have been developed to enhance the speed performance.
Abstract: A 54/spl times/54-b multiplier using pass-transistor multiplexers has been fabricated by 0.25 /spl mu/m CMOS technology. To enhance the speed performance, a new 4-2 compressor and a carry lookahead adder (CLA), both featuring pass-transistor multiplexers, have been developed. The new circuits have a speed advantage over conventional CMOS circuits because the number of critical-path gate stages is minimized due to the high logic functionality of pass-transistor multiplexers. The active size of the 54/spl times/54-b multiplier is 3.77/spl times/3.41 mm. The multiplication time is 4.4 ns at a 3.5-V power supply. >

179 citations

01 Jan 1994
TL;DR: Methods of implementing binary multiplication with the smallest possible latency are investigated, and traditional Booth encoded multipliers are superior in layout area, power, and delay to non-Booth encode multipliers.
Abstract: This thesis investigates methods of implementing binary multiplication with the smallest possible latency. The principle area of concentration is on multipliers with lengths of 53 bits, which makes the results suitable for IEEE-754 double precision multiplication. Low latency demands high performance circuitry, and small physical size to limit propagation delays. VLSI implementations are the only available means for meeting these two requirements, but efficient algorithms are also crucial. An extension to Booth's algorithm for multiplication (redundant Booth) has been developed, which represents partial products in a partially redundant form. This redundant representation can reduce or eliminate the time required to produce "hard" multiples (multiples that require a carry propagate addition) required by the traditional higher order Booth algorithms. This extension reduces the area and power requirements of fully parallel implementations, but is also as fast as any multiplication method yet reported. In order to evaluate various multiplication algorithms, a software tool has been developed which automates the layout and optimization of parallel multiplier trees. The tool takes into consideration wire and asymmetric input delays, as well as gate delays, as the tree is built. The tool is used to design multipliers based upon various algorithms, using both Booth encoded, non-Booth encoded and the new extended Booth algorithms. The designs are then compared on the basis of delay, power, and area. For maximum speed, the designs are based upon a 0.6$\mu$ BiCMOS process using emitter coupled logic (ECL). The algorithms developed in this thesis make possible 53 x 53 multipliers with a latency of less than 2.6 nanoseconds @ 10.5 Watts and a layout area of 13mm$\sp2$. Smaller and lower power designs are also possible, as illustrated by an example with a latency of 3.6 nanoseconds @ 5.8 W, and an area of 8.9mm$\sp2$. The conclusions based upon ECL designs are extended where possible to other technologies (CMOS). Crucial to the performance of multipliers are high speed carry propagate adders. A number of high speed adder designs have been developed, and the algorithms and design of these adders are discussed. The implementations developed for this study indicate that traditional Booth encoded multipliers are superior in layout area, power, and delay to non-Booth encoded multipliers. Redundant Booth encoding further reduces the area and power requirements. Finally, only half of the total multiplier delay was found to be due to the summation of the partial products. The remaining delay was due to wires and carry propagate adder delays.

158 citations