scispace - formally typeset
Search or ask a question
Journal ArticleDOI

Area-time-power tradeoffs in parallel adders

TL;DR: A uniform static CMOS layout methodology whereby short circuit power mininization is used as the optimization criterion is adopted and a large adder design space is formulated from which an architect can choose an adder with the desired characteristics.
Abstract: In this paper, several classes of parallel, synchronous adders are surveyed based on their power, delay and area characteristics. The adders studied include the linear time ripple carry and Manchester carry chain adders, the square-root time carry skip and carry select adders, the logarithmic time carry lookahead adder and its variations, and the constant time signed-digit and carry-save adders. Most of the research in the last few decades has concentrated on reducing the delay of addition. With the rising popularity of portable computers, however, the emphasis is on both high speed and low power operation. In this paper we adopt a uniform static CMOS layout methodology whereby short circuit power mininization is used as the optimization criterion. The relative merits of the different adders are evaluated by performing a detailed transistor-level simulation of the adders using HSPICE. Among the two's complement adders, a variation of the carry lookahead adder, called ELM, was found to have the best power-delay product. Based on the results of our experiments, a large adder design space is formulated from which an architect can choose an adder with the desired characteristics.

Content maybe subject to copyright    Report

Citations
More filters
Journal ArticleDOI
TL;DR: It is shown that these proposed Bio-inspired Imprecise Computational blocks (BICs) can be exploited to efficiently implement a three-layer face recognition neural network and the hardware defuzzification block of a fuzzy processor.
Abstract: The conventional digital hardware computational blocks with different structures are designed to compute the precise results of the assigned calculations. The main contribution of our proposed Bio-inspired Imprecise Computational blocks (BICs) is that they are designed to provide an applicable estimation of the result instead of its precise value at a lower cost. These novel structures are more efficient in terms of area, speed, and power consumption with respect to their precise rivals. Complete descriptions of sample BIC adder and multiplier structures as well as their error behaviors and synthesis results are introduced in this paper. It is then shown that these BIC structures can be exploited to efficiently implement a three-layer face recognition neural network and the hardware defuzzification block of a fuzzy processor.

458 citations


Cites background from "Area-time-power tradeoffs in parall..."

  • ...Comparison of the unit-gate model area/delay results with precise transistor-level synthesis results [31] show that this model yields acceptable relative accuracy at a high abstraction level using only simple analytic calculations [30]....

    [...]

Journal ArticleDOI
TL;DR: Simulation results show that the 4- 2 compressor with the proposed XOR-XNOR module and the new fast 5-2 compressor architecture are able to function at supply voltage as low as 0.6 V, and outperform many other architectures including the classical CMOS logic compressors and variants of compressors constructed with various combinations of recently reported superior low-power logic cells.
Abstract: This paper presents several architectures and designs of low-power 4-2 and 5-2 compressors capable of operating at ultra low supply voltages. These compressor architectures are anatomized into their constituent modules and different static logic styles based on the same deep submicrometer CMOS process model are used to realize them. Different configurations of each architecture, which include a number of novel 4-2 and 5-2 compressor designs, are prototyped and simulated to evaluate their performance in speed, power dissipation and power-delay product. The newly developed circuits are based on various configurations of the novel 5-2 compressor architecture with the new carry generator circuit, or existing architectures configured with the proposed circuit for the exclusive OR (XOR) and exclusive NOR ( XNOR) [XOR-XNOR] module. The proposed new circuit for the XOR-XNOR module eliminates the weak logic on the internal nodes of pass transistors with a pair of feedback PMOS-NMOS transistors. Driving capability has been considered in the design as well as in the simulation setup so that these 4-2 and 5-2 compressor cells can operate reliably in any tree structured parallel multiplier at very low supply voltages. Two new simulation environments are created to ensure that the performances reflect the realistic circuit operation in the system to which these cells are integrated. Simulation results show that the 4-2 compressor with the proposed XOR-XNOR module and the new fast 5-2 compressor architecture are able to function at supply voltage as low as 0.6 V, and outperform many other architectures including the classical CMOS logic compressors and variants of compressors constructed with various combinations of recently reported superior low-power logic cells.

349 citations


Cites background from "Area-time-power tradeoffs in parall..."

  • ...…a Booth encoder for the generation of a reduced number of partial products; a carry save structured accumulator for a further reduction of the partial products’ matrix to only the addition of two operands; and a fast carry propagation adder (CPA) [9] for the computation of the final binary…...

    [...]

DissertationDOI
01 Jan 1997
TL;DR: It is found that the ripple-carry, the carry-lookahead, and the proposed carry-increment adders show the best overall performance characteristics for cell-based design.
Abstract: The addition of two binary numbers is the fundamental and most often used arithmetic operation on microprocessors, digital signal processors (DSP), and data-processing application-specific integrated circuits (ASIC). Therefore, bi¬ nary adders are crucial building blocks in very large-scale integrated (VLSI) circuits. Their efficient implementation is not trivial because a costly carrypropagation operation involving all operand bits has to be performed. Many different circuit architectures for binary addition have been proposed over the last decades, covering a wide range of performance characteristics. Also, their realization at the transistor level for full-custom circuit implemen¬ tations has been addressed intensively. However, the suitability of adder archi¬ tectures for cell-based design and hardware synthesis both prerequisites for the ever increasing productivity in ASIC design — was hardly investigated. Based on the various speed-up schemes for binary addition, a compre¬ hensive overview and a qualitative evaluation of the different existing adder architectures are given in this thesis. In addition, a new multilevel carryincrement adder architecture is proposed. It is found that the ripple-carry, the carry-lookahead, and the proposed carry-increment adders show the best overall performance characteristics for cell-based design. These three adder architectures, which together cover the entire range of possible area vs. delay trade-offs, are comprised in the more general prefix adder architecture reported in the literature. It is shown that this universal and flexible prefix adder structure also allows the realization of various customized adders and of adders fulfilling arbitrary timing and area constraints. A non-heuristic algorithm for the synthesis and optimization of prefix adders is proposed. It allows the runtime-efficient generation of area-optimal adders for given timing constraints.

268 citations

Proceedings ArticleDOI
Meiqi Wang1, Siyuan Lu1, Danyang Zhu1, Jun Lin1, Zhongfeng Wang1 
01 Oct 2018
TL;DR: This paper performs an efficient hardware implementation of softmax function using hardware description language (HDL) and synthesized under the TSMC 28-nm CMOS technology and results show that the architecture achieves a throughput of 6.976 G/s for 8-bit input data.
Abstract: Recently, significant improvement has been achieved for hardware architecture design of deep neural networks (DNNs). However, the hardware implementation of one widely used softmax function in DNNs has not been much investigated, which involves expensive division and exponentiation units. This paper performs an efficient hardware implementation of softmax function. Mathematical transformations and linear fitting are used to simplify this function. Multiple algorithmic strength reduction strategies and fast addition methods are employed to optimize the architecture. By using these techniques, complicated logic units like multipliers are eliminated and the memory consumption is largely reduced while the accuracy loss is negligible. The proposed design is coded using hardware description language (HDL) and synthesized under the TSMC 28-nm CMOS technology. Synthesis results show that the architecture achieves a throughput of 6.976 G/s for 8-bit input data. The power efficiency of 463.04 Gb/(mm2• mW) is achieved and it costs only 0.015mm2 area resources. To the best of our knowledge, this is the first work on efficient hardware implementation for softmax in open literature.

84 citations


Cites methods from "Area-time-power tradeoffs in parall..."

  • ...Adders are further optimized using carry-save method [11]- [12]....

    [...]

Book
14 Feb 2011
TL;DR: Digital Design of Signal Processing Systems discusses a spectrum of architectures and methods for effective implementation of algorithms in hardware (HW) and includes conversion of algorithms from floating-point to fixed-point format, parallel architectures for basic computational blocks, Verilog Hardware Description Language (HDL), SystemVerilog and coding guidelines for synthesis.
Abstract: Digital Design of Signal Processing Systems discusses a spectrum of architectures and methods for effective implementation of algorithms in hardware (HW). Encompassing all facets of the subject this book includes conversion of algorithms from floating-point to fixed-point format, parallel architectures for basic computational blocks, Verilog Hardware Description Language (HDL), SystemVerilog and coding guidelines for synthesis.

82 citations


Cites background or methods from "Area-time-power tradeoffs in parall..."

  • ...Multi processor system on chip (MPSoC) is another design of choice for many modern high throughput signal processing and multimedia applications [10, 11]....

    [...]

  • ...15 format end always @( posedge clk) begin prod[0] < xn[0] * b0; prod[1] < xn[2] * b1; prod[2] < xn[4] * b2; prod[3] < xn[6] * b3; prod[4] < xn[8] * b4; prod[5] < xn[10] * b5; prod[6] < xn[12] * b6; prod[7] < xn[14] * b7; end always @(posedge clk) begin mac[0] < prod[0]; for (i 0; i<7; i i+1) mac[i+1] < mac[i]+prod[i+1]; end assign yn mac[7]; endmodule...

    [...]

  • ...This design methodology is discussed in [10] and [11]....

    [...]

  • ...This is especially true formore complex signal processing applicationswherekeeping thenumerical accuracy intact is consideredcritical [10]....

    [...]

  • ...A high speed bus like Amba High speed Bus (AHB) is used in these systems [10]....

    [...]

References
More filters
Book
01 Jan 1978

2,993 citations

Journal ArticleDOI
TL;DR: In this paper, techniques for low power operation are presented which use the lowest possible supply voltage coupled with architectural, logic style, circuit, and technology optimizations to reduce power consumption in CMOS digital circuits while maintaining computational throughput.
Abstract: Motivated by emerging battery-operated applications that demand intensive computation in portable environments, techniques are investigated which reduce power consumption in CMOS digital circuits while maintaining computational throughput. Techniques for low-power operation are shown which use the lowest possible supply voltage coupled with architectural, logic style, circuit, and technology optimizations. An architecturally based scaling strategy is presented which indicates that the optimum voltage is much lower than that determined by other scaling considerations. This optimum is achieved by trading increased silicon area for reduced power consumption. >

2,690 citations

Journal Article
TL;DR: An architecturally based scaling strategy is presented which indicates that the optimum voltage is much lower than that determined by other scaling considerations, and is achieved by trading increased silicon area for reduced power consumption.
Abstract: Motivated by emerging battery-operated applications that demand intensive computation in portable environments, techniques are investigated which reduce power consumption in CMOS digital circuits while maintaining computational throughput Techniques for low-power operation are shown which use the lowest possible supply voltage coupled with architectural, logic style, circuit, and technology optimizations An architecturally based scaling strategy is presented which indicates that the optimum voltage is much lower than that determined by other scaling considerations This optimum is achieved by trading increased silicon area for reduced power consumption >

2,337 citations

Journal ArticleDOI
TL;DR: Sign-digit representations limit carry-propagation to one position to the left during the operations of addition and subtraction in digital computers and arithmetic operations with signed-digit numbers: addition, subtraction, multiplication, division and roundoff are discussed.
Abstract: This paper describes a class of number representations which are called signed-digit representations. Signed-digit representations limit carry-propagation to one position to the left during the operations of addition and subtraction in digital computers. Carry-propagation chains are eliminated by the use of redundant representations for the operands. Redundancy in the number representation allows a method of fast addition and subtraction in which each sum (or difference) digit is the function only of the digits in two adjacent digital positions of the operands. The addition time for signed-digit numbers of any length is equal to the addition time for two digits. The paper discusses the properties of signed-digit representations and arithmetic operations with signed-digit numbers: addition, subtraction, multiplication, division and roundoff. A brief discussion of logical design problems for a signed-digit adder concludes the presentation.

1,232 citations

Journal ArticleDOI
TL;DR: It is shown that addition of n-bit binary numbers can be performed on a chip with a regular layout in time proportional to log n and with area proportional to n.
Abstract: With VLSI architecture, the chip area and design regularity represent a better measure of cost than the conventional gate count. We show that addition of n-bit binary numbers can be performed on a chip with a regular layout in time proportional to log n and with area proportional to n.

1,147 citations