scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

An efficient floating point multiplier design for high speed applications using Karatsuba algorithm and Urdhva-Tiryagbhyam algorithm

TL;DR: A combination of Karatsuba algorithm and Urdhva-Tiryagbhyam algorithm is used to implement unsigned binary multiplier for mantissa multiplication which gives a better implementation in terms of delay and power.
Abstract: Floating point multiplication is a crucial operation in high power computing applications such as image processing, signal processing etc. And also multiplication is the most time and power consuming operation. This paper proposes an efficient method for IEEE 754 floating point multiplication which gives a better implementation in terms of delay and power. A combination of Karatsuba algorithm and Urdhva-Tiryagbhyam algorithm (Vedic Mathematics) is used to implement unsigned binary multiplier for mantissa multiplication. The multiplier is implemented using Verilog HDL, targeted on Spartan-3E and Virtex-4 FPGA.
Citations
More filters
Journal ArticleDOI
TL;DR: The proposed solution achieves state-of-the-art performances in terms of elaboration velocity, achieving a critical path delay of about 2 ns both on a Xilinx Virtex 7 and with CMOS 90-nm std_cells.
Abstract: A new radix-3 partitioning method of natural numbers, derived by the weight partition theory, is employed to build a multiplierless circuit that is well suited for multimedia filtering applications. The partitioning method allows conveniently premultiplying 32-b floating-point filter coefficients with the smallest set of parts composing an unsigned integer input. In this way, similar to the distributed arithmetic, shifters and recoding circuitry, typical of other well-known multiplier circuits, are completely substituted with simplified floating-point adders. Compared to the existent literature, targeted to both field-programmable gate array and std_cell technology, the proposed solution achieves state-of-the-art performances in terms of elaboration velocity, achieving a critical path delay of about 2 ns both on a Xilinx Virtex 7 and with CMOS 90-nm std_cells.

32 citations


Cites background or methods from "An efficient floating point multipl..."

  • ...The mapped physical resources are approximatively lower by 30% than those in [23], while the delay is about one-half, although this value is not representative because of the technology differences between the two target platforms....

    [...]

  • ...The work in [23] for FPGA and that in [16] and [17] for std_cells have been used as comparative terms....

    [...]

Journal ArticleDOI
TL;DR: In this paper, a novel hardware architecture for efficient field-programmable gate array (FPGA) implementation of Finite-field multipliers for ECC was proposed, which resulted in a lower combinational delay and area-delay product indicating the efficiency of design.
Abstract: Cryptography systems have become inseparable parts of almost every communication device. Among cryptography algorithms, public-key cryptography, and in particular elliptic curve cryptography (ECC), has become the most dominant protocol at this time. In ECC systems, polynomial multiplication is considered to be the most slow and area consuming operation. This article proposes a novel hardware architecture for efficient field-programmable gate array (FPGA) implementation of Finite-field multipliers for ECC. Proposed hardware was implemented on different FPGA devices for various operand sizes, and performance parameters were determined. Comparing to state-of-the-art works, the proposed method resulted in a lower combinational delay and area–delay product indicating the efficiency of design.

12 citations

Proceedings ArticleDOI
01 Dec 2018
TL;DR: The VLSI implementation of the new radix-2 Decimation In Time (DIT) Fast Fourier Transform (FFT) algorithm with reduced arithmetic complexity which is based on scaling the twiddle factor and results show that the proposed architecture significantly reduces the hardware area and power consumption.
Abstract: In this paper we discuss the VLSI implementation of the new radix-2 Decimation In Time (DIT) Fast Fourier Transform (FFT) algorithm with reduced arithmetic complexity which is based on scaling the twiddle factor. Some signal processing require high performance FFT processors and to meet these performance requirements, the processor needs to be pipelined and parallelized. An optimized ASIC design is derived from this new radix-2 algorithm with fewer multipliers and adopted a complete parallel and pipelined architecture for hardware implementation of a 64 point FFT. The implementation results show that the proposed architecture significantly reduces the hardware area by 13.74 percent and power consumption by 16 percent when compared to the standard FFT architecture. Simulation of design units is done in Xilinx ISE WebPack 13.1 and synthesized using Cadence Encounter RTL Compiler.

9 citations


Cites background from "An efficient floating point multipl..."

  • ...[16] for power and area efficient computation of...

    [...]

Proceedings ArticleDOI
12 Jun 2019
TL;DR: This paper proposes an efficient method for signed binary multiplication using Urdhva-Tiryagbhyam technique, Karatsuba algorithm and efficient carry select adder to design a binary multiplier which consumes lesser area, power and delay.
Abstract: In DSP processors or other applications which use multiply-accumulate units (MAC) etc., multiplication of large numbers is the main bottleneck. Multiplying two n-bit binary numbers requires n (n − 1 ) adders and n2 AND gates, which consumes more time, power and area for large n since the hardware scales as the square of n so, there is a need to design a binary multiplier which consumes lesser area, power and delay but in general there will be tradeoff between area, power and delay. With the shrinking of technology we can slightly compromise with area. This paper proposes an efficient method for signed binary multiplication using Urdhva-Tiryagbhyam technique, Karatsuba algorithm and efficient carry select adder. Urdhva-Tiryagbhyam technique is known for its low delay [8] as it produces partial products at same instant and sums them up. It is best suited when the number of bits in the multiplier and multiplicand are less than 16 [8], [14]. Whereas Karatsuba algorithm is applicable for multiplication of larger number of bits [5]. The proposed multiplier is implemented for a first stage butterfly unit [12] of a radix-2 FFT algorithm. This proposed design is implemented using Verilog HDL and synthesized in both 90 nm and 45 nm technology at multiplier level and in 45nm technology for butterfly unit using cadence RTL compiler and results are compared.

4 citations

Proceedings ArticleDOI
01 Jun 2022
TL;DR: Non-uniform quantization (N2UQ) as mentioned in this paper learns the flexible inequidistant input thresholds to better fit the underlying distribution while quantizing these real-valued inputs into equidistant output levels.
Abstract: The nonuniform quantization strategy for compressing neural networks usually achieves better performance than its counterpart, i.e., uniform strategy, due to its superior representational capacity. However, many nonuniform quantization methods overlook the complicated projection process in implementing the nonuniformly quantized weights/activations, which incurs non-negligible time and space overhead in hardware deployment. In this study, we propose Nonuniform-to-Uniform Quantization (N2UQ), a method that can maintain the strong representation ability of nonuniform methods while being hardware-friendly and efficient as the uniform quantization for model inference. We achieve this through learning the flexible inequidistant input thresholds to better fit the underlying distribution while quantizing these real-valued inputs into equidistant output levels. To train the quantized network with learnable input thresholds, we introduce a generalized straight-through estimator (G-STE) for intractable backward derivative calculation w.r.t. threshold parameters. Additionally, we consider entropy preserving regularization to further reduce information loss in weight quantization. Even under this adverse constraint of imposing uniformly quantized weights and activations, our N2UQ outperforms state-of-the-art nonuniform quantization methods by 0.5 ~ 1.7% on ImageNet, demonstrating the contribution of N2UQ design. Code and models are available at: https://github.com/liuzechun/Nonuniform-to-Uniform-Quantization.

3 citations

References
More filters
StandardDOI
01 Jan 2008

1,354 citations

Journal Article
TL;DR: A reduced-bit multiplication algorithm based on the ancient Vedic multiplication formulae, Urdhva tiryakbhyam and Nikhilam, is proposed and is further optimized by use of some general arithmetic operations such as expansion and bit-shifting to take advantage of bit-reduction in multiplication.
Abstract: A reduced-bit multiplication algorithm based on the ancient Vedic multiplication formulae is proposed in this paper. Both the Vedic multiplication formulae, Urdhva tiryakbhyam and Nikhilam, are first discussed in detail. Urdhva tiryakbhyam, being a general multiplication formula, is equally applicable to all cases of multiplication. It is applied to the digital arithmetic and is shown to yield a multiplier architecture which is very similar to the popular array multiplier. Due to its structure, it leads to a high carry prop- agation delay in case of multiplication of large numbers. Nikhilam Sutra, on the other hand, is more efficient in the multiplication of large numbers as it reduces the multiplication of two large numbers to that of two smaller numbers. The framework of the proposed algorithm is taken from this Sutra and is further optimized by use of some general arithmetic operations such as expansion and bit-shifting to take advantage of bit-reduction in multiplication. We illustrate the proposed algorithm by reducing a general 4£4-bit multiplication to a single 2 £ 2-bit multiplication operation.

105 citations


"An efficient floating point multipl..." refers methods in this paper

  • ...A more optimized hard [9, 10] is shown in Fig....

    [...]

Proceedings ArticleDOI
01 Dec 2012
TL;DR: An improved version of tree based Wallace tree multiplier architecture using Booth Recoder using Booth algorithm and compressor adders is proposed, which shows that the proposed architecture is around 67 percent faster than the existing Wallace-tree multiplier.
Abstract: A Wallace tree multiplier using Booth Recoder is proposed in this paper. It is an improved version of tree based Wallace tree multiplier architecture. This paper aims at additional reduction of latency and area of the Wallace tree multiplier. This is accomplished by the use of Booth algorithm and compressor adders. The coding is done in Verilog HDL and synthesized for Xilinx Virtex 6 FPGA device. The result shows that the proposed architecture is around 67 percent faster than the existing Wallace-tree multiplier, 53 percent faster than the Vedic multiplier, 22 percent faster than the radix-8 Booth multiplier, 18 percent faster than the radix-16 Booth Multiplier. In terms of area also, the proposed multiplier is much efficient.

50 citations

Journal ArticleDOI
TL;DR: A simple digital multiplier architecture based on the Urdhva Tiryakbhyam (Vertically and Cross wise) Sutra of Vedic Mathematics is presented and an improved technique for low power and high speed multiplier of two binary numbers (16 bit each) is developed.
Abstract: High-speed parallel multipliers are one of the keys in RISCs (Reduced Instruction Set Computers), DSPs (Digital Signal Processors), and graphics accelerators and so on. Array multiplier, Booth Multiplier and Wallace Tree multipliers are some of the standard approaches used in implementation of binary multiplier which are suitable for VLSI implementation. A simple digital multiplier (henceforth referred to as Vedic Multiplier in short VM) architecture based on the Urdhva Tiryakbhyam (Vertically and Cross wise) Sutra of Vedic Mathematics is presented. An improved technique for low power and high speed multiplier of two binary numbers (16 bit each) is developed. An algorithm is proposed and implemented on 16nm CMOS technology. The designed 16x16 bit multiplier dissipates a power of 0.17 mW. The propagation delay time of the proposed architecture is 27.15ns. These results are many improvements over power dissipations and delays reported in literature for Vedic and Booth Multiplier.

32 citations

26 Mar 2017
TL;DR: In this article, a vedic multiplier using Urdhva Tiryagbhyam sutra in Xilinx ISE is proposed and the design takes lesser time for operation than currently available multipliers.
Abstract: Today's technology has raised demand for Fast and real time signal processing operation. Multiplication is one of the most important arithmetic operations. In this paper, we have proposed design of vedic multiplier using Urdhva Tiryagbhyam sutra in Xilinx ISE. This design takes lesser time for operation than currently available multipliers .It encompasses wide era of image processing and digital signal processing in much efficient way with increase in speed and thus leading to higher performance rating

24 citations