# Speed Optimization of 32 Bit Single Precision Floating Point Multiplier through Pipelining in VHDL

Tajinder Pal Singh\*

\* Electrical and Electronics Engineering, Faculty of Electrical Engineering, Chitkara University, India

(tajinderpal.singh@chitkarauniversity.edu.in)

<sup>‡</sup>Corresponding Author Tajinderpal Singh, Chitkara University, Tel: +91

8872446979,tajinderpal.singh@chitkarauniversity.edu.in

Received: 03.08.2018 Accepted: 28.12.2018

Abstract- The paper is presenting the architectural method for speed optimization of floating point multiplier involves increasing the frequency by implementing pipelines in the design using VHDL language. The whole algorithm of IEEE 754 standard 32 bit single precision floating point multiplier have two pipelining stages, which improve the frequency rate of clock to 422.556 MHz and result the output in 2.367ns. The design also handles the overflow/ underflow cases with normalization for the better accuracy of the result. Xilinx vertex 5 FPGA is targeted for the design and the simulation is done on Xilinx ISE simulator.

Keywords Floating Point multiplier; VHDL; Pipelining; Xilinx ISE tool.

#### 1. Introduction

This paper explores the effect of pipelining on the performance of the proposed design of Floating point multiplier. Mostly latency in the clock affects highly the design performance. So the branch prediction and fast branch recovery will continue to increase in importance, which is called pipelining. In VHDL pipelining the multiple instructions are runs in parallel. The pipeline is divided in stages. Each stage completes a part of an instruction in parallel. The pipe stages are hooked together in which instructions enter at one end, progress through the stages, and exit at the other end. Hence increase the instruction throughput.

In floating point multiplier, multiplication required the following steps; first to multiply the mantissa or significand of two multiplicand, second is to add the exponent of both numbers and subtract the bais and third is to XORing the sign bit of both numbers. In these steps most of the calculation time is spent in significand multiplication, moderate in addition and very less in sign calculation. The pipelining helps to divide the critical path of floating multiplier into small stages which improve the overall performance of the design.

In our design of single precision floating point multiplier we introduce two pipelining stages in the architecture which achieves the frequency rate of 422.556MHz.The two stage pipelining increase the instruction throughput of the proposed design and result the output in 2.367ns. The complete algorithm & pipelining in floating point multiplier is discussed in next sections.

#### 2. Operation of Floating point multiplication

Let A & B are two floating point numbers to be multiplied. E is representing Bias Exponent value of A or B. M is Mantissa or fraction value of the numbers and S is representing sign bit. The whole process takes following steps for the operation of multiplication as shown in the flow chart given below in figure 1a.



# Fig. 1. Flow chart for Floating Point Multiplier with pipelining

1. First step in multiply the two floating point numbers is to check whether the both multiplicand is in the range of the format according to IEEE754 standard or not as in figure 2 given below.

The value of bias exponent (E) should lie in between 0 to 254(8 bit only). If not the bias exponent is represented by a special values. For example If the value of bias exponent (Ebias) is more than 254 than its turn to not a number (NaN) on the other hand if the value of bias exponent is less than 1 its turn to zero as shown in table 1.

2. Significand multiplication is the second step of the whole operation, if only both the multiplicands come in the range of IEEE754 standard. This operation is performed on the 23 bit mantissa (or fraction) part of the numbers which gives the 48 bit result in output. The 48 bit result of the mantissa multiplication is than round off to 23 bit so that it will fit in the range of IEEE754 32 bit floating point number format.

With that rounding, normalization is also done to represent the number in scientific notation, which also increases the accuracy of the result.

3. The result of significand multiplication is normalized so that the product will have leading bit one on the left side of decimal point. If the both multiplicands are the normalized numbers so the result will have leading one either on 47 bit or on 46 bit.

• If the leading one is at 46 bit or on the left side of decimal point the product don't require any shift. In the other words we can say there will not any increment in exponent value.

For example 1 10000111 1.001011000 (normalized form)



Fig. 2. IEEE 754 standard for 32 bit

|                     | Bias Exponent            | Special Values    |
|---------------------|--------------------------|-------------------|
| Bias Exponent Range | If $E_{Bias} > 254$      | Not a number(NaN) |
| (0 to 254)          | If $E_{\text{Bias}} < 1$ | Zero              |

| Table 1. Ap | opearance | properties | of accepted | l manuscripts |
|-------------|-----------|------------|-------------|---------------|
|             | 1         | 1 1        |             |               |

| M                                          | Mmult_n00701 |              |  |  |  |  |
|--------------------------------------------|--------------|--------------|--|--|--|--|
| x(31:0) DataA(23:0)<br>y(31:0) DataB(23:0) |              | Result(47:0) |  |  |  |  |
| M                                          | mult_n007    | 01           |  |  |  |  |

Fig. 3. Schematic diagram for significand multiplier.

For example

If the leading one is at 47 bit the product require shift on right which will increment the exponent by one.

For Example 1 10000110 1001011000 (Non normalized form)

If the both multiplicand are not a normalized numbers than for the requirement of normalize the result so that there is a 1 just before the radix point (decimal point) the radix point move one place to the left increments the exponent by 1; moving one place to the right decrements the exponent by 1.

4. Next step is to round off the result obtained from significand multiplication. In rounding the intermediate product obtained from significand multiplication become of 23 bit from 46 bits.

5. Adding the exponent of two numbers is done on 8 bit part of floating point number. After addition the constant value of 127 (bias) is subtracted from the addition result.

6. The result obtained from exponent addition is checked for overflow and underflow. If overflow occurs the result turned to infinity and when underflow occurs result

turned to zero. The overflow and underflow can be compensated only under following situations

| Table 2. Overflow and underflow detection with effect of |
|----------------------------------------------------------|
| normalization                                            |

| Exponent                     | Over | Under | Compensated   |
|------------------------------|------|-------|---------------|
| Value                        |      |       | or not in     |
| (E <sub>result</sub> )       | flow | flow  | normalization |
| E >255                       | Vac  | No    | No            |
| $E_{\text{result}} \leq 255$ | 1 65 | INO   | INO           |
|                              |      |       |               |
| E <sub>result</sub> =0       | No   | Yes   | Yes           |
| F 1<0                        | No   | Ves   | No            |
| L result V                   | 110  | 103   | NO            |

## 3. Pipelining in Floating point multiplier

Time performance enhancement is the requirement of the design. It has been achieved through the VHDL pipelining. Pipelining in the proposed design is accomplished at the two different stages. These two stages pipelining involving parallel processing of the data improve the frequency rate of the clock. Hence decrease the overall time consumption of floating point multiplier and gives better speed.

The pipelining comes at two different locations is discussed below in detail in figure 3.

- One after the significand multiplier and before the exponent adder & rounding.
- Second after the exponent addition and before the overflow/underflow detection & final result of multiplication



Fig. 4. Proposed architecture with pipelining

# INTERNATIONAL JOURNAL of ENGINEERING SCIENCE AND APPLICATION Singh, Vol.2, No.4, 2018

Significand multiplier is one of the critical parts of floating point multiplier. In this unit most of the calculation takes place which decrease the overall speed of multiplier. A parallel branch in between the significand multiplier and before the exponent adder & rounding helps to overcome the problem of speed optimization. After that the other parallel branch in between the exponent addition and before the overflow/underflow detection add on in improving the clock rate ( up to 422.556MHz) and output the result in just 2.367ns. The figure 04 given below represents the maximum path delay in the design in between two parallel stages.

05 given below representing the Frequency rate, clock timing and the combinational path delay of the floating point multiplier. This result shows that the proposed design is 40%

faster response than previous deigns & 90% faster than

| Fiming constraint: D<br>Clock period: 2.36<br>Total number of par        | efault pe<br>7ns (freq<br>ths / des                     | riod analy<br>uency: 422<br>tination ;                                          | ysis fo<br>2.556MH<br>ports:            | r Clock 'GND_5_o_GND_5_o_OR_9_o'<br>z)<br>45 / 9                                                                                                                                  |
|--------------------------------------------------------------------------|---------------------------------------------------------|---------------------------------------------------------------------------------|-----------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Delay:<br>Source:<br>Destination:<br>Source Clock:<br>Destination Clock: | 2.367ns<br>aux1_47_<br>exponent<br>GND_5_0_<br>GND_5_0_ | (Levels o:<br>1 (LATCH)<br>_sum1_8 (1<br>GND_5_0_09<br>GND_5_0_09<br>GND_5_0_09 | f Logic<br>LATCH)<br>R_9_0 r<br>R_9_0 r | = 4)<br>ising<br>ising                                                                                                                                                            |
| Data Path: aux1_47                                                       | 1 to exp                                                | onent_sumi<br>Gate                                                              | 1_8<br>Net                              |                                                                                                                                                                                   |
| Cell:in->out                                                             | fanout                                                  | Delay                                                                           | Delay                                   | Logical Name (Net Name)                                                                                                                                                           |
| LDE_1:G->Q                                                               | 1                                                       | 0.475                                                                           | 0.339                                   | aux1_47_1 (aux1_47_1)                                                                                                                                                             |
| XORCY:CI->O                                                              | 8                                                       | 0.023                                                                           | 0.610                                   | Madd_GND_5_0_GND_5_0_add_9_001_Madd_cy<0> (Madd_GND_5_0_GND_5_0_add_9_001_Madd_cy<0>)<br>Madd_GND_5_0_GND_5_0_add_9_00T_Madd_xor<1> (Maub_GND_5_0_GND_5_0_aub_11_00T<8:0>_lut<1>) |
| LUT4:I1->0                                                               | 1                                                       | 0.097                                                                           | 0.355                                   | Msub_GND_5_o_GND_5_o_sub_11_OUT<8:0>_cy<5>11_SW2 (N9)                                                                                                                             |
| LUT6:15->0                                                               | 1                                                       | 0.097                                                                           | 0.000                                   | Msub_GND_5_o_GND_5_o_sub_11_OUT<8:0>_xor<8>11 (GND_5_o_GND_5_o_sub_11_OUT<8>)                                                                                                     |
| LDE_1:D                                                                  |                                                         | -0.028                                                                          |                                         | exponent_sum1_8                                                                                                                                                                   |
| Total                                                                    |                                                         | 2.367ns                                                                         | (1.062                                  | ns logic, 1.305ns route)                                                                                                                                                          |

Fig. 5. Path delay in between two pipelining stages

#### 4. Result

The resultant section of the paper is representing the different figures of results which justifying the design, performance and accuracy of the proposed design. The figure

conventional design.

| Name                   | Value                                  | 1 | 2,099,992 ps | 2,099,993.ps | 2,099,994 ps | 2,099,995 ps     | 2,099,996 ps                            | 2,099,997.ps | 2,099,998 ps | 2,099,999 ps |
|------------------------|----------------------------------------|---|--------------|--------------|--------------|------------------|-----------------------------------------|--------------|--------------|--------------|
| 🕨 🕌 x(31x1)            | 011111110111100000                     |   |              |              | 0            | 1111110111100000 | 000000000000000000000000000000000000000 |              |              |              |
| ) 🎽 y(31x0)            | 01111111100000000                      |   |              |              | 0            | 111111100000000  | 000000000000000000000000000000000000000 |              |              |              |
| 🕨 🕌 z(31x0)            | 011111111000000000                     |   |              |              | 0            | 111111100000000  | 000000000000000000000000000000000000000 |              |              |              |
| 🕨 🕌 auxt[47:0]         | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX |   |              |              |              |                  |                                         | 0000         |              |              |
| 🕨 🕌 exponent_sum1(8x0) | XXXXXXXXXX                             |   |              |              |              | XXXXXX           | άX                                      |              |              |              |
|                        |                                        |   |              |              |              |                  |                                         |              |              |              |
|                        |                                        |   |              |              |              |                  |                                         |              |              |              |

Fig. 6. Timing summary of the proposed design

| Name                   | Value                                  | 2,099,992 ps | 12,099,993 ps | 2,099,994 ps | 2,099,995 ps                            | 2,099,996 ps                            | 2,099,997 ps | 2,099,998 ps | 2,099,999 ps |
|------------------------|----------------------------------------|--------------|---------------|--------------|-----------------------------------------|-----------------------------------------|--------------|--------------|--------------|
| 🕨 🕌 x(31x0)            | 011111110111100000                     |              |               | 0            | 1111110111100000                        | 000000000000000000000000000000000000000 |              |              |              |
| ) 📲 y(31x0)            | 0111111100000000                       |              |               | 0            | 1111110000000                           | 000000000000000000000000000000000000000 |              |              |              |
| ▶ 👫 z[31x0]            | 01111111000000000                      |              |               | 0            | 11111110000000                          | 000000000000000000000000000000000000000 |              |              |              |
| 🕨 👹 auxt[47:0]         | X0000000000000000000000000000000000000 |              |               | 000000       |                                         |                                         | 00000        |              |              |
| 🕨 🙀 exponent_sum1(8:0) | XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX |              |               |              | ,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, | ¢Χ                                      |              |              |              |
|                        |                                        |              |               |              |                                         |                                         |              |              |              |
|                        |                                        |              |               |              |                                         |                                         |              |              |              |

Fig. 7. Simulation result of multiplitication of two normalized numbers

Fig. 8. Simulation result of multiplitication of two numbers results in infinity

### 5. Conclusion

The paper presents an implementation of floating point multiplier, which follow IEEE754 binary interchange format. The both normalization and rounding techniques are implemented in the design. The proposed design gives the output in latency of two clock cycles and the frequency rate of 422.556MHz is achieved. The whole designed is targeted to Xilinx vertex 5 FPGA.

## References

- [1] IEEE 754-2008, IEEE Standard for Floating-Point Arithmetic, 2008
- [2] Mohamed Al-Ashrafy, Ashraf Salem and Wagdy Anis, "An efficient implementation of floating point multiplier" 2011 IEEE

- [3] H. Poor, An Introduction to Signal Detection and Estimation, New York: Springer-Verlag, 1985, ch. 4. (Book Chapter)
- [4] B.Sreenivasa Ganesh, J.E.N. Abhilash and G.Rajesh Kumar, " *Design and implemntation of floating point multiplier for better response*" IJARCET,vol 1, issue 17, sep 2012.
- [5] Naresh Grover, M.K Soni., "Design of FPGA based 32 bit floating point airthmetic unit and verfication of its VHDL code using MATLAB" I.J information engineering and electronics business ,feb 2014,MECS press
- [6] Ameya deshmukh and Pooja hatwalne, "Design and implementatio time efficient floating point multiplier using VHDL, international journal of latestest trend in engineering and technology, vol 8, issue 03, pp 084-090.