Open AccessJournal ArticleDOI

A New High Radix-2 r ( r ≥ 8) Multibit Recoding Algorithm for Large Operand Size ( N ≥ 32) Multipliers

Q: How many clock cycles does a two’s complement require?

For instance, a 64-bit two’s complement finely pipelined multiplier requires a latency of seven clock cycles only (critical path composed of a series of 7 adders).

Q: What is the critical path of the multiplier in terms of logic levels?

Based on the total number of adders (AddT), the critical path of the multiplier in terms of logic levels is: DelT= N/r-1+Del+ds, where Del is the delay due to adder stages inside PPGj and ds is the delay due to multiplexer logic inside PPGji.

Q: What are the basic components of a recoding algorithm?

1) AreaThree basic components are necessary for theimplementation of RTL multipliers:• multiplexers (Mux1) to recode the digit terms (Qj,Pj,…) included in the recoding expression; • shifters (Mux2) for partial product generation; • and adders for partial product summation.

Q: What is the purpose of the recoding of large slices in a mono-bloc?

Recoding large slices (r≥8) in a mono-bloc PPG such as in [11][12], requires the use of an RTL “case statement” with r+1 entries.

Q: What is the simplest way to reduce the number of bits of the multiplier?

To comply with time constraint of a given application, the authors need a multiplication algorithm that allows, to some extent, a parameterized reduction (N/r) of the multiply-time without sacrificing area.

Q: What is the tradeoff for a recoding scheme?

based on theory and implementation results, the authors conclude that the best tradeoff related to their recoding schemes depends on N and r values.

Q: Why is the solution space a deterministic C-program?

Because of an explosive number of possible combinations (N>>), the solution space is exhaustively explored using a deterministic C-program for r varying from 8 to 1024.

A. K. Oudjida, +3 more

- 01 Apr 2013 -

Journal of Low Power Electronics

- Vol. 9, Iss: 1, pp 50-62

Chats0

TLDR

A new recursive recoding algorithm is proposed that shortens the critical path of the multiplier and reduces the hardware complexity of partial-product-generators as well and provides an optimal space/time partitioning of themultiplier architecture for any size N of the operands.

Abstract:

This paper addresses the problem of multiplication with large operand sizes (N≥32). We propose a new recursive recoding algorithm that shortens the critical path of the multiplier and reduces the hardware complexity of partial-product-generators as well. The new recoding algorithm provides an optimal space/time partitioning of the multiplier architecture for any size N of the operands. As a result, the critical path is drastically reduced to 33 N / 2 - 3 with no area overhead in comparison to modified Booth algorithm that shows a critical path of N/2 in adder stages. For instance, only 7 adder stages are needed for a 64-bit two's complement multiplier. Confronted to reference algorithms for N=64, important gain ratios of 1.62, 1.71, 2.64 are obtained in terms of multiply-time, energy consumption per multiply- operation, and total gate count, respectively.

Content maybe subject to copyright Report

HAL Id: hal-00872326

https://hal.archives-ouvertes.fr/hal-00872326

Submitted on 11 Oct 2013

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

A New High Radix-2r (r ≥ 8) Multibit Recoding

Algorithm for Large Operand Size (N ≥ 32) Multipliers.

Abdelkrim K. Oudjida, Nicolas Chaillet, Mohamed L. Berrandjia, Ahmed

Liacha

To cite this version:

Abdelkrim K. Oudjida, Nicolas Chaillet, Mohamed L. Berrandjia, Ahmed Liacha. A New High Radix-

2r (r ≥ 8) Multibit Recoding Algorithm for Large Operand Size (N ≥ 32) Multipliers.. Journal of

Low Power Electronics, American Scientic Publishers, 2013, 9, pp.50-62. �hal-00872326�

Abstract—This paper addresses the problem of

multiplication with large operand sizes (N≥32). We propose a

new recursive recoding algorithm that shortens the critical

path of the multiplier and reduces the hardware complexity of

partial-product-generators as well. The new recoding

algorithm provides an optimal space/time partitioning of the

multiplier architecture for any size N of the operands. As a

result, the critical path is drastically reduced to

323

−/N

with no area overhead in comparison to modified Booth

algorithm that shows a critical path of N/2 in adder stages. For

instance, only 7 adder stages are needed for a 64-bit two’s

complement multiplier. Confronted to reference algorithms for

N=64, important gain ratios of 1.62, 1.71, 2.64 are obtained in

terms of multiply-time, energy consumption per multiply-

operation, and total gate count, respectively.

Index Terms— High-Radix Multiplication, Low-Power

Multiplication, Multibit Recoding Multiplication, Partial

Product Generator (PPG), Register-Transfer-Level (RTL)

I. BACKGROUND AND MOTIVATION

N multiplication-intensive applications, as in digital signal

processing or process control, multiply-time is a critical

factor that limits the whole system performance. When these

types of applications are embedded, energy consumption per

multiply operation becomes an additional critical issue.

Furthermore, in large-operand-size applications (N≥32), the

need for a scalable architecture is essential to ensure a linear

increase O(N) of multiply-time while multiplier size grows

quadratically O(N

) with operand bit-length N.

Consequently, high-speed, low-power, and highly-scalable

architecture are the three major requirements for today’s

general-purpose multipliers

[1].

However, large operand size multipliers are very time

consuming. To comply with time constraint of a given

application, we need a multiplication algorithm that allows,

to some extent, a parameterized reduction (N/r) of the

multiply-time without sacrificing area. This is achieved if,

and only if the total critical path can be properly shortened

by reducing the number of partial products (PPs) and

exploiting inherent parallelism. Theoretically, only the

signed multibit recoding multiplication algorithm

[2] is

capable of such a drastic reduction (N/r) of the PP number,

given that r+1 is the number of bits of the multiplier that are

simultaneously treated (1<r≤N/2). Unfortunately, this

algorithm requires the pre-computation of a number of odd-

multiples of the multiplicand (until (2

r-1

-1).X) that scales

linearly with r. The large number of odd-multiples not only

requires a considerable amount of multiplexers to perform

the necessary complex recoding into partial product

generators (PPG), but dramatically increases the routing

density as well. Therefore, a reverse effect occurs that

offsets speed and power benefits of the compression factor

N/r. This is the main reason why the multibit recoding

algorithm was abandoned. Moreover, in industry

commercial designs do not exceed r=4 (radix-16). A hybrid

radix-4/-8 is proposed in

[3] for low-power multimedia

applications. To increase the speed of the multiplier, most

ancient processors employed radix-8, such as: Fchip

[4],

IBM S/390

[5], Alpha RISC [6], IA-32 [7]

and AMDK7 [8].

While radix-16 is used only in the most recent Intel

processors: 64 and IA-32

[9], and Itanium-Poulson [10].

In research, the highest radix algorithms are proposed in

the works of Seidel et al. [11] and Dimitrov et al. [12]. Both

works

rely upon advanced arithmetic to determine minimal

number-bases that are representatives of the digits resulting

from larger multibit recoding. The objective is to eliminate

information redundancy inside r+1 bit-length slices for a

more compact PPG. This is achievable as long as no or just

very few odd- multiples are required.

Seidel introduced a secondary recoding of digits issued

from an initial multibit recoding for 5≤r≤16. The recoding

scheme is based on balanced complete residue system.

Though it significantly reduces the number of partial

products (N/r for 5≤r≤16), it requires some odd-multiples

for r≥8. Dimitrov proposed a new recoding scheme based

on double base number system for 6≤r≤11. The algorithm is

limited to unsigned multiplication and requires larger

number of odd-multiples. Both algorithms

[11][12] require a

PPG that includes a number of adders to accumulate

intermediary partial products corresponding to recoded

elementary digits.

In fact, odd-multiples are not the only problem for a

compact PPG. Recoding large slices (r≥8) in a mono-bloc

PPG such as in

[11][12], requires the use of an RTL “case

statement” with r+1 entries. In this case, 2

r+1

combinations

must be processed, which yields to a huge amount of

multiplexer resources. Thus, mono-bloc PPG recoding is

incompatible with high radix (r≥8) approach whose purpose

is to reduce the multiply-time (N/r) of large operand size

(N ≥32) multipliers.

The objective of this paper is to overcome these two

above-mentioned shortcomings. To achieve such a goal, the

multibit recoding multiplication algorithm is revisited

[2]. Its

design space is extended by the introduction of a new

recursive version that enabled to solve the hard problem of

radix-2

two’s complement multiplication for any value of r.

The solution consists essentially in dividing the high radix-2

mono-bloc PPG

(

Fig. 1.a) into a number of lower

sub-radix-2

odd-multiple free PPG

(

Fig. 1.b), such as s is a

divider of r . As direct benefits of the partitioning of

Fig. 1.b:

• there is no need to pre-compute odd-multiples of the

multiplicand, which drastically reduces the required

amount of hardware resources and routing;

• since the size of PPG

entry is much smaller than the

size of PPG

one (s≤r/2), the total multiplexing logic

required by RTL “case statements” to recode the

entries is greatly reduced;

A New High Radix-2

(r≥8) Multibit Recoding Algorithm

for Large Operand Size (N ≥32) Multipliers

A.K. Oudjida

, N. Chaillet

, M.L. Berrandjia

, and A. Liacha

(1) Centre de Développement des Technologies Avancées, Algiers, Algeria

(2) Institut FEMTO-ST, Besançon, France

Fig. 1. Generalized N×N bit radix-2

parallel multiplier.

(a) Critical path in conventional

[2][4][5][6][7][8] and recent [3][9][10]

[11][12]

radix-2

multipliers. O(X) is the necessary set of odd-multiples

corresponding to radix-2

recoding. PPG

[11][12] includes a number

of adders to accumulate intermediary partial product.

(

b) Critical path in our proposed radix-2

multipliers. Main features are: no

odd-multiples, much more compact PPG

, much shorter critical path.

(b)

is the main radix and

is the sub-radix

PP: Partial Product

Critical path (Del

)

2N-1 , 0

⎭

⎬

⎫

⎩

⎨

⎧

−

= X

XXXO )1

2(...,5,3)(

(a)

PPG

-1 , r-1

r+1

N-r-1 , N-1

r+1

PPG

(N/r)-1

()

N-r-1 , N-1

∑

PPG

0 (r/s)-1

PPG

1 (r/s)-1

PPG

∑

PPG

(N/r)-1 0

PPG

(N/r)-1 1

PPG

(N/r)-1 (r/s)-1

PPG

(N/r)-1

∑

-1 , r-1

r+1

r-1 , 2r-1

r+1

(N/r)-1

r-1 , 2r-1

• the possibility to simultaneously process larger bit

slices (r≥16) radically shortens the critical path in

terms of adder levels, especially for very large operand

sizes (N≥64).

Guided by accurate area heuristics, the final result of an

optimization process, gradually undertaken in this paper,

delivers for each value of N (N=8..8192) the appropriate

radix-2

(r=8..512) and sub-radix-2

(s=4..32) that lead to

the architecture with the shortest critical path (

323

−/N

)

in adder stages. The couple (r,s) serves to partition the

architecture so that maximum parallelism is exploited. As

for area, our proposed architectures require as many

hardware resources as modified Booth algorithm

[13] with a

critical path of N/2

[14][15][16][17]. For instance, a 64-bit

two’s complement finely pipelined multiplier requires a

latency of seven clock cycles only (critical path composed

of a series of 7 adders). FPGA implementation on Virtex-6

circuit of our 64-bit two’s complement radix-2

multiplier

shows important gain ratios over Seidel

[11] and Dimitrov

[12] radix-2

algorithms. The respective gain ratios are

enumerated as follows: 1.62, 1.71, 2.64 and 1.83, 1.71, 3.32

are obtained in terms of multiply-time, energy consumption

per multiply-operation, and total gate count, respectively.

The paper is organized as follows. Section I outlines the

main requirement specifications for a generalized radix-2

multiplication. Section II introduces the new recursive

multibit recoding multiplication algorithm, illustrated by

two high-radix (2

and 2

) recoding examples in Section

III. Section IV introduces some preliminary steps toward an

optimal partitioning of the multiplier architecture, while the

optimal partitioning is presented in Section V. Section VI

compares and discusses the implementation results. Finally,

Section VII provides some concluding remarks and

suggestions for future work.

II. T

HE NEW RECURSIVE MULTIBIT RECODING

MULTIPLICATION ALGORITHM

The equation (2.1.2) of the original multibit recoding

algorithm presented in

[2] does not offer hardware visibility.

Let us rewrite it in a simpler hardware-friendly form, as

follows:

(

∑

−

++−

⋅⋅⋅++++=

222

rjrjrjrj

yyyyY

)

∑

−

−+

−

−+

−

=−+

2222

rrj

Qyy

(1)

Where

−

and

Ν∈r

. For simplicity purposes and

without loss of generality, we assume that r is a divider of N .

In equation

(1), the two’s complement representation of

the multiplier Y is split into N/r two’s complement slices

(

), each of r+1 bit length. Each pair of two contiguous

slices has one overlapping bit. In literature, equation

(1) is

referred to by radix-2

equation, to which corresponds a

digit set

(

)

D 2

such as

(

)

{

}

2022

−−

−=∈

rrr

,...,,...,DQ

Thus, the signed multiplication between X and Y becomes:

QXYX 2...

∑

−

(2). Where each partial product can be

expressed as follows:

()

(

)

XmQX

..... 212 −=

, with

(

)

{

}

12312

−=∈

−rr

Om ...,,,

such as

(

)

−

(

)

O 2

represents the required set of odd-multiples of the

multiplicand (m.X) for radix-2

. Hence, the partial-product

generation-process consists first in selecting one odd-

multiple (m.X) among the whole set of pre-computed odd-

multiples, which is then submitted to a hardwired shift of f

positions, and finally conditionally complemented (-1)

depending on the bit sign e of Q

term.

Table I provides a

picture on how the number of odd-multiples grows when the

radix becomes higher. While lower m.X can be obtained

using just one addition (3X=2X+1X), the calculation of

higher ones may require a number of computation steps

(11X= 8X+2X+1X).

To bypass the hard problem of odd-multiples, we exploit

the fact that the N+1 bit-length two’s complement multiplier

Y on which equation

(1) is applied, is composed of a series

(N/r) of r+1 bit-length two’s complement slices (

digits)

on which equation

(1) can be recursively applied again.

Based on this observation, let us announce the two

following theorems accompanied with their respective

proofs inserted in

Appendix.

TABLE I

MAIN FEATURES OF THE MULTIBIT RECODING MULTIPLICATION ALGORITH

Radix Nbr. of Partial Products Odd Multiples (m.X)

N 1X

N/2 1X

N/3 1X, 3X

N/4 1X, 3X, 5X, 7X

N/5 1X, 3X, 5X, 7X, 9X, 11X, 13X, 15X

|O(2

r+1

)|=2×|O(2

)|. In radix-2

, the multiplier Y is divided into N/r slices,

each of r+1 bit length. Each pair of two contiguous slices has one

overlapping bit.

Theorem 1. Any digit

(

)

DQ 2∈

can be represented in a

combination of digits

(

)

DP 2∈

, such as s is a divider of r.

When theorem (1) is applied to equation

(1), it gives:

PY 22

∑∑

−

⎥

⎦

⎤

⎢

⎣

⎡

(3) ; where

(

)

{

}

2022

−−

−=∈

sss

,...,,...,DP

with

(

)

{

}

12312

−=

−ss

O ,...,,

such as

()

and

P.XY.X 22

∑∑

−

⎥

⎦

⎤

⎢

⎣

⎡

(4)

Theorem 2. Any digit

(

)

DQ 2∈

can be represented in a

combination of digits P

such as

(

)

DP 2∈

and

(

)

DT 2∈

with s+t a divider of r , and t < s.

Likewise, when theorem (2) is applied to equation (1), we

obtain:

[]

()

itss

jiji

TPY 222

∑∑

−

⎥

⎦

⎤

⎢

⎣

⎡

(5). Where

(

)

{

}

2,...,0,...,22

−−

−=∈

sss

with

(

)

{

}

12312

−=

−ss

O ...,,,

and

(

)

{

}

2,...,0,...,22

−−

−=∈

ttt

with

(

)

{

}

12312

−=

−tt

O ...,,,

such as

()

tsk

= 2

and

[]

()

itss

jiji

TXPXYX 222

∑∑

−

⎥

⎦

⎤

⎢

⎣

⎡

+= ...

(6)

Theorem (1) and (2) allow an exponential reduction

(1/2

and 1/2

k(s+t)

, resp.) of the number of odd-multiples in

equations

(4) and (6) in comparison to equation (2), but at

the expense of a linear increase (ks-1 and k(s+t)-1, resp.) in

the number of additions. The advantage by far outweighs

the cost, as practically shown in the next section.

The translation of equation

(4) into architecture is

depicted by

Fig. 1.b, where each PPG

) is built up using

r/s identical PPG

). This is not the case for equation

(6)

which requires two different PPG

and T

)

. Theorem (1)

and (2) can be merged together to produce PPG

made of a

number of different PPG

,...). This is the

general case that is thoroughly studied in next sections in

order to determine the optimal multiplier.

III. T

WO HIGH RADIX (2

AND 2

) ILLUSTRATIVE EXAMPLES

Theorems (1) and (2) permit to build up any high radix-2

multiplication algorithm based on lower sub-radices,

employing much less odd-multiples. The objective

hereafter is to generate high radix-2

multiplication without

odd-multiples for a maximum reduction of multiplexer

complexity inside PPG

. To achieve such a goal, a number

of odd-multiple free low-radix algorithms are used, such as

Booth algorithm (radix-2

)

[18], modified Booth algorithm

(radix-2

)

[13], Seidel et al. algorithms (radix-2

and

radix-2

)

[11][19]. Booth and modified Booth recoding

(McSorley algorithm

[13]) can be derived from equation (3)

for (r,s)=(1,1) and (r,s)=(2,2), respectively. They are

respectively summarized as follows:

()

∑∑

−

=−=

QyyY

(7)

With

(

)

{

}

1012

,,D −=

and

(

)

{}

()

(

)()

∑∑

−

+−

=−+=

12212

222

jjj

QyyyY

(8)

With

(

)

}

{

2,1,0,1,22

−−=D

and

(

)

{

}

Seidel radix-2

recoding

[11][19] is described as follows:

[]

(

)

PQ.Y

∑

−

(9) with

{}

;,,,,Q

21012 −−∈

{

}

4210124 ,,,,,,P

−−−∈

and

(

)

{}

And Seidel radix-2

recoding is given by the following

equation:

[]

(

)

jjj

TP.Q.Y

21111

∑

−

++=

(10) with

{

}

21012 ,,,,Q

−−∈

;

{

}

16,8,4,2,1,0,1,2,4,8,16, −−−−−∈

and

(

)

{

}

. Note that while equations (

9) and (10) are

odd-multiple free since all included digits are power of 2,

they require a post-accumulation to deal with odd numbers

(7, 11 and 121). Thus, a number of extra-adders are needed.

Optimized higher radices are obtained as follows.

A. Our new radix-2

recoding

Based on theorem (2), each 8+1 bit slice is split into 5+1,

2+1, and 1+1 overlapping slices using Seidel radix-2

McSorley radix-2

, and Booth radix-2

algorithms,

respectively. The new recoding is given by the following

equation:

()

[]

(

)

∑

−

+++=

852

2227

jjjj

..SRPQ.Y

(11)

With

{

}

21012 ,,,,Q

−−∈

;

{}

4210124 ,,,,,,P

−−−∈

;

{

}

21012 ,,,,R

−−∈

;

{

}

101 ,,S

−∈

and

(

)

{

}

B. Our new radix-2

recoding

Likewise, using theorem (2), each 16+1 bit slice is split

into 8+1, 5+1, 2+1, and 1+1 overlapping slices using Seidel

radix-2

and radix-2

, McSorley radix-2

, and Booth radix-

algorithms, respectively. The new recoding is described

by the following equation:

()

[

∑

−

+++++=

271111

jjjjj

.SR.TP.Q.Y

(

)

]

16132

222 ..+

(12) with

{}

21012 ,,,,Q

−−∈

;

{

}

1684210124816 ,,,,,,,,,,T,P

−−−−−∈

;

{

}

21012 ,,,,R

−−∈

;

{

}

4210124 ,,,,,,S

−−−∈

;

{

}

21012 ,,,,U

−−∈

;

{

}

101 ,,V

−=

and

(

)

{

}

In our preceding work [20], we pursued this combination

process farther and generated a series of higher radix (2

, …) recoding schemes with

(

)

{}

12 =

. However, what

still remains unknown is to determine, for a given N value,

the proper radix (2

) that leads to the optimal architecture.

The translation of equations (

11) and (12) into

architectures is depicted in

Fig. 2.a and 2.b, respectively.

All Dimitrov algorithms developed in

[12] are unsigned.

For an equitable comparison, we had to develop a new

two’s complement radix-2

recoding version with

(

)

{}

75312

,,,=O

based on Dimitrov unsigned radix-2

recoding (mult_7b2d in

[12]) with

(

)

{}

75312

,,,=O

. The

new recoding is:

()

PQY

21212

−

−−+=

∑

(13)

With

{}

{

}

{

}

1,07,6,5,4,3,2,1,0,;7,5,3,1, ∈∈∈ eandhkPQ

For the comparative study, our proposed algorithms

(eq.

11 and 12) as well as Seidel and Dimitrov algorithms

(eq.

10 and 13, resp.) are first analytically characterized and

then physically implemented.

C. Analytical characterization of area and speed

Prior implementation, we need to develop a generalized

theoretical model which predicts area and speed features of

each recoding algorithm with respect to N and r values.

1) Area

Three basic components are necessary for the

implementation of RTL multipliers:

• multiplexers (Mux1) to recode the digit terms (Q

,…)

included in the recoding expression;

• shifters (Mux2) for partial product generation;

• and adders for partial product summation.

Whereas the exact number of adders can be known in

advance, we need to develop heuristics for the two others.

The total multiplexer complexity (Mux1) of a radix-2

multiplier depends on:

• the number (N/r) of PPG

;

• the number (i) of lower sub-radices (2

, 2

, and 2

)

used to build up the higher radix-2

. To each sub-

radix-2

used (PPG

) corresponds an RTL “case

statement” that recodes the digit terms (Q

,…)

present in the equation;

• the number of entries (e

+1) in each “case statement”

corresponding to each sub-radix-2

;

• the number (d

) of digit terms (Q

,…) that

figures in each “case statement”

;

• and on the number of necessary odd-multiples (|O

used to calculate the digit terms.

Hence, we

can announce that:

()

∑

Mux || ...

For Dimitrov algorithm (eq.

13), this gives: r=8, i=1,

=8, d

=2, and |O

|=4. Thus, Mux1 = 512 N.

The synthesis of the RTL “shift statement” infers

multiplexers whose complexity depends on the number (p

)

of different shift positions for all odd-multiples involved in

the calculation of each digit term (j). Thus, we can write:

(

)

∑∑

sjsj

Mux || ..2

. For Dimitrov algorithm

(eq.

13), this gives: r = 8, i=1, j=2, p

=8, and |O

| =

| = 4. Thus, Mux2=8N. Hence, the total multiplexer

complexity becomes: Mux

= Mux1+Mux2=520N.

A N-bit radix-2

multiplier generates N/r PP. Thus, The

total number of adders comprises:

•

(

)

1/ −rN

adders to sum the N/r PP;

• plus the necessary adders inside each PPG

accumulate the intermediate PP issuing from PPG

;

• plus a number of adders included inside each PPG

depending on the recoding scheme used.

For example, in Seidel algorithm (eq.

10), the term

jijiji

TPQ ++1111

is calculated as follows:

(

)

(

)

jijijijijijiji

TPPPQQQ +−+++−

2337

2222

, which

requires 6 adders for post-accumulation operation

[11][19].

Hence, the total number of necessary adders is:

Add

(

)

(

)( )

1878618 −=+− N//N/N

23 , 28

28 , 30

30 , 31

15 , 23

39 , 44

44 , 46

46 , 47

31 , 39

55 , 60

60 , 62

62 , 63

47 , 55

127 - 0

(b)

7 , 12

12 , 14

14 , 15

-1 , 7

PPG

Fig. 2. Two’s complement 64×64 bit multiplier.

(a) Radix-2

multiplier. Space partitioning according to equation (

11)

(b) Radix-2

multiplier. Space partitioning according to equation (

12)

Critical path (Del

= N/r-1+Del+d

)

(a)

-1 , 4

4 , 6

6 , 7

7 , 12

12 , 14

14 , 15

15 , 20

20 , 22

22 , 23

23 , 28

28 , 30

30 , 31

31 , 36

36 , 38

38 , 39

39 , 44

44 , 46

46 , 47

47 , 52

52 , 54

54 , 55

55 , 60

60 , 62

62 , 63

127 - 0

PPG

including a fixed

number of adders

Del

is the delay in adder levels of

the total critical path. Del is the

delay in adder levels inside PPG

and d

is the delay due to

multiplexer logic inside PPG

HTML Viewer

Figures

TABLE III IMPLEMENTATION RESULTS OF A TWO’S COMPLEMENT 64-BIT PARALLEL MULTIPLIER ON XILINX XC6VSX475T-2FF1156 CIRCUIT

TABLE V OPTIMAL PPGj SOLUTION (a,b,c,d) LEADING TO THE OPTIMAL

TABLE VI DELAY AND MULTIPLEXER COMPLEXITY OF THE NEW BASIC RADICES: STEP #2

TABLE IV DELAY AND MULTIPLEXER COMPLEXITY OF BASIC RADICES: STEP #1

Fig. 3. Critical path (Del+di) inside a generalized PPGj

- 08 Feb 2021 -

IEEE Transactions on Circuits and System...

TL;DR: In this paper, the Radix- $w$ -bit windowing method is proposed, where the properties of speed, memory, and security are described by exact analytic formulas as proof of superiority Contrary to existing windowing algorithms, to minimize the number of ADDs, the window size ($w$ ) is guided by an optimum depending on the bit-length ( $l$ ) of the scalar k The number of required precomputations is minimal regarding the value of k.

...read moreread less

A new high radix-2r (r≥8) multibit recoding algorithm for large operand size (N≥32) multipliers

A. K. Oudjida, +3 more

- 05 Dec 2012 -

ACM Sigarch Computer Architecture News

New high-speed and low-power radix-2 r multiplication algorithms

A. K. Oudjida, +3 more

On the implementation of a three-operand multiplier

R. McIlhenny, +1 more

Efficient Design for Radix-8 Booth Multiplier and Its Application in Lifting 2-D DWT

Basant Kumar Mohanty, +1 more

- 01 Mar 2017 -

Circuits Systems and Signal Processing

An Efficient Single Precision Floating Point Multiplier Architecture based on Classical Recoding Algorithm

J. Jean Jenifer Nesam, +1 more

- 09 Feb 2016 -

Indian journal of science and technology

Frequently Asked Questions (20)

Q1. What are the contributions in "A new high radix-2r (r ≥ 8) multibit recoding algorithm for large operand size (n ≥ 32) multipliers" ?

This paper addresses the problem of multiplication with large operand sizes ( N≥32 ). The authors propose a new recursive recoding algorithm that shortens the critical path of the multiplier and reduces the hardware complexity of partial-product-generators as well. The new recoding algorithm provides an optimal space/time partitioning of the multiplier architecture for any size N of the operands.

Q2. What are the three major requirements for today’s multiplication-intensive applications?

in large-operand-size applications (N≥32), the need for a scalable architecture is essential to ensure a linearincrease O(N) of multiply-time while multiplier size grows quadratically O(N2) with operand bit-length N.Consequently, high-speed, low-power, and highly-scalablearchitecture are the three major requirements for today’sgeneral-purpose multipliers [1].

Q3. How many clock cycles does a two’s complement require?

For instance, a 64-bit two’s complement finely pipelined multiplier requires a latency of seven clock cycles only (critical path composed of a series of 7 adders).

Q4. What is the critical path of the multiplier in terms of logic levels?

Based on the total number of adders (AddT), the critical path of the multiplier in terms of logic levels is: DelT= N/r-1+Del+ds, where Del is the delay due to adder stages inside PPGj and ds is the delay due to multiplexer logic inside PPGji.

Q5. What are the basic components of a recoding algorithm?

1) AreaThree basic components are necessary for theimplementation of RTL multipliers:• multiplexers (Mux1) to recode the digit terms (Qj,Pj,…) included in the recoding expression; • shifters (Mux2) for partial product generation; • and adders for partial product summation.

Q6. What is the purpose of the recoding of large slices in a mono-bloc?

Recoding large slices (r≥8) in a mono-bloc PPG such as in [11][12], requires the use of an RTL “case statement” with r+1 entries.

Q7. What is the simplest way to reduce the number of bits of the multiplier?

To comply with time constraint of a given application, the authors need a multiplication algorithm that allows, to some extent, a parameterized reduction (N/r) of the multiply-time without sacrificing area.

Q8. What is the tradeoff for a recoding scheme?

based on theory and implementation results, the authors conclude that the best tradeoff related to their recoding schemes depends on N and r values.

Q9. Why is the solution space a deterministic C-program?

Because of an explosive number of possible combinations (N>>), the solution space is exhaustively explored using a deterministic C-program for r varying from 8 to 1024.

Q10. What is the important reason for the radix-28 PPGj?

radix-28 PPGji of equation (15) is the least area consumer because it does not employ odd-multiples and requires a small amount of multiplexers as the total number of input combinations in each radix-28 PPGji is equal to 8+8+8+8=32.

Q11. Why is the look-up table based multiplication algorithm so fast?

Because exploiting the maximum parallelisminherent in multiply operation, their look-up-table basedmultiplier (eq. 15) is even speed-competitive with Xilinx’shardwired multiplier employing DSP-Slices (18×18 bit full-custom multipliers).

Q12. What is the topology of the proposed recoding schemes?

The topology of their proposed recoding schemes showshigh capabilities for pipelining which can be finely orcoarsely grained to satisfy both high throughput and low latency applications.

Q13. What is the way to solve the problem of radix-2r?

Guided by accurate area heuristics, the final result of an optimization process, gradually undertaken in this paper, delivers for each value of N (N=8..8192) the appropriate radix-2r (r=8..512) and sub-radix-2 s (s=4..32) that lead to the architecture with the shortest critical path ( 3233 −/N ) in adder stages.

Q14. What is the solution to the problem of radix-2r two?

The solution consists essentially in dividing the high radix-2r mono-bloc PPGj (Fig. 1.a) into a number of lower sub-radix-2s odd-multiple free PPGji (Fig. 1.b), such as s is a divider of r .

Q15. What is the corresponding adigit set of rrrj?

In literature, equation (1) is referred to by radix-2r equation, to which corresponds adigit set ( )rD 2 such as ( ) { }11 2022 −−−=∈ rrrj ,...,,...,DQ .

Q16. What is the advantage of partitioning PPGj?

As direct benefits of the partitioning of Fig. 1.b:• there is no need to pre-compute odd-multiples of the multiplicand, which drastically reduces the requiredamount of hardware resources and routing;• since the size of PPGji entry is much smaller than the size of PPGj one (s≤r/2), the total multiplexing logic required by RTL “case statements” to recode theentries is greatly reduced;

Q17. What is the important reason for the recoding of equations?

Based on theory (Table II) and implementation results (Table III), Dimitrov recoding is the most space consuming due to the use of odd-multiples of the multiplicand.

Q18. What is the purpose of a mono-bloc PPG?

mono-bloc PPG recoding is incompatible with high radix (r≥8) approach whose purpose is to reduce the multiply-time (N/r) of large operand size (N ≥32) multipliers.

Q19. What is the reason why the solution space is not balanced?

even the “balanced” solution is not really balanced enough since the mean values of Del and Mux are 1.4×Delmin and 5.2×Muxmin , respectively.

Q20. What is the important reason for the radix-232 algorithm?

A. Area occupationFor operand size N=64, equation (15) is a composite radix-232 algorithm (Table X), where each PPGj processes simultaneously 32+1 inputs that are split on four sub-radix28 PPGji made of four instances ( jikC ) of McSorley algorithm (Fig. 4).

A New High Radix-2 r ( r ≥ 8) Multibit Recoding Algorithm for Large Operand Size ( N ≥ 32) Multipliers

Figures

Citations

Radix- $2^{r}$ Arithmetic for Multiplication by a Constant

Radix-2 r Arithmetic for Multiplication by a Constant: Further Results and Improvements

Some Algorithms for Computing Short-Length Linear Convolution

A new binary arithmetic for finite-word-length linear controllers: MEMS applications

Radix-2 w Arithmetic for Scalar Multiplication in Elliptic Curve Cryptography

Related Papers (5)

A new high radix-2r (r≥8) multibit recoding algorithm for large operand size (N≥32) multipliers

New high-speed and low-power radix-2 r multiplication algorithms

On the implementation of a three-operand multiplier

Efficient Design for Radix-8 Booth Multiplier and Its Application in Lifting 2-D DWT

An Efficient Single Precision Floating Point Multiplier Architecture based on Classical Recoding Algorithm

Frequently Asked Questions (20)

Q1. What are the contributions in "A new high radix-2r (r ≥ 8) multibit recoding algorithm for large operand size (n ≥ 32) multipliers" ?

Q2. What are the three major requirements for today’s multiplication-intensive applications?

Q3. How many clock cycles does a two’s complement require?

Q4. What is the critical path of the multiplier in terms of logic levels?

Q5. What are the basic components of a recoding algorithm?

Q6. What is the purpose of the recoding of large slices in a mono-bloc?

Q7. What is the simplest way to reduce the number of bits of the multiplier?

Q8. What is the tradeoff for a recoding scheme?

Q9. Why is the solution space a deterministic C-program?

Q10. What is the important reason for the radix-28 PPGj?

Q11. Why is the look-up table based multiplication algorithm so fast?

Q12. What is the topology of the proposed recoding schemes?

Q13. What is the way to solve the problem of radix-2r?

Q14. What is the solution to the problem of radix-2r two?

Q15. What is the corresponding adigit set of rrrj?

Q16. What is the advantage of partitioning PPGj?

Q17. What is the important reason for the recoding of equations?

Q18. What is the purpose of a mono-bloc PPG?

Q19. What is the reason why the solution space is not balanced?

Q20. What is the important reason for the radix-232 algorithm?