scispace - formally typeset
Open AccessJournal ArticleDOI

A New High Radix-2 r ( r ≥ 8) Multibit Recoding Algorithm for Large Operand Size ( N ≥ 32) Multipliers

Reads0
Chats0
TLDR
A new recursive recoding algorithm is proposed that shortens the critical path of the multiplier and reduces the hardware complexity of partial-product-generators as well and provides an optimal space/time partitioning of themultiplier architecture for any size N of the operands.
Abstract
This paper addresses the problem of multiplication with large operand sizes (N≥32). We propose a new recursive recoding algorithm that shortens the critical path of the multiplier and reduces the hardware complexity of partial-product-generators as well. The new recoding algorithm provides an optimal space/time partitioning of the multiplier architecture for any size N of the operands. As a result, the critical path is drastically reduced to 33 N / 2 - 3 with no area overhead in comparison to modified Booth algorithm that shows a critical path of N/2 in adder stages. For instance, only 7 adder stages are needed for a 64-bit two's complement multiplier. Confronted to reference algorithms for N=64, important gain ratios of 1.62, 1.71, 2.64 are obtained in terms of multiply-time, energy consumption per multiply- operation, and total gate count, respectively.

read more

Content maybe subject to copyright    Report

HAL Id: hal-00872326
https://hal.archives-ouvertes.fr/hal-00872326
Submitted on 11 Oct 2013
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
A New High Radix-2r (r 8) Multibit Recoding
Algorithm for Large Operand Size (N 32) Multipliers.
Abdelkrim K. Oudjida, Nicolas Chaillet, Mohamed L. Berrandjia, Ahmed
Liacha
To cite this version:
Abdelkrim K. Oudjida, Nicolas Chaillet, Mohamed L. Berrandjia, Ahmed Liacha. A New High Radix-
2r (r 8) Multibit Recoding Algorithm for Large Operand Size (N 32) Multipliers.. Journal of
Low Power Electronics, American Scientic Publishers, 2013, 9, pp.50-62. �hal-00872326�

Abstract—This paper addresses the problem of
multiplication with large operand sizes (N32). We propose a
new recursive recoding algorithm that shortens the critical
path of the multiplier and reduces the hardware complexity of
partial-product-generators as well. The new recoding
algorithm provides an optimal space/time partitioning of the
multiplier architecture for any size N of the operands. As a
result, the critical path is drastically reduced to
323
3
/N
with no area overhead in comparison to modified Booth
algorithm that shows a critical path of N/2 in adder stages. For
instance, only 7 adder stages are needed for a 64-bit two’s
complement multiplier. Confronted to reference algorithms for
N=64, important gain ratios of 1.62, 1.71, 2.64 are obtained in
terms of multiply-time, energy consumption per multiply-
operation, and total gate count, respectively.
Index Terms— High-Radix Multiplication, Low-Power
Multiplication, Multibit Recoding Multiplication, Partial
Product Generator (PPG), Register-Transfer-Level (RTL)
I. BACKGROUND AND MOTIVATION
N multiplication-intensive applications, as in digital signal
processing or process control, multiply-time is a critical
factor that limits the whole system performance. When these
types of applications are embedded, energy consumption per
multiply operation becomes an additional critical issue.
Furthermore, in large-operand-size applications (N32), the
need for a scalable architecture is essential to ensure a linear
increase O(N) of multiply-time while multiplier size grows
quadratically O(N
2
) with operand bit-length N.
Consequently, high-speed, low-power, and highly-scalable
architecture are the three major requirements for today’s
general-purpose multipliers
[1].
However, large operand size multipliers are very time
consuming. To comply with time constraint of a given
application, we need a multiplication algorithm that allows,
to some extent, a parameterized reduction (N/r) of the
multiply-time without sacrificing area. This is achieved if,
and only if the total critical path can be properly shortened
by reducing the number of partial products (PPs) and
exploiting inherent parallelism. Theoretically, only the
signed multibit recoding multiplication algorithm
[2] is
capable of such a drastic reduction (N/r) of the PP number,
given that r+1 is the number of bits of the multiplier that are
simultaneously treated (1<rN/2). Unfortunately, this
algorithm requires the pre-computation of a number of odd-
multiples of the multiplicand (until (2
r-1
-1).X) that scales
linearly with r. The large number of odd-multiples not only
requires a considerable amount of multiplexers to perform
the necessary complex recoding into partial product
generators (PPG), but dramatically increases the routing
density as well. Therefore, a reverse effect occurs that
offsets speed and power benefits of the compression factor
N/r. This is the main reason why the multibit recoding
algorithm was abandoned. Moreover, in industry
commercial designs do not exceed r=4 (radix-16). A hybrid
radix-4/-8 is proposed in
[3] for low-power multimedia
applications. To increase the speed of the multiplier, most
ancient processors employed radix-8, such as: Fchip
[4],
IBM S/390
[5], Alpha RISC [6], IA-32 [7]
and AMDK7 [8].
While radix-16 is used only in the most recent Intel
processors: 64 and IA-32
[9], and Itanium-Poulson [10].
In research, the highest radix algorithms are proposed in
the works of Seidel et al. [11] and Dimitrov et al. [12]. Both
works
rely upon advanced arithmetic to determine minimal
number-bases that are representatives of the digits resulting
from larger multibit recoding. The objective is to eliminate
information redundancy inside r+1 bit-length slices for a
more compact PPG. This is achievable as long as no or just
very few odd- multiples are required.
Seidel introduced a secondary recoding of digits issued
from an initial multibit recoding for 5r16. The recoding
scheme is based on balanced complete residue system.
Though it significantly reduces the number of partial
products (N/r for 5r16), it requires some odd-multiples
for r8. Dimitrov proposed a new recoding scheme based
on double base number system for 6r11. The algorithm is
limited to unsigned multiplication and requires larger
number of odd-multiples. Both algorithms
[11][12] require a
PPG that includes a number of adders to accumulate
intermediary partial products corresponding to recoded
elementary digits.
In fact, odd-multiples are not the only problem for a
compact PPG. Recoding large slices (r8) in a mono-bloc
PPG such as in
[11][12], requires the use of an RTL “case
statement” with r+1 entries. In this case, 2
r+1
combinations
must be processed, which yields to a huge amount of
multiplexer resources. Thus, mono-bloc PPG recoding is
incompatible with high radix (r8) approach whose purpose
is to reduce the multiply-time (N/r) of large operand size
(N 32) multipliers.
The objective of this paper is to overcome these two
above-mentioned shortcomings. To achieve such a goal, the
multibit recoding multiplication algorithm is revisited
[2]. Its
design space is extended by the introduction of a new
recursive version that enabled to solve the hard problem of
radix-2
r
two’s complement multiplication for any value of r.
The solution consists essentially in dividing the high radix-2
r
mono-bloc PPG
j
(
Fig. 1.a) into a number of lower
sub-radix-2
s
odd-multiple free PPG
ji
(
Fig. 1.b), such as s is a
divider of r . As direct benefits of the partitioning of
Fig. 1.b:
there is no need to pre-compute odd-multiples of the
multiplicand, which drastically reduces the required
amount of hardware resources and routing;
since the size of PPG
ji
entry is much smaller than the
size of PPG
j
one (sr/2), the total multiplexing logic
required by RTL “case statements” to recode the
entries is greatly reduced;
A New High Radix-2
r
(r8) Multibit Recoding Algorithm
for Large Operand Size (N 32) Multipliers
A.K. Oudjida
1
, N. Chaillet
2
, M.L. Berrandjia
1
, and A. Liacha
1
I
(1) Centre de Développement des Technologies Avancées, Algiers, Algeria
(2) Institut FEMTO-ST, Besançon, France

Fig. 1. Generalized N×N bit radix-2
r
parallel multiplier.
(a) Critical path in conventional
[2][4][5][6][7][8] and recent [3][9][10]
[11][12]
radix-2
r
multipliers. O(X) is the necessary set of odd-multiples
corresponding to radix-2
r
recoding. PPG
j
of
[11][12] includes a number
of adders to accumulate intermediary partial product.
(
b) Critical path in our proposed radix-2
r
multipliers. Main features are: no
odd-multiples, much more compact PPG
j
, much shorter critical path.
(b)
2
r
is the main radix and
2
s
is the sub-radix
PP: Partial Product
Critical path (Del
T
)
.
.
.
P
2N-1 , 0
P
2N-1 , 0
= X
r
XXXO )1
1
2(...,5,3)(
X
N
(a)
PPG
0
PPG
1
Y
-1 , r-1
r+1
r+1
Y
N-r-1 , N-1
r+1
+
+
PPG
(N/r)-1
.
.
.
PP
0
PP
1
PP
(N/r)-1
()
XO
.
.
.
PP
1
Y
N-r-1 , N-1
.
.
.
PPG
00
.
.
.
PPG
01
PPG
0 (r/s)-1
PPG
0
.
.
.
PPG
10
.
.
.
PPG
11
PPG
1 (r/s)-1
PPG
1
.
.
.
PPG
(N/r)-1 0
.
.
.
PPG
(N/r)-1 1
PPG
(N/r)-1 (r/s)-1
PPG
(N/r)-1
.
.
.
Y
-1 , r-1
r+1
Y
r-1 , 2r-1
r+1
r+1
PP
(N/r)-1
PP
0
+
+
N
Y
r-1 , 2r-1
X
the possibility to simultaneously process larger bit
slices (r16) radically shortens the critical path in
terms of adder levels, especially for very large operand
sizes (N64).
Guided by accurate area heuristics, the final result of an
optimization process, gradually undertaken in this paper,
delivers for each value of N (N=8..8192) the appropriate
radix-2
r
(r=8..512) and sub-radix-2
s
(s=4..32) that lead to
the architecture with the shortest critical path (
323
3
/N
)
in adder stages. The couple (r,s) serves to partition the
architecture so that maximum parallelism is exploited. As
for area, our proposed architectures require as many
hardware resources as modified Booth algorithm
[13] with a
critical path of N/2
[14][15][16][17]. For instance, a 64-bit
two’s complement finely pipelined multiplier requires a
latency of seven clock cycles only (critical path composed
of a series of 7 adders). FPGA implementation on Virtex-6
circuit of our 64-bit two’s complement radix-2
32
multiplier
shows important gain ratios over Seidel
[11] and Dimitrov
[12] radix-2
8
algorithms. The respective gain ratios are
enumerated as follows: 1.62, 1.71, 2.64 and 1.83, 1.71, 3.32
are obtained in terms of multiply-time, energy consumption
per multiply-operation, and total gate count, respectively.
The paper is organized as follows. Section I outlines the
main requirement specifications for a generalized radix-2
r
multiplication. Section II introduces the new recursive
multibit recoding multiplication algorithm, illustrated by
two high-radix (2
8
and 2
16
) recoding examples in Section
III. Section IV introduces some preliminary steps toward an
optimal partitioning of the multiplier architecture, while the
optimal partitioning is presented in Section V. Section VI
compares and discusses the implementation results. Finally,
Section VII provides some concluding remarks and
suggestions for future work.
II. T
HE NEW RECURSIVE MULTIBIT RECODING
MULTIPLICATION ALGORITHM
The equation (2.1.2) of the original multibit recoding
algorithm presented in
[2] does not offer hardware visibility.
Let us rewrite it in a simpler hardware-friendly form, as
follows:
(
=
++
++++=
1
0
2
2
1
10
1
222
r
N
j
rjrjrjrj
yyyyY
)
=
+
+
=+
1
0
1
1
2
2
2222
r
N
j
rj
j
rj
rrj
r
rrj
r
Qyy
(1)
Where
0
1
=
y
and
*
Νr
. For simplicity purposes and
without loss of generality, we assume that r is a divider of N .
In equation
(1), the two’s complement representation of
the multiplier Y is split into N/r two’s complement slices
(
j
Q
), each of r+1 bit length. Each pair of two contiguous
slices has one overlapping bit. In literature, equation
(1) is
referred to by radix-2
r
equation, to which corresponds a
digit set
(
)
r
D 2
such as
(
)
{
11
2022
=
rrr
j
,...,,...,DQ
.
Thus, the signed multiplication between X and Y becomes:
rj
r
N
j
j
QXYX 2...
1
0
=
=
(2). Where each partial product can be
expressed as follows:
()
(
)
XmQX
f
e
rj
j
..... 212 =
, with
(
)
{
}
12312
1
=
rr
Om ...,,,
such as
(
)
2
22
=
rr
O
.
(
)
r
O 2
represents the required set of odd-multiples of the
multiplicand (m.X) for radix-2
r
. Hence, the partial-product
generation-process consists first in selecting one odd-
multiple (m.X) among the whole set of pre-computed odd-
multiples, which is then submitted to a hardwired shift of f
positions, and finally conditionally complemented (-1)
e
depending on the bit sign e of Q
j
term.
Table I provides a
picture on how the number of odd-multiples grows when the
radix becomes higher. While lower m.X can be obtained
using just one addition (3X=2X+1X), the calculation of
higher ones may require a number of computation steps
(11X= 8X+2X+1X).
To bypass the hard problem of odd-multiples, we exploit
the fact that the N+1 bit-length two’s complement multiplier
Y on which equation
(1) is applied, is composed of a series
(N/r) of r+1 bit-length two’s complement slices (
j
Q
digits)
on which equation
(1) can be recursively applied again.
Based on this observation, let us announce the two
following theorems accompanied with their respective
proofs inserted in
Appendix.
TABLE I
MAIN FEATURES OF THE MULTIBIT RECODING MULTIPLICATION ALGORITH
Radix Nbr. of Partial Products Odd Multiples (m.X)
2
1
N 1X
2
2
N/2 1X
2
3
N/3 1X, 3X
2
4
N/4 1X, 3X, 5X, 7X
2
5
N/5 1X, 3X, 5X, 7X, 9X, 11X, 13X, 15X
|O(2
r+1
)|=2×|O(2
r
)|. In radix-2
r
, the multiplier Y is divided into N/r slices,
each of r+1 bit length. Each pair of two contiguous slices has one
overlapping bit.

Theorem 1. Any digit
(
)
r
j
DQ 2
can be represented in a
combination of digits
(
)
s
ji
DP 2
, such as s is a divider of r.
When theorem (1) is applied to equation
(1), it gives:
rj
r
N
j
s
r
i
si
ji
PY 22
1
0
1
0
∑∑
=
=
=
(3) ; where
(
)
{
}
11
2022
=
sss
ji
,...,,...,DP
with
(
)
{
}
12312
1
=
ss
O ,...,,
such as
()
()
ks
s
r
O
O
2
2
2
=
and
rj
r
N
j
s
r
i
si
ji
P.XY.X 22
1
0
1
0
∑∑
=
=
=
(4)
Theorem 2. Any digit
(
)
r
j
DQ 2
can be represented in a
combination of digits P
ji
+T
jk
such as
(
)
s
ji
DP 2
and
(
)
t
jk
DT 2
with s+t a divider of r , and t < s.
Likewise, when theorem (2) is applied to equation (1), we
obtain:
[]
()
rj
r
N
j
ts
r
i
itss
jiji
TPY 222
1
0
1
0
∑∑
=
+
=
+
+=
(5). Where
(
)
{
}
11
2,...,0,...,22
=
sss
ji
DP
with
(
)
{
}
12312
1
=
ss
O ...,,,
and
(
)
{
}
11
2,...,0,...,22
=
ttt
ji
DT
with
(
)
{
}
12312
1
=
tt
O ...,,,
such as
()
()
()
tsk
ts
r
O
O
+
+
= 2
2
2
and
[]
()
rj
r
n
j
ts
r
i
itss
jiji
TXPXYX 222
1
0
1
0
∑∑
=
+
=
+
+= ...
(6)
Theorem (1) and (2) allow an exponential reduction
(1/2
ks
and 1/2
k(s+t)
, resp.) of the number of odd-multiples in
equations
(4) and (6) in comparison to equation (2), but at
the expense of a linear increase (ks-1 and k(s+t)-1, resp.) in
the number of additions. The advantage by far outweighs
the cost, as practically shown in the next section.
The translation of equation
(4) into architecture is
depicted by
Fig. 1.b, where each PPG
j
(Q
j
) is built up using
r/s identical PPG
ji
(P
ji
). This is not the case for equation
(6)
which requires two different PPG
ji
(P
ji
and T
ji
)
. Theorem (1)
and (2) can be merged together to produce PPG
j
made of a
number of different PPG
ji
(P
ji
,T
ji
,U
ji
,V
ji
,...). This is the
general case that is thoroughly studied in next sections in
order to determine the optimal multiplier.
III. T
WO HIGH RADIX (2
8
AND 2
16
) ILLUSTRATIVE EXAMPLES
Theorems (1) and (2) permit to build up any high radix-2
r
multiplication algorithm based on lower sub-radices,
employing much less odd-multiples. The objective
hereafter is to generate high radix-2
r
multiplication without
odd-multiples for a maximum reduction of multiplexer
complexity inside PPG
j
. To achieve such a goal, a number
of odd-multiple free low-radix algorithms are used, such as
Booth algorithm (radix-2
1
)
[18], modified Booth algorithm
(radix-2
2
)
[13], Seidel et al. algorithms (radix-2
5
and
radix-2
8
)
[11][19]. Booth and modified Booth recoding
(McSorley algorithm
[13]) can be derived from equation (3)
for (r,s)=(1,1) and (r,s)=(2,2), respectively. They are
respectively summarized as follows:
()
=
=
==
1
0
1
0
1
22
N
j
j
j
j
N
j
jj
QyyY
(7)
With
(
)
{
}
1012
1
,,D =
and
(
)
{}
12
1
=O
()
(
)()
=
=
+
=+=
12
0
22
12
0
12212
222
/N
j
j
j
j
/N
j
jjj
QyyyY
(8)
With
(
)
}
{
2,1,0,1,22
2
=D
and
(
)
{
}
12
2
=O
Seidel radix-2
5
recoding
[11][19] is described as follows:
[]
(
)
j
/N
j
jj
PQ.Y
5
15
0
27
=
+=
(9) with
{}
;,,,,Q
j
21012
{
}
4210124 ,,,,,,P
j
and
(
)
{}
12
5
=O
.
And Seidel radix-2
8
recoding is given by the following
equation:
[]
(
)
j
/N
j
jjj
TP.Q.Y
8
18
0
2
21111
=
++=
(10) with
{
}
21012 ,,,,Q
j
;
{
}
16,8,4,2,1,0,1,2,4,8,16,
jj
TP
and
(
)
{
}
12
8
=O
. Note that while equations (
9) and (10) are
odd-multiple free since all included digits are power of 2,
they require a post-accumulation to deal with odd numbers
(7, 11 and 121). Thus, a number of extra-adders are needed.
Optimized higher radices are obtained as follows.
A. Our new radix-2
8
recoding
Based on theorem (2), each 8+1 bit slice is split into 5+1,
2+1, and 1+1 overlapping slices using Seidel radix-2
5
,
McSorley radix-2
2
, and Booth radix-2
1
algorithms,
respectively. The new recoding is given by the following
equation:
()
()
[]
(
)
=
+++=
18
0
852
2227
/N
j
j
jjjj
..SRPQ.Y
(11)
With
{
}
21012 ,,,,Q
j
;
{}
4210124 ,,,,,,P
j
;
{
}
21012 ,,,,R
j
;
{
}
101 ,,S
j
and
(
)
{
}
12
8
=O
B. Our new radix-2
16
recoding
Likewise, using theorem (2), each 16+1 bit slice is split
into 8+1, 5+1, 2+1, and 1+1 overlapping slices using Seidel
radix-2
8
and radix-2
5
, McSorley radix-2
2
, and Booth radix-
2
1
algorithms, respectively. The new recoding is described
by the following equation:
()
()
[
=
+++++=
1
16
0
82
271111
N
j
jjjjj
.SR.TP.Q.Y
(
)
]
j
jj
VU
16132
222 ..+
(12) with
{}
21012 ,,,,Q
j
;
{
}
1684210124816 ,,,,,,,,,,T,P
jj
;
{
}
21012 ,,,,R
j
;
{
}
4210124 ,,,,,,S
j
;
{
}
21012 ,,,,U
j
;
{
}
101 ,,V
j
=
and
(
)
{
}
12
16
=O
In our preceding work [20], we pursued this combination
process farther and generated a series of higher radix (2
24
,
2
32
, …) recoding schemes with
(
)
{}
12 =
r
O
. However, what
still remains unknown is to determine, for a given N value,
the proper radix (2
r
) that leads to the optimal architecture.

The translation of equations (
11) and (12) into
architectures is depicted in
Fig. 2.a and 2.b, respectively.
All Dimitrov algorithms developed in
[12] are unsigned.
For an equitable comparison, we had to develop a new
two’s complement radix-2
8
recoding version with
(
)
{}
75312
8
,,,=O
based on Dimitrov unsigned radix-2
7
recoding (mult_7b2d in
[12]) with
(
)
{}
75312
7
,,,=O
. The
new recoding is:
()
()
()
()
i
j
n
j
j
h
e
j
k
PQY
8
78
18
0
21212
+
=
+=
/
..
(13)
With
{}
{
}
{
}
1,07,6,5,4,3,2,1,0,;7,5,3,1, eandhkPQ
jj
For the comparative study, our proposed algorithms
(eq.
11 and 12) as well as Seidel and Dimitrov algorithms
(eq.
10 and 13, resp.) are first analytically characterized and
then physically implemented.
C. Analytical characterization of area and speed
Prior implementation, we need to develop a generalized
theoretical model which predicts area and speed features of
each recoding algorithm with respect to N and r values.
1) Area
Three basic components are necessary for the
implementation of RTL multipliers:
multiplexers (Mux1) to recode the digit terms (Q
j
,P
j
,…)
included in the recoding expression;
shifters (Mux2) for partial product generation;
and adders for partial product summation.
Whereas the exact number of adders can be known in
advance, we need to develop heuristics for the two others.
The total multiplexer complexity (Mux1) of a radix-2
r
multiplier depends on:
the number (N/r) of PPG
j
;
the number (i) of lower sub-radices (2
1
, 2
2
, 2
5
, and 2
8
)
used to build up the higher radix-2
r
. To each sub-
radix-2
s
used (PPG
ji
) corresponds an RTL “case
statement” that recodes the digit terms (Q
ji
,P
ji
,T
ji
,…)
present in the equation;
the number of entries (e
s
+1) in each “case statement”
corresponding to each sub-radix-2
s
;
the number (d
s
) of digit terms (Q
ji
,P
ji
,T
ji
,…) that
figures in each “case statement”
;
and on the number of necessary odd-multiples (|O
s
|)
used to calculate the digit terms.
Hence, we
can announce that:
()
+
=
i
ss
s
e
Od
r
N
Mux || ...
1
21
For Dimitrov algorithm (eq.
13), this gives: r=8, i=1,
e
s
=8, d
s
=2, and |O
s
|=4. Thus, Mux1 = 512 N.
The synthesis of the RTL “shift statement” infers
multiplexers whose complexity depends on the number (p
sj
)
of different shift positions for all odd-multiples involved in
the calculation of each digit term (j). Thus, we can write:
(
)
∑∑
=
ij
sjsj
Op
r
N
Mux || ..2
. For Dimitrov algorithm
(eq.
13), this gives: r = 8, i=1, j=2, p
s1
=p
s2
=8, and |O
s1
| =
|O
s2
| = 4. Thus, Mux2=8N. Hence, the total multiplexer
complexity becomes: Mux
T
= Mux1+Mux2=520N.
A N-bit radix-2
r
multiplier generates N/r PP. Thus, The
total number of adders comprises:
(
)
1/ rN
adders to sum the N/r PP;
plus the necessary adders inside each PPG
j
to
accumulate the intermediate PP issuing from PPG
ji
;
plus a number of adders included inside each PPG
ji
depending on the recoding scheme used.
For example, in Seidel algorithm (eq.
10), the term
jijiji
TPQ ++1111
2
is calculated as follows:
(
)
(
)
jijijijijijiji
TPPPQQQ ++++
2337
2222
, which
requires 6 adders for post-accumulation operation
[11][19].
Hence, the total number of necessary adders is:
Add
T
=
(
)
(
)( )
1878618 =+ N//N/N
.
PP
0
+
Y
23 , 28
Y
28 , 30
Y
30 , 31
Y
15 , 23
Y
39 , 44
Y
44 , 46
Y
46 , 47
Y
31 , 39
Y
55 , 60
Y
60 , 62
Y
62 , 63
Y
47 , 55
64
P
127 - 0
PP
1
PP
2
PP
3
X
(b)
Y
7 , 12
Y
12 , 14
Y
14 , 15
Y
-1 , 7
U
0
V
0
R
0
S
0
+
PPG
0
Q
0
P
0
T
0
U
1
+
V
1
R
1
S
1
+
PPG
1
Q
1
P
1
T
1
+
U
2
+
V
2
R
2
S
2
+
PPG
2
Q
2
P
2
T
2
+
U
3
+
V
3
R
3
S
3
+
PPG
3
Q
3
P
3
T
3
+
+
+
+
+
Fig. 2. Two’s complement 64×64 bit multiplier.
(a) Radix-2
8
multiplier. Space partitioning according to equation (
11)
(b) Radix-2
16
multiplier. Space partitioning according to equation (
12)
Critical path (Del
T
= N/r-1+Del+d
s
)
(a)
X
64
Y
-1 , 4
Y
4 , 6
Y
6 , 7
Y
7 , 12
Y
12 , 14
Y
14 , 15
Y
15 , 20
Y
20 , 22
Y
22 , 23
Y
23 , 28
Y
28 , 30
Y
30 , 31
Y
31 , 36
Y
36 , 38
Y
38 , 39
Y
39 , 44
Y
44 , 46
Y
46 , 47
Y
47 , 52
Y
52 , 54
Y
54 , 55
Y
55 , 60
Y
60 , 62
Y
62 , 63
+
PP
7
PP
0
PP
1
PP
2
PP
3
PP
4
PP
5
PP
6
P
127 - 0
R
0
S
0
Q
0
P
0
PPG
0
R
1
S
1
Q
1
P
1
PPG
1
R
2
S
2
Q
2
P
2
PPG
2
R
3
S
3
Q
3
P
3
PPG
3
R
4
S
4
Q
4
P
4
PPG
4
R
5
S
5
Q
5
P
5
PPG
5
R
6
S
6
Q
6
P
6
PPG
6
R
7
S
7
Q
7
P
7
PPG
7
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
PPG
ji
including a fixed
number of adders
Del
T
is the delay in adder levels of
the total critical path. Del is the
delay in adder levels inside PPG
j
and d
s
is the delay due to
multiplexer logic inside PPG
ji

Citations
More filters
Journal ArticleDOI

Radix- $2^{r}$ Arithmetic for Multiplication by a Constant

TL;DR: The formal proof that, for an $N$ -bit constant, the maximum number of additions using radix- $2^{r}$ is lower than Dimitrov's estimated upper bound $2 \cdot N/log(N)$ using the double-base number system (DBNS).
Journal ArticleDOI

Radix-2 r Arithmetic for Multiplication by a Constant: Further Results and Improvements

TL;DR: The formal proof that, for an N-bit constant, the maximum number of additions using radix- 2ris lower than Dimitrov's estimated upper bound 2.
Journal ArticleDOI

Some Algorithms for Computing Short-Length Linear Convolution

TL;DR: A set of efficient algorithmic solutions for computing short linear convolutions focused on hardware implementation in VLSI for sequences of length N, which are resource-efficient and energy-efficient in terms of their hardware implementation.
Proceedings ArticleDOI

A new binary arithmetic for finite-word-length linear controllers: MEMS applications

TL;DR: The exploration of a number of binary arithmetics showed that radix-2r is the best candidate that fits the aforementioned requirements and has been fully exploited to designing efficient multiplier cores, which are the real engine of the linear systems.
Journal ArticleDOI

Radix-2 w Arithmetic for Scalar Multiplication in Elliptic Curve Cryptography

TL;DR: In this paper, the Radix- $w$ -bit windowing method is proposed, where the properties of speed, memory, and security are described by exact analytic formulas as proof of superiority Contrary to existing windowing algorithms, to minimize the number of ADDs, the window size ($w$ ) is guided by an optimum depending on the bit-length ( $l$ ) of the scalar k The number of required precomputations is minimal regarding the value of k.
Related Papers (5)
Frequently Asked Questions (20)
Q1. What are the contributions in "A new high radix-2r (r ≥ 8) multibit recoding algorithm for large operand size (n ≥ 32) multipliers" ?

This paper addresses the problem of multiplication with large operand sizes ( N≥32 ). The authors propose a new recursive recoding algorithm that shortens the critical path of the multiplier and reduces the hardware complexity of partial-product-generators as well. The new recoding algorithm provides an optimal space/time partitioning of the multiplier architecture for any size N of the operands. 

in large-operand-size applications (N≥32), the need for a scalable architecture is essential to ensure a linearincrease O(N) of multiply-time while multiplier size grows quadratically O(N2) with operand bit-length N.Consequently, high-speed, low-power, and highly-scalablearchitecture are the three major requirements for today’sgeneral-purpose multipliers [1]. 

For instance, a 64-bit two’s complement finely pipelined multiplier requires a latency of seven clock cycles only (critical path composed of a series of 7 adders). 

Based on the total number of adders (AddT), the critical path of the multiplier in terms of logic levels is: DelT= N/r-1+Del+ds, where Del is the delay due to adder stages inside PPGj and ds is the delay due to multiplexer logic inside PPGji. 

1) AreaThree basic components are necessary for theimplementation of RTL multipliers:• multiplexers (Mux1) to recode the digit terms (Qj,Pj,…) included in the recoding expression; • shifters (Mux2) for partial product generation; • and adders for partial product summation. 

Recoding large slices (r≥8) in a mono-bloc PPG such as in [11][12], requires the use of an RTL “case statement” with r+1 entries. 

To comply with time constraint of a given application, the authors need a multiplication algorithm that allows, to some extent, a parameterized reduction (N/r) of the multiply-time without sacrificing area. 

based on theory and implementation results, the authors conclude that the best tradeoff related to their recoding schemes depends on N and r values. 

Because of an explosive number of possible combinations (N>>), the solution space is exhaustively explored using a deterministic C-program for r varying from 8 to 1024. 

radix-28 PPGji of equation (15) is the least area consumer because it does not employ odd-multiples and requires a small amount of multiplexers as the total number of input combinations in each radix-28 PPGji is equal to 8+8+8+8=32. 

Because exploiting the maximum parallelisminherent in multiply operation, their look-up-table basedmultiplier (eq. 15) is even speed-competitive with Xilinx’shardwired multiplier employing DSP-Slices (18×18 bit full-custom multipliers). 

The topology of their proposed recoding schemes showshigh capabilities for pipelining which can be finely orcoarsely grained to satisfy both high throughput and low latency applications. 

Guided by accurate area heuristics, the final result of an optimization process, gradually undertaken in this paper, delivers for each value of N (N=8..8192) the appropriate radix-2r (r=8..512) and sub-radix-2 s (s=4..32) that lead to the architecture with the shortest critical path ( 3233 −/N ) in adder stages. 

The solution consists essentially in dividing the high radix-2r mono-bloc PPGj (Fig. 1.a) into a number of lower sub-radix-2s odd-multiple free PPGji (Fig. 1.b), such as s is a divider of r . 

In literature, equation (1) is referred to by radix-2r equation, to which corresponds adigit set ( )rD 2 such as ( ) { }11 2022 −−−=∈ rrrj ,...,,...,DQ . 

As direct benefits of the partitioning of Fig. 1.b:• there is no need to pre-compute odd-multiples of the multiplicand, which drastically reduces the requiredamount of hardware resources and routing;• since the size of PPGji entry is much smaller than the size of PPGj one (s≤r/2), the total multiplexing logic required by RTL “case statements” to recode theentries is greatly reduced; 

Based on theory (Table II) and implementation results (Table III), Dimitrov recoding is the most space consuming due to the use of odd-multiples of the multiplicand. 

mono-bloc PPG recoding is incompatible with high radix (r≥8) approach whose purpose is to reduce the multiply-time (N/r) of large operand size (N ≥32) multipliers. 

even the “balanced” solution is not really balanced enough since the mean values of Del and Mux are 1.4×Delmin and 5.2×Muxmin , respectively. 

A. Area occupationFor operand size N=64, equation (15) is a composite radix-232 algorithm (Table X), where each PPGj processes simultaneously 32+1 inputs that are split on four sub-radix28 PPGji made of four instances ( jikC ) of McSorley algorithm (Fig. 4).