scispace - formally typeset
Open AccessProceedings ArticleDOI

A New Family of High.Performance Parallel Decimal Multipliers

Reads0
Chats0
TLDR
Two novel architectures for parallel decimal multipliers are introduced based on a new algorithm for decimal carry-save multioperand addition that uses a novel BCD-4221 recoding for decimal digits and three schemes for fast and efficient generation of partial products in parallel are presented.
Abstract
This paper introduces two novel architectures for parallel decimal multipliers. Our multipliers are based on a new algorithm for decimal carry-save multioperand addition that uses a novel BCD-4221 recoding for decimal digits. It significantly improves the area and latency of the partial product reduction tree with respect to previous proposals. We also present three schemes for fast and efficient generation of partial products in parallel. The recoding of the BCD-8421 multiplier operand into minimally redundant signed-digit radix-10, radix-4 and radix-5 representations using new recoders reduces the complexity of partial product generation. In addition, SD radix-4 and radix-5 recodings allow the reuse of a conventional parallel binary radix-4 multiplier to perform combined binary/decimal multiplications. Evaluation results show that the proposed architectures have interesting area-delay figures compared to conventional Booth radix-4 and radix-8 parallel binary multipliers and other representative alternatives for decimal multiplication.

read more

Content maybe subject to copyright    Report

A New Family of High–Performance Parallel Decimal Multipliers
Alvaro V´azquez, Elisardo Antelo
University of Santiago de Compostela
Dept. of Electronic and Computer Science
15782 Santiago de Compostela, Spain
alvaro@dec.usc.es, elisardo@dec.usc.es
Paolo Montuschi
Politecnico di Torino
Dept. of Computer Engineering
10129 Torino, Italy
montuschi@polito.it
Abstract
This paper introduces two novel architectures for par-
allel decimal multipliers. Our multipliers are based on a
new algorithm for decimal carry–save multioperand ad-
dition that uses a novel BCD–4221 recoding for decimal
digits. It significantly improves the area and latency of
the partial product reduction tree with respect to previous
proposals. We also present three schemes for fast and ef-
ficient generation of partial products in parallel. The re-
coding of the BCD–8421 multiplier operand into minimally
redundant signed–digit radix–10, radix–4 and radix–5 rep-
resentations using new recoders reduces the complexity of
partial product generation. In addition, SD radix–4 and
radix–5 recodings allow the reuse of a conventional par-
allel binary radix–4 multiplier to perform combined bi-
nary/decimal multiplications. Evaluation results show that
the proposed architectures have interesting area–delay fig-
ures compared to conventional Booth radix–4 and radix–8
parallel binary multipliers and other representative alter-
natives for decimal multiplication.
1. Introduction
Providing hardware support for decimal arithmetic is be-
coming a topic of interest. Specifically, the revision of the
IEEE–754 Standard for Floating–Point Arithmetic (IEEE–
754r) [1] already incorporates specifications for decimal
arithmetic. Thus, it is expected that microprocessor manu-
facturers include decimal floating–point units in their prod-
ucts oriented to mainframe servers to satisfy the high perfor-
mance demands of current financial, commercial and user–
oriented applications [3].
An important and frequent operation in decimal compu-
tations is multiplication. However, due to the inherent in-
A. V´azquez and E. Antelo supported in part by the Ministry of Science
and Technology of Spain under contract TIN2004-07797-C02 and Xunta
de Galicia under contract PGIDT03TIC10502PR.
efficiency of decimal arithmetic implementations in binary
logic, practically all the proposed decimal multipliers are
sequential units [2, 4, 7, 9, 11, 16]. Recently, the first im-
plementation of a parallel decimal multiplier was presented
in [8]. Parallel multipliers are used extensively in most of
the binary floating–point units [10, 13] and are of interest
for decimal applications to scale performance.
In this paper, we introduce new methods for the effi-
cient implementation of decimal parallel multiplication by
a parallel generation of partial products and the reduction
of these partial products using a novel decimal carry–save
addition tree. We present the architectures of two differ-
ent high–performance parallel multipliers that implement
these methods. The second architecture also allows an ef-
fective implementation of a combined binary/decimal mul-
tiplier. These high–performance implementations have sim-
ilar hardware complexity or a moderate increment in area
with respect to the equivalent binary parallel multipliers.
The paper is organized as follows. Section 2 outlines the
previous (most representative) work on decimal multiplica-
tion. In Section 3 we introduce our proposals for an efficient
implementation of decimal parallel multiplication. The pro-
posed techniques for the generation of partial products are
more detailed in Section 4, while the reduction of partial
products is fully discussed in Section 5. We describe the
two resulting architectures and some variants in Section 6.
In Section 7 we provide rough area–delay evaluation results
for 64–bit (16 decimal digits) decimal and binary parallel
multipliers. We compare these results with some other rep-
resentative works. Finally we summarize the main conclu-
sions in Section 8.
2. An overview of decimal multiplication
Multiplication consists of three stages: generation of par-
tial products, fast reduction (addition) of partial products to
a two operand and a final carry propagate addition. Decimal
multiplication is more complex than binary multiplication
mainly for two reasons: the higher range of decimal digits
18th IEEE Symposium on Computer Arithmetic(ARITH'07)
0-7695-2854-6/07 $20.00 © 2007
Authorized licensed use limited to: Univ of Calif Davis. Downloaded on March 2, 2009 at 14:05 from IEEE Xplore. Restrictions apply.

([0, 9]), which increments the number of multiplicand mul-
tiples and the inefficiency of representing decimal values in
systems based on binary logic using BCD–8421 (since only
9 out of the 16 possible 4–bit combinations represent a valid
decimal digit). These issues complicate the generation and
reduction of partial products.
Proposed methods for the generation of decimal par-
tial products follow two approaches. The first alterna-
tive [2, 4] generates and stores all the required multipli-
cand multiples. Next, multiples are distributed to the re-
duction stage through multiplexers controlled by the mul-
tiplier digits. This approach requires more than a cycle
to generate some complex BCD-8421 multiplicand mul-
tiples (3X,6X,7X,8X,9X). To avoid complicated multiples
the multiplier can be recoded. In [8] each multiplier digit
is recoded as Y
i
= Y
H
5+Y
L
, with Y
H
∈{0, 1} and
Y
L
∈{2, 1, 0, 1, 2} . Multiples 2X and 5X can be com-
puted without a carry propagation over the whole number.
Negative multiples requires an additional 9’s complement
addition. The second approach generates only the partial
product as needed using digit–by–digit lookup table meth-
ods [9, 16]. In a recent work [5], a magnitude range reduc-
tion of the operand digits by a radix–10 signed–digit recod-
ing (from [0,9] to [-5,5]) is suggested. This recoding of both
operands speeds–up and simplies the generation of partial
products. Then, overlapped signed–digit partial products
1
are generated using simplified tables and a set of multiplex-
ers and xor gates.
First attempts to improve decimal multiplication per-
formed the reduction of decimal partial products using some
scheme for decimal carry propagate addition such as direct
decimal addition [12]. Proposals to perform the reduction of
decimal partial products using multioperand carry–free ad-
dition were suggested in [9] (carry–save) and [15] (signed–
digit). Recently several techniques have been proposed that
improve these previous works. In [5] a signed–digit decimal
adder based on [15] is used. Redundant binary coded dec-
imal (RBCD) adders [14] can also perform decimal carry
free additions using a signed–digit representation of deci-
mal digits ( [7, 7]). In [11] a scheme of two levels of
3–2 binary carry–save adders (CSA) is used to add the par-
tial products iteratively. Since it uses BCD–8421 to repre-
sent decimal digits, a digit addition of +6 or +12 (modulo
16) is required to obtain the decimal carry and to correct
the sum digit. Logic for detection of decimal carries and
sum digit is in the critical path (sum path). In order to elim-
inate decimal corrections from the critical path of the bi-
nary CSA, three different techniques were proposed in [6].
Among these proposals, non–speculative adders present the
best area–delay figures and are the most suitable for multi-
operand addition using a CSA tree. Non–speculative adders
1
Two overlapped digits in the range of [5, 5] and [2, 2] are gener-
ated for each partial product digit position.
reduce the BCD–8421 input operands using a binary CSA
tree. Preliminary sum digits are then obtained using a level
of 4–bit carry propagate adders. Finally, decimal carry and
sum digit corrections are determined from the preliminary
sum digit and the carries passed to the next more significant
digit position in the binary CSA tree
2
. Decimal correction
is performed using combinational logic (its complexity de-
pends on the number of input operands added) and a 3–bit
carry propagate adder per digit.
Another representative technique [4] uses an array of 4–
bit decimal carry–propagate adders based on direct decimal
addition. This adder takes two BCD–8421 digits and a 1–
bit input carry and generates a 1–bit decimal carry and the
BCD–8421 sum digit. An iterative decimal multiplier based
on a refinement of [4] is presented in [7]. It uses BCD–
8421 invalid combinations to simplify the sum digit logic.
A combinational radix–10 CSA tree is implemented in [8]
using these 4–bit decimal carry–propagate adders. To opti-
mize the partial product reduction they also use an array of
decimal digit counters. Each counter adds 8 decimal carries
of the same weight and produces a BCD–8421 digit.
3. Proposed techniques for decimal parallel
multiplication
We assume that multiplicand X and multiplier Y are
unsigned decimal integer words. Extension to decimal
floating–point multiplication involves exponent addition,
rounding of X · Y to fit the required precision and sign cal-
culations. We represent the decimal digits of any d–digit
decimal integer operand Z =
d1
i=0
Z
i
· 10
i
as
Z
i
=
3
i=0
z
i,j
· r
j
where Z
i
[0, 9] is the i
th
decimal digit, z
i,j
is the j
th
bit of the BCD i
th
digit and r
j
is the weight of the j
th
bit. In Table 1 diverse BCD codings are represented. For
BCD–8421, r
j
=2
j
. BCD–4221 and BCD–5211 are new
codings introduced in this paper characterized by the use of
redundancy in decimal digit representation. As we have
mentioned, the use of BCD–8421 to represent decimal dig-
its means introducing costly decimal corrections in the par-
tial product reduction binary CSA tree to obtain the correct
decimal carry and sum. To avoid these corrections we use
the BCD–4221 coding of Table 1 to represent partial prod-
uct digits. Thus, we can perform fast decimal carry–save
addition using an ordinary 4–bit binary 3:2 CSA as
A
i
+ B
i
+ C
i
=
3
j=0
(s
i,j
+2h
i,j
) r
j
2
A +6 must be added each time a carry is passed to the next more
significant digit position.
18th IEEE Symposium on Computer Arithmetic(ARITH'07)
0-7695-2854-6/07 $20.00 © 2007
Authorized licensed use limited to: Univ of Calif Davis. Downloaded on March 2, 2009 at 14:05 from IEEE Xplore. Restrictions apply.

BCD-8421 BCD-5421 BCD-4221 BCD-5211
0 0000 0000 0000 0000
1 0001 0001 0001 0001 | 0010
2 0010 0010 0010 | 0100 0100 | 0011
3 0011 0011 0011 | 0101 0101 | 0110
4 0100 0100 1000 | 0110 0111
5 0101 1000 1001 | 0111 1000
6 0110 1001 1010 | 1100 1001 | 1010
7 0111 1010 1011 | 1101 1100 | 1011
8 1000 1011 1110 1110|1101
9 1001 1100 1111 1111
Table 1. BCD codings
=
3
j=0
s
i,j
r
j
+2
3
j=0
h
i,j
r
j
= S
i
+2H
i
with (r
3
,r
2
,r
1
,r
0
)=(4, 2, 2, 1) and
s
i,j
= a
i,j
b
i,j
c
i,j
h
i,j
= a
i,j
· b
i,j
(a
i,j
b
i,j
) · c
i,j
H
i
[0, 9], S
i
[0, 9] are the decimal carry and sum digits
at position i while symbols , ·,and indicate binary oper-
ators OR, AND and XOR respectively. No decimal correc-
tion is required because H
i
and S
i
are valid decimal digits
in BCD–4221 code. However a decimal multiplication by 2
is required before using the carry digit for the computations.
This can be performed in a simple way by a digit recoding
to BCD–5211 (shown in Table 1) followed by a 1–bit wired
left shift:
2H
i
= l1shif t(W
i
)=w
i,3
10 + w
i,2
4+w
i,1
2+w
i,0
2
where
W
i
= w
i,3
5+w
i,2
2+w
i,1
+ w
i,0
is the BCD–5211 recoded decimal carry digit. Moreover,
this operation is in the fast path (carry path of a full–adder).
Note that the 1–bit left shift of W
i
produces a carry output
(w
i,3
) to the next decimal digit (i +1), while the less sig-
nificant bit position is occupied by the carry input (w
i1,3
)
of the previous digit W
i1
. Logical expressions for BCD–
4221 to BCD–5211 recoding are given by
w
i,3
= h
i,3
· (h
i,2
h
i,1
h
i,0
) h
i,2
· h
i,1
· h
i,0
w
i,2
= h
i,2
· h
i,1
· h
i,3
h
i,0
(h
i,3
· h
i,0
) h
i,2
h
i,1
w
i,1
= h
i,2
· h
i,1
· h
i,3
h
i,0
h
i,3
· h
i,0
· h
i,2
h
i,1
w
i,0
=(h
i,2
· h
i,1
) h
i,3
h
i,0
Nevertheless, due to the redundancy of BCD–4221 and
BCD–5211 codings, there are several choices with differ-
ent area–delay trade–offs for the logical implementation of
this digit recoding. This decimal carry–save algorithm leads
to fast and area optimized decimal carry–save tree adders
detailed in Section 5. Furthermore, conversions between
BCD–8421 and BCD–4221 codings can be performed us-
ing a simple gate level.
To generate all the partial products in parallel, we obtain
all the required multiples. We aim for a fast generation of
a reduced number of partial products. This is achieved with
the recoding of the multiplier. We have developed three
different recodings for the multiplier with good trade–offs
between fast generation of partial products and the num-
ber of partial products generated. A minimally redundant
signed–digit (SD) radix–10 recoding (digits in [5, 5])pro-
duces only d +1partial products but requires a carry propa-
gate addition to generate complex multiples 3X and 3X .
Minimally redundant signed–digit (SD) radix–4 and radix–
5 recodings (with digits in [2, 2]) produce 2d partial prod-
ucts (2 digits per radix–10 digit) but multiplicand multiples
are produced in a few levels of combinational logic. Fur-
thermore, another advantage of using BCD–4221 to repre-
sent partial product digits is that the 9’s complement of each
digit can be obtained by bit inverting each digit. This sim-
plifies the generation of the negative multiplicand multiples.
The proposed BCD–8421 to SD recoders and the generation
and selection of multiples are detailed in Section 4.
For the final decimal carry propagate addition we use a
binary quaternary tree (Q-T) adder modified to perform dec-
imal additions [17]. Decimal quaternary tree adders based
on conditional speculative decimal addition present low la-
tency (about 10% more than the fastest binary adders) and
require less hardware than other alternatives.
4. Generation of partial products
4.1. Multiplier recoding
A. Signed–Digit Radix–10 Recoding.
This recoding transforms the digit set {0,...,9} into the
signed–digit (SD) set {−5,...,5} to perform the selection
of multiples in a similar way as modified Booth recoding.
Fig. 1 shows a block diagram of the recoding and the mul-
tiplicand multiple selection units.
We denote Y
i
= y
i,3
5+
2
j=0
y
i,j
2
j
the digits of the
multiplier coded in BCD–5421 (see Table 1). The recoded
SD radix–10 multiplier can be expressed in terms of Y
i
as
Y =
d1
i=0
y
i,3
10 y
i,3
5+
2
j=0
y
i,j
2
j
10
i
= y
d1,3
10
d
d1
i=0
Yb
i
10
i
where the value of each SD radix–10 digit Yb
i
[5, 5] is
18th IEEE Symposium on Computer Arithmetic(ARITH'07)
0-7695-2854-6/07 $20.00 © 2007
Authorized licensed use limited to: Univ of Calif Davis. Downloaded on March 2, 2009 at 14:05 from IEEE Xplore. Restrictions apply.

4X
BCD−8421 to SD radix−10
Recoder
Mux−5
5X
ys
i
Partial Product i
Multiplicand multiple selection
y1
i
y2
i
Y
i
ys
i
y1
i
y2
i
SD digit {−5...5}
Signed digit radix−10 recoder
digit BCD−8421
Y
i−1
(overlapped digit)
y3
i
y4
i
y5
i
3X
X
2X
y5
i
y4
i
y3
i
Figure 1. Partial product generation for SD radix–10.
given by
Yb
i
= y
i,3
5+
2
j=0
y
i,j
2
j
+ y
i1,3
with y
1,3
=0. Control signals (in ”hot–one” code) can
be obtained directly from input BCD–8421 multiplier digits
using the following logical expressions:
ys
i
= y
i,3
y
i,2
· (y
i,1
y
i,0
)
y5
i
= y
i,2
· y
i,1
· (y
i,0
ys
i1
)
y4
i
= ys
i1
· y
i,0
· (y
i,2
y
i,1
)
y3
i
= y
i,1
· (y
i,0
ys
i1
)
y2
i
= y
i,0
ys
i1
· (y
i,3
y
i,2
· y
i,1
)
y1
i
= y
i,2
y
i,1
· (y
i,0
ys
i1
)
Since multiplicand multiples are recoded to BCD–4221,
negative multiples can be generated by the XOR of ys
i
with
the corresponding positive multiple as shown in the multi-
plicand multiple selector of Fig. 1.
B. Signed–Digit Radix–4 Recoding.
Two SD radix–4 digits Y
U
i
∈{0, 1, 2} (upper), Y
L
i
{−2, 1, 0, 1, 2} (lower) are generated per each BCD–8421
digit (Y
i
= Y
U
i
· 4+Y
L
i
). We obtain the SD radix–4 selec-
tion signals directly from the BCD–8421 digits as
(Y
U
i
)
ys
U
i
= y
i,3
y2
U
i
= y
i,3
· y
i,2
· y
i,1
y1
U
i
= y
i,3
· y
i,2
y
i,1
(Y
L
i
)
ys
L
i
= y
i,3
y
i,1
y2
L
i
= ys
L
i
· y
i,0
· y
i1,3
ys
L
i
· y
i,0
· y
i1,3
y1
L
i
= y
i,0
y
i1,3
The block diagram of a 4–bit combined binary/decimal
recoder and the corresponding multiplicand multiple selec-
tor are shown in Fig. 2 where control signal d
M
is true for
decimal multiplication. The combined SD radix–4 recoder
implements the decimal selection signals and the conven-
tional Booth radix–4 selection signals. Upper signals select
multiples ±8X and ± 4X while lower signals select multi-
ples {−2X, X, X, 2X}. Although the resulting combined
4−bit/BCD−8421 to SD
radix−4 Recoder
Mux−2
ys
i
U
Partial Product i−upper
Selection of multiplicand multiples
ys
i
L
y1
i
U
y2
i
U
y1
i
L
y2
i
L
y
i,3
y
i,2
y
i,1
y
i,0
y
i−1,3
ys
i
U
y1
i
U
y2
i
U
Mux−2
ys
i
L
y1
i
L
y2
i
L
2 SD digits {−2,...,2}
Partial Product i−lower
Binary/BCD−8421 to SD radix−4 recoder
8X
BCD
Mux−2 Mux−2
8X
BIN
Mux−2
Mux−2
d
M
d
M
d
M
d
M
4X
BIN
2X
BIN
X
BIN
4X
BCD
2X
BCD X
BCD
Figure 2. Partial product generation for SD radix–4.
SD radix–4 recoders and multiple selectors are simple, ob-
taining decimal multiples 4X and 8X requires double and
triple latency with respect to obtaining the decimal 2X mul-
tiple.
C. Signed–Digit Radix–5 Recoding.
This recoding uses a different set of multiplicand multi-
ples (5X,10X instead of 4X,8X) for decimal partial prod-
uct generation that have a similar latency to 2X and X.
Each BCD–8421 digit of the multiplier is encoded into two
radix–5 digits (Y
i
= Y
U
i
· 5+Y
L
i
) with Y
U
i
∈{0, 1} and
Y
L
i
∈{2, 1, 0, 1, 2} .
SD radix–5 selection signals are obtained from the
BCD–8421 input digits using:
(Y
U
i
)
ys
U
i
=0
y2
U
i
= y
i,3
y1
U
i
= y
i,2
y
i,1
· y
i,0
(Y
L
i
)
ys
L
i
= y
i,3
y
i,2
· y
i,1
· y
i,0
y
i,2
· y
i,1
· y
i,0
y2
L
i
= y
i,0
· (y
i,3
y
i,1
) y
i,2
· y
i,1
y1
L
i
= y
i,2
· y
i,0
y
i,2
· y
i,1
· y
i,0
The block diagram of the digit recoder and multiples se-
lector is shown in Fig. 3(a). A combined binary radix–
4/decimal radix–5 block diagram for the partial product
generation is proposed in Fig. 3(b). Multiplexers controlled
by d
M
select the operands required by binary or decimal
multiplications. Although BCD to SD radix–4 encoding is
slightly simpler than radix–5, partial product generation for
decimal SD radix–5 is faster and comparable in latency with
binary SD radix–4, due to a faster generation of multipli-
cand multiples as we show in the following subsection.
4.2. Generation of multiplicand multiples
Decimal multiplicand multiples 2X and 5X are obtained
in a few levels of logic using recoding and wired left shifts.
Any other multiple is generated using these multiples or
from multiplicand X. The generation sequence of 2X is
as follows. Each BCD–8421 digit is first recoded to BCD–
5211 using
w
i,3
= h
i,3
h
i,2
· (h
i,1
h
i,0
)
18th IEEE Symposium on Computer Arithmetic(ARITH'07)
0-7695-2854-6/07 $20.00 © 2007
Authorized licensed use limited to: Univ of Calif Davis. Downloaded on March 2, 2009 at 14:05 from IEEE Xplore. Restrictions apply.

5X
BCD−8421 toSD radix−5
recoder
Mux−2
10X
Partial Product i−upper
Multiplicand multiples selection
ys
i
L
y1
i
U
y2
i
U
y1
i
L
y2
i
L
y
i,3
y
i,2
y
i,1
y
i,0
y1
i
U
y2
i
U
X
Mux−2
2X
ys
i
L
y1
i
L
y2
i
L
2 digits SD radix−5 {−2,..2}
Partial Product i−lower
BCD−8421 to SD radix−5 recoder
(a) Decimal SD radix–5 recoding.
BCD−8421 to SD
radix−5 recoder
Mux−2
10X
BCD
Combined 4−bit SD radix−4/radix−5 recoder
d
M
Y
i
U
Y
i
L
Y
i
y1
i
U
y2
i
U
Mux−2
ys
i
L
y1
i
L
y2
i
L
Partial Product i−lower
Binary to SD
radix−4 recoder
y
i−1,3
Y
i
U
Y
i
L
4
3
3
3
3
1
Mux−2
Mux−2
Y
i
U
Y
i
L
ys
i
U
Mux−2 Mux−2
8X
BIN
Mux−2 Mux−2
d
M
d
M
d
M
d
M
d
M
Multiplicand multiples selection
4X
BIN
2X
BIN
X
BIN
5X
BCD
2X
BCD
X
BCD
(b) Combined binary/decimal to SD radix–4/radix–5 recoding.
Figure 3. Partial product generation for SD radix–5.
w
i,2
= h
i,3
(h
i,1
(h
i,2
· h
i,0
))
w
i,1
= h
i,3
· h
i,0
h
i,2
· h
i,1
h
i,0
w
i,0
= h
i,3
(h
i,2
h
i,0
)
Then a wired 1–bit left shift is performed over the recoded
multiplicand, obtaining the 2X multiple in BCD–4221.
The 5 X multiple is obtained by a simple 3–bit left shift
of the multiplicand, but with resultant digits coded in BCD–
5421. Thus a digit recoding from BCD–5421 to BCD-4221
is performed using expressions
w
i,3
= h
i,3
h
i,2
w
i,2
= h
i,3
· (h
i,2
(h
i,1
· h
i,0
))
w
i,1
= h
i,1
· h
i,3
· (h
i,2
h
i,0
)
w
i,0
= h
i,3
h
i,0
The generation of negative multiples is performed by
evaluating the 10’s complement of positive multiples as
X =
d1
i=0
(9 X
i
) · 10
i
+1
For BCD–8421 this is performed by a digit addition of +6
followed by a bit–complement operation since 9 X
i
=
X
i
+6. For BCD–4221, a 10’s complement is performed
simply by bit–complementing the positive multiple, since
9 X
i
= X
i
. Addition of the 10’s complement +1 is per-
formed in the partial product reduction tree by a tail encod-
ing bit, since each partial product is 4–bit (or at least 1–bit)
left shifted from the previous one. To avoid sign extension
and thus to reduce the complexity, the partial product signs
sg
i
are encoded in each leading digit position as
d1
i=0
sg
i
10
i+d
= 10
2d
+
d1
i=0
(9 sg
i
)10
i+d
+1=
= 10
2d
+
d1
i=1
(8 + sg
i
)10
i+d
+(sg
0
10 + sg
0
9)10
d
Each partial product is at most of d +3–digit length, due to
the three extra digit positions required for the encoded sign,
the tail encoding bit and the left shifting.
Fig. 4(a) shows the block diagram for the generation of
multiplicand multiples for SD radix–10 encoding. Multiple
4X is obtained as 2 × 2X. Multiple 3X is evaluated by a
carry propagate addition of multiples X and 2X in a deci-
mal quaternary tree [17]. The latency of the partial product
generation is constrained by the generation of 3X.TheSD
radix–10 multiple selector of Fig. 1 uses the xor operation
to select positive or negative multiples as a function of the
SD radix–10 control signal ys
i
.
Fig. 4(b) shows the generation of multiples for the case
of decimal SD radix–4 recoding. Multiple 8X is obtained
as 2 × 2 × 2X, so the latency of multiplicand multiples
generation is about three times the latency of 2X operation.
On the other hand, generation of radix–5 multiples is faster
(approx. the latency of 2X) as it is shown in Fig. 4(c).
5. Reduction of partial products
To implement the algorithm for carry–save addition for-
mulated in Section 3 we propose a decimal 3:2 CSA that re-
duces 3 BCD–4221 digits to a carry and a sum BCD–4221
digits. This module consists of a 4–bit binary 3:2 CSA plus
a BCD–4221 to BCD–5211 digit recoder. From this mod-
ule we construct p:2 (p 3) decimal CSAs, optimizing the
critical path delay using fast inputs and outputs.
5.1. Decimal 3:2 carry-save adder
The block diagram of the proposed 4–bit 3:2 CSA is
shown in Fig. 5(a). The block labeled ×2 performs the
multiplication of the carry digit by 2. For decimal multipli-
cation the ×2 module is detailed in Fig. 5(b). It consist of
a BCD–4221 to BCD–5211 digit recoder and a 1–bit wired
left shift. A combined binary/decimal 3:2 CSA is shown in
Fig. 5(c). A 4–bit 2:1 multiplexer controlled by d
M
selects
18th IEEE Symposium on Computer Arithmetic(ARITH'07)
0-7695-2854-6/07 $20.00 © 2007
Authorized licensed use limited to: Univ of Calif Davis. Downloaded on March 2, 2009 at 14:05 from IEEE Xplore. Restrictions apply.

Citations
More filters
Journal ArticleDOI

Improved Design of High-Performance Parallel Decimal Multipliers

TL;DR: The proposed architectures of two parallel decimal multipliers have interesting area-delay figures compared to conventional Booth radix-4 and radix--8 parallel binary multipliers and outperform the figures of previous alternatives for decimal multiplication.
Journal ArticleDOI

Improving the Speed of Parallel Decimal Multiplication

TL;DR: In order to improve the speed of parallel decimal multiplication, a new PPG method is presented, fine-tune the PPR method of one of the full solutions and the final addition scheme of the other; thus, assembling a new full solution is presented.
Proceedings ArticleDOI

A parallel IEEE P754 decimal floating-point multiplier

TL;DR: In this article, a fully parallel decimal floating-point multiplier compliant with the recent draft of the IEEE P754 Standard for Floating-point Arithmetic (IEEE P754) is presented.
Journal ArticleDOI

High-Speed Parallel Decimal Multiplication with Redundant Internal Encodings

TL;DR: By considering the tradeoff of designs among three components, the overall delay of the proposed 16 × 16-digit multiplier takes about 11 percent less timing delay with 2 percent less area compared to the current fastest design.
Journal ArticleDOI

Decimal Floating-Point Multiplication

TL;DR: The design of two decimal floating-point multipliers are presented: one whose partial product accumulation strategy employs decimal carry- save addition and one that employs binary carry-save addition.
References
More filters
Proceedings ArticleDOI

Decimal floating-point: algorism for computers

TL;DR: This work introduces a new approach to decimal floating point which not only provides the strict results which are necessary for commercial applications but also meets the constraints and requirements of the IEEE 854 standard.
Journal ArticleDOI

A 4.4 ns CMOS 54/spl times/54-b multiplier using pass-transistor multiplexer

TL;DR: A 54/spl times/54-b multiplier using pass-transistor multiplexers has been fabricated by 0.25 /spl mu/m CMOS technology and a new 4-2 compressor and a carry lookahead adder (CLA) have been developed to enhance the speed performance.
Proceedings ArticleDOI

Decimal multiplication via carry-save addition

TL;DR: Two novel designs for fixed-point decimal multiplication are presented that utilize decimal carry-save addition to reduce the critical path delay and can be extended to support decimal floating-point multiplication.
Proceedings ArticleDOI

Decimal multiplication with efficient partial product generation

TL;DR: A novel design for fixed-point decimal multiplication that utilizes a simple recoding scheme to produce signed-magnitude representations of the operands thereby greatly simplifying the process of generating partial products for each multiplier digit.
Frequently Asked Questions (13)
Q1. What have the authors contributed in "A new family of high–performance parallel decimal multipliers∗" ?

This paper introduces two novel architectures for parallel decimal multipliers. The authors also present three schemes for fast and efficient generation of partial products in parallel. 

Extension to decimal floating–point multiplication involves exponent addition, rounding of X · Y to fit the required precision and sign calculations. 

6. For BCD–4221, a 10’s complement is performed simply by bit–complementing the positive multiple, since 9 − Xi = Xi. Addition of the 10’s complement +1 is performed in the partial product reduction tree by a tail encoding bit, since each partial product is 4–bit (or at least 1–bit) left shifted from the previous one. 

Multiple 8X is obtained as 2 × 2 × 2X , so the latency of multiplicand multiples generation is about three times the latency of 2X operation. 

The proposed SD radix–5 is 1.7 times faster than [5] but generates 32 partial products while the proposed SD radix– 10 scheme is 1.3 times slower than [5]. 

Synthesis results given in [8] show a critical path delay of 2.65ns and an equivalent area of 68.000 NAND2 gates, while ratios are 1.90 for delay and1.50 for area respect to a radix–4 binary multiplier. 

In order to eliminate decimal corrections from the critical path of the binary CSA, three different techniques were proposed in [6]. 

The authors have used an area–delay model for static CMOS gates based on logical effort to evaluate the area–delay figures of the proposed architectures and two representative binary parallel multipliers [10, 13]. 

The ×2 multiplication for the final decimal carry operand is performed in parallel with the first stage of the decimal carry–propagate adder (+6 digit addition). 

The area–delay figures from a comparative study including conventional binary parallel multipliers and other representative decimal proposals show that their decimal SD radix–10 multiplier is an interesting option for high performance with moderate area. 

The recoded SD radix–10 multiplier can be expressed in terms of Y ∗i asY = d−1∑ i=0 ( y∗i,3 10 − y∗i,3 5 + 2∑ j=0 y∗i,j 2 j ) 10i= y∗d−1,3 10 d − d−1∑ i=0 Y bi 10iwhere the value of each SD radix–10 digit Y bi ∈ [−5, 5] is18th IEEE Symposium on Computer Arithmetic(ARITH'07) 0-7695-2854-6/07 $20.00 © 2007Authorized licensed use limited to: Univ of Calif Davis. 

Although BCD to SD radix–4 encoding is slightly simpler than radix–5, partial product generation for decimal SD radix–5 is faster and comparable in latency with binary SD radix–4, due to a faster generation of multiplicand multiples as the authors show in the following subsection. 

This recoding transforms the digit set {0, . . . , 9} into the signed–digit (SD) set {−5, . . . , 5} to perform the selection of multiples in a similar way as modified Booth recoding.