What have the authors contributed in "A new family of high–performance parallel decimal multipliers∗" ?

This paper introduces two novel architectures for parallel decimal multipliers. The authors also present three schemes for fast and efficient generation of partial products in parallel.

How fast is the proposed SD radix–5 scheme?

The proposed SD radix–5 is 1.7 times faster than [5] but generates 32 partial products while the proposed SD radix– 10 scheme is 1.3 times slower than [5].

What is the critical path delay of the radix–4 binary multiplier?

Synthesis results given in [8] show a critical path delay of 2.65ns and an equivalent area of 68.000 NAND2 gates, while ratios are 1.90 for delay and1.50 for area respect to a radix–4 binary multiplier.

What is the way to compare the CMOS gates with other proposed architectures?

The authors have used an area–delay model for static CMOS gates based on logical effort to evaluate the area–delay figures of the proposed architectures and two representative binary parallel multipliers [10, 13].

What is the 2 multiplication for the final decimal carry operand?

The ×2 multiplication for the final decimal carry operand is performed in parallel with the first stage of the decimal carry–propagate adder (+6 digit addition).

what is the area delay of a decimal radix?

The area–delay figures from a comparative study including conventional binary parallel multipliers and other representative decimal proposals show that their decimal SD radix–10 multiplier is an interesting option for high performance with moderate area.

What is the recoding of the SD radix–10 multiplier?

The recoded SD radix–10 multiplier can be expressed in terms of Y ∗i asY = d−1∑ i=0 ( y∗i,3 10 − y∗i,3 5 + 2∑ j=0 y∗i,j 2 j ) 10i= y∗d−1,3 10 d − d−1∑ i=0 Y bi 10iwhere the value of each SD radix–10 digit Y bi ∈ [−5, 5] is18th IEEE Symposium on Computer Arithmetic(ARITH'07) 0-7695-2854-6/07 $20.00 © 2007Authorized licensed use limited to: Univ of Calif Davis.

(Open Access) A New Family of High.Performance Parallel Decimal Multipliers (2007) | Alvaro Vazquez

Q: What is the way to add a digit to a decimal floating point?

Extension to decimal floating–point multiplication involves exponent addition, rounding of X · Y to fit the required precision and sign calculations.

Q: How long does multiple 8X take to generate?

Multiple 8X is obtained as 2 × 2 × 2X , so the latency of multiplicand multiples generation is about three times the latency of 2X operation.

Q: What are the three techniques proposed in the binary CSA?

In order to eliminate decimal corrections from the critical path of the binary CSA, three different techniques were proposed in [6].

A New Family of High–Performance Parallel Decimal Multipliers

∗

Alvaro V´azquez, Elisardo Antelo

University of Santiago de Compostela

Dept. of Electronic and Computer Science

15782 Santiago de Compostela, Spain

alvaro@dec.usc.es, elisardo@dec.usc.es

Paolo Montuschi

Politecnico di Torino

Dept. of Computer Engineering

10129 Torino, Italy

montuschi@polito.it

Abstract

This paper introduces two novel architectures for par-

allel decimal multipliers. Our multipliers are based on a

new algorithm for decimal carry–save multioperand ad-

dition that uses a novel BCD–4221 recoding for decimal

digits. It signiﬁcantly improves the area and latency of

the partial product reduction tree with respect to previous

proposals. We also present three schemes for fast and ef-

ﬁcient generation of partial products in parallel. The re-

coding of the BCD–8421 multiplier operand into minimally

redundant signed–digit radix–10, radix–4 and radix–5 rep-

resentations using new recoders reduces the complexity of

partial product generation. In addition, SD radix–4 and

radix–5 recodings allow the reuse of a conventional par-

allel binary radix–4 multiplier to perform combined bi-

nary/decimal multiplications. Evaluation results show that

the proposed architectures have interesting area–delay ﬁg-

ures compared to conventional Booth radix–4 and radix–8

parallel binary multipliers and other representative alter-

natives for decimal multiplication.

1. Introduction

Providing hardware support for decimal arithmetic is be-

coming a topic of interest. Speciﬁcally, the revision of the

IEEE–754 Standard for Floating–Point Arithmetic (IEEE–

754r) [1] already incorporates speciﬁcations for decimal

arithmetic. Thus, it is expected that microprocessor manu-

facturers include decimal ﬂoating–point units in their prod-

ucts oriented to mainframe servers to satisfy the high perfor-

mance demands of current ﬁnancial, commercial and user–

oriented applications [3].

An important and frequent operation in decimal compu-

tations is multiplication. However, due to the inherent in-

∗

A. V´azquez and E. Antelo supported in part by the Ministry of Science

and Technology of Spain under contract TIN2004-07797-C02 and Xunta

de Galicia under contract PGIDT03TIC10502PR.

efﬁciency of decimal arithmetic implementations in binary

logic, practically all the proposed decimal multipliers are

sequential units [2, 4, 7, 9, 11, 16]. Recently, the ﬁrst im-

plementation of a parallel decimal multiplier was presented

in [8]. Parallel multipliers are used extensively in most of

the binary ﬂoating–point units [10, 13] and are of interest

for decimal applications to scale performance.

In this paper, we introduce new methods for the efﬁ-

cient implementation of decimal parallel multiplication by

a parallel generation of partial products and the reduction

of these partial products using a novel decimal carry–save

addition tree. We present the architectures of two differ-

ent high–performance parallel multipliers that implement

these methods. The second architecture also allows an ef-

fective implementation of a combined binary/decimal mul-

tiplier. These high–performance implementations have sim-

ilar hardware complexity or a moderate increment in area

with respect to the equivalent binary parallel multipliers.

The paper is organized as follows. Section 2 outlines the

previous (most representative) work on decimal multiplica-

tion. In Section 3 we introduce our proposals for an efﬁcient

implementation of decimal parallel multiplication. The pro-

posed techniques for the generation of partial products are

more detailed in Section 4, while the reduction of partial

products is fully discussed in Section 5. We describe the

two resulting architectures and some variants in Section 6.

In Section 7 we provide rough area–delay evaluation results

for 64–bit (16 decimal digits) decimal and binary parallel

multipliers. We compare these results with some other rep-

resentative works. Finally we summarize the main conclu-

sions in Section 8.

2. An overview of decimal multiplication

Multiplication consists of three stages: generation of par-

tial products, fast reduction (addition) of partial products to

a two operand and a ﬁnal carry propagate addition. Decimal

multiplication is more complex than binary multiplication

mainly for two reasons: the higher range of decimal digits

18th IEEE Symposium on Computer Arithmetic(ARITH'07)

Authorized licensed use limited to: Univ of Calif Davis. Downloaded on March 2, 2009 at 14:05 from IEEE Xplore. Restrictions apply.

([0, 9]), which increments the number of multiplicand mul-

tiples and the inefﬁciency of representing decimal values in

systems based on binary logic using BCD–8421 (since only

9 out of the 16 possible 4–bit combinations represent a valid

decimal digit). These issues complicate the generation and

reduction of partial products.

Proposed methods for the generation of decimal par-

tial products follow two approaches. The ﬁrst alterna-

tive [2, 4] generates and stores all the required multipli-

cand multiples. Next, multiples are distributed to the re-

duction stage through multiplexers controlled by the mul-

tiplier digits. This approach requires more than a cycle

to generate some complex BCD-8421 multiplicand mul-

tiples (3X,6X,7X,8X,9X). To avoid complicated multiples

the multiplier can be recoded. In [8] each multiplier digit

is recoded as Y

= Y

5+Y

, with Y

∈{0, 1} and

∈{−2, −1, 0, 1, 2} . Multiples 2X and 5X can be com-

puted without a carry propagation over the whole number.

Negative multiples requires an additional 9’s complement

addition. The second approach generates only the partial

product as needed using digit–by–digit lookup table meth-

ods [9, 16]. In a recent work [5], a magnitude range reduc-

tion of the operand digits by a radix–10 signed–digit recod-

ing (from [0,9] to [-5,5]) is suggested. This recoding of both

operands speeds–up and simpliﬁes the generation of partial

products. Then, overlapped signed–digit partial products

are generated using simpliﬁed tables and a set of multiplex-

ers and xor gates.

First attempts to improve decimal multiplication per-

formed the reduction of decimal partial products using some

scheme for decimal carry propagate addition such as direct

decimal addition [12]. Proposals to perform the reduction of

decimal partial products using multioperand carry–free ad-

dition were suggested in [9] (carry–save) and [15] (signed–

digit). Recently several techniques have been proposed that

improve these previous works. In [5] a signed–digit decimal

adder based on [15] is used. Redundant binary coded dec-

imal (RBCD) adders [14] can also perform decimal carry–

free additions using a signed–digit representation of deci-

mal digits (∈ [−7, 7]). In [11] a scheme of two levels of

3–2 binary carry–save adders (CSA) is used to add the par-

tial products iteratively. Since it uses BCD–8421 to repre-

sent decimal digits, a digit addition of +6 or +12 (modulo

16) is required to obtain the decimal carry and to correct

the sum digit. Logic for detection of decimal carries and

sum digit is in the critical path (sum path). In order to elim-

inate decimal corrections from the critical path of the bi-

nary CSA, three different techniques were proposed in [6].

Among these proposals, non–speculative adders present the

best area–delay ﬁgures and are the most suitable for multi-

operand addition using a CSA tree. Non–speculative adders

Two overlapped digits in the range of [−5, 5] and [−2, 2] are gener-

ated for each partial product digit position.

reduce the BCD–8421 input operands using a binary CSA

tree. Preliminary sum digits are then obtained using a level

of 4–bit carry propagate adders. Finally, decimal carry and

sum digit corrections are determined from the preliminary

sum digit and the carries passed to the next more signiﬁcant

digit position in the binary CSA tree

. Decimal correction

is performed using combinational logic (its complexity de-

pends on the number of input operands added) and a 3–bit

carry propagate adder per digit.

Another representative technique [4] uses an array of 4–

bit decimal carry–propagate adders based on direct decimal

addition. This adder takes two BCD–8421 digits and a 1–

bit input carry and generates a 1–bit decimal carry and the

BCD–8421 sum digit. An iterative decimal multiplier based

on a reﬁnement of [4] is presented in [7]. It uses BCD–

8421 invalid combinations to simplify the sum digit logic.

A combinational radix–10 CSA tree is implemented in [8]

using these 4–bit decimal carry–propagate adders. To opti-

mize the partial product reduction they also use an array of

decimal digit counters. Each counter adds 8 decimal carries

of the same weight and produces a BCD–8421 digit.

3. Proposed techniques for decimal parallel

multiplication

We assume that multiplicand X and multiplier Y are

unsigned decimal integer words. Extension to decimal

ﬂoating–point multiplication involves exponent addition,

rounding of X · Y to ﬁt the required precision and sign cal-

culations. We represent the decimal digits of any d–digit

decimal integer operand Z =



d−1

i=0

· 10



i=0

i,j

· r

where Z

∈ [0, 9] is the i

decimal digit, z

i,j

is the j

bit of the BCD i

digit and r

is the weight of the j

bit. In Table 1 diverse BCD codings are represented. For

BCD–8421, r

. BCD–4221 and BCD–5211 are new

codings introduced in this paper characterized by the use of

redundancy in decimal digit representation. As we have

mentioned, the use of BCD–8421 to represent decimal dig-

its means introducing costly decimal corrections in the par-

tial product reduction binary CSA tree to obtain the correct

decimal carry and sum. To avoid these corrections we use

the BCD–4221 coding of Table 1 to represent partial prod-

uct digits. Thus, we can perform fast decimal carry–save

addition using an ordinary 4–bit binary 3:2 CSA as

+ B

+ C



j=0

i,j

+2h

i,j

) r

A +6 must be added each time a carry is passed to the next more

signiﬁcant digit position.

18th IEEE Symposium on Computer Arithmetic(ARITH'07)

Authorized licensed use limited to: Univ of Calif Davis. Downloaded on March 2, 2009 at 14:05 from IEEE Xplore. Restrictions apply.

BCD-8421 BCD-5421 BCD-4221 BCD-5211

0 0000 0000 0000 0000

1 0001 0001 0001 0001 | 0010

2 0010 0010 0010 | 0100 0100 | 0011

3 0011 0011 0011 | 0101 0101 | 0110

4 0100 0100 1000 | 0110 0111

5 0101 1000 1001 | 0111 1000

6 0110 1001 1010 | 1100 1001 | 1010

7 0111 1010 1011 | 1101 1100 | 1011

8 1000 1011 1110 1110|1101

9 1001 1100 1111 1111

Table 1. BCD codings



j=0

i,j



j=0

i,j

= S

+2H

with (r

)=(4, 2, 2, 1) and

i,j

= a

i,j

⊕ b

i,j

⊕ c

i,j

= a

i,j

· b

i,j

∨ (a

i,j

∨ b

i,j

) · c

i,j

∈ [0, 9], S

∈ [0, 9] are the decimal carry and sum digits

at position i while symbols ∨, ·,and⊕ indicate binary oper-

ators OR, AND and XOR respectively. No decimal correc-

tion is required because H

and S

are valid decimal digits

in BCD–4221 code. However a decimal multiplication by 2

is required before using the carry digit for the computations.

This can be performed in a simple way by a digit recoding

to BCD–5211 (shown in Table 1) followed by a 1–bit wired

left shift:

= l1shif t(W

)=w

i,3

10 + w

i,2

4+w

i,1

2+w

i,0

where

= w

i,3

5+w

i,2

2+w

i,1

+ w

i,0

is the BCD–5211 recoded decimal carry digit. Moreover,

this operation is in the fast path (carry path of a full–adder).

Note that the 1–bit left shift of W

produces a carry output

i,3

) to the next decimal digit (i +1), while the less sig-

niﬁcant bit position is occupied by the carry input (w

i−1,3

)

of the previous digit W

i−1

. Logical expressions for BCD–

4221 to BCD–5211 recoding are given by

i,3

= h

i,3

· (h

i,2

∨ h

i,1

∨ h

i,0

) ∨ h

i,2

· h

i,1

· h

i,0

i,2

= h

i,2

· h

i,1

· h

i,3

⊕ h

i,0

∨ (h

i,3

· h

i,0

) ⊕ h

i,2

⊕ h

i,1

= h

i,2

· h

i,1

· h

i,3

⊕ h

i,0

∨ h

i,3

· h

i,0

· h

i,2

⊕ h

i,1

i,0

=(h

i,2

· h

i,1

) ⊕ h

i,3

⊕ h

i,0

Nevertheless, due to the redundancy of BCD–4221 and

BCD–5211 codings, there are several choices with differ-

ent area–delay trade–offs for the logical implementation of

this digit recoding. This decimal carry–save algorithm leads

to fast and area optimized decimal carry–save tree adders

detailed in Section 5. Furthermore, conversions between

BCD–8421 and BCD–4221 codings can be performed us-

ing a simple gate level.

To generate all the partial products in parallel, we obtain

all the required multiples. We aim for a fast generation of

a reduced number of partial products. This is achieved with

the recoding of the multiplier. We have developed three

different recodings for the multiplier with good trade–offs

between fast generation of partial products and the num-

ber of partial products generated. A minimally redundant

signed–digit (SD) radix–10 recoding (digits in [−5, 5])pro-

duces only d +1partial products but requires a carry propa-

gate addition to generate complex multiples 3X and −3X .

Minimally redundant signed–digit (SD) radix–4 and radix–

5 recodings (with digits in [−2, 2]) produce 2d partial prod-

ucts (2 digits per radix–10 digit) but multiplicand multiples

are produced in a few levels of combinational logic. Fur-

thermore, another advantage of using BCD–4221 to repre-

sent partial product digits is that the 9’s complement of each

digit can be obtained by bit inverting each digit. This sim-

pliﬁes the generation of the negative multiplicand multiples.

The proposed BCD–8421 to SD recoders and the generation

and selection of multiples are detailed in Section 4.

For the ﬁnal decimal carry propagate addition we use a

binary quaternary tree (Q-T) adder modiﬁed to perform dec-

imal additions [17]. Decimal quaternary tree adders based

on conditional speculative decimal addition present low la-

tency (about 10% more than the fastest binary adders) and

require less hardware than other alternatives.

4. Generation of partial products

4.1. Multiplier recoding

A. Signed–Digit Radix–10 Recoding.

This recoding transforms the digit set {0,...,9} into the

signed–digit (SD) set {−5,...,5} to perform the selection

of multiples in a similar way as modiﬁed Booth recoding.

Fig. 1 shows a block diagram of the recoding and the mul-

tiplicand multiple selection units.

We denote Y

∗

= y

∗

i,3



j=0

∗

i,j

the digits of the

multiplier coded in BCD–5421 (see Table 1). The recoded

SD radix–10 multiplier can be expressed in terms of Y

∗

Y =

d−1



i=0



∗

i,3

10 − y

∗

i,3



j=0

∗

i,j



= y

∗

d−1,3

−

d−1



i=0

where the value of each SD radix–10 digit Yb

∈ [−5, 5] is

18th IEEE Symposium on Computer Arithmetic(ARITH'07)

Authorized licensed use limited to: Univ of Calif Davis. Downloaded on March 2, 2009 at 14:05 from IEEE Xplore. Restrictions apply.

BCD−8421 to SD radix−10

Recoder

Mux−5

Partial Product i

Multiplicand multiple selection

SD digit {−5...5}

Signed digit radix−10 recoder

digit BCD−8421

i−1

(overlapped digit)

Figure 1. Partial product generation for SD radix–10.

given by

= −y

∗

i,3



j=0

∗

i,j

+ y

∗

i−1,3

with y

∗

−1,3

=0. Control signals (in ”hot–one” code) can

be obtained directly from input BCD–8421 multiplier digits

using the following logical expressions:

= y

i,3

∨ y

i,2

· (y

i,1

∨ y

i,0

)

= y

i,2

· y

i,1

· (y

i,0

⊕ ys

i−1

)

= ys

i−1

· y

i,0

· (y

i,2

⊕ y

i,1

)

= y

i,1

· (y

i,0

⊕ ys

i−1

)

= y

i,0

∨ ys

i−1

· (y

i,3

∨ y

i,2

· y

i,1

)

= y

i,2

∨ y

i,1

· (y

i,0

⊕ ys

i−1

)

Since multiplicand multiples are recoded to BCD–4221,

negative multiples can be generated by the XOR of ys

with

the corresponding positive multiple as shown in the multi-

plicand multiple selector of Fig. 1.

B. Signed–Digit Radix–4 Recoding.

Two SD radix–4 digits Y

∈{0, 1, 2} (upper), Y

∈

{−2, −1, 0, 1, 2} (lower) are generated per each BCD–8421

digit (Y

= Y

· 4+Y

). We obtain the SD radix–4 selec-

tion signals directly from the BCD–8421 digits as

)







= y

i,3

= y

i,3

· y

i,2

· y

i,1

= y

i,3

· y

i,2

⊕ y

i,1

)







= y

i,3

∨ y

i,1

= ys

· y

i,0

· y

i−1,3

∨ ys

· y

i,0

· y

i−1,3

= y

i,0

⊕ y

i−1,3

The block diagram of a 4–bit combined binary/decimal

recoder and the corresponding multiplicand multiple selec-

tor are shown in Fig. 2 where control signal d

is true for

decimal multiplication. The combined SD radix–4 recoder

implements the decimal selection signals and the conven-

tional Booth radix–4 selection signals. Upper signals select

multiples ±8X and ± 4X while lower signals select multi-

ples {−2X, −X, X, 2X}. Although the resulting combined

4−bit/BCD−8421 to SD

radix−4 Recoder

Mux−2

Partial Product i−upper

Selection of multiplicand multiples

i,3

i,2

i,1

i,0

i−1,3

Mux−2

2 SD digits {−2,...,2}

Partial Product i−lower

Binary/BCD−8421 to SD radix−4 recoder

BCD

Mux−2 Mux−2

BIN

Mux−2

BIN

BCD

BCD X

BCD

Figure 2. Partial product generation for SD radix–4.

SD radix–4 recoders and multiple selectors are simple, ob-

taining decimal multiples 4X and 8X requires double and

triple latency with respect to obtaining the decimal 2X mul-

tiple.

C. Signed–Digit Radix–5 Recoding.

This recoding uses a different set of multiplicand multi-

ples (5X,10X instead of 4X,8X) for decimal partial prod-

uct generation that have a similar latency to 2X and X.

Each BCD–8421 digit of the multiplier is encoded into two

radix–5 digits (Y

= Y

· 5+Y

) with Y

∈{0, 1} and

∈{−2, −1, 0, 1, 2} .

SD radix–5 selection signals are obtained from the

BCD–8421 input digits using:

)







= y

i,3

= y

i,2

∨ y

i,1

· y

i,0

)







= y

i,3

∨ y

i,2

· y

i,1

· y

i,0

∨ y

i,2

· y

i,1

· y

i,0

= y

i,0

· (y

i,3

∨ y

i,1

) ∨ y

i,2

· y

i,1

= y

i,2

· y

i,0

∨ y

i,2

· y

i,1

· y

i,0

The block diagram of the digit recoder and multiples se-

lector is shown in Fig. 3(a). A combined binary radix–

4/decimal radix–5 block diagram for the partial product

generation is proposed in Fig. 3(b). Multiplexers controlled

by d

select the operands required by binary or decimal

multiplications. Although BCD to SD radix–4 encoding is

slightly simpler than radix–5, partial product generation for

decimal SD radix–5 is faster and comparable in latency with

binary SD radix–4, due to a faster generation of multipli-

cand multiples as we show in the following subsection.

4.2. Generation of multiplicand multiples

Decimal multiplicand multiples 2X and 5X are obtained

in a few levels of logic using recoding and wired left shifts.

Any other multiple is generated using these multiples or

from multiplicand X. The generation sequence of 2X is

as follows. Each BCD–8421 digit is ﬁrst recoded to BCD–

5211 using

i,3

= h

i,3

∨ h

i,2

· (h

i,1

∨ h

i,0

)

18th IEEE Symposium on Computer Arithmetic(ARITH'07)

Authorized licensed use limited to: Univ of Calif Davis. Downloaded on March 2, 2009 at 14:05 from IEEE Xplore. Restrictions apply.

BCD−8421 toSD radix−5

recoder

Mux−2

10X

Partial Product i−upper

Multiplicand multiples selection

i,3

i,2

i,1

i,0

Mux−2

2 digits SD radix−5 {−2,..2}

Partial Product i−lower

BCD−8421 to SD radix−5 recoder

(a) Decimal SD radix–5 recoding.

BCD−8421 to SD

radix−5 recoder

Mux−2

10X

BCD

Partial Product i−upper

Combined 4−bit SD radix−4/radix−5 recoder

Mux−2

Partial Product i−lower

Binary to SD

radix−4 recoder

i−1,3

Mux−2

Mux−2 Mux−2

BIN

Mux−2 Mux−2

Multiplicand multiples selection

BIN

BCD

(b) Combined binary/decimal to SD radix–4/radix–5 recoding.

Figure 3. Partial product generation for SD radix–5.

i,2

= h

i,3

∨ (h

i,1

⊕ (h

i,2

· h

i,0

))

i,1

= h

i,3

· h

i,0

∨ h

i,2

· h

i,1

∨ h

i,0

= h

i,3

∨ (h

i,2

⊕ h

i,0

)

Then a wired 1–bit left shift is performed over the recoded

multiplicand, obtaining the 2X multiple in BCD–4221.

The 5 X multiple is obtained by a simple 3–bit left shift

of the multiplicand, but with resultant digits coded in BCD–

5421. Thus a digit recoding from BCD–5421 to BCD-4221

is performed using expressions

i,3

= h

i,3

∨ h

i,2

= h

i,3

· (h

i,2

∨ (h

i,1

· h

i,0

))

i,1

= h

i,1

· h

i,3

· (h

i,2

∨ h

i,0

)

i,0

= h

i,3

⊕ h

i,0

The generation of negative multiples is performed by

evaluating the 10’s complement of positive multiples as

−X =

d−1



i=0

(9 − X

) · 10

For BCD–8421 this is performed by a digit addition of +6

followed by a bit–complement operation since 9 − X

+6. For BCD–4221, a 10’s complement is performed

simply by bit–complementing the positive multiple, since

9 − X

= X

. Addition of the 10’s complement +1 is per-

formed in the partial product reduction tree by a tail encod-

ing bit, since each partial product is 4–bit (or at least 1–bit)

left shifted from the previous one. To avoid sign extension

and thus to reduce the complexity, the partial product signs

are encoded in each leading digit position as

−

d−1



i=0

i+d

= −10

d−1



i=0

(9 − sg

)10

i+d

+1=

= −10

d−1



i=1

(8 + sg

)10

i+d

+(sg

10 + sg

9)10

Each partial product is at most of d +3–digit length, due to

the three extra digit positions required for the encoded sign,

the tail encoding bit and the left shifting.

Fig. 4(a) shows the block diagram for the generation of

multiplicand multiples for SD radix–10 encoding. Multiple

4X is obtained as 2 × 2X. Multiple 3X is evaluated by a

carry propagate addition of multiples X and 2X in a deci-

mal quaternary tree [17]. The latency of the partial product

generation is constrained by the generation of 3X.TheSD

radix–10 multiple selector of Fig. 1 uses the xor operation

to select positive or negative multiples as a function of the

SD radix–10 control signal ys

Fig. 4(b) shows the generation of multiples for the case

of decimal SD radix–4 recoding. Multiple 8X is obtained

as 2 × 2 × 2X, so the latency of multiplicand multiples

generation is about three times the latency of 2X operation.

On the other hand, generation of radix–5 multiples is faster

(approx. the latency of 2X) as it is shown in Fig. 4(c).

5. Reduction of partial products

To implement the algorithm for carry–save addition for-

mulated in Section 3 we propose a decimal 3:2 CSA that re-

duces 3 BCD–4221 digits to a carry and a sum BCD–4221

digits. This module consists of a 4–bit binary 3:2 CSA plus

a BCD–4221 to BCD–5211 digit recoder. From this mod-

ule we construct p:2 (p ≥ 3) decimal CSAs, optimizing the

critical path delay using fast inputs and outputs.

5.1. Decimal 3:2 carry-save adder

The block diagram of the proposed 4–bit 3:2 CSA is

shown in Fig. 5(a). The block labeled ×2 performs the

multiplication of the carry digit by 2. For decimal multipli-

cation the ×2 module is detailed in Fig. 5(b). It consist of

a BCD–4221 to BCD–5211 digit recoder and a 1–bit wired

left shift. A combined binary/decimal 3:2 CSA is shown in

Fig. 5(c). A 4–bit 2:1 multiplexer controlled by d

selects

18th IEEE Symposium on Computer Arithmetic(ARITH'07)

Authorized licensed use limited to: Univ of Calif Davis. Downloaded on March 2, 2009 at 14:05 from IEEE Xplore. Restrictions apply.

A New Family of High.Performance Parallel Decimal Multipliers

Figures

Citations

Improved Design of High-Performance Parallel Decimal Multipliers

Improving the Speed of Parallel Decimal Multiplication

A parallel IEEE P754 decimal floating-point multiplier

High-Speed Parallel Decimal Multiplication with Redundant Internal Encodings

Decimal Floating-Point Multiplication

References

IEEE Standard for Floating-Point Arithmetic

Decimal floating-point: algorism for computers

A 4.4 ns CMOS 54/spl times/54-b multiplier using pass-transistor multiplexer

Decimal multiplication via carry-save addition

Decimal multiplication with efficient partial product generation

Related Papers (5)

Decimal floating-point: algorism for computers

Decimal multiplication via carry-save addition

A Radix-10 Combinational Multiplier

Decimal multiplication with efficient partial product generation

Improved Design of High-Performance Parallel Decimal Multipliers

Frequently Asked Questions (13)

Q1. What have the authors contributed in "A new family of high–performance parallel decimal multipliers∗" ?

Q2. What is the way to add a digit to a decimal floating point?

Q3. How is the complement +1 performed in the partial product reduction tree?

Q4. How long does multiple 8X take to generate?

Q5. How fast is the proposed SD radix–5 scheme?

Q6. What is the critical path delay of the radix–4 binary multiplier?

Q7. What are the three techniques proposed in the binary CSA?

Q8. What is the way to compare the CMOS gates with other proposed architectures?

Q9. What is the 2 multiplication for the final decimal carry operand?

Q10. what is the area delay of a decimal radix?

Q11. What is the recoding of the SD radix–10 multiplier?

Q12. What is the difference between the two radix–5 encodings?

Q13. What is the recoding of the digits?