What are the future works mentioned in the paper "Pipelined fpga adders" ?

Future work also includes extending the optimization options to include operator latency, and possibly combinations such as “ LUTs and latency ”.

What is the effect of this optimization on the number of registers?

In addition to latency reduction, this optimization brings the following gains: the number of registers is reduced by the carry propagation size (which now needs no registering), the LUT count is reduced by approximatively w, and the number of slices by approximatively w/2.

What is the worst case relative error of the estimation formulae?

The worst case relative error is of the order of 10−2 (one percent) which makes them sufficiently accurate for estimation formulae.

How many slices are in a VirtexII-Pro device?

All the slices in a VirtexII-Pro device were similar to sliceM, but they were reduced to half the total number of slices for Virtex4 and Spartan3, and about a quarter in Virtex5 and Virtex6 devices (with higher density at the input of the DSP48E blocks).

What is the option for a low-latency operator?

For both alternative and low-latency architectures, there are two options: either perform all additions in using chunk size γ, or buffer the inputs and perform computations using chunk size α.

How is the proposed adder generation implemented?

Work is under way to integrate the proposed adders in all the coarser cores of the FloPoCo project, and to support more FPGA targets.

(Open Access) Pipelined FPGA Adders (2010) | Florent de Dinechin

Q: What is the common solution for a tight frequency-driven pipelining?

A tight frequency-driven pipelining is obtained by first determining the maximal addition size α in equation 1 for which the critical path delay is less than the target period T :α = 1 +⌊ T − δLUT − δxorδcarry⌋ .

Q: What is the novel feature of the classical addition architecture?

In this section the authors propose a scalable low-latency addition architecture based on the textbook carry-select architecture, whose novel feature is to make efficient use of the fast-carry chains for the carry-bit computations.

HAL Id: ensl-00475780

https://hal-ens-lyon.archives-ouvertes.fr/ensl-00475780v2

Submitted on 1 Nov 2010

HAL is a multi-disciplinary open access

archive for the deposit and dissemination of sci-

entic research documents, whether they are pub-

lished or not. The documents may come from

teaching and research institutions in France or

abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est

destinée au dépôt et à la diusion de documents

scientiques de niveau recherche, publiés ou non,

émanant des établissements d’enseignement et de

recherche français ou étrangers, des laboratoires

publics ou privés.

Pipelined FPGA Adders

Florent de Dinechin, Hong Diep Nguyen, Bogdan Pasca

To cite this version:

Florent de Dinechin, Hong Diep Nguyen, Bogdan Pasca. Pipelined FPGA Adders. International

Conference on Field Programmable Logic and Applications, Aug 2010, Milano, Italy. pp.422-427,

�10.1109/FPL.2010.87�. �ensl-00475780v2�

Pipelined FPGA Adders

LIP Research Report RR2010-16

Florent de Dinechin, Hong Diep Nguyen, Bogdan Pasca

LIP, projet Ar

enaire

ENS de Lyon

46 all

ee d’Italie, 69364 Lyon Cedex 07, France

Email: {Florent.de.Dinechin,Hong.Diep.Nguyen,Bogdan.Pasca}@ens-lyon.fr

Abstract—Integer addition is a universal building block, and

applications such as quad-precision ﬂoating-point or elliptic curve

cryptography now demand precisions well beyond 64 bits. This

study explores the trade-offs between size, latency and frequency

for pipelined large-precision adders on FPGA. It compares three

pipelined adder architectures: the classical pipelined ripple-carry

adder, a variation that reduces register count, and an FPGA-

speciﬁc implementation of the carry-select adder capable of

providing lower latency additions at a comparable price. For each

of these architectures, resource estimation models are deﬁned,

and used in an adder generator that selects the best architecture

considering the target FPGA, the target operating frequency, and

the addition bit width.

Keywords-addition; pipeline; low-latency; FPGA

I. INTRODUCTION

Integer addition is used as a building block in many coarser

operators. Examples which require large adders include integer

multipliers, most ﬂoating-point operators, and modular adders

used in some cryptographic applications. In ﬂoating-point, the

demand in precision is now moving from double (64-bit) to

the recently standardized quadruple precision (128-bit format,

including 112 bits for the signiﬁcand) [1]. In elliptic-curve

cryptography, the size of modular additions is currently above

150 bits for acceptable security.

This study presents an operator generator for binary integer

addition that is based on resource estimation models of possi-

ble implementations. Given a speciﬁcation including a target

frequency, the generator queries the implementation models in

order to select the one matching this frequency at minimal cost.

Once found, the VHDL code of the selected implementation

is generated.

Adders differ in the way they propagate carries. Modern FP-

GAs include special hardware dedicated to carry propagation

[2], [3], [4], [5], [6]. Sending a carry to a neighbouring cell

through the dedicated carry line is much faster than sending a

bit to the same cell through the general reconﬁgurable routing

fabric. Therefore, proven solutions for VLSI designs [7] bring

little speed improvement on FPGAs over the ripple carry

adder (RCA) except for addition size exceeding 64 bits [8].

These speed improvements are small, and they come at a cost

penalty exceeding a factor 2 over the RCA. Therefore, a binary

addition is expressed in VHDL as a + and is implemented by

default as an RCA.

This article re-evaluates this situation when a pipelined

adder is needed. Pipelining is used for cutting the critical path

in order to increase operator frequency. To the best of our

knowledge, there is no IP core generator nor VHDL/Verilog li-

brary which provide high-performance pipelined binary adders

for FPGAs. This work introduces the adder generator used

in the FloPoCo project

as a building block of most other

operators.

The main contributions of this work are:

• an alternative pipelining of ripple-carry adder;

• a novel short-latency pipelined adder;

• resource estimation models including slice, register and

LUT count for three adder architectures;

• integration of these models into an addition operator

generator that takes as input a list of user speciﬁcations,

and returns the VHDL code of the best operator.

A. Related Work

The simplest pipelining of binary addition [9], [10], [7]

consists in buffering the carry-out of each full-adder (FA)

along the carry propagation path, and inserting synchronization

registers for I/O. The previous technique is wasteful when

the objective period is larger than the delay of a 1-bit carry

propagation. For these cases, a better version [11], [7], [12]

consists in registering carries only every α FA cell. This

technique will be detailed in section II-A, and is referred to

as the classical RCA pipelining technique.

Faster techniques than the previous classical architecture

have been developed for VLSI. A ﬁrst idea is to speed up the

logic on the carry propagation path [13], [10]. Other, more

algorithmic approaches include carry-select, carry-skip, and

the family of preﬁx adders [7]. These designs map poorly

on FPGAs, however they have served as an initial source of

inspiration for the proposed pipelining techniques from section

II-C.

A complete study on unpipelined binary FPGA addition is

presented in [8]. The authors present FPGA-speciﬁc optimiza-

tion opportunities for carry-skip and carry-select adders and

show that optimized versions of these adders can be faster than

the RCA for large addition sizes. However, these faster ver-

sions come at at a signiﬁcant size penalty, which recommends

them only for delay-critical applications. Moreover, pipelining

is not covered. The present article extends this previous study

to pipelined addition.

http://www.ens-lyon.fr/LIP/Arenaire/Ware/FloPoCo/

B. FPGA addition in the FloPoCo context

FloPoCo is a generator of arithmetic cores (Floating-Point

Cores, but not only) for FPGAs. FloPoCo also provides a

framework for arithmetic operator development that is, to our

knowledge, the easiest way to design complex operators with

ﬂexible pipelines [14]. The operators presented in this paper

have been developed using the FloPoCo framework and are

essential building blocks of most complex FloPoCo operators.

FloPoCo generates arithmetic operators in human-readable

synthesizable VHDL starting from a list of user speciﬁcations

(see Figure 1). These speciﬁcations include: operator param-

eters (operand width for binary addition), deployment FPGA

target, target frequency and others. One of the original features

of FloPoCo is that operator generation is frequency-driven.

Instead of generating the fastest possible operator, the FloPoCo

philosophy is to provide the smallest operator meeting a

frequency constraint. This approach has the advantage of being

compositional: a larger operator working at frequency f may

be assembled out of sub-components working at frequency f.

This study formalizes frequency-driven addition pipelining.

C. Design-space exploration by resource estimation

Modern FPGA resources are heterogeneous, including LUT-

based logic, embedded memories, embedded DSP blocks, and

others. For addition, we only need to estimate logic and

registers. This study gives resource estimation formulae for

these resources for several Xilinx FPGAs. Altera targets are

currently only partially supported. This doesn’t mean that

FloPoCo operators do not work on Altera, just that they are

not optimized accurately.

The formulae allow for a fast and exhaustive design-

space exploration, where only the selected architecture will

be generated and synthesized. For this method to be valid,

we will check in III-A that these formulae effectively predict

the performance and resource consumption of the operator

after synthesis and technology mapping. Addition and register

mapping is simple enough for these formulae to be accurate

to about one percent in all cases.

D. FPGA targets

In the FloPoCo framework, each FPGA is abstracted to a

list of essential attributes: LUT features, routing delays, DSP

conﬁgurations, on-chip memory, etc..

The Xlinx VirtexII-Pro[2] , Spartan3 [3] and Virtex-4 [4]

FPGAs have very similar slice structure (Figure 2): two 4-input

LUTs with corresponding ﬂip-ﬂops and arithmetic logic for

VHDL

output delays

width

input delays

deployment FPGA

target frequency

Adder

Generator

Fig. 1. FloPoCo adder generator

carry-bit computation and propagation. Carry-bit propagation

is accomplished by means of dedicated carry-chains running

vertically through the FPGA layout.

This is the default slice type and is denoted by sliceL.

In addition, a secondary slice type featuring a superset of

functionalities is available. The sliceM cell allows the LUT

to be conﬁgured as a variable-length shift-register (SRL16).

When this conﬁguration is used, shift registers of up-to 16 bits

can be absorbed in one half-slice. This feature, when available,

allows minimizing input/output synchronization cost.

The Virtex-5 and Virtex-6 slices [5] are similar with respect

to addition. However, they allow independent use of the LUTs

and registers, which means that estimation formulae have to

count them separately.

II. PIPELINED ADDITION ON FPGA

Let X, Y be two integers on w bits (in the range {0, ..., 2

−

1}) and c

a carry-in bit. The sum of X, Y and c

is noted

R = X +Y +c

. It is in [0, 2

w+1

−1] and is representable on

w + 1 bits. Note that all the following also applies to signed

integers in 2’s complement notation.

The RCA delay is proportional to the addition size. It

has three components. First, the LUT delay, δ

LUT

, used to

precompute the carry multiplexer select signal. Then there is a

worst-case delay of (w−1)δ

carry

for carry propagation. Finally,

xor

, the delay of the xor gate used to compute the MSB sum

bit.

= δ

LUT

+ (w − 1)δ

carry

+ δ

xor

(1)

As w increases the addition frequency decreases as illus-

trated in Figure 3 for three FPGAs.

In the context of frequency-driven pipelining, a pair (w, f)

which is under the corresponding curve in Figure 3 meets the

frequency constraint. There are two solutions for additions not

meeting this constraint. We can choose a different addition

architecture that is able to reach the frequency without too

much of a cost penalty [8]. This solution is unable to cover the

entire (w, f) space. Another solution is to pipeline the adder

design such that the critical path of the circuit is less than

the target period T = 1/f. This study focuses on the second

solution, because it is more scalable and often consumes less

resources.

LUT4

RAM16

SRL16

Fig. 2. sliceM (VirtexII-Pro, Spartan3 and Virtex-4)

A. Classical RCA Pipelining

A tight frequency-driven pipelining is obtained by ﬁrst

determining the maximal addition size α in equation 1 for

which the critical path delay is less than the target period T :

α = 1 +



T − δ

LUT

− δ

xor

carry



Next, the addition is split into k chunks of α bits (except the

last chunk denoted by β, β ≤ α) such that w = (k − 1)α + β.

An instantiation of this architecture highlighting the pre-

viously discussed parameters is presented in Figure 4 for

k = 4. As k decreases, the number of registers used for

synchronization decreases. When the critical path of the w-

bit addition is ≤ T , no pipelining is required (k = 1) and the

addition may be expressed as a simple + in VHDL.

The column labelled Classical in Table II presents the re-

source estimation formulae function of α, β, w, k, respectively

with and without allowing shift-register packing in LUTs

(SRL). Let us now explain how such formulae were built.

B. Resource estimation techniques

Let us take as a running example the previous classical

architecture, annotated on Figure 5.

The LUTs of the Xilinx FPGAs can be be used either as

a function generator or as a variable length shift-register, as

previously presented in Section I-D.

For classical architecture, the addition diagonal uses w

LUTs conﬁgured as function generators (Figure 5, σ). The

LUT SRL conﬁguration is used wherever two or more ﬂip-

ﬂops are cascaded to form a shift register. This is the case

of the (k − 2)α SRLs under the addition diagonal (Figure

5, ξ), together with the 2β SRLs corresponding to the last

column of width β (Figure 5, µ) and of the 2(k − 3)α

SRLs above the diagonal (Figure 5, θ). These sum up to

w + (3k − 8)α + 2β = (4k − 9)α + 3β, which is the value

reported in Table II.

There is one consideration to be made before counting

registers: each time an SRL is used, the corresponding slice

ﬂip-ﬂop is also used. In other words, for a p-level shift-register,

p − 1 levels are pushed into the SRL and one into the ﬂip-

ﬂop. Hence, we count (3k − 8)α + 2β registers for the same

number of SRL, and, in addition, α registers (Figure 5, φ)

100

200

500

8 64 128 256 512 1024

Frequency(MHz)

Width (bits)

300

400

VirtexIV

Virtex5

Spartan3

Fig. 3. Ripple-Carry Addition Frequency for VirtexIV, Virtex5 and Spartan3E

under, 2α registers (Figure 5, ρ) above the diagonal plus

the k − 1 registers for the carry-bit propagation. These total

(3k − 5)α + 2β + k − 1, the value reported in Table II.

The next task is to count slices. We choose to count half-

slices and divide this number by 2 rounding upwards. This

corresponds to a dense placement of the pipelined adder, which

the tools are expected to favor. Experimental results given in

section III-A will validate this assumption.

The number of half-slices used by the classical implemen-

tation is: w for the diagonal addition, (3k − 8)α + 2β for

the SRL and corresponding ﬂip-ﬂops, and 3α + k − 1 for the

independent registers. However, we subtract α as the left-most

addition of α bits includes the registers in the same slice as

the LUT. The number totals (4k − 7)α + 3β + k − 1, which

is reported in Table II.

All the formulae presented in this paper were deduced using

these techniques. Relative errors of these estimation formulae

are given in Table III. The worst case relative error is of the

order of 10

−2

(one percent) which makes them sufﬁciently

accurate for estimation formulae.

C. Alternative RCA Pipelining

The classical pipelining technique requires a signiﬁcant

amount of registers for input synchronization. This number

may be lowered by performing the chunk additions at the ﬁrst

pipeline level and then propagating these sums instead. When

no SRL are allowed, the number of registers propagated above

the diagonal will be approximatively halved, and may still be

packed in shift registers. An instantiation of this architecture

for k = 4 is presented in Figure 6.

Each adder on the addition diagonal takes as input an

operand on α+1 bits and a 1-bit carry in and returns a α+1-bit

wide result. This addition does not overﬂow, as the α + 1-bit

input was the result of an addition of two α-bit numbers with

a carry-in of 0.

The resource estimation formulae for this architecture are

presented in Table II.

D. Short-Latency Addition Architecture

Given a target frequency f, the pipeline depth of the previ-

ously presented architectures increases linearly with addition

size. In this section we propose a scalable low-latency addition

architecture based on the textbook carry-select architecture,

whose novel feature is to make efﬁcient use of the fast-carry

chains for the carry-bit computations.

The algorithm ﬁrst determines the chunk size α as per

section II-A. Next, two sums are computed for each pair of

chunks: X

+ Y

and X

+ Y

+ 1. The ﬁnal result R is a

combination of the corresponding sub-sums and is found in a

space of 2

combinations. Selecting the appropriate sub-sum is

done by using a carry-in bit. The novel idea in this algorithm

is the use of the dedicated fast-carry chains to compute the

carry-bits for the result selection.

Actually, for each chunk, a pair (sum, carry-out) is com-

puted for both possible values of the carry-in. We use the

1 + α

β β α α α α α α

1 + α

Fig. 4. Classical addition architecture [7]

1 + α

β β α α α α α α

1 + α

Fig. 5. Annotated classical architecture

+++

1 + α

β β α α α α α α

Fig. 6. Proposed FPGA architecture

following notations to denote the concatenation of the sub-

sums and their corresponding carry-out bits.

= X

+ Y

= X

+ Y

+ 1

We denote by R

the i

sub-result such that R =

k−1

. . . R

. The value of R

can be expressed in the

following way knowing S

, S

and c

i−1

if (c

i−1

= 0) then R

← S

else R

← S

The carry-out bit for a chunk c

is computed from its

carry-in c

i−1

and the two precomputed carries c

and c

The circuit used to compute them is particularly designed

to take advantage of the fast carry chains of the FPGA by

expressing the carry-out computation under the form of an

addition (Figure 7):

¬c

= c

i−1

+ c

+ 2

One can verify the correctness of the carry generation by

checking the truth table presented in Table I. Note that the

greyed-out rows of the table will never be needed, as c

= 1

implies c

= 1 (it is not possible that X

+ Y

overﬂows and

+ Y

+ 1 doesn’t). The value of s

is not used further but

is necessary for correct inference and mapping of the addition

on the fast-carry chains of the FPGA.

It should be noted that a strong point of this approach

is that this carry propagation is expressed as an addition,

and therefore portable (no need for vendor-speciﬁc low-level

LUT-ﬁlling primitives). For instance, porting it to Altera chips

should simply involve choosing the appropriate values for the

delay-related parameters inﬂuencing the chunk size.

The formulae presented in Table II are deduced for k ≥ 3.

To use them we thus have to ensure w ≥ 2α + 1, possibly by

reducing α with respect to the optimal α deduced from the

target frequency.

The short-latency architecture depicted in Figure 8 has a

constant latency of two cycles. In addition, for lower frequency

operators, the second register levels can be discarded. How-

ever, choosing the correct splitting for the inputs is not trivial

CACFA FA

i−1

1 c

¬c

i−1

Fig. 7. Carry-Add-Cell (CAC) implementation and representation

TABLE I

CAC TRUTH TABLE. GREYED-OUT ROWS ARE NOT NEEDED

i−1

¬c

0 0 0 0 1 0

0 0 1 0 1 1

0 1 0 0 1 1

0 1 1 1 0 0

1 0 0 0 1 1

1 0 1 1 0 0

1 1 0 1 0 0

1 1 1 1 0 1

as we have to ensure that the critical path length is smaller

than the target period T . Considering that the ﬁrst sums are

registered, we have to ﬁnd the correct sizes for splitting the

inputs, such that the critical path length that includes the carry

generation circuit and the ﬁnal additions is less than T .

Intuitively, as the index of the chunks added is higher, the

length of the corresponding carry bit propagation is longer

and thus the length of the ﬁnal addition has to be smaller.

We use a greedy algorithm that, at index i ﬁnds the maximum

addition size such that the carry propagation for index i and the

ﬁnal addition for this index is smaller than T . However, it is

k−1

1 X

+ + +

. . .

k−1

. . .

k−1

+++

CACCACCAC

Fig. 8. Short-Latency Addition architecture

Pipelined FPGA Adders

Citations

Designing Custom Arithmetic Data Paths with FloPoCo

Floating-point exponential functions for DSP-enabled FPGAs

Parameter Space for the Architecture of FFT-Based Montgomery Modular Multiplication

FPGA-Specific Arithmetic Optimizations of Short-Latency Adders

Efficient implementation of parallel BCD multiplication in LUT-6 FPGAs

References

IEEE Standard for Floating-Point Arithmetic

Digital arithmetic

FPGA adders: performance evaluation and optimal design

Generating high-performance custom floating-point pipelines

Arithmétique des ordinateurs

Related Papers (5)

FPGA-Specific Arithmetic Optimizations of Short-Latency Adders

Digital arithmetic

Efficient implementation of fast redundant number adders for long word-lengths in FPGAs

Unified Architecture for Double/Two-Parallel Single Precision Floating Point Adder

Multi-operand adder synthesis on FPGAs using generalized parallel counters

Frequently Asked Questions (12)

Q1. What are the contributions mentioned in the paper "Pipelined fpga adders" ?

Q2. What are the future works mentioned in the paper "Pipelined fpga adders" ?

Q3. What is the effect of this optimization on the number of registers?

Q4. What is the common solution for a tight frequency-driven pipelining?

Q5. What is the worst case relative error of the estimation formulae?

Q6. How many slices are in a VirtexII-Pro device?

Q7. What is the novel feature of the classical addition architecture?

Q8. What is the way to estimate the number of registers in the pipeline?

Q9. What is the function of the Xilinx FPGAs?

Q10. What is the option for a low-latency operator?

Q11. What is the smallest number of bits in the addition diagonal?

Q12. How is the proposed adder generation implemented?