scispace - formally typeset
Open AccessProceedings ArticleDOI

Pipelined FPGA Adders

TLDR
This study compares three pipelined adder architectures: the classical pipelining ripple-carry adder, a variation that reduces register count, and an FPGA-specific implementation of the carry-select adder capable of providing lower latency additions at a comparable price.
Abstract
Integer addition is a universal building block, and applications such as quad-precision floating-point or elliptic curve cryptography now demand precisions well beyond 64 bits. This study explores the trade-offs between size, latency and frequency for pipelined large-precision adders on FPGA. It compares three pipelined adder architectures: the classical pipelined ripple-carry adder, a variation that reduces register count, and an FPGA-specific implementation of the carry-select adder capable of providing lower latency additions at a comparable price. For each of these architectures, resource estimation models are defined, and used in an adder generator that selects the best architecture considering the target FPGA, the target operating frequency, and the addition bit width.

read more

Content maybe subject to copyright    Report

HAL Id: ensl-00475780
https://hal-ens-lyon.archives-ouvertes.fr/ensl-00475780v2
Submitted on 1 Nov 2010
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entic research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diusion de documents
scientiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Pipelined FPGA Adders
Florent de Dinechin, Hong Diep Nguyen, Bogdan Pasca
To cite this version:
Florent de Dinechin, Hong Diep Nguyen, Bogdan Pasca. Pipelined FPGA Adders. International
Conference on Field Programmable Logic and Applications, Aug 2010, Milano, Italy. pp.422-427,
�10.1109/FPL.2010.87�. �ensl-00475780v2�

Pipelined FPGA Adders
LIP Research Report RR2010-16
Florent de Dinechin, Hong Diep Nguyen, Bogdan Pasca
LIP, projet Ar
´
enaire
ENS de Lyon
46 all
´
ee d’Italie, 69364 Lyon Cedex 07, France
Email: {Florent.de.Dinechin,Hong.Diep.Nguyen,Bogdan.Pasca}@ens-lyon.fr
Abstract—Integer addition is a universal building block, and
applications such as quad-precision floating-point or elliptic curve
cryptography now demand precisions well beyond 64 bits. This
study explores the trade-offs between size, latency and frequency
for pipelined large-precision adders on FPGA. It compares three
pipelined adder architectures: the classical pipelined ripple-carry
adder, a variation that reduces register count, and an FPGA-
specific implementation of the carry-select adder capable of
providing lower latency additions at a comparable price. For each
of these architectures, resource estimation models are defined,
and used in an adder generator that selects the best architecture
considering the target FPGA, the target operating frequency, and
the addition bit width.
Keywords-addition; pipeline; low-latency; FPGA
I. INTRODUCTION
Integer addition is used as a building block in many coarser
operators. Examples which require large adders include integer
multipliers, most floating-point operators, and modular adders
used in some cryptographic applications. In floating-point, the
demand in precision is now moving from double (64-bit) to
the recently standardized quadruple precision (128-bit format,
including 112 bits for the significand) [1]. In elliptic-curve
cryptography, the size of modular additions is currently above
150 bits for acceptable security.
This study presents an operator generator for binary integer
addition that is based on resource estimation models of possi-
ble implementations. Given a specification including a target
frequency, the generator queries the implementation models in
order to select the one matching this frequency at minimal cost.
Once found, the VHDL code of the selected implementation
is generated.
Adders differ in the way they propagate carries. Modern FP-
GAs include special hardware dedicated to carry propagation
[2], [3], [4], [5], [6]. Sending a carry to a neighbouring cell
through the dedicated carry line is much faster than sending a
bit to the same cell through the general reconfigurable routing
fabric. Therefore, proven solutions for VLSI designs [7] bring
little speed improvement on FPGAs over the ripple carry
adder (RCA) except for addition size exceeding 64 bits [8].
These speed improvements are small, and they come at a cost
penalty exceeding a factor 2 over the RCA. Therefore, a binary
addition is expressed in VHDL as a + and is implemented by
default as an RCA.
This article re-evaluates this situation when a pipelined
adder is needed. Pipelining is used for cutting the critical path
in order to increase operator frequency. To the best of our
knowledge, there is no IP core generator nor VHDL/Verilog li-
brary which provide high-performance pipelined binary adders
for FPGAs. This work introduces the adder generator used
in the FloPoCo project
1
as a building block of most other
operators.
The main contributions of this work are:
an alternative pipelining of ripple-carry adder;
a novel short-latency pipelined adder;
resource estimation models including slice, register and
LUT count for three adder architectures;
integration of these models into an addition operator
generator that takes as input a list of user specifications,
and returns the VHDL code of the best operator.
A. Related Work
The simplest pipelining of binary addition [9], [10], [7]
consists in buffering the carry-out of each full-adder (FA)
along the carry propagation path, and inserting synchronization
registers for I/O. The previous technique is wasteful when
the objective period is larger than the delay of a 1-bit carry
propagation. For these cases, a better version [11], [7], [12]
consists in registering carries only every α FA cell. This
technique will be detailed in section II-A, and is referred to
as the classical RCA pipelining technique.
Faster techniques than the previous classical architecture
have been developed for VLSI. A first idea is to speed up the
logic on the carry propagation path [13], [10]. Other, more
algorithmic approaches include carry-select, carry-skip, and
the family of prefix adders [7]. These designs map poorly
on FPGAs, however they have served as an initial source of
inspiration for the proposed pipelining techniques from section
II-C.
A complete study on unpipelined binary FPGA addition is
presented in [8]. The authors present FPGA-specific optimiza-
tion opportunities for carry-skip and carry-select adders and
show that optimized versions of these adders can be faster than
the RCA for large addition sizes. However, these faster ver-
sions come at at a significant size penalty, which recommends
them only for delay-critical applications. Moreover, pipelining
is not covered. The present article extends this previous study
to pipelined addition.
1
http://www.ens-lyon.fr/LIP/Arenaire/Ware/FloPoCo/

B. FPGA addition in the FloPoCo context
FloPoCo is a generator of arithmetic cores (Floating-Point
Cores, but not only) for FPGAs. FloPoCo also provides a
framework for arithmetic operator development that is, to our
knowledge, the easiest way to design complex operators with
flexible pipelines [14]. The operators presented in this paper
have been developed using the FloPoCo framework and are
essential building blocks of most complex FloPoCo operators.
FloPoCo generates arithmetic operators in human-readable
synthesizable VHDL starting from a list of user specifications
(see Figure 1). These specifications include: operator param-
eters (operand width for binary addition), deployment FPGA
target, target frequency and others. One of the original features
of FloPoCo is that operator generation is frequency-driven.
Instead of generating the fastest possible operator, the FloPoCo
philosophy is to provide the smallest operator meeting a
frequency constraint. This approach has the advantage of being
compositional: a larger operator working at frequency f may
be assembled out of sub-components working at frequency f.
This study formalizes frequency-driven addition pipelining.
C. Design-space exploration by resource estimation
Modern FPGA resources are heterogeneous, including LUT-
based logic, embedded memories, embedded DSP blocks, and
others. For addition, we only need to estimate logic and
registers. This study gives resource estimation formulae for
these resources for several Xilinx FPGAs. Altera targets are
currently only partially supported. This doesn’t mean that
FloPoCo operators do not work on Altera, just that they are
not optimized accurately.
The formulae allow for a fast and exhaustive design-
space exploration, where only the selected architecture will
be generated and synthesized. For this method to be valid,
we will check in III-A that these formulae effectively predict
the performance and resource consumption of the operator
after synthesis and technology mapping. Addition and register
mapping is simple enough for these formulae to be accurate
to about one percent in all cases.
D. FPGA targets
In the FloPoCo framework, each FPGA is abstracted to a
list of essential attributes: LUT features, routing delays, DSP
configurations, on-chip memory, etc..
The Xlinx VirtexII-Pro[2] , Spartan3 [3] and Virtex-4 [4]
FPGAs have very similar slice structure (Figure 2): two 4-input
LUTs with corresponding flip-flops and arithmetic logic for
VHDL
output delays
width
input delays
deployment FPGA
target frequency
Adder
Generator
Fig. 1. FloPoCo adder generator
carry-bit computation and propagation. Carry-bit propagation
is accomplished by means of dedicated carry-chains running
vertically through the FPGA layout.
This is the default slice type and is denoted by sliceL.
In addition, a secondary slice type featuring a superset of
functionalities is available. The sliceM cell allows the LUT
to be configured as a variable-length shift-register (SRL16).
When this configuration is used, shift registers of up-to 16 bits
can be absorbed in one half-slice. This feature, when available,
allows minimizing input/output synchronization cost.
The Virtex-5 and Virtex-6 slices [5] are similar with respect
to addition. However, they allow independent use of the LUTs
and registers, which means that estimation formulae have to
count them separately.
II. PIPELINED ADDITION ON FPGA
Let X, Y be two integers on w bits (in the range {0, ..., 2
w
1}) and c
in
a carry-in bit. The sum of X, Y and c
in
is noted
R = X +Y +c
in
. It is in [0, 2
w+1
1] and is representable on
w + 1 bits. Note that all the following also applies to signed
integers in 2’s complement notation.
The RCA delay is proportional to the addition size. It
has three components. First, the LUT delay, δ
LUT
, used to
precompute the carry multiplexer select signal. Then there is a
worst-case delay of (w1)δ
carry
for carry propagation. Finally,
δ
xor
, the delay of the xor gate used to compute the MSB sum
bit.
δ
w
= δ
LUT
+ (w 1)δ
carry
+ δ
xor
(1)
As w increases the addition frequency decreases as illus-
trated in Figure 3 for three FPGAs.
In the context of frequency-driven pipelining, a pair (w, f)
which is under the corresponding curve in Figure 3 meets the
frequency constraint. There are two solutions for additions not
meeting this constraint. We can choose a different addition
architecture that is able to reach the frequency without too
much of a cost penalty [8]. This solution is unable to cover the
entire (w, f) space. Another solution is to pipeline the adder
design such that the critical path of the circuit is less than
the target period T = 1/f. This study focuses on the second
solution, because it is more scalable and often consumes less
resources.
LUT4
LUT4
FF
FF
RAM16
RAM16
SRL16
SRL16
Fig. 2. sliceM (VirtexII-Pro, Spartan3 and Virtex-4)

A. Classical RCA Pipelining
A tight frequency-driven pipelining is obtained by first
determining the maximal addition size α in equation 1 for
which the critical path delay is less than the target period T :
α = 1 +
T δ
LUT
δ
xor
δ
carry
.
Next, the addition is split into k chunks of α bits (except the
last chunk denoted by β, β α) such that w = (k 1)α + β.
An instantiation of this architecture highlighting the pre-
viously discussed parameters is presented in Figure 4 for
k = 4. As k decreases, the number of registers used for
synchronization decreases. When the critical path of the w-
bit addition is T , no pipelining is required (k = 1) and the
addition may be expressed as a simple + in VHDL.
The column labelled Classical in Table II presents the re-
source estimation formulae function of α, β, w, k, respectively
with and without allowing shift-register packing in LUTs
(SRL). Let us now explain how such formulae were built.
B. Resource estimation techniques
Let us take as a running example the previous classical
architecture, annotated on Figure 5.
The LUTs of the Xilinx FPGAs can be be used either as
a function generator or as a variable length shift-register, as
previously presented in Section I-D.
For classical architecture, the addition diagonal uses w
LUTs configured as function generators (Figure 5, σ). The
LUT SRL configuration is used wherever two or more flip-
flops are cascaded to form a shift register. This is the case
of the (k 2)α SRLs under the addition diagonal (Figure
5, ξ), together with the 2β SRLs corresponding to the last
column of width β (Figure 5, µ) and of the 2(k 3)α
SRLs above the diagonal (Figure 5, θ). These sum up to
w + (3k 8)α + 2β = (4k 9)α + 3β, which is the value
reported in Table II.
There is one consideration to be made before counting
registers: each time an SRL is used, the corresponding slice
flip-flop is also used. In other words, for a p-level shift-register,
p 1 levels are pushed into the SRL and one into the flip-
flop. Hence, we count (3k 8)α + 2β registers for the same
number of SRL, and, in addition, α registers (Figure 5, φ)
100
200
500
8 64 128 256 512 1024
Frequency(MHz)
Width (bits)
300
400
VirtexIV
Virtex5
Spartan3
Fig. 3. Ripple-Carry Addition Frequency for VirtexIV, Virtex5 and Spartan3E
under, 2α registers (Figure 5, ρ) above the diagonal plus
the k 1 registers for the carry-bit propagation. These total
(3k 5)α + 2β + k 1, the value reported in Table II.
The next task is to count slices. We choose to count half-
slices and divide this number by 2 rounding upwards. This
corresponds to a dense placement of the pipelined adder, which
the tools are expected to favor. Experimental results given in
section III-A will validate this assumption.
The number of half-slices used by the classical implemen-
tation is: w for the diagonal addition, (3k 8)α + 2β for
the SRL and corresponding flip-flops, and 3α + k 1 for the
independent registers. However, we subtract α as the left-most
addition of α bits includes the registers in the same slice as
the LUT. The number totals (4k 7)α + 3β + k 1, which
is reported in Table II.
All the formulae presented in this paper were deduced using
these techniques. Relative errors of these estimation formulae
are given in Table III. The worst case relative error is of the
order of 10
2
(one percent) which makes them sufficiently
accurate for estimation formulae.
C. Alternative RCA Pipelining
The classical pipelining technique requires a significant
amount of registers for input synchronization. This number
may be lowered by performing the chunk additions at the first
pipeline level and then propagating these sums instead. When
no SRL are allowed, the number of registers propagated above
the diagonal will be approximatively halved, and may still be
packed in shift registers. An instantiation of this architecture
for k = 4 is presented in Figure 6.
Each adder on the addition diagonal takes as input an
operand on α+1 bits and a 1-bit carry in and returns a α+1-bit
wide result. This addition does not overflow, as the α + 1-bit
input was the result of an addition of two α-bit numbers with
a carry-in of 0.
The resource estimation formulae for this architecture are
presented in Table II.
D. Short-Latency Addition Architecture
Given a target frequency f, the pipeline depth of the previ-
ously presented architectures increases linearly with addition
size. In this section we propose a scalable low-latency addition
architecture based on the textbook carry-select architecture,
whose novel feature is to make efficient use of the fast-carry
chains for the carry-bit computations.
The algorithm first determines the chunk size α as per
section II-A. Next, two sums are computed for each pair of
chunks: X
i
+ Y
i
and X
i
+ Y
i
+ 1. The final result R is a
combination of the corresponding sub-sums and is found in a
space of 2
k
combinations. Selecting the appropriate sub-sum is
done by using a carry-in bit. The novel idea in this algorithm
is the use of the dedicated fast-carry chains to compute the
carry-bits for the result selection.
Actually, for each chunk, a pair (sum, carry-out) is com-
puted for both possible values of the carry-in. We use the

R
0
R
2
R
1
R
3
Y
3
Y
2
Y
1
Y
0
+
+
+
+
X
3
X
2
X
1
C
in
X
0
β
1 + α
1 + α
β β α α α α α α
1 + α
Fig. 4. Classical addition architecture [7]
σ
ξ
θ
µ
ρ
φ
R
0
R
2
R
1
R
3
Y
3
Y
2
Y
1
Y
0
+
+
+
+
X
3
X
2
X
1
C
in
X
0
β
1 + α
1 + α
β β α α α α α α
1 + α
Fig. 5. Annotated classical architecture
R
0
R
2
R
1
R
3
Y
3
Y
2
Y
1
Y
0
+
+
+
+
+++
X
3
X
2
X
1
C
in
X
0
1 + α
β
β β α α α α α α
Fig. 6. Proposed FPGA architecture
following notations to denote the concatenation of the sub-
sums and their corresponding carry-out bits.
c
i
0
S
i
0
= X
i
+ Y
i
c
i
1
S
i
1
= X
i
+ Y
i
+ 1
We denote by R
i
the i
th
sub-result such that R =
R
k1
. . . R
1
R
0
. The value of R
i
can be expressed in the
following way knowing S
i
0
, S
i
1
and c
i1
.
if (c
i1
= 0) then R
i
S
i
0
else R
i
S
i
1
The carry-out bit for a chunk c
i
is computed from its
carry-in c
i1
and the two precomputed carries c
i
0
and c
i
1
.
The circuit used to compute them is particularly designed
to take advantage of the fast carry chains of the FPGA by
expressing the carry-out computation under the form of an
addition (Figure 7):
c
i
¬c
i
s
0
i
= c
i1
+ c
i
0
+ c
i
1
+ 2
One can verify the correctness of the carry generation by
checking the truth table presented in Table I. Note that the
greyed-out rows of the table will never be needed, as c
i
0
= 1
implies c
i
1
= 1 (it is not possible that X
i
+ Y
i
overflows and
X
i
+ Y
i
+ 1 doesn’t). The value of s
0
i
is not used further but
is necessary for correct inference and mapping of the addition
on the fast-carry chains of the FPGA.
It should be noted that a strong point of this approach
is that this carry propagation is expressed as an addition,
and therefore portable (no need for vendor-specific low-level
LUT-filling primitives). For instance, porting it to Altera chips
should simply involve choosing the appropriate values for the
delay-related parameters influencing the chunk size.
The formulae presented in Table II are deduced for k 3.
To use them we thus have to ensure w 2α + 1, possibly by
reducing α with respect to the optimal α deduced from the
target frequency.
The short-latency architecture depicted in Figure 8 has a
constant latency of two cycles. In addition, for lower frequency
operators, the second register levels can be discarded. How-
ever, choosing the correct splitting for the inputs is not trivial
CACFA FA
c
i1
c
i
1
1 c
i
0
0
c
i
¬c
i
c
i
c
i
1
c
i
0
¬c
i
c
i1
s
0
i
Fig. 7. Carry-Add-Cell (CAC) implementation and representation
TABLE I
CAC TRUTH TABLE. GREYED-OUT ROWS ARE NOT NEEDED
c
i1
c
i
0
c
i
1
c
i
¬c
i
s
0
i
0 0 0 0 1 0
0 0 1 0 1 1
0 1 0 0 1 1
0 1 1 1 0 0
1 0 0 0 1 1
1 0 1 1 0 0
1 1 0 1 0 0
1 1 1 1 0 1
as we have to ensure that the critical path length is smaller
than the target period T . Considering that the first sums are
registered, we have to find the correct sizes for splitting the
inputs, such that the critical path length that includes the carry
generation circuit and the final additions is less than T .
Intuitively, as the index of the chunks added is higher, the
length of the corresponding carry bit propagation is longer
and thus the length of the final addition has to be smaller.
We use a greedy algorithm that, at index i finds the maximum
addition size such that the carry propagation for index i and the
final addition for this index is smaller than T . However, it is
Y
1
Y
0
Y
2
Y
3
Y
k1
+
+
X
1
1 X
0
c
in
+ + +
. . .
R
0
R
1
R
2
R
3
R
k1
. . .
+
X
2
1
+
X
3
1
+
+
X
k1
+++
CACCACCAC
Fig. 8. Short-Latency Addition architecture

Citations
More filters
Journal ArticleDOI

Designing Custom Arithmetic Data Paths with FloPoCo

TL;DR: This work presents a leading effort to automate the production of pipelined data-path circuits for implementing numerical functions in FPGA-based acceleration of scientific computing.
Proceedings ArticleDOI

Floating-point exponential functions for DSP-enabled FPGAs

TL;DR: In this paper, the authors present a generator of floating-point exponential operators targeting recent FPGAs with embedded memories and DSP blocks, and demonstrate that this approach is flexible and can scale up to quadruple-precision, while enabling frequencies close to the FPGA's nominal frequency.
Journal ArticleDOI

Parameter Space for the Architecture of FFT-Based Montgomery Modular Multiplication

TL;DR: Improvements to FFT-based Montgomery Modular Multiplication (FFTM3) using carry-save arithmetic and pre-computation techniques are presented and pseudo-Fermat number transform is used to enrich the supported operand sizes for the FFTM3.
Proceedings ArticleDOI

FPGA-Specific Arithmetic Optimizations of Short-Latency Adders

TL;DR: This study presents FPGA-specific arithmetic optimizations for the mapping of carry-select and carry-increment adders targeting the hardware carry chains of modern FPGAs and different trade-offs between latency and area are explored.
Proceedings ArticleDOI

Efficient implementation of parallel BCD multiplication in LUT-6 FPGAs

TL;DR: A combinational implementation maps quite well into the slice structure of the Xilinx Virtex-5/Virtex-6 families and it is highly pipelineable and outperforms the area and latency figures of previous implementations in FPGAs.
References
More filters
Book

Digital arithmetic

TL;DR: Digital Arithmetic, two of the field's leading experts, deliver a unified treatment of digital arithmetic, tying underlying theory to design practice in a technology-independent manner, to develop sound solutions, avoid known mistakes, and repeat successful design decisions.
Journal ArticleDOI

FPGA adders: performance evaluation and optimal design

TL;DR: The authors discuss costs and operational delays of fixed-point adders on Xilinx 4000 series devices and propose timing models and optimization schemes for carry-skip and carry-select adders.
Proceedings ArticleDOI

Generating high-performance custom floating-point pipelines

TL;DR: This generator is presented around the simple example of a collision detector, which it significantly improves in accuracy, DSP count, logic usage, frequency and latency with respect to an implementation using standard floating-point operators.
Book

Arithmétique des ordinateurs

TL;DR: In this paper, the authors present divers systemes de notation des nombres and les diverses methodes utilisees par les machines scientifiques for effectuer les operations arithmetiques usuelles (addition, soustraction, multiplication, division) and calculer les principales fonctions mathematiques (sinus, logarithme, exponentielle, racine carree, ec.)
Related Papers (5)
Frequently Asked Questions (12)
Q1. What are the contributions mentioned in the paper "Pipelined fpga adders" ?

This study explores the trade-offs between size, latency and frequency for pipelined large-precision adders on FPGA. 

Future work also includes extending the optimization options to include operator latency, and possibly combinations such as “ LUTs and latency ”. 

In addition to latency reduction, this optimization brings the following gains: the number of registers is reduced by the carry propagation size (which now needs no registering), the LUT count is reduced by approximatively w, and the number of slices by approximatively w/2. 

A tight frequency-driven pipelining is obtained by first determining the maximal addition size α in equation 1 for which the critical path delay is less than the target period T :α = 1 +⌊ T − δLUT − δxorδcarry⌋ . 

The worst case relative error is of the order of 10−2 (one percent) which makes them sufficiently accurate for estimation formulae. 

All the slices in a VirtexII-Pro device were similar to sliceM, but they were reduced to half the total number of slices for Virtex4 and Spartan3, and about a quarter in Virtex5 and Virtex6 devices (with higher density at the input of the DSP48E blocks). 

In this section the authors propose a scalable low-latency addition architecture based on the textbook carry-select architecture, whose novel feature is to make efficient use of the fast-carry chains for the carry-bit computations. 

When no SRL are allowed, the number of registers propagated above the diagonal will be approximatively halved, and may still be packed in shift registers. 

The LUTs of the Xilinx FPGAs can be be used either as a function generator or as a variable length shift-register, as previously presented in Section I-D. 

For both alternative and low-latency architectures, there are two options: either perform all additions in using chunk size γ, or buffer the inputs and perform computations using chunk size α. 

Each adder on the addition diagonal takes as input an operand on α+1 bits and a 1-bit carry in and returns a α+1-bit wide result. 

Work is under way to integrate the proposed adders in all the coarser cores of the FloPoCo project, and to support more FPGA targets.