scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A new approach to pipeline FFT processor

15 Apr 1996-pp 766-770
TL;DR: A new VLSI architecture for a real-time pipeline FFT processor is proposed, derived by integrating a twiddle factor decomposition technique in the divide-and-conquer approach, which has the same multiplicative complexity as the radix-4 algorithm, but retains the butterfly structure of the Radix-2 algorithm.
Abstract: A new VLSI architecture for a real-time pipeline FFT processor is proposed. A hardware-oriented radix-2/sup 2/ algorithm is derived by integrating a twiddle factor decomposition technique in the divide-and-conquer approach. The radix-2/sup 2/ algorithm has the same multiplicative complexity as the radix-4 algorithm, but retains the butterfly structure of the radix-2 algorithm. The single-path delay-feedback architecture is used to exploit the spatial regularity in the signal flow graph of the algorithm. For length-N DFT computation, the hardware requirement of the proposed architecture is minimal on both dominant components: log/sub 4/N-1 complexity multipliers and N-1 complexity data memory. The validity and efficiency of the architecture have been verified by simulation in the hardware description language VHDL.

Summary (1 min read)

I. INTRODUCTION

  • Pipeline FFT processor is a specified class of processors for DFT computation utilizing fast algorithms.
  • The architecture design for pipeline FFT processor had been the subject of intensive research as early as in 70's when real-time processing was demanded in such application as radar signal processing [SI, well before the VLSI technology had advanced to the level of system integration.
  • Here different approaches will be put into functional blocks with unified terminology, where the additive butterfly has been separated from multiplier to show the hardware requirernent distinctively, as in Fig. 1 .
  • The input sequence has been broken into two parallel data stream flowing forwatrd, with correct "distance" between data elements entering the butterfly scheduled by proper delays.
  • By the observations made in last section the most desirable hardware oriented algorithm will be that it has the same number of non-trivial multiplications at the same positions in the SFG as of radix-4 algorithms, but has the same butterfly structure as that of radix-2 algorithms.

I v . R2'SDF ARCHITIECTURE

  • Mapping radix-2' DIF FFT algorithm derived in last section to the R2SDF architecture discussed in section 11, a new architecture of Radix-2' Single-path Delay Feedback (R2'SDF) approach is obtained.
  • Fig. 4 outlines an implementation of the R2'SDF architecture for N = 256, note the similarity of the datapath to R2SDF and the reduced number of multipliers.
  • A BF I1 (log, N)-bit binary counter serves two purposes: synchronization controller and address counter for twiddle factor reading in each stages.
  • The input data from left is directed to the shift registers until they are filled.
  • On next N/2 cycles, the multiplexors turn to position "1" the butterfly computes a 2-point DFT with incoming data and the data stored in the shift registers.

V. CONCLUSION

  • A hardware-oriented radix-2' algorithm is derived which has the radix-4 multiplicative complexity but retains radix-2 butterfly structure in the SFG.
  • Based on this algorithm, a new, efficient pipeline FFT architecture, the R2'SDF architecture, is put forward.
  • The hardware requirement of proposed architecture as compared with various approaches is shown in Table 1 , where not only the number of complex multipliers, adders and memory size but also the control complexity are listed for comparison.
  • For easy reading, base-4 logarithm is used whenever applicable.
  • It shows R2'SDF has reached the minimum requirement for both multiplier and the storage, and only second to R4SDC for adder.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A New Approach
to
Pipeline
FFT
Processor
Shousheng
He
and Mats
Torkelson
Department
of
Applied Electronics,
Lund
University,
S-22100
Lund,
SWEDEN
email: he@tde.lth.se; torkel@tde.lth.se
Abstract-
A
new VLSI architecture
for
real-
time pipeline FFT processor is proposed.
A
hardware oriented radix-2’ algorithm
is
derived
by integrating a twiddle factor decomposition
technique in the divide and conquer approach.
Radi~-2~ algorithm has the same multiplicative
complexity
as
radix-4 algorithm, but retains the
butterfly structure of radix-2 algorithm. The
single-path delay-feedback architecture
is
used
to exploit the spatial regularity in signal flow
graph
of
the algorithm.
For
length-N DFT com-
putation, the hardware requirement
of
the pro-
posed architecture is minimal on both dominant
components:
log,
N
-
1
complex multipliers and
N
-
1
complex data memory. The validity
and
efficiency
of
the architecture have been verified
by simulation in hardware description language
VHDL.
I.
INTRODUCTION
Pipeline
FFT
processor is
a
specified class of proces-
sors
for DFT computation utilizing fast algorithms.
It
is characterized with real-time, non-stopping processing
as
the data sequence passing the processor. It is an
AT2
non-optimal approach with
AT2
=
O(N3),
since the
area lower bound is
O(N).
However,
as
it has been spec-
ulated
[l]
that for real-time processing whether a new
metric should be introduced since it is
necessarily non-
optimal
given the time complexity of
O(N).
Although
asymptotically almost all the feasible architectures have
reached the area lower bound
[2],
the class of pipeline
FFT
processors has probably the smallest “constant fac-
tor” among the approaches that meet the time require-
ment, due to its least number, O(logN),
of
Arithmetic
Elements
(AE).
The difference comes from the fact that
an AE, especially the multiplier, takes much larger area
than
a
register in digital VLSI implementation.
It
is also interesting to note the at least R(1ogN)
AEs are necessary to meet the real-time processing
requirement due to the computational complexity of
R(N1ogN)
for
FFT
algorithm. Thus it has the na-
1063-7133196
$5.00
0
1996
IEEE
Proceedings
of
IPPS
’96
ture of “lower bound”
€or
AE
requirement. Any “op-
timal” architecture for real-time processing will likely
have R(1og
N)
AEs.
Another major area/energy consumption
of
the
FFT
processor comes from the memory requirement to buffer
the input data and the intermediate result for the com-
putation. For large size transform, this turns out to be
dominating
[3,4].
Although there is no formal proof, the
area lower bound indicates that the the “lower bound”
for the number of registers
is
likely to be
Q(N).
This is
obviously true for any architecture implementing
FFT
based algorithm, since the butterfly
at
first stage has
to
take data elements separated
N/r
distance away from
the input sequence, where
r
is
a
small constant integer,
or
the “radix”.
Putting above arguments together,
a
pipeline
FFT
processor has necessarily R(log,
N)
AEs and
R(N)
com-
plex word registers. The optimal architecture has
to
be the one that reduces the “constant factor”,
or
the
absolute number of AEs (multipliers and adders) and
memory size, to the minimum.
In this paper
a
new approach for real-time pipeline
FFT
processor, the Radi~-2~ Single-path Delay Feed-
back,
or
R2’SDF architecture will be presented. We
will begin with
a
brief review of previous approaches. A
hardware oriented radix-2’ algorithm is then developed
by integrating
a
twiddle factor decomposition technique
in divide and conquer approach to form
a
spatially reg-
ular signal flow graph (SFG). Mapping the algorithm to
the cascading delay feedback structure leads to the the
proposed architecture. Finally we conclude with a com-
parison of hardware requirement
of
R2’SDF and several
other popular pipeline architectures.
11.
PIPELINE
FFT
PROCESSOR
ARCHITECTURES
Before going into details of the new approach, it is ben-
eficial to have a brief review
of
the various architectures
for
pipeline FFT processors. To avoid being influenced
by the sequence order, we assume that the real-time pro-
cessing task only requires the input sequence
to
be in
normal order, and the output
is
allowed
to
be in
digit-
reversed (radix-2
or
radix-4) order, which
is
permissi-
766

ble in such applications such
as
IDFT based communi-
cation system
[5].
We also stick
to
the Decimation-In-
Frequency (DIF) type of decomposition throughout the
discussion.
The architecture design for pipeline
FFT
processor
had been the subject of intensive research
as
early
as
in
70’s when real-time processing
was
demanded in such
application
as
radar signal processing
[SI,
well before
the VLSI technology had advanced to the level of sys-
tem integration. Several architectures have been pro-
posed over the last
2
decades since then, along with the
increasing interest and the leap forward of the technol-
ogy. Here different approaches will be put into func-
tional blocks with unified terminology, where the addi-
tive butterfly has been separated from multiplier to show
the hardware requirernent distinctively,
as
in Fig.
1.
The
control and twiddle factor reading mechanism have been
also omitted for clarity.
All
data and arithmetic opera-
tions are complex, and a constraiint that N is a power
of
4
applies.
Figure
1:
Various schemes for pipeline FFT processor
R2MDC:
Radix-2 Multi-path Delay Commutator
[6]
was probably the most straightforward approach for
pipeline implementation of radix-2 FFT algorithm.
The input sequence has been broken into two par-
allel data stream flowing forwatrd, with correct “dis-
tance” between data elements entering the butterfly
scheduled by proper delays. Both butterflies and mul-
tipliers are in
50%
utilization. log,
N
-
2
multipliers,
log,
N
radix-2 butterflies and
3/2N
-
2
registers (de-
lay elements) are required.
R2SDF:
Radix-2 Single-path Delay Feedback 1171 uses
the registers more efficiently by storing the butter-
fly output
in
feedback shift registers.
A
single data
stream goes through the multiplier at every stage.
It
has same number of butterfly units and multipliers
as
in
R2MDC
approach, but with much reduced memory
requirement:
N
-
1
registers. Its memory requirement
is minimal.
R4SDF:
Radix-4 Single-path Delay Feedback
[8]
was
proposed
as
a radix-4 version of R2SDF, employing
CORDIC1 iterations. The utilization of multipliers
has been increased to 75% due to the storage of
3
out
of radix-4 butterfly outputs. However, the utilization
of the radix-4 butterfly, which is fairly complicated
and contains at least
8
complex adders, is dropped to
only
25%.
It requires log, N
-
1
multipliers,
log,
N
full radix-4 butterflies and storage of size N
-
1.
R4LMDC:
Radix-4 Multi-path Delay Commutator
IS]
is a radix-4 version of R2MDC. It has been used
as
the architecture for the initial VLSI implementation
of pipeline
FFT
processor
[3]
and massive wafer scale
integrattion
[9].
However, it suffers from low, 25%,
utilization of all components, which can be compen-
sated only in some special applications where four
FFTs are being processed simultaneously. It requires
3
log,
N
multipliers, log4
N
full radix-4 butterflies and
5/2N
--
4
registers.
R4SDC:
Radix-4 Single-path Delay Commutator
[IO]
uses a modified radix-4 algorithm with programable
1/4
radix-4 butterflies to achieve higher, 75% utiliza-
tion
of
multipliers. A combined Delay-Commutator
also reduces the memory requirement to 2N
-
2
from
5/2N
-
1,
that of R4MDC. The butterfly and
delay-commutator become relatively complicated due
to programmability requirement.
R4SDC
has been
used recently in building the largest ever single chip
pipeline FFT processor
for
HDTV application
[4].
A
swift skimming through of the architectures
listed
above reveals the distinctive merits of the
differ-
ent approaches: First, the delay-feedback approaches
are always more efficient than corresponding delay-
commutator approaches in terms of memory utilization
since the stored butterfly output can be directly used
by the multipliers. Second, radix-4 algorithm based
‘The
Coordinate
Rotational
Digital
Computer
767

single-path architectures have higher multiplier utiliza-
tion, however, radix-2 algorithm based ar zhitectures
have simpler butterflies which are better utilized. The
new approach developed in following sections is highly
motivated by these observations.
111. RADIX-2' DIF
FFT
ALGORITHM
By the observations made in last section the most de-
sirable
hardware oriented
algorithm will be that it has
the same number of non-trivial multiplications at the
same positions in the SFG
as
of radix-4 algorithms, but
has the same butterfly structure
as
that of radix-2 al-
gorithms. Strictly speaking, algorithms with this fea-
ture
is
not completely new. An SFG with a complex
"bias" factor had been obtained implicitly
as
the result
of
constant-rotation/compensation
procedure using re-
stricted CORDIC operations
[ll].
Another algorithm
combining radix-4 and radix-'4
+
2' in DIT form has
been used to decrease the scaling error in R2MDC ar-
chitecture, without altering the multiplier requirement
[12]. The clear derivation of the algorithm in DIF form
with perception of reducing the hardware requirement
in the context pipeline FFT processor
is,
however, yet
to be developed.
To avoid confusing with the well known radix-2/4 split
radix algorithm and the mixed radix-'4
+
2' algorithm,
the notion of
1-adix-2~
algorithm is used to clearly reflect
the structural relation with radix-2 algorithm and the
identical computational requirement with radix-4 algo-
rithm.
The DFT of size
N
is defined by
N-1
n=O
where
WN
denotes the Nth primitive root of unity, with
its exponent evaluated modulo
N.
To make the deriva-
tion of the new algorithm clearer, consider the first
2
steps of decomposition in the radix-2 DIF FFT together.
Applying a 3-dimensional linear index map,
the Common Factor Algorithm (CFA) has the
form
of
X(t1
+
2kz
+
4k3)
where the butterfly structure
N
N N N
B$(T"Z
+n3)
=
z(-nz
4
+n3)
+
(-l)%(-n2 4 +n3
+
-)
2
If
the expression within the braces of eqn.
(3)
is to
be computed before further decomposition, an ordinary
radix-2 DIF FFT results. The key idea
of
the new algo-
rithm is to proceed the second step decomposition to the
remaining DFT coefficients,
including the "twiddle fac-
tor"
Wh*n2+n3)k1,
to exploit the exceptional values in
multiplication before the next butterfly
is
constructed.
Decomposing the composite twiddle factor and observe
that
(+a+n3)(k1+2ka+4ks)
- -
w;(ki+2ka)wFsks
WN
(4)
- -
(_j)na(kl+~ka)~n3(L1+2ka)
N
wF3k3
Substituting eqn.
(4)
in eqn. (3) and expand the sum-
mation with index
n2.
After simplification we have a
set
of
4 DFTs of length N/4,
E-1
(5)
where H(k1,
k2,
723)
is expressed in eqn.
(6).
Figure 2: Butterfly with decomposed twiddle factors.
eqn.
(6)
represents the first two stages of butterflies
'with only trivial multiplications in the SFG,
as
BF
I
and
BF
I1
in Fig. 2. After these two stages, full multipli-
ers are required
to
compute the product of the decom-
posed twiddle factor
W2(k1t2ka)
in
'
eqn.
(5),
as
shown
in Fig. 2. Note the order of the twiddle factors is differ-
ent from that of radix-4 algorithm.
768

Applying this CFA procedure recursively to the re-
maining DFTs of leng,h N/4 in eqn. (5), the complete
radix-2’ DIF FFT algorithm is obtained. An
N
=
16
example is shown in Fig.
3
where small diamonds rep-
resent trivial multiplication by
W;l4
=
-j,
which in-
volves only real-imaginary swapping and sign inversion.
BF
I
BF
U
BF
Ill
BF
IV
Figure
3:
Radix-2’ DlIF FFT flow graph for
N
=
16
Radix-2’ algorithm has the feature that it has the
same multiplicative ccimplexity
as
radix-4 algorithms,
but still retains the radix-2 butterfly structures. The
multiplicative operations are in a such an arrangement
that only every other stage has non-trivial multiplica-
tions.
This is a great structural advantage over other
algorithms when pipeline/cascade FFT architecture is
under consideration.
Iv. R2’SDF
ARCHITIECTURE
Mapping radix-2’ DIF FFT algorithm derived in last
section to the R2SDF architecture discussed in section
11,
a new architecture of Radix-2’ Single-path Delay Feed-
back (R2’SDF) approach is obtained.
Fig. 4 outlines an implementation of the R2’SDF ar-
chitecture for N
=
256, note the similarity of the data-
path to R2SDF and the reduced number of multipliers.
The implementation uses two types
of
butterflies, one
identical to that in RSSDF, the other contains also the
logic to implement the trivial twiddle factor multipli-
cation, as shown in Fig. 5-(i)(ii) respectively. Due to
the spatial regularity of Radix-2’ algorithm, the syn-
chronization control
of‘
the processor is very simple. A
BF
I1
(log, N)-bit binary counter serves two purposes: syn-
chronization controller and address counter for twiddle
factor reading in each stages.
With the help
of
the butterfly structures shown in
Fig. 5, the scheduled operation of the R2’SDF processor
in Fig. 4 is
as
follows. On first N/2 cycles, the 2-to-
1
multiplexors in the first butterfly module switch to
position
“O”,
and the butterfly is idle. The input data
from left is directed to the shift registers until they are
filled. On next N/2 cycles, the multiplexors turn to
position
“1”
the butterfly computes a 2-point DFT with
incoming data and the data stored in the shift registers.
The butterfly output Zl(n) is sent
to
apply the twiddle
factor, andl
Zl(n
+
N/2) is sent back to the shift regis-
ters to be “multiplied” in still next N/2 cycles when the
first half of the next frame of time sequence is loaded
in. The operation of the second butterfly is similar to
that of the first one, except the “distance” of butter-
fly input sequence are just N/4 and the trivial twid-
dle factor imultiplication has been implemented by real-
imaginary swapping with a commutator and controlled
add/subtract operations,
as
in Fig. 5-(ii), which requires
two bit control signal from the synchronizing counter.
The data then goes through a full complex multiplier,
working at
75%
utility, accomplishes the result of first
level
of
radix-4 DFT word by word. Further processing
repeats this pattern with the distance
of
the input data
decreases
lby
half at each consecutive butterfly stages.
After
N
-
1
clock cycles, The complete DFT transform
result streams out to the right,
in
bit-reversed
order.
The next frame of transform can be computed without
pausing due to the pipelined processing of each stages.
In practical implementation, pipeline register should
be inserted between each multiplier and butterfly stage
to improve the performance. Shimming registers are
also needeid for control signals to comply with thus
re-
vised timing. The latency of the output is then increased
to
N-
l+3(log4
N-
1) without affecting the throughput
rate.
V.
CONCLUSION
In this paper, a hardware-oriented radix-2’ algorithm
is
derived which has the radix-4 multiplicative com-
plexity but retains radix-2 butterfly structure in the
SFG.
Based
on this algorithm,
a
new, efficient pipeline

Figure
4:
R2’SDF pipeline FFT architecture
for
N
=
256
I
I
(i). BF2I
1s
(ii). BF2II
Figure
5:
Butterfly structure
for
R2’SDF
FFT
processor
FFT
architecture, the R2’SDF architecture, is put
for-
ward. The hardware requirement
of
proposed architec-
ture
as
compared with various approaches is shown in
Table
1,
where not only the number
of
complex mul-
tipliers, adders and memory size but also the control
complexity are listed for comparison. For easy reading,
base-4 logarithm is used whenever applicable. It shows
R2’SDF has reached the minimum requirement for both
multiplier and the storage, and only second to R4SDC
for adder. This makes it an ideal architecture
for
VLSI
implementation
of
pipeline
FFT
processors.
Table
1:
Hardware requirement comparison
transform size and word-length, using fixed point arith-
metic and
a
complex array multiplier implemented with
distributed arithmetic. The validity and efficiency
of
the proposed architecture has been verified by extensive
simulation.
REFERENCES
[I]
C.
D. Thompson. Fourier transform in VLSI.
IEEE
Trans.
Comput.,
C-32(11):1047-1057,
Nov.
1983.
[a]
S.
He and M. Torkelson. A new expandable
2D
systolic
array for DFT computation based on symbiosis of
1D
arrays. In
Proc. ICA3 PP’95,
pages
12-19,
Brisbane,
Australia, Apr.
1995.
[3]
E.
E.
Swartzlander,
W.
K.
W.
Young, and
S.
J. Joseph.
A radix
4
delay commutator for fast Fourier transform
processor implementation.
IEEE
J.
Solid-State Circuits,
SC-19(5):702-709,
Oct.
1984.
[4]
E.
Bidet,
D.
Castelain, C. Joanblanq, and
P.
Stenn.
A
fast single-chip implementation of
8192
complex point
FFT.
IEEE
J.
Solid-State Circuits,
30(3):300-305,
Mar.
1995.
[5]
M. Alard and R. Lassalle. Principles of modulation and
channel coding for digital broadcasting for mobile re-
ceivers.
EBU Review,
(224):47-69,
Aug.
1987.
[6]
L.R. Rabiner and
B.
Gold.
Theory and Application
of
Digital Signal Processing.
Prentice-Hall, Inc.,
1975.
[7]
E.H.
Wold and A.M. Despain. Pipeline and parallel-
pipeline
FFT
processors for VLSI implementation.
IEEE
Trans. Comput.,
C-33(5):414-426,
May
1984.
[8]
A.M. Despain. Fourier transform computer
us-
ing CORDIC iterations.
IEEE
Trans.
Comput.,
C-
[9]
E.
E.
Swartzlander, V.
K.
Jain, and
H.
Hikawa. A radix
8
wafer scale FFT processor.
J.
VLSI Signal Processing,
23(10):993-1001,
Oct.
1974.
-.....
4(2,3):165-176,
May
1992.
#/
adder
#
memory
size
[lo]
G.
Bi and
E.
V.
Jones. A pipelined FFT processor
for word-sequential data.
IEEE
Trans.
Acoust., Speech,
Signal Processing,
37(12):1982-1985,
Dec.
1989.
R4MDC
3(10&
N
-
1)
810&
N
5N/2
-
4
simple
[ll]
A.M. Despain. Very
fast
Fourier transform algorithms
R4SDC
log,
N
-
1
310g4
N
2N
-
2
complex
hardware for implementation.
IEEE
Trans.
Comput.,
R22SDF
log4
N
-
1
410&
N
N
-
1
Radix-2 FFT-pipeline architecture with ra-
RZMDC
’(log4
N
-
1)
410g4
N 3N/2
-
2
simple
simple
medium
simple
C-28(5):333-341,
May
1979.
[12]
R.
Storn.
duced noise-to-signal ratio.
IEE
Proc.-
Vis.
Image
Sig-
nul Process.,
141(2):81-86,
Apr.
1994.
The architecture has been modeled with hardware de-
scription language VHDL with generic parameters
for
770
Citations
More filters
Journal ArticleDOI
TL;DR: A formal procedure for designing FFT architectures using folding transformation and register minimization techniques is proposed and new parallel-pipelined architectures for the computation of real-valued fast Fourier transform (RFFT) are derived.
Abstract: This paper presents a novel approach to develop parallel pipelined architectures for the fast Fourier transform (FFT). A formal procedure for designing FFT architectures using folding transformation and register minimization techniques is proposed. Novel parallel-pipelined architectures for the computation of complex and real valued fast Fourier transform are derived. For complex valued Fourier transform (CFFT), the proposed architecture takes advantage of under utilized hardware in the serial architecture to derive L-parallel architectures without increasing the hardware complexity by a factor of L. The operating frequency of the proposed architecture can be decreased which in turn reduces the power consumption. Further, this paper presents new parallel-pipelined architectures for the computation of real-valued fast Fourier transform (RFFT). The proposed architectures exploit redundancy in the computation of FFT samples to reduce the hardware complexity. A comparison is drawn between the proposed designs and the previous architectures. The power consumption can be reduced up to 37% and 50% in 2-parallel CFFT and RFFT architectures, respectively. The output samples are obtained in a scrambled order in the proposed architectures. Circuits to reorder these scrambled output sequences to a desired order are presented.

163 citations


Cites methods from "A new approach to pipeline FFT proc..."

  • ...Algorithms including radix-4 [2], split-radix [3], radix- [4] have been developed based on the basic radix-2 FFT approach....

    [...]

  • ...In this paper, we propose several novel parallel-pipelined architectures for the computation of RFFT based on radix- [4] and radix- algorithms [28]....

    [...]

  • ...The other approaches [4] are not specific for the RFFT and can be used to calculate the CFFT....

    [...]

  • ...We can simply use the Radix-2 single-path delay feedback approach presented in [4], with just modifying the first complex butterfly stage into real....

    [...]

Journal ArticleDOI
TL;DR: The Spiral hardware generation framework and system for linear transforms is introduced, which automatically generates an algorithm, maps it to a datapath, and outputs a synthesizable register transfer level Verilog description suitable for FPGA or ASIC implementation.
Abstract: Linear signal transforms such as the discrete Fourier transform (DFT) are very widely used in digital signal processing and other domains. Due to high performance or efficiency requirements, these transforms are often implemented in hardware. This implementation is challenging due to the large number of algorithmic options (e.g., fast Fourier transform algorithms or FFTs), the variety of ways that a fixed algorithm can be mapped to a sequential datapath, and the design of the components of this datapath. The best choices depend heavily on the resource budget and the performance goals of the target application. Thus, it is difficult for a designer to determine which set of options will best meet a given set of requirements.In this article we introduce the Spiral hardware generation framework and system for linear transforms. The system takes a problem specification as input as well as directives that define characteristics of the desired datapath. Using a mathematical language to represent and explore transform algorithms and datapath characteristics, the system automatically generates an algorithm, maps it to a datapath, and outputs a synthesizable register transfer level Verilog description suitable for FPGA or ASIC implementation. The quality of the generated designs rivals the best available handwritten IP cores.

120 citations


Cites methods from "A new approach to pipeline FFT proc..."

  • ...[2005], and Johnson et al. [1990] use this type of representation to specify and generate software implementations of the FFT....

    [...]

  • ...Similarly, Kee et al. [2008] target the FFT on an FPGA, expressing a radix 2 FFT algorithm as a double loop using National Instruments LabVIEW, where each loop can then be unrolled by the tool....

    [...]

Journal ArticleDOI
TL;DR: An multipath delay commutator (MDC)-based architecture and memory scheduling to implement fast Fourier transform (FFT) processors for multiple input multiple output-orthogonal frequency division multiplexing (MIMO-OFDM) systems with variable length is presented.
Abstract: This paper presents an multipath delay commutator (MDC)-based architecture and memory scheduling to implement fast Fourier transform (FFT) processors for multiple input multiple output-orthogonal frequency division multiplexing (MIMO-OFDM) systems with variable length. Based on the MDC architecture, we propose to use radix-Ns butterflies at each stage, where Ns is the number of data streams, so that there is only one butterfly needed in each stage. Consequently, a 100% utilization rate in computational elements is achieved. Moreover, thanks to the simple control mechanism of the MDC, we propose simple memory scheduling methods for input data and output bit/set-reversing, which again results in a full utilization rate in memory usage. Since the memory requirements usually dominate the die area of FFT/inverse fast Fourier transform (IFFT) processors, the proposed scheme can effectively reduce the memory size and thus the die area as well. Furthermore, to apply the proposed scheme in practical applications, we let Ns=4 and implement a 4-stream FFT/IFFT processor with variable length including 2048, 1024, 512, and 128 for MIMO-OFDM systems. This processor can be used in IEEE 802.16 WiMAX and 3GPP long term evolution applications. The processor was implemented with an UMC 90-nm CMOS technology with a core area of 3.1 mm2. The power consumption at 40 MHz was 63.72/62.92/57.51/51.69 mW for 2048/1024/512/128-FFT, respectively in the post-layout simulation. Finally, we analyze the complexity and performance of the implemented processor and compare it with other processors. The results show advantages of the proposed scheme in terms of area and power consumption.

99 citations


Cites background from "A new approach to pipeline FFT proc..."

  • ...Such memory requirement may be forbidden if Ns is large, because the area of memory does not shrink as much as that of logic gates when fabrication technology advances, due to the use of sense amplify circuitry....

    [...]

Journal ArticleDOI
TL;DR: Since the proposed algorithm can achieve higher processing rate and better efficiency than the conventional algorithm, it is very suitable for the OFDM/DMT applications such as the WLAN, DAB/DVB, and ADSL/VDSL systems.
Abstract: In this paper, we propose a new efficient FFT algorithm for OFDM/DMT applications and present its pipeline implementation results. Since the proposed algorithm is based on the radix-4 butterfly unit, the processing rate can be twice as fast as that based on the radix-23 algorithm. Also, its implementation is more area-efficient than the implementation from conventional radix-4 algorithm due to reduced number of nontrivial multipliers like using the radix-2/sup 3/ algorithm. In order to compare the proposed algorithm with the conventional radix-4 algorithm, the 64-point MDC pipelined FFT processor based on the proposed algorithm was implemented. After the logic synthesis using 0.35 /spl mu/m CMOS technology, the logic gate count for the processor with the proposed algorithm is only about 70% of that for the processor with the conventional radix-4 algorithm. Since the proposed algorithm can achieve higher processing rate and better efficiency than the conventional algorithm, it is very suitable for the OFDM/DMT applications such as the WLAN, DAB/DVB, and ADSL/VDSL systems.

83 citations

Journal ArticleDOI
TL;DR: The discrete Fourier transform (DFT) matrix factorization based on the Kronecker product is proposed to express the family of radix rk single-path delay commutator/single- path delay feedback (SDC/SDF) pipeline fast Fouriers transform (FFT) architectures.
Abstract: This paper proposes to use the discrete Fourier transform (DFT) matrix factorization based on the Kronecker product to express the family of radix rk single-path delay commutator/single-path delay feedback (SDC/SDF) pipeline fast Fourier transform (FFT) architectures. The matricial expressions of the radix r, r 2, r 3, and r 4 decimation-in-frequency (DIF) SDC/SDF pipeline architectures are derived. These expressions can be written using a small set of operators, resulting in a compact representation of the algorithms. The derived expressions are general in terms of r and the number of points of the FFT N. Expressions are given where it is not necessary that N is a power of rk. The proposed set of operators can be mapped to equivalent hardware circuits. Thus, the designer can easily go from the matricial representations to their implementations and vice versa. As an example, the mapping of the operators is shown for radix 2, 22, 23, and 24, and the details of the corresponding SDC/SDF pipeline FFT architectures are presented. Furthermore, a general expression is given for the SDC/SDF radix rk pipeline architectures when k > 4. This general expression helps the designer to efficiently handle a wider design exploration space and select the optimum single-path architecture for a given value of N.

83 citations


Cites background from "A new approach to pipeline FFT proc..."

  • ...The authors are with the CEIT and TECNUN, University of Navarra, 20018 San Sebastián, Spain (e-mail: acortes@ceit.es; ivelez@ceit.es; jfsevillano@ceit. es)....

    [...]

  • ...Multiple-path architectures, such as [1]–[5], where data is input using several parallel paths, are used when the throughput needs to be increased for a given clock frequency of the FFT processor....

    [...]

References
More filters
Book
01 Jan 1975
TL;DR: Feyman and Wing as discussed by the authors introduced the simplicity of the invariant imbedding method to tackle various problems of interest to engineers, physicists, applied mathematicians, and numerical analysts.
Abstract: sprightly style and is interesting from cover to cover. The comments, critiques, and summaries that accompany the chapters are very helpful in crystalizing the ideas and answering questions that may arise, particularly to the self-learner. The transparency in the presentation of the material in the book equips the reader to proceed quickly to a wealth of problems included at the end of each chapter. These problems ranging from elementary to research-level are very valuable in that a solid working knowledge of the invariant imbedding techniques is acquired as well as good insight in attacking problems in various applied areas. Furthermore, a useful selection of references is given at the end of each chapter. This book may not appeal to those mathematicians who are interested primarily in the sophistication of mathematical theory, because the authors have deliberately avoided all pseudo-sophistication in attaining transparency of exposition. Precisely for the same reason the majority of the intended readers who are applications-oriented and are eager to use the techniques quickly in their own fields will welcome and appreciate the efforts put into writing this book. From a purely mathematical point of view, some of the invariant imbedding results may be considered to be generalizations of the classical theory of first-order partial differential equations, and a part of the analysis of invariant imbedding is still at a somewhat heuristic stage despite successes in many computational applications. However, those who are concerned with mathematical rigor will find opportunities to explore the foundations of the invariant imbedding method. In conclusion, let me quote the following: "What is the best method to obtain the solution to a problem'? The answer is, any way that works." (Richard P. Feyman, Engineering and Science, March 1965, Vol. XXVIII, no. 6, p. 9.) In this well-written book, Bellman and Wing have indeed accomplished the task of introducing the simplicity of the invariant imbedding method to tackle various problems of interest to engineers, physicists, applied mathematicians, and numerical analysts.

3,249 citations

Journal ArticleDOI
TL;DR: VLSI implementations have constraints which differ from those of discrete implementations, requiring another look at some of the typical FFT'algorithms in the light of these constraints.
Abstract: In some signal processing applications, it is desirable to build very high performance fast Fourier transform (FFT) processors. To meet the performance requirements, these processors are typically highly pipelined. Until the advent of VLSI, it was not possible to build a single chip which could be used to construct pipeline FFT processors of a reasonable size. However, VLSI implementations have constraints which differ from those of discrete implementations, requiring another look at some of the typical FFT'algorithms in the light of these constraints.

327 citations


"A new approach to pipeline FFT proc..." refers methods in this paper

  • ...By the observations made in last section the most desirable hardware oriented algorithm will be that it has the same number of non-trivial multiplications at the same positions in the SFG as of radix-4 algorithms, but has the same butterfly structure as that of radix-2 algorithms....

    [...]

Journal ArticleDOI
TL;DR: The CORDIC iteration is applied to several Fourier transform algorithms and a new, especially attractive FFT computer architecture is presented as an example of the utility of this technique.
Abstract: The CORDIC iteration is applied to several Fourier transform algorithms. The number of operations is found as a function of transform method and radix representation. Using these representations, several hardware configurations are examined for cost, speed, and complexity tradeoffs. A new, especially attractive FFT computer architecture is presented as an example of the utility of this technique. Compensated and modified CORDIC algorithms are also developed.

304 citations

Journal ArticleDOI
G. Bi1, E.V. Jones1
TL;DR: A modified fast Fourier transform algorithm is described together with a real-time pipelined implementation that requires less data memory and only 1/3 of the number of complex multipliers of a conventional design.
Abstract: A modified fast Fourier transform algorithm is described together with a real-time pipelined implementation. The approach is particularly suited to sequentially presented input data. The method can be used for both mixed and uniform radix number implementations. For example, for the radix-4 implementation, the method requires less data memory and only 1/3 of the number of complex multipliers of a conventional design. >

204 citations


"A new approach to pipeline FFT proc..." refers methods in this paper

  • ...R4SDC: Radix-4 Single-path Delay Commutator [10] uses a modified radix-4 algorithm with programable 1=4 radix-4 butterflies to achieve higher, 75% utilization of multipliers....

    [...]