Area-Efficient Scheduling Scheme Based FFT Processor for Various OFDM
Systems
Jeong Keun Jang
Dongbu Hitek
Bucheon, Korea
jeongkeun.jang@dbhitek.com
Ho Keun Kim, Myung Hoon Sunwoo
Department of Electrical and Computer
Engineering
Ajou University
Suwon, Korea
hokeun92@ajou.ac.kr, sunwoo@ajou.ac.kr
Oscar Gustafsson
Department of Electrical Engineering
Linköping University
Linköping, Sweden
oscar.gustafsson@liu.se
Abstract— This paper presents an area-efficient fast Fourier
transform (FFT) processor for orthogonal frequency-
division multiplexing systems based on multi-path delay
commutator architecture. This paper proposes a data sched-
uling scheme to reduce the number of complex constant mul-
tipliers. The proposed mixed-radix multi-path delay commu-
tator FFT processor can support 128-, 256-, and 512-point
FFT sizes. The proposed processor was synthesized using the
Samsung 65-nm CMOS standard cell library. The proposed
processor with eight parallel data paths can achieve a high
throughput rate of up to 2.64 GSample/s at 330 MHz.
Keywords-fast Fourier transform (FFT); high throughput;
low hardware complexity; mixed-radix multi-path delay
commutator (MRMDC); orthogonal frequency-division
multiplexing (OFDM) systems
I. INTRODUCTION
Fast Fourier transform (FFT) is a well-known mathe-
matical algorithm for performing Fourier transform opera-
tions. The FFT plays an important role in different fields
such as communication systems, biomedical applications,
sensor, and radar signal processing. Moreover, an FFT
processor is a high computational complexity module in
the physical layer of orthogonal frequency-division multi-
plexing (OFDM) applications such as IEEE 802.11n/ac/ad
[1], IEEE 802.15.3.c [2], and IEEE 802.16e [3]. Hence,
various FFT processors have been proposed [2] to satisfy
real-time processing requirements and reduce hardware
complexity [3]-[11].
Most of the FFT architectures can be divided into two
categories: 1) memory-based architectures and 2) pipe-
lined architectures. Memory-based architectures were pro-
posed to achieve smaller area [3]; whereas, pipelined FFT
architectures [4]-[11] can achieve high throughput rates
and low latency, which are suitable for real-time applica-
tions. Pipelined FFT architectures can be classified into
single-path feedback (SDF) architectures [5], multi-path
delay feedback (MDF) architectures [9]-[11], and multi-
path delay commutator (MDC) architectures [6]-[8], ac-
cording to the dataflow scheme.
In current real-time applications, many parallel pipe-
lined FFT architectures have been proposed [6]-[11] to
provide very high throughput rates. The number of delay
elements in MDF architectures [9]-[11] is less than that in
SDF architectures [5]. Recently, parallel MDC architec-
tures have been proposed in [6]-[8] for achieving high
throughput rates and hardware efficiency based on radix-2
n
algorithms as an improvement on radix-2 and radix-4 algo-
rithms. In [8], radix-8 pipelined MDC architectures im-
proved the area efficiency by using data shuffling struc-
tures. However, the radix-8 algorithm cannot handle 128-
and 256-point FFTs. Conversely, the proposed FFT pro-
cessor can provide both 128- and 256-point FFTs. Moreo-
ver, the proposed processor was designed based on the
radix-4 and radix-2 algorithms, which can significantly
reduce the area.
In this paper, we propose an eight-parallel mixed-radix
MDC architecture for low hardware complexity. An area-
efficient scheduling scheme is proposed to reduce the size
of read-only memories (ROMs) for storing twiddle factors.
This paper is organized as follows. Section II describes
FFT algorithms for the proposed architecture. Section III
provides the proposed mixed-radix MDC FFT architecture
in detail. Section IV presents the design and implementa-
tion results of the proposed FFT processor. Finally, the
conclusion is presented in Section V.
II. FFT
ALGORITHMS
The discrete Fourier transform (DFT) of length N is
defined as
1
0
() () , 0,1, , 1.
−
=
==−
N
nk
N
n
Xk xnW k N
(1)
where x(n) and X(k) denote the input and output of the
DFT, respectively, and
nk
N
W
denotes the Nth primitive root
of unity, with its exponent evaluated as modulo N [12].
(2 / )
cos(2 / ) sin(2 / ).
π
ππ
−
== −
nk j nk N
N
W e nk N j nk N
(2)
Furthermore, (1) can be reformulated as (3) using the
2-dimensional index map in (4). Moreover, (3) consists of
two DFT computation 64-point DFTs, which are expressed
as G(n
2
, k
1
) and N/64-point DFT.
21
64
121 2
21
64
11 21 2 2
21
1
63
()(64)
64
12 12
00
1
63
1264 /64
00
(,)
(64) ( )
64
()
64
−
++
==
−
==
+= +
=+
N
N
N
nn k k
N
nn
nk nk nk
NN
nn
Gn k
N
Xk k x n n W
N
xnnW W W
(3)
where
12
12
0, 1, , 63; 0, 1, , ( / 64 1)
0, 1, , 63; 0, 1, , ( / 64 1)
128,256,512.
==−
==−
=
nnN
kkN
N
(4)
Thus, when N is 128, 256, and 512, the N/64-point
DFT is 4-, 4-, 2-, and 2-point DFTs, respectively. As these
2- and 4-point DFTs can be folded using radix-2, they can
be calculated using radix-2, radix-2
2
, and radix-2
3
, respec-
tively, as expressed in (5).
()
{}
()
21 22
2
21 11 2 1 2 2
21
7
21 8
12
0
46
5
11
21 2 4 2
00
5
128
256
,
(64)
,
αβ αβ αβ
αα
=
′′ ′′ ′′
′′
==
=
=
=
+=
nk nk
N
n
Stage TF Stage BU
Stage BU
nk
N
Stage TF
N
N
Gn k W W
Xk k
GnkW WWW
31 3 2 3 3
3
7
1
(2)
82
0
6
512
αβ αβ αβ
α
′′ ′′ ′′
+
′
=
=
Stage BU
Stage TF
N
WW
(5)
11
1
123412 3 4
4321
4321
1
1234112 3 4
1
63
21 1 2 64
0
1133
(16 4 2 )( 4 16 32 )
1264
0000
1133
12
0000
4
42 , 41632
(,)
64
64
64
16
nk
n
k
n
N
Gn k x n n W
N
xnnW
N
xnn
W
αα ααββ β β
αααα
αααα
αβ
αααα ββ β β
=
+++ ++ +
====
====
+++ =++ +
=
=+
=+
=+
×
231 2 33 43
12122 44
2
31
(2 )( 4 )
16 4 64 2 4 2
234
1
.
Stage TF
Stage TFStage TF
Stage BU Stage BU Stage BU
Stage BU
WWW WWW
ααβ β αβ αβαβ αβ αβ
++
(6)
where
1234
12 34
0, 1, 2, 3; 0, 1, 2, 3; 0, 1; 0, 1
0, 1, 2, 3; 0, 1, 2, 3; 0, 1; 0, 1.
αααα
ββββ
====
====
(7)
Therefore, this paper proposes decomposition for cal-
culating the 128-, 256-, and 512-point DFTs using (5) and
(6). In these decompositions, the required twiddle factors
for each stage are summarized in Table I; the mixed meth-
od in Table I indicates that the twiddle factors should be
calculated according to the FFT size.
III. P
ROPOSED
FFT
A
RCHITECTURE
Using the radix-4
2
and radix-2
2
FFT algorithms in
Module-1 and the radix-2
n
FFT algorithm in Module-2, we
proposed the mixed-radix MDC FFT architecture illustrat-
ed in Fig. 1. To perform 128-, 256-, and 512-point FFT
operations, the proposed FFT processor consists of seven
stages. Stages 1, 2, 3, and 4 are used in common, but stag-
es 5, 6, and 7 are selectively reconfigured according to two
selection bits as presented as shown in Table II. The pro-
posed FFT architecture employs MDC architectures in-
cluding butterfly units (BU), complex multipliers, complex
Radix-2 BU Radix-2 BURadix-2 BURadix-2 BU
Radix-2 BU Radix-2 BURadix-2 BURadix-2 BU
-j
Radix-2 BU Radix-2 BURadix-2 BURadix-2 BU
Radix-4 BU
Radix-4 BU
-j
-j
1
2 1 1
1
2 1 1
1
1
2
1
2 1 1
1
2
3
1
2
3
3
2
1
3
2
1
Radix-2 BU Radix-2 BURadix-2 BURadix-2 BU
-j
Radix-2 BU Radix-2 BURadix-2 BURadix-2 BU
2 1
1
2 1
2
1
2 1
8
16 8 8
8
16 8 8
8
8
16
8
16 8 8
16 8
8
16 8
16
8
16 8
612 6
24 2
48 4
612 6
24 2
48 4
Radix-4 BU Radix-4 BU
6
12 6
2
4 2
4
8 4
6
12 6
2
4 2
4
8 4
-j
Complex Constant Multiplier
FFT Processor
Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7
Complex Multiplier
Constant Multiplier
Module-2Module-1
S
0
S
1
S
0
S
1
S
0
S
1
S
0
S
1
S
0
S
1
S
0
S
1
S
0
S
1
S
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
Commutator
Figure 1.
Proposed mixed-radix MDC FFT architecture
TABLE I. 128-,
256-,
AND
512-
POINT
FFT
TWIDDLE FACTOR
COMPUTATION
.
Stage
FFT size
1 2 3 4 5 6
128-point
W
16
W
64
-j
W
128
256-point
W
16
W
64
-j
W
256
-
j
512-point
W
16
W
64
-j
W
512
-j W
8
Mixed method
W
16
W
64
-j
W
512
-
j
W
8
TABLE II.
M
ULTIPLEXER
S
ELECTION
B
ITS
FFT size S
1
S
0
128 0 0
256 0 1
512 1 1
constant multipliers, delay elements, and commutators.
A. Proposed data scheduling scheme in stage 2
The proposed FFT processor requires the twiddle fac-
tor
241 2
(2 )( 4 )
64
W
αα
ββ
++
in stage 2. By using the proposed
commutator, we modified the conventional structure by
changing the connection. Therefore, the proposed commu-
tator blocks between stage 2 and stage 3 reduce the num-
ber of multipliers by rearranging the output data samples
of radix-4 BU.
By using the new data scheduling scheme, the pro-
posed architecture can remove complex multipliers in
paths 1 and 5 as shown in Fig. 2. Therefore, the new data
scheduling scheme can reduce the number of complex
constant multipliers from eight to six.
B. Proposed data scheduling scheme in stage 4
The twiddle factor is
21 2 3 421
( 4 16 32 )
512 512
()
ββ β β
++ +
=
nnk
WW
in stage
4. Changing the location of the data samples in stage 4
affects the twiddle factor multiplications. As shown in Fig.
3, using the proposed scheduling scheme, three of the eight
512
W
could be replaced with two
256
W
and one
128
W
, and one
of the eight
512
W
is not required.
The twiddle factor multiplication is one of the major
contributors to the area of the FFT processor, which re-
quires both memories and complex multipliers [15]. The
existing processor [10] requires ROMs with 1024 stored
words. However, the proposed FFT processor requires
ROMs with 672 stored words, by using the data mapping
scheme as shown in Fig. 4. Therefore, the size of the twid-
dle factor LUTs in stage 4 can be reduced to 34.4% com-
pared with the existing structure [10].
IV. R
ESULTS AND
C
OMPARISONS
Based on the fixed-point simulation results, 12-bit
word length of the proposed FFT processor is synthesized
using a Samsung 65-nm CMOS standard cell library. The
proposed processor can operate up to 330 MHz. For com-
parison with different technologies, the normalized area
based on [8] is expressed in the following equation:
2
Area
Normalized Area =
(Tech. / 65 nm)
(8)
As summarized in Table III, the proposed FFT proces-
Radix-2 BU Radix-2 BURadix-2 BURadix-2 BU
612 6
24 2
48 4
612 6
24 2
48 4
6
12 6
2
4 2
4
8 4
6
12 6
2
4 2
4
8 4
-j
-j
Stage 3Stage 2
Complex Constant Multiplier
64
W
Radix-4 BU
Radix-4 BU
Figure 2.
Stages 2 and 3 of the proposed FFT architecture.
Figure 4.
Eight regions of the mapping scheme.
Complex Multiplier
Radix-2 BU Radix-2 BURadix-2 BURadix-2 BU
1
2 1 1
1
2 1 1
1
1
2
1
2 1 1
1
2
3
1
2
3
3
2
1
3
2
1
2 1
1
2 1
2
1
2 1
Stage 4
ROM
(W
512
)
ROM
(W
128
)
ROM
(W
512
)
ROM
(W
512
)
ROM
(W
512
)
ROM
(W
256
)
ROM
(W
256
)
Figure 3.
Decomposition of three different FFT lengths.
TABLE III.
P
ERFORMANCE
C
OMPARISONS
Proposed
[8]
[10]
[11]
Technology 65 nm 65 nm 90 nm 130 nm
Architecture MDC MDC MDF MDF
Size 512/256/128 512 512 512
Datapath
type
8 8 8 8
Algorithm
Mixed
radix-2/4
Radix-8
Modified
radix-2
5
Mixed
radix-
2
2
/2
3/
2
4
Word length
(bits)
12 12 12 14
SQNR (dB) 33 N/A 35 N/A
Frequency
(MHz)
330 330 310 220
Throughput
(GSample/s)
2.64 2.64 2.48 1.76
Area (mm
2
) 0.21 0.88 0.78 1.69
Normalized
area (mm
2
)
0.21 0.88 0.41 0.42
sor operates at 330 MHz and its throughput is 2.64 GSam-
ple/s. The throughput is the same as that in [8] and faster
than those in [10] and [11]. The normalized areas in [8],
[10], [11], and the proposed FFT processor are 0.88 mm
2
,
0.41 mm
2
, 0.42 mm
2
, and 0.21 mm
2
, respectively. In
summary, the proposed FFT processor can additionally
support 128-/256-point operations compared with [8], [10],
and [11]. Furthermore, the clock rate and throughput are
faster than in [10] and [11] as the proposed FFT architec-
ture is MDC. Therefore, the proposed FFT processor
achieves the best area efficiency and throughput compared
with the other FFT processors in [8], [10], and [11] and
can be applied to an OFDM system such as IEEE
802.11n/ac/ad, because the proposed FFT processor can
support various FFT points compared with [8], [10], and
[11].
V. C
ONCLUSION
This paper proposed an area-efficient mixed-radix
MDC FFT processor for various OFDM systems such as
802.11n/ac/ad. The proposed FFT processor can be recon-
figured for 128-, 256-, and 512-point FFTs. The proposed
processor adopts a scheduling scheme to reduce the num-
ber of complex multipliers and complex constant multipli-
ers. The performance results show that the proposed FFT
processor can achieve 2.64 GSample/s at 330 MHz. More-
over, the proposed FFT processor can support various FFT
points compared with [8], [10], and [11]. Thus, it can be
applied to various OFDM systems such as 802.11n/ac/ad.
A
CKNOWLEDGMENT
This research was supported by the MSIT(Ministry of
Science and ICT), Korea, under the ITRC(Information
Technology Research Center) support program(IITP-2018-
2016-0-00309-002) supervised by the IITP(Institute for
Information & communications Technology Promotion),
by the National Research Foundation of Korea under the
framework of international cooperation program (NRF-
2016K2A9A2A12003787) and by IDEC (IC Design Edu-
cation Center).
R
EFERENCES
[1]
IEEE P802.11-Task Group AD, http://www.ieee802.org/11/
[2]
M. Garrido, F. Qureshi, J. Takala, and O. Gustafsson, Hardware
architectures for the fast Fourier transform, 3rd ed. Handbook of
Signal Processing Systems, Springer, 2018.
[3]
S. J. Huang and S. G. Chen, “A high-throughput radix-16 FFT
processor with parallel and normal input/output ordering for IEEE
802.15.3c systems,” IEEE Trans. on Circuits and Syst. I, vol. 59,
no. 8, pp. 1752
–
1765, Aug. 2012.
[4]
Fang-Li Yuan, Yi-Hsien Lin, Chih-Feng Wu, Muh-Tian Shiue and
Chorng-Kuang Wang, “A 256-Point dataflow scheduling 2×2
MIMO FFT FFT/IFFT processor for ieee 802.16 WMAN,” in
Proc.
IEEE Asian Solid-State Circuits Conference (A-SSCC), Nov. 2008,
pp.309-312., doi:10.1109/ASSCC.2008.4708789.
[5]
C. T. Lin, Y. C. Yu and L. D. Van, “Cost-effective triple-mode
reconfigurable pipeline FFT/IFFT/2-D DCT processor,” IEEE
Trans. on Very Large Scale Integr. (VLSI) Syst., vol. 16, no.8, pp.
1058-1071, Aug. 2008.
[6]
M. Garrido, J. Grajal, M. S´anchez, and O. Gustafsson, “Pipelined
radix-2
k
feedforward FFT architectures,” IEEE Trans. on Very
Large Scale Integr. (VLSI) Syst., vol. 21, no. 1, pp. 23-32, Jan.
2013.
[7]
M. Ayinala and K.K. Parhi, “Parallel Pipelined FFT Architectures
with Reduced Number of Delays,” in Proc. ACM Great Lakes
Symp. on VLSI (GLSVLSI), May 2012, pp. 63-66, doi:
10.1145/2206781.2206798.
[8]
T. Ahmed, M. Garrido, and O. Gustafsson, “A 512-point 8-parallel
pipelined feedforward FFT for WPAN,” in Proc. 2011 Conference
Record of the Forty Fifth Asilomar Conference on Signals,
Systems and Computers (ASILOMAR), Nov. 2011, pp. 981–984,
doi: 10.1109/ACSSC.2011.6190157.
[9]
Y. Chen, Y.-W. Lin, Y.-C. Taso, and C.-Y. Lee, “A 2.4-Gsample/s
DVFS FFT processor for MIMO OFDM communication systems,”
IEEE J. of Solid-State Circuits
, vol. 43, no. 5, pp. 1260–1273, May
2008.
[10]
T. Cho and H. Lee, “A High-Speed Low-Complexity Modified
Radix-2
5
FFT Processor for High Rate WPAN Applications,”
IEEE Trans. on Very Large Scale Integr. (VLSI) Syst., vol. 21, pp.
187-191, Jan. 2013.
[11]
C. Wang, Y. Yan, and X. Fu, “A High-Throughput Low-
complexity Radix-2
4
-2
2
-2
3
FFT/IFFT Processor with Parallel and
Normal Input/ Output Order for IEEE 802.11ad Systems,” IEEE
Trans. on Very Large Scale Integr. (VLSI) Syst., vol. 23, no. 11,
pp. 2728-2732, Nov. 2015.
[12]
A. V. Oppenheim and R. W. Schafe, Discrete-time signal
processing. Englewood Cliffs: Prentice Hall. 1989.
[13]
F. Qureshi and O. Gustafsson, “Low-complexity reconfigurable
complex constant multiplication for FFTs,” in Proc. IEEE Int.
Symp. on Circuits and Systems (ISCAS), 2009, pp. 1137-1140,
doi: 10.1109/ISCAS.2009.5117961.
[14]
M. Garrido, F. Qureshi, O. Gustafsson, “Low-Complexity
Multiplierless Constant Rotators Based on Combined Coefficient
Selection and Shift-and-Add Implementation (CCSSI),” IEEE
Trans. on Circuits and Syst. I, vol. 61, no. 7, pp. 2002-2012, Jul.
2014.
[15]
F. Qureshi, S.A. Alam and O. Gustafsson, “4K-Point FFT
Algorithms based on optimized twiddle factor multiplication for
FPGAs,” in proc. 2010 Asia Pacific Conference on Postgraduate
Research in Microelectronics and Electronics (PrimeAsia), Sep.
2010, pp. 225-228, doi: 10.1109/PRIMEASIA.2010.5604921.