scispace - formally typeset
Open AccessProceedings ArticleDOI

Area-Efficient Scheduling Scheme Based FFT Processor for Various OFDM Systems

TLDR
An area-efficient fast Fourier transform (FFT) processor for orthogonal frequency-division multiplexing systems based on multi-path delay commutator architecture and a data scheduling scheme to reduce the number of complex constant multipliers is proposed.
Abstract
This paper presents an area-efficient fast Fourier transform (FFT) processor for orthogonal frequency-division multiplexing systems based on multi-path delay commutator architecture. This paper proposes a data scheduling scheme to reduce the number of complex constant multipliers. The proposed mixed-radix multi-path delay commutator FFT processor can support 128-, 256-, and 512-point FFT sizes. The proposed processor was synthesized using the Samsung 65-nm CMOS standard cell library. The proposed processor with eight parallel data paths can achieve a high throughput rate of up to 2.64 GSample/s at 330 MHz.

read more

Content maybe subject to copyright    Report

Area-Efficient Scheduling Scheme Based FFT Processor for Various OFDM
Systems
Jeong Keun Jang
Dongbu Hitek
Bucheon, Korea
jeongkeun.jang@dbhitek.com
Ho Keun Kim, Myung Hoon Sunwoo
Department of Electrical and Computer
Engineering
Ajou University
Suwon, Korea
hokeun92@ajou.ac.kr, sunwoo@ajou.ac.kr
Oscar Gustafsson
Department of Electrical Engineering
Linköping University
Linköping, Sweden
oscar.gustafsson@liu.se
Abstract This paper presents an area-efficient fast Fourier
transform (FFT) processor for orthogonal frequency-
division multiplexing systems based on multi-path delay
commutator architecture. This paper proposes a data sched-
uling scheme to reduce the number of complex constant mul-
tipliers. The proposed mixed-radix multi-path delay commu-
tator FFT processor can support 128-, 256-, and 512-point
FFT sizes. The proposed processor was synthesized using the
Samsung 65-nm CMOS standard cell library. The proposed
processor with eight parallel data paths can achieve a high
throughput rate of up to 2.64 GSample/s at 330 MHz.
Keywords-fast Fourier transform (FFT); high throughput;
low hardware complexity; mixed-radix multi-path delay
commutator (MRMDC); orthogonal frequency-division
multiplexing (OFDM) systems
I. INTRODUCTION
Fast Fourier transform (FFT) is a well-known mathe-
matical algorithm for performing Fourier transform opera-
tions. The FFT plays an important role in different fields
such as communication systems, biomedical applications,
sensor, and radar signal processing. Moreover, an FFT
processor is a high computational complexity module in
the physical layer of orthogonal frequency-division multi-
plexing (OFDM) applications such as IEEE 802.11n/ac/ad
[1], IEEE 802.15.3.c [2], and IEEE 802.16e [3]. Hence,
various FFT processors have been proposed [2] to satisfy
real-time processing requirements and reduce hardware
complexity [3]-[11].
Most of the FFT architectures can be divided into two
categories: 1) memory-based architectures and 2) pipe-
lined architectures. Memory-based architectures were pro-
posed to achieve smaller area [3]; whereas, pipelined FFT
architectures [4]-[11] can achieve high throughput rates
and low latency, which are suitable for real-time applica-
tions. Pipelined FFT architectures can be classified into
single-path feedback (SDF) architectures [5], multi-path
delay feedback (MDF) architectures [9]-[11], and multi-
path delay commutator (MDC) architectures [6]-[8], ac-
cording to the dataflow scheme.
In current real-time applications, many parallel pipe-
lined FFT architectures have been proposed [6]-[11] to
provide very high throughput rates. The number of delay
elements in MDF architectures [9]-[11] is less than that in
SDF architectures [5]. Recently, parallel MDC architec-
tures have been proposed in [6]-[8] for achieving high
throughput rates and hardware efficiency based on radix-2
n
algorithms as an improvement on radix-2 and radix-4 algo-
rithms. In [8], radix-8 pipelined MDC architectures im-
proved the area efficiency by using data shuffling struc-
tures. However, the radix-8 algorithm cannot handle 128-
and 256-point FFTs. Conversely, the proposed FFT pro-
cessor can provide both 128- and 256-point FFTs. Moreo-
ver, the proposed processor was designed based on the
radix-4 and radix-2 algorithms, which can significantly
reduce the area.
In this paper, we propose an eight-parallel mixed-radix
MDC architecture for low hardware complexity. An area-
efficient scheduling scheme is proposed to reduce the size
of read-only memories (ROMs) for storing twiddle factors.
This paper is organized as follows. Section II describes
FFT algorithms for the proposed architecture. Section III
provides the proposed mixed-radix MDC FFT architecture
in detail. Section IV presents the design and implementa-
tion results of the proposed FFT processor. Finally, the
conclusion is presented in Section V.
II. FFT
ALGORITHMS
The discrete Fourier transform (DFT) of length N is
defined as
1
0
() () , 0,1, , 1.
=
==
N
nk
N
n
Xk xnW k N
(1)
where x(n) and X(k) denote the input and output of the
DFT, respectively, and
nk
N
W
denotes the Nth primitive root
of unity, with its exponent evaluated as modulo N [12].
(2 / )
cos(2 / ) sin(2 / ).
π
ππ
==
nk j nk N
N
W e nk N j nk N
(2)
Furthermore, (1) can be reformulated as (3) using the
2-dimensional index map in (4). Moreover, (3) consists of
two DFT computation 64-point DFTs, which are expressed
as G(n
2
, k
1
) and N/64-point DFT.

21
64
121 2
21
64
11 21 2 2
21
1
63
()(64)
64
12 12
00
1
63
1264 /64
00
(,)
(64) ( )
64
()
64
++
==
==
+= +




=+








N
N
N
nn k k
N
nn
nk nk nk
NN
nn
Gn k
N
Xk k x n n W
N
xnnW W W

(3)
where
12
12
0, 1, , 63; 0, 1, , ( / 64 1)
0, 1, , 63; 0, 1, , ( / 64 1)
128,256,512.
==
==
=
nnN
kkN
N


(4)
Thus, when N is 128, 256, and 512, the N/64-point
DFT is 4-, 4-, 2-, and 2-point DFTs, respectively. As these
2- and 4-point DFTs can be folded using radix-2, they can
be calculated using radix-2, radix-2
2
, and radix-2
3
, respec-
tively, as expressed in (5).
()
{}
()
21 22
2
21 11 2 1 2 2
21
7
21 8
12
0
46
5
11
21 2 4 2
00
5
128
256
,
(64)
,
αβ αβ αβ
αα
=
′′
′′
==
=
=
=
+=






nk nk
N
n
Stage TF Stage BU
Stage BU
nk
N
Stage TF
N
N
Gn k W W
Xk k
GnkW WWW


31 3 2 3 3
3
7
1
(2)
82
0
6
512
αβ αβ αβ
α
′′ ′′ ′′
+
=
=
Stage BU
Stage TF
N
WW



(5)
11
1
123412 3 4
4321
4321
1
1234112 3 4
1
63
21 1 2 64
0
1133
(16 4 2 )( 4 16 32 )
1264
0000
1133
12
0000
4
42 , 41632
(,)
64
64
64
16
nk
n
k
n
N
Gn k x n n W
N
xnnW
N
xnn
W
αα ααββ β β
αααα
αααα
αβ
αααα ββ β β
=
+++ ++ +
====
====
+++ =++ +
=

=+



=+



=+


×


231 2 33 43
12122 44
2
31
(2 )( 4 )
16 4 64 2 4 2
234
1
.
Stage TF
Stage TFStage TF
Stage BU Stage BU Stage BU
Stage BU
WWW WWW
ααβ β αβ αβαβ αβ αβ
++

(6)
where
1234
12 34
0, 1, 2, 3; 0, 1, 2, 3; 0, 1; 0, 1
0, 1, 2, 3; 0, 1, 2, 3; 0, 1; 0, 1.
αααα
ββββ
====
====
(7)
Therefore, this paper proposes decomposition for cal-
culating the 128-, 256-, and 512-point DFTs using (5) and
(6). In these decompositions, the required twiddle factors
for each stage are summarized in Table I; the mixed meth-
od in Table I indicates that the twiddle factors should be
calculated according to the FFT size.
III. P
ROPOSED
FFT
A
RCHITECTURE
Using the radix-4
2
and radix-2
2
FFT algorithms in
Module-1 and the radix-2
n
FFT algorithm in Module-2, we
proposed the mixed-radix MDC FFT architecture illustrat-
ed in Fig. 1. To perform 128-, 256-, and 512-point FFT
operations, the proposed FFT processor consists of seven
stages. Stages 1, 2, 3, and 4 are used in common, but stag-
es 5, 6, and 7 are selectively reconfigured according to two
selection bits as presented as shown in Table II. The pro-
posed FFT architecture employs MDC architectures in-
cluding butterfly units (BU), complex multipliers, complex
Radix-2 BU Radix-2 BURadix-2 BURadix-2 BU
Radix-2 BU Radix-2 BURadix-2 BURadix-2 BU
-j
Radix-2 BU Radix-2 BURadix-2 BURadix-2 BU
Radix-4 BU
Radix-4 BU
-j
-j
1
2 1 1
1
2 1 1
1
1
2
1
2 1 1
1
2
3
1
2
3
3
2
1
3
2
1
Radix-2 BU Radix-2 BURadix-2 BURadix-2 BU
-j
Radix-2 BU Radix-2 BURadix-2 BURadix-2 BU
2 1
1
2 1
2
1
2 1
8
16 8 8
8
16 8 8
8
8
16
8
16 8 8
16 8
8
16 8
16
8
16 8
612 6
24 2
48 4
612 6
24 2
48 4
Radix-4 BU Radix-4 BU
6
12 6
2
4 2
4
8 4
6
12 6
2
4 2
4
8 4
-j
Complex Constant Multiplier
FFT Processor
Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7
Complex Multiplier
Constant Multiplier
Module-2Module-1
S
0
S
1
S
0
S
1
S
0
S
1
S
0
S
1
S
0
S
1
S
0
S
1
S
0
S
1
S
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
0
1
0
1
0
1
0
1
1
0
1
0
1
0
1
0
Commutator
Figure 1.
Proposed mixed-radix MDC FFT architecture
TABLE I. 128-,
256-,
AND
512-
POINT
FFT
TWIDDLE FACTOR
COMPUTATION
.
Stage
FFT size
1 2 3 4 5 6
128-point
W
16
W
64
-j
W
128
256-point
W
16
W
64
-j
W
256
-
j
512-point
W
16
W
64
-j
W
512
-j W
8
Mixed method
W
16
W
64
-j
W
512
-
j
W
8
TABLE II.
M
ULTIPLEXER
S
ELECTION
B
ITS
FFT size S
1
S
0
128 0 0
256 0 1
512 1 1

constant multipliers, delay elements, and commutators.
A. Proposed data scheduling scheme in stage 2
The proposed FFT processor requires the twiddle fac-
tor
241 2
(2 )( 4 )
64
W
αα
ββ
++
in stage 2. By using the proposed
commutator, we modified the conventional structure by
changing the connection. Therefore, the proposed commu-
tator blocks between stage 2 and stage 3 reduce the num-
ber of multipliers by rearranging the output data samples
of radix-4 BU.
By using the new data scheduling scheme, the pro-
posed architecture can remove complex multipliers in
paths 1 and 5 as shown in Fig. 2. Therefore, the new data
scheduling scheme can reduce the number of complex
constant multipliers from eight to six.
B. Proposed data scheduling scheme in stage 4
The twiddle factor is
21 2 3 421
( 4 16 32 )
512 512
()
ββ β β
++ +
=
nnk
WW
in stage
4. Changing the location of the data samples in stage 4
affects the twiddle factor multiplications. As shown in Fig.
3, using the proposed scheduling scheme, three of the eight
512
W
could be replaced with two
256
W
and one
128
W
, and one
of the eight
512
W
is not required.
The twiddle factor multiplication is one of the major
contributors to the area of the FFT processor, which re-
quires both memories and complex multipliers [15]. The
existing processor [10] requires ROMs with 1024 stored
words. However, the proposed FFT processor requires
ROMs with 672 stored words, by using the data mapping
scheme as shown in Fig. 4. Therefore, the size of the twid-
dle factor LUTs in stage 4 can be reduced to 34.4% com-
pared with the existing structure [10].
IV. R
ESULTS AND
C
OMPARISONS
Based on the fixed-point simulation results, 12-bit
word length of the proposed FFT processor is synthesized
using a Samsung 65-nm CMOS standard cell library. The
proposed processor can operate up to 330 MHz. For com-
parison with different technologies, the normalized area
based on [8] is expressed in the following equation:
2
Area
Normalized Area =
(Tech. / 65 nm)
(8)
As summarized in Table III, the proposed FFT proces-
Radix-2 BU Radix-2 BURadix-2 BURadix-2 BU
612 6
24 2
48 4
612 6
24 2
48 4
6
12 6
2
4 2
4
8 4
6
12 6
2
4 2
4
8 4
-j
-j
Stage 3Stage 2
Complex Constant Multiplier
64
W
Radix-4 BU
Radix-4 BU
Figure 2.
Stages 2 and 3 of the proposed FFT architecture.
Figure 4.
Eight regions of the mapping scheme.
Complex Multiplier
Radix-2 BU Radix-2 BURadix-2 BURadix-2 BU
1
2 1 1
1
2 1 1
1
1
2
1
2 1 1
1
2
3
1
2
3
3
2
1
3
2
1
2 1
1
2 1
2
1
2 1
Stage 4
ROM
(W
512
)
ROM
(W
128
)
ROM
(W
512
)
ROM
(W
512
)
ROM
(W
512
)
ROM
(W
256
)
ROM
(W
256
)
Figure 3.
Decomposition of three different FFT lengths.
TABLE III.
P
ERFORMANCE
C
OMPARISONS
Proposed
[8]
[10]
[11]
Technology 65 nm 65 nm 90 nm 130 nm
Architecture MDC MDC MDF MDF
Size 512/256/128 512 512 512
Datapath
type
8 8 8 8
Algorithm
Mixed
radix-2/4
Radix-8
Modified
radix-2
5
Mixed
radix-
2
2
/2
3/
2
4
Word length
(bits)
12 12 12 14
SQNR (dB) 33 N/A 35 N/A
Frequency
(MHz)
330 330 310 220
Throughput
(GSample/s)
2.64 2.64 2.48 1.76
Area (mm
2
) 0.21 0.88 0.78 1.69
Normalized
area (mm
2
)
0.21 0.88 0.41 0.42

sor operates at 330 MHz and its throughput is 2.64 GSam-
ple/s. The throughput is the same as that in [8] and faster
than those in [10] and [11]. The normalized areas in [8],
[10], [11], and the proposed FFT processor are 0.88 mm
2
,
0.41 mm
2
, 0.42 mm
2
, and 0.21 mm
2
, respectively. In
summary, the proposed FFT processor can additionally
support 128-/256-point operations compared with [8], [10],
and [11]. Furthermore, the clock rate and throughput are
faster than in [10] and [11] as the proposed FFT architec-
ture is MDC. Therefore, the proposed FFT processor
achieves the best area efficiency and throughput compared
with the other FFT processors in [8], [10], and [11] and
can be applied to an OFDM system such as IEEE
802.11n/ac/ad, because the proposed FFT processor can
support various FFT points compared with [8], [10], and
[11].
V. C
ONCLUSION
This paper proposed an area-efficient mixed-radix
MDC FFT processor for various OFDM systems such as
802.11n/ac/ad. The proposed FFT processor can be recon-
figured for 128-, 256-, and 512-point FFTs. The proposed
processor adopts a scheduling scheme to reduce the num-
ber of complex multipliers and complex constant multipli-
ers. The performance results show that the proposed FFT
processor can achieve 2.64 GSample/s at 330 MHz. More-
over, the proposed FFT processor can support various FFT
points compared with [8], [10], and [11]. Thus, it can be
applied to various OFDM systems such as 802.11n/ac/ad.
A
CKNOWLEDGMENT
This research was supported by the MSIT(Ministry of
Science and ICT), Korea, under the ITRC(Information
Technology Research Center) support program(IITP-2018-
2016-0-00309-002) supervised by the IITP(Institute for
Information & communications Technology Promotion),
by the National Research Foundation of Korea under the
framework of international cooperation program (NRF-
2016K2A9A2A12003787) and by IDEC (IC Design Edu-
cation Center).
R
EFERENCES
[1]
IEEE P802.11-Task Group AD, http://www.ieee802.org/11/
[2]
M. Garrido, F. Qureshi, J. Takala, and O. Gustafsson, Hardware
architectures for the fast Fourier transform, 3rd ed. Handbook of
Signal Processing Systems, Springer, 2018.
[3]
S. J. Huang and S. G. Chen, “A high-throughput radix-16 FFT
processor with parallel and normal input/output ordering for IEEE
802.15.3c systems,” IEEE Trans. on Circuits and Syst. I, vol. 59,
no. 8, pp. 1752
1765, Aug. 2012.
[4]
Fang-Li Yuan, Yi-Hsien Lin, Chih-Feng Wu, Muh-Tian Shiue and
Chorng-Kuang Wang, “A 256-Point dataflow scheduling 2×2
MIMO FFT FFT/IFFT processor for ieee 802.16 WMAN,” in
Proc.
IEEE Asian Solid-State Circuits Conference (A-SSCC), Nov. 2008,
pp.309-312., doi:10.1109/ASSCC.2008.4708789.
[5]
C. T. Lin, Y. C. Yu and L. D. Van, “Cost-effective triple-mode
reconfigurable pipeline FFT/IFFT/2-D DCT processor,” IEEE
Trans. on Very Large Scale Integr. (VLSI) Syst., vol. 16, no.8, pp.
1058-1071, Aug. 2008.
[6]
M. Garrido, J. Grajal, M. S´anchez, and O. Gustafsson, “Pipelined
radix-2
k
feedforward FFT architectures,” IEEE Trans. on Very
Large Scale Integr. (VLSI) Syst., vol. 21, no. 1, pp. 23-32, Jan.
2013.
[7]
M. Ayinala and K.K. Parhi, “Parallel Pipelined FFT Architectures
with Reduced Number of Delays,” in Proc. ACM Great Lakes
Symp. on VLSI (GLSVLSI), May 2012, pp. 63-66, doi:
10.1145/2206781.2206798.
[8]
T. Ahmed, M. Garrido, and O. Gustafsson, “A 512-point 8-parallel
pipelined feedforward FFT for WPAN,” in Proc. 2011 Conference
Record of the Forty Fifth Asilomar Conference on Signals,
Systems and Computers (ASILOMAR), Nov. 2011, pp. 981–984,
doi: 10.1109/ACSSC.2011.6190157.
[9]
Y. Chen, Y.-W. Lin, Y.-C. Taso, and C.-Y. Lee, “A 2.4-Gsample/s
DVFS FFT processor for MIMO OFDM communication systems,”
IEEE J. of Solid-State Circuits
, vol. 43, no. 5, pp. 1260–1273, May
2008.
[10]
T. Cho and H. Lee, “A High-Speed Low-Complexity Modified
Radix-2
5
FFT Processor for High Rate WPAN Applications,”
IEEE Trans. on Very Large Scale Integr. (VLSI) Syst., vol. 21, pp.
187-191, Jan. 2013.
[11]
C. Wang, Y. Yan, and X. Fu, “A High-Throughput Low-
complexity Radix-2
4
-2
2
-2
3
FFT/IFFT Processor with Parallel and
Normal Input/ Output Order for IEEE 802.11ad Systems,” IEEE
Trans. on Very Large Scale Integr. (VLSI) Syst., vol. 23, no. 11,
pp. 2728-2732, Nov. 2015.
[12]
A. V. Oppenheim and R. W. Schafe, Discrete-time signal
processing. Englewood Cliffs: Prentice Hall. 1989.
[13]
F. Qureshi and O. Gustafsson, “Low-complexity reconfigurable
complex constant multiplication for FFTs,” in Proc. IEEE Int.
Symp. on Circuits and Systems (ISCAS), 2009, pp. 1137-1140,
doi: 10.1109/ISCAS.2009.5117961.
[14]
M. Garrido, F. Qureshi, O. Gustafsson, “Low-Complexity
Multiplierless Constant Rotators Based on Combined Coefficient
Selection and Shift-and-Add Implementation (CCSSI),” IEEE
Trans. on Circuits and Syst. I, vol. 61, no. 7, pp. 2002-2012, Jul.
2014.
[15]
F. Qureshi, S.A. Alam and O. Gustafsson, “4K-Point FFT
Algorithms based on optimized twiddle factor multiplication for
FPGAs,” in proc. 2010 Asia Pacific Conference on Postgraduate
Research in Microelectronics and Electronics (PrimeAsia), Sep.
2010, pp. 225-228, doi: 10.1109/PRIMEASIA.2010.5604921.
Citations
More filters
Journal ArticleDOI

A Survey on Pipelined FFT Hardware Architectures

TL;DR: A survey that includes the main advances in the field related to architectures for complex input data and power-of-two FFT sizes and divides the architectures into serial and parallel.
Journal ArticleDOI

Optimum MDC FFT Hardware Architectures in Terms of Delays and Multiplexers

TL;DR: This brief shows how to derive all the optimum multi-path delay commutator (MDC) fast Fourier transform (FFT) hardware architectures in terms of delays and multiplexers and calculate the number of such architectures and shows that there exist a large number of optimum MDC FFTs.
Proceedings ArticleDOI

Evolution of the Performance of Pipelined FFT Architectures Through the Years

TL;DR: A big picture of the evolution of pipelined FFT architectures is provided, it reveals hidden trends, and gives hints about the future of pipedrive FFT architecture.
Proceedings ArticleDOI

Pipelined Fast Fourier Transform (FFT) Processor Power Optimization

TL;DR: Power consumption in term of total dynamic power and cell leakage power during the hierarchical condition for different type of pipelined FFT is studied and after the flattening process, power consumption reduced significantly.
References
More filters
Journal ArticleDOI

Pipelined Radix- $2^{k}$ Feedforward FFT Architectures

TL;DR: The proposed radix-2k feedforward architectures not only offer an attractive solution for current applications, but also open up a new research line on feedforward structures.
Journal ArticleDOI

A 2.4-Gsample/s DVFS FFT Processor for MIMO OFDM Communication Systems

TL;DR: A new dynamic voltage and frequency scaling (DVFS) FFT processor for MIMO OFDM applications and a novel open-loop voltage detection and scaling (OLVDS) mechanism is proposed for fast and robust voltage management.
Journal ArticleDOI

A High-Speed Low-Complexity Modified ${\rm Radix}-2^{5}$ FFT Processor for High Rate WPAN Applications

TL;DR: A novel modified radix-25 FFT algorithm that reduces the hardware complexity is proposed, which can reduce the number of complex multiplications and the size of the twiddle factor memory.
Journal ArticleDOI

A High-Throughput Radix-16 FFT Processor With Parallel and Normal Input/Output Ordering for IEEE 802.15.3c Systems

TL;DR: The proposed radix-16 FFT processor is area-efficient with high data processing rate and hardware utilization efficiency, and a conflict-free multibank memory addressing scheme is devised to support up to 16-way parallel and normal-order data input/output.
Journal ArticleDOI

Low-Complexity Multiplierless Constant Rotators Based on Combined Coefficient Selection and Shift-and-Add Implementation (CCSSI)

TL;DR: A new approach to design multiplierless constant rotators based on a combined coefficient selection and shift-and-add implementation (CCSSI) that provides an extended design space that offers a larger number of alternatives with respect to previous works.
Related Papers (5)
Frequently Asked Questions (1)
Q1. What are the contributions in "Area-efficient scheduling scheme based fft processor for various ofdm systems" ?

This paper presents an area-efficient fast Fourier transform ( FFT ) processor for orthogonal frequencydivision multiplexing systems based on multi-path delay commutator architecture. This paper proposes a data scheduling scheme to reduce the number of complex constant multipliers.