(Open Access) Area-Efficient Scheduling Scheme Based FFT Processor for Various OFDM Systems (2018) | Jeong Keun Jang

Area-Efficient Scheduling Scheme Based FFT Processor for Various OFDM

Systems

Jeong Keun Jang

Dongbu Hitek

Bucheon, Korea

jeongkeun.jang@dbhitek.com

Ho Keun Kim, Myung Hoon Sunwoo

Department of Electrical and Computer

Engineering

Ajou University

Suwon, Korea

hokeun92@ajou.ac.kr, sunwoo@ajou.ac.kr

Oscar Gustafsson

Department of Electrical Engineering

Linköping University

Linköping, Sweden

oscar.gustafsson@liu.se

Abstract— This paper presents an area-efficient fast Fourier

transform (FFT) processor for orthogonal frequency-

division multiplexing systems based on multi-path delay

commutator architecture. This paper proposes a data sched-

uling scheme to reduce the number of complex constant mul-

tipliers. The proposed mixed-radix multi-path delay commu-

tator FFT processor can support 128-, 256-, and 512-point

FFT sizes. The proposed processor was synthesized using the

Samsung 65-nm CMOS standard cell library. The proposed

processor with eight parallel data paths can achieve a high

throughput rate of up to 2.64 GSample/s at 330 MHz.

Keywords-fast Fourier transform (FFT); high throughput;

low hardware complexity; mixed-radix multi-path delay

commutator (MRMDC); orthogonal frequency-division

multiplexing (OFDM) systems

I. INTRODUCTION

Fast Fourier transform (FFT) is a well-known mathe-

matical algorithm for performing Fourier transform opera-

tions. The FFT plays an important role in different fields

such as communication systems, biomedical applications,

sensor, and radar signal processing. Moreover, an FFT

processor is a high computational complexity module in

the physical layer of orthogonal frequency-division multi-

plexing (OFDM) applications such as IEEE 802.11n/ac/ad

[1], IEEE 802.15.3.c [2], and IEEE 802.16e [3]. Hence,

various FFT processors have been proposed [2] to satisfy

real-time processing requirements and reduce hardware

complexity [3]-[11].

Most of the FFT architectures can be divided into two

categories: 1) memory-based architectures and 2) pipe-

lined architectures. Memory-based architectures were pro-

posed to achieve smaller area [3]; whereas, pipelined FFT

architectures [4]-[11] can achieve high throughput rates

and low latency, which are suitable for real-time applica-

tions. Pipelined FFT architectures can be classified into

single-path feedback (SDF) architectures [5], multi-path

delay feedback (MDF) architectures [9]-[11], and multi-

path delay commutator (MDC) architectures [6]-[8], ac-

cording to the dataflow scheme.

In current real-time applications, many parallel pipe-

lined FFT architectures have been proposed [6]-[11] to

provide very high throughput rates. The number of delay

elements in MDF architectures [9]-[11] is less than that in

SDF architectures [5]. Recently, parallel MDC architec-

tures have been proposed in [6]-[8] for achieving high

throughput rates and hardware efficiency based on radix-2

algorithms as an improvement on radix-2 and radix-4 algo-

rithms. In [8], radix-8 pipelined MDC architectures im-

proved the area efficiency by using data shuffling struc-

tures. However, the radix-8 algorithm cannot handle 128-

and 256-point FFTs. Conversely, the proposed FFT pro-

cessor can provide both 128- and 256-point FFTs. Moreo-

ver, the proposed processor was designed based on the

radix-4 and radix-2 algorithms, which can significantly

reduce the area.

In this paper, we propose an eight-parallel mixed-radix

MDC architecture for low hardware complexity. An area-

efficient scheduling scheme is proposed to reduce the size

of read-only memories (ROMs) for storing twiddle factors.

This paper is organized as follows. Section II describes

FFT algorithms for the proposed architecture. Section III

provides the proposed mixed-radix MDC FFT architecture

in detail. Section IV presents the design and implementa-

tion results of the proposed FFT processor. Finally, the

conclusion is presented in Section V.

II. FFT

ALGORITHMS

The discrete Fourier transform (DFT) of length N is

defined as

() () , 0,1, , 1.

−

==−





Xk xnW k N

(1)

where x(n) and X(k) denote the input and output of the

DFT, respectively, and

denotes the Nth primitive root

of unity, with its exponent evaluated as modulo N [12].

(2 / )

cos(2 / ) sin(2 / ).

ππ

−

== −

nk j nk N

W e nk N j nk N

(2)

Furthermore, (1) can be reformulated as (3) using the

2-dimensional index map in (4). Moreover, (3) consists of

two DFT computation 64-point DFTs, which are expressed

as G(n

, k

) and N/64-point DFT.

121 2

11 21 2 2

()(64)

12 12

1264 /64

(,)

(64) ( )

()

−

+= +





















nn k k

nk nk nk

Gn k

Xk k x n n W

xnnW W W



(3)

where

0, 1, , 63; 0, 1, , ( / 64 1)

128,256,512.

==−





==−







nnN

kkN



(4)

Thus, when N is 128, 256, and 512, the N/64-point

DFT is 4-, 4-, 2-, and 2-point DFTs, respectively. As these

2- and 4-point DFTs can be folded using radix-2, they can

be calculated using radix-2, radix-2

, and radix-2

, respec-

tively, as expressed in (5).

()

{}

()







21 22

21 11 2 1 2 2

21 8

21 2 4 2

128

256

(64)

αβ αβ αβ

αα

′′ ′′ ′′

′′















nk nk

Stage TF Stage BU

Stage BU

Stage TF

Gn k W W

Xk k

GnkW WWW



 



31 3 2 3 3

(2)

512

αβ αβ αβ

′′ ′′ ′′

′



Stage BU

Stage TF







(5)

123412 3 4

4321

1234112 3 4

21 1 2 64

1133

(16 4 2 )( 4 16 32 )

1264

0000

1133

0000

42 , 41632

(,)

Gn k x n n W

xnnW

xnn

αα ααββ β β

αααα

αβ

αααα ββ β β

+++ ++ +

====

+++ =++ +

































231 2 33 43

12122 44

(2 )( 4 )

16 4 64 2 4 2

234

Stage TF

Stage TFStage TF

Stage BU Stage BU Stage BU

Stage BU

WWW WWW

ααβ β αβ αβαβ αβ αβ



(6)

where

1234

12 34

0, 1, 2, 3; 0, 1, 2, 3; 0, 1; 0, 1

0, 1, 2, 3; 0, 1, 2, 3; 0, 1; 0, 1.

αααα

ββββ

====





====



(7)

Therefore, this paper proposes decomposition for cal-

culating the 128-, 256-, and 512-point DFTs using (5) and

(6). In these decompositions, the required twiddle factors

for each stage are summarized in Table I; the mixed meth-

od in Table I indicates that the twiddle factors should be

calculated according to the FFT size.

III. P

ROPOSED

FFT

RCHITECTURE

Using the radix-4

and radix-2

FFT algorithms in

Module-1 and the radix-2

FFT algorithm in Module-2, we

proposed the mixed-radix MDC FFT architecture illustrat-

ed in Fig. 1. To perform 128-, 256-, and 512-point FFT

operations, the proposed FFT processor consists of seven

stages. Stages 1, 2, 3, and 4 are used in common, but stag-

es 5, 6, and 7 are selectively reconfigured according to two

selection bits as presented as shown in Table II. The pro-

posed FFT architecture employs MDC architectures in-

cluding butterfly units (BU), complex multipliers, complex

Radix-2 BU Radix-2 BURadix-2 BURadix-2 BU

-j

Radix-2 BU Radix-2 BURadix-2 BURadix-2 BU

Radix-4 BU

-j

2 1 1

Radix-2 BU Radix-2 BURadix-2 BURadix-2 BU

-j

Radix-2 BU Radix-2 BURadix-2 BURadix-2 BU

2 1

16 8 8

16 8

612 6

24 2

48 4

612 6

24 2

48 4

Radix-4 BU Radix-4 BU

12 6

4 2

8 4

12 6

4 2

8 4

-j

Complex Constant Multiplier

FFT Processor

Stage 2 Stage 3 Stage 4 Stage 5 Stage 6 Stage 7

Complex Multiplier

Constant Multiplier

Module-2Module-1

Commutator

Figure 1.

Proposed mixed-radix MDC FFT architecture

TABLE I. 128-,

256-,

AND

512-

POINT

FFT

TWIDDLE FACTOR

COMPUTATION

Stage

FFT size

1 2 3 4 5 6

128-point

-j

128

256-point

-j

256

512-point

-j

512

-j W

Mixed method

-j

512

TABLE II.

ULTIPLEXER

ELECTION

ITS

FFT size S

128 0 0

256 0 1

512 1 1

constant multipliers, delay elements, and commutators.

A. Proposed data scheduling scheme in stage 2

The proposed FFT processor requires the twiddle fac-

tor

241 2

(2 )( 4 )

αα

ββ

in stage 2. By using the proposed

commutator, we modified the conventional structure by

changing the connection. Therefore, the proposed commu-

tator blocks between stage 2 and stage 3 reduce the num-

ber of multipliers by rearranging the output data samples

of radix-4 BU.

By using the new data scheduling scheme, the pro-

posed architecture can remove complex multipliers in

paths 1 and 5 as shown in Fig. 2. Therefore, the new data

scheduling scheme can reduce the number of complex

constant multipliers from eight to six.

B. Proposed data scheduling scheme in stage 4

The twiddle factor is

21 2 3 421

( 4 16 32 )

512 512

()

ββ β β

++ +

nnk

in stage

4. Changing the location of the data samples in stage 4

affects the twiddle factor multiplications. As shown in Fig.

3, using the proposed scheduling scheme, three of the eight

512

could be replaced with two

256

and one

128

, and one

of the eight

512

is not required.

The twiddle factor multiplication is one of the major

contributors to the area of the FFT processor, which re-

quires both memories and complex multipliers [15]. The

existing processor [10] requires ROMs with 1024 stored

words. However, the proposed FFT processor requires

ROMs with 672 stored words, by using the data mapping

scheme as shown in Fig. 4. Therefore, the size of the twid-

dle factor LUTs in stage 4 can be reduced to 34.4% com-

pared with the existing structure [10].

IV. R

ESULTS AND

OMPARISONS

Based on the fixed-point simulation results, 12-bit

word length of the proposed FFT processor is synthesized

using a Samsung 65-nm CMOS standard cell library. The

proposed processor can operate up to 330 MHz. For com-

parison with different technologies, the normalized area

based on [8] is expressed in the following equation:

Area

Normalized Area =

(Tech. / 65 nm)

(8)

As summarized in Table III, the proposed FFT proces-

Radix-2 BU Radix-2 BURadix-2 BURadix-2 BU

612 6

24 2

48 4

612 6

24 2

48 4

12 6

4 2

8 4

12 6

4 2

8 4

-j

Stage 3Stage 2

Complex Constant Multiplier

Radix-4 BU

Figure 2.

Stages 2 and 3 of the proposed FFT architecture.

Figure 4.

Eight regions of the mapping scheme.

Complex Multiplier

Radix-2 BU Radix-2 BURadix-2 BURadix-2 BU

2 1 1

2 1

Stage 4

ROM

512

)

ROM

128

)

ROM

512

)

ROM

512

)

ROM

512

)

ROM

256

)

ROM

256

)

Figure 3.

Decomposition of three different FFT lengths.

TABLE III.

ERFORMANCE

OMPARISONS

Proposed

[8]

[10]

[11]

Technology 65 nm 65 nm 90 nm 130 nm

Architecture MDC MDC MDF MDF

Size 512/256/128 512 512 512

Datapath

type

8 8 8 8

Algorithm

Mixed

radix-2/4

Radix-8

Modified

radix-2

Mixed

radix-

Word length

(bits)

12 12 12 14

SQNR (dB) 33 N/A 35 N/A

Frequency

(MHz)

330 330 310 220

Throughput

(GSample/s)

2.64 2.64 2.48 1.76

Area (mm

) 0.21 0.88 0.78 1.69

Normalized

area (mm

)

0.21 0.88 0.41 0.42

sor operates at 330 MHz and its throughput is 2.64 GSam-

ple/s. The throughput is the same as that in [8] and faster

than those in [10] and [11]. The normalized areas in [8],

[10], [11], and the proposed FFT processor are 0.88 mm

0.41 mm

, 0.42 mm

, and 0.21 mm

, respectively. In

summary, the proposed FFT processor can additionally

support 128-/256-point operations compared with [8], [10],

and [11]. Furthermore, the clock rate and throughput are

faster than in [10] and [11] as the proposed FFT architec-

ture is MDC. Therefore, the proposed FFT processor

achieves the best area efficiency and throughput compared

with the other FFT processors in [8], [10], and [11] and

can be applied to an OFDM system such as IEEE

802.11n/ac/ad, because the proposed FFT processor can

support various FFT points compared with [8], [10], and

[11].

V. C

ONCLUSION

This paper proposed an area-efficient mixed-radix

MDC FFT processor for various OFDM systems such as

802.11n/ac/ad. The proposed FFT processor can be recon-

figured for 128-, 256-, and 512-point FFTs. The proposed

processor adopts a scheduling scheme to reduce the num-

ber of complex multipliers and complex constant multipli-

ers. The performance results show that the proposed FFT

processor can achieve 2.64 GSample/s at 330 MHz. More-

over, the proposed FFT processor can support various FFT

points compared with [8], [10], and [11]. Thus, it can be

applied to various OFDM systems such as 802.11n/ac/ad.

CKNOWLEDGMENT

This research was supported by the MSIT(Ministry of

Science and ICT), Korea, under the ITRC(Information

Technology Research Center) support program(IITP-2018-

2016-0-00309-002) supervised by the IITP(Institute for

Information & communications Technology Promotion),

by the National Research Foundation of Korea under the

framework of international cooperation program (NRF-

2016K2A9A2A12003787) and by IDEC (IC Design Edu-

cation Center).

EFERENCES

[1]

IEEE P802.11-Task Group AD, http://www.ieee802.org/11/

[2]

M. Garrido, F. Qureshi, J. Takala, and O. Gustafsson, Hardware

architectures for the fast Fourier transform, 3rd ed. Handbook of

Signal Processing Systems, Springer, 2018.

[3]

S. J. Huang and S. G. Chen, “A high-throughput radix-16 FFT

processor with parallel and normal input/output ordering for IEEE

802.15.3c systems,” IEEE Trans. on Circuits and Syst. I, vol. 59,

no. 8, pp. 1752

–

1765, Aug. 2012.

[4]

Fang-Li Yuan, Yi-Hsien Lin, Chih-Feng Wu, Muh-Tian Shiue and

Chorng-Kuang Wang, “A 256-Point dataflow scheduling 2×2

MIMO FFT FFT/IFFT processor for ieee 802.16 WMAN,” in

Proc.

IEEE Asian Solid-State Circuits Conference (A-SSCC), Nov. 2008,

pp.309-312., doi:10.1109/ASSCC.2008.4708789.

[5]

C. T. Lin, Y. C. Yu and L. D. Van, “Cost-effective triple-mode

reconfigurable pipeline FFT/IFFT/2-D DCT processor,” IEEE

Trans. on Very Large Scale Integr. (VLSI) Syst., vol. 16, no.8, pp.

1058-1071, Aug. 2008.

[6]

M. Garrido, J. Grajal, M. S´anchez, and O. Gustafsson, “Pipelined

radix-2

feedforward FFT architectures,” IEEE Trans. on Very

Large Scale Integr. (VLSI) Syst., vol. 21, no. 1, pp. 23-32, Jan.

2013.

[7]

M. Ayinala and K.K. Parhi, “Parallel Pipelined FFT Architectures

with Reduced Number of Delays,” in Proc. ACM Great Lakes

Symp. on VLSI (GLSVLSI), May 2012, pp. 63-66, doi:

10.1145/2206781.2206798.

[8]

T. Ahmed, M. Garrido, and O. Gustafsson, “A 512-point 8-parallel

pipelined feedforward FFT for WPAN,” in Proc. 2011 Conference

Record of the Forty Fifth Asilomar Conference on Signals,

Systems and Computers (ASILOMAR), Nov. 2011, pp. 981–984,

doi: 10.1109/ACSSC.2011.6190157.

[9]

Y. Chen, Y.-W. Lin, Y.-C. Taso, and C.-Y. Lee, “A 2.4-Gsample/s

DVFS FFT processor for MIMO OFDM communication systems,”

IEEE J. of Solid-State Circuits

, vol. 43, no. 5, pp. 1260–1273, May

2008.

[10]

T. Cho and H. Lee, “A High-Speed Low-Complexity Modified

Radix-2

FFT Processor for High Rate WPAN Applications,”

IEEE Trans. on Very Large Scale Integr. (VLSI) Syst., vol. 21, pp.

187-191, Jan. 2013.

[11]

C. Wang, Y. Yan, and X. Fu, “A High-Throughput Low-

complexity Radix-2

-2

FFT/IFFT Processor with Parallel and

Normal Input/ Output Order for IEEE 802.11ad Systems,” IEEE

Trans. on Very Large Scale Integr. (VLSI) Syst., vol. 23, no. 11,

pp. 2728-2732, Nov. 2015.

[12]

A. V. Oppenheim and R. W. Schafe, Discrete-time signal

processing. Englewood Cliffs: Prentice Hall. 1989.

[13]

F. Qureshi and O. Gustafsson, “Low-complexity reconfigurable

complex constant multiplication for FFTs,” in Proc. IEEE Int.

Symp. on Circuits and Systems (ISCAS), 2009, pp. 1137-1140,

doi: 10.1109/ISCAS.2009.5117961.

[14]

M. Garrido, F. Qureshi, O. Gustafsson, “Low-Complexity

Multiplierless Constant Rotators Based on Combined Coefficient

Selection and Shift-and-Add Implementation (CCSSI),” IEEE

Trans. on Circuits and Syst. I, vol. 61, no. 7, pp. 2002-2012, Jul.

2014.

[15]

F. Qureshi, S.A. Alam and O. Gustafsson, “4K-Point FFT

Algorithms based on optimized twiddle factor multiplication for

FPGAs,” in proc. 2010 Asia Pacific Conference on Postgraduate

Research in Microelectronics and Electronics (PrimeAsia), Sep.

2010, pp. 225-228, doi: 10.1109/PRIMEASIA.2010.5604921.

Area-Efficient Scheduling Scheme Based FFT Processor for Various OFDM Systems

Figures

Citations

A Survey on Pipelined FFT Hardware Architectures

Optimum MDC FFT Hardware Architectures in Terms of Delays and Multiplexers

Evolution of the Performance of Pipelined FFT Architectures Through the Years

Pipelined Fast Fourier Transform (FFT) Processor Power Optimization

Comprehensive Review on LTE-BER Reduction with Modern Transform

References

Pipelined Radix- $2^{k}$ Feedforward FFT Architectures

A 2.4-Gsample/s DVFS FFT Processor for MIMO OFDM Communication Systems

A High-Speed Low-Complexity Modified ${\rm Radix}-2^{5}$ FFT Processor for High Rate WPAN Applications

A High-Throughput Radix-16 FFT Processor With Parallel and Normal Input/Output Ordering for IEEE 802.15.3c Systems

Low-Complexity Multiplierless Constant Rotators Based on Combined Coefficient Selection and Shift-and-Add Implementation (CCSSI)

Related Papers (5)

New parallel MDC FFT processor with efiicient scheduling scheme

High speed eight-parallel mixed-radix FFT Processor for OFDM systems

Efficient scheduling scheme for eight-parallel MDC FFT processor

Design of an 8-Channel FFT Processor for IEEE 802.11ac MIMO-OFDM WLAN System

Novel Shared Multiplier Scheduling Scheme for Area-Efficient FFT/IFFT Processors

Frequently Asked Questions (1)

Q1. What are the contributions in "Area-efficient scheduling scheme based fft processor for various ofdm systems" ?