scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A Block Scaling FFT/IFFT Processor for WiMAX Applications

01 Dec 2006-pp 203-206
TL;DR: A novel block scaling method and a new ping-pong cache-memory architecture are proposed to reduce the power consumption and hardware cost and by proper scheduling of the two data streams, the proposed design achieves better hardware utilization.
Abstract: This paper presents a low-power design of a two-stream MIMO FFT/IFFT processor for WiMAX applications A novel block scaling method and a new ping-pong cache-memory architecture are proposed to reduce the power consumption and hardware cost With these schemes, half the memory accesses and 64-Kbit memory can be saved Furthermore, by proper scheduling of the two data streams, the proposed design achieves better hardware utilization and can process two 2048-point FFTs/IFFTs consecutively within 2052 cycles A test chip of the proposed FFT/IFFT processor has been designed using UMC 013 mum 1P8M process with a core area of 1332times1590 mum2 The SQNR performance of the 2048-point FFT/IFFT is over 48 dB for QPSK and 16/64-QAM modulations Power dissipation of two 2048-point FFT computations is about 1726 mW at 2286 MHz which meets the maximum throughput rate of WiMAX applications

Summary (2 min read)

Introduction

  • Stream MIMO FFT/IFFT processor for WiMAX applications.
  • With these schemes, half the memory accesses and 64-Kbit memory (4 bits in wordlength) can be saved without inducing idle cycles.
  • Moreover, by proper scheduling of the two data streams, the proposed FFT/IFFT processor avoids stalls of function units and thus achieves better hardware utilization.

II. ALGORITHM

  • To reduce the number of complex multiplications, radix-8 algorithm is chosen to carry out the DFT [4].
  • Its hardware is very complex if directly implemented.
  • Thus the authors employ radix-23 and radix-22 [4] to replace radix-8 and radix-4, respectively.
  • With these steps, the authors can decompose the 2048-point FFT into three radix-23 and multiplication stages and a final radix-22 stage for further hardware implementations.

A. Block Scaling Method

  • Block floating-point (BFP) [5] is an efficient way to reduce the wordlength by increasing the dynamic range compared to the fixed-point format.
  • To solve this problem, a dynamic scaling FFT processor [3] is proposed by employing multiple exponents for cache-size blocks.
  • While dynamic scaling approach has a satisfactory result in reducing wordlength, it still has two drawbacks.
  • At the same time, the resulting exponents are saved for data alignment in the next processing stage.
  • First, because the input symbols are gain-controlled and have specified modulation in OFDM systems, the maximum value of the final FFT output can be expected in advance.

III. ARCHITECTURE

  • Block diagram of the proposed FFT/IFFT processor is depicted in Fig.
  • It consists of four FFT/IFFT control units, a main memory unit, a processing engine (PE), and a 64-word cache.
  • A novel block scaling method and a new ping-pong cache-memory architecture are proposed to reduce the power consumption and hardware cost.
  • With these techniques and proper data scheduling, the proposed design can realize two 2048-point FFT/IFFT computations in 2052 clock cycles.
  • The modules of the proposed design will be described in more detail below.

A. Main Memory

  • For memory-based FFT processors supporting consecutive I/O, multiple main memories are needed as computation and I/O buffers [7].
  • To reduce the total memory size, the continuous flow (CF) memory architecture is proposed [7] where only two N-word memories are required for N-point FFT.
  • This is because the original CF FFT adopts radix-4 and radix-2 algorithms which have different bit-reverse orders.
  • In their proposed design; however, CF memory architecture causes no problem since radix-23 and radix-22 algorithms have the same bit-reverse order as radix-2 algorithm [4].
  • As shown in Fig. 4, one 4096-word SRAM works as the I/O buffer while the other one works as the processing buffer, and vice versa.

B. Ping-Pong Cache-Memory Architecture

  • Cached-memory FFT [2], [3] is proposed for low power consumption by reducing the memory accesses.
  • A concurrent read/write cache with complex control is required to increase the throughput.
  • Thus the authors propose the ping-pong cache-memory architecture which uses a simple cache with single read/write operations.
  • By using this scheme, half the memory accesses can be saved.

C. Processing Engine (PE)

  • The PE is designed to perform radix-23/22/2 butterfly operations and complex multiplications with proposed block scaling approach as shown in Fig.
  • At the fist processing stage, since the inputs have the same decimal point, data alignments are skipped.
  • Afterward, the output of ODSU1 is sent to the complex multipliers for twiddle factor multiplications.
  • The second and third stages have similar control flows as stage 1. First Stage Intermediate Stage(s) Final Stage Alignment Bypass ON ON Configurable BU Radix-2 3 Radix-23 Radix-23 for 512 FFT Radix-22 for 256/2048 FFT.

IV. CHIP IMPLEMENTATION

  • A test chip of the proposed block scaling FFT/IFFT processor (2048-point mode) is implemented using UMC 0.13 μm 1P8M CMOS technology for verification.
  • From post-layout prime power simulation, it is shown that the proposed FFT/IFFT consumes only 17.26 mW at 22.86 MHz when performing two 2048-point FFT computations consecutively for WiMAX applications.
  • The SQNR performance of the 2048-point FFT/IFFT has also been verified to exceed 48 dB for QPSK and 16/64-QAM signals.
  • Thus the implementation loss of cascaded IFFT and FFT is only 0.1 dB with AWGN at 30 dB SNR which satisfies their design target for WiMAX applications.
  • The detailed power profiling and chip summary are shown in Fig. 9. Fig. 9. Power profiling and chip summary of the proposed processor.

V. COMPARISON

  • For comparisons, the authors choose two FFT processor chips which can handle consecutive 2048-point FFT computations [8], [9].
  • Besides, to compare the FFT processor chips fabricated with different technologies, the authors adopt the normalized area and FFTs per energy [2] as their performance indices shown in eqs.
  • Note that eq. (4) has been adapted to take account of the voltage scaling.
  • The authors can find that the FFT processor [9] use a shorter wordlength of 12 bits since it only supports for 9-bit input.
  • Both designs [8], [9] do not employ a cache design to reduce the power of memory accesses.

VI. CONCLUSION

  • A block scaling MIMO FFT/IFFT processor for WiMAX applications has been proposed in this paper.
  • It can support two 2048-point FFT/IFFT computations simultaneously within 2052 clock cycles.
  • Moreover, with a novel block scaling method and a new ping-pong cache-memory architecture, both power consumption and hardware cost can be greatly reduced.
  • A test chip has been designed using UMC 0.13 μm 1P8M process.
  • Simulation result has shown that the proposed FFT processor consumes only 17.26 mW at 22.86 MHz which meets the maximum throughput rate of WiMAX applications.

Did you find this useful? Give us your feedback

Content maybe subject to copyright    Report

A Block Scaling FFT/IFFT Processor
for WiMAX Applications
Yuan Chen
, Yu-Wei Lin
, and Chen-Yi Lee
National Chiao Tung University, Hsinchu, Taiwan
MediaTek Inc., Hsinchu, Taiwan
Email: ychen@si2lab.org
Abstract-This paper presents a low-power design of a two-
stream MIMO FFT/IFFT processor for WiMAX applications. A
novel block scaling method and a new ping-pong cache-memory
architecture are proposed to reduce the power consumption and
hardware cost. With these schemes, half the memory accesses
and 64-Kbit memory can be saved. Furthermore, by proper
scheduling of the two data streams, the proposed design achieves
better hardware utilization and can process two 2048-point
FFTs/IFFTs consecutively within 2052 cycles. A test chip of the
proposed FFT/IFFT processor has been designed using UMC
0.13 μm 1P8M process with a core area of 1332×1590 μm
2
. The
SQNR performance of the 2048-point FFT/IFFT is over 48 dB
for QPSK and 16/64-QAM modulations. Power dissipation of
two 2048-point FFT computations is about 17.26 mW at 22.86
MHz which meets the maximum throughput rate of WiMAX
applications.
I. INTRODUCTION
Multiple-input multiple-output orthogonal frequency
division multiplexing (MIMO OFDM) is considered a key
technology in high-throughput transmissions over wireless
fading channels. The emerging WiMAX/IEEE 802.16
standard has employed this technology in its physical-layer
specification to provide broadband wireless access services.
In the specification, scalable channel bandwidths from 1.25 to
20 MHz by adjusting FFT size (from 128 to 2048-point) are
employed for different applications. Three modulation types
(QPSK, 16/64-QAM) and four guard intervals modes (1/4,
1/8, 1/16, 1/32) are also supported to further increase the
system scalability. A block diagram of a 2×2 MIMO
transceiver for WiMAX applications is shown in Fig. 1. By
processing two data streams with duplicated antennas and
functional units, the peak data rate of the 2×2 MIMO
transceiver can be two-folded compared to that of a
single-input single-output (SISO) transceiver.
To support a MIMO transceiver for WiMAX applications,
a variable-length FFT/IFFT processor capable of processing
multiple data streams is required. Since 2×2 MIMO with time
division duplex (TDD) mode is defined in the WiMAX
Forum Release-1 system profiles [1], a two-stream 128/256/
512/1024/2048-point FFT/IFFT processor is considered in
this paper. Besides, while the power consumption is critical
for portable systems, the FFT/IFFT processor for WiMAX
applications should be power-efficient. There have been
many researches on low-power FFT designs by employing
the cached-memory architecture to reduce the memory
accesses [2], [3]. However, the increase in wordlength [2] or
idle cycles [3] still causes wastes in power consumption and
hardware cost. To solve these problems, a novel block scaling
method and a new ping-pong cache-memory architecture are
exploited in our proposed FFT/IFFT processor. With these
schemes, half the memory accesses and 64-Kbit memory (4
bits in wordlength) can be saved without inducing idle cycles.
Moreover, by proper scheduling of the two data streams, the
proposed FFT/IFFT processor avoids stalls of function units
and thus achieves better hardware utilization. Two-stream
2048-point FFTs/IFFTs can be computed consecutively
within 2052 processing cycles.
Fig. 1. Block diagram of a 2×2 MIMO transceiver for WiMAX applications.
II. A
LGORITHM
The N-point discrete Fourier transform (DFT) of a complex
input sequence x(n) can be defined as:
1
0
( ) ( ) 0,1, 2,... 1
N
N
kn
n
Xk xnW k N
=
=
=−
(1)
where
2/
N
kn j kn N
We
π
=
is referred to the twiddle factor. To
reduce the number of complex multiplications, radix-8
algorithm is chosen to carry out the DFT [4]. Here we take the
longest 2048-point DFT in the design as an example. Since
2048 is not a power of 8, we decompose the 2048-point DFT
into three radix-8 stages and a final radix-4 stage as shown in
the following equation:
1234 234 33 3411 2 2 4 4
4321
12 3 4
3777
(32 4 ) (4 )
1 2 3 4 8 2048 8 256 8 32 4
0000
(8 64512)
(256 32 4 )
knnn knn kn knkn k n k n
nnnn
Xk k k k
xnnnnWW WW WWW
++ +
====
+
++ =
⎧⎫
⎧⎫
⎧⎫
⎪⎪⎪
+++
⎨⎨⎨
⎪⎪
⎪⎪
⎩⎭⎪⎪
⎩⎭
⎩⎭
∑∑∑∑
(2)
where k
1
,k
2
,k
3
=0,1,2,…7 and k
4
=0,1,2,3. Similarly, 128/256/
512/1024-point DFT can also be decomposed to preceding
0-7803-9735-5/06/$20.00 ©2006 IEEE
203
7-1

radix-8 stages and a final radix-8/4/2 stage depending on the
DFT size. Although high-radix algorithm is effective in
reducing the number of complex multiplications, its hardware
is very complex if directly implemented. Thus we employ
radix-2
3
and radix-2
2
[4] to replace radix-8 and radix-4,
respectively. A signal flow graph (SFG) of the 32-point
radix-2
3
/2
2
FFT is shown in Fig. 2 as an example. We can find
in this figure that a full 32-point FFT is completed by one
radix-2
3
and multiplication stage and one radix-2
2
stage. With
these steps, we can decompose the 2048-point FFT into three
radix-2
3
and multiplication stages and a final radix-2
2
stage
for further hardware implementations.
3
8
W
3
8
W
3
8
W
3
8
W
1
8
W
1
8
W
1
8
W
1
8
W
4
32
W
8
32
W
4
32
W
12
32
W
2
32
W
6
32
W
6
32
W
12
32
W
18
32
W
1
32
W
2
32
W
3
32
W
5
32
W
10
32
W
15
32
W
3
32
W
6
32
W
9
32
W
7
32
W
14
32
W
21
32
W
Fig. 2. SFG of a 32-point radix-2
3
/2
2
FFT.
A. Block Scaling Method
Block floating-point (BFP) [5] is an efficient way to reduce
the wordlength by increasing the dynamic range compared to
the fixed-point format. The behavior of BFP is similar to that
of floating-point except a single exponent is used for a group
of data. Although BFP is often adopted in memory-based FFT
processors to save the hardware cost and power, it is not
suited to cached-memory FFT processors because of the
interleaved processing stages [5]. To solve this problem, a
dynamic scaling FFT processor [3] is proposed by employing
multiple exponents for cache-size blocks. While dynamic
scaling approach has a satisfactory result in reducing
wordlength, it still has two drawbacks. Since the exponent
position can be determined only after all cached data are
processed, some clock cycles are wasted. Also, the internal
wordlength of both arithmetic units and cache needs to be
extended to prevent overflows.
Thus we propose the block scaling method which
eliminates the increased wordlength and idle cycles by a
“detect and scale” approach. Each set of the output symbols
will be scaled right away if an overflow is detected. At the
same time, the resulting exponents are saved for data
alignment in the next processing stage. Although this method
can be realized by saving block exponents for all processing
stages, it is hardware consuming. To work out this issue, we
scale the final output of FFT to a predetermined exponent,
and thus only 296 exponents are needed to be stored for the
longest-length 2048-point FFTs. There are two main reasons
why this fixed-exponent scheme is feasible. First, because the
input symbols are gain-controlled and have specified
modulation in OFDM systems, the maximum value of the
final FFT output can be expected in advance. Second, in most
dedicated OFDM transceiver designs, only fixed-point format
is considered due to simpler hardware implementations. As
the simulation result shows in Fig. 3, over four bits can be
reduced in wordlength by the proposed method under the
same signal-to-quantization-noise ratio (SQNR). We can also
find that more than one fourth of the memory size (from 16
bits to 12 bits) can be saved at about 50 dB SQNR.
9 10 11 12 13 14 15 16
0
10
20
30
40
50
60
70
80
SQNR (dB)
Wordlength (bits)
Fixed-point method
Proposed block-scaling method
Fig. 3. SQNR performance of the proposed block scaling method.
III. A
RCHITECTURE
Block diagram of the proposed FFT/IFFT processor is
depicted in Fig. 4. It consists of four FFT/IFFT control units, a
main memory unit, a processing engine (PE), and a 64-word
cache. In this design, a novel block scaling method and a new
ping-pong cache-memory architecture are proposed to reduce
the power consumption and hardware cost. Besides, since
FFT and IFFT have the same operations except for complex-
conjugated twiddle factors, we implement IFFT by simply
taking conjugates of FFT input/output [6] as shown in Fig. 4.
With these techniques and proper data scheduling, the
proposed design can realize two 2048-point FFT/IFFT
computations in 2052 clock cycles. Thus by taking the guard
interval of WiMAX systems into account, the proposed
FFT/IFFT processor does not need to operate in a multiple
sampling frequency as the previous cached-memory FFT
designs do [2], [3]. The modules of the proposed design will
be described in more detail below.
Fig. 4. Block diagram of the proposed two-stream FFT/IFFT processor.
204

A. Main Memory
For memory-based FFT processors supporting consecutive
I/O, multiple main memories are needed as computation and
I/O buffers [7]. To reduce the total memory size, the
continuous flow (CF) memory architecture is proposed [7]
where only two N-word memories are required for N-point
FFT. Although CF FFT can reduce memory size by doing I/O
operation concurrently in a single memory, it requires
additional controls for memory addressing and butterfly units
(BU). This is because the original CF FFT adopts radix-4 and
radix-2 algorithms which have different bit-reverse orders. In
our proposed design; however, CF memory architecture
causes no problem since radix-2
3
and radix-2
2
algorithms
have the same bit-reverse order as radix-2 algorithm [4]. As
shown in Fig. 4, one 4096-word SRAM works as the I/O
buffer while the other one works as the processing buffer, and
vice versa. Each SRAM is further partitioned to eight banks to
support eight accesses simultaneously for radix-2
3
algorithms.
B. Ping-Pong Cache-Memory Architecture
Cached-memory FFT [2], [3] is proposed for low power
consumption by reducing the memory accesses. As shown in
Fig. 5, data are first read from main memory and then sent to
the cache. By proper data scheduling, PE can perform
multiple-stage processing by accessing local cache instead of
the main memory. Although cached-memory FFT can reduce
memory accesses effectively, a concurrent read/write cache
with complex control is required to increase the throughput.
Thus we propose the ping-pong cache-memory architecture
which uses a simple cache with single read/write operations.
As illustrated in Fig. 6, data read from the main memory are
processed by PE first and then written to the cache for future
use. After the cache is full, data in the cache are read by PE
and the computed results are stored back to the main memory.
Since radix-2
3
algorithm is adopted in the proposed design, a
64-word cache is employed to support two-stage radix-2
3
processing. By using this scheme, half the memory accesses
can be saved. Moreover, the ping-pong cache-memory has
shorter latency compared to the cached-memory, which is
beneficial in scheduling data streams.
Fig. 5. Cached-memory architecture.
Fig. 6. Proposed ping-pong cache-memory architecture.
C. Processing Engine (PE)
The PE is designed to perform radix-2
3
/2
2
/2 butterfly
operations and complex multiplications with proposed block
scaling approach as shown in Fig. 7. Since variable-length
FFT must be supported and the final stage can be radix-2
3
,
radix-2
2
, or radix-2 as described earlier, a configurable
radix-2
3
/2
2
/2 butterfly unit capable of processing one radix-2
3
,
two radix-2
2
, or four radix-2 is adopted. We use 2048-point
FFT mode to describe the control of PE. At the fist processing
stage, since the inputs have the same decimal point, data
alignments are skipped. Input data are processed by radix-2
3
BU directly and then passed to the first overflow detection
and scaling unit (ODSU1) in Fig. 7. If an overflow is detected,
all eight inputs will be scaled and the corresponding shift in
exponent is sent to the block scaling unit. Afterward, the
output of ODSU1 is sent to the complex multipliers for
twiddle factor multiplications. The outputs of the complex
multipliers are passed to the second overflow detection and
scaling unit (ODSU2) in Fig. 7 where the same operation of
ODSU1 is performed. The second and third stages have
similar control flows as stage 1. For stage 4, after inputs are
aligned in decimal point for processing, two radix-2
2
operations are performed. At this stage; however, only scaling
is performed in ODSU1 since the final output is
fixed-exponent in our proposed block scaling algorithm.
Complex multiplications and ODSU2 are also skipped in this
stage because no twiddle factor multiplication is required at
final stage as shown previously in Fig. 2. The detailed control
flow for all 128~2048 FFT modes is summarized in Table I.
Fig. 7. Block diagram of the processing engine.
TABLE I. PE control for 128~2048-point FFT/IFFT.
First
Stage
Intermediate
Stage(s)
Final
Stage
Alignment Bypass ON ON
Configurable
BU
Radix-2
3
Radix-2
3
Radix-2
3
for 512 FFT
Radix-2
2
for 256/2048 FFT
Radix-2 for 128/1024 FFT
ODSU1
Detection
& Scaling
Detection
& Scaling
Scaling
Multiplier ON ON Bypass
ODSU2
Detection
& Scaling
Detection
& Scaling
Bypass
Block
scaling
unit
Exponent
store
Alignment
control &
Exponent
store
Alignment
control &
ODSU1
control
IV. C
HIP IMPLEMENTATION
A test chip of the proposed block scaling FFT/IFFT
processor (2048-point mode) is implemented using UMC
0.13 μm 1P8M CMOS technology for verification. The core
size is 1332×1590 μm
2
as shown in Fig. 8. From post-layout
prime power simulation, it is shown that the proposed
205

FFT/IFFT consumes only 17.26 mW at 22.86 MHz when
performing two 2048-point FFT computations consecutively
for WiMAX applications. The SQNR performance of the
2048-point FFT/IFFT has also been verified to exceed 48 dB
for QPSK and 16/64-QAM signals. Thus the implementation
loss of cascaded IFFT and FFT is only 0.1 dB with AWGN at
30 dB SNR which satisfies our design target for WiMAX
applications. The detailed power profiling and chip summary
are shown in Fig. 9.
4096-word SRAM
4096-word SRAM
BSU
ROM
Cache
Cache
BU M7
M1 M2
M3 M4
M6M5
Controller
Fig. 8. Chip layout of the proposed FFT/IFFT Processor.
Fig. 9. Power profiling and chip summary of the proposed processor.
V. C
OMPARISON
For comparisons, we choose two FFT processor chips
which can handle consecutive 2048-point FFT computations
[8], [9]. Since these two chips can not support multiple data
streams and only complete results for 1024-point FFT are
listed, the comparisons of execution time and power are based
on single-stream 1024-point FFT. Besides, to compare the
FFT processor chips fabricated with different technologies,
we adopt the normalized area and FFTs per energy [2] as our
performance indices shown in eqs. (3) and (4). Note that eq.
(4) has been adapted to take account of the voltage scaling.
2
Area
Normalized Area
(Technology/0.13μm)
=
(3)
2
3
(Technology/0.13μm) ( /1.2)
FFTs
Normalized
Energy Power Execution Time 10
DD
V×
=
××
(4)
The comparison results are summarized in TABLE II. We
can find that the FFT processor [9] use a shorter wordlength
of 12 bits since it only supports for 9-bit input. The processor
[8] has employed the BFP approach and thus the wordlength
is not increased. However, both designs [8], [9] do not employ
a cache design to reduce the power of memory accesses. From
this comparison, it is shown that our proposal has a
satisfactory result in both normalized area and FFTs per
energy, which justifies the feasibility of the proposed method.
TABLE II. Chip comparison of various 2048-point FFT Processors.
This Work Zhong [8] Lin [9]
*3
Technology 0.13 μm 0.25 μm 0.35 μm
Supported FFT/
IFFT (consecutive)
Two 2048-point
*1
FFTs/IFFTs
8~2048-point
FFT
512~2048-point
FFT
Cache design Yes No No
Scaling/BFP design Block scaling BFP No
Input bit width 12 bits 16 bits 9 bits
Wordlength 12 bits 16 bits 12 bits
Core voltage 1.2 volt 2.5 volt 3.3 volt
Clock rate 22.86 MHz 200 MHz 45.45 MHz
Execution time
(1024-point)
22.48 μs
*2
26.4 μs 45.06 μs
Power (1024-point) 17.26 mW
*2
400 mW 640 mW
Core Area
2.12 μm
2
11.42 μm
2
13.05 μm
2
Normalized 1024-
Point FFTs/ Energy
2577
*2
790 706
Normalized Area 1.06 3.09 1.80
*1: Can be extended to 128~2048-point by adding control modes.
*2: Normalized from data of two 2048-point FFTs.
*3: The bit-reverse memory is not included.
VI. C
ONCLUSION
A block scaling MIMO FFT/IFFT processor for WiMAX
applications has been proposed in this paper. It can support
two 2048-point FFT/IFFT computations simultaneously
within 2052 clock cycles. Moreover, with a novel block
scaling method and a new ping-pong cache-memory
architecture, both power consumption and hardware cost can
be greatly reduced. A test chip has been designed using UMC
0.13 μm 1P8M process. Simulation result has shown that the
proposed FFT processor consumes only 17.26 mW at 22.86
MHz which meets the maximum throughput rate of WiMAX
applications.
A
CKNOWLEDGMENT
This work was supported by the National Science Council
of Taiwan under Grant NSC94-2215-E-009-044 and
by ICL/ITRI. under Grant 5352BA5115.
REFERENCES
[1] WiMAX Forum, Mobile WiMAX-Part I: A technical overview and
performance evaluations, Feb. 21, 2006.
[2] B. M. Bass, “A low-power, high-performance, 1024-point FFT
processor,” IEEE J. Solid-State Circuits, vol. 34, pp. 380–387, Mar.
1999.
[3] Y.-W. Lin, H.-Y. Liu, and C.-Y. Lee, “A dynamic scaling FFT
processor for DVB-T applications,” IEEE J. Solid-State Circuits, vol.
39, pp. 2005–2013, Nov. 2004.
[4] He Shousheng and M. Torkelson, “Designing pipeline FFT processor
for OFDM (de)modulation,” In Proc. Int. Symp. Signals, Systems,
and Electronics, 29 Sept.-2 Oct. 1998, pp. 257-262.
[5] B. M. Baas, “An approach to low-power, high-performance, fast
Fourier transform processor design,” PhD Dissertation, Stanford
University, Stanford, CA, 1999.
[6] K. Maharatna, E. Grass, and U. Jagdhold, “A 64-point Fourier
transform chip for high-speed wireless LAN application using
OFDM,” IEEE J. Solid-State Circuits, vol. 39, pp. 484-493, Mar.
2003.
[7] B. G. Jo and M. H. Sunwoo, “New continuous-flow mixed radix
(CFMR) FFT using novel in-place strategy,” IEEE Trans. Circuits
Syst., vol. 52, pp. 911–919, May. 2005.
[8] G. Zhong, F. Xu, and A. N. Willson Jr., “A power-scalable
reconfigurable FFT/IFFT IC based on a multi-processor ring,” IEEE
J. Solid-State Circuits, vol. 41, pp. 483-495, Feb. 2006.
[9] Y.-T. Lin, P.-Y. Tsai, and T.-D. Chiueh, “Low-power variable-
length fast Fourier transform processor,” In Proc. Comput. Digit.
Tech., vol. 152, No. 4, pp. 499-506, July 2005.
206
Citations
More filters
Journal ArticleDOI
TL;DR: A multimode memory-based Fast Fourier Transform (FFT) processor for a medical system aimed at Fourier-domain optical coherence tomography (FD-OCT) capable of supporting wireless displays based on multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM).
Abstract: This paper presents a multimode memory-based Fast Fourier Transform (FFT) processor for a medical system aimed at Fourier-domain optical coherence tomography (FD-OCT) capable of supporting wireless displays based on multiple-input multiple-output orthogonal frequency division multiplexing (MIMO-OFDM). The proposed FFT processor enables the use of 2-stream 4096/2048/1024-point FFTs and 1- to 4-stream 128/64-point FFTs for FD-OCT and OFDM applications, respectively. Using cost-effective four-bank single-port SRAM operating in four-word data width, the proposed design provides data access for up to sixteen memory paths. In conjunction with a proposed FFT kernel devised using hardware-efficient multiplication and cache units, the proposed system allows high-throughput multimode FFT operations in an energy- and area-efficient configuration. A test chip was designed using TSMC-0.18 μm CMOS technology with a core size of 4.8 mm 2 . Post-layout simulation performing 4096-point FFT at 80 MHz and the 128-point FFT at 40 MHz achieved throughput of 152 MS/s and 160 MS/s with power consumption of 156.2 mW and 69.9 mW, respectively. Compared to the previous approaches fully or partially supporting the specified OCT/OFDM FFTs, different degrees of area or energy efficiency improvements can be shown by our design depending on the FFT operation mode. In addition, system-level verification for practical OCT imaging was also performed using an FPGA platform.

18 citations


Cites background or methods from "A Block Scaling FFT/IFFT Processor ..."

  • ...3) Fixed-Point Performance: To improve performance in the signal-to-quantization-noise ratio (SQNR) with finite word length, we employed a block floating point (BFP) scheme [30], [28], [48] by representing a block of data values using the formula , in which M is the mantissa of the individual data value and E is a global exponent term for all data values....

    [...]

  • ...Moreover, rather than using a global operation scheme based on overall performance optimization [30]–[34], it is more helpful to minimize the power consumption (i....

    [...]

  • ...Based on the evaluation results presented in [30], [48], and [50], we determined that in fixed-point performance for OCT or OFDM FFTs, the proposed FFT structure lends directional support for SQNR for both FD-OCT and OFDM applications....

    [...]

  • ...For example, an eight-path radix-8 kernel may employ seven complex multipliers as shown in [27], [30], [31]....

    [...]

  • ...To enable a balanced comparison of hardware performance, we employed two parameters based on [22], [28], [30] and [51] in terms of “normalized N-point FFTs per energy” (N is the FFT length specified in our design) as shown in (14) and “normalized T....

    [...]

Proceedings ArticleDOI
24 May 2009
TL;DR: In this work, mixed-radix-2/4/8 algorithm and new continuous-flow method are applied to achieve variable-length of 1K/2K/4K/8K points and in-order output and ping-pong cache memory architecture and optimized data scaling strategy are also applied to reduce main memory accesses up to 50% and achieve higher SQNR requirements of 16/64-QAM signals for multi-standards.
Abstract: In this paper, we present a low power and variable-length design of fast Fourier transform (FFT) processor for flexible MIMO-OFDM applications. In this work, mixed-radix-2/4/8 algorithm and new continuous-flow method are applied to achieve variable-length of 1K/2K/4K/8K points and in-order output. Furthermore, ping-pong cache memory architecture and optimized data scaling strategy are also applied to reduce main memory accesses up to 50% and achieve higher SQNR requirements of 16/64-QAM signals for multi-standards. A test chip of the proposed FFT processor has been designed and fabricated using CMOS 0.18µm 1P6M process with the core area of 4.96mm2. Our proposed processor can perform four independent data sequences of 2048-point FFT within 205.2 µs at operating clock rate of 20 MHz. The power consumption of calculating single 8192-point FFT sequence is only 20.88mW at operating clock rate of 10 MHz.

16 citations


Cites background from "A Block Scaling FFT/IFFT Processor ..."

  • ...Furthermore, the ping-pong cache memory architecture is proposed to further reduce memory accesses more efficiently [8]....

    [...]

  • ...For these reasons, several dynamic data scaling approaches have been proposed to preserve effective data word-length and avoiding data overflow [3], [8]....

    [...]

  • ...Proposed Chen [8] Zhong [10] Technology 0....

    [...]

  • ...7 [8], respectively to be our baseline of performance comparison....

    [...]

Proceedings ArticleDOI
12 Dec 2008
TL;DR: By applying the proposed mixed-radix dataflow scheduling (MRDS) technique, the effective hardware utilization can be raised to 100% and the hardware complexity is significantly reduced.
Abstract: In this paper, an efficient solution of MIMO FFT/IFFT processor for IEEE 802.16 WMAN is presented. By applying the proposed mixed-radix dataflow scheduling (MRDS) technique, the effective hardware utilization can be raised to 100%. Therefore, a single butterfly unit within each pipeline stage is sufficient to deal with the two data sequences, and the hardware complexity is significantly reduced. The proposed FFT/IFFT processor has been emulated on the FPGA board. The signal-to-quantization noise ratio (SQNR) is over 44 dB for QPSK and 16/64-QAM signals. Furthermore, a test chip has been designed using standard 0.18-mum CMOS technology with a core area of 887 times 842 mum2. According to the post-layout simulation results, the design consumes 46 mW at 64 MHz operating frequency, which meets the maximum throughput requirements of IEEE 802.16 WMAN.

11 citations


Cites methods from "A Block Scaling FFT/IFFT Processor ..."

  • ...The performance indices for chip area and power consumption are defined as follows [1][3]:...

    [...]

Proceedings ArticleDOI
21 Mar 2012
TL;DR: This proposed architecture applies a reconfigurable complex multiplier to achieve a ROM-less FFT/IFFT processor and to reduce the truncation error, the fixed width modified booth multiplier is adopted.
Abstract: Fast Fourier transform (FFT) processing is one of the key procedure in popular orthogonal frequency division multiplexing (OFDM) communication systems. Structured pipeline architectures, low power consumption, high speed and reduced chip area are the main concerns in this VLSI implementation. In this paper, the efficient implementation of FFT/IFFT processor for OFDM applications is presented. The processor can be used in various OFDM-based communication systems, such as Worldwide Interoperability for Microwave access (Wi-Max), digital audio broadcasting (DAB), digital video broadcasting-terrestrial (DVB-T). We adopt single-path delay feedback architecture. To eliminate the read only memories (ROM's) used to store the twiddle factors, this proposed architecture applies a reconfigurable complex multiplier to achieve a ROM-less FFT/IFFT processor and to reduce the truncation error we adopt the fixed width modified booth multiplier. The three processing elements (PE's), delay-line (DL) buffers are used for computing IFFT. Thus we consume the low power, lower hardware cost, high efficiency and reduced chip size.

10 citations

Journal ArticleDOI
TL;DR: A new method which is combining Fast Fourier Transform and Grey Relational Analyses to get best diagnosis results has been developed for an embedded system within a compact and low-cost measurement device.
Abstract: In this work, the implementation of an embedded system for real-time detection of rotor bar failures in induction motor has been realized. The device is a prototype measurement device which can detect broken rotor bars on the field without any additional setup or third-party software. This study has focused to derive a new method from previous studies on diagnosing of rotor failures and developed a microcontroller based embedded measurement device. A new method which is combining Fast Fourier Transform and Grey Relational Analyses to get best diagnosis results has been developed for an embedded system within a compact and low-cost measurement device. Although, there are some computer-based techniques and motor drivers that can detect internal failures of induction motor, there is no such a measurement device that can do all process by itself.

10 citations

References
More filters
Proceedings ArticleDOI
29 Sep 1998
TL;DR: By exploiting the spatial regularity of the new algorithm, the requirement for both dominant elements in VLSI implementation, the memory size and the number of complex multipliers, have been minimized and the area/power efficiency has been enhanced.
Abstract: The FFT processor is one of the key components in the implementation of wideband OFDM systems. Architectures with a structured pipeline have been used to meet the fast, real-time processing demand and low-power consumption requirement in a mobile environment. Architectures based on new forms of FFT, the radix-2/sup i/ algorithm derived by cascade decomposition, is proposed. By exploiting the spatial regularity of the new algorithm, the requirement for both dominant elements in VLSI implementation, the memory size and the number of complex multipliers, have been minimized. Progressive wordlength adjustment has been introduced to optimize the total memory size with a given signal-to-quantization-noise-ratio (SQNR) requirement in fixed-point processing. A new complex multiplier based on distributed arithmetic further enhanced the area/power efficiency of the design. A single-chip processor for 1 K complex point FFT transform is used to demonstrate the design issues under consideration.

322 citations


"A Block Scaling FFT/IFFT Processor ..." refers methods in this paper

  • ...Thus we employ radix-23 and radix-22 [4] to replace radix-8 and radix-4, respectively....

    [...]

  • ...To reduce the number of complex multiplications, radix-8 algorithm is chosen to carry out the DFT [4]....

    [...]

  • ...Here we take the longest 2048-point DFT in the design as an example....

    [...]

  • ...reduce the number of complex multiplications, radix-8 algorithm is chosen to carry out the DFT [4]....

    [...]

  • ...Similarly, 128/256/ 512/1024-point DFT can also be decomposed to preceding 0-7803-9735-5/06/$20.00 ©2006 IEEE 203 radix-8 stages and a final radix-8/4/2 stage depending on the DFT size....

    [...]

Journal ArticleDOI
TL;DR: This paper presents an energy-efficient, single-chip, 1024-point fast Fourier transform (FFT) processor, which has been fabricated in a standard 0.7 /spl mu/m CMOS process and is fully functional on first-pass silicon.
Abstract: This paper presents an energy-efficient, single-chip, 1024-point fast Fourier transform (FFT) processor. The 460000-transistor design has been fabricated in a standard 0.7 /spl mu/m (L/sub poly/=0.6 /spl mu/m) CMOS process and is fully functional on first-pass silicon. At a supply voltage of 1.1 V, it calculates a 1024-point complex FFT in 330 /spl mu/s while consuming 9.5 mW, resulting in an adjusted energy efficiency more than 16 times greater than the previously most efficient known FFT processor. At 3.3 V, it operates at 173 MHz-which is a clock rate 2.6 times greater than the previously fastest rate.

319 citations


"A Block Scaling FFT/IFFT Processor ..." refers background or methods in this paper

  • ...Thus by taking the guard interval of WiMAX systems into account, the proposed FFT/IFFT processor does not need to operate in a multiple sampling frequency as the previous cached-memory FFT designs do [2], [3]....

    [...]

  • ...However, the increase in wordlength [2] or idle cycles [3] still causes wastes in power consumption and hardware cost....

    [...]

  • ...Besides, to compare the FFT processor chips fabricated with different technologies, we adopt the normalized area and FFTs per energy [2] as our performance indices shown in eqs....

    [...]

  • ...Cached-memory FFT [2], [3] is proposed for low power consumption by reducing the memory accesses....

    [...]

  • ...There have been many researches on low-power FFT designs by employing the cached-memory architecture to reduce the memory accesses [2], [3]....

    [...]

Journal ArticleDOI
TL;DR: A novel fixed-point 16-bit word-width 64-point FFT/IFFT processor developed primarily for the application in an OFDM-based IEEE 802.11a wireless LAN baseband processor that can be used for any application that requires fast operation as well as low power consumption.
Abstract: In this paper, we present a novel fixed-point 16-bit word-width 64-point FFT/IFFT processor developed primarily for the application in an OFDM-based IEEE 802.11a wireless LAN baseband processor. The 64-point FFT is realized by decomposing it into a two-dimensional structure of 8-point FFTs. This approach reduces the number of required complex multiplications compared to the conventional radix-2 64-point FFT algorithm. The complex multiplication operations are realized using shift-and-add operations. Thus, the processor does not use a two-input digital multiplier. It also does not need any RAM or ROM for internal storage of coefficients. The proposed 64-point FFT/IFFT processor has been fabricated and tested successfully using our in-house 0.25-/spl mu/m BiCMOS technology. The core area of this chip is 6.8 mm/sup 2/. The average dynamic power consumption is 41 mW at 20 MHz operating frequency and 1.8 V supply voltage. The processor completes one parallel-to-parallel (i.e., when all input data are available in parallel and all output data are generated in parallel) 64-point FFT computation in 23 cycles. These features show that though it has been developed primarily for application in the IEEE 802.11a standard, it can be used for any application that requires fast operation as well as low power consumption.

165 citations


"A Block Scaling FFT/IFFT Processor ..." refers methods in this paper

  • ...Besides, since FFT and IFFT have the same operations except for complexconjugated twiddle factors, we implement IFFT by simply taking conjugates ofFFT input/output [6] as shown in Fig....

    [...]

Journal ArticleDOI
TL;DR: A new continuous-flow mixed-radix (CFMR) fast Fourier transform (FFT) processor that uses the MR (radix-4/2) algorithm and a novel in-place strategy that can reduce hardware complexity and computation cycles compared with existing FFT processors is proposed.
Abstract: The paper proposes a new continuous-flow mixed-radix (CFMR) fast Fourier transform (FFT) processor that uses the MR (radix-4/2) algorithm and a novel in-place strategy. The existing in-place strategy supports only a fixed-radix FFT algorithm. In contrast, the proposed in-place strategy can support the MR algorithm, which allows CF FFT computations regardless of the length of FFT. The novel in-place strategy is made by interchanging storage locations of butterfly outputs. The CFMR FFT processor provides the MR algorithm, the in-place strategy, and the CF FFT computations at the same time. The CFMR FFT processor requires only two N-word memories due to the proposed in-place strategy. In addition, it uses one butterfly unit that can perform either one radix-4 butterfly or two radix-2 butterflies. The CFMR FFT processor using the 0.18 /spl mu/m SEC cell library consists of 37,000 gates excluding memories, requires only 640 clock cycles for a 512-point FFT and runs at 100 MHz. Therefore, the CFMR FFT processor can reduce hardware complexity and computation cycles compared with existing FFT processors.

128 citations


"A Block Scaling FFT/IFFT Processor ..." refers background or methods in this paper

  • ...For memory-based FFT processors supporting consecutive I/0, multiple main memories are needed as computation and I/0 buffers [7]....

    [...]

  • ...To reduce the total memory size, the continuous flow (CF) memory architecture is proposed [7] where only two N-word memories are required for N-point FFT....

    [...]

Journal ArticleDOI
TL;DR: This paper presents an 8192-point FFT processor for DVB-T systems, in which a three-step radix-8 FFT algorithm, a new dynamic scaling approach, and a novel matrix prefetch buffer are exploited.
Abstract: This paper presents an 8192-point FFT processor for DVB-T systems, in which a three-step radix-8 FFT algorithm, a new dynamic scaling approach, and a novel matrix prefetch buffer are exploited. About 64 K bit memory space can be saved in the 8 K point FFT by the proposed dynamic scaling approach. Moreover, with data scheduling and pre-fetched buffering, single-port memory can be adopted without degrading throughput rate. A test chip for 8 K mode DVB-T system has been designed and fabricated using 0.18-/spl mu/m single-poly six-metal CMOS process with core area of 4.84 mm/sup 2/. Power dissipation is about 25.2 mW at 20 MHz.

111 citations


"A Block Scaling FFT/IFFT Processor ..." refers background or methods in this paper

  • ...Thus by taking the guard interval of WiMAX systems into account, the proposed FFT/IFFT processor does not need to operate in a multiple sampling frequency as the previous cached-memory FFT designs do [2], [3]....

    [...]

  • ...However, the increase in wordlength [2] or idle cycles [3] still causes wastes in power consumption and hardware cost....

    [...]

  • ...dynamck sloaing-FTprocessorP[3] is proposced bayempoyinguc mutiporleepnngts for icracighe-sz blocks....

    [...]

  • ...Cached-memory FFT [2], [3] is proposed for low power consumption by reducing the memory accesses....

    [...]

  • ...There have been many researches on low-power FFT designs by employing the cached-memory architecture to reduce the memory accesses [2], [3]....

    [...]

Frequently Asked Questions (1)
Q1. What are the contributions in "A block scaling fft/ifft processor for wimax applications" ?

This paper presents a low-power design of a twostream MIMO FFT/IFFT processor for WiMAX applications. Furthermore, by proper scheduling of the two data streams, the proposed design achieves better hardware utilization and can process two 2048-point FFTs/IFFTs consecutively within 2052 cycles.