A Block Scaling FFT/IFFT Processor for WiMAX Applications

01 Dec 2006

TL;DR: A novel block scaling method and a new ping-pong cache-memory architecture are proposed to reduce the power consumption and hardware cost and by proper scheduling of the two data streams, the proposed design achieves better hardware utilization.

AbstractThis paper presents a low-power design of a two-stream MIMO FFT/IFFT processor for WiMAX applications A novel block scaling method and a new ping-pong cache-memory architecture are proposed to reduce the power consumption and hardware cost With these schemes, half the memory accesses and 64-Kbit memory can be saved Furthermore, by proper scheduling of the two data streams, the proposed design achieves better hardware utilization and can process two 2048-point FFTs/IFFTs consecutively within 2052 cycles A test chip of the proposed FFT/IFFT processor has been designed using UMC 013 mum 1P8M process with a core area of 1332times1590 mum2 The SQNR performance of the 2048-point FFT/IFFT is over 48 dB for QPSK and 16/64-QAM modulations Power dissipation of two 2048-point FFT computations is about 1726 mW at 2286 MHz which meets the maximum throughput rate of WiMAX applications

## Summary (2 min read)

### Introduction

• Stream MIMO FFT/IFFT processor for WiMAX applications.
• With these schemes, half the memory accesses and 64-Kbit memory (4 bits in wordlength) can be saved without inducing idle cycles.
• Moreover, by proper scheduling of the two data streams, the proposed FFT/IFFT processor avoids stalls of function units and thus achieves better hardware utilization.

### II. ALGORITHM

• To reduce the number of complex multiplications, radix-8 algorithm is chosen to carry out the DFT [4].
• Its hardware is very complex if directly implemented.
• Thus the authors employ radix-23 and radix-22 [4] to replace radix-8 and radix-4, respectively.
• With these steps, the authors can decompose the 2048-point FFT into three radix-23 and multiplication stages and a final radix-22 stage for further hardware implementations.

### A. Block Scaling Method

• Block floating-point (BFP) [5] is an efficient way to reduce the wordlength by increasing the dynamic range compared to the fixed-point format.
• To solve this problem, a dynamic scaling FFT processor [3] is proposed by employing multiple exponents for cache-size blocks.
• While dynamic scaling approach has a satisfactory result in reducing wordlength, it still has two drawbacks.
• At the same time, the resulting exponents are saved for data alignment in the next processing stage.
• First, because the input symbols are gain-controlled and have specified modulation in OFDM systems, the maximum value of the final FFT output can be expected in advance.

### III. ARCHITECTURE

• Block diagram of the proposed FFT/IFFT processor is depicted in Fig.
• It consists of four FFT/IFFT control units, a main memory unit, a processing engine (PE), and a 64-word cache.
• A novel block scaling method and a new ping-pong cache-memory architecture are proposed to reduce the power consumption and hardware cost.
• With these techniques and proper data scheduling, the proposed design can realize two 2048-point FFT/IFFT computations in 2052 clock cycles.
• The modules of the proposed design will be described in more detail below.

### A. Main Memory

• For memory-based FFT processors supporting consecutive I/O, multiple main memories are needed as computation and I/O buffers [7].
• To reduce the total memory size, the continuous flow (CF) memory architecture is proposed [7] where only two N-word memories are required for N-point FFT.
• This is because the original CF FFT adopts radix-4 and radix-2 algorithms which have different bit-reverse orders.
• In their proposed design; however, CF memory architecture causes no problem since radix-23 and radix-22 algorithms have the same bit-reverse order as radix-2 algorithm [4].
• As shown in Fig. 4, one 4096-word SRAM works as the I/O buffer while the other one works as the processing buffer, and vice versa.

### B. Ping-Pong Cache-Memory Architecture

• Cached-memory FFT [2], [3] is proposed for low power consumption by reducing the memory accesses.
• A concurrent read/write cache with complex control is required to increase the throughput.
• Thus the authors propose the ping-pong cache-memory architecture which uses a simple cache with single read/write operations.
• By using this scheme, half the memory accesses can be saved.

### C. Processing Engine (PE)

• The PE is designed to perform radix-23/22/2 butterfly operations and complex multiplications with proposed block scaling approach as shown in Fig.
• At the fist processing stage, since the inputs have the same decimal point, data alignments are skipped.
• Afterward, the output of ODSU1 is sent to the complex multipliers for twiddle factor multiplications.
• The second and third stages have similar control flows as stage 1. First Stage Intermediate Stage(s) Final Stage Alignment Bypass ON ON Configurable BU Radix-2 3 Radix-23 Radix-23 for 512 FFT Radix-22 for 256/2048 FFT.

### IV. CHIP IMPLEMENTATION

• A test chip of the proposed block scaling FFT/IFFT processor (2048-point mode) is implemented using UMC 0.13 μm 1P8M CMOS technology for verification.
• From post-layout prime power simulation, it is shown that the proposed FFT/IFFT consumes only 17.26 mW at 22.86 MHz when performing two 2048-point FFT computations consecutively for WiMAX applications.
• The SQNR performance of the 2048-point FFT/IFFT has also been verified to exceed 48 dB for QPSK and 16/64-QAM signals.
• Thus the implementation loss of cascaded IFFT and FFT is only 0.1 dB with AWGN at 30 dB SNR which satisfies their design target for WiMAX applications.
• The detailed power profiling and chip summary are shown in Fig. 9. Fig. 9. Power profiling and chip summary of the proposed processor.

### V. COMPARISON

• For comparisons, the authors choose two FFT processor chips which can handle consecutive 2048-point FFT computations [8], [9].
• Besides, to compare the FFT processor chips fabricated with different technologies, the authors adopt the normalized area and FFTs per energy [2] as their performance indices shown in eqs.
• Note that eq. (4) has been adapted to take account of the voltage scaling.
• The authors can find that the FFT processor [9] use a shorter wordlength of 12 bits since it only supports for 9-bit input.
• Both designs [8], [9] do not employ a cache design to reduce the power of memory accesses.

### VI. CONCLUSION

• A block scaling MIMO FFT/IFFT processor for WiMAX applications has been proposed in this paper.
• It can support two 2048-point FFT/IFFT computations simultaneously within 2052 clock cycles.
• Moreover, with a novel block scaling method and a new ping-pong cache-memory architecture, both power consumption and hardware cost can be greatly reduced.
• A test chip has been designed using UMC 0.13 μm 1P8M process.
• Simulation result has shown that the proposed FFT processor consumes only 17.26 mW at 22.86 MHz which meets the maximum throughput rate of WiMAX applications.

A Block Scaling FFT/IFFT Processor
for WiMAX Applications
Yuan Chen
, Yu-Wei Lin
, and Chen-Yi Lee
National Chiao Tung University, Hsinchu, Taiwan
MediaTek Inc., Hsinchu, Taiwan
Email: ychen@si2lab.org
Abstract-This paper presents a low-power design of a two-
stream MIMO FFT/IFFT processor for WiMAX applications. A
novel block scaling method and a new ping-pong cache-memory
architecture are proposed to reduce the power consumption and
hardware cost. With these schemes, half the memory accesses
and 64-Kbit memory can be saved. Furthermore, by proper
scheduling of the two data streams, the proposed design achieves
better hardware utilization and can process two 2048-point
FFTs/IFFTs consecutively within 2052 cycles. A test chip of the
proposed FFT/IFFT processor has been designed using UMC
0.13 μm 1P8M process with a core area of 1332×1590 μm
2
. The
SQNR performance of the 2048-point FFT/IFFT is over 48 dB
for QPSK and 16/64-QAM modulations. Power dissipation of
two 2048-point FFT computations is about 17.26 mW at 22.86
MHz which meets the maximum throughput rate of WiMAX
applications.
I. INTRODUCTION
Multiple-input multiple-output orthogonal frequency
division multiplexing (MIMO OFDM) is considered a key
technology in high-throughput transmissions over wireless
fading channels. The emerging WiMAX/IEEE 802.16
standard has employed this technology in its physical-layer
specification to provide broadband wireless access services.
In the specification, scalable channel bandwidths from 1.25 to
20 MHz by adjusting FFT size (from 128 to 2048-point) are
employed for different applications. Three modulation types
(QPSK, 16/64-QAM) and four guard intervals modes (1/4,
1/8, 1/16, 1/32) are also supported to further increase the
system scalability. A block diagram of a 2×2 MIMO
transceiver for WiMAX applications is shown in Fig. 1. By
processing two data streams with duplicated antennas and
functional units, the peak data rate of the 2×2 MIMO
transceiver can be two-folded compared to that of a
single-input single-output (SISO) transceiver.
To support a MIMO transceiver for WiMAX applications,
a variable-length FFT/IFFT processor capable of processing
multiple data streams is required. Since 2×2 MIMO with time
division duplex (TDD) mode is defined in the WiMAX
Forum Release-1 system profiles [1], a two-stream 128/256/
512/1024/2048-point FFT/IFFT processor is considered in
this paper. Besides, while the power consumption is critical
for portable systems, the FFT/IFFT processor for WiMAX
applications should be power-efficient. There have been
many researches on low-power FFT designs by employing
the cached-memory architecture to reduce the memory
accesses [2], [3]. However, the increase in wordlength [2] or
idle cycles [3] still causes wastes in power consumption and
hardware cost. To solve these problems, a novel block scaling
method and a new ping-pong cache-memory architecture are
exploited in our proposed FFT/IFFT processor. With these
schemes, half the memory accesses and 64-Kbit memory (4
bits in wordlength) can be saved without inducing idle cycles.
Moreover, by proper scheduling of the two data streams, the
proposed FFT/IFFT processor avoids stalls of function units
and thus achieves better hardware utilization. Two-stream
2048-point FFTs/IFFTs can be computed consecutively
within 2052 processing cycles.
Fig. 1. Block diagram of a 2×2 MIMO transceiver for WiMAX applications.
II. A
LGORITHM
The N-point discrete Fourier transform (DFT) of a complex
input sequence x(n) can be defined as:
1
0
( ) ( ) 0,1, 2,... 1
N
N
kn
n
Xk xnW k N
=
=
=−
(1)
where
2/
N
kn j kn N
We
π
=
is referred to the twiddle factor. To
reduce the number of complex multiplications, radix-8
algorithm is chosen to carry out the DFT [4]. Here we take the
longest 2048-point DFT in the design as an example. Since
2048 is not a power of 8, we decompose the 2048-point DFT
into three radix-8 stages and a final radix-4 stage as shown in
the following equation:
1234 234 33 3411 2 2 4 4
4321
12 3 4
3777
(32 4 ) (4 )
1 2 3 4 8 2048 8 256 8 32 4
0000
(8 64512)
(256 32 4 )
knnn knn kn knkn k n k n
nnnn
Xk k k k
xnnnnWW WW WWW
++ +
====
+
++ =
⎧⎫
⎧⎫
⎧⎫
⎪⎪⎪
+++
⎨⎨⎨
⎪⎪
⎪⎪
⎩⎭⎪⎪
⎩⎭
⎩⎭
∑∑∑∑
(2)
where k
1
,k
2
,k
3
=0,1,2,…7 and k
4
=0,1,2,3. Similarly, 128/256/
512/1024-point DFT can also be decomposed to preceding
0-7803-9735-5/06/20.00 ©2006 IEEE 203 7-1 radix-8 stages and a final radix-8/4/2 stage depending on the DFT size. Although high-radix algorithm is effective in reducing the number of complex multiplications, its hardware is very complex if directly implemented. Thus we employ radix-2 3 and radix-2 2 [4] to replace radix-8 and radix-4, respectively. A signal flow graph (SFG) of the 32-point radix-2 3 /2 2 FFT is shown in Fig. 2 as an example. We can find in this figure that a full 32-point FFT is completed by one radix-2 3 and multiplication stage and one radix-2 2 stage. With these steps, we can decompose the 2048-point FFT into three radix-2 3 and multiplication stages and a final radix-2 2 stage for further hardware implementations. 3 8 W 3 8 W 3 8 W 3 8 W 1 8 W 1 8 W 1 8 W 1 8 W 4 32 W 8 32 W 4 32 W 12 32 W 2 32 W 6 32 W 6 32 W 12 32 W 18 32 W 1 32 W 2 32 W 3 32 W 5 32 W 10 32 W 15 32 W 3 32 W 6 32 W 9 32 W 7 32 W 14 32 W 21 32 W Fig. 2. SFG of a 32-point radix-2 3 /2 2 FFT. A. Block Scaling Method Block floating-point (BFP) [5] is an efficient way to reduce the wordlength by increasing the dynamic range compared to the fixed-point format. The behavior of BFP is similar to that of floating-point except a single exponent is used for a group of data. Although BFP is often adopted in memory-based FFT processors to save the hardware cost and power, it is not suited to cached-memory FFT processors because of the interleaved processing stages [5]. To solve this problem, a dynamic scaling FFT processor [3] is proposed by employing multiple exponents for cache-size blocks. While dynamic scaling approach has a satisfactory result in reducing wordlength, it still has two drawbacks. Since the exponent position can be determined only after all cached data are processed, some clock cycles are wasted. Also, the internal wordlength of both arithmetic units and cache needs to be extended to prevent overflows. Thus we propose the block scaling method which eliminates the increased wordlength and idle cycles by a “detect and scale” approach. Each set of the output symbols will be scaled right away if an overflow is detected. At the same time, the resulting exponents are saved for data alignment in the next processing stage. Although this method can be realized by saving block exponents for all processing stages, it is hardware consuming. To work out this issue, we scale the final output of FFT to a predetermined exponent, and thus only 296 exponents are needed to be stored for the longest-length 2048-point FFTs. There are two main reasons why this fixed-exponent scheme is feasible. First, because the input symbols are gain-controlled and have specified modulation in OFDM systems, the maximum value of the final FFT output can be expected in advance. Second, in most dedicated OFDM transceiver designs, only fixed-point format is considered due to simpler hardware implementations. As the simulation result shows in Fig. 3, over four bits can be reduced in wordlength by the proposed method under the same signal-to-quantization-noise ratio (SQNR). We can also find that more than one fourth of the memory size (from 16 bits to 12 bits) can be saved at about 50 dB SQNR. 9 10 11 12 13 14 15 16 0 10 20 30 40 50 60 70 80 SQNR (dB) Wordlength (bits) Fixed-point method Proposed block-scaling method Fig. 3. SQNR performance of the proposed block scaling method. III. A RCHITECTURE Block diagram of the proposed FFT/IFFT processor is depicted in Fig. 4. It consists of four FFT/IFFT control units, a main memory unit, a processing engine (PE), and a 64-word cache. In this design, a novel block scaling method and a new ping-pong cache-memory architecture are proposed to reduce the power consumption and hardware cost. Besides, since FFT and IFFT have the same operations except for complex- conjugated twiddle factors, we implement IFFT by simply taking conjugates of FFT input/output [6] as shown in Fig. 4. With these techniques and proper data scheduling, the proposed design can realize two 2048-point FFT/IFFT computations in 2052 clock cycles. Thus by taking the guard interval of WiMAX systems into account, the proposed FFT/IFFT processor does not need to operate in a multiple sampling frequency as the previous cached-memory FFT designs do [2], [3]. The modules of the proposed design will be described in more detail below. Fig. 4. Block diagram of the proposed two-stream FFT/IFFT processor. 204 A. Main Memory For memory-based FFT processors supporting consecutive I/O, multiple main memories are needed as computation and I/O buffers [7]. To reduce the total memory size, the continuous flow (CF) memory architecture is proposed [7] where only two N-word memories are required for N-point FFT. Although CF FFT can reduce memory size by doing I/O operation concurrently in a single memory, it requires additional controls for memory addressing and butterfly units (BU). This is because the original CF FFT adopts radix-4 and radix-2 algorithms which have different bit-reverse orders. In our proposed design; however, CF memory architecture causes no problem since radix-2 3 and radix-2 2 algorithms have the same bit-reverse order as radix-2 algorithm [4]. As shown in Fig. 4, one 4096-word SRAM works as the I/O buffer while the other one works as the processing buffer, and vice versa. Each SRAM is further partitioned to eight banks to support eight accesses simultaneously for radix-2 3 algorithms. B. Ping-Pong Cache-Memory Architecture Cached-memory FFT [2], [3] is proposed for low power consumption by reducing the memory accesses. As shown in Fig. 5, data are first read from main memory and then sent to the cache. By proper data scheduling, PE can perform multiple-stage processing by accessing local cache instead of the main memory. Although cached-memory FFT can reduce memory accesses effectively, a concurrent read/write cache with complex control is required to increase the throughput. Thus we propose the ping-pong cache-memory architecture which uses a simple cache with single read/write operations. As illustrated in Fig. 6, data read from the main memory are processed by PE first and then written to the cache for future use. After the cache is full, data in the cache are read by PE and the computed results are stored back to the main memory. Since radix-2 3 algorithm is adopted in the proposed design, a 64-word cache is employed to support two-stage radix-2 3 processing. By using this scheme, half the memory accesses can be saved. Moreover, the ping-pong cache-memory has shorter latency compared to the cached-memory, which is beneficial in scheduling data streams. Fig. 5. Cached-memory architecture. Fig. 6. Proposed ping-pong cache-memory architecture. C. Processing Engine (PE) The PE is designed to perform radix-2 3 /2 2 /2 butterfly operations and complex multiplications with proposed block scaling approach as shown in Fig. 7. Since variable-length FFT must be supported and the final stage can be radix-2 3 , radix-2 2 , or radix-2 as described earlier, a configurable radix-2 3 /2 2 /2 butterfly unit capable of processing one radix-2 3 , two radix-2 2 , or four radix-2 is adopted. We use 2048-point FFT mode to describe the control of PE. At the fist processing stage, since the inputs have the same decimal point, data alignments are skipped. Input data are processed by radix-2 3 BU directly and then passed to the first overflow detection and scaling unit (ODSU1) in Fig. 7. If an overflow is detected, all eight inputs will be scaled and the corresponding shift in exponent is sent to the block scaling unit. Afterward, the output of ODSU1 is sent to the complex multipliers for twiddle factor multiplications. The outputs of the complex multipliers are passed to the second overflow detection and scaling unit (ODSU2) in Fig. 7 where the same operation of ODSU1 is performed. The second and third stages have similar control flows as stage 1. For stage 4, after inputs are aligned in decimal point for processing, two radix-2 2 operations are performed. At this stage; however, only scaling is performed in ODSU1 since the final output is fixed-exponent in our proposed block scaling algorithm. Complex multiplications and ODSU2 are also skipped in this stage because no twiddle factor multiplication is required at final stage as shown previously in Fig. 2. The detailed control flow for all 128~2048 FFT modes is summarized in Table I. Fig. 7. Block diagram of the processing engine. TABLE I. PE control for 128~2048-point FFT/IFFT. First Stage Intermediate Stage(s) Final Stage Alignment Bypass ON ON Configurable BU Radix-2 3 Radix-2 3 Radix-2 3 for 512 FFT Radix-2 2 for 256/2048 FFT Radix-2 for 128/1024 FFT ODSU1 Detection & Scaling Detection & Scaling Scaling Multiplier ON ON Bypass ODSU2 Detection & Scaling Detection & Scaling Bypass Block scaling unit Exponent store Alignment control & Exponent store Alignment control & ODSU1 control IV. C HIP IMPLEMENTATION A test chip of the proposed block scaling FFT/IFFT processor (2048-point mode) is implemented using UMC 0.13 μm 1P8M CMOS technology for verification. The core size is 1332×1590 μm 2 as shown in Fig. 8. From post-layout prime power simulation, it is shown that the proposed 205 FFT/IFFT consumes only 17.26 mW at 22.86 MHz when performing two 2048-point FFT computations consecutively for WiMAX applications. The SQNR performance of the 2048-point FFT/IFFT has also been verified to exceed 48 dB for QPSK and 16/64-QAM signals. Thus the implementation loss of cascaded IFFT and FFT is only 0.1 dB with AWGN at 30 dB SNR which satisfies our design target for WiMAX applications. The detailed power profiling and chip summary are shown in Fig. 9. 4096-word SRAM 4096-word SRAM BSU ROM Cache Cache BU M7 M1 M2 M3 M4 M6M5 Controller Fig. 8. Chip layout of the proposed FFT/IFFT Processor. Fig. 9. Power profiling and chip summary of the proposed processor. V. C OMPARISON For comparisons, we choose two FFT processor chips which can handle consecutive 2048-point FFT computations [8], [9]. Since these two chips can not support multiple data streams and only complete results for 1024-point FFT are listed, the comparisons of execution time and power are based on single-stream 1024-point FFT. Besides, to compare the FFT processor chips fabricated with different technologies, we adopt the normalized area and FFTs per energy [2] as our performance indices shown in eqs. (3) and (4). Note that eq. (4) has been adapted to take account of the voltage scaling. 2 Area Normalized Area (Technology/0.13μm) = (3) 2 3 (Technology/0.13μm) ( /1.2) FFTs Normalized Energy Power Execution Time 10 DD V× = ×× (4) The comparison results are summarized in TABLE II. We can find that the FFT processor [9] use a shorter wordlength of 12 bits since it only supports for 9-bit input. The processor [8] has employed the BFP approach and thus the wordlength is not increased. However, both designs [8], [9] do not employ a cache design to reduce the power of memory accesses. From this comparison, it is shown that our proposal has a satisfactory result in both normalized area and FFTs per energy, which justifies the feasibility of the proposed method. TABLE II. Chip comparison of various 2048-point FFT Processors. This Work Zhong [8] Lin [9] *3 Technology 0.13 μm 0.25 μm 0.35 μm Supported FFT/ IFFT (consecutive) Two 2048-point *1 FFTs/IFFTs 8~2048-point FFT 512~2048-point FFT Cache design Yes No No Scaling/BFP design Block scaling BFP No Input bit width 12 bits 16 bits 9 bits Wordlength 12 bits 16 bits 12 bits Core voltage 1.2 volt 2.5 volt 3.3 volt Clock rate 22.86 MHz 200 MHz 45.45 MHz Execution time (1024-point) 22.48 μs *2 26.4 μs 45.06 μs Power (1024-point) 17.26 mW *2 400 mW 640 mW Core Area 2.12 μm 2 11.42 μm 2 13.05 μm 2 Normalized 1024- Point FFTs/ Energy 2577 *2 790 706 Normalized Area 1.06 3.09 1.80 *1: Can be extended to 128~2048-point by adding control modes. *2: Normalized from data of two 2048-point FFTs. *3: The bit-reverse memory is not included. VI. C ONCLUSION A block scaling MIMO FFT/IFFT processor for WiMAX applications has been proposed in this paper. It can support two 2048-point FFT/IFFT computations simultaneously within 2052 clock cycles. Moreover, with a novel block scaling method and a new ping-pong cache-memory architecture, both power consumption and hardware cost can be greatly reduced. A test chip has been designed using UMC 0.13 μm 1P8M process. Simulation result has shown that the proposed FFT processor consumes only 17.26 mW at 22.86 MHz which meets the maximum throughput rate of WiMAX applications. A CKNOWLEDGMENT This work was supported by the National Science Council of Taiwan under Grant NSC94-2215-E-009-044 and by ICL/ITRI. under Grant 5352BA5115. REFERENCES [1] WiMAX Forum, Mobile WiMAX-Part I: A technical overview and performance evaluations, Feb. 21, 2006. [2] B. M. Bass, “A low-power, high-performance, 1024-point FFT processor,” IEEE J. Solid-State Circuits, vol. 34, pp. 380–387, Mar. 1999. [3] Y.-W. Lin, H.-Y. Liu, and C.-Y. Lee, “A dynamic scaling FFT processor for DVB-T applications,” IEEE J. Solid-State Circuits, vol. 39, pp. 2005–2013, Nov. 2004. [4] He Shousheng and M. Torkelson, “Designing pipeline FFT processor for OFDM (de)modulation,” In Proc. Int. Symp. Signals, Systems, and Electronics, 29 Sept.-2 Oct. 1998, pp. 257-262. [5] B. M. Baas, “An approach to low-power, high-performance, fast Fourier transform processor design,” PhD Dissertation, Stanford University, Stanford, CA, 1999. [6] K. Maharatna, E. Grass, and U. Jagdhold, “A 64-point Fourier transform chip for high-speed wireless LAN application using OFDM,” IEEE J. Solid-State Circuits, vol. 39, pp. 484-493, Mar. 2003. [7] B. G. Jo and M. H. Sunwoo, “New continuous-flow mixed radix (CFMR) FFT using novel in-place strategy,” IEEE Trans. Circuits Syst., vol. 52, pp. 911–919, May. 2005. [8] G. Zhong, F. Xu, and A. N. Willson Jr., “A power-scalable reconfigurable FFT/IFFT IC based on a multi-processor ring,” IEEE J. Solid-State Circuits, vol. 41, pp. 483-495, Feb. 2006. [9] Y.-T. Lin, P.-Y. Tsai, and T.-D. Chiueh, “Low-power variable- length fast Fourier transform processor,” In Proc. Comput. Digit. Tech., vol. 152, No. 4, pp. 499-506, July 2005. 206 ##### Citations More filters Journal ArticleDOI TL;DR: An multipath delay commutator (MDC)-based architecture and memory scheduling to implement fast Fourier transform (FFT) processors for multiple input multiple output-orthogonal frequency division multiplexing (MIMO-OFDM) systems with variable length is presented. Abstract: This paper presents an multipath delay commutator (MDC)-based architecture and memory scheduling to implement fast Fourier transform (FFT) processors for multiple input multiple output-orthogonal frequency division multiplexing (MIMO-OFDM) systems with variable length. Based on the MDC architecture, we propose to use radix-Ns butterflies at each stage, where Ns is the number of data streams, so that there is only one butterfly needed in each stage. Consequently, a 100% utilization rate in computational elements is achieved. Moreover, thanks to the simple control mechanism of the MDC, we propose simple memory scheduling methods for input data and output bit/set-reversing, which again results in a full utilization rate in memory usage. Since the memory requirements usually dominate the die area of FFT/inverse fast Fourier transform (IFFT) processors, the proposed scheme can effectively reduce the memory size and thus the die area as well. Furthermore, to apply the proposed scheme in practical applications, we let Ns=4 and implement a 4-stream FFT/IFFT processor with variable length including 2048, 1024, 512, and 128 for MIMO-OFDM systems. This processor can be used in IEEE 802.16 WiMAX and 3GPP long term evolution applications. The processor was implemented with an UMC 90-nm CMOS technology with a core area of 3.1 mm2. The power consumption at 40 MHz was 63.72/62.92/57.51/51.69 mW for 2048/1024/512/128-FFT, respectively in the post-layout simulation. Finally, we analyze the complexity and performance of the implemented processor and compare it with other processors. The results show advantages of the proposed scheme in terms of area and power consumption. 95 citations Journal ArticleDOI TL;DR: A new dynamic voltage and frequency scaling (DVFS) FFT processor for MIMO OFDM applications and a novel open-loop voltage detection and scaling (OLVDS) mechanism is proposed for fast and robust voltage management. Abstract: This paper presents a new dynamic voltage and frequency scaling (DVFS) FFT processor for MIMO OFDM applications. By the proposed multimode multipath-delay-feedback (MMDF) architecture, our FFT processor can process 1-8-stream 256-point FFTs or a high-speed 256-point FFT in two processing domains at minimum clock frequency for DVFS operations. A parallelized radix-24 FFT algorithm is also employed to save the power consumption and hardware cost of complex multipliers. Furthermore, a novel open-loop voltage detection and scaling (OLVDS) mechanism is proposed for fast and robust voltage management. With these schemes, the proposed FFT processor can operate at adequate voltage/frequency under different configurations to support the power-aware feature. A test chip of the proposed FFT processor has been fabricated using UMC 90 nm single-poly nine-metal CMOS process with a core area of 1.88 times1.88 mm2 . The SQNR performance of this FFT chip is over 35.8 dB for QPSK/16-QAM modulation. Power dissipation of 2.4 Gsample/s 256-point FFT computations is about 119.7 mW at 0.85 V. Depending on the operation mode, power can be saved by 18%-43% with voltage scaling in TT corner. 95 citations ### Cites background from "A Block Scaling FFT/IFFT Processor ..." • ...There have been several studies on low-power FFT processors for MIMO OFDM systems [1], [2] which focus on reducing the peak power at maximum throughput rate.... [...] Journal ArticleDOI 01 Feb 2011 TL;DR: To eliminate the read-only memories used to store the twiddle factors, the proposed architecture applies a reconfigurable complex multiplier and bit-parallel multipliers to achieve a ROM-less FFT/IFFT processor, thus consuming lower power than the existing works. Abstract: 4G and other wireless systems are currently hot topics of research and development in the communication field. Broadband wireless systems based on orthogonal frequency division multiplexing (OFDM) often require an inverse fast Fourier transform (IFFT) to produce multiple subcarriers. In this paper, we present the efficient implementation of a pipeline FFT/IFFT processor for OFDM applications. Our design adopts a single-path delay feedback style as the proposed hardware architecture. To eliminate the read-only memories (ROM's) used to store the twiddle factors, the proposed architecture applies a reconfigurable complex multiplier and bit-parallel multipliers to achieve a ROM-less FFT/IFFT processor, thus consuming lower power than the existing works. The design spends about 33.6K gates, and its power consumption is about 9.8mW at 20MHz. 62 citations ### Cites background from "A Block Scaling FFT/IFFT Processor ..." • ...1 can be expressed as: 2/)]()[(2)( 216 abjbaWjba , (2) where (a + jb) denotes a discrete-time signal in complex form.... [...] • ...The radix-2 DIF FFT described above appears regularity in SFG and has less complex multipliers required.... [...] Journal ArticleDOI TL;DR: A generalized conflict-free memory addressing scheme for memory-based fast Fourier transform (FFT) processors with parallel arithmetic processing units made up of radix-2q multi-path delay commutator (MDC) is presented. Abstract: This paper presents a generalized conflict-free memory addressing scheme for memory-based fast Fourier transform (FFT) processors with parallel arithmetic processing units made up of radix-2q multi-path delay commutator (MDC). The proposed addressing scheme considers the continuous-flow operation with minimum shared memory requirements. To improve throughput, parallel high-radix processing units are employed. We prove that the solution to non-conflict memory access satisfying the constraints of the continuous-flow, variable-size, higher-radix, and parallel-processing operations indeed exists. In addition, a rescheduling technique for twiddle-factor multiplication is developed to reduce hardware complexity and to enhance hardware efficiency. From the results, we can see that the proposed processor has high utilization and efficiency to support flexible configurability for various FFT sizes with fewer computation cycles than the conventional radix-2/radix-4 memory-based FFT processors. 61 citations ### Additional excerpts • ...Recently, higher radix [20], [21] and/or Fig.... [...] Journal ArticleDOI TL;DR: A data relocation scheme that merges multiple banks to lower the area requirement and power dissipation of memory-based FFT architectures is proposed and the proposed memory-addressing method can effectively deal with single-port, merged-bank memory with high-radix processing elements. Abstract: This paper explores efficient memory management schemes for memory-based architectures of the fast Fourier transform (FFT). A data relocation scheme that merges multiple banks to lower the area requirement and power dissipation of memory-based FFT architectures is proposed. The proposed memory-addressing method can effectively deal with single-port, merged-bank memory with high-radix processing elements. Compared with conventional memory-based FFT designs using dual-port memory, the derived architecture has better performance in terms of area and power consumption. The proposed scheme is extended to a cached-memory FFT architecture to further reduce power dissipation. An 8192-point cached-memory FFT processor is implemented for digital video broadcasting-terrestrial/handheld applications by using 0.18-\mu $m 1P6M CMOS technology. Experimental results show that the proposed memory scheme consumes 10.1%–29.3% less area and 9.6%–67.9% less power compared with those of the multibank design. 30 citations ### Cites methods from "A Block Scaling FFT/IFFT Processor ..." • ...In [19], a ping-pong CM architecture that eliminates load/flush cycles into/from cache was proposed; in [20], the ping-pong CM architecture and a memory partition design were adopted to reduce power consumption in different operation modes in a digital video broadcasting-terrestrial/handheld (DVB-T/H) system; in [21], a three-level memory (two-level cache) architecture was presented to improve energy efficiency.... [...] ##### References More filters Proceedings ArticleDOI 29 Sep 1998 TL;DR: By exploiting the spatial regularity of the new algorithm, the requirement for both dominant elements in VLSI implementation, the memory size and the number of complex multipliers, have been minimized and the area/power efficiency has been enhanced. 