# A Block Scaling FFT/IFFT Processor for WiMAX Applications

TL;DR: A novel block scaling method and a new ping-pong cache-memory architecture are proposed to reduce the power consumption and hardware cost and by proper scheduling of the two data streams, the proposed design achieves better hardware utilization.

Abstract: This paper presents a low-power design of a two-stream MIMO FFT/IFFT processor for WiMAX applications A novel block scaling method and a new ping-pong cache-memory architecture are proposed to reduce the power consumption and hardware cost With these schemes, half the memory accesses and 64-Kbit memory can be saved Furthermore, by proper scheduling of the two data streams, the proposed design achieves better hardware utilization and can process two 2048-point FFTs/IFFTs consecutively within 2052 cycles A test chip of the proposed FFT/IFFT processor has been designed using UMC 013 mum 1P8M process with a core area of 1332times1590 mum2 The SQNR performance of the 2048-point FFT/IFFT is over 48 dB for QPSK and 16/64-QAM modulations Power dissipation of two 2048-point FFT computations is about 1726 mW at 2286 MHz which meets the maximum throughput rate of WiMAX applications

## Summary (2 min read)

### Introduction

- Stream MIMO FFT/IFFT processor for WiMAX applications.
- With these schemes, half the memory accesses and 64-Kbit memory (4 bits in wordlength) can be saved without inducing idle cycles.
- Moreover, by proper scheduling of the two data streams, the proposed FFT/IFFT processor avoids stalls of function units and thus achieves better hardware utilization.

### II. ALGORITHM

- To reduce the number of complex multiplications, radix-8 algorithm is chosen to carry out the DFT [4].
- Its hardware is very complex if directly implemented.
- Thus the authors employ radix-23 and radix-22 [4] to replace radix-8 and radix-4, respectively.
- With these steps, the authors can decompose the 2048-point FFT into three radix-23 and multiplication stages and a final radix-22 stage for further hardware implementations.

### A. Block Scaling Method

- Block floating-point (BFP) [5] is an efficient way to reduce the wordlength by increasing the dynamic range compared to the fixed-point format.
- To solve this problem, a dynamic scaling FFT processor [3] is proposed by employing multiple exponents for cache-size blocks.
- While dynamic scaling approach has a satisfactory result in reducing wordlength, it still has two drawbacks.
- At the same time, the resulting exponents are saved for data alignment in the next processing stage.
- First, because the input symbols are gain-controlled and have specified modulation in OFDM systems, the maximum value of the final FFT output can be expected in advance.

### III. ARCHITECTURE

- Block diagram of the proposed FFT/IFFT processor is depicted in Fig.
- It consists of four FFT/IFFT control units, a main memory unit, a processing engine (PE), and a 64-word cache.
- A novel block scaling method and a new ping-pong cache-memory architecture are proposed to reduce the power consumption and hardware cost.
- With these techniques and proper data scheduling, the proposed design can realize two 2048-point FFT/IFFT computations in 2052 clock cycles.
- The modules of the proposed design will be described in more detail below.

### A. Main Memory

- For memory-based FFT processors supporting consecutive I/O, multiple main memories are needed as computation and I/O buffers [7].
- To reduce the total memory size, the continuous flow (CF) memory architecture is proposed [7] where only two N-word memories are required for N-point FFT.
- This is because the original CF FFT adopts radix-4 and radix-2 algorithms which have different bit-reverse orders.
- In their proposed design; however, CF memory architecture causes no problem since radix-23 and radix-22 algorithms have the same bit-reverse order as radix-2 algorithm [4].
- As shown in Fig. 4, one 4096-word SRAM works as the I/O buffer while the other one works as the processing buffer, and vice versa.

### B. Ping-Pong Cache-Memory Architecture

- Cached-memory FFT [2], [3] is proposed for low power consumption by reducing the memory accesses.
- A concurrent read/write cache with complex control is required to increase the throughput.
- Thus the authors propose the ping-pong cache-memory architecture which uses a simple cache with single read/write operations.
- By using this scheme, half the memory accesses can be saved.

### C. Processing Engine (PE)

- The PE is designed to perform radix-23/22/2 butterfly operations and complex multiplications with proposed block scaling approach as shown in Fig.
- At the fist processing stage, since the inputs have the same decimal point, data alignments are skipped.
- Afterward, the output of ODSU1 is sent to the complex multipliers for twiddle factor multiplications.
- The second and third stages have similar control flows as stage 1. First Stage Intermediate Stage(s) Final Stage Alignment Bypass ON ON Configurable BU Radix-2 3 Radix-23 Radix-23 for 512 FFT Radix-22 for 256/2048 FFT.

### IV. CHIP IMPLEMENTATION

- A test chip of the proposed block scaling FFT/IFFT processor (2048-point mode) is implemented using UMC 0.13 μm 1P8M CMOS technology for verification.
- From post-layout prime power simulation, it is shown that the proposed FFT/IFFT consumes only 17.26 mW at 22.86 MHz when performing two 2048-point FFT computations consecutively for WiMAX applications.
- The SQNR performance of the 2048-point FFT/IFFT has also been verified to exceed 48 dB for QPSK and 16/64-QAM signals.
- Thus the implementation loss of cascaded IFFT and FFT is only 0.1 dB with AWGN at 30 dB SNR which satisfies their design target for WiMAX applications.
- The detailed power profiling and chip summary are shown in Fig. 9. Fig. 9. Power profiling and chip summary of the proposed processor.

### V. COMPARISON

- For comparisons, the authors choose two FFT processor chips which can handle consecutive 2048-point FFT computations [8], [9].
- Besides, to compare the FFT processor chips fabricated with different technologies, the authors adopt the normalized area and FFTs per energy [2] as their performance indices shown in eqs.
- Note that eq. (4) has been adapted to take account of the voltage scaling.
- The authors can find that the FFT processor [9] use a shorter wordlength of 12 bits since it only supports for 9-bit input.
- Both designs [8], [9] do not employ a cache design to reduce the power of memory accesses.

### VI. CONCLUSION

- A block scaling MIMO FFT/IFFT processor for WiMAX applications has been proposed in this paper.
- It can support two 2048-point FFT/IFFT computations simultaneously within 2052 clock cycles.
- Moreover, with a novel block scaling method and a new ping-pong cache-memory architecture, both power consumption and hardware cost can be greatly reduced.
- A test chip has been designed using UMC 0.13 μm 1P8M process.
- Simulation result has shown that the proposed FFT processor consumes only 17.26 mW at 22.86 MHz which meets the maximum throughput rate of WiMAX applications.

Did you find this useful? Give us your feedback

...read more

##### Citations

95 citations

95 citations

### Cites background from "A Block Scaling FFT/IFFT Processor ..."

...There have been several studies on low-power FFT processors for MIMO OFDM systems [1], [2] which focus on reducing the peak power at maximum throughput rate....

[...]

62 citations

### Cites background from "A Block Scaling FFT/IFFT Processor ..."

...1 can be expressed as: 2/)]()[(2)( 216 abjbaWjba , (2) where (a + jb) denotes a discrete-time signal in complex form....

[...]

...The radix-2 DIF FFT described above appears regularity in SFG and has less complex multipliers required....

[...]

61 citations

### Additional excerpts

...Recently, higher radix [20], [21] and/or Fig....

[...]

30 citations

### Cites methods from "A Block Scaling FFT/IFFT Processor ..."

...In [19], a ping-pong CM architecture that eliminates load/flush cycles into/from cache was proposed; in [20], the ping-pong CM architecture and a memory partition design were adopted to reduce power consumption in different operation modes in a digital video broadcasting-terrestrial/handheld (DVB-T/H) system; in [21], a three-level memory (two-level cache) architecture was presented to improve energy efficiency....

[...]

##### References

^{1}

316 citations

### "A Block Scaling FFT/IFFT Processor ..." refers methods in this paper

...Thus we employ radix-23 and radix-22 [4] to replace radix-8 and radix-4, respectively....

[...]

...To reduce the number of complex multiplications, radix-8 algorithm is chosen to carry out the DFT [4]....

[...]

...Here we take the longest 2048-point DFT in the design as an example....

[...]

...reduce the number of complex multiplications, radix-8 algorithm is chosen to carry out the DFT [4]....

[...]

...Similarly, 128/256/ 512/1024-point DFT can also be decomposed to preceding 0-7803-9735-5/06/$20.00 ©2006 IEEE 203 radix-8 stages and a final radix-8/4/2 stage depending on the DFT size....

[...]

311 citations

### "A Block Scaling FFT/IFFT Processor ..." refers background or methods in this paper

...Thus by taking the guard interval of WiMAX systems into account, the proposed FFT/IFFT processor does not need to operate in a multiple sampling frequency as the previous cached-memory FFT designs do [2], [3]....

[...]

...However, the increase in wordlength [2] or idle cycles [3] still causes wastes in power consumption and hardware cost....

[...]

...Besides, to compare the FFT processor chips fabricated with different technologies, we adopt the normalized area and FFTs per energy [2] as our performance indices shown in eqs....

[...]

...Cached-memory FFT [2], [3] is proposed for low power consumption by reducing the memory accesses....

[...]

...There have been many researches on low-power FFT designs by employing the cached-memory architecture to reduce the memory accesses [2], [3]....

[...]

159 citations

### "A Block Scaling FFT/IFFT Processor ..." refers methods in this paper

...Besides, since FFT and IFFT have the same operations except for complexconjugated twiddle factors, we implement IFFT by simply taking conjugates ofFFT input/output [6] as shown in Fig....

[...]

^{1}

122 citations

### "A Block Scaling FFT/IFFT Processor ..." refers background or methods in this paper

...For memory-based FFT processors supporting consecutive I/0, multiple main memories are needed as computation and I/0 buffers [7]....

[...]

...To reduce the total memory size, the continuous flow (CF) memory architecture is proposed [7] where only two N-word memories are required for N-point FFT....

[...]

109 citations

### "A Block Scaling FFT/IFFT Processor ..." refers background or methods in this paper

...Thus by taking the guard interval of WiMAX systems into account, the proposed FFT/IFFT processor does not need to operate in a multiple sampling frequency as the previous cached-memory FFT designs do [2], [3]....

[...]

...However, the increase in wordlength [2] or idle cycles [3] still causes wastes in power consumption and hardware cost....

[...]

...dynamck sloaing-FTprocessorP[3] is proposced bayempoyinguc mutiporleepnngts for icracighe-sz blocks....

[...]

...Cached-memory FFT [2], [3] is proposed for low power consumption by reducing the memory accesses....

[...]

...There have been many researches on low-power FFT designs by employing the cached-memory architecture to reduce the memory accesses [2], [3]....

[...]