A Block Scaling FFT/IFFT Processor for WiMAX Applications
Summary (2 min read)
Introduction
- Stream MIMO FFT/IFFT processor for WiMAX applications.
- With these schemes, half the memory accesses and 64-Kbit memory (4 bits in wordlength) can be saved without inducing idle cycles.
- Moreover, by proper scheduling of the two data streams, the proposed FFT/IFFT processor avoids stalls of function units and thus achieves better hardware utilization.
II. ALGORITHM
- To reduce the number of complex multiplications, radix-8 algorithm is chosen to carry out the DFT [4].
- Its hardware is very complex if directly implemented.
- Thus the authors employ radix-23 and radix-22 [4] to replace radix-8 and radix-4, respectively.
- With these steps, the authors can decompose the 2048-point FFT into three radix-23 and multiplication stages and a final radix-22 stage for further hardware implementations.
A. Block Scaling Method
- Block floating-point (BFP) [5] is an efficient way to reduce the wordlength by increasing the dynamic range compared to the fixed-point format.
- To solve this problem, a dynamic scaling FFT processor [3] is proposed by employing multiple exponents for cache-size blocks.
- While dynamic scaling approach has a satisfactory result in reducing wordlength, it still has two drawbacks.
- At the same time, the resulting exponents are saved for data alignment in the next processing stage.
- First, because the input symbols are gain-controlled and have specified modulation in OFDM systems, the maximum value of the final FFT output can be expected in advance.
III. ARCHITECTURE
- Block diagram of the proposed FFT/IFFT processor is depicted in Fig.
- It consists of four FFT/IFFT control units, a main memory unit, a processing engine (PE), and a 64-word cache.
- A novel block scaling method and a new ping-pong cache-memory architecture are proposed to reduce the power consumption and hardware cost.
- With these techniques and proper data scheduling, the proposed design can realize two 2048-point FFT/IFFT computations in 2052 clock cycles.
- The modules of the proposed design will be described in more detail below.
A. Main Memory
- For memory-based FFT processors supporting consecutive I/O, multiple main memories are needed as computation and I/O buffers [7].
- To reduce the total memory size, the continuous flow (CF) memory architecture is proposed [7] where only two N-word memories are required for N-point FFT.
- This is because the original CF FFT adopts radix-4 and radix-2 algorithms which have different bit-reverse orders.
- In their proposed design; however, CF memory architecture causes no problem since radix-23 and radix-22 algorithms have the same bit-reverse order as radix-2 algorithm [4].
- As shown in Fig. 4, one 4096-word SRAM works as the I/O buffer while the other one works as the processing buffer, and vice versa.
B. Ping-Pong Cache-Memory Architecture
- Cached-memory FFT [2], [3] is proposed for low power consumption by reducing the memory accesses.
- A concurrent read/write cache with complex control is required to increase the throughput.
- Thus the authors propose the ping-pong cache-memory architecture which uses a simple cache with single read/write operations.
- By using this scheme, half the memory accesses can be saved.
C. Processing Engine (PE)
- The PE is designed to perform radix-23/22/2 butterfly operations and complex multiplications with proposed block scaling approach as shown in Fig.
- At the fist processing stage, since the inputs have the same decimal point, data alignments are skipped.
- Afterward, the output of ODSU1 is sent to the complex multipliers for twiddle factor multiplications.
- The second and third stages have similar control flows as stage 1. First Stage Intermediate Stage(s) Final Stage Alignment Bypass ON ON Configurable BU Radix-2 3 Radix-23 Radix-23 for 512 FFT Radix-22 for 256/2048 FFT.
IV. CHIP IMPLEMENTATION
- A test chip of the proposed block scaling FFT/IFFT processor (2048-point mode) is implemented using UMC 0.13 μm 1P8M CMOS technology for verification.
- From post-layout prime power simulation, it is shown that the proposed FFT/IFFT consumes only 17.26 mW at 22.86 MHz when performing two 2048-point FFT computations consecutively for WiMAX applications.
- The SQNR performance of the 2048-point FFT/IFFT has also been verified to exceed 48 dB for QPSK and 16/64-QAM signals.
- Thus the implementation loss of cascaded IFFT and FFT is only 0.1 dB with AWGN at 30 dB SNR which satisfies their design target for WiMAX applications.
- The detailed power profiling and chip summary are shown in Fig. 9. Fig. 9. Power profiling and chip summary of the proposed processor.
V. COMPARISON
- For comparisons, the authors choose two FFT processor chips which can handle consecutive 2048-point FFT computations [8], [9].
- Besides, to compare the FFT processor chips fabricated with different technologies, the authors adopt the normalized area and FFTs per energy [2] as their performance indices shown in eqs.
- Note that eq. (4) has been adapted to take account of the voltage scaling.
- The authors can find that the FFT processor [9] use a shorter wordlength of 12 bits since it only supports for 9-bit input.
- Both designs [8], [9] do not employ a cache design to reduce the power of memory accesses.
VI. CONCLUSION
- A block scaling MIMO FFT/IFFT processor for WiMAX applications has been proposed in this paper.
- It can support two 2048-point FFT/IFFT computations simultaneously within 2052 clock cycles.
- Moreover, with a novel block scaling method and a new ping-pong cache-memory architecture, both power consumption and hardware cost can be greatly reduced.
- A test chip has been designed using UMC 0.13 μm 1P8M process.
- Simulation result has shown that the proposed FFT processor consumes only 17.26 mW at 22.86 MHz which meets the maximum throughput rate of WiMAX applications.
Did you find this useful? Give us your feedback
Citations
18 citations
Cites background or methods from "A Block Scaling FFT/IFFT Processor ..."
...3) Fixed-Point Performance: To improve performance in the signal-to-quantization-noise ratio (SQNR) with finite word length, we employed a block floating point (BFP) scheme [30], [28], [48] by representing a block of data values using the formula , in which M is the mantissa of the individual data value and E is a global exponent term for all data values....
[...]
...Moreover, rather than using a global operation scheme based on overall performance optimization [30]–[34], it is more helpful to minimize the power consumption (i....
[...]
...Based on the evaluation results presented in [30], [48], and [50], we determined that in fixed-point performance for OCT or OFDM FFTs, the proposed FFT structure lends directional support for SQNR for both FD-OCT and OFDM applications....
[...]
...For example, an eight-path radix-8 kernel may employ seven complex multipliers as shown in [27], [30], [31]....
[...]
...To enable a balanced comparison of hardware performance, we employed two parameters based on [22], [28], [30] and [51] in terms of “normalized N-point FFTs per energy” (N is the FFT length specified in our design) as shown in (14) and “normalized T....
[...]
16 citations
Cites background from "A Block Scaling FFT/IFFT Processor ..."
...Furthermore, the ping-pong cache memory architecture is proposed to further reduce memory accesses more efficiently [8]....
[...]
...For these reasons, several dynamic data scaling approaches have been proposed to preserve effective data word-length and avoiding data overflow [3], [8]....
[...]
...Proposed Chen [8] Zhong [10] Technology 0....
[...]
...7 [8], respectively to be our baseline of performance comparison....
[...]
11 citations
Cites methods from "A Block Scaling FFT/IFFT Processor ..."
...The performance indices for chip area and power consumption are defined as follows [1][3]:...
[...]
10 citations
10 citations
References
322 citations
"A Block Scaling FFT/IFFT Processor ..." refers methods in this paper
...Thus we employ radix-23 and radix-22 [4] to replace radix-8 and radix-4, respectively....
[...]
...To reduce the number of complex multiplications, radix-8 algorithm is chosen to carry out the DFT [4]....
[...]
...Here we take the longest 2048-point DFT in the design as an example....
[...]
...reduce the number of complex multiplications, radix-8 algorithm is chosen to carry out the DFT [4]....
[...]
...Similarly, 128/256/ 512/1024-point DFT can also be decomposed to preceding 0-7803-9735-5/06/$20.00 ©2006 IEEE 203 radix-8 stages and a final radix-8/4/2 stage depending on the DFT size....
[...]
319 citations
"A Block Scaling FFT/IFFT Processor ..." refers background or methods in this paper
...Thus by taking the guard interval of WiMAX systems into account, the proposed FFT/IFFT processor does not need to operate in a multiple sampling frequency as the previous cached-memory FFT designs do [2], [3]....
[...]
...However, the increase in wordlength [2] or idle cycles [3] still causes wastes in power consumption and hardware cost....
[...]
...Besides, to compare the FFT processor chips fabricated with different technologies, we adopt the normalized area and FFTs per energy [2] as our performance indices shown in eqs....
[...]
...Cached-memory FFT [2], [3] is proposed for low power consumption by reducing the memory accesses....
[...]
...There have been many researches on low-power FFT designs by employing the cached-memory architecture to reduce the memory accesses [2], [3]....
[...]
165 citations
"A Block Scaling FFT/IFFT Processor ..." refers methods in this paper
...Besides, since FFT and IFFT have the same operations except for complexconjugated twiddle factors, we implement IFFT by simply taking conjugates ofFFT input/output [6] as shown in Fig....
[...]
128 citations
"A Block Scaling FFT/IFFT Processor ..." refers background or methods in this paper
...For memory-based FFT processors supporting consecutive I/0, multiple main memories are needed as computation and I/0 buffers [7]....
[...]
...To reduce the total memory size, the continuous flow (CF) memory architecture is proposed [7] where only two N-word memories are required for N-point FFT....
[...]
111 citations
"A Block Scaling FFT/IFFT Processor ..." refers background or methods in this paper
...Thus by taking the guard interval of WiMAX systems into account, the proposed FFT/IFFT processor does not need to operate in a multiple sampling frequency as the previous cached-memory FFT designs do [2], [3]....
[...]
...However, the increase in wordlength [2] or idle cycles [3] still causes wastes in power consumption and hardware cost....
[...]
...dynamck sloaing-FTprocessorP[3] is proposced bayempoyinguc mutiporleepnngts for icracighe-sz blocks....
[...]
...Cached-memory FFT [2], [3] is proposed for low power consumption by reducing the memory accesses....
[...]
...There have been many researches on low-power FFT designs by employing the cached-memory architecture to reduce the memory accesses [2], [3]....
[...]