scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A Generalized Conflict-Free Memory Addressing Scheme for Continuous-Flow Parallel-Processing FFT Processors With Rescheduling

01 Dec 2011-IEEE Transactions on Very Large Scale Integration Systems (IEEE)-Vol. 19, Iss: 12, pp 2290-2302
TL;DR: A generalized conflict-free memory addressing scheme for memory-based fast Fourier transform (FFT) processors with parallel arithmetic processing units made up of radix-2q multi-path delay commutator (MDC) is presented.
Abstract: This paper presents a generalized conflict-free memory addressing scheme for memory-based fast Fourier transform (FFT) processors with parallel arithmetic processing units made up of radix-2q multi-path delay commutator (MDC). The proposed addressing scheme considers the continuous-flow operation with minimum shared memory requirements. To improve throughput, parallel high-radix processing units are employed. We prove that the solution to non-conflict memory access satisfying the constraints of the continuous-flow, variable-size, higher-radix, and parallel-processing operations indeed exists. In addition, a rescheduling technique for twiddle-factor multiplication is developed to reduce hardware complexity and to enhance hardware efficiency. From the results, we can see that the proposed processor has high utilization and efficiency to support flexible configurability for various FFT sizes with fewer computation cycles than the conventional radix-2/radix-4 memory-based FFT processors.
Citations
More filters
Journal ArticleDOI
TL;DR: An multipath delay commutator (MDC)-based architecture and memory scheduling to implement fast Fourier transform (FFT) processors for multiple input multiple output-orthogonal frequency division multiplexing (MIMO-OFDM) systems with variable length is presented.
Abstract: This paper presents an multipath delay commutator (MDC)-based architecture and memory scheduling to implement fast Fourier transform (FFT) processors for multiple input multiple output-orthogonal frequency division multiplexing (MIMO-OFDM) systems with variable length. Based on the MDC architecture, we propose to use radix-Ns butterflies at each stage, where Ns is the number of data streams, so that there is only one butterfly needed in each stage. Consequently, a 100% utilization rate in computational elements is achieved. Moreover, thanks to the simple control mechanism of the MDC, we propose simple memory scheduling methods for input data and output bit/set-reversing, which again results in a full utilization rate in memory usage. Since the memory requirements usually dominate the die area of FFT/inverse fast Fourier transform (IFFT) processors, the proposed scheme can effectively reduce the memory size and thus the die area as well. Furthermore, to apply the proposed scheme in practical applications, we let Ns=4 and implement a 4-stream FFT/IFFT processor with variable length including 2048, 1024, 512, and 128 for MIMO-OFDM systems. This processor can be used in IEEE 802.16 WiMAX and 3GPP long term evolution applications. The processor was implemented with an UMC 90-nm CMOS technology with a core area of 3.1 mm2. The power consumption at 40 MHz was 63.72/62.92/57.51/51.69 mW for 2048/1024/512/128-FFT, respectively in the post-layout simulation. Finally, we analyze the complexity and performance of the implemented processor and compare it with other processors. The results show advantages of the proposed scheme in terms of area and power consumption.

99 citations


Cites background from "A Generalized Conflict-Free Memory ..."

  • ...Continuous-flow mixedradix (CFMR) FFT [8], [9] utilizes two N-sample memories to generate a continuous output stream....

    [...]

Journal ArticleDOI
TL;DR: This brief presents a novel scalable architecture for in-place fast Fourier transform (IFFT) computation for real-valued signals based on a modified radix-2 algorithm, which removes the redundant operations from the flow graph.
Abstract: This brief presents a novel scalable architecture for in-place fast Fourier transform (IFFT) computation for real-valued signals. The proposed computation is based on a modified radix-2 algorithm, which removes the redundant operations from the flow graph. A new processing element (PE) is proposed using two radix-2 butterflies that can process four inputs in parallel. A novel conflict-free memory-addressing scheme is proposed to ensure the continuous operation of the FFT processor. Furthermore, the addressing scheme is extended to support multiple parallel PEs. The proposed real-FFT processor simultaneously requires fewer computation cycles and lower hardware cost compared to prior work. For example, the proposed design with two PEs reduces the computation cycles by a factor of 2 for a 256-point real fast Fourier transform (RFFT) compared to a prior work while maintaining a lower hardware complexity. The number of computation cycles is reduced proportionately with the increase in the number of PEs.

56 citations


Cites methods from "A Generalized Conflict-Free Memory ..."

  • ...Higher radix butterfly units and/or parallel processing can be utilized to increase the throughput [9]....

    [...]

Journal ArticleDOI
TL;DR: The proposed radix-16 FFT processor is area-efficient with high data processing rate and hardware utilization efficiency, and a conflict-free multibank memory addressing scheme is devised to support up to 16-way parallel and normal-order data input/output.
Abstract: This paper presents a high-throughput FFT processor for IEEE 802.15.3c (WPANs) standard. To meet the throughput requirement of 2.59 Giga-samples/s, radix-16 FFT algorithm is adopted and reformulated to an efficient form so that the required number of butterfly stages is reduced. Specifically, the radix-16 butterfly processing element consists of two cascaded parallel/pipelined radix-4 butterfly units. It facilitates low-complexity realization of radix-16 butterfly operation and high operation speed due to its optimized pipelined structure. Besides, a new three-stage multiplier for twiddle factor multiplication is also proposed, which has lower area and power consumption than conventional complex multipliers. Moreover, a conflict-free multibank memory addressing scheme is devised to support up to 16-way parallel and normal-order data input/output. Without needing to reorder the input/output data, this scheme helps a high-throughput design result. Equipped with those new performance-boosting techniques, overall the proposed radix-16 FFT processor is area-efficient with high data processing rate and hardware utilization efficiency. The EDA synthesis results show that whole FFT processor area is 0.93 mm2, and the power consumption is 42 mW with 90 nm process. The SQNR performance is 57 dB with 12-bit wordlength implementation.

54 citations

Journal ArticleDOI
TL;DR: A novel architecture for memory-based fast Fourier transform (FFT) computation for real-valued signals based on radix-2 decimation-in-frequency algorithm to minimize the computation clock cycles and maximize the utilization of the processing element (PE).
Abstract: This brief presents a novel architecture for memory-based fast Fourier transform (FFT) computation for real-valued signals based on radix-2 decimation-in-frequency algorithm. A superior strategy of stage partition for the real FFT (RFFT) is proposed to minimize the computation clock cycles and maximize the utilization of the processing element (PE). The PE employed in our RFFT architecture can process four inputs in parallel by using two radix-2 butterflies and only two multiplexers. The proposed memory-addressing scheme and control of the multiplexers can be expressed in terms of a counter according to the RFFT computation stage. Furthermore, the proposed RFFT architecture can support more PEs in two dimensions as well. Compared with prior works, the proposed RFFT processors have the advantages of fewer computation cycles and lower hardware usage. The experiment shows that the proposed processor reduces the computation cycles by a factor of 17.5% for a 32-point RFFT computation compared with a recently presented work while maintaining lower hardware usage and complexity in the PE design.

41 citations


Cites methods or result from "A Generalized Conflict-Free Memory ..."

  • ...Moreover, one obvious advantage of the proposed RFFT architecture is that the capability of the required memory can be reduced by a factor of 2, as compared with the traditional memory-based complex FFT processors in [9] and [15]....

    [...]

  • ..., pipelined [8] and memory-based architectures [9]....

    [...]

  • ...These architectures are adopted in many applications such as optical coherence tomography in image processing [1], orthogonal frequency-division multiplexing and discrete multitone in communication [9], and wireless sensor network [10]....

    [...]

Journal ArticleDOI
TL;DR: A data relocation scheme that merges multiple banks to lower the area requirement and power dissipation of memory-based FFT architectures is proposed and the proposed memory-addressing method can effectively deal with single-port, merged-bank memory with high-radix processing elements.
Abstract: This paper explores efficient memory management schemes for memory-based architectures of the fast Fourier transform (FFT). A data relocation scheme that merges multiple banks to lower the area requirement and power dissipation of memory-based FFT architectures is proposed. The proposed memory-addressing method can effectively deal with single-port, merged-bank memory with high-radix processing elements. Compared with conventional memory-based FFT designs using dual-port memory, the derived architecture has better performance in terms of area and power consumption. The proposed scheme is extended to a cached-memory FFT architecture to further reduce power dissipation. An 8192-point cached-memory FFT processor is implemented for digital video broadcasting-terrestrial/handheld applications by using 0.18- $\mu $ m 1P6M CMOS technology. Experimental results show that the proposed memory scheme consumes 10.1%–29.3% less area and 9.6%–67.9% less power compared with those of the multibank design.

39 citations

References
More filters
Proceedings ArticleDOI
29 Sep 1998
TL;DR: By exploiting the spatial regularity of the new algorithm, the requirement for both dominant elements in VLSI implementation, the memory size and the number of complex multipliers, have been minimized and the area/power efficiency has been enhanced.
Abstract: The FFT processor is one of the key components in the implementation of wideband OFDM systems. Architectures with a structured pipeline have been used to meet the fast, real-time processing demand and low-power consumption requirement in a mobile environment. Architectures based on new forms of FFT, the radix-2/sup i/ algorithm derived by cascade decomposition, is proposed. By exploiting the spatial regularity of the new algorithm, the requirement for both dominant elements in VLSI implementation, the memory size and the number of complex multipliers, have been minimized. Progressive wordlength adjustment has been introduced to optimize the total memory size with a given signal-to-quantization-noise-ratio (SQNR) requirement in fixed-point processing. A new complex multiplier based on distributed arithmetic further enhanced the area/power efficiency of the design. A single-chip processor for 1 K complex point FFT transform is used to demonstrate the design issues under consideration.

322 citations

Journal ArticleDOI
TL;DR: This paper presents an energy-efficient, single-chip, 1024-point fast Fourier transform (FFT) processor, which has been fabricated in a standard 0.7 /spl mu/m CMOS process and is fully functional on first-pass silicon.
Abstract: This paper presents an energy-efficient, single-chip, 1024-point fast Fourier transform (FFT) processor. The 460000-transistor design has been fabricated in a standard 0.7 /spl mu/m (L/sub poly/=0.6 /spl mu/m) CMOS process and is fully functional on first-pass silicon. At a supply voltage of 1.1 V, it calculates a 1024-point complex FFT in 330 /spl mu/s while consuming 9.5 mW, resulting in an adjusted energy efficiency more than 16 times greater than the previously most efficient known FFT processor. At 3.3 V, it operates at 173 MHz-which is a clock rate 2.6 times greater than the previously fastest rate.

319 citations


"A Generalized Conflict-Free Memory ..." refers background in this paper

  • ...Among them, pipelined single-path delay feedback (SDF) architecture [3]–[6] and memory-based/cache-memorybased architecture [7]–[9] are two popular solutions....

    [...]

Journal ArticleDOI
TL;DR: A novel 128/64 point fast Fourier transform (FFT)/ inverse FFT (IFFT) processor for the applications in a multiple-input multiple-output orthogonal frequency-division multiplexing based IEEE 802.11n wireless local area network baseband processor.
Abstract: In this paper, we present a novel 128/64 point fast Fourier transform (FFT)/ inverse FFT (IFFT) processor for the applications in a multiple-input multiple-output orthogonal frequency-division multiplexing based IEEE 802.11n wireless local area network baseband processor. The unfolding mixed-radix multipath delay feedback FFT architecture is proposed to efficiently deal with multiple data sequences. The proposed processor not only supports the operation of FFT/IFFT in 128 points and 64 points but can also provide different throughput rates for 1-4 simultaneous data sequences to meet IEEE 802.11n requirements. Furthermore, less hardware complexity is needed in our design compared with traditional four-parallel approach. The proposed FFT/IFFT processor is designed in a 0.13-mum single-poly and eight-metal CMOS process. The core area is 660times2142 mum2 , including an FFT/IFFT processor and a test module. At the operation clock rate of 40 MHz, our proposed processor can calculate 128-point FFT with four independent data sequences within 3.2 mus meeting IEEE 802.11n standard requirements

143 citations


"A Generalized Conflict-Free Memory ..." refers background in this paper

  • ...Among them, pipelined single-path delay feedback (SDF) architecture [3]–[6] and memory-based/cache-memorybased architecture [7]–[9] are two popular solutions....

    [...]

Journal ArticleDOI
TL;DR: A new continuous-flow mixed-radix (CFMR) fast Fourier transform (FFT) processor that uses the MR (radix-4/2) algorithm and a novel in-place strategy that can reduce hardware complexity and computation cycles compared with existing FFT processors is proposed.
Abstract: The paper proposes a new continuous-flow mixed-radix (CFMR) fast Fourier transform (FFT) processor that uses the MR (radix-4/2) algorithm and a novel in-place strategy. The existing in-place strategy supports only a fixed-radix FFT algorithm. In contrast, the proposed in-place strategy can support the MR algorithm, which allows CF FFT computations regardless of the length of FFT. The novel in-place strategy is made by interchanging storage locations of butterfly outputs. The CFMR FFT processor provides the MR algorithm, the in-place strategy, and the CF FFT computations at the same time. The CFMR FFT processor requires only two N-word memories due to the proposed in-place strategy. In addition, it uses one butterfly unit that can perform either one radix-4 butterfly or two radix-2 butterflies. The CFMR FFT processor using the 0.18 /spl mu/m SEC cell library consists of 37,000 gates excluding memories, requires only 640 clock cycles for a 512-point FFT and runs at 100 MHz. Therefore, the CFMR FFT processor can reduce hardware complexity and computation cycles compared with existing FFT processors.

128 citations


"A Generalized Conflict-Free Memory ..." refers background or methods in this paper

  • ...1, it supports the continuous-flow operation and merges the input and output buffer so as to minimize the total memory requirement to as in [9] and [10]....

    [...]

  • ...However, it has been shown in [9] and [10] that a continuous-flow FFT processor can minimize the storage to...

    [...]

  • ...In the past, numerous FFT processors have been designed [2]–[9]....

    [...]

  • ...In [9] and [11], an in-place strategy was applied for the radix-2/4 butterfly unit....

    [...]

  • ...Among them, pipelined single-path delay feedback (SDF) architecture [3]–[6] and memory-based/cache-memorybased architecture [7]–[9] are two popular solutions....

    [...]

Journal ArticleDOI
TL;DR: This paper presents an 8192-point FFT processor for DVB-T systems, in which a three-step radix-8 FFT algorithm, a new dynamic scaling approach, and a novel matrix prefetch buffer are exploited.
Abstract: This paper presents an 8192-point FFT processor for DVB-T systems, in which a three-step radix-8 FFT algorithm, a new dynamic scaling approach, and a novel matrix prefetch buffer are exploited. About 64 K bit memory space can be saved in the 8 K point FFT by the proposed dynamic scaling approach. Moreover, with data scheduling and pre-fetched buffering, single-port memory can be adopted without degrading throughput rate. A test chip for 8 K mode DVB-T system has been designed and fabricated using 0.18-/spl mu/m single-poly six-metal CMOS process with core area of 4.84 mm/sup 2/. Power dissipation is about 25.2 mW at 20 MHz.

111 citations


"A Generalized Conflict-Free Memory ..." refers methods in this paper

  • ...We can see that a specific and similar rescheduling technique for one-half complex multiplications after the radix-8 butterfly has been used in [8]....

    [...]