scispace - formally typeset
Search or ask a question
Journal ArticleDOI

48-Mode Reconfigurable Design of SDF FFT Hardware Architecture Using Radix-3 2 and Radix-2 3 Design Approaches

TL;DR: A reconfigurable (RC) fast Fourier transform (FFT) design in a systematic design scheme that can support up to 2187 FFT-point manipulation and 48 RC modes and supports 32 operating modes defined in 3GPP-LTE standard is proposed.
Abstract: In this paper, we propose a reconfigurable (RC) fast Fourier transform (FFT) design in a systematic design scheme. The RC design bricks are mainly proposed to arbitrarily concatenate to support FFT-point required. Meanwhile, we show three developed design techniques, including six-type RC processing element, systematic first-in first-out reuse arrangement, and section-based twiddle factor generator to elaborate our FFT design. In a design/implementation example, it can support up to 2187 FFT-point manipulation and 48 RC modes. It also supports 32 operating modes defined in 3GPP-LTE standard. In application-specified integrated circuit implementation with TSMC 90-nm CMOS technology, our design work occupies a core area of 1.664 mm2 and consumes 35.2 mW under maximal clock frequency of 188.67 MHz. This paper also has outstanding design performance in terms of speed-area ratio and power-frequency ratio for comparison reference.
Citations
More filters
Journal ArticleDOI
TL;DR: The proposed processor has better-normalized throughput per area unit than the state-of-the-art available designs and is designed as a general IP and can be implemented using a processor synthesizer (application-specific instruction-set processor designer).
Abstract: A high-throughput programmable fast Fourier transform (FFT) processor is designed supporting 16- to 4096-point FFTs and 12- to 2400-point discrete Fourier transforms (DFTs) for 4G, wireless local area network, and future 5G. A 16-path data parallel memory-based architecture is selected as a tradeoff between throughput and cost. To implement a hardware-efficient high-speed processor, several improvements are provided. To maximally reuse the hardware resource, a reconfigurable butterfly unit is proposed to support computing including eight radix-2 in parallel, four radix-3/4 in parallel, two radix-5/8 in parallel, and a radix-16 in one clock cycle. Twiddle factor multipliers using different schemes are optimized and compared, wherein modified coordinate rotation digital computer scheme is finally implemented to minimize the hardware cost while supporting both FFTs and DFTs. An optimized conflict-free data access scheme is also proposed to support multiple butterflies at any radices. The processor is designed as a general IP and can be implemented using a processor synthesizer (application-specific instruction-set processor designer). The electronic design automation synthesis result based on a 65-nm technology shows that the processor area is 1.46 mm2. The processor supports 972 MS/s 4096-point FFT at 250 MHz with a power consumption of 68.64 mW and a signal-to-quantization-noise ratio of 66.1 dB. The proposed processor has better-normalized throughput per area unit than the state-of-the-art available designs.

30 citations


Cites background or methods from "48-Mode Reconfigurable Design of SD..."

  • ...According to the fact that all of the 11 sizes in Table V are divisible by 8, we integrate the schemes in [24] and [15] and propose a ROM partition scheme that reduces the number of table entries to (U/8) + V (N = U V )....

    [...]

  • ...SDF processor in [16] supports 46 2m3n5k points using a single-table approximation method (STAM) for TF generation....

    [...]

  • ...Several designs are listed for comparison, including a 64- to 4096-point FFT processor [8], two memory-based DFT processors [17], [18] using PFA algorithm, and two SDF DFT processors [15], [16]....

    [...]

  • ...However, only SDF [15], [16], MDF [6], and memory-based [17], [18] architectures support diverse Fig....

    [...]

  • ...To reduce the ROM size in the TF multiplier, many ROM-based schemes are proposed, such as ROM sharing scheme [23], ROM partition scheme [15], [24], memoryless rotator [5], [25], and trigonometric approximation [6]....

    [...]

Book ChapterDOI
01 Jan 2019
TL;DR: This chapter summarizes the research on FFT hardware architectures by presenting the FFT algorithms, the building blocks in FFTHardware architectures, the architectures themselves, and the bit reversal algorithm.
Abstract: The fast Fourier transform (FFT) is a widely used algorithm in signal processing applications. FFT hardware architectures are designed to meet the requirements of the most demanding applications in terms of performance, circuit area, and/or power consumption. This chapter summarizes the research on FFT hardware architectures by presenting the FFT algorithms, the building blocks in FFT hardware architectures, the architectures themselves, and the bit reversal algorithm.

16 citations

Journal ArticleDOI
TL;DR: A modified pipelined architecture with a reorganized memory structure and also an efficient data scheduling mechanism for memories and butterflies are developed, which achieves a throughput of 0.6 GS/s and 1200 clock cycles latency, the lowest latency reported to-date for single-input pipelining FFT/IFFT architectures.
Abstract: A considerable part of latency in the baseband of massive multiple-input multiple-output (MIMO) systems is introduced by orthogonal frequency division multiplexing (OFDM) (de)modulation. To address the low-latency demand of massive MIMO systems, a fast Fourier transform (FFT) processor and corresponding reordering scheme are proposed, which reduce the processing latency and reordering latency of OFDM-based systems, respectively. The main idea is to utilize the OFDM guard bands to decrease the number of required computations and thus the processing time. In case of a 2048-point IFFT, the proposed scheme leads to 42% reduction in latency compared to the reported pipelined schemes at the cost of 4% additional memory, which is around 2.4% of the total chip area. To realize this idea, a modified pipelined architecture with a reorganized memory structure and also an efficient data scheduling mechanism for memories and butterflies are developed. Using the proposed scheme, a 2048-point FFT/IFFT processor has been implemented in a 28-nm complementary metal-oxide-semiconductor technology. The post-layout simulations show that our design achieves a throughput of 0.6 GS/s and 1200 clock cycles latency, the lowest latency reported to-date for single-input pipelined FFT/IFFT architectures.

16 citations


Cites background from "48-Mode Reconfigurable Design of SD..."

  • ...The design has a latency of 1200 CC, demonstrating 42% reduction compared to the latency of reported designs [7], [12]....

    [...]

  • ...constitutes the dominant extra cost of our design compared to the traditional ones [7], [12]....

    [...]

  • ...This category is divided to the singlepath delay feedback (SDF) architecture [7], multipath delay feedback (MDF) [8], and multi-path delay commutator (MDC) architectures [9], [10]....

    [...]

  • ...In general, the pipelined architectures can achieve the lowest latency [7]....

    [...]

  • ...In case of massive MIMO system in Table I, Extra Mem has 89 words, which is less than 4% of the total memory of traditional architectures, that contain at least N-1 = 2047 words memory [7], [12]....

    [...]

Journal ArticleDOI
TL;DR: A flexible and reconfigurable changeable-radix fast Fourier transform (FFT) hardware architecture that aims to support 48 different FFT sizes and up to 4096 FFT points, which are defined in current 3GPP-LTE communication system is proposed.
Abstract: In this paper, we propose a flexible and reconfigurable changeable-radix fast Fourier transform (FFT) hardware architecture. It aims to support 48 different FFT sizes and up to 4096 FFT points, which are defined in current 3GPP-LTE communication system. The built-up design structure is primarily constructed on a radix-52 basis of single-path delay feedback FFT and up to 18 various changeable radixes of FFT processing. A design technique of switchable FIFO usage approach is developed to efficiently manage FIFO arrangement for 48 FFT modes. In addition, a design technique of coarse and fine rotating is designed to effectively reduce twiddle-factor circuit area. By using TSMC 40-nm CMOS technology, an FFT ASIC implementation only has a core area occupation of 0.414 mm2 and consumes 49.8 mW in average at maximal working frequency of 526.32 MHz. This innovative design work is competitive as compared to current state-of-the-art works, especially in terms of circuit area cost and power/energy performance evaluation.

11 citations


Cites background or methods or result from "48-Mode Reconfigurable Design of SD..."

  • ...Unfortunately, [13], [22], and [23] only cover up to N = 2048 FFT points....

    [...]

  • ...On the other hand, by using NEE (energy performance index in [32]), our chip is only a little worse than [22] and [23] because our work supports longer FFT length (such as twice of FFT length in both [22] and [23])....

    [...]

  • ...By using NAE (area performance index in [32]), our chip is a little worse than [11], [22], and [23]....

    [...]

  • ...Reference [22] presents a LEGO-like constructing approach, which can handle the FFT points with the mixed powers of 2 and 3 only....

    [...]

  • ...Even [22] and [23] provides 6 and 7 modes (very limited), respectively....

    [...]

Journal ArticleDOI
TL;DR: In this article, the authors propose a fully-unrolled Streaming MUltiplierLess (SMUL) fast Fourier Transform (FFT) engine that performs one transform per clock cycle.
Abstract: Beamspace processing is an emerging paradigm to reduce hardware complexity in all-digital millimeter-wave (mmWave) massive multiple-input multiple-output (MIMO) basestations. This approach exploits sparsity of mmWave channels but requires spatial discrete Fourier transforms (DFTs) across the antenna array, which must be performed at the baseband sampling rate. To mitigate the resulting DFT hardware implementation bottleneck, we propose a fully-unrolled Streaming MUltiplierLess (SMUL) fast Fourier Transform (FFT) engine that performs one transform per clock cycle. The proposed SMUL-FFT architecture avoids hardware multipliers by restricting the twiddle factors to a sum-of-powers-of-two, resulting in substantial power and area savings. Compared to state-of-the-art FFTs, our SMUL-FFT ASIC designs in 65nm CMOS demonstrate more than 45% and 17% improvements in energy-efficiency and area-efficiency, respectively, without noticeably increasing the error-rate in mmWave massive MIMO systems.

9 citations

References
More filters
Proceedings ArticleDOI
15 Apr 1996
TL;DR: A new VLSI architecture for a real-time pipeline FFT processor is proposed, derived by integrating a twiddle factor decomposition technique in the divide-and-conquer approach, which has the same multiplicative complexity as the radix-4 algorithm, but retains the butterfly structure of the Radix-2 algorithm.
Abstract: A new VLSI architecture for a real-time pipeline FFT processor is proposed. A hardware-oriented radix-2/sup 2/ algorithm is derived by integrating a twiddle factor decomposition technique in the divide-and-conquer approach. The radix-2/sup 2/ algorithm has the same multiplicative complexity as the radix-4 algorithm, but retains the butterfly structure of the radix-2 algorithm. The single-path delay-feedback architecture is used to exploit the spatial regularity in the signal flow graph of the algorithm. For length-N DFT computation, the hardware requirement of the proposed architecture is minimal on both dominant components: log/sub 4/N-1 complexity multipliers and N-1 complexity data memory. The validity and efficiency of the architecture have been verified by simulation in the hardware description language VHDL.

410 citations


"48-Mode Reconfigurable Design of SD..." refers methods in this paper

  • ...A regular hardware-oriented design methodology, single-path delay feedback (SDF) FFT [1], is developed in 1996, mostly focusing on the radix-2 FFT design [2]–[6]....

    [...]

Proceedings ArticleDOI
29 Sep 1998
TL;DR: By exploiting the spatial regularity of the new algorithm, the requirement for both dominant elements in VLSI implementation, the memory size and the number of complex multipliers, have been minimized and the area/power efficiency has been enhanced.
Abstract: The FFT processor is one of the key components in the implementation of wideband OFDM systems. Architectures with a structured pipeline have been used to meet the fast, real-time processing demand and low-power consumption requirement in a mobile environment. Architectures based on new forms of FFT, the radix-2/sup i/ algorithm derived by cascade decomposition, is proposed. By exploiting the spatial regularity of the new algorithm, the requirement for both dominant elements in VLSI implementation, the memory size and the number of complex multipliers, have been minimized. Progressive wordlength adjustment has been introduced to optimize the total memory size with a given signal-to-quantization-noise-ratio (SQNR) requirement in fixed-point processing. A new complex multiplier based on distributed arithmetic further enhanced the area/power efficiency of the design. A single-chip processor for 1 K complex point FFT transform is used to demonstrate the design issues under consideration.

322 citations


"48-Mode Reconfigurable Design of SD..." refers background in this paper

  • ...Besides, in order to achieve lower computation complexity, radix-22 [11]–[13], radix-23 [14]–[20], radix-24 [21], [22], and radix-2k [23], [24] FFT circuits are developed in sequence....

    [...]

Journal ArticleDOI
TL;DR: A novel 128-point FFT/IFFT processor for ultrawideband (UWB) systems and the proposed pipelined FFT architecture, called mixed-radix multipath delay feedback (MRMDF), can provide a higher throughput rate by using the multidata-path scheme.
Abstract: In this paper, we present a novel 128-point FFT/IFFT processor for ultrawideband (UWB) systems. The proposed pipelined FFT architecture, called mixed-radix multipath delay feedback (MRMDF), can provide a higher throughput rate by using the multidata-path scheme. Furthermore, the hardware costs of memory and complex multipliers in MRMDF are only 38.9% and 44.8% of those in the known FFT processor by means of the delay feedback and the data scheduling approaches. The high-radix FFT algorithm is also realized in our processor to reduce the number of complex multiplications. A test chip for the UWB system has been designed and fabricated using 0.18-/spl mu/m single-poly and six-metal CMOS process with a core area of 1.76/spl times/1.76 mm/sup 2/, including an FFT/IFFT processor and a test module. The throughput rate of this fabricated FFT processor is up to 1 Gsample/s while it consumes 175 mW. Power dissipation is 77.6 mW when its throughput rate meets UWB standard in which the FFT throughput rate is 409.6 Msample/s.

220 citations

Journal ArticleDOI
TL;DR: The proposed radix-2k feedforward architectures not only offer an attractive solution for current applications, but also open up a new research line on feedforward structures.
Abstract: The appearance of radix-22 was a milestone in the design of pipelined FFT hardware architectures. Later, radix-22 was extended to radix-2k . However, radix-2k was only proposed for single-path delay feedback (SDF) architectures, but not for feedforward ones, also called multi-path delay commutator (MDC). This paper presents the radix-2k feedforward (MDC) FFT architectures. In feedforward architectures radix-2k can be used for any number of parallel samples which is a power of two. Furthermore, both decimation in frequency (DIF) and decimation in time (DIT) decompositions can be used. In addition to this, the designs can achieve very high throughputs, which makes them suitable for the most demanding applications. Indeed, the proposed radix-2k feedforward architectures require fewer hardware resources than parallel feedback ones, also called multi-path delay feedback (MDF), when several samples in parallel must be processed. As a result, the proposed radix-2k feedforward architectures not only offer an attractive solution for current applications, but also open up a new research line on feedforward structures.

198 citations


"48-Mode Reconfigurable Design of SD..." refers background in this paper

  • ...Besides, in order to achieve lower computation complexity, radix-22 [11]–[13], radix-23 [14]–[20], radix-24 [21], [22], and radix-2k [23], [24] FFT circuits are developed in sequence....

    [...]

Journal ArticleDOI
TL;DR: A semicustom delay commutator circuit to support the implementation of high-speed fast Fourier transform processors based on the radix 4 pipeline FFT algorithm of J.H. McClellan and R.J. Purdy (1978) is described.
Abstract: The development is described of a semicustom delay commutator circuit to support the implementation of high-speed fast Fourier transform processors based on the radix 4 pipeline FFT algorithm of J.H. McClellan and R.J. Purdy (1978). The delay commutator is a 108000-transistor circuit comprising 12288 shift register stages and approximately 2000 gates of random logic realized with 2.5-micrometer design rule CMOS standard cell technology. It operates at a 10-MHz clock rate, which processes data at a 40-MHz rate. The delay commutator is suitable for implementing processors that compute transforms of 16, 64, 256, 1024, and 4096 (complex) points. It is implemented as a 4-bit-wide data slice to facilitate cocatenation to accommodate common data word sizes and to use a standard 48-pin dual-in-line package.

141 citations


"48-Mode Reconfigurable Design of SD..." refers background in this paper

  • ...Later, radix-4 [7]–[9] and radix-8 [10] FFT hardware designs are discussed to expand the similar design concepts....

    [...]