scispace - formally typeset
Search or ask a question
Journal ArticleDOI

VLSI Design and Implementation of Reconfigurable 46-Mode Combined-Radix-Based FFT Hardware Architecture for 3GPP-LTE Applications

TL;DR: This paper presents a reconfigurable fast Fourier transform (FFT) hardware architecture, supporting 46 different FFT sizes defined in 3GPP-LTE applications, and delivers high-quality design results in the aspects of area- and energy-related performance indexes.
Abstract: This paper presents a reconfigurable fast Fourier transform (FFT) hardware architecture, supporting 46 different FFT sizes defined in 3GPP-LTE applications. Our proposed design concept is mainly based on combined radix-5, radix-32, and radix24 single-path delay feedback FFT design approaches. In addition, in order to elaborate our hardware design, we also develop three design techniques, such as reconfigurable processing kernel with seven types (RPK-ST), efficient FIFO management scheme, and single-table approximation method. In an ASIC implementation with TSMC 40-nm CMOS technology, our 46-mode reconfigurable FFT chip only occupies a core area of 0.36 mm2, dissipates 48.46 mW, and operates up to clock frequency of 500 MHz. As compared with the other state-of-the-art works, our work delivers high-quality design results in the aspects of area- and energy-related performance indexes, providing a constructive FFT design prototyping for 3GPP-LTE systems.
Citations
More filters
Journal ArticleDOI
TL;DR: The proposed processor has better-normalized throughput per area unit than the state-of-the-art available designs and is designed as a general IP and can be implemented using a processor synthesizer (application-specific instruction-set processor designer).
Abstract: A high-throughput programmable fast Fourier transform (FFT) processor is designed supporting 16- to 4096-point FFTs and 12- to 2400-point discrete Fourier transforms (DFTs) for 4G, wireless local area network, and future 5G. A 16-path data parallel memory-based architecture is selected as a tradeoff between throughput and cost. To implement a hardware-efficient high-speed processor, several improvements are provided. To maximally reuse the hardware resource, a reconfigurable butterfly unit is proposed to support computing including eight radix-2 in parallel, four radix-3/4 in parallel, two radix-5/8 in parallel, and a radix-16 in one clock cycle. Twiddle factor multipliers using different schemes are optimized and compared, wherein modified coordinate rotation digital computer scheme is finally implemented to minimize the hardware cost while supporting both FFTs and DFTs. An optimized conflict-free data access scheme is also proposed to support multiple butterflies at any radices. The processor is designed as a general IP and can be implemented using a processor synthesizer (application-specific instruction-set processor designer). The electronic design automation synthesis result based on a 65-nm technology shows that the processor area is 1.46 mm2. The processor supports 972 MS/s 4096-point FFT at 250 MHz with a power consumption of 68.64 mW and a signal-to-quantization-noise ratio of 66.1 dB. The proposed processor has better-normalized throughput per area unit than the state-of-the-art available designs.

30 citations


Cites background or methods from "VLSI Design and Implementation of R..."

  • ...However, only SDF [15], [16], MDF [6], and memory-based [17], [18] architectures support diverse Fig....

    [...]

  • ...Several designs are listed for comparison, including a 64- to 4096-point FFT processor [8], two memory-based DFT processors [17], [18] using PFA algorithm, and two SDF DFT processors [15], [16]....

    [...]

  • ...SDF processor in [16] supports 46 2m3n5k points using a single-table approximation method (STAM) for TF generation....

    [...]

Journal ArticleDOI
TL;DR: A modified pipelined architecture with a reorganized memory structure and also an efficient data scheduling mechanism for memories and butterflies are developed, which achieves a throughput of 0.6 GS/s and 1200 clock cycles latency, the lowest latency reported to-date for single-input pipelining FFT/IFFT architectures.
Abstract: A considerable part of latency in the baseband of massive multiple-input multiple-output (MIMO) systems is introduced by orthogonal frequency division multiplexing (OFDM) (de)modulation. To address the low-latency demand of massive MIMO systems, a fast Fourier transform (FFT) processor and corresponding reordering scheme are proposed, which reduce the processing latency and reordering latency of OFDM-based systems, respectively. The main idea is to utilize the OFDM guard bands to decrease the number of required computations and thus the processing time. In case of a 2048-point IFFT, the proposed scheme leads to 42% reduction in latency compared to the reported pipelined schemes at the cost of 4% additional memory, which is around 2.4% of the total chip area. To realize this idea, a modified pipelined architecture with a reorganized memory structure and also an efficient data scheduling mechanism for memories and butterflies are developed. Using the proposed scheme, a 2048-point FFT/IFFT processor has been implemented in a 28-nm complementary metal-oxide-semiconductor technology. The post-layout simulations show that our design achieves a throughput of 0.6 GS/s and 1200 clock cycles latency, the lowest latency reported to-date for single-input pipelined FFT/IFFT architectures.

16 citations


Cites background from "VLSI Design and Implementation of R..."

  • ...latency of N clock cycles (or even more) [11], [12], which is...

    [...]

Journal ArticleDOI
TL;DR: Compared to benchmark designs, the new technique involves much less calibration hardware overhead, and calibrates all mismatches with much less time, and addresses the needs from practical TI-ADCs well.
Abstract: Time-interleaving (TI) is a major trend for high-speed ADC designs. The major limitation of TI-ADC is the mismatches among ADC channels. Calibration techniques have been actively pursued to compensate the mismatches. In this paper, we present a new spectral calibration technique for TI-ADC. Although the technique does not run in the background, it does not need external calibration signals and has no constraint on the input signal, which is similar to blind estimation. Compared to benchmark designs, the new technique involves much less calibration hardware overhead, and calibrates all mismatches with much less time. In practice, this efficient technique can run repetitively to track the environment changes. It addresses the needs from practical TI-ADCs well. The technique is verified with extensive simulations on a 2GS/s 12-bit 16-channel ADC. With mismatch spurs over −40dBc, the technique can reliably suppress the spurs to be lower than −80dBc within about 4000 samples without the need of iteration.

12 citations

Journal ArticleDOI
TL;DR: A flexible and reconfigurable changeable-radix fast Fourier transform (FFT) hardware architecture that aims to support 48 different FFT sizes and up to 4096 FFT points, which are defined in current 3GPP-LTE communication system is proposed.
Abstract: In this paper, we propose a flexible and reconfigurable changeable-radix fast Fourier transform (FFT) hardware architecture. It aims to support 48 different FFT sizes and up to 4096 FFT points, which are defined in current 3GPP-LTE communication system. The built-up design structure is primarily constructed on a radix-52 basis of single-path delay feedback FFT and up to 18 various changeable radixes of FFT processing. A design technique of switchable FIFO usage approach is developed to efficiently manage FIFO arrangement for 48 FFT modes. In addition, a design technique of coarse and fine rotating is designed to effectively reduce twiddle-factor circuit area. By using TSMC 40-nm CMOS technology, an FFT ASIC implementation only has a core area occupation of 0.414 mm2 and consumes 49.8 mW in average at maximal working frequency of 526.32 MHz. This innovative design work is competitive as compared to current state-of-the-art works, especially in terms of circuit area cost and power/energy performance evaluation.

11 citations


Cites background or methods or result from "VLSI Design and Implementation of R..."

  • ...Reference [23] uses radix-5 SDF-FFT basis to provide 46 FFT operating modes....

    [...]

  • ...Reference [23] is only workable for N≤2048 when the FIFO sub-bank partition and distribution plan are only based on supporting N = 2048 maximally....

    [...]

  • ...Unfortunately, [13], [22], and [23] only cover up to N = 2048 FFT points....

    [...]

  • ...On the other hand, by using NEE (energy performance index in [32]), our chip is only a little worse than [22] and [23] because our work supports longer FFT length (such as twice of FFT length in both [22] and [23])....

    [...]

  • ...By using NAE (area performance index in [32]), our chip is a little worse than [11], [22], and [23]....

    [...]

Journal ArticleDOI
TL;DR: A hardware design of an efficient Radix-2 FFT architecture using optimized multiplier and novel Parallel prefix (PP) adder is proposed and results in a 20.19% improvement in comparison with other state-of-art techniques.
Abstract: The Fast Fourier Transform (FFT) is the basic building block for DSP applications where high processing speed is the critical requirement. Resource utilization and the number of computational stages in Radix-2 FFT structure implementation can be minimized by improving the performance of utilized multiplier and adder blocks. This work proposes a hardware design of an efficient Radix-2 FFT architecture using optimized multiplier and novel Parallel prefix (PP) adder. The designed FFT architecture results in low power and area with an increase in operation speed in comparison to the existing architectures. Our proposed Radix 2 FFT implementation results in 18.218 ns (6.030 ns logic delay and 12.118 ns router delay) in comparison with 24.003 ns delay for Wallace multiplier using Kogge Stone PP adder (M1P1), 24.162 ns delay for Wallace multiplier using Brent Kung PP adder (M2P2), 24.889 ns delay for Wallace multiplier using Landner Fischer PP adder (M3P3) and 22.827 ns delay for Wallace multiplier using Han Carlson PP adder (M4P4) algorithm. The proposed adder and hence the FFT processor can be used in different applications where high speed, low power, and less area is required. The novel PP architecture results in a 20.19% improvement in comparison with other state-of-art techniques.

10 citations

References
More filters
Proceedings ArticleDOI
29 Sep 1998
TL;DR: By exploiting the spatial regularity of the new algorithm, the requirement for both dominant elements in VLSI implementation, the memory size and the number of complex multipliers, have been minimized and the area/power efficiency has been enhanced.
Abstract: The FFT processor is one of the key components in the implementation of wideband OFDM systems. Architectures with a structured pipeline have been used to meet the fast, real-time processing demand and low-power consumption requirement in a mobile environment. Architectures based on new forms of FFT, the radix-2/sup i/ algorithm derived by cascade decomposition, is proposed. By exploiting the spatial regularity of the new algorithm, the requirement for both dominant elements in VLSI implementation, the memory size and the number of complex multipliers, have been minimized. Progressive wordlength adjustment has been introduced to optimize the total memory size with a given signal-to-quantization-noise-ratio (SQNR) requirement in fixed-point processing. A new complex multiplier based on distributed arithmetic further enhanced the area/power efficiency of the design. A single-chip processor for 1 K complex point FFT transform is used to demonstrate the design issues under consideration.

322 citations


"VLSI Design and Implementation of R..." refers background in this paper

  • ...In order to reduce circuit complexity of processing kernels and twiddlefactors multiplication, there exist a series of developed design structures, including radix-22[22]–[24], radix-23[25]–[27], and...

    [...]

Journal ArticleDOI
TL;DR: A novel 128-point FFT/IFFT processor for ultrawideband (UWB) systems and the proposed pipelined FFT architecture, called mixed-radix multipath delay feedback (MRMDF), can provide a higher throughput rate by using the multidata-path scheme.
Abstract: In this paper, we present a novel 128-point FFT/IFFT processor for ultrawideband (UWB) systems. The proposed pipelined FFT architecture, called mixed-radix multipath delay feedback (MRMDF), can provide a higher throughput rate by using the multidata-path scheme. Furthermore, the hardware costs of memory and complex multipliers in MRMDF are only 38.9% and 44.8% of those in the known FFT processor by means of the delay feedback and the data scheduling approaches. The high-radix FFT algorithm is also realized in our processor to reduce the number of complex multiplications. A test chip for the UWB system has been designed and fabricated using 0.18-/spl mu/m single-poly and six-metal CMOS process with a core area of 1.76/spl times/1.76 mm/sup 2/, including an FFT/IFFT processor and a test module. The throughput rate of this fabricated FFT processor is up to 1 Gsample/s while it consumes 175 mW. Power dissipation is 77.6 mW when its throughput rate meets UWB standard in which the FFT throughput rate is 409.6 Msample/s.

220 citations

Journal ArticleDOI
TL;DR: The proposed radix-2k feedforward architectures not only offer an attractive solution for current applications, but also open up a new research line on feedforward structures.
Abstract: The appearance of radix-22 was a milestone in the design of pipelined FFT hardware architectures. Later, radix-22 was extended to radix-2k . However, radix-2k was only proposed for single-path delay feedback (SDF) architectures, but not for feedforward ones, also called multi-path delay commutator (MDC). This paper presents the radix-2k feedforward (MDC) FFT architectures. In feedforward architectures radix-2k can be used for any number of parallel samples which is a power of two. Furthermore, both decimation in frequency (DIF) and decimation in time (DIT) decompositions can be used. In addition to this, the designs can achieve very high throughputs, which makes them suitable for the most demanding applications. Indeed, the proposed radix-2k feedforward architectures require fewer hardware resources than parallel feedback ones, also called multi-path delay feedback (MDF), when several samples in parallel must be processed. As a result, the proposed radix-2k feedforward architectures not only offer an attractive solution for current applications, but also open up a new research line on feedforward structures.

198 citations


"VLSI Design and Implementation of R..." refers background in this paper

  • ...The similar FFT design approach even more extends to a general radix-2k [31], [32] basis....

    [...]

Journal ArticleDOI
TL;DR: A semicustom delay commutator circuit to support the implementation of high-speed fast Fourier transform processors based on the radix 4 pipeline FFT algorithm of J.H. McClellan and R.J. Purdy (1978) is described.
Abstract: The development is described of a semicustom delay commutator circuit to support the implementation of high-speed fast Fourier transform processors based on the radix 4 pipeline FFT algorithm of J.H. McClellan and R.J. Purdy (1978). The delay commutator is a 108000-transistor circuit comprising 12288 shift register stages and approximately 2000 gates of random logic realized with 2.5-micrometer design rule CMOS standard cell technology. It operates at a 10-MHz clock rate, which processes data at a 40-MHz rate. The delay commutator is suitable for implementing processors that compute transforms of 16, 64, 256, 1024, and 4096 (complex) points. It is implemented as a 4-bit-wide data slice to facilitate cocatenation to accommodate common data word sizes and to use a standard 48-pin dual-in-line package.

141 citations


Additional excerpts

  • ...other single radix, such as radix-3 [8], [9], radix-4 [10]–[14], radix-5 [15]–[18], radix-6 [19], [20], and radix-8 [21]....

    [...]

Journal ArticleDOI
TL;DR: A design methodology for power and area minimization of flexible FFT processors based on the power-area tradeoff space obtained by adjusting algorithm, architecture, and circuit variables is presented.
Abstract: This paper presents a design methodology for power and area minimization of flexible FFT processors. The methodology is based on the power-area tradeoff space obtained by adjusting algorithm, architecture, and circuit variables. Radix factorization is the main technique for achieving high energy efficiency with flexibility, followed by architecture parallelism and delay line circuits. The flexibility is provided by reconfigurable processing units that support radix-2/4/8/16 factorizations. As a proof of concept, a 128- to 2048-point FFT processor for 3GPP-LTE standard has been implemented in a 65-nm CMOS process. The processor designed for minimum power-area product is integrated in 1.25 × 1.1 mm2 and dissipates 4.05 mW at 0.45 V for the 20 MHz LTE bandwidth. The energy dissipation ranging from 2.5 to 103.7 nJ/FFT for 128 to 2048 points makes it the lowest energy flexible FFT.

120 citations


"VLSI Design and Implementation of R..." refers background or methods in this paper

  • ...applied, we can use several performance indexes pre-defined in [40]–[42] to do a fair design comparison....

    [...]

  • ...designs in [38]–[40], they only support 4 - 6 basic cases of FFT sizes defined in 3GPP-LTE standard....

    [...]