VLSI Design and Implementation of Reconfigurable 46-Mode Combined-Radix-Based FFT Hardware Architecture for 3GPP-LTE Applications

doi:10.1109/TCSI.2017.2725338

Home
/
Papers
/
VLSI Design and Implementation of Reconfigurable 46-Mode Combined-Radix-Based FFT Hardware Architecture for 3GPP-LTE Applications

Journal Article•DOI•

VLSI Design and Implementation of Reconfigurable 46-Mode Combined-Radix-Based FFT Hardware Architecture for 3GPP-LTE Applications

Xin-Yu Shih¹, Hong-Ru Chou², Yue-Qu Liu²•Institutions (2)

MediaTek¹, National Sun Yat-sen University²

01 Jan 2018-IEEE Transactions on Circuits and Systems I-regular Papers (IEEE)-Vol. 65, Iss: 1, pp 118-129

TL;DR: This paper presents a reconfigurable fast Fourier transform (FFT) hardware architecture, supporting 46 different FFT sizes defined in 3GPP-LTE applications, and delivers high-quality design results in the aspects of area- and energy-related performance indexes.

read less

Abstract: This paper presents a reconfigurable fast Fourier transform (FFT) hardware architecture, supporting 46 different FFT sizes defined in 3GPP-LTE applications. Our proposed design concept is mainly based on combined radix-5, radix-32, and radix24 single-path delay feedback FFT design approaches. In addition, in order to elaborate our hardware design, we also develop three design techniques, such as reconfigurable processing kernel with seven types (RPK-ST), efficient FIFO management scheme, and single-table approximation method. In an ASIC implementation with TSMC 40-nm CMOS technology, our 46-mode reconfigurable FFT chip only occupies a core area of 0.36 mm2, dissipates 48.46 mW, and operates up to clock frequency of 500 MHz. As compared with the other state-of-the-art works, our work delivers high-quality design results in the aspects of area- and energy-related performance indexes, providing a constructive FFT design prototyping for 3GPP-LTE systems.

...read moreread less

Citations

PDF

Open Access

More filters

Journal Article•DOI•

A High-Flexible Low-Latency Memory-Based FFT Processor for 4G, WLAN, and Future 5G

[...]

Shaohan Liu¹, Dake Liu¹•Institutions (1)

Beijing Institute of Technology¹

01 Mar 2019-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: The proposed processor has better-normalized throughput per area unit than the state-of-the-art available designs and is designed as a general IP and can be implemented using a processor synthesizer (application-specific instruction-set processor designer).

...read moreread less

Abstract: A high-throughput programmable fast Fourier transform (FFT) processor is designed supporting 16- to 4096-point FFTs and 12- to 2400-point discrete Fourier transforms (DFTs) for 4G, wireless local area network, and future 5G. A 16-path data parallel memory-based architecture is selected as a tradeoff between throughput and cost. To implement a hardware-efficient high-speed processor, several improvements are provided. To maximally reuse the hardware resource, a reconfigurable butterfly unit is proposed to support computing including eight radix-2 in parallel, four radix-3/4 in parallel, two radix-5/8 in parallel, and a radix-16 in one clock cycle. Twiddle factor multipliers using different schemes are optimized and compared, wherein modified coordinate rotation digital computer scheme is finally implemented to minimize the hardware cost while supporting both FFTs and DFTs. An optimized conflict-free data access scheme is also proposed to support multiple butterflies at any radices. The processor is designed as a general IP and can be implemented using a processor synthesizer (application-specific instruction-set processor designer). The electronic design automation synthesis result based on a 65-nm technology shows that the processor area is 1.46 mm2. The processor supports 972 MS/s 4096-point FFT at 250 MHz with a power consumption of 68.64 mW and a signal-to-quantization-noise ratio of 66.1 dB. The proposed processor has better-normalized throughput per area unit than the state-of-the-art available designs.

...read moreread less

30 citations

Cites background or methods from "VLSI Design and Implementation of R..."

...However, only SDF [15], [16], MDF [6], and memory-based [17], [18] architectures support diverse Fig....
[...]
...Several designs are listed for comparison, including a 64- to 4096-point FFT processor [8], two memory-based DFT processors [17], [18] using PFA algorithm, and two SDF DFT processors [15], [16]....
[...]
...SDF processor in [16] supports 46 2m3n5k points using a single-table approximation method (STAM) for TF generation....
[...]

Journal Article•DOI•

A Low Latency FFT/IFFT Architecture for Massive MIMO Systems Utilizing OFDM Guard Bands

[...]

Mojtaba Mahdavi¹, Ove Edfors¹, Viktor Öwall¹, Liang Liu¹•Institutions (1)

Lund University¹

27 Feb 2019-IEEE Transactions on Circuits and Systems I-regular Papers

TL;DR: A modified pipelined architecture with a reorganized memory structure and also an efficient data scheduling mechanism for memories and butterflies are developed, which achieves a throughput of 0.6 GS/s and 1200 clock cycles latency, the lowest latency reported to-date for single-input pipelining FFT/IFFT architectures.

...read moreread less

Abstract: A considerable part of latency in the baseband of massive multiple-input multiple-output (MIMO) systems is introduced by orthogonal frequency division multiplexing (OFDM) (de)modulation. To address the low-latency demand of massive MIMO systems, a fast Fourier transform (FFT) processor and corresponding reordering scheme are proposed, which reduce the processing latency and reordering latency of OFDM-based systems, respectively. The main idea is to utilize the OFDM guard bands to decrease the number of required computations and thus the processing time. In case of a 2048-point IFFT, the proposed scheme leads to 42% reduction in latency compared to the reported pipelined schemes at the cost of 4% additional memory, which is around 2.4% of the total chip area. To realize this idea, a modified pipelined architecture with a reorganized memory structure and also an efficient data scheduling mechanism for memories and butterflies are developed. Using the proposed scheme, a 2048-point FFT/IFFT processor has been implemented in a 28-nm complementary metal-oxide-semiconductor technology. The post-layout simulations show that our design achieves a throughput of 0.6 GS/s and 1200 clock cycles latency, the lowest latency reported to-date for single-input pipelined FFT/IFFT architectures.

...read moreread less

16 citations

Cites background from "VLSI Design and Implementation of R..."

...latency of N clock cycles (or even more) [11], [12], which is...
[...]

Journal Article•DOI•

An Efficient Spur-Aliasing-Free Spectral Calibration Technique in Time-Interleaved ADCs

[...]

Han Niu¹, Jie Yuan¹•Institutions (1)

Hong Kong University of Science and Technology¹

25 Feb 2020-IEEE Transactions on Circuits and Systems I-regular Papers

TL;DR: Compared to benchmark designs, the new technique involves much less calibration hardware overhead, and calibrates all mismatches with much less time, and addresses the needs from practical TI-ADCs well.

...read moreread less

Abstract: Time-interleaving (TI) is a major trend for high-speed ADC designs. The major limitation of TI-ADC is the mismatches among ADC channels. Calibration techniques have been actively pursued to compensate the mismatches. In this paper, we present a new spectral calibration technique for TI-ADC. Although the technique does not run in the background, it does not need external calibration signals and has no constraint on the input signal, which is similar to blind estimation. Compared to benchmark designs, the new technique involves much less calibration hardware overhead, and calibrates all mismatches with much less time. In practice, this efficient technique can run repetitively to track the environment changes. It addresses the needs from practical TI-ADCs well. The technique is verified with extensive simulations on a 2GS/s 12-bit 16-channel ADC. With mismatch spurs over −40dBc, the technique can reliably suppress the spurs to be lower than −80dBc within about 4000 samples without the need of iteration.

...read moreread less

12 citations

Journal Article•DOI•

Design and Implementation of Flexible and Reconfigurable SDF-Based FFT Chip Architecture With Changeable-Radix Processing Elements

[...]

Xin-Yu Shih¹, Hong-Ru Chou², Yue-Qu Liu²•Institutions (2)

MediaTek¹, National Sun Yat-sen University²

23 Aug 2018-IEEE Transactions on Circuits and Systems I-regular Papers

TL;DR: A flexible and reconfigurable changeable-radix fast Fourier transform (FFT) hardware architecture that aims to support 48 different FFT sizes and up to 4096 FFT points, which are defined in current 3GPP-LTE communication system is proposed.

...read moreread less

Abstract: In this paper, we propose a flexible and reconfigurable changeable-radix fast Fourier transform (FFT) hardware architecture. It aims to support 48 different FFT sizes and up to 4096 FFT points, which are defined in current 3GPP-LTE communication system. The built-up design structure is primarily constructed on a radix-52 basis of single-path delay feedback FFT and up to 18 various changeable radixes of FFT processing. A design technique of switchable FIFO usage approach is developed to efficiently manage FIFO arrangement for 48 FFT modes. In addition, a design technique of coarse and fine rotating is designed to effectively reduce twiddle-factor circuit area. By using TSMC 40-nm CMOS technology, an FFT ASIC implementation only has a core area occupation of 0.414 mm2 and consumes 49.8 mW in average at maximal working frequency of 526.32 MHz. This innovative design work is competitive as compared to current state-of-the-art works, especially in terms of circuit area cost and power/energy performance evaluation.

...read moreread less

11 citations

Cites background or methods or result from "VLSI Design and Implementation of R..."

...Reference [23] uses radix-5 SDF-FFT basis to provide 46 FFT operating modes....
[...]
...Reference [23] is only workable for N≤2048 when the FIFO sub-bank partition and distribution plan are only based on supporting N = 2048 maximally....
[...]
...Unfortunately, [13], [22], and [23] only cover up to N = 2048 FFT points....
[...]
...On the other hand, by using NEE (energy performance index in [32]), our chip is only a little worse than [22] and [23] because our work supports longer FFT length (such as twice of FFT length in both [22] and [23])....
[...]
...By using NAE (area performance index in [32]), our chip is a little worse than [11], [22], and [23]....
[...]

Journal Article•DOI•

A novel parallel prefix adder for optimized Radix-2 FFT processor

[...]

Garima Thakur¹, Harsh Sohal¹, Shruti Jain¹•Institutions (1)

Jaypee University of Information Technology¹

15 Mar 2021-Multidimensional Systems and Signal Processing

TL;DR: A hardware design of an efficient Radix-2 FFT architecture using optimized multiplier and novel Parallel prefix (PP) adder is proposed and results in a 20.19% improvement in comparison with other state-of-art techniques.

...read moreread less

Abstract: The Fast Fourier Transform (FFT) is the basic building block for DSP applications where high processing speed is the critical requirement. Resource utilization and the number of computational stages in Radix-2 FFT structure implementation can be minimized by improving the performance of utilized multiplier and adder blocks. This work proposes a hardware design of an efficient Radix-2 FFT architecture using optimized multiplier and novel Parallel prefix (PP) adder. The designed FFT architecture results in low power and area with an increase in operation speed in comparison to the existing architectures. Our proposed Radix 2 FFT implementation results in 18.218 ns (6.030 ns logic delay and 12.118 ns router delay) in comparison with 24.003 ns delay for Wallace multiplier using Kogge Stone PP adder (M1P1), 24.162 ns delay for Wallace multiplier using Brent Kung PP adder (M2P2), 24.889 ns delay for Wallace multiplier using Landner Fischer PP adder (M3P3) and 22.827 ns delay for Wallace multiplier using Han Carlson PP adder (M4P4) algorithm. The proposed adder and hence the FFT processor can be used in different applications where high speed, low power, and less area is required. The novel PP architecture results in a 20.19% improvement in comparison with other state-of-art techniques.

...read moreread less

10 citations

1
2
3
4
…
5
6
7

Collapse

References

PDF

Open Access

More filters

Proceedings Article•DOI•

Designing pipeline FFT processor for OFDM (de)modulation

[...]

Shousheng He¹, M. Torkelson•Institutions (1)

Lund University¹

29 Sep 1998

TL;DR: By exploiting the spatial regularity of the new algorithm, the requirement for both dominant elements in VLSI implementation, the memory size and the number of complex multipliers, have been minimized and the area/power efficiency has been enhanced.

...read moreread less

Abstract: The FFT processor is one of the key components in the implementation of wideband OFDM systems. Architectures with a structured pipeline have been used to meet the fast, real-time processing demand and low-power consumption requirement in a mobile environment. Architectures based on new forms of FFT, the radix-2/sup i/ algorithm derived by cascade decomposition, is proposed. By exploiting the spatial regularity of the new algorithm, the requirement for both dominant elements in VLSI implementation, the memory size and the number of complex multipliers, have been minimized. Progressive wordlength adjustment has been introduced to optimize the total memory size with a given signal-to-quantization-noise-ratio (SQNR) requirement in fixed-point processing. A new complex multiplier based on distributed arithmetic further enhanced the area/power efficiency of the design. A single-chip processor for 1 K complex point FFT transform is used to demonstrate the design issues under consideration.

...read moreread less

322 citations

"VLSI Design and Implementation of R..." refers background in this paper

...In order to reduce circuit complexity of processing kernels and twiddlefactors multiplication, there exist a series of developed design structures, including radix-22[22]–[24], radix-23[25]–[27], and...
[...]

Journal Article•DOI•

A 1-GS/s FFT/IFFT processor for UWB applications

[...]

Yu-Wei Lin, Hsuan-Yu Liu, Chen-Yi Lee

25 Jul 2005-IEEE Journal of Solid-state Circuits

TL;DR: A novel 128-point FFT/IFFT processor for ultrawideband (UWB) systems and the proposed pipelined FFT architecture, called mixed-radix multipath delay feedback (MRMDF), can provide a higher throughput rate by using the multidata-path scheme.

...read moreread less

Abstract: In this paper, we present a novel 128-point FFT/IFFT processor for ultrawideband (UWB) systems. The proposed pipelined FFT architecture, called mixed-radix multipath delay feedback (MRMDF), can provide a higher throughput rate by using the multidata-path scheme. Furthermore, the hardware costs of memory and complex multipliers in MRMDF are only 38.9% and 44.8% of those in the known FFT processor by means of the delay feedback and the data scheduling approaches. The high-radix FFT algorithm is also realized in our processor to reduce the number of complex multiplications. A test chip for the UWB system has been designed and fabricated using 0.18-/spl mu/m single-poly and six-metal CMOS process with a core area of 1.76/spl times/1.76 mm/sup 2/, including an FFT/IFFT processor and a test module. The throughput rate of this fabricated FFT processor is up to 1 Gsample/s while it consumes 175 mW. Power dissipation is 77.6 mW when its throughput rate meets UWB standard in which the FFT throughput rate is 409.6 Msample/s.

...read moreread less

220 citations

Journal Article•DOI•

Pipelined Radix- $2^{k}$ Feedforward FFT Architectures

[...]

Mario Garrido¹, Jesus Grajal, Miguel Ángel Martín Sánchez, Oscar Gustafsson¹•Institutions (1)

Linköping University¹

01 Jan 2013-IEEE Transactions on Very Large Scale Integration Systems

TL;DR: The proposed radix-2k feedforward architectures not only offer an attractive solution for current applications, but also open up a new research line on feedforward structures.

...read moreread less

Abstract: The appearance of radix-22 was a milestone in the design of pipelined FFT hardware architectures. Later, radix-22 was extended to radix-2k . However, radix-2k was only proposed for single-path delay feedback (SDF) architectures, but not for feedforward ones, also called multi-path delay commutator (MDC). This paper presents the radix-2k feedforward (MDC) FFT architectures. In feedforward architectures radix-2k can be used for any number of parallel samples which is a power of two. Furthermore, both decimation in frequency (DIF) and decimation in time (DIT) decompositions can be used. In addition to this, the designs can achieve very high throughputs, which makes them suitable for the most demanding applications. Indeed, the proposed radix-2k feedforward architectures require fewer hardware resources than parallel feedback ones, also called multi-path delay feedback (MDF), when several samples in parallel must be processed. As a result, the proposed radix-2k feedforward architectures not only offer an attractive solution for current applications, but also open up a new research line on feedforward structures.

...read moreread less

198 citations

"VLSI Design and Implementation of R..." refers background in this paper

...The similar FFT design approach even more extends to a general radix-2k [31], [32] basis....
[...]

Journal Article•DOI•

A radix 4 delay commutator for fast Fourier transform processor implementation

[...]

E.E. Swartzlander, W.K.W. Young¹, S.J. Joseph²•Institutions (2)

Northrop Grumman Corporation¹, Alcatel-Lucent²

01 Jan 1984-IEEE Journal of Solid-state Circuits

TL;DR: A semicustom delay commutator circuit to support the implementation of high-speed fast Fourier transform processors based on the radix 4 pipeline FFT algorithm of J.H. McClellan and R.J. Purdy (1978) is described.

...read moreread less

Abstract: The development is described of a semicustom delay commutator circuit to support the implementation of high-speed fast Fourier transform processors based on the radix 4 pipeline FFT algorithm of J.H. McClellan and R.J. Purdy (1978). The delay commutator is a 108000-transistor circuit comprising 12288 shift register stages and approximately 2000 gates of random logic realized with 2.5-micrometer design rule CMOS standard cell technology. It operates at a 10-MHz clock rate, which processes data at a 40-MHz rate. The delay commutator is suitable for implementing processors that compute transforms of 16, 64, 256, 1024, and 4096 (complex) points. It is implemented as a 4-bit-wide data slice to facilitate cocatenation to accommodate common data word sizes and to use a standard 48-pin dual-in-line package.

...read moreread less

141 citations

Additional excerpts

...other single radix, such as radix-3 [8], [9], radix-4 [10]–[14], radix-5 [15]–[18], radix-6 [19], [20], and radix-8 [21]....
[...]

Journal Article•DOI•

Power and Area Minimization of Reconfigurable FFT Processors: A 3GPP-LTE Example

[...]

Chia-Hsiang Yang¹, Tsung-Han Yu², Dejan Markovic²•Institutions (2)

National Chiao Tung University¹, University of California, Los Angeles²

01 Mar 2012-IEEE Journal of Solid-state Circuits

TL;DR: A design methodology for power and area minimization of flexible FFT processors based on the power-area tradeoff space obtained by adjusting algorithm, architecture, and circuit variables is presented.

...read moreread less

Abstract: This paper presents a design methodology for power and area minimization of flexible FFT processors. The methodology is based on the power-area tradeoff space obtained by adjusting algorithm, architecture, and circuit variables. Radix factorization is the main technique for achieving high energy efficiency with flexibility, followed by architecture parallelism and delay line circuits. The flexibility is provided by reconfigurable processing units that support radix-2/4/8/16 factorizations. As a proof of concept, a 128- to 2048-point FFT processor for 3GPP-LTE standard has been implemented in a 65-nm CMOS process. The processor designed for minimum power-area product is integrated in 1.25 × 1.1 mm2 and dissipates 4.05 mW at 0.45 V for the 20 MHz LTE bandwidth. The energy dissipation ranging from 2.5 to 103.7 nJ/FFT for 128 to 2048 points makes it the lowest energy flexible FFT.

...read moreread less

120 citations

"VLSI Design and Implementation of R..." refers background or methods in this paper

...applied, we can use several performance indexes pre-defined in [40]–[42] to do a fair design comparison....
[...]
...designs in [38]–[40], they only support 4 - 6 basic cases of FFT sizes defined in 3GPP-LTE standard....
[...]