scispace - formally typeset
Search or ask a question
Journal ArticleDOI

A 2.4-Gsample/s DVFS FFT Processor for MIMO OFDM Communication Systems

22 Apr 2008-IEEE Journal of Solid-state Circuits (IEEE)-Vol. 43, Iss: 5, pp 1260-1273
TL;DR: A new dynamic voltage and frequency scaling (DVFS) FFT processor for MIMO OFDM applications and a novel open-loop voltage detection and scaling (OLVDS) mechanism is proposed for fast and robust voltage management.
Abstract: This paper presents a new dynamic voltage and frequency scaling (DVFS) FFT processor for MIMO OFDM applications. By the proposed multimode multipath-delay-feedback (MMDF) architecture, our FFT processor can process 1-8-stream 256-point FFTs or a high-speed 256-point FFT in two processing domains at minimum clock frequency for DVFS operations. A parallelized radix-24 FFT algorithm is also employed to save the power consumption and hardware cost of complex multipliers. Furthermore, a novel open-loop voltage detection and scaling (OLVDS) mechanism is proposed for fast and robust voltage management. With these schemes, the proposed FFT processor can operate at adequate voltage/frequency under different configurations to support the power-aware feature. A test chip of the proposed FFT processor has been fabricated using UMC 90 nm single-poly nine-metal CMOS process with a core area of 1.88 times1.88 mm2 . The SQNR performance of this FFT chip is over 35.8 dB for QPSK/16-QAM modulation. Power dissipation of 2.4 Gsample/s 256-point FFT computations is about 119.7 mW at 0.85 V. Depending on the operation mode, power can be saved by 18%-43% with voltage scaling in TT corner.
Citations
More filters
Journal ArticleDOI
TL;DR: A novel simplification method to reduce the hardware cost in multiplication units of the multiple-path FFT approach is proposed and a multidata scaling scheme to reduce wordlengths while preserving the signal-to-quantization-noise ratio is presented.
Abstract: This brief presents a fast Fourier transform (FFT) processor that provides high throughput rate (T.R.) by applying the eight-data-path pipelined approach for wireless personal area network applications. The hardware costs, including the power consumption and area, increase due to multiple data paths and increased wordlength along stages. To resolve these issues, a novel simplification method to reduce the hardware cost in multiplication units of the multiple-path FFT approach is proposed. A multidata scaling scheme to reduce wordlengths while preserving the signal-to-quantization-noise ratio is also presented. Using UMC 90-nm 1P9M technology, a 2048-point FFT processor test chip has been designed, and its 128-point FFT kernel has been fabricated for ultrawideband (UWB) applications and also for verification. The 2048-point FFT processor can provide a T.R. of 2.4 GS/s at 300 MHz with a power consumption of 159 mW. Compared with the four-data-path approach, a power consumption saving of about 30% can be achieved under the same T.R. In addition, the 128-point FFT kernel test chip has a measured power consumption of 6.8 mW with a T.R. of 409.6 MS/s at 52 MHz to meet the UWB standard with a saving in power consumption of about 40%.

131 citations


Cites methods from "A 2.4-Gsample/s DVFS FFT Processor ..."

  • ...by using parallel data paths is also shown in [5] and [10]–[12]....

    [...]

  • ...For comparison with different technologies, the area and power consumption are normalized based on [12] and [15], as illustrated in the following:...

    [...]

Journal ArticleDOI
TL;DR: A design methodology for power and area minimization of flexible FFT processors based on the power-area tradeoff space obtained by adjusting algorithm, architecture, and circuit variables is presented.
Abstract: This paper presents a design methodology for power and area minimization of flexible FFT processors. The methodology is based on the power-area tradeoff space obtained by adjusting algorithm, architecture, and circuit variables. Radix factorization is the main technique for achieving high energy efficiency with flexibility, followed by architecture parallelism and delay line circuits. The flexibility is provided by reconfigurable processing units that support radix-2/4/8/16 factorizations. As a proof of concept, a 128- to 2048-point FFT processor for 3GPP-LTE standard has been implemented in a 65-nm CMOS process. The processor designed for minimum power-area product is integrated in 1.25 × 1.1 mm2 and dissipates 4.05 mW at 0.45 V for the 20 MHz LTE bandwidth. The energy dissipation ranging from 2.5 to 103.7 nJ/FFT for 128 to 2048 points makes it the lowest energy flexible FFT.

120 citations


Cites methods from "A 2.4-Gsample/s DVFS FFT Processor ..."

  • ...For a power-of-prime size, the Winograd Fourier Transform Algorithm (WFTA) [17] performs the decomposition efficiently using cyclic-convolution techniques....

    [...]

  • ...We start from radix-2 architecture since mixed-radix structures are built using radix-2 butterflies....

    [...]

01 Jan 2012
TL;DR: This work proposes Globally-Ratiochronous, Locally-Synchronous (GRLS) systems, where GRLS is a design style intermediate between the mesochronous and the GALS design paradigms: local frequencies in a GRLS system do not need to be identical, but are required to be rationally-related, which results in high figures of merit for GRLS systems.
Abstract: It is well recognized in the literature that the fully-synchronous design style, once the best choice due especially to the simplicity of its design flow, is not suitable for present-days systems, which contain many more gates compared to their predecessors, and has to be superseded to meet the new needs of the industry. The alternative solution that has enjoyed more success in industry and the literature consists in breaking down a system into several fully-synchronous modules clocked with independent clocks. Such systems go under the name of Globally-non-Synchronous (GnS) and make no assumption on the phase alignment between the clocks in the individual modules. GnS design styles do not require a globally balanced clock tree and employ special synchronizers to achieve latency-insensitivity. The individual modules, whose sizes are relatively small, remain fully-synchronous, thus easy to design andmaintain. Two main classes of GnS systems have been proposed: the GALS (for Globally-Asynchronous, Locally-Synchronous) design style allows each module to be clocked at its own independent clock frequency; the mesochronous design style constrains all modules to run at the same frequency. GALS systems support per-module Dynamic Voltage-Frequency Scaling (DVFS), but GALS interfaces are complex and introduce high performance penalties; mesochronous systems do not support per-module DVFS but support simpler and faster interfaces. It is well recognized that neither of the two design styles can fully satisfy all the contrasting needs of the electronic industry, and often hybrid solutions are deployed as a trade-off. We propose Globally-Ratiochronous, Locally-Synchronous (GRLS) systems, where GRLS is a design style intermediate between the mesochronous and the GALS design paradigms: local frequencies in a GRLS system do not need to be identical, but are required to be rationally-related (such as one being 3/4 or 2/5 of the other). The periodic properties of rationally-related systems allow the deployment of interfaces that do not use any form of handshake and, thanks to this, are much more performant than GALS interfaces; on the other hand, GRLS supports quantized per-module DVFS. In this work we deploy and analyse all the components of the GRLS design style: the frequency regulation system, the voltage regulation system, and the GRLS latency-insensitive interfaces. We perform a theoretical analysis of DVFS efficiency in different GRLS systems, and then study a GRLS NoC-based platform. We also develop a complete GRLS power management system for a GRLS Network-on-Chip (NoC)-based platform. Experimental results show that GRLS performances are close to those of mesochronous systems and GRLS flexibility is close to that of GALS systems, which results in high figures of merit for GRLS systems. As an example, the GRLS NoC-based platform we study in this work has at least ≈ 21% lower latency-power product compared to alternative mesochronous-GALS hybrid platforms, and respectively ≈ 32% and ≈ 48% better latency-power product compared to mesochronous and GALS platforms.

115 citations


Cites methods from "A 2.4-Gsample/s DVFS FFT Processor ..."

  • ...Internal DVFS is a well-known and widely-applied DVFS technique which can be applied to both GnS and synchronous systems [133]....

    [...]

Journal ArticleDOI
TL;DR: An multipath delay commutator (MDC)-based architecture and memory scheduling to implement fast Fourier transform (FFT) processors for multiple input multiple output-orthogonal frequency division multiplexing (MIMO-OFDM) systems with variable length is presented.
Abstract: This paper presents an multipath delay commutator (MDC)-based architecture and memory scheduling to implement fast Fourier transform (FFT) processors for multiple input multiple output-orthogonal frequency division multiplexing (MIMO-OFDM) systems with variable length. Based on the MDC architecture, we propose to use radix-Ns butterflies at each stage, where Ns is the number of data streams, so that there is only one butterfly needed in each stage. Consequently, a 100% utilization rate in computational elements is achieved. Moreover, thanks to the simple control mechanism of the MDC, we propose simple memory scheduling methods for input data and output bit/set-reversing, which again results in a full utilization rate in memory usage. Since the memory requirements usually dominate the die area of FFT/inverse fast Fourier transform (IFFT) processors, the proposed scheme can effectively reduce the memory size and thus the die area as well. Furthermore, to apply the proposed scheme in practical applications, we let Ns=4 and implement a 4-stream FFT/IFFT processor with variable length including 2048, 1024, 512, and 128 for MIMO-OFDM systems. This processor can be used in IEEE 802.16 WiMAX and 3GPP long term evolution applications. The processor was implemented with an UMC 90-nm CMOS technology with a core area of 3.1 mm2. The power consumption at 40 MHz was 63.72/62.92/57.51/51.69 mW for 2048/1024/512/128-FFT, respectively in the post-layout simulation. Finally, we analyze the complexity and performance of the implemented processor and compare it with other processors. The results show advantages of the proposed scheme in terms of area and power consumption.

99 citations


Additional excerpts

  • ...Proposed [26] [27] [29] [31] [30] [24] [28] [25] [20] [14] [12]...

    [...]

Proceedings Article
01 Jan 2006
TL;DR: Experimental results with a 90-nm CMOS device indicate that use of the proposed power monitoring results in the successful minimizing of power consumption.
Abstract: This paper describes newly developed delay and power monitoring schemes for minimizing power consumption by means of the dynamic control of supply voltage V DD and threshold voltage V TH in active and standby modes. In the active mode, on the basis of delay monitoring results, either V DD control or V TH control is selected to avoid any oscillation problem between them. In V DD control, on the basis of delay monitoring results, V DD is adjusted so as to be maintained at the minimum value at which the chip is able to operate for a given clock frequency. In V TH control, on the basis of power monitoring results, V TH is adjusted so as to maintain a certain switching current I SW /leakage current I LEAK ratio known to indicate minimum power consumption. In the standby mode, the precision of power monitoring (which detects optimum body bias by comparing subthreshold current I SUBTH to substrate current I SUB ) is improved by taking into consideration both the effects of lowering V DD and the effects of the presence of gate-oxide leakage current. Experimental results with a 90-nm CMOS device indicate that use of the proposed power monitoring results in the successful minimizing of power consumption. It does so by making it possible to: 1) maintain the I SW /I LEAK ratio in the active mode and 2) detect optimum body bias conditions (I SUBTH = I SUB ) within an error of less than 20% with respect to actual minimum leakage current values in the standby mode.

81 citations

References
More filters
Book
01 Jan 2007
TL;DR: This book discusses Digital Signal Processing Systems, Pipelining and Parallel Processing, Synchronous, Wave, and Asynchronous Pipelines, and Bit-Level Arithmetic Architectures.
Abstract: Introduction to Digital Signal Processing Systems. Iteration Bound. Pipelining and Parallel Processing. Retiming. Unfolding. Folding. Systolic Architecture Design. Fast Convolution. Algorithmic Strength Reduction in Filters and Transforms. Pipelined and Parallel Recursive and Adaptive Filters. Scaling and Roundoff Noise. Digital Lattice Filter Structures. Bit-Level Arithmetic Architectures. Redundant Arithmetic. Numerical Strength Reduction. Synchronous, Wave, and Asynchronous Pipelines. Low-Power Design. Programmable Digital Signal Processors. Appendices. Index.

1,361 citations


"A 2.4-Gsample/s DVFS FFT Processor ..." refers methods in this paper

  • ...The canonical signed digit (CSD) technique [9] can also be applied to further reduce the hardware complexity....

    [...]

Journal ArticleDOI
03 Jan 2005
TL;DR: New subthreshold logic and memory design methodologies are developed and demonstrated on a fast Fourier transform (FFT) processor that is designed to investigate the estimated minimum energy point.
Abstract: In emerging embedded applications such as wireless sensor networks, the key metric is minimizing energy dissipation rather than processor speed. Minimum energy analysis of CMOS circuits estimates the optimal operating point of clock frequencies, supply voltage, and threshold voltage according to A. Chandrakasan et al. (see ibid., vol.27, no.4, p.473-84, Apr. 1992). The minimum energy analysis shows that the optimal power supply typically occurs in subthreshold (e.g., supply voltages that are below device thresholds). New subthreshold logic and memory design methodologies are developed and demonstrated on a fast Fourier transform (FFT) processor. The FFT processor uses an energy-aware architecture that allows for variable FFT length (128-1024 point), variable bit-precision (8 b and 16 b) and is designed to investigate the estimated minimum energy point. The FFT processor is fabricated using a standard 0.18-/spl mu/m CMOS logic process and operates down to 180 mV. The minimum energy point for the 16-b 1024-point FFT processor occurs at 350-mV supply voltage where it dissipates 155 nJ/FFT at a clock frequency of 10 kHz.

619 citations


"A 2.4-Gsample/s DVFS FFT Processor ..." refers background or methods or result in this paper

  • ...From Table VI, we can find that [14] and our proposed FFT processor have similar energy efficiency which is higher than that of [13]....

    [...]

  • ...However, it should be noted that the FFT processor in [14] is a full-custom...

    [...]

  • ...Here we choose two variable-length FFT chips supporting 256-point FFT [13], [14] to compare the energy efficiency....

    [...]

Proceedings ArticleDOI
29 Sep 1998
TL;DR: By exploiting the spatial regularity of the new algorithm, the requirement for both dominant elements in VLSI implementation, the memory size and the number of complex multipliers, have been minimized and the area/power efficiency has been enhanced.
Abstract: The FFT processor is one of the key components in the implementation of wideband OFDM systems. Architectures with a structured pipeline have been used to meet the fast, real-time processing demand and low-power consumption requirement in a mobile environment. Architectures based on new forms of FFT, the radix-2/sup i/ algorithm derived by cascade decomposition, is proposed. By exploiting the spatial regularity of the new algorithm, the requirement for both dominant elements in VLSI implementation, the memory size and the number of complex multipliers, have been minimized. Progressive wordlength adjustment has been introduced to optimize the total memory size with a given signal-to-quantization-noise-ratio (SQNR) requirement in fixed-point processing. A new complex multiplier based on distributed arithmetic further enhanced the area/power efficiency of the design. A single-chip processor for 1 K complex point FFT transform is used to demonstrate the design issues under consideration.

322 citations


"A 2.4-Gsample/s DVFS FFT Processor ..." refers methods in this paper

  • ...For comparison, the traditional method adopting eight singlepath delay feedback (SDF) structures [8] is also implemented with clock gating and the same radix-2 algorithm is adopted in this reference design....

    [...]

  • ...To implement the 16-pint DFT more efficiently, we use the radix-2 FFT algorithm which can easily be derived from the radix-2 algorithm [8]....

    [...]

Journal ArticleDOI
TL;DR: This paper presents an energy-efficient, single-chip, 1024-point fast Fourier transform (FFT) processor, which has been fabricated in a standard 0.7 /spl mu/m CMOS process and is fully functional on first-pass silicon.
Abstract: This paper presents an energy-efficient, single-chip, 1024-point fast Fourier transform (FFT) processor. The 460000-transistor design has been fabricated in a standard 0.7 /spl mu/m (L/sub poly/=0.6 /spl mu/m) CMOS process and is fully functional on first-pass silicon. At a supply voltage of 1.1 V, it calculates a 1024-point complex FFT in 330 /spl mu/s while consuming 9.5 mW, resulting in an adjusted energy efficiency more than 16 times greater than the previously most efficient known FFT processor. At 3.3 V, it operates at 173 MHz-which is a clock rate 2.6 times greater than the previously fastest rate.

319 citations

Journal ArticleDOI
TL;DR: In this paper, a variable supplyvoltage (VS) scheme was proposed to automatically generate minimum internal supply voltages by feedback control of a buck converter, a speed detector, and a timing controller so that they meet the demand on its operation frequency.
Abstract: This paper describes a variable supply-voltage (VS) scheme. From an external supply, the VS scheme automatically generates minimum internal supply voltages by feedback control of a buck converter, a speed detector, and a timing controller so that they meet the demand on its operation frequency. A 32-b RISC core processor is developed in a 0.4-/spl mu/m CMOS technology which optimally controls the internal supple voltages with the VS scheme and the threshold voltages through substrate bias control. Performance in MIPS/W is improved by a factor of more than two compared with its conventional CMOS design.

309 citations