scispace - formally typeset
Search or ask a question
Author

M. Torkelson

Bio: M. Torkelson is an academic researcher from Lund University. The author has contributed to research in topics: Very-large-scale integration & Pipeline (computing). The author has an hindex of 5, co-authored 5 publications receiving 1002 citations.

Papers
More filters
Proceedings ArticleDOI
15 Apr 1996
TL;DR: A new VLSI architecture for a real-time pipeline FFT processor is proposed, derived by integrating a twiddle factor decomposition technique in the divide-and-conquer approach, which has the same multiplicative complexity as the radix-4 algorithm, but retains the butterfly structure of the Radix-2 algorithm.
Abstract: A new VLSI architecture for a real-time pipeline FFT processor is proposed. A hardware-oriented radix-2/sup 2/ algorithm is derived by integrating a twiddle factor decomposition technique in the divide-and-conquer approach. The radix-2/sup 2/ algorithm has the same multiplicative complexity as the radix-4 algorithm, but retains the butterfly structure of the radix-2 algorithm. The single-path delay-feedback architecture is used to exploit the spatial regularity in the signal flow graph of the algorithm. For length-N DFT computation, the hardware requirement of the proposed architecture is minimal on both dominant components: log/sub 4/N-1 complexity multipliers and N-1 complexity data memory. The validity and efficiency of the architecture have been verified by simulation in the hardware description language VHDL.

410 citations

Proceedings ArticleDOI
29 Sep 1998
TL;DR: By exploiting the spatial regularity of the new algorithm, the requirement for both dominant elements in VLSI implementation, the memory size and the number of complex multipliers, have been minimized and the area/power efficiency has been enhanced.
Abstract: The FFT processor is one of the key components in the implementation of wideband OFDM systems. Architectures with a structured pipeline have been used to meet the fast, real-time processing demand and low-power consumption requirement in a mobile environment. Architectures based on new forms of FFT, the radix-2/sup i/ algorithm derived by cascade decomposition, is proposed. By exploiting the spatial regularity of the new algorithm, the requirement for both dominant elements in VLSI implementation, the memory size and the number of complex multipliers, have been minimized. Progressive wordlength adjustment has been introduced to optimize the total memory size with a given signal-to-quantization-noise-ratio (SQNR) requirement in fixed-point processing. A new complex multiplier based on distributed arithmetic further enhanced the area/power efficiency of the design. A single-chip processor for 1 K complex point FFT transform is used to demonstrate the design issues under consideration.

322 citations

Proceedings ArticleDOI
11 May 1998
TL;DR: By exploiting the spatial regularity of the new algorithm, minimal requirement for both dominant components in VLSI implementation has been achieved: only 4 complex multipliers and 1024 complex-word data memory for the pipelined 1K FFT processor.
Abstract: The design and implementation of a 1024-point pipeline FFT processor is presented. The architecture is based on a new form of FFT, the radix-2/sup 2/ algorithm. By exploiting the spatial regularity of the new algorithm, minimal requirement for both dominant components in VLSI implementation has been achieved: only 4 complex multipliers and 1024 complex-word data memory for the pipelined 1K FFT processor. The chip has been implement in 0.5 /spl mu/m CMOS technology and takes an area of 40 mm/sup 2/. With 3.3 V power supply, it can compute 2/sup n/, n=0, 1, ..., 10 complex point forward and inverse FFT in real time with up to 30 MHz sampling frequency. The SQNR is above 50 dB for white noise input.

243 citations

Proceedings ArticleDOI
01 May 1994
TL;DR: A pipelinable bit-serial multiplier using Canonic Signed Digit, or CSD code to represent constant coefficients is introduced and it is shown that FPGA architecture is an ideal vehicle for thus optimized bit- serial processing.
Abstract: A pipelinable bit-serial multiplier using Canonic Signed Digit, or CSD code to represent constant coefficients is introduced. A bit-serial module for a(x/spl plusmn/y)z/sup -1/ type computation is further developed. Optimization over discrete power-of-two coefficient space has been retargeted on this type of multiplier to generate minimized no-zero bit coefficients. This also make it possible to confine the latency to be equivalent to the data wordlength without causing a large delay in partial product sum propagation. A single chip FPGA implementation of a full 16-bit 31-tap Hilbert transformer is used as an example to demonstrate the application of the multiplier module with the special consideration of FPGA architectures. It is shown that FPGA architecture is an ideal vehicle for thus optimized bit-serial processing. >

38 citations

Proceedings ArticleDOI
05 May 1996
TL;DR: The design of an efficient array architecture for the multiplication of complex numbers applying distributed arithmetic is presented and VHDL module with generic parameters has been written and successfully simulated, which enable the complex multiplier module to be included in large designs with required word-lengths for both operands.
Abstract: The design of an efficient array architecture for the multiplication of complex numbers applying distributed arithmetic is presented. The complex multiplier takes an area just over that of two real multipliers and its speed is almost the same as a single real multiplier. The texture of the design is obtained by an in-depth examination of a real multiplier structure with data in the off-set binary representation. Residue error compensation and the functional requirement of various boundary cells, such as negative weight addition, are discussed in detail. VHDL module with generic parameters has been written and successfully simulated, which enable the complex multiplier module to be included in large designs with required word-lengths for both operands. A test chip has been implemented with a standard library in 0.8 /spl mu/m CMOS process and fabricated.

9 citations


Cited by
More filters
Journal ArticleDOI
01 May 2001
TL;DR: A survey of academic research and commercial development in reconfigurable computing for DSP systems over the past fifteen years is presented in this article, with a focus on the application domain of digital signal processing.
Abstract: Steady advances in VLSI technology and design tools have extensively expanded the application domain of digital signal processing over the past decade. While application-specific integrated circuits (ASICs) and programmable digital signal processors (PDSPs) remain the implementation mechanisms of choice for many DSP applications, increasingly new system implementations based on reconfigurable computing are being considered. These flexible platforms, which offer the functional efficiency of hardware and the programmability of software, are quickly maturing as the logic capacity of programmable devices follows Moore's Law and advanced automated design techniques become available. As initial reconfigurable technologies have emerged, new academic and commercial efforts have been initiated to support power optimization, cost reduction, and enhanced run-time performance. This paper presents a survey of academic research and commercial development in reconfigurable computing for DSP systems over the past fifteen years. This work is placed in the context of other available DSP implementation media including ASICs and PDSPs to fully document the range of design choices available to system engineers. It is shown that while contemporary reconfigurable computing can be applied to a variety of DSP applications including video, audio, speech, and control, much work remains to realize its full potential. While individual implementations of PDSP, ASIC, and reconfigurable resources each offer distinct advantages, it is likely that integrated combinations of these technologies will provide more complete solutions.

390 citations

Proceedings ArticleDOI
29 Sep 1998
TL;DR: By exploiting the spatial regularity of the new algorithm, the requirement for both dominant elements in VLSI implementation, the memory size and the number of complex multipliers, have been minimized and the area/power efficiency has been enhanced.
Abstract: The FFT processor is one of the key components in the implementation of wideband OFDM systems. Architectures with a structured pipeline have been used to meet the fast, real-time processing demand and low-power consumption requirement in a mobile environment. Architectures based on new forms of FFT, the radix-2/sup i/ algorithm derived by cascade decomposition, is proposed. By exploiting the spatial regularity of the new algorithm, the requirement for both dominant elements in VLSI implementation, the memory size and the number of complex multipliers, have been minimized. Progressive wordlength adjustment has been introduced to optimize the total memory size with a given signal-to-quantization-noise-ratio (SQNR) requirement in fixed-point processing. A new complex multiplier based on distributed arithmetic further enhanced the area/power efficiency of the design. A single-chip processor for 1 K complex point FFT transform is used to demonstrate the design issues under consideration.

322 citations

Journal ArticleDOI
TL;DR: This paper presents an energy-efficient, single-chip, 1024-point fast Fourier transform (FFT) processor, which has been fabricated in a standard 0.7 /spl mu/m CMOS process and is fully functional on first-pass silicon.
Abstract: This paper presents an energy-efficient, single-chip, 1024-point fast Fourier transform (FFT) processor. The 460000-transistor design has been fabricated in a standard 0.7 /spl mu/m (L/sub poly/=0.6 /spl mu/m) CMOS process and is fully functional on first-pass silicon. At a supply voltage of 1.1 V, it calculates a 1024-point complex FFT in 330 /spl mu/s while consuming 9.5 mW, resulting in an adjusted energy efficiency more than 16 times greater than the previously most efficient known FFT processor. At 3.3 V, it operates at 173 MHz-which is a clock rate 2.6 times greater than the previously fastest rate.

319 citations

Proceedings ArticleDOI
11 May 1998
TL;DR: By exploiting the spatial regularity of the new algorithm, minimal requirement for both dominant components in VLSI implementation has been achieved: only 4 complex multipliers and 1024 complex-word data memory for the pipelined 1K FFT processor.
Abstract: The design and implementation of a 1024-point pipeline FFT processor is presented. The architecture is based on a new form of FFT, the radix-2/sup 2/ algorithm. By exploiting the spatial regularity of the new algorithm, minimal requirement for both dominant components in VLSI implementation has been achieved: only 4 complex multipliers and 1024 complex-word data memory for the pipelined 1K FFT processor. The chip has been implement in 0.5 /spl mu/m CMOS technology and takes an area of 40 mm/sup 2/. With 3.3 V power supply, it can compute 2/sup n/, n=0, 1, ..., 10 complex point forward and inverse FFT in real time with up to 30 MHz sampling frequency. The SQNR is above 50 dB for white noise input.

243 citations

Journal ArticleDOI
TL;DR: A novel 128-point FFT/IFFT processor for ultrawideband (UWB) systems and the proposed pipelined FFT architecture, called mixed-radix multipath delay feedback (MRMDF), can provide a higher throughput rate by using the multidata-path scheme.
Abstract: In this paper, we present a novel 128-point FFT/IFFT processor for ultrawideband (UWB) systems. The proposed pipelined FFT architecture, called mixed-radix multipath delay feedback (MRMDF), can provide a higher throughput rate by using the multidata-path scheme. Furthermore, the hardware costs of memory and complex multipliers in MRMDF are only 38.9% and 44.8% of those in the known FFT processor by means of the delay feedback and the data scheduling approaches. The high-radix FFT algorithm is also realized in our processor to reduce the number of complex multiplications. A test chip for the UWB system has been designed and fabricated using 0.18-/spl mu/m single-poly and six-metal CMOS process with a core area of 1.76/spl times/1.76 mm/sup 2/, including an FFT/IFFT processor and a test module. The throughput rate of this fabricated FFT processor is up to 1 Gsample/s while it consumes 175 mW. Power dissipation is 77.6 mW when its throughput rate meets UWB standard in which the FFT throughput rate is 409.6 Msample/s.

220 citations