scispace - formally typeset
Search or ask a question
Proceedings ArticleDOI

A family of scalable FFT architectures and an implementation of 1024-point radix-2 FFT for real-time communications

TL;DR: A family of architectures for FFT implementation based on the decomposition of the perfect shuffle permutation is presented, which can be designed with variable number of processing elements, providing designers with a trade-off choice of speed vs. complexity.
Abstract: The paper presents a family of architectures for FFT implementation based on the decomposition of the perfect shuffle permutation, which can be designed with variable number of processing elements. This provides designers with a trade-off choice of speed vs. complexity (cost and area.). A detailed case study is provided on the implementation of 1024-point FFT with 2 processing elements using 45 nm process technology, including area, timing, power and place-and-route results.

Content maybe subject to copyright    Report

Citations
More filters
Book ChapterDOI
07 Oct 2012
TL;DR: This work makes a first step towards efficient FFT-based arithmetic for lattice-based cryptography and shows that the FFT can be implemented efficiently on reconfigurable hardware.
Abstract: In recent years lattice-based cryptography has emerged as quantum secure and theoretically elegant alternative to classical cryptographic schemes (like ECC or RSA). In addition to that, lattices are a versatile tool and play an important role in the development of efficient fully or somewhat homomorphic encryption (SHE/FHE) schemes. In practice, ideal lattices defined in the polynomial ring ℤp[x]/〈xn+1〉 allow the reduction of the generally very large key sizes of lattice constructions. Another advantage of ideal lattices is that polynomial multiplication is a basic operation that has, in theory, only quasi-linear time complexity of ${\mathcal O}(n \log{n})$ in ℤp[x]/〈xn+1〉. However, few is known about the practical performance of the FFT in this specific application domain and whether it is really an alternative. In this work we make a first step towards efficient FFT-based arithmetic for lattice-based cryptography and show that the FFT can be implemented efficiently on reconfigurable hardware. We give instantiations of recently proposed parameter sets for homomorphic and public-key encryption. In a generic setting we are able to multiply polynomials with up to 4096 coefficients and a 17-bit prime in less than 0.5 milliseconds. For a parameter set of a SHE scheme (n=1024,p=1061093377) our implementation performs 9063 polynomial multiplications per second on a mid-range Spartan-6.

157 citations

Proceedings ArticleDOI
19 Apr 2009
TL;DR: Analysis of a processor architecture tailored for radix-4 and mixed-radix FFT algorithms shows that a programmable solution can possess energy-efficiency comparable to a fixed-function ASIC.
Abstract: In this paper, we describe a processor architecture tailored for radix-4 and mixed-radix FFT algorithms, which have lower arithmetic complexity than radix-2 algorithms. The processor is based on transport triggered architecture and several optimizations have been used to improve the energy-efficiency. The processor has been synthesized on a 130nm standard cell technology and analysis show that a programmable solution can possess energy-efficiency comparable to a fixed-function ASIC.

17 citations

Journal ArticleDOI
01 Apr 2011
TL;DR: A processor architecture tailored for radix-4 and mixed-radix FFT computations is described and experiments show that a programmable solution can possess energy-efficiency comparable to fixed-function ASICs.
Abstract: In this paper, a processor architecture tailored for radix-4 and mixed-radix FFT computations is described. The processor has native support for power-of-two transform sizes. Several optimizations have been used to improve the energy-efficiency of the processor and experiments show that a programmable solution can possess energy-efficiency comparable to fixed-function ASICs.

14 citations

BookDOI
01 Jan 2012
TL;DR: This paper shows that, by specializing the construction of Shallue and van de Woestijne to BN curves, one obtains an encoding function that can be implemented rather efficiently and securely, and that is well-distributed in the sense of Farashahi et al., so that one can easily build from it a hash function that is indifferentiable from a random oracle.
Abstract: A number of recent works have considered the problem of constructing constant-time hash functions to various families of elliptic curves over finite fields. In the relevant literature, it has been occasionally asserted that constant-time hashing to certain special elliptic curves, in particular so-called BN elliptic curves, was an open problem. It turns out, however, that a suitably general encoding function was constructed by Shallue and van de Woestijne back in 2006. In this paper, we show that, by specializing the construction of Shallue and van de Woestijne to BN curves, one obtains an encoding function that can be implemented rather efficiently and securely, that reaches about 9/16ths of all points on the curve, and that is well-distributed in the sense of Farashahi et al., so that one can easily build from it a hash function that is indifferentiable from a random oracle.

14 citations

Proceedings ArticleDOI
06 Mar 2014
TL;DR: This paper demonstrates the FPGA implementation of FFT algorithm that is precisely designed to induce an efficient implementation of the parameters involving area and performance by configuring the size of F FT input points which is well suited for wireless and signal processing applications.
Abstract: This paper demonstrates the FPGA implementation of FFT algorithm that is precisely designed to induce an efficient implementation of the parameters involving area and performance by configuring the size of FFT input points which is well suited for wireless and signal processing applications. An optimized architecture is demonstrated in this paper for computing FFT of length 8/16/32/64/128/512 and 1024 using Radix-4/Radix 2∗2 FFT in FPGA and is compared with Xilinx LogiCore™ FFT IP with configurable point size. It is found that proposed design is more efficient and effective in terms of area and performance while achieving the input system configurability. A novel Address Generator architecture has been proposed which facilitates for Complex Math Processor (CMP). This single generator helps in effectively carrying out the address mapping scheme. The occurrence of hardware overheads is minimized by using the multiplexor for complex arithmetic's. The entire RTL design is described using Verilog HDL and simulated using Xilinx ISim. This experimental result is tested on Spartan-6 XC6SLX4, which is the smallest device on Spartan 6 family and found that Xilinx FFT IP core over maps the available DSP48 slices. The result shows 538 LUT's, 847 Flip Flops, 3 DSP Slices, Maximum Frequency of 217 MHz. This is about 60% improvement in resource usage and 14% upgrade in the performance thus creating a low cost Configurable FFT Processor.

5 citations

References
More filters
Journal ArticleDOI
TL;DR: Good generalized these methods and gave elegant algorithms for which one class of applications is the calculation of Fourier series, applicable to certain problems in which one must multiply an N-vector by an N X N matrix which can be factored into m sparse matrices.
Abstract: An efficient method for the calculation of the interactions of a 2' factorial ex- periment was introduced by Yates and is widely known by his name. The generaliza- tion to 3' was given by Box et al. (1). Good (2) generalized these methods and gave elegant algorithms for which one class of applications is the calculation of Fourier series. In their full generality, Good's methods are applicable to certain problems in which one must multiply an N-vector by an N X N matrix which can be factored into m sparse matrices, where m is proportional to log N. This results inma procedure requiring a number of operations proportional to N log N rather than N2. These methods are applied here to the calculation of complex Fourier series. They are useful in situations where the number of data points is, or can be chosen to be, a highly composite number. The algorithm is here derived and presented in a rather different form. Attention is given to the choice of N. It is also shown how special advantage can be obtained in the use of a binary computer with N = 2' and how the entire calculation can be performed within the array of N data storage locations used for the given Fourier coefficients. Consider the problem of calculating the complex Fourier series N-1 (1) X(j) = EA(k)-Wjk, j = 0 1, * ,N- 1, k=0

11,795 citations


"A family of scalable FFT architectu..." refers methods in this paper

  • ...The Fast Fourier Transform (FFT) is a conventional method for an accelerated computation of the Discrete Fourier Transform (DFT) [1], which has been used in many applications such as spectrum estimation, fast convolution and correlation, signal modulation, etc....

    [...]

  • ...Also, algorithms are known in two types DIT, decimation in time, where complex multiplication occurs after the two-point DFT; and DIF, decimation in frequency, where complex multiplication occurs before the two-point DFT....

    [...]

  • ...In [1] Radix-2 FFT of N points –where N is integer power of 2-requires N log2N complex operations compared to N2 of direct DFT computation....

    [...]

  • ...Many of the FFT algorithms relate to the " butterfly structure " presented first by Cooley and Tukey [1] where separate processing element (PE) is assigned for each node of the FFT flow....

    [...]

Book
01 Jan 1975
TL;DR: Feyman and Wing as discussed by the authors introduced the simplicity of the invariant imbedding method to tackle various problems of interest to engineers, physicists, applied mathematicians, and numerical analysts.
Abstract: sprightly style and is interesting from cover to cover. The comments, critiques, and summaries that accompany the chapters are very helpful in crystalizing the ideas and answering questions that may arise, particularly to the self-learner. The transparency in the presentation of the material in the book equips the reader to proceed quickly to a wealth of problems included at the end of each chapter. These problems ranging from elementary to research-level are very valuable in that a solid working knowledge of the invariant imbedding techniques is acquired as well as good insight in attacking problems in various applied areas. Furthermore, a useful selection of references is given at the end of each chapter. This book may not appeal to those mathematicians who are interested primarily in the sophistication of mathematical theory, because the authors have deliberately avoided all pseudo-sophistication in attaining transparency of exposition. Precisely for the same reason the majority of the intended readers who are applications-oriented and are eager to use the techniques quickly in their own fields will welcome and appreciate the efforts put into writing this book. From a purely mathematical point of view, some of the invariant imbedding results may be considered to be generalizations of the classical theory of first-order partial differential equations, and a part of the analysis of invariant imbedding is still at a somewhat heuristic stage despite successes in many computational applications. However, those who are concerned with mathematical rigor will find opportunities to explore the foundations of the invariant imbedding method. In conclusion, let me quote the following: "What is the best method to obtain the solution to a problem'? The answer is, any way that works." (Richard P. Feyman, Engineering and Science, March 1965, Vol. XXVIII, no. 6, p. 9.) In this well-written book, Bellman and Wing have indeed accomplished the task of introducing the simplicity of the invariant imbedding method to tackle various problems of interest to engineers, physicists, applied mathematicians, and numerical analysts.

3,249 citations

Journal ArticleDOI
TL;DR: VLSI implementations have constraints which differ from those of discrete implementations, requiring another look at some of the typical FFT'algorithms in the light of these constraints.
Abstract: In some signal processing applications, it is desirable to build very high performance fast Fourier transform (FFT) processors. To meet the performance requirements, these processors are typically highly pipelined. Until the advent of VLSI, it was not possible to build a single chip which could be used to construct pipeline FFT processors of a reasonable size. However, VLSI implementations have constraints which differ from those of discrete implementations, requiring another look at some of the typical FFT'algorithms in the light of these constraints.

327 citations


"A family of scalable FFT architectu..." refers background in this paper

  • ...A variety of pipeline FFTs have been implemented [6]-[9]....

    [...]

Journal ArticleDOI
TL;DR: This paper presents an energy-efficient, single-chip, 1024-point fast Fourier transform (FFT) processor, which has been fabricated in a standard 0.7 /spl mu/m CMOS process and is fully functional on first-pass silicon.
Abstract: This paper presents an energy-efficient, single-chip, 1024-point fast Fourier transform (FFT) processor. The 460000-transistor design has been fabricated in a standard 0.7 /spl mu/m (L/sub poly/=0.6 /spl mu/m) CMOS process and is fully functional on first-pass silicon. At a supply voltage of 1.1 V, it calculates a 1024-point complex FFT in 330 /spl mu/s while consuming 9.5 mW, resulting in an adjusted energy efficiency more than 16 times greater than the previously most efficient known FFT processor. At 3.3 V, it operates at 173 MHz-which is a clock rate 2.6 times greater than the previously fastest rate.

319 citations

Proceedings ArticleDOI
11 May 1998
TL;DR: By exploiting the spatial regularity of the new algorithm, minimal requirement for both dominant components in VLSI implementation has been achieved: only 4 complex multipliers and 1024 complex-word data memory for the pipelined 1K FFT processor.
Abstract: The design and implementation of a 1024-point pipeline FFT processor is presented. The architecture is based on a new form of FFT, the radix-2/sup 2/ algorithm. By exploiting the spatial regularity of the new algorithm, minimal requirement for both dominant components in VLSI implementation has been achieved: only 4 complex multipliers and 1024 complex-word data memory for the pipelined 1K FFT processor. The chip has been implement in 0.5 /spl mu/m CMOS technology and takes an area of 40 mm/sup 2/. With 3.3 V power supply, it can compute 2/sup n/, n=0, 1, ..., 10 complex point forward and inverse FFT in real time with up to 30 MHz sampling frequency. The SQNR is above 50 dB for white noise input.

243 citations


"A family of scalable FFT architectu..." refers background in this paper

  • ...A variety of pipeline FFTs have been implemented [6]-[9]....

    [...]