What format is used to store the data in a 2-D matrix?

An 2-D matrix is typically stored in C using row-major format: size- contiguous arrays for each row, stored as consecutive blocks starting from a pointer (for input/output).

How can one decompose a trigonometric transform?

One can also compute symmetric DFTs by directly specializing the Cooley–Tukey algorithm, removing redundant operations as the authors did for real inputs, to decompose the transform into smaller symmetric transforms [53], [56], [57].

What is the way to solve a problem of arbitrary vector rank?

one must reduce a problem of arbitrary vector rank to a set of loops nested around a problem of vector rank 0, i.e., a single (possibly multidimensional) DFT.

What is the function that creates the plan?

When invoked by the planner, a solver creates the plan for the given problem (if possible) and it initializes any auxiliary data required by the plan (such as twiddle factors).

how many times does a codelet execute two iterations of its loop in parallel?

On machines that support vectors of length 4, the authors view SIMD data as vectors of two complex numbers, and each codelet executes two iterations of its loop in parallel.

Why is the answer to this question constantly changing?

because computer hardware is continually changing, the answer to this question has been continually changing as well.

What is the penalty for a 1024 1024 2-D complex-data transform?

In other cases, however, the penalty from impatient mode can be larger; for example, it has a 47% penalty for a 1024 1024 2-D complex-data transform on the same machine, since vector recursion proves important there for the discontiguous (row) dimension of the transform.

What is the effect of vector reordering?

)2) Vector Recursion: Another example of the effect of loop reordering is a style of plan that the authors sometimes call vector recursion (unrelated to “vector-radix” FFTs [16]).

What is the key difficulty in implementing the Cooley–Tukey FFT?

(On the other end of the scale, a “radix” of roughly has been called a four-step FFT [18], and the authors have found that one step of such a radix can be useful for large sizes in FFTW; see Section IV-D1.)A key difficulty in implementing the Cooley–Tukey FFT is that the dimension corresponds to discontiguous inputsin but contiguous outputs in , and vice versa for .

What is the way to solve the type-IV trigonometric transform?

For the type-I DCT/DST, however, the authors could not find any accurate algorithm to re-express the transform in terms of an equal-length real-input DFT; thus, the authors resort to the “slow” method of embedding it in a real-input DFT of length .

What is the way to reduce planner time relative to their observed benefits?

the planner can operate in an impatient mode that reduces the space of plans by eliminating some possibilities that appear to inordinately increase planner time relative to their observed benefits.

(Open Access) The Design and Implementation of FFTW3 (2005) | Matteo Frigo

Q: What have the authors contributed in "The design and implementation of fftw3" ?

This paper shows that such an approach can yield an implementation that is competitive with hand-optimized libraries, and describes the software structure that makes their current FFTW3 version flexible and adaptive. The authors further discuss a new algorithm for real-data DFTs of prime size, a new way of implementing DFTs by means of machine-specific single-instruction, multiple-data ( SIMD ) instructions, and how a special-purpose compiler can derive optimized implementations of the discrete cosine and sine transforms automatically from a DFT algorithm.

Q: What is the basic idea behind the Cooley–Tukey FFT?

The basic idea behind this FFT is that a DFT of a composite size can be reexpressed in terms of smaller DFTs of sizes and —essentially, as a 2-D DFT of size where the output is transposed.

The Design and Implementation of FFTW3

MATTEO FRIGO AND

STEVEN G. JOHNSON

Invited Paper

FFTW is an implementation of the discrete Fourier transform

(DFT) that adapts to the hardware in order to maximize perfor-

mance. This paper shows that such an approach can yield an im-

plementation that is competitive with hand-optimized libraries, and

describes the software structure that makes our current FFTW3 ver-

sion ﬂexible and adaptive. We further discuss a new algorithm for

real-data DFTs of prime size, a new way of implementing DFTs by

means of machine-speciﬁc single-instruction, multiple-data (SIMD)

instructions, and how a special-purpose compiler can derive opti-

mized implementations of the discrete cosine and sine transforms

automatically from a DFT algorithm.

Keywords—Adaptive software, cosine transform, fast Fourier

transform (FFT), Fourier transform, Hartley transform, I/O tensor.

I. INTRODUCTION

FFTW [1] is a widely used free-software library that

computes the discrete Fourier transform (DFT) and its

various special cases. Its performance is competitive even

with vendor-optimized programs, but unlike these programs,

FFTW is not tuned to a ﬁxed machine. Instead, FFTW

uses a

planner to adapt its algorithms to the hardware in

order to maximize performance. The input to the planner

is a problem, a multidimensional loop of multidimensional

DFTs. The planner applies a set of rules to recursively

decompose a problem into simpler subproblems of the same

type. “Sufﬁciently simple” problems are solved directly by

optimized, straight-line code that is automatically generated

by a special-purpose compiler. This paper describes the

overall structure of FFTW as well as the speciﬁc improve-

ments in FFTW3, our latest version.

Manuscript received November 24, 2003; revised October 15, 2004.

The work of M. Frigo was supported in part by the Defense Advanced

Research Projects Agency (DARPA) under Contract NBCH30390004.

The work of S. G. Johnson was supported in part by the Materials Research

Science and Engineering Center program of the National Science Founda-

tion under Award DMR-9400334.

M. Frigo is with the IBM Austin Research Laboratory, Austin, TX 78758

USA (e-mail: Athena@fftw.org).

S. G. Johnson is with the Massachusetts Institute of Technology, Cam-

bridge, MA 02139 USA.

Digital Object Identiﬁer 10.1109/JPROC.2004.840301

FFTW is fast, but its speed does not come at the expense of

ﬂexibility. In fact, FFTW is probably the most ﬂexible DFT

library available.

• FFTW is written in portable C and runs well on many

architectures and operating systems.

• FFTW computes DFTs in

time for any

length

. (Most other DFT implementations are either

restricted to a subset of sizes or they become

for certain values of , for example, when is prime.)

• FFTW imposes no restrictions on the rank (dimension-

ality) of multidimensional transforms. (Most other im-

plementations are limited to one-dimensional (1-D), or

at most two-dimensional (2-D) and three-dimensional

data.)

• FFTW supports multiple and/or strided DFTs; for ex-

ample, to transform a three-component vector ﬁeld or a

portion of a multidimensional array. (Most implemen-

tations support only a single DFT of contiguous data.)

• FFTW supports DFTs of real data, as well as of real

symmetric/antisymmetric data [also called the discrete

cosine transform (DCT) and the discrete sine transform

(DST)].

The interaction of the user with FFTW occurs in two

stages: planning, in which FFTW adapts to the hardware,

and execution, in which FFTW performs useful work for

the user. To compute a DFT, the user ﬁrst invokes the

FFTW planner, specifying the problem to be solved. The

problem is a data structure that describes the “shape” of the

input data—array sizes and memory layouts—but does not

contain the data itself. In return, the planner yields a plan,

an executable data structure that accepts the input data and

computes the desired DFT. Afterwards, the user can execute

the plan as many times as desired.

The FFTW planner works by measuring the actual run-

time of many different plans and by selecting the fastest one.

This process is analogous to what a programmer would do

by hand when tuning a program to a ﬁxed machine, but in

FFTW’s case no manual intervention is required. Because

of the repeated performance measurements, however, the

planner tends to be time-consuming. In performance-critical

216 PROCEEDINGS OF THE IEEE, VOL. 93, NO. 2, FEBRUARY 2005

applications, many transforms of the same size are typically

required and, therefore, a large one-time cost is usually

acceptable. Otherwise, FFTW provides a mode of operation

where the planner quickly returns a “reasonable” plan that

is not necessarily the fastest.

The planner generates plans according to rules that

recursively decompose a problem into simpler subprob-

lems. When the problem becomes “sufﬁciently simple,”

FFTW produces a plan that calls a fragment of optimized

straight-line code that solves the problem directly. These

fragments are called

codelets in FFTW’s lingo. You can

envision a codelet as computing a “small” DFT, but many

variations and special cases exist. For example, a codelet

might be specialized to compute the DFT of real input (as

opposed to complex). FFTW’s speed depends, therefore, on

two factors. First, the decomposition rules must produce a

space of plans that is rich enough to contain “good” plans

for most machines. Second, the codelets must be fast, since

they ultimately perform all the real work.

FFTW’s codelets are generated automatically by a spe-

cial-purpose compiler called

. Most users do not

interact with

at all: the standard FFTW distribution

contains a set of about 150 pregenerated codelets that cover

the most common uses. Users with special needs can use

to generate their own codelets. is useful

because of the following features. From a high-level mathe-

matical description of a DFT algorithm,

derives an

optimized implementation automatically. From a complex

DFT algorithm,

automatically derives an optimized

algorithm for the real-input DFT. We take advantage of

this property to implement real-data DFTs (Section VII),

as well as to exploit machine-speciﬁc single-instruction

multiple-data (SIMD) instructions (Section IX). Similarly,

automatically derives codelets for the DCT and

DST (Section VIII). We summarize

in Section VI,

while a full description appears in [2].

We have produced three major implementations of

FFTW, each building on the experience of the previous

system. FFTW1 (1997) [3] introduced the idea of generating

codelets automatically and of letting a planner search for the

best combination of codelets. FFTW2 (1998) incorporated

a new version of

[2]. did not change much

in FFTW3 (2003), but the runtime structure was completely

rewritten to allow for a much larger space of plans. This

paper describes the main ideas common to all FFTW sys-

tems, the runtime structure of FFTW3, and the modiﬁcations

since FFTW2.

Previous work on adaptive systems includes [3]–[11]. In

particular, SPIRAL [9], [10] is another system focused on

optimization of Fourier transforms and related algorithms,

but it has distinct differences from FFTW. SPIRAL searches

at compile time over a space of mathematically equivalent

formulas expressed in a “tensor-product” language, whereas

FFTW searches at runtime over the formalism discussed in

Section IV, which explicitly includes low-level details, such

as strides and memory alignments, that are not as easily ex-

pressed using tensor products. SPIRAL generates machine-

dependent code, whereas FFTW’s codelets are machine in-

dependent. FFTW’s search uses dynamic programming [12,

Ch. 16], while the SPIRAL project has experimented with a

wider range of search strategies, including machine-learning

techniques [13].

The remainder of this paper is organized as follows. We

begin with a general overview of fast Fourier transforms

(FFTs) in Section II. Then, in Section III, we compare the

performance of FFTW and other DFT implementations.

Section IV describes the space of plans explored by FFTW

and how the FFTW planner works. Section V describes

our experiences in the practical usage of FFTW. Section VI

summarizes how

works. Section VII explains how

FFTW computes DFTs of real data. Section VIII describes

how

generates DCT and DST codelets, as well as

how FFTW handles these transforms in the general case.

Section IX tells how FFTW exploits SIMD instructions.

II. FFT O

VERVIEW

The (forward, 1-D) DFT of an array

of complex num-

bers is the array

given by

(1)

where

and . Imple-

mented directly, (1) would require

operations; FFTs

are

algorithms to compute the same result. The

most important FFT (and the one primarily used in FFTW)

is known as the “Cooley–Tukey” algorithm, after the two

authors who rediscovered and popularized it in 1965 [14],

although it had been previously known as early as 1805 by

Gauss as well as by later reinventors [15]. The basic idea be-

hind this FFT is that a DFT of a composite size

can be reexpressed in terms of smaller DFTs of sizes

and —essentially, as a 2-D DFT of size where

the output is transposed. The choices of factorizations of

combined with the many different ways to implement the

data reorderings of the transpositions, have led to numerous

implementation strategies for the Cooley–Tukey FFT, with

many variants distinguished by their own names [16], [17].

FFTW implements a space of many such variants, as de-

scribed later, but here we derive the basic algorithm, identify

its key features, and outline some important historical varia-

tions and their relation to FFTW.

The Cooley–Tukey algorithm can be derived as follows.

can be factored into , (1) can be rewritten by

letting

and . We then have

(2)

Thus, the algorithm computes

DFTs of size (the inner

sum), multiplies the result by the so-called twiddle factors

, and ﬁnally computes DFTs of size (the outer

sum). This decomposition is then continued recursively. The

literature uses the term radix to describe an

or that

FRIGO AND JOHNSON: THE DESIGN AND IMPLEMENTATION OF FFTW3 217

is bounded (often constant); the small DFT of the radix is

traditionally called a butterﬂy.

Many well-known variations are distinguished by the

radix alone. A decimation in time (DIT) algorithm uses

as the radix, while a decimation in frequency (DIF)

algorithm uses

as the radix. If multiple radices are used,

e.g., for

composite but not a prime power, the algorithm

is called mixed radix. A peculiar blending of radix 2 and 4

is called split radix, which was proposed to minimize the

count of arithmetic operations [16]. (Unfortunately, as we

argue in this paper, minimal-arithmetic, ﬁxed-factorization

implementations tend to no longer be optimal on recent

computer architectures.) FFTW implements both DIT and

DIF, is mixed-radix with radices that are adapted to the

hardware, and often uses much larger radices (radix-32 is

typical) than were once common. (On the other end of the

scale, a “radix” of roughly

has been called a four-step

FFT [18], and we have found that one step of such a radix

can be useful for large sizes in FFTW; see Section IV-D1.)

A key difﬁculty in implementing the Cooley–Tukey FFT

is that the

dimension corresponds to discontiguous inputs

in but contiguous outputs in , and vice versa for

. This is a matrix transpose for a single decomposition

stage, and the composition of all such transpositions is a

(mixed-base) digit-reversal permutation (or bit-reversal, for

radix-2). The resulting necessity of discontiguous memory

access and data reordering hinders efﬁcient use of hier-

archical memory architectures (e.g., caches), so that the

optimal execution order of an FFT for given hardware is

nonobvious, and various approaches have been proposed.

One ordering distinction is between recursion and itera-

tion. As expressed above, the Cooley–Tukey algorithm could

be thought of as deﬁning a tree of smaller and smaller DFTs;

for example, a textbook radix-2 algorithm would divide size

into two transforms of size , which are divided into

four transforms of size

, and so on until a base case is

reached (in principle, size 1). This might naturally suggest

a recursive implementation in which the tree is traversed

“depth-ﬁrst”—one size-

transform is solved completely

before processing the other one, and so on. However, most

traditional FFT implementations are nonrecursive (with rare

exceptions [19]) and traverse the tree “breadth-ﬁrst” [17]—in

the radix-2 example, they would perform

(trivial) size-1

transforms, then

combinations into size-2 transforms,

then

combinations into size-4 transforms, and so on,

thus making

passes over the whole array. In contrast,

as we discuss in Section IV-D1, FFTW3 employs an explic-

itly recursive strategy that encompasses both depth-ﬁrst and

breadth-ﬁrst styles, favoring the former, since it has some

theoretical and practical advantages.

A second ordering distinction lies in how the digit re-

versal is performed. The classic approach is a single, sepa-

rate digit-reversal pass following or preceding the arithmetic

computations. Although this pass requires only

time

[20], it can still be nonnegligible, especially if the data is

out of cache; moreover, it neglects the possibility that data

reordering during the transform may improve memory lo-

cality. Perhaps the oldest alternative is the Stockham auto-

sort FFT [17], [21], which transforms back and forth be-

tween two arrays with each butterﬂy, transposing one digit

each time, and was popular to improve contiguity of access

for vector computers [22]. Alternatively, an explicitly recur-

sive style, as in FFTW, performs the digit-reversal implic-

itly at the “leaves” of its computation when operating out-of-

place (Section IV-D1). To operate in-place with

scratch

storage, one can interleave small matrix transpositions with

the butterﬂies [23]–[26], and a related strategy in FFTW is

described in Section IV-D3. FFTW can also perform inter-

mediate reorderings that blend its in-place and out-of-place

strategies, as described in Section V-C.

Finally, we should mention that there are many FFTs en-

tirely distinct from Cooley–Tukey. Three notable such algo-

rithms are the prime-factor algorithm for

[27, p. 619], along with Rader’s [28] and Bluestein’s [27],

[29] algorithms for prime

. FFTW implements the ﬁrst two

in its codelet generator for hard-coded

(Section VI) and

the latter two for general prime

. A new generalization of

Rader’s algorithm for prime-size real-data transforms is also

discussed in Section VII. FFTW does not employ the Wino-

grad FFT [30], which minimizes the number of multiplica-

tions at the expense of a large number of additions. (This

tradeoff is not beneﬁcial on current processors that have spe-

cialized hardware multipliers.)

III. B

ENCHMARK RESULTS

We have performed extensive benchmarks of FFTW’s

performance, along with that of over 50 other FFT imple-

mentations, on most modern general-purpose processors,

comparing complex and real-data transforms in one to three

dimensions and for both single and double precisions. We

generally found FFTW to be superior to other publicly

available codes and comparable to vendor-tuned libraries.

The complete results can be found in [1]. In this section, we

present data for a small sampling of representative codes for

complex-data 1-D transforms on a few machines.

We show the benchmark results as a series of graphs.

Speed is measured in “MFLOPS,” deﬁned for a transform of

size

as , where is the time in microseconds

for one transform, not including one-time initialization

costs. This count of ﬂoating-point operations is based

on the asymptotic number of operations for the radix-2

Cooley–Tukey algorithm (see [17, p. 45]), although the

actual count is lower for most DFT implementations. The

MFLOPS measure should, thus, be viewed as a convenient

scaling factor rather than as an absolute indicator of CPU

performance.

Fig. 1 shows the benchmark results for power-of-two

sizes, in double precision, on a 2.8-GHz Pentium IV with

the Intel compilers; in Fig. 2 are results for selected non-

power-of-two sizes of the form

on the same

machine; in Fig. 3 are the single-precision power-of-two

results. Note that only the FFTW, MKL (Intel), IPPS (Intel),

and Takahashi libraries on this machine were speciﬁcally

designed to exploit the SSE/SSE2 SIMD instructions

(see Section IX); for comparison, we also include FFTW

218 PROCEEDINGS OF THE IEEE, VOL. 93, NO. 2, FEBRUARY 2005

Fig. 1. Comparison of double-precision 1-D complex DFTs,

power-of-two sizes, on a 2.8-GHz Pentium IV. Intel C/Fortran

compilers v. 7.1, optimization ﬂags -

(maximum

optimization, enable automatic vectorizer).

Fig. 2. Comparison of double-precision 1-D complex DFTs,

nonpower-of-two sizes, on a 2.8-GHz Pentium IV. Compiler and

ﬂags as in Fig. 1.

(out-of-place) with SIMD disabled (“fftw, no simd”). In

Fig. 4 are the power-of-two double-precision results on a

2-GHz PowerPC 970 (G5) with the Apple

3.3 compiler.

In Fig. 5 are the power-of-two double-precision results on

an 833-MHz Alpha EV6 with the Compaq compilers, and in

Fig. 6 are the single-precision results on the same machine.

In addition to FFTW v. 3.0.1, the other codes benchmarked

are as follows (some for only one precision or machine):

arprec, “four-step” FFT implementation [18] (from the C++

ARPREC library, 2002); cxml, the vendor-tuned Compaq

Extended Math Library on Alpha; fftpack, the Fortran library

from [22]; green, free code by J. Green (C, 1998); mkl, the

Intel Math Kernel Library v. 6.1 (DFTI interface) on the

Pentium IV; ipps, the Intel Integrated Performance Primi-

tives, Signal Processing, v. 3.0 on the Pentium IV; numerical

recipes, the C

routine from [31]; ooura, a free code

by T. Ooura (C and Fortran, 2001); singleton, a Fortran FFT

[32]; sorensen, a split-radix FFT [33]; takahashi, the FFTE

Fig. 3. Comparison of single-precision 1-D complex DFTs,

power-of-two sizes, on a 2.8-GHz Pentium IV. Compiler and ﬂags

as in Fig. 1. Note that fftpack, which was originally designed for

vectorizing compilers (or vice versa), beneﬁts somewhat from the

automatic vectorization in this case.

Fig. 4. Comparison of double-precision 1-D complex DFTs,

power-of-two sizes, on a 2-GHz PowerPC 970 (G5). Apple

gcc

v. 3.3,

g77

v. 3.4 20031105 (experimental). Optimization ﬂags

mcpu =

970

mtune

970

. The Apple vDSP library uses

separate real/imaginary arrays to store complex numbers and,

therefore, its performance is not strictly comparable with the other

codes, which use an array of real/imaginary pairs.

library v. 3.2 by D. Takahashi (Fortran, 2004) [34]; and vdsp,

the Apple vDSP library on the G5.

We now offer some remarks to aid the interpretation of the

performance results. The performance of all routines drops

for large problems, reﬂecting the cache hierarchy of the ma-

chine. Performance is low for small problems as well, be-

cause of the overhead of calling a routine to do little work.

FFTW is the only library that exploits SIMD instructions for

nonpower-of-two sizes, which gives it an advantage on the

Pentium IV for this case. IPPS is limited to in-place con-

tiguous inputs, whereas MKL and FFTW allow for strided

input. Assuming contiguous input gives some speed advan-

tage on a machine such as the Pentium IV, where index com-

putation is somewhat slow.

FRIGO AND JOHNSON: THE DESIGN AND IMPLEMENTATION OF FFTW3 219

Fig. 5. Comparison of double-precision 1-D complex DFTs,

power-of-two sizes, on an 833-MHz Alpha EV6. Compaq C

V6.2–505. Compaq Fortran X1.0.1–1155. Optimization ﬂags:

newc

ansi

alias

ansi

args

reorder

tune host

arch host

Fig. 6. Comparison of single-precision 1-D complex DFTs,

power-of-two sizes, on an 833-MHz Alpha EV6. Compilers and

ﬂags as in Fig. 5.

IV. STRUCTURE OF

FFTW3

In this section, we discuss in detail how FFTW works.

Speciﬁcally, we discuss how FFTW represents the problem

to be solved (Sections IV-A and IV-B), the set of plans

that the planner considers during its search (Sections IV-C

and IV-D), and the internal operation of the planner

(Section IV-E). For simplicity, this section considers com-

plex DFTs only; we discuss real DFTs in Section VII.

Of these components, the representation of the problem to

be solved is a critical choice. Indeed, we view our deﬁnition

of a “problem” as a fundamental contribution of this paper.

Because only problems that can be expressed can be solved,

the representation of a problem determines an upper bound

to the space of plans that the planner can explore; therefore,

it ultimately constrains FFTW’s performance.

A. Representation of Problems in FFTW

DFT problems in FFTW are expressed in terms of struc-

tures called I/O tensors, which in turn are described in terms

of ancillary structures called I/O dimensions. (I/O tensors are

unrelated to the tensor-product notation of SPIRAL.) In this

section, we deﬁne these terms precisely.

An I/O dimension

is a triple , where is a

nonnegative integer called the length,

is an integer called

the input stride, and

is an integer called the output stride.

An I/O tensor

is a set of I/O dimen-

sions. The nonnegative integer

is called the rank of

the I/O tensor. A DFT problem, denoted by

consists of two I/O tensors

and and two pointers

and . Roughly speaking, this describes nested loops of

-dimensional DFTs with input data starting at memory

location

and output data starting at . We now give a

more precise deﬁnition by induction on

, yielding a set

of assignments from input to output. Conceptually, all of the

right-hand sides of these assignments are evaluated before

writing their values to the left-hand sides, a ﬁction that de-

ﬁnes the behavior precisely, e.g., when

. (See also the

examples in Section IV-B.)

, with , is the -dimensional DFT,

deﬁned as follows. Let

; for

all output indexes

, yield the assignment

where each input index is summed from 0 to ,

is a primitive th root of unity as in Section II, and

denotes the complex number at memory location

(with pointer arithmetic in units of complex numbers).

By convention, we deﬁne the zero-dimensional problem

to yield the assignment .

is recursively deﬁned as a

“loop” of

problems: for all , yield all assign-

ments in

If two assignments write to the same memory location, the

DFT problem is undeﬁned. Such nonsensical problems are

not normally encountered in practice, however, as discussed

in Section IV-B.

One property of this deﬁnition is the fact that an I/O tensor

is equivalent to . That is, length-1 DFT dimen-

sions and length-1 loops can be eliminated. FFTW, therefore,

internally canonicalizes I/O tensors by removing all I/O di-

mensions where

. (Similarly, all I/O tensors of the form

are equivalent.)

We call

the size of the problem. The rank of a problem

is deﬁned to be the rank of its size (i.e., the dimensionality of

the DFT). Similarly, we call

the vector size of the problem,

and the vector rank of a problem is correspondingly deﬁned

to be the rank of its vector size. One unusual feature of FFTW

is that the vector rank is arbitrary: FFTW is not restricted to

vector sizes of rank 1. Intuitively, the vector size can be inter-

preted as a set of “loops” wrapped around a single DFT, and

we, therefore, refer to a single I/O dimension of

as a vector

loop. (Alternatively, one can view the problem as deﬁning

220 PROCEEDINGS OF THE IEEE, VOL. 93, NO. 2, FEBRUARY 2005

The Design and Implementation of FFTW3

Citations

QUANTUM ESPRESSO: a modular and open-source software project for quantum simulations of materials

Quantum ESPRESSO: a modular and open-source software project for quantum simulations of materials

Evidence for wavelike energy transfer through quantum coherence in photosynthetic systems

The 6dF Galaxy Survey: baryon acoustic oscillations and the local Hubble constant

The Landscape of Parallel Computing Research: A View from Berkeley

References

Introduction to Algorithms

Introduction to Algorithms

Numerical Recipes in C: The Art of Scientific Computing

An algorithm for the machine calculation of complex Fourier series

The MD5 Message-Digest Algorithm

Related Papers (5)

An algorithm for the machine calculation of complex Fourier series

New trends in high performance computing

FFTW: an adaptive software architecture for the FFT

Automatically Tuned Linear Algebra Software

Computational Frameworks for the Fast Fourier Transform

Frequently Asked Questions (15)

Q1. What have the authors contributed in "The design and implementation of fftw3" ?

Q2. What format is used to store the data in a 2-D matrix?

Q3. What is the basic idea behind the Cooley–Tukey FFT?

Q4. How can one decompose a trigonometric transform?

Q5. What is the way to solve a problem of arbitrary vector rank?

Q6. What is the function that creates the plan?

Q7. how many times does a codelet execute two iterations of its loop in parallel?

Q8. Why does the planner have to represent a problem?

Q9. What is the advantage of FFTW on the Pentium IV?

Q10. Why is the answer to this question constantly changing?

Q11. What is the penalty for a 1024 1024 2-D complex-data transform?

Q12. What is the effect of vector reordering?

Q13. What is the key difficulty in implementing the Cooley–Tukey FFT?

Q14. What is the way to solve the type-IV trigonometric transform?

Q15. What is the way to reduce planner time relative to their observed benefits?