scispace - formally typeset
Open AccessJournal ArticleDOI

The Design and Implementation of FFTW3

Reads0
Chats0
TLDR
It is shown that such an approach can yield an implementation of the discrete Fourier transform that is competitive with hand-optimized libraries, and the software structure that makes the current FFTW3 version flexible and adaptive is described.
Abstract
FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order to maximize performance. This paper shows that such an approach can yield an implementation that is competitive with hand-optimized libraries, and describes the software structure that makes our current FFTW3 version flexible and adaptive. We further discuss a new algorithm for real-data DFTs of prime size, a new way of implementing DFTs by means of machine-specific single-instruction, multiple-data (SIMD) instructions, and how a special-purpose compiler can derive optimized implementations of the discrete cosine and sine transforms automatically from a DFT algorithm.

read more

Content maybe subject to copyright    Report

The Design and Implementation of FFTW3
MATTEO FRIGO AND
STEVEN G. JOHNSON
Invited Paper
FFTW is an implementation of the discrete Fourier transform
(DFT) that adapts to the hardware in order to maximize perfor-
mance. This paper shows that such an approach can yield an im-
plementation that is competitive with hand-optimized libraries, and
describes the software structure that makes our current FFTW3 ver-
sion flexible and adaptive. We further discuss a new algorithm for
real-data DFTs of prime size, a new way of implementing DFTs by
means of machine-specific single-instruction, multiple-data (SIMD)
instructions, and how a special-purpose compiler can derive opti-
mized implementations of the discrete cosine and sine transforms
automatically from a DFT algorithm.
Keywords—Adaptive software, cosine transform, fast Fourier
transform (FFT), Fourier transform, Hartley transform, I/O tensor.
I. INTRODUCTION
FFTW [1] is a widely used free-software library that
computes the discrete Fourier transform (DFT) and its
various special cases. Its performance is competitive even
with vendor-optimized programs, but unlike these programs,
FFTW is not tuned to a fixed machine. Instead, FFTW
uses a
planner to adapt its algorithms to the hardware in
order to maximize performance. The input to the planner
is a problem, a multidimensional loop of multidimensional
DFTs. The planner applies a set of rules to recursively
decompose a problem into simpler subproblems of the same
type. “Sufficiently simple” problems are solved directly by
optimized, straight-line code that is automatically generated
by a special-purpose compiler. This paper describes the
overall structure of FFTW as well as the specific improve-
ments in FFTW3, our latest version.
Manuscript received November 24, 2003; revised October 15, 2004.
The work of M. Frigo was supported in part by the Defense Advanced
Research Projects Agency (DARPA) under Contract NBCH30390004.
The work of S. G. Johnson was supported in part by the Materials Research
Science and Engineering Center program of the National Science Founda-
tion under Award DMR-9400334.
M. Frigo is with the IBM Austin Research Laboratory, Austin, TX 78758
USA (e-mail: Athena@fftw.org).
S. G. Johnson is with the Massachusetts Institute of Technology, Cam-
bridge, MA 02139 USA.
Digital Object Identifier 10.1109/JPROC.2004.840301
FFTW is fast, but its speed does not come at the expense of
flexibility. In fact, FFTW is probably the most flexible DFT
library available.
FFTW is written in portable C and runs well on many
architectures and operating systems.
FFTW computes DFTs in
time for any
length
. (Most other DFT implementations are either
restricted to a subset of sizes or they become
for certain values of , for example, when is prime.)
FFTW imposes no restrictions on the rank (dimension-
ality) of multidimensional transforms. (Most other im-
plementations are limited to one-dimensional (1-D), or
at most two-dimensional (2-D) and three-dimensional
data.)
FFTW supports multiple and/or strided DFTs; for ex-
ample, to transform a three-component vector field or a
portion of a multidimensional array. (Most implemen-
tations support only a single DFT of contiguous data.)
FFTW supports DFTs of real data, as well as of real
symmetric/antisymmetric data [also called the discrete
cosine transform (DCT) and the discrete sine transform
(DST)].
The interaction of the user with FFTW occurs in two
stages: planning, in which FFTW adapts to the hardware,
and execution, in which FFTW performs useful work for
the user. To compute a DFT, the user first invokes the
FFTW planner, specifying the problem to be solved. The
problem is a data structure that describes the “shape” of the
input data—array sizes and memory layouts—but does not
contain the data itself. In return, the planner yields a plan,
an executable data structure that accepts the input data and
computes the desired DFT. Afterwards, the user can execute
the plan as many times as desired.
The FFTW planner works by measuring the actual run-
time of many different plans and by selecting the fastest one.
This process is analogous to what a programmer would do
by hand when tuning a program to a fixed machine, but in
FFTW’s case no manual intervention is required. Because
of the repeated performance measurements, however, the
planner tends to be time-consuming. In performance-critical
0018-9219/$20.00 © 2005 IEEE
216 PROCEEDINGS OF THE IEEE, VOL. 93, NO. 2, FEBRUARY 2005

applications, many transforms of the same size are typically
required and, therefore, a large one-time cost is usually
acceptable. Otherwise, FFTW provides a mode of operation
where the planner quickly returns a reasonable plan that
is not necessarily the fastest.
The planner generates plans according to rules that
recursively decompose a problem into simpler subprob-
lems. When the problem becomes sufciently simple,
FFTW produces a plan that calls a fragment of optimized
straight-line code that solves the problem directly. These
fragments are called
codelets in FFTWs lingo. You can
envision a codelet as computing a small DFT, but many
variations and special cases exist. For example, a codelet
might be specialized to compute the DFT of real input (as
opposed to complex). FFTWs speed depends, therefore, on
two factors. First, the decomposition rules must produce a
space of plans that is rich enough to contain good plans
for most machines. Second, the codelets must be fast, since
they ultimately perform all the real work.
FFTWs codelets are generated automatically by a spe-
cial-purpose compiler called
. Most users do not
interact with
at all: the standard FFTW distribution
contains a set of about 150 pregenerated codelets that cover
the most common uses. Users with special needs can use
to generate their own codelets. is useful
because of the following features. From a high-level mathe-
matical description of a DFT algorithm,
derives an
optimized implementation automatically. From a complex
DFT algorithm,
automatically derives an optimized
algorithm for the real-input DFT. We take advantage of
this property to implement real-data DFTs (Section VII),
as well as to exploit machine-specic single-instruction
multiple-data (SIMD) instructions (Section IX). Similarly,
automatically derives codelets for the DCT and
DST (Section VIII). We summarize
in Section VI,
while a full description appears in [2].
We have produced three major implementations of
FFTW, each building on the experience of the previous
system. FFTW1 (1997) [3] introduced the idea of generating
codelets automatically and of letting a planner search for the
best combination of codelets. FFTW2 (1998) incorporated
a new version of
[2]. did not change much
in FFTW3 (2003), but the runtime structure was completely
rewritten to allow for a much larger space of plans. This
paper describes the main ideas common to all FFTW sys-
tems, the runtime structure of FFTW3, and the modications
to
since FFTW2.
Previous work on adaptive systems includes [3][11]. In
particular, SPIRAL [9], [10] is another system focused on
optimization of Fourier transforms and related algorithms,
but it has distinct differences from FFTW. SPIRAL searches
at compile time over a space of mathematically equivalent
formulas expressed in a tensor-product language, whereas
FFTW searches at runtime over the formalism discussed in
Section IV, which explicitly includes low-level details, such
as strides and memory alignments, that are not as easily ex-
pressed using tensor products. SPIRAL generates machine-
dependent code, whereas FFTWs codelets are machine in-
dependent. FFTWs search uses dynamic programming [12,
Ch. 16], while the SPIRAL project has experimented with a
wider range of search strategies, including machine-learning
techniques [13].
The remainder of this paper is organized as follows. We
begin with a general overview of fast Fourier transforms
(FFTs) in Section II. Then, in Section III, we compare the
performance of FFTW and other DFT implementations.
Section IV describes the space of plans explored by FFTW
and how the FFTW planner works. Section V describes
our experiences in the practical usage of FFTW. Section VI
summarizes how
works. Section VII explains how
FFTW computes DFTs of real data. Section VIII describes
how
generates DCT and DST codelets, as well as
how FFTW handles these transforms in the general case.
Section IX tells how FFTW exploits SIMD instructions.
II. FFT O
VERVIEW
The (forward, 1-D) DFT of an array
of complex num-
bers is the array
given by
(1)
where
and . Imple-
mented directly, (1) would require
operations; FFTs
are
algorithms to compute the same result. The
most important FFT (and the one primarily used in FFTW)
is known as the CooleyTukey algorithm, after the two
authors who rediscovered and popularized it in 1965 [14],
although it had been previously known as early as 1805 by
Gauss as well as by later reinventors [15]. The basic idea be-
hind this FFT is that a DFT of a composite size
can be reexpressed in terms of smaller DFTs of sizes
and essentially, as a 2-D DFT of size where
the output is transposed. The choices of factorizations of
,
combined with the many different ways to implement the
data reorderings of the transpositions, have led to numerous
implementation strategies for the CooleyTukey FFT, with
many variants distinguished by their own names [16], [17].
FFTW implements a space of many such variants, as de-
scribed later, but here we derive the basic algorithm, identify
its key features, and outline some important historical varia-
tions and their relation to FFTW.
The CooleyTukey algorithm can be derived as follows.
If
can be factored into , (1) can be rewritten by
letting
and . We then have
(2)
Thus, the algorithm computes
DFTs of size (the inner
sum), multiplies the result by the so-called twiddle factors
, and nally computes DFTs of size (the outer
sum). This decomposition is then continued recursively. The
literature uses the term radix to describe an
or that
FRIGO AND JOHNSON: THE DESIGN AND IMPLEMENTATION OF FFTW3 217

is bounded (often constant); the small DFT of the radix is
traditionally called a buttery.
Many well-known variations are distinguished by the
radix alone. A decimation in time (DIT) algorithm uses
as the radix, while a decimation in frequency (DIF)
algorithm uses
as the radix. If multiple radices are used,
e.g., for
composite but not a prime power, the algorithm
is called mixed radix. A peculiar blending of radix 2 and 4
is called split radix, which was proposed to minimize the
count of arithmetic operations [16]. (Unfortunately, as we
argue in this paper, minimal-arithmetic, xed-factorization
implementations tend to no longer be optimal on recent
computer architectures.) FFTW implements both DIT and
DIF, is mixed-radix with radices that are adapted to the
hardware, and often uses much larger radices (radix-32 is
typical) than were once common. (On the other end of the
scale, a radix of roughly
has been called a four-step
FFT [18], and we have found that one step of such a radix
can be useful for large sizes in FFTW; see Section IV-D1.)
A key difculty in implementing the CooleyTukey FFT
is that the
dimension corresponds to discontiguous inputs
in but contiguous outputs in , and vice versa for
. This is a matrix transpose for a single decomposition
stage, and the composition of all such transpositions is a
(mixed-base) digit-reversal permutation (or bit-reversal, for
radix-2). The resulting necessity of discontiguous memory
access and data reordering hinders efcient use of hier-
archical memory architectures (e.g., caches), so that the
optimal execution order of an FFT for given hardware is
nonobvious, and various approaches have been proposed.
One ordering distinction is between recursion and itera-
tion. As expressed above, the CooleyTukey algorithm could
be thought of as dening a tree of smaller and smaller DFTs;
for example, a textbook radix-2 algorithm would divide size
into two transforms of size , which are divided into
four transforms of size
, and so on until a base case is
reached (in principle, size 1). This might naturally suggest
a recursive implementation in which the tree is traversed
depth-rst”—one size-
transform is solved completely
before processing the other one, and so on. However, most
traditional FFT implementations are nonrecursive (with rare
exceptions [19]) and traverse the tree breadth-rst [17]in
the radix-2 example, they would perform
(trivial) size-1
transforms, then
combinations into size-2 transforms,
then
combinations into size-4 transforms, and so on,
thus making
passes over the whole array. In contrast,
as we discuss in Section IV-D1, FFTW3 employs an explic-
itly recursive strategy that encompasses both depth-rst and
breadth-rst styles, favoring the former, since it has some
theoretical and practical advantages.
A second ordering distinction lies in how the digit re-
versal is performed. The classic approach is a single, sepa-
rate digit-reversal pass following or preceding the arithmetic
computations. Although this pass requires only
time
[20], it can still be nonnegligible, especially if the data is
out of cache; moreover, it neglects the possibility that data
reordering during the transform may improve memory lo-
cality. Perhaps the oldest alternative is the Stockham auto-
sort FFT [17], [21], which transforms back and forth be-
tween two arrays with each buttery, transposing one digit
each time, and was popular to improve contiguity of access
for vector computers [22]. Alternatively, an explicitly recur-
sive style, as in FFTW, performs the digit-reversal implic-
itly at the leaves of its computation when operating out-of-
place (Section IV-D1). To operate in-place with
scratch
storage, one can interleave small matrix transpositions with
the butteries [23][26], and a related strategy in FFTW is
described in Section IV-D3. FFTW can also perform inter-
mediate reorderings that blend its in-place and out-of-place
strategies, as described in Section V-C.
Finally, we should mention that there are many FFTs en-
tirely distinct from CooleyTukey. Three notable such algo-
rithms are the prime-factor algorithm for
[27, p. 619], along with Raders [28] and Bluesteins [27],
[29] algorithms for prime
. FFTW implements the rst two
in its codelet generator for hard-coded
(Section VI) and
the latter two for general prime
. A new generalization of
Raders algorithm for prime-size real-data transforms is also
discussed in Section VII. FFTW does not employ the Wino-
grad FFT [30], which minimizes the number of multiplica-
tions at the expense of a large number of additions. (This
tradeoff is not benecial on current processors that have spe-
cialized hardware multipliers.)
III. B
ENCHMARK RESULTS
We have performed extensive benchmarks of FFTWs
performance, along with that of over 50 other FFT imple-
mentations, on most modern general-purpose processors,
comparing complex and real-data transforms in one to three
dimensions and for both single and double precisions. We
generally found FFTW to be superior to other publicly
available codes and comparable to vendor-tuned libraries.
The complete results can be found in [1]. In this section, we
present data for a small sampling of representative codes for
complex-data 1-D transforms on a few machines.
We show the benchmark results as a series of graphs.
Speed is measured in MFLOPS, dened for a transform of
size
as , where is the time in microseconds
for one transform, not including one-time initialization
costs. This count of oating-point operations is based
on the asymptotic number of operations for the radix-2
CooleyTukey algorithm (see [17, p. 45]), although the
actual count is lower for most DFT implementations. The
MFLOPS measure should, thus, be viewed as a convenient
scaling factor rather than as an absolute indicator of CPU
performance.
Fig. 1 shows the benchmark results for power-of-two
sizes, in double precision, on a 2.8-GHz Pentium IV with
the Intel compilers; in Fig. 2 are results for selected non-
power-of-two sizes of the form
on the same
machine; in Fig. 3 are the single-precision power-of-two
results. Note that only the FFTW, MKL (Intel), IPPS (Intel),
and Takahashi libraries on this machine were specically
designed to exploit the SSE/SSE2 SIMD instructions
(see Section IX); for comparison, we also include FFTW
218 PROCEEDINGS OF THE IEEE, VOL. 93, NO. 2, FEBRUARY 2005

Fig. 1. Comparison of double-precision 1-D complex DFTs,
power-of-two sizes, on a 2.8-GHz Pentium IV. Intel C/Fortran
compilers v. 7.1, optimization ags -
O3
-
xW
(maximum
optimization, enable automatic vectorizer).
Fig. 2. Comparison of double-precision 1-D complex DFTs,
nonpower-of-two sizes, on a 2.8-GHz Pentium IV. Compiler and
ags as in Fig. 1.
(out-of-place) with SIMD disabled (fftw, no simd). In
Fig. 4 are the power-of-two double-precision results on a
2-GHz PowerPC 970 (G5) with the Apple
3.3 compiler.
In Fig. 5 are the power-of-two double-precision results on
an 833-MHz Alpha EV6 with the Compaq compilers, and in
Fig. 6 are the single-precision results on the same machine.
In addition to FFTW v. 3.0.1, the other codes benchmarked
are as follows (some for only one precision or machine):
arprec, four-step FFT implementation [18] (from the C++
ARPREC library, 2002); cxml, the vendor-tuned Compaq
Extended Math Library on Alpha; fftpack, the Fortran library
from [22]; green, free code by J. Green (C, 1998); mkl, the
Intel Math Kernel Library v. 6.1 (DFTI interface) on the
Pentium IV; ipps, the Intel Integrated Performance Primi-
tives, Signal Processing, v. 3.0 on the Pentium IV; numerical
recipes, the C
routine from [31]; ooura, a free code
by T. Ooura (C and Fortran, 2001); singleton, a Fortran FFT
[32]; sorensen, a split-radix FFT [33]; takahashi, the FFTE
Fig. 3. Comparison of single-precision 1-D complex DFTs,
power-of-two sizes, on a 2.8-GHz Pentium IV. Compiler and ags
as in Fig. 1. Note that fftpack, which was originally designed for
vectorizing compilers (or vice versa), benets somewhat from the
automatic vectorization in this case.
Fig. 4. Comparison of double-precision 1-D complex DFTs,
power-of-two sizes, on a 2-GHz PowerPC 970 (G5). Apple
gcc
v. 3.3,
g77
v. 3.4 20031105 (experimental). Optimization ags
-
O3
-
mcpu =
970
-
mtune
=
970
. The Apple vDSP library uses
separate real/imaginary arrays to store complex numbers and,
therefore, its performance is not strictly comparable with the other
codes, which use an array of real/imaginary pairs.
library v. 3.2 by D. Takahashi (Fortran, 2004) [34]; and vdsp,
the Apple vDSP library on the G5.
We now offer some remarks to aid the interpretation of the
performance results. The performance of all routines drops
for large problems, reecting the cache hierarchy of the ma-
chine. Performance is low for small problems as well, be-
cause of the overhead of calling a routine to do little work.
FFTW is the only library that exploits SIMD instructions for
nonpower-of-two sizes, which gives it an advantage on the
Pentium IV for this case. IPPS is limited to in-place con-
tiguous inputs, whereas MKL and FFTW allow for strided
input. Assuming contiguous input gives some speed advan-
tage on a machine such as the Pentium IV, where index com-
putation is somewhat slow.
FRIGO AND JOHNSON: THE DESIGN AND IMPLEMENTATION OF FFTW3 219

Fig. 5. Comparison of double-precision 1-D complex DFTs,
power-of-two sizes, on an 833-MHz Alpha EV6. Compaq C
V6.2505. Compaq Fortran X1.0.11155. Optimization ags:
-
newc
-
w0
-
O5
-
ansi
_
alias
-
ansi
_
args
-
fp
_
reorder
-
tune host
-
arch host
.
Fig. 6. Comparison of single-precision 1-D complex DFTs,
power-of-two sizes, on an 833-MHz Alpha EV6. Compilers and
ags as in Fig. 5.
IV. STRUCTURE OF
FFTW3
In this section, we discuss in detail how FFTW works.
Specically, we discuss how FFTW represents the problem
to be solved (Sections IV-A and IV-B), the set of plans
that the planner considers during its search (Sections IV-C
and IV-D), and the internal operation of the planner
(Section IV-E). For simplicity, this section considers com-
plex DFTs only; we discuss real DFTs in Section VII.
Of these components, the representation of the problem to
be solved is a critical choice. Indeed, we view our denition
of a problem as a fundamental contribution of this paper.
Because only problems that can be expressed can be solved,
the representation of a problem determines an upper bound
to the space of plans that the planner can explore; therefore,
it ultimately constrains FFTWs performance.
A. Representation of Problems in FFTW
DFT problems in FFTW are expressed in terms of struc-
tures called I/O tensors, which in turn are described in terms
of ancillary structures called I/O dimensions. (I/O tensors are
unrelated to the tensor-product notation of SPIRAL.) In this
section, we dene these terms precisely.
An I/O dimension
is a triple , where is a
nonnegative integer called the length,
is an integer called
the input stride, and
is an integer called the output stride.
An I/O tensor
is a set of I/O dimen-
sions. The nonnegative integer
is called the rank of
the I/O tensor. A DFT problem, denoted by
,
consists of two I/O tensors
and and two pointers
and . Roughly speaking, this describes nested loops of
-dimensional DFTs with input data starting at memory
location
and output data starting at . We now give a
more precise denition by induction on
, yielding a set
of assignments from input to output. Conceptually, all of the
right-hand sides of these assignments are evaluated before
writing their values to the left-hand sides, a ction that de-
nes the behavior precisely, e.g., when
. (See also the
examples in Section IV-B.)
, with , is the -dimensional DFT,
dened as follows. Let
; for
all output indexes
, yield the assignment
where each input index is summed from 0 to ,
is a primitive th root of unity as in Section II, and
denotes the complex number at memory location
(with pointer arithmetic in units of complex numbers).
By convention, we dene the zero-dimensional problem
to yield the assignment .
is recursively dened as a
loop of
problems: for all , yield all assign-
ments in
.
If two assignments write to the same memory location, the
DFT problem is undened. Such nonsensical problems are
not normally encountered in practice, however, as discussed
in Section IV-B.
One property of this denition is the fact that an I/O tensor
is equivalent to . That is, length-1 DFT dimen-
sions and length-1 loops can be eliminated. FFTW, therefore,
internally canonicalizes I/O tensors by removing all I/O di-
mensions where
. (Similarly, all I/O tensors of the form
are equivalent.)
We call
the size of the problem. The rank of a problem
is dened to be the rank of its size (i.e., the dimensionality of
the DFT). Similarly, we call
the vector size of the problem,
and the vector rank of a problem is correspondingly dened
to be the rank of its vector size. One unusual feature of FFTW
is that the vector rank is arbitrary: FFTW is not restricted to
vector sizes of rank 1. Intuitively, the vector size can be inter-
preted as a set of loops wrapped around a single DFT, and
we, therefore, refer to a single I/O dimension of
as a vector
loop. (Alternatively, one can view the problem as dening
220 PROCEEDINGS OF THE IEEE, VOL. 93, NO. 2, FEBRUARY 2005

Citations
More filters
Journal ArticleDOI

QUANTUM ESPRESSO: a modular and open-source software project for quantum simulations of materials

TL;DR: QUANTUM ESPRESSO as discussed by the authors is an integrated suite of computer codes for electronic-structure calculations and materials modeling, based on density functional theory, plane waves, and pseudopotentials (norm-conserving, ultrasoft, and projector-augmented wave).
Journal ArticleDOI

Evidence for wavelike energy transfer through quantum coherence in photosynthetic systems

TL;DR: Previous two-dimensional electronic spectroscopy investigations of the FMO bacteriochlorophyll complex are extended, and direct evidence is obtained for remarkably long-lived electronic quantum coherence playing an important part in energy transfer processes within this system is obtained.

The Landscape of Parallel Computing Research: A View from Berkeley

TL;DR: The parallel landscape is frame with seven questions, and the following are recommended to explore the design space rapidly: • The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems • The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS (Million Instructions per Second) per watt, MIPS per area of silicon, and MIPS each development dollar.
References
More filters
Book

Introduction to Algorithms

TL;DR: The updated new edition of the classic Introduction to Algorithms is intended primarily for use in undergraduate or graduate courses in algorithms or data structures and presents a rich variety of algorithms and covers them in considerable depth while making their design and analysis accessible to all levels of readers.
Book

Numerical Recipes in C: The Art of Scientific Computing

TL;DR: Numerical Recipes: The Art of Scientific Computing as discussed by the authors is a complete text and reference book on scientific computing with over 100 new routines (now well over 300 in all), plus upgraded versions of many of the original routines, with many new topics presented at the same accessible level.
Journal ArticleDOI

An algorithm for the machine calculation of complex Fourier series

TL;DR: Good generalized these methods and gave elegant algorithms for which one class of applications is the calculation of Fourier series, applicable to certain problems in which one must multiply an N-vector by an N X N matrix which can be factored into m sparse matrices.
Proceedings Article

The MD5 Message-Digest Algorithm

TL;DR: This document describes the MD5 message-digest algorithm, which takes as input a message of arbitrary length and produces as output a 128-bit "fingerprint" or "message digest" of the input.
Frequently Asked Questions (15)
Q1. What have the authors contributed in "The design and implementation of fftw3" ?

This paper shows that such an approach can yield an implementation that is competitive with hand-optimized libraries, and describes the software structure that makes their current FFTW3 version flexible and adaptive. The authors further discuss a new algorithm for real-data DFTs of prime size, a new way of implementing DFTs by means of machine-specific single-instruction, multiple-data ( SIMD ) instructions, and how a special-purpose compiler can derive optimized implementations of the discrete cosine and sine transforms automatically from a DFT algorithm. 

An 2-D matrix is typically stored in C using row-major format: size- contiguous arrays for each row, stored as consecutive blocks starting from a pointer (for input/output). 

The basic idea behind this FFT is that a DFT of a composite size can be reexpressed in terms of smaller DFTs of sizes and —essentially, as a 2-D DFT of size where the output is transposed. 

One can also compute symmetric DFTs by directly specializing the Cooley–Tukey algorithm, removing redundant operations as the authors did for real inputs, to decompose the transform into smaller symmetric transforms [53], [56], [57]. 

one must reduce a problem of arbitrary vector rank to a set of loops nested around a problem of vector rank 0, i.e., a single (possibly multidimensional) DFT. 

When invoked by the planner, a solver creates the plan for the given problem (if possible) and it initializes any auxiliary data required by the plan (such as twiddle factors). 

On machines that support vectors of length 4, the authors view SIMD data as vectors of two complex numbers, and each codelet executes two iterations of its loop in parallel. 

Because only problems that can be expressed can be solved, the representation of a problem determines an upper bound to the space of plans that the planner can explore; therefore, it ultimately constrains FFTW’s performance. 

FFTW is the only library that exploits SIMD instructions for nonpower-of-two sizes, which gives it an advantage on the Pentium IV for this case. 

because computer hardware is continually changing, the answer to this question has been continually changing as well. 

In other cases, however, the penalty from impatient mode can be larger; for example, it has a 47% penalty for a 1024 1024 2-D complex-data transform on the same machine, since vector recursion proves important there for the discontiguous (row) dimension of the transform. 

)2) Vector Recursion: Another example of the effect of loop reordering is a style of plan that the authors sometimes call vector recursion (unrelated to “vector-radix” FFTs [16]). 

(On the other end of the scale, a “radix” of roughly has been called a four-step FFT [18], and the authors have found that one step of such a radix can be useful for large sizes in FFTW; see Section IV-D1.)A key difficulty in implementing the Cooley–Tukey FFT is that the dimension corresponds to discontiguous inputsin but contiguous outputs in , and vice versa for . 

For the type-I DCT/DST, however, the authors could not find any accurate algorithm to re-express the transform in terms of an equal-length real-input DFT; thus, the authors resort to the “slow” method of embedding it in a real-input DFT of length . 

the planner can operate in an impatient mode that reduces the space of plans by eliminating some possibilities that appear to inordinately increase planner time relative to their observed benefits.