CooleyTukey FFT algorithm
From Academic Kids

The CooleyTukey algorithm is the most common fast Fourier transform (FFT) algorithm. It reexpresses the discrete Fourier transform (DFT) of an arbitrary composite size n = n_{1}n_{2} in terms of smaller DFTs of sizes n_{1} and n_{2}, recursively, in order to reduce the computation time to O(n log n) for highlycomposite n. Because of the algorithm's importance, specific variants and implementation styles have become known by their own names, as described below.
This algorithm, including its recursive application, was already known around 1805 to Carl Friedrich Gauss, who used it to interpolate the trajectories of the asteroids Pallas and Juno, but his work was not widely recognized (being published only posthumously and in neoLatin); Gauss did not analyze the asymptotic computational time, however. Various limited forms were also rediscovered several times throughout the 19th and early 20th centuries. FFTs became popular after J. W. Cooley of IBM and John W. Tukey of Princeton published a paper in 1965 reinventing the algorithm and describing how to perform it conveniently on a computer (including how to arrange for the output to be produced in the natural ordering).
Because the CooleyTukey algorithm breaks the DFT into smaller DFTs, it can be combined arbitrarily with any other algorithm for the DFT. For example, Rader's or Bluestein's algorithm can be used to handle large prime factors that cannot be decomposed by CooleyTukey, or the Primefactor algorithm can be exploited for greater efficiency in separating out relatively prime factors.
See also the fast Fourier transform for information on other FFT algorithms, specializations for real and/or symmetric data, and accuracy in the face of finite floatingpoint precision.
Contents 
The radix2 DIT case
The simplest and most common form of the CooleyTukey algorithm (more so in textbooks than in highperformance implementations, however) is called a radix2 decimationintime (DIT) FFT: it divides the problem size into two interleaved halves with each recursive stage.
Recall that the DFT is defined by the formula:
 <math>f_j = \sum_{k=0}^{n1} x_k e^{\frac{2\pi i}{n} jk }
\qquad j = 0,\dots,n1. <math>
Radix2 DIT first computes the Fourier transforms of the evenindexed numbers x_{0}, x_{2}, ..., x_{n2}, and of the oddindexed numbers x_{1}, x_{3}, ..., x_{n1}, and then combines those two results to produce the Fourier transform of the whole sequence. This idea can then be performed recursively to reduce the overall runtime to O(n log n). This simplified form assumes that n is a power of two; since the number of sample points n can usually be chosen freely by the application, this is often not an important restriction.
More explicitly, let us write n' = n/2 and denote the DFT of the evenindexed numbers x'_{0} = x_{0}, x'_{1} = x_{2}, ..., x'_{n'1} = x_{n2} by f_{0}', ..., f '_{n'1}; and the DFT of the oddindexed numbers x''_{0} = x_{1}, x''_{1} = x_{3}, ..., x''_{n'1} = x_{n1} by f_{0}'', ..., f ''_{n'1}. Then it follows:
 <math> \begin{matrix}
f_j & = & \sum_{k=0}^{\frac{n}{2}1} x_{2k} e^{\frac{2\pi i}{n} j(2k)}
+ \sum_{k=0}^{\frac{n}{2}1} x_{2k+1} e^{\frac{2\pi i}{n} j(2k+1)} \\ \\
& = & \sum_{k=0}^{n'1} x'_{k} e^{\frac{2\pi i}{n'} jk}
+ e^{\frac{2\pi i}{n}j} \sum_{k=0}^{n'1} x''_k e^{\frac{2\pi i}{n'} jk} \\ \\
& = & \left\{ \begin{matrix}
f'_j + e^{\frac{2\pi i}{n}j} f''_j & \mbox{if } j f'_{jn'}  e^{\frac{2\pi i}{n}(jn')} f''_{jn'} & \mbox{if } j \geq n' \end{matrix} \right.\end{matrix} <math>
This process is an example of the general technique of divide and conquer algorithms; in many traditional implementations, however, the explicit recursion is avoided, and instead one traverses the computational tree in breadthfirst fashion.
The above reexpression of a sizen DFT as two sizen/2 DFTs is sometimes called the DanielsonLanczos lemma, since the identity was noted by those two authors in 1942 (influenced by Runge's 1903 work). They applied their lemma in a "backwards" recursive fashion, repeatedly doubling the DFT size until the transform spectrum converged (although they apparently didn't realize the linearithmic asymptotic complexity they had achieved). The DanielsonLanczos work predated widespread availability of computers and required hand calculation (possibly with mechanical aides such as adding machines); they reported a computation time of 140 minutes for a size64 DFT operating on real inputs to 35 significant digits. Cooley and Tukey's 1965 paper reported a running time of 0.02 minutes for a size2048 complex DFT on an IBM 7094 (probably in 36bit single precision, ~8 digits). Rescaling the time by the number of operations, this corresponds roughly to a speedup factor of around 800,000. The more modern FFT library FFTW, on a 2GHz Pentium 4 in 64bit double precision (~16 digits), can compute a size64 realinput DFT in 0.5μs and a size2048 complex DFT in 50μs, speedups of about 16,000,000,000 and 20,000 over Danielson & Lanczos and Cooley & Tukey, respectively, not even including the considerable improvements in accuracy.
(140 minutes for size 64 may sound like a long time, but it corresponds to an average of at most 16 seconds per floatingpoint operation, around 20% of which are multiplications...this is a fairly impressive rate for a human being to sustain for almost two and a half hours, especially when you consider the bookkeeping overhead.)
General factorizations
More generally, CooleyTukey algorithms recursively reexpress a DFT of a composite size n = n_{1}n_{2} as:
 Perform n_{1} DFTs of size n_{2}.
 Multiply by complex roots of unity called twiddle factors.
 Perform n_{2} DFTs of size n_{1}.
Typically, either n_{1} or n_{2} is a small factor (not necessarily prime), called the radix (which can differ between stages of the recursion). If n_{1} is the radix, it is called a decimation in time (DIT) algorithm, whereas if n_{2} is the radix, it is decimation in frequency (DIF, also called the SandeTukey algorithm). The version presented above was a radix2 DIT algorithm; in the final expression, the phase multiplying the odd transform is the twiddle factor, and the +/ combination (butterfly) of the even and odd transforms is a size2 DFT. (The radix's small DFT is sometimes known as a butterfly, socalled because of the shape of the dataflow diagram for the radix2 case.)
There are many other variations on the CooleyTukey algorithm. Mixedradix implementations handle composite sizes with a variety of (typically small) factors in addition to two, usually (but not always) employing the O(n^{2}) algorithm for the prime base cases of the recursion. Splitradix merges radices 2 and 4, exploiting the fact that the first transform of radix 2 requires no twiddle factor, in order to achieve the lowest known arithmetic operation count for poweroftwo sizes. (On presentday computers, performance is determined more by cache and CPU pipeline considerations than by strict operation counts; welloptimized FFT implementations often employ larger radices and/or hardcoded basecase transforms of significant size.) Another way of looking at the CooleyTukey algorithm is that it reexpresses a size n onedimensional DFT as an n_{1} by n_{2} twodimensional DFT (plus twiddles), where the output matrix is transposed. The net result of all of these transpositions, for a radix2 algorithm, corresponds to a bit reversal of the input (DIF) or output (DIT) indices. If, instead of using a small radix, one employs a radix of roughly √n and explicit input/output matrix transpositions, it is called a fourstep algorithm (or sixstep, depending on the number of transpositions), initially proposed for memory locality (cache) optimization (Gentleman and Sande, 1966; also Bailey, 1990).
The general CooleyTukey factorization rewrites the indices j and k as j = n_{2} j_{1} + j_{2} and k = n_{1} k_{2} + k_{1}, respectively, where the indices j_{a} and k_{a} run from 0..n_{a}1 (for a of 1 or 2). That is, it reindexes the input (k) and output (j) as n_{1} by n_{2} twodimensional arrays in columnmajor and rowmajor order, respectively. When this reindexing is substituted into the DFT formula for jk, the n_{2}j_{1}n_{1}k_{2} cross term vanishes (its exponential is unity), and the remaining terms give
 <math>f_{n_2 j_1 + j_2} =
\sum_{k_1=0}^{n_11} \left[ e^{\frac{2\pi i}{n} j_2 k_1 } \right] \left( \sum_{k_2=0}^{n_21} x_{n_1 k_2 + k_1} e^{\frac{2\pi i}{n_2} j_2 k_2 } \right) e^{\frac{2\pi i}{n_1} j_1 k_1 }<math>
where the inner sum is a DFT of size n_{2}, the outer sum is a DFT of size n_{1}, and the [...] bracketed term is the twiddle factor.
The 1965 CooleyTukey paper noted that one can employ an arbitrary radix r (as well as mixed radices), but failed to realize that the radix butterfly is itself a DFT that can use FFT algorithms. Hence, they reckoned the complexity to be O(r^{2} n/r log_{r}n), and erroneously concluded that the optimal radix was 3 (the closest integer to e). Gauss also derived the algorithm for arbitrary radices, and gave explicit examples of both radix3 and radix6 steps.
Data reordering, bit reversal, and inplace algorithms
Although the abstract CooleyTukey factorization of the DFT, above, applies in some form to all implementations of the algorithm, much greater diversity exists in the techniques for ordering and accessing the data at each stage of the FFT. Of special interest is the problem of devising an inplace algorithm that overwrites its input with its output data using only O(1) auxiliary storage.
The most wellknown reordering technique involves explicit bit reversal for inplace radix2 algorithms. Bit reversal is the permutation where the data at an index k, written in binary with digits b_{4}b_{3}b_{2}b_{1}b_{0} (e.g. 5 digits for n=32 inputs), is transferred to the index with reversed digits b_{0}b_{1}b_{2}b_{3}b_{4} . Consider the last stage of a radix2 DIT algorithm like the one presented above, where the output is written inplace over the input: when f_{j}' and f_{j}'' are combined with a size2 DFT, those two values are overwritten by the outputs. However, the two output values should go in the first and second halves of the output array, corresponding to the most significant bit b_{4} (for n=32); whereas the two inputs f_{j}' and f_{j}'' are interleaved in the even and odd elements, corresponding to the least significant bit b_{0}. Thus, in order to get the output in the correct place, these two bits must be swapped in the input. If you include all of the recursive stages of a radix2 DIT algorithm, all the bits must be swapped and thus one must preprocess the input with a bit reversal to get inorder output. Correspondingly, the reversed (dual) algorithm is radix2 DIF, and this takes inorder input and produces bitreversed output, requiring a bitreversal postprocessing step. Alternatively, some applications (such as convolution) work equally well on bitreversed data, so one can do radix2 DIF without bit reversal, followed by processing, followed by the radix2 DIT inverse DFT without bit reversal to produce final results in the natural order.
Many FFT users, however, prefer naturalorder outputs, and a separate, explicit bitreversal stage can have a nonnegligible impact on the computation time, even though bit reversal can be done in O(n) time and has been the subject of much research (e.g. Karp, 1996; Carter, 1998; and Rubio, 2002). Also, while the permutation is a bit reversal in the radix2 case, it is more generally an arbitrary (mixedbase) digit reversal for the mixedradix case, and the permutation algorithms become more complicated to implement. Moreover, it is desirable on many hardware architectures to reorder intermediate stages of the FFT algorithm so that they operate on consecutive (or at least more localized) data elements. To these ends, a number of alternative implementation schemes have been devised for the CooleyTukey algorithm that do not require separate bit reversal and/or involve additional permutations at intermediate stages.
The problem is greatly simplified if it is outofplace: the output array is distinct from the input array or, equivalently, an equalsize auxiliary array is available. The Stockham autosort algorithm (Stockham, 1966) performs every stage of the FFT outofplace, typically writing back and forth between two arrays, transposing one "digit" of the indices with each stage, and has been especially popular on SIMD architectures (Swarztrauber, 1982). Even greater potential SIMD advantages (more consecutive accesses) have been proposed for the Pease (1968) algorithm, which also reorders outofplace with each stage, but this method requires separate bit/digit reversal and O(n log n) storage. One can also directly apply the CooleyTukey factorization definition with explicit (depthfirst) recursion and small radices, which produces naturalorder outofplace output with no separate permutation step and can be argued to have cacheoblivious locality benefits on systems with hierarchical memory (Singleton, 1967; see also FFTW).
A typical strategy for inplace algorithms without auxiliary storage and without separate digitreversal passes involves small matrix transpositions (which swap individual pairs of digits) at intermediate stages, which can be combined with the radix butterflies to reduce the number of passes over the data (Johnson & Burrus, 1984; Temperton, 1991; Qian et al., 1994; Hegland, 1994).
References
 James W. Cooley and John W. Tukey, "An algorithm for the machine calculation of complex Fourier series," Math. Comput. 19, 297–301 (1965).
 Carl Friedrich Gauss, "Nachlass: Theoria interpolationis methodo nova tractata," Werke band 3, 265–327 (Königliche Gesellschaft der Wissenschaften, Göttingen, 1866). See also M. T. Heideman, D. H. Johnson, and C. S. Burrus, "Gauss and the history of the fast Fourier transform," IEEE ASSP Magazine 1 (4), 14–21 (1984).
 G. C. Danielson and C. Lanczos, "Some improvements in practical Fourier analysis and their application to Xray scattering from liquids," J. Franklin Inst. 233, 365–380 and 435–452 (1942).
 W. M. Gentleman and G. Sande, "Fast Fourier transforms—for fun and profit," Proc. AFIPS 29, 563–578 (1966).
 P. Duhamel and M. Vetterli, "Fast Fourier transforms: a tutorial review and a state of the art," Signal Processing 19, 259–299 (1990).
 David H. Bailey, "FFTs in external or hierarchical memory," J. Supercomputing 4 (1), 23–35 (1990).
 Alan H. Karp, "Bit reversal on uniprocessors," SIAM Review 38 (1), 1–26 (1996).
 Larry Carter and Kang Su Gatlin, "Towards an optimal bitreversal permutation program," Proc. 39th Ann. Symp. on Found. of Comp. Sci. (FOCS), 544–553 (1998).
 M. Rubio, P. Gómez, and K. Drouiche, "A new superfast bit reversal algorithm," Intl. J. Adaptive Control and Signal Processing 16, 703–707 (2002).
 T. G. Stockham, "High speed convolution and correlation", Spring Joint Computer Conference, Proc. AFIPS 28, 229–233 (1966).
 P. N. Swarztrauber, "Vectorizing the FFTs", in G. Rodrigue (Ed.), Parallel Computations (Academic Press, New York, 1982), pp. 51–83.
 M. C. Pease, "An adaptation of the fast Fourier transform for parallel processing", J. ACM 15 (2), 252–264 (1968).
 Richard C. Singleton, "On computing the fast Fourier transform", Commun. of the ACM 10 (1967), 647–654.
 H. W. Johnson and C. S. Burrus, "An inplace inorder radix2 FFT," Proc. ICASSP, 28A.2.1–28A.2.4 (1984).
 C. Temperton, "Selfsorting inplace fast Fourier transform," SIAM J. Sci. Stat. Comput. 12 (4), 808–823 (1991).
 Z. Qian, C. Lu, M. An, and R. Tolimieri, "Selfsorting inplace FFT algorithm with minimum working space," IEEE Trans. ASSP 52 (10), 2835–2836 (1994).
 M. Hegland, "A selfsorting inplace fast Fourier transform algorithm suitable for vector and parallel processing," Numerische Mathematik 68 (4), 507–547 (1994).
 Matteo Frigo and Steven G. Johnson: FFTW, http://www.fftw.org/. A free (GPL) C library for computing discrete Fourier transforms in one or more dimensions, of arbitrary size, using the CooleyTukey algorithm. Also M. Frigo and S. G. Johnson, "The Design and Implementation of FFTW3 (http://fftw.org/fftwpaperieee.pdf)," Proceedings of the IEEE 93 (2), 216–231 (2005).