DSP_fft16x16r
4-16
The function takes the twiddle factors and input data, and calculates the FFT
producing the frequency domain data in the y[ ] array. As the FFT allows every
input point to affect every output point, which causes cache thrashing in a
cache based system. This is mitigated by allowing the main FFT of size N to
be divided into several steps, allowing as much data reuse as possible. For
example, see the following function:
DSP_fft16x16r(1024,&x[0], &w[0], y,brev,4, 0,1024);
is equivalent to:
DSP_fft16x16r(1024,&x[2*0], &w[0] ,y,brev,256, 0,1024);
DSP_fft16x16r(256, &x[2*0], &w[2*768],y,brev,4, 0,1024);
DSP_fft16x16r(256, &x[2*256],&w[2*768],y,brev,4, 256,1024);
DSP_fft16x16r(256, &x[2*512],&w[2*768],y,brev,4, 512,1024);
DSP_fft16x16r(256, &x[2*768],&w[2*768],y,brev,4, 768,1024);
Notice how the first FFT function is called on the entire 1K data set. It covers
the first pass of the FFT until the butterfly size is 256.
The following 4 FFTs do 256-point FFTs 25% of the size. These continue down
to the end when the butterfly is of size 4. They use an index to the main twiddle
factor array of 0.75*2*N. This is because the twiddle factor array is composed
of successively decimated versions of the main array.
N not equal to a power of 4 can be used; i.e. 512. In this case, the following
would be needed to decompose the FFT:
DSP_fft16x16r(512, &x[0], &w[0], y,brev,2, 0,512);
is equivalent to:
DSP_fft16x16r(512, &x[0], &w[0], y,brev,128, 0,512);
DSP_fft16x16r(128, &x[2*0], &w[2*384],y,brev,2, 0,512);
DSP_fft16x16r(128, &x[2*128],&w[2*384],y,brev,2, 128,512);
DSP_fft16x16r(128, &x[2*256],&w[2*384],y,brev,2, 256,512);
DSP_fft16x16r(128, &x[2*384],&w[2*384],y,brev,2, 384,512);
The twiddle factor array is composed of log
4
(N) sets of twiddle factors, (3/4)*N,
(3/16)*N, (3/64)*N, etc. The index into this array for each stage of the FFT is
calculated by summing these indices up appropriately. For multiple FFTs, they
can share the same table by calling the small FFTs from further down in the
twiddle factor array, in the same way as the decomposition works for more data
reuse.
Thus, the above decomposition can be summarized for a general N, radix “rad”
as follows.