Introduction:
There are many ways to implement an FFT. The supplied demonstration graph
demo/ffttest/ffttest is an implementation example that breaks down the FFT to its component add, subtract, and multiply functional core. In ffttest, the use of "families" replicates each described core unit scaling the size of the application to the size of the necessary FFT. Parallelism is taken advantage of by being able to take each element and individually, or as a core unit, map them to a separate processor. This design takes advantage of the high level of replication basic functions. But this design may only have strong performance benefits on architectures such as systolic arrays or systems with large numbers of processors with complex communication topologies.
Today's processors have optimized routines to perform basic sized FFTs. It is then an advantage NOT to decompose an FFT to such a fine granularity, as in
ffttest, but to use these optimized routines. But there are occasions when the supplied routines are not adequate. There are two reasons where one may wish to use multiples of these optimized FFTs in an application.
It must be remembered that any single optimized routine may only be executed on a single processor. Parallelizing compilers can create executables for multiple processors, but you may lose control of execution characteristics.
Abstract:
A graph using FFTs in parallel for execution on multiple DSP processors is described. This allows large vector size FFTs to be implemented efficiently using the size FFTs provided by a particular DSP library. The algorithm implementation, its use in higher level graphs, and its performance are also described.
The following graph is the top level of this design. It has one source providing a modulated complex random noise signal. The output display allows you to compare the signal's frequency spectrum computed by two different implementations.
To see more details of the algorithm, double-click on either the source_vx or the vx_parfft hierarchical boxes. To avoid confusion, the local constant integers are connected to their respective inputs of both source_vx and vx_parfft .
The next figure is a random sampling of the output from the graph display on scope2. This is a quick comparison of the outputs of both branches of FFT calculation. This is an easy method of seeing if the two functions are roughly performing the same calculations. As you can see, both the single function vx_fft and parallelized vx_parfft have the same output.

Application Note last updated 3/24/98