SPRADM6 Application note

SPRADM6 December 2024 AM62D-Q1

2.1.1 Fast Fourier Transform

Fast Fourier Transform (FFT) is on of the most common signal processing algorithms. The FFTLIB library contains a number of kernels for FFT with various batch size and data types. These kernels are implemented only utilizing the only the C7x. The kernels do not utilize the MMA, yet. The FFTLIB is included in the FREERTOS-SDK for AM62D. The documentation of the library contains instructions to build and run these kernels. Table 2-1 shows performance data of complex input/complex output FFT for 16-bit complex integer and 32-bit float complex data types for various vectors and batch sizes executed on C7x. The performance is measured in terms of the total number of C7x clock cycles to perform FFT and the duration to process each batch when C7x clocked at 1.0GHz.

Table 2-1 FFT Performance on C7x

FFT Parameters		Performance for 16-bit int complex input and complex output [Cycles]		Performance for 32-bit float complex input and complex output [Cycles]
FFT Size	Batch Size	Total [Cycles]	Time per batch [µs]	Total [Cycles]	Time per batch [µs]
128	128	12806	0.1	24221	0.18
256	64	13944	0.21	24850	0.38
512	32	13793	0.43	26826	0.83
1024	16	16012	1.0	29783	1.86
2048	8	15981	1.99	31650	3.95
4096	4	17112	4.27	34834	8.70
8192	2	17063	8.53	36304	18.15

To show performance gain when using C7x DSP over ARM, the execution time of FFT on C7x DSP is compared with execution time of the same type of FFT on Arm-Cortex-A53 core. The benchmark on Arm uses the implementation from Ne10 library, which leverages the Advanced SIMD or NEON acceleration of Cortex-A53. This library is not included in the SDK but can be downloaded from the official Ne10 repository. This library requires a Linux® OS, so this was implemented on AM62A, which has the same Cortex-A53 cores as AM62D. Table 2-2 shows a 1024-point single precision floating point complex FFT execution time on Arm-Cortex-A53 and C7x DSP. The results show approximately 11x speedup between the Cortex-A53 and C7x DSP.

Table 2-2 C7x DSP vs Arm CPU executing Complex FFT

	Arm-Cortex-A53 at 1.4GHz single thread / core	C7x at 1.0GHz	C7x (1.0GHz) to A53 (1.4GHz) Improvement
1024-point Complex FFT Execution Time	19.4µs	1.8µs	Approx. 11x