SPRADM6 December 2024 AM62D-Q1
Fast Fourier Transform (FFT) is on of the most common signal processing algorithms. The FFTLIB library contains a number of kernels for FFT with various batch size and data types. These kernels are implemented only utilizing the only the C7x. The kernels do not utilize the MMA, yet. The FFTLIB is included in the FREERTOS-SDK for AM62D. The documentation of the library contains instructions to build and run these kernels. Table 2-1 shows performance data of complex input/complex output FFT for 16-bit complex integer and 32-bit float complex data types for various vectors and batch sizes executed on C7x. The performance is measured in terms of the total number of C7x clock cycles to perform FFT and the duration to process each batch when C7x clocked at 1.0GHz.
FFT Parameters | Performance for 16-bit int complex input and complex output [Cycles] | Performance for 32-bit float complex input and complex output [Cycles] | |||
|---|---|---|---|---|---|
| FFT Size | Batch Size | Total [Cycles] | Time per batch [µs] | Total [Cycles] | Time per batch [µs] |
128 | 128 | 12806 | 0.1 | 24221 | 0.18 |
256 | 64 | 13944 | 0.21 | 24850 | 0.38 |
512 | 32 | 13793 | 0.43 | 26826 | 0.83 |
1024 | 16 | 16012 | 1.0 | 29783 | 1.86 |
2048 | 8 | 15981 | 1.99 | 31650 | 3.95 |
4096 | 4 | 17112 | 4.27 | 34834 | 8.70 |
8192 | 2 | 17063 | 8.53 | 36304 | 18.15 |
To show performance gain when using C7x DSP over ARM, the execution time of FFT on C7x DSP is compared with execution time of the same type of FFT on Arm-Cortex-A53 core. The benchmark on Arm uses the implementation from Ne10 library, which leverages the Advanced SIMD or NEON acceleration of Cortex-A53. This library is not included in the SDK but can be downloaded from the official Ne10 repository. This library requires a Linux® OS, so this was implemented on AM62A, which has the same Cortex-A53 cores as AM62D. Table 2-2 shows a 1024-point single precision floating point complex FFT execution time on Arm-Cortex-A53 and C7x DSP. The results show approximately 11x speedup between the Cortex-A53 and C7x DSP.
| Arm-Cortex-A53 at 1.4GHz single thread / core | C7x at 1.0GHz | C7x (1.0GHz) to A53 (1.4GHz) Improvement | |
|---|---|---|---|
| 1024-point Complex FFT Execution Time | 19.4µs | 1.8µs | Approx. 11x |