SPRADM6 December   2024 AM62D-Q1

 

  1.   1
  2.   Abstract
  3.   Trademarks
  4. 1Introduction
    1. 1.1 Load Binaries to AM62D
  5. 2Processor Core Benchmarks
    1. 2.1 C7x DSP Benchmark
      1. 2.1.1 Fast Fourier Transform
      2. 2.1.2 Digital Signal Processing
        1. 2.1.2.1 FIR
        2. 2.1.2.2 Cascade Biquad
        3. 2.1.2.3 Dot Product
      3. 2.1.3 Mathematical Operations
    2. 2.2 Dhrystone on A53 cores
  6. 3Memory System Benchmarks
    1. 3.1 Critical Memory Access Latency
    2. 3.2 UDMA: DDR to DDR Data Copy
    3. 3.3 C7x DRU Performance: Block Copy with DMA
  7. 4Application Specific Benchmarks
    1. 4.1 SBL Boot Time
    2. 4.2 IPC Performance
    3. 4.3 Flash
    4. 4.4 Application Specific Latency
  8. 5Summary
  9. 6References

Fast Fourier Transform

Fast Fourier Transform (FFT) is on of the most common signal processing algorithms. The FFTLIB library contains a number of kernels for FFT with various batch size and data types. These kernels are implemented only utilizing the only the C7x. The kernels do not utilize the MMA, yet. The FFTLIB is included in the FREERTOS-SDK for AM62D. The documentation of the library contains instructions to build and run these kernels. Table 2-1 shows performance data of complex input/complex output FFT for 16-bit complex integer and 32-bit float complex data types for various vectors and batch sizes executed on C7x. The performance is measured in terms of the total number of C7x clock cycles to perform FFT and the duration to process each batch when C7x clocked at 1.0GHz.

Table 2-1 FFT Performance on C7x

FFT Parameters

Performance for 16-bit int complex input and complex output [Cycles]

Performance for 32-bit float complex input and complex output [Cycles]

FFT SizeBatch Size

Total [Cycles]

Time per batch [µs]

Total [Cycles]Time per batch [µs]

128

128

12806

0.1

24221

0.18

256

64

13944

0.21

24850

0.38

512

32

13793

0.43

26826

0.83

1024

16

16012

1.0

29783

1.86

2048

8

15981

1.99

31650

3.95

4096

4

17112

4.27

34834

8.70

8192

2

17063

8.53

36304

18.15

To show performance gain when using C7x DSP over ARM, the execution time of FFT on C7x DSP is compared with execution time of the same type of FFT on Arm-Cortex-A53 core. The benchmark on Arm uses the implementation from Ne10 library, which leverages the Advanced SIMD or NEON acceleration of Cortex-A53. This library is not included in the SDK but can be downloaded from the official Ne10 repository. This library requires a Linux® OS, so this was implemented on AM62A, which has the same Cortex-A53 cores as AM62D. Table 2-2 shows a 1024-point single precision floating point complex FFT execution time on Arm-Cortex-A53 and C7x DSP. The results show approximately 11x speedup between the Cortex-A53 and C7x DSP.

Table 2-2 C7x DSP vs Arm CPU executing Complex FFT
Arm-Cortex-A53 at 1.4GHz single thread / coreC7x at 1.0GHz

C7x (1.0GHz) to A53 (1.4GHz) Improvement

1024-point Complex FFT Execution Time19.4µs1.8µs

Approx. 11x