SPRADO9 March   2025 AM62L

 

  1.   1
  2.   Abstract
  3.   Trademarks
  4. 1Introduction
  5. 2Processor Core and Compute Benchmarks
    1. 2.1 Dhrystone
    2. 2.2 Whetstone
    3. 2.3 Linpack
    4. 2.4 NBench
    5. 2.5 CoreMark-Pro
    6. 2.6 Fast Fourier Transform
    7. 2.7 Cryptographic Benchmarks
  6. 3Memory System Benchmarks
    1. 3.1 Memory Bandwidth and Latency
      1. 3.1.1 LMBench
      2. 3.1.2 STREAM
    2. 3.2 Critical Memory Access Latency
    3. 3.3 UDMA: DDR to DDR Data Copy
  7. 4Summary
  8. 5References

LMBench

LMBench is a suite of micro benchmarks for processor cores and operating system primitives. The memory bandwidth and latency related tests are most relevant for modern embedded processors. The results vary a little (< 10%) run to run.

LMBench benchmark bw_mem measures achieved memory copy performance. With parameter cp, the benchmark does an array copy and bcopy parameter uses the runtime glibc version of memcpy() standard function. The glibc uses a highly optimized implementation that utilizes, for example, SIMD resulting in higher performance. The size parameter equal to or smaller than the cache size at a given level measures the achievable memory bandwidth from software doing a typical for loop or memcpy() type operation. Typical use is for external memory bandwidth calculation. The bandwidth is calculated as byte read and written counts as 1, which is roughly half of STREAM copy result. The benchmark further allows creating parallel threads with -P parameter. To get the maximum multi-core memory bandwidth, create the same amount of threads as there are cores available for the operating system, which is 2 for AM62Lx Linux (-P 2). To show full performance characterization of the AM62Lx, the LMBench tests are implemented on full factorial combinations of number of cores and clock frequency. The code block below shows terminal printout of executing the LMBench commands.

root@am62lxx-evm:~# bw_mem 8M bcopy
8.00 1000.00
root@am62lxx-evm:~# bw_mem -P 2 8M bcopy
8.00 1127.87
root@am62lxx-evm:~# bw_mem 8M cp
8.00 579.54
root@am62lxx-evm:~# bw_mem -P 2 8M cp
8.00 645.71

Table 3-1 shows the measured bandwidth and the efficiency compared to theoretical wire rate. The wire rate used is the LPDDR4 MT/s rate times the width divided by two (read and write making up a copy both consume the bus).

Equation 1. Efficiency= Measured SpeedLPDDR4 MT/s × width2= Measured Speed1600 ×2 B2= Measured Speed1600 
Table 3-1 LMBench Results
CommandDescriptionArm-Cortex-A53 at 1.25GHz, LPDDR4-1600MT/s-16 Bit [MB/s]LPDDR4 Efficiency [%]
Bw_mem 8M bcopySingle core, glibc memcpy1,000

62

bw_mem -P 2 8M bcopyDual core, glibc memcpy

1,127

70

Bw_mem 8M cpSingle core, inline copy loop

579

36
bw_mem -P 2 8M cpDual core, inline copy loop

645

40

LMBench benchmark lat_mem_rd is used to measure the observed memory access latency for external memory (LPDDR4 on AM62Lx) and cache hits. The two arguments are the size of the transaction (64 in the code block below) and the stride of the read (512). These two values are selected to measure the latency to caches and external memory, not the processor data prefetchers or other speculative execution. For access patterns, the prefetching works, but this benchmark is most useful to measure the case when the prefetching does not.

The code block below shoes the terminal printout of executing lat_mem_rd command. The left column is the size of the data access pattern in megabytes, right column is the round trip read latency in nanoseconds. This command is executed for Arm-Cortex-A53 clock frequency of 1.25GHz.

root@am62lxx-evm:~# lat_mem_rd 64 512
"stride=512
0.00049 2.405
0.00098 2.404
0.00195 2.404
0.00293 2.405
0.00391 2.404
0.00586 2.404
0.00781 2.405
0.01172 2.404
0.01562 2.405
0.02344 2.525
0.03125 5.903
0.04688 7.172
0.06250 8.453
0.09375 9.674
0.12500 10.239
0.18750 10.838
0.25000 30.115
0.37500 106.020
0.50000 155.871
0.75000 178.770
1.00000 181.088
1.50000 182.258
2.00000 182.664
3.00000 182.958
4.00000 182.947
6.00000 183.280
8.00000 183.235
12.00000 183.265
16.00000 183.375
24.00000 183.149
32.00000 183.076
48.00000 183.513
64.00000 183.483

Figure 3-1 shows connected scatter plots of memory latency results for 1.25GHz. Based on memory block size (x-axis), the plot can be divided into three regions. The first region is when the accessed memory block is smaller than L1 cache. Assume that the data is completely inside the L1 and such the latency in this region is a close estimation of L1 cache latency. The second region is when the accessed memory block is bigger than L1 but smaller than L2 cache. The latency in this region is a mix of L1, L2, and LPDDR4 latency. The latency at the middle of that region can be assumed to be a close representation of L2 latency. The third region is when the access memory block is bigger than L2 cache. The last reading in this region reflects the LPDDR4 latency.

 Memory Read Latency Figure 3-1 Memory Read Latency

Table 3-2 shows a summary for Arm-Cortex-A53 read latency.

Table 3-2 Memory Read Latency Results
MemoryArm-Cortex-A53 at 1.25GHz [ns]
L1 cache2.4
L2 cache10.2
LPDDR4-1600 MT/s183.5

Latency: lat_mem_rd-stride128-szN, where N is equal to or smaller than the cache size at given level measures the cache miss penalty. N that is at least double the size of last level cache is the latency to external memory.

Bandwidth: bw_mem_bcopy-N, where N is equal to or smaller than the cache size at a given level measures the achievable memory bandwidth from software doing a memcpy() type operation. Typical use is for external memory bandwidth calculation. The bandwidth is calculated as byte read and written counts as 1 which is roughly half of STREAM copy result.