SDAA175 November   2025 AM62A3 , AM62A7 , AM67A , TDA4VM

 

  1.   1
  2.   Abstract
  3.   Trademarks
  4. 1Introduction
  5. 2C7xMMA Cache Structure
  6. 3Model DDR Read/Write Analysis for a Compiled TIDL Model
  7. 4Model Optimization
    1. 4.1 Simple Structure Models
    2. 4.2 Complex Structure
      1. 4.2.1 Residual Structures
      2. 4.2.2 Parallel Branch Merge
  8. 5Summary
  9. 6References

C7xMMA Cache Structure

DDR bandwidth optimization starts with comprehending the TI C7xMMA’s memory hierarchy, simplified in the figure below.

 C7xMMA Three-Level Cache StructureFigure 2-1 C7xMMA Three-Level Cache Structure

The C7xMMA employs a three-level cache structure. Beyond external DDR, it incorporates internal L1D, L2, and L3/MSMC caches. L1D is the smallest and closest to the compute core (typical size 16KB). L2 is somewhat more distant (typical sizes 224KB, 448KB), but tightly coupled to data movement mechanisms for the MMA. L3 on TDA4x is the Multicore Shared Memory Controller (MSMC), while on other SoCs, it isSRAM managed individually by each C7xMMA. Note: The L1D, L2, L3 terminology here corresponds to the TIDL framework’s description; chip datasheets may refer to L1P, L1D, L2, and in some SoCs, L3 (i.e. MSMC on TDA4VM). The size of L2 and L3 regions can be found in a device_config.cfg file included in tidl tools

The figure below shows cache usage during a typical layer's inference, involving four operations. Operation 1 is DMA transferring data directly from DDR to L2. Operation 2 moves data from L3 to L2. Operation 3 transfers data from L2 to L3. Operation 4 moves data from L3 to DDR. Operations 2 and 3 are over ten times more efficient than 1 and 4. Utilizing the previous layer’s feature map can lead to three scenarios: only Operation 1 (if the input layer and previous layer output reside entirely in DDR); only Operation 2 (if the previous feature map fits completely in L3/MSMC); or both Operations 1 and 2 (if the previous output is too large for L3, partially stored in DDR). After computing the current layer’s feature map, Operation 3 is prioritized to move data to L3. If L3 capacity is exceeded, Operation 4 stores the surplus in DDR. Weight values are always stored in DDR and fetched directly to L2 when needed.

 C7xMMA Cache OperationsFigure 2-2 C7xMMA Cache Operations

This three-level cache architecture significantly boosts inference efficiency by avoiding slow DDR read/writes during compute cycles and conserving DDR bandwidth. The key to improving efficiency and saving bandwidth lies in maximizing L3 utilization, preventing feature map storage in DDR. The next section covers analyzing model memory usage.