SDAA175 November   2025 AM62A3 , AM62A7 , AM67A , TDA4VM

 

  1.   1
  2.   Abstract
  3.   Trademarks
  4. 1Introduction
  5. 2C7xMMA Cache Structure
  6. 3Model DDR Read/Write Analysis for a Compiled TIDL Model
  7. 4Model Optimization
    1. 4.1 Simple Structure Models
    2. 4.2 Complex Structure
      1. 4.2.1 Residual Structures
      2. 4.2.2 Parallel Branch Merge
  8. 5Summary
  9. 6References

Parallel Branch Merge

Applications often involve multiple deep parallel branches merging into one, or one branch splitting into several deep parallel paths. This is especially common for multi-input neural networks. The figure shows part of a classic four-input BEV network after the gridsample operator merges paths.

Since paths are deep before merging, feature maps must be placed in DDR at the merge point, making DDR bandwidth consumption unavoidable here. Such architectures should be used only as needed to avoid excession DDR bandwidth and resulting bottlenecks. However, it may be possible to reduce the DDR read bandwidth caused by weights. The model architecture can be modified to consolidate multiple input heads and setting the batch dimension of certain model layers to a value greater than 1 such that weights need only load once.

 Complex structure merging multiple parallel branchesFigure 4-3 Complex structure merging multiple parallel branches

For example, in the above figure, the same backbone network precedes the GridSample layers and the feature maps at each layer of this backbone are relatively small. The four branches can be merged into two or even one, with the batch dimension adjusted accordingly. These would be followed by appropriate Slice or data-shaping layers to separate out the batches again so they can be recombined by the Add layers shown. This approach can reduce or even prevent the same weights from being repeatedly loaded, thereby lowering DDR read bandwidth overhead. This method requires attention to the size of the feature maps in the merged backbone network.