SPRY344A January   2022  – March 2023 TDA4VM , TDA4VM-Q1

 

  1.   At a glance
  2.   Authors
  3.   Introduction
  4.   Defining AI at the edge
  5.   What is an efficient edge AI system?
    1.     Selecting an SoC architecture
    2.     Programmable core types and accelerators
  6.   Designing edge AI systems with TI vision processors
    1.     Deep learning accelerator
    2.     Imaging and computer vision hardware accelerators
    3.     Smart internal bus and memory architecture
    4.     Optimized system BOM
    5.     Easy-to-use software development environment
  7.   Conclusion

Deep learning accelerator

While suitable for other tasks, CPUs and GPUs are not the best cores to accelerate deep learning tasks. CPUs have throughput limitations and consume high power; GPUs consume the most power of all cores and have a large memory footprint.

TI vision AI processors integrate a deep learning accelerator comprising a matrix multiplication accelerator (MMA) in an ASIC, bolted to a programmable C71 DSP. The MMA enables high-performance (4K 8-bit fixed multiply accumulates per cycle) and low-power tensor acceleration, while the C71 DSP accelerates vector and scalar operations and manages the MMA.

The combination of the MMA and C71 DSP yields an accelerator that achieves among the highest performance (inference per second) and power (inference per watt) in the industry. The programming flexibility of the C71 core enables you to keep up with edge AI innovations. And when not used for deep learning, the core can process other computationally intensive tasks while consuming low power.

The smart memory architecture makes high utilization of the accelerator possible. The accelerator comes with its own memory subsystem; a dedicated 4D-programmable direct memory access (DMA) engine for data transfer; and specialized streaming hardware that can bring data directly from external memory to the functional units of the C71 core and MMA, bypassing the cache. Tiling and supertiling features minimize data transfer to and from external memory.

Table 2 shows an 8-bit fixed inference performance on the AM68A and TDA4VM with an 8-TOPS-per-second accelerator. Reported performance is with batch size 1, and a single 32-bit LPDD4.

Table 2 Inference benchmarks on MLPerf recommended models.
Network Image Resolution Frames Per Second (fps)
MobileNet_v1 224 × 224 741
Resnet-50 V1.5 224 × 224 162
SSD-MobileNets-V1 300 × 300 385

Disclaimer: TI has used the MLPerf recommended models and guidelines for edge AI interference benchmarking. TI has not yet submitted the results to the MLcommons organization.