SPRY344A January 2022 – March 2023 TDA4VM , TDA4VM-Q1
While suitable for other tasks, CPUs and GPUs are not the best cores to accelerate deep learning tasks. CPUs have throughput limitations and consume high power; GPUs consume the most power of all cores and have a large memory footprint.
TI vision AI processors integrate a deep learning accelerator comprising a matrix multiplication accelerator (MMA) in an ASIC, bolted to a programmable C71 DSP. The MMA enables high-performance (4K 8-bit fixed multiply accumulates per cycle) and low-power tensor acceleration, while the C71 DSP accelerates vector and scalar operations and manages the MMA.
The combination of the MMA and C71 DSP yields an accelerator that achieves among the highest performance (inference per second) and power (inference per watt) in the industry. The programming flexibility of the C71 core enables you to keep up with edge AI innovations. And when not used for deep learning, the core can process other computationally intensive tasks while consuming low power.
The smart memory architecture makes high utilization of the accelerator possible. The accelerator comes with its own memory subsystem; a dedicated 4D-programmable direct memory access (DMA) engine for data transfer; and specialized streaming hardware that can bring data directly from external memory to the functional units of the C71 core and MMA, bypassing the cache. Tiling and supertiling features minimize data transfer to and from external memory.
Table 2 shows an 8-bit fixed inference performance on the AM68A and TDA4VM with an 8-TOPS-per-second accelerator. Reported performance is with batch size 1, and a single 32-bit LPDD4.
Network | Image Resolution | Frames Per Second (fps) |
---|---|---|
MobileNet_v1 | 224 × 224 | 741 |
Resnet-50 V1.5 | 224 × 224 | 162 |
SSD-MobileNets-V1 | 300 × 300 | 385 |
Disclaimer: TI has used the MLPerf recommended models and guidelines for edge AI interference benchmarking. TI has not yet submitted the results to the MLcommons organization.