SPRAD74 March   2023 AM62A3 , AM62A3-Q1 , AM62A7 , AM62A7-Q1

 

  1.   Abstract
  2.   Trademarks
  3. 1Smarter Cameras at the Edge
  4. 2AM6xA Scalable Portfolio and the AM62A
  5. 3Smart Camera Use Cases
    1. 3.1 Security Camera Example
  6. 4Deep Learning on the AM62A
    1. 4.1 Deep Learning Accelerator
    2. 4.2 Edge AI Software
  7. 5VPAC Vision Accelerator and ISP
  8. 6Low-Power Performance
  9. 7Call to Action
  10. 8References

Deep Learning Accelerator

TI's processors use a state-of-the-art deep learning accelerator design. TI has a long history of digital signal processors (DSPs) that have become increasingly integrated other SoCs at TI; however, a DSP alone is insufficient for most vision deep learning models. Our deep learning accelerator is a tight-coupling of a C7x DSP and a custom matrix-multiply accelerator (MMA), which massively increases performance on neural networks (NN), especially convolutional NNs (CNN) that are common in vision AI.

The AM62A's deep learning accelerator uses a 256-bit C7x DSP and an MMA capable of performing 32x32 matrix multiplies on 8-bit integer values in a single clock cycle. When run at the maximum of 1 GHz, this provides a max compute capacity of 2 TOPS as the 32x32 matrix operation is 1024 multiply-accumulates (MACs, where each MAC is considered two operations). To ensure the MMA always has values to compute, the architecture includes multiple streaming engines that move 256-bits of data each clock cycle to the two input matrices from the single output matrix. Depending on the layers composing the neural network architecture, outputs from the MMA may be sent through the C7x for computing any non-linear functions within the layer. Developers need not program this themselves; API calls from the Arm cores reduce the complexity of programming the accelerator, as described in the Edge AI software section.

While TOPS is a common metric for quantifying machine learning performance on accelerators like TPUs, VPUs, NPUs, and GPUs, etc., one accelerator architecture may outperform another despite having a lower theoretical compute capacity. TI's architecture was designed to optimize power and performance by using a single large compute unit, the MMA, rather than many smaller ones in parallel. With many small units, more transfers to memory are required as there is less data reuse of the same data for subsequent execution cycles. More transfers equate to higher power expenditure. Specially designed data-streaming engines maintain the 256-bit buffers within the accelerator hold the necessary data. A well-optimized application will use a model whose dimensions at each layer completely fill the MMA.

GUID-20230118-SS0I-4DCJ-L3LF-SB57FKLRZ6V2-low.svg Figure 4-1 AI Accelerator Architecture