SDAA429 June   2026 MSPM0G5187

 

  1.   1
  2.   Abstract
  3.   Trademarks
  4. 1Introduction
  5. 2MSPM0G5187 with TinyEngine NPU
  6. 3Edge AI Toolchains
    1. 3.1 TI Edge AI Studio
    2. 3.2 TI Tiny ML Tensorlab
    3. 3.3 TI Neural Network Compiler
  7. 4Edge AI Application: Digit Recognition
    1. 4.1 LeNet-5 Variant CNN Model
    2. 4.2 NPU/CPU Performance Comparison
  8. 5Edge AI Application: Waveform Classifier
    1. 5.1 Feature Extraction
    2. 5.2 Time-Series Classification Model
    3. 5.3 Model Memory Considerations
    4. 5.4 NPU/CPU Performance Comparison
  9. 6Summary
  10. 7References

Model Memory Considerations

Developing edge AI solutions on resource-constrained embedded platforms requires balancing a rigid, three-way optimization triad: algorithmic performance (accuracy, latency), non-volatile storage (FLASH/ROM), and run-time memory (SRAM).

  • Algorithmic Performance: encompasses both inference accuracy and latency, which are primarily shaped by the model architecture - including network depth, layer width, and operator complexity.
  • Static Storage Memory (Flash/ROM): This is primarily determined by the model's total parameter count (weights and biases). In a fully quantized INT8 pipeline, each parameter maps directly to exactly one byte of Flash storage. Consequently, deeper networks with expanded channel widths will lineally increase the static binary size. Sometimes, the variations in the input feature dimensions can also influence this static memory usage (e.g., in fully-connected layer).
  • Dynamic Runtime Memory (SRAM/RAM): This is governed by the input feature size and the resulting activation tensors (feature maps) across intermediate layers. As a time-series slice propagates through the network, the processing core must allocate temporary workspaces to store layer inputs and outputs. Longer input temporal windows or higher feature dimensions exponentially inflate this peak runtime RAM requirement.

Driven by these hardware realities, deploying at the edge forces a departure from traditional accuracy-first mentalities. Instead, navigating a successful deployment demands a meticulous trade-off, balancing raw model performance directly against the hard physical boundaries of static Flash and peak dynamic SRAM.

To illustrate this relationship, Table 5-5 quantifies the memory footprint and resource utilization of waveform classifier across varying model sizes and input feature configurations.

Table 5-5 Model Memory Analysis
Model Variant (Parameter)Flash (ROM) SizeRAM Size @ Input = 64RAM Size @ Input = 128RAM Size @ Input = 256
CLS_1KApproximately 5.6KBApproximately 2.7KBApproximately 5.3KBApproximately 10.4KB
CLS_4KApproximately 9.9KBApproximately 1.2KBApproximately 1.7KBApproximately 2.7KB
CLS_13KApproximately 21.4KBApproximately 2.3KBApproximately 3.3KBApproximately 5.3KB

Across all input dimensions, the smallest model (CLS_1K) consistently consumes two to four times more runtime RAM than the larger models (CLS_4K and CLS_13K). This counterintuitive behavior stems from the fact that parameter count and runtime memory are driven by fundamentally different architectural decisions. TheCLS_1K relies on a shallow topology that lacks early downsampling, causing large, high-resolution feature maps to persist in memory throughout the intermediate layers. In contrast, CLS_4K and CLS_13K adopt deeper architectures with strided convolutions at the front-end of the network, which immediately reduce the temporal dimensions and significantly decrease the size of intermediate runtime tensors.