SDAA185 February   2026

 

  1.   1
  2.   Abstract
  3.   Trademarks
  4. Introduction
    1. 1.1 NPU Definition and Purpose
    2. 1.2 Key Capabilities
    3. 1.3 Technical Limitations
  5. Development Flow Overview
    1. 2.1 Model Development Phase
    2. 2.2 Model Compilation Phase
    3. 2.3 Application Integration Phase
  6. Example Model Creation (Python)
    1. 3.1 Model Selection Rationale
    2. 3.2 Model Architecture Design
    3. 3.3 Training Details
      1. 3.3.1 Development Environment Setup
      2. 3.3.2 Dataset Generation
      3. 3.3.3 Model Training Configuration
      4. 3.3.4 Quantization-Aware Training Process
  7. Quantization for Embedded Platform
    1. 4.1 Quantization Approaches: QAT versus PTQ
      1. 4.1.1 Post-Training Quantization (PTQ)
      2. 4.1.2 Quantization-Aware Training (QAT)
    2. 4.2 Quantization Frameworks and Wrapper Modules
      1. 4.2.1 Generic Wrappers for CPU Quantization
      2. 4.2.2 TINPU Wrappers for NPU Hardware Acceleration
  8. Validating the Model
    1. 5.1 Two-Phase Training Strategy
      1. 5.1.1 Initial Training Phase
      2. 5.1.2 Fine-Tuning Phase
    2. 5.2 Training Phase Comparison
    3. 5.3 Validation Results and Metrics
  9. Testing the Model
    1. 6.1 Inference Setup and Methodology
      1. 6.1.1 Generic User Testing Approach
    2. 6.2 Testing Results and Visual Analysis
      1. 6.2.1 Visual Performance Assessment
    3. 6.3 Quantitative Performance Metrics
  10. Moving the Model to TI MCU (C2000 – F28P55x) [Beginner Level]
  11. Moving the Model to TI MCU (C2000 – F28P55x) [Developer Level]
    1. 8.1 Compilation Prerequisites
      1. 8.1.1 Required TI Software Components
      2. 8.1.2 Environment Setup Process
    2. 8.2 Configuration File Setup
      1. 8.2.1 Configuration File Structure
        1. 8.2.1.1 Models Requiring Dequantization Flag
      2. 8.2.2 Special Configuration for Regression Models
        1. 8.2.2.1 Output Dequantization Flag
        2. 8.2.2.2 Compiler Constants Modification
        3. 8.2.2.3 Compilation Dictionary Update
    3. 8.3 Compilation Process Flow
      1. 8.3.1 Launching the Compilation
      2. 8.3.2 Compilation Phases
      3. 8.3.3 Common Issues to Watch For
  12. Setting up the MCU Project
    1. 9.1 Creating a CCS Project for NPU Applications
    2. 9.2 Understanding the NPU Interface
      1. 9.2.1 Key Interface Components
      2. 9.2.2 Basic Usage Pattern
  13. 10Testing the Model in the Embedded Environment
    1. 10.1 Visual Performance Assessment
    2. 10.2 Quantitative Performance Metrics
  14. 11NPU Integration in a Real-Time Signal Chain
    1. 11.1 Application Block Diagram
    2. 11.2 Application Code Implementation
    3. 11.3 Hardware Components Utilized
    4. 11.4 Hardware Validation Results
      1. 11.4.1 Input Signal Characteristics
      2. 11.4.2 Neural Network Output Analysis
  15. 12Key Design Decisions and Impact
    1. 12.1 NPU Handling of Numbers
      1. 12.1.1 Integer-Only Architecture
      2. 12.1.2 Working with Negative and Floating-Point Values
    2. 12.2 Supported Neural Network Layers and Constraints
      1. 12.2.1 Supported Layer Types
        1. 12.2.1.1 Convolution Layers
        2. 12.2.1.2 Other Core Layers
        3. 12.2.1.3 Flexibilities
    3. 12.3 Model Complexity and Size Limitations
      1. 12.3.1 Memory Constraints and Model Size
      2. 12.3.2 Optimization Process and Performance Trade-offs
  16. 13Benchmarks
    1. 13.1 Model Performance Comparison
      1. 13.1.1 128 - Neuron Model
      2. 13.1.2 64 - Neuron Model
      3. 13.1.3 16 - Neuron Model
      4. 13.1.4 Reference Benchmark
    2. 13.2 Performance Analysis
      1. 13.2.1 Model Selection Trade-offs
      2. 13.2.2 CPU versus NPU Performance
    3. 13.3 Pipeline Stage Timing Measurements
  17. 14Summary
    1. 14.1 Key Capabilities and Constraints
    2. 14.2 Development Workflow
    3. 14.3 Model Design Considerations
    4. 14.4 Implementation Challenges and Solutions
    5. 14.5 Broader Applications
  18. 15References

Optimization Process and Performance Trade-offs

  • Progressive Size Reduction: We systematically tested different model sizes and found that while models with up to 128 neurons per layer can fit on the device, the 64-neuron configuration emerged as the best balance point for accuracy and latency.
  • Memory versus. Accuracy: Memory constraints became our primary design consideration, forcing us to work backward from hardware limitations rather than forward from accuracy goals.
  • Implementation Considerations: The final model not only fit on the device but also:
    • Compiled reliably with the NPU toolchain
    • Maintained adequate precision for sine wave approximation
    • Delivered consistent inference times below 1ms