SDAA185 February   2026

 

  1.   1
  2.   Abstract
  3.   Trademarks
  4. Introduction
    1. 1.1 NPU Definition and Purpose
    2. 1.2 Key Capabilities
    3. 1.3 Technical Limitations
  5. Development Flow Overview
    1. 2.1 Model Development Phase
    2. 2.2 Model Compilation Phase
    3. 2.3 Application Integration Phase
  6. Example Model Creation (Python)
    1. 3.1 Model Selection Rationale
    2. 3.2 Model Architecture Design
    3. 3.3 Training Details
      1. 3.3.1 Development Environment Setup
      2. 3.3.2 Dataset Generation
      3. 3.3.3 Model Training Configuration
      4. 3.3.4 Quantization-Aware Training Process
  7. Quantization for Embedded Platform
    1. 4.1 Quantization Approaches: QAT versus PTQ
      1. 4.1.1 Post-Training Quantization (PTQ)
      2. 4.1.2 Quantization-Aware Training (QAT)
    2. 4.2 Quantization Frameworks and Wrapper Modules
      1. 4.2.1 Generic Wrappers for CPU Quantization
      2. 4.2.2 TINPU Wrappers for NPU Hardware Acceleration
  8. Validating the Model
    1. 5.1 Two-Phase Training Strategy
      1. 5.1.1 Initial Training Phase
      2. 5.1.2 Fine-Tuning Phase
    2. 5.2 Training Phase Comparison
    3. 5.3 Validation Results and Metrics
  9. Testing the Model
    1. 6.1 Inference Setup and Methodology
      1. 6.1.1 Generic User Testing Approach
    2. 6.2 Testing Results and Visual Analysis
      1. 6.2.1 Visual Performance Assessment
    3. 6.3 Quantitative Performance Metrics
  10. Moving the Model to TI MCU (C2000 – F28P55x) [Beginner Level]
  11. Moving the Model to TI MCU (C2000 – F28P55x) [Developer Level]
    1. 8.1 Compilation Prerequisites
      1. 8.1.1 Required TI Software Components
      2. 8.1.2 Environment Setup Process
    2. 8.2 Configuration File Setup
      1. 8.2.1 Configuration File Structure
        1. 8.2.1.1 Models Requiring Dequantization Flag
      2. 8.2.2 Special Configuration for Regression Models
        1. 8.2.2.1 Output Dequantization Flag
        2. 8.2.2.2 Compiler Constants Modification
        3. 8.2.2.3 Compilation Dictionary Update
    3. 8.3 Compilation Process Flow
      1. 8.3.1 Launching the Compilation
      2. 8.3.2 Compilation Phases
      3. 8.3.3 Common Issues to Watch For
  12. Setting up the MCU Project
    1. 9.1 Creating a CCS Project for NPU Applications
    2. 9.2 Understanding the NPU Interface
      1. 9.2.1 Key Interface Components
      2. 9.2.2 Basic Usage Pattern
  13. 10Testing the Model in the Embedded Environment
    1. 10.1 Visual Performance Assessment
    2. 10.2 Quantitative Performance Metrics
  14. 11NPU Integration in a Real-Time Signal Chain
    1. 11.1 Application Block Diagram
    2. 11.2 Application Code Implementation
    3. 11.3 Hardware Components Utilized
    4. 11.4 Hardware Validation Results
      1. 11.4.1 Input Signal Characteristics
      2. 11.4.2 Neural Network Output Analysis
  15. 12Key Design Decisions and Impact
    1. 12.1 NPU Handling of Numbers
      1. 12.1.1 Integer-Only Architecture
      2. 12.1.2 Working with Negative and Floating-Point Values
    2. 12.2 Supported Neural Network Layers and Constraints
      1. 12.2.1 Supported Layer Types
        1. 12.2.1.1 Convolution Layers
        2. 12.2.1.2 Other Core Layers
        3. 12.2.1.3 Flexibilities
    3. 12.3 Model Complexity and Size Limitations
      1. 12.3.1 Memory Constraints and Model Size
      2. 12.3.2 Optimization Process and Performance Trade-offs
  16. 13Benchmarks
    1. 13.1 Model Performance Comparison
      1. 13.1.1 128 - Neuron Model
      2. 13.1.2 64 - Neuron Model
      3. 13.1.3 16 - Neuron Model
      4. 13.1.4 Reference Benchmark
    2. 13.2 Performance Analysis
      1. 13.2.1 Model Selection Trade-offs
      2. 13.2.2 CPU versus NPU Performance
    3. 13.3 Pipeline Stage Timing Measurements
  17. 14Summary
    1. 14.1 Key Capabilities and Constraints
    2. 14.2 Development Workflow
    3. 14.3 Model Design Considerations
    4. 14.4 Implementation Challenges and Solutions
    5. 14.5 Broader Applications
  18. 15References

Technical Limitations

While powerful, the F28P55x NPU operates under several constraints that influence application design:

  • Architectural Limitations: Neural Network topologies like CNNs and MLPs with ReLu activations are better supported compared to complex architectures such as LSTMs or Transformers.
  • Precision Tradeoffs: Quantization necessary for NPU execution introduces precision loss compared to floating-point implementations, requiring careful training approaches to maintain accuracy.
  • Development Workflow Complexity: Specific toolchain requirements for model compilation and deployment add additional development steps compared to standard microcontroller programming.

These capabilities and limitations frame the practical application space for the F28P55x NPU in automotive and industrial embedded systems, where balancing computational power with resource constraints is essential for successful implementation.