SPRADB0 Application note

SPRADB0 may 2023 AM62A3 , AM62A3-Q1 , AM62A7 , AM62A7-Q1 , AM68A , AM69A

4.1 Input Optimization (Optional)

There is an optional parameter in the training YAML configuration file for 'input_optimization' that is related to quantization and compilation. In short, it folds two preprocessing steps into the first few layers of the model to save CPU cycles and DDR bandwidth. Context for this is provided here, as well as necessary background on model quantization, which is a crucial step of compiling the model.

Models are almost universally trained in floating point format for model weights and values. They can be trained with 'quantization awareness' (QAT or quantization aware training) but weights will still be floating point; weights can also be quantized after training (PTQ or post-training quantization). PTQ converts the model into a fixed-point format that neural network accelerators are often optimized for. For TI's architecture, part of compilation is to create a quantization function that maps float32's to integers (8 or 16 bit) – this is sometimes referred to as 'calibration'. This function is universal across the network, so it is important that all float32s are in a similar range so that they map consistently to fixed point without "clipping" very high or low values to the edges of the limited integer range ⁽¹⁾.

A common step for quantization is to limit clipping when mapping floats to integers by keeping weights and values in a range between -1 to 1 or 0 to 1. This is made easier in interior layers by using regularization during training. For the model input, images are often in an 8-bit RGB format and are converted to float32 by subtracting a mean and multiplying by a scaler such that values entering the model are in this smaller, consistent range. This is a necessary preprocessing step for the model, but it is performed inefficiently on the CPU.

"Input optimization" is a strategy for making this preprocessing more efficient by including it in the model. The subtract and multiply operations are moved into the model itself by creating two additional layers at the input to do these element-wise operations on the C7xMMA. This reduces CPU usage and DDR usage because the original 8/16-bit integers for RGB pixels can be provided as is, rather than the preprocessed 32-bit floats.

This input_optimization step can also pulled into the model as part of compilation.

1. Hybrid networks that use more than one level of quantization like 8-bit on some layers and 16-bit on others will have more than one quantization function. Nonetheless, keeping weights in the same range throughout the network will reduce the amount of clipping.