SPRUIG3 User guide

SPRUIG3C January 2018 – August 2019 TDA4VM , TDA4VM-Q1

5.2.3 Optimization of OFFSET_NP1-Based Transpose

VCC may recognize transpose sequences that use OFFSET_NP1. The general pattern is to store into a scratch buffer using OFFSET_NP1, then read the scratch buffer back using NPT loads. Unfortunately there is no direct translation for the OFFSET_NP1 store. However, the streaming engine does support a transposed read mode. If transpose recognition is enabled, the migration tool may transform the sequence to use non-transposed stores instead of OFFSET_NP1 stores, and transposed loads using the streaming engine instead of normal vector loads. Thus, the transpose operation shifts from the store to the subsequent load. The layout of the data in the scratch buffer is altered with respect to its VCOP layout, so this transformation only works when the scratch buffer is used only for the transpose operation and not otherwise used.

Transpose detection and transformation may be enabled by:

Enabling the --transpose command line option. This option enables automatic detection of transpose sequences. This will apply transpose at every possible point in the file. If a kernel that should use transpose is in the same file as one that shouldn’t, they should be separated into two files.
Applying the __tscratch keyword to a parameter. (For example, __tscratch __vptr_uint32 scratch_buffer) This method of enabling transpose will take effect even if --transpose is not specified.

The transpose transformation may be performed under the following conditions:

All stores in the sequence are OFFSET_NP1.
All loads in the sequence are NPT.
The buffer used for the transpose is not later examined by the caller. (The data in the buffer will not be transposed.)
Enough streaming engines and streaming address generators are available.
The stores/loads are 32 bit. (The streaming engine only supports all transpose read configurations for 32 bit data. If the only reason a transpose is not performed is the data type, VCC will warn.)

The transpose transformation will correctly handle unrolled reads or unrolled writes and transform them as a set. In addition, the transpose transformation will correctly handle a transpose scratch buffer that has been split such that the one portion is used separately from another portion. However, the transpose transformation will not correctly handle a combination of unrolled reads/writes AND a transpose scratch buffer that has been split. (It becomes impossible for VCC to disambiguate the offset for the unroll from the offset for the split.)