SPRUIG3C January   2018  – August 2019 TDA4VM , TDA4VM-Q1

 

  1.   Read This First
    1.     About This Manual
    2.     Related Documentation
    3.     Trademarks
  2. 1Overview and Scope
    1. 1.1 Comparing VCOP and C7000
    2. 1.2 About this Document
      1. 1.2.1 Documentation Conventions
    3. 1.3 Output Format
    4. 1.4 Data Types
      1. 1.4.1 40-bit Incompatibilities
      2. 1.4.2 40-Bit Detection in Host Emulation Mode
    5. 1.5 SIMD Width
    6. 1.6 VCOP Virtual Machine
  3. 2Kernel API
    1. 2.1 Overview
    2. 2.2 Parameter Block
      1. 2.2.1 Tvals Structure
      2. 2.2.2 Pblock Manipulation
  4. 3Loop Control
    1. 3.1 Overview
    2. 3.2 Loop Control and Nested Loops
    3. 3.3 Repeat Loops
    4. 3.4 Compound Conditions
    5. 3.5 Early Exit
  5. 4Addressing
    1. 4.1 Overview
    2. 4.2 Streaming Engines
    3. 4.3 Streaming Address Generators
    4. 4.4 Indexed Addressing
    5. 4.5 Circular Addressing
  6. 5Operations
    1. 5.1 Load Operations
    2. 5.2 Store Operations
      1. 5.2.1 Predicated Stores
      2. 5.2.2 Scatter and Transposing Stores
      3. 5.2.3 Optimization of OFFSET_NP1-Based Transpose
      4. 5.2.4 Rounding Stores
      5. 5.2.5 Saturating Stores
    3. 5.3 Arithmetic Operations
      1. 5.3.1 Vector Compares
      2. 5.3.2 Multiplication with Rounding, Truncation, or Left Shift
    4. 5.4 Lookup and Histogram Table Operations
      1. 5.4.1 Determination of Table Size
      2. 5.4.2 Table Configuration
      3. 5.4.3 Copy-in Operation
      4. 5.4.4 Copy-out Operation
      5. 5.4.5 Index Adjustment from Non-zero Agen
      6. 5.4.6 Lookup Operation
      7. 5.4.7 Histogram Update Operation
      8. 5.4.8 16-Way Lookup and Histogram
  7. 6Performance
    1. 6.1 Overview
    2. 6.2 Compiler Requirements
    3. 6.3 Automatic Performance Profiling
    4. 6.4 Performance Options
  8.   A Warnings and Notes
    1.     A.1 Compatibility Warnings
    2.     A.2 Efficiency Warnings

Optimization of OFFSET_NP1-Based Transpose

VCC may recognize transpose sequences that use OFFSET_NP1. The general pattern is to store into a scratch buffer using OFFSET_NP1, then read the scratch buffer back using NPT loads. Unfortunately there is no direct translation for the OFFSET_NP1 store. However, the streaming engine does support a transposed read mode. If transpose recognition is enabled, the migration tool may transform the sequence to use non-transposed stores instead of OFFSET_NP1 stores, and transposed loads using the streaming engine instead of normal vector loads. Thus, the transpose operation shifts from the store to the subsequent load. The layout of the data in the scratch buffer is altered with respect to its VCOP layout, so this transformation only works when the scratch buffer is used only for the transpose operation and not otherwise used.

Transpose detection and transformation may be enabled by:

  1. Enabling the --transpose command line option. This option enables automatic detection of transpose sequences. This will apply transpose at every possible point in the file. If a kernel that should use transpose is in the same file as one that shouldn’t, they should be separated into two files.
  2. Applying the __tscratch keyword to a parameter. (For example, __tscratch __vptr_uint32 scratch_buffer) This method of enabling transpose will take effect even if --transpose is not specified.

The transpose transformation may be performed under the following conditions:

  • All stores in the sequence are OFFSET_NP1.
  • All loads in the sequence are NPT.
  • The buffer used for the transpose is not later examined by the caller. (The data in the buffer will not be transposed.)
  • Enough streaming engines and streaming address generators are available.
  • The stores/loads are 32 bit. (The streaming engine only supports all transpose read configurations for 32 bit data. If the only reason a transpose is not performed is the data type, VCC will warn.)

The transpose transformation will correctly handle unrolled reads or unrolled writes and transform them as a set. In addition, the transpose transformation will correctly handle a transpose scratch buffer that has been split such that the one portion is used separately from another portion. However, the transpose transformation will not correctly handle a combination of unrolled reads/writes AND a transpose scratch buffer that has been split. (It becomes impossible for VCC to disambiguate the offset for the unroll from the offset for the split.)