SPRUIG3C January   2018  – August 2019 TDA4VM , TDA4VM-Q1

 

  1.   Read This First
    1.     About This Manual
    2.     Related Documentation
    3.     Trademarks
  2. 1Overview and Scope
    1. 1.1 Comparing VCOP and C7000
    2. 1.2 About this Document
      1. 1.2.1 Documentation Conventions
    3. 1.3 Output Format
    4. 1.4 Data Types
      1. 1.4.1 40-bit Incompatibilities
      2. 1.4.2 40-Bit Detection in Host Emulation Mode
    5. 1.5 SIMD Width
    6. 1.6 VCOP Virtual Machine
  3. 2Kernel API
    1. 2.1 Overview
    2. 2.2 Parameter Block
      1. 2.2.1 Tvals Structure
      2. 2.2.2 Pblock Manipulation
  4. 3Loop Control
    1. 3.1 Overview
    2. 3.2 Loop Control and Nested Loops
    3. 3.3 Repeat Loops
    4. 3.4 Compound Conditions
    5. 3.5 Early Exit
  5. 4Addressing
    1. 4.1 Overview
    2. 4.2 Streaming Engines
    3. 4.3 Streaming Address Generators
    4. 4.4 Indexed Addressing
    5. 4.5 Circular Addressing
  6. 5Operations
    1. 5.1 Load Operations
    2. 5.2 Store Operations
      1. 5.2.1 Predicated Stores
      2. 5.2.2 Scatter and Transposing Stores
      3. 5.2.3 Optimization of OFFSET_NP1-Based Transpose
      4. 5.2.4 Rounding Stores
      5. 5.2.5 Saturating Stores
    3. 5.3 Arithmetic Operations
      1. 5.3.1 Vector Compares
      2. 5.3.2 Multiplication with Rounding, Truncation, or Left Shift
    4. 5.4 Lookup and Histogram Table Operations
      1. 5.4.1 Determination of Table Size
      2. 5.4.2 Table Configuration
      3. 5.4.3 Copy-in Operation
      4. 5.4.4 Copy-out Operation
      5. 5.4.5 Index Adjustment from Non-zero Agen
      6. 5.4.6 Lookup Operation
      7. 5.4.7 Histogram Update Operation
      8. 5.4.8 16-Way Lookup and Histogram
  7. 6Performance
    1. 6.1 Overview
    2. 6.2 Compiler Requirements
    3. 6.3 Automatic Performance Profiling
    4. 6.4 Performance Options
  8.   A Warnings and Notes
    1.     A.1 Compatibility Warnings
    2.     A.2 Efficiency Warnings

Scatter and Transposing Stores

VCOP has three distribution modes which store vector elements in a non-linear fashion. The OFFSET_NP1 mode stores each element 9 words apart, effecting a 9x8 transpose. (The 9-word offset causes each element to be stored into a different memory bank, to achieve a parallel store). The SDDA ( s_scatter in Kernel-C) mode stores each element at a data-dependent offset; the offsets come from another vector. The PDDA ( p_scatter in Kernel-C) mode is like SDDA except the offsets must be such that each element goes to a different bank, so they can be parallel. The PDDA mode is also typically used for transpose.

There is no direct support for these operations on C7x. On C7x, transpose is supported by the Streaming Engine, but only for loads and not for stores.

For the specific case of OFFSET_NP1 used to store into a scratch buffer for a transpose operation, the migration tool may detect this and re-arrange the layout of the scratch buffer so as to implement the transpose using the streaming engine. See Section 5.3.3.

Otherwise, the migration tool generates naive sequences for these modes, storing elements one at a time rather than in parallel.

Efficiency Warning: Parallel Stores
Unless the above optimizations are enabled, kernels that use these distribution modes will have inefficient translations.