SPRUIG3C January   2018  – August 2019 TDA4VM , TDA4VM-Q1

 

  1.   Read This First
    1.     About This Manual
    2.     Related Documentation
    3.     Trademarks
  2. 1Overview and Scope
    1. 1.1 Comparing VCOP and C7000
    2. 1.2 About this Document
      1. 1.2.1 Documentation Conventions
    3. 1.3 Output Format
    4. 1.4 Data Types
      1. 1.4.1 40-bit Incompatibilities
      2. 1.4.2 40-Bit Detection in Host Emulation Mode
    5. 1.5 SIMD Width
    6. 1.6 VCOP Virtual Machine
  3. 2Kernel API
    1. 2.1 Overview
    2. 2.2 Parameter Block
      1. 2.2.1 Tvals Structure
      2. 2.2.2 Pblock Manipulation
  4. 3Loop Control
    1. 3.1 Overview
    2. 3.2 Loop Control and Nested Loops
    3. 3.3 Repeat Loops
    4. 3.4 Compound Conditions
    5. 3.5 Early Exit
  5. 4Addressing
    1. 4.1 Overview
    2. 4.2 Streaming Engines
    3. 4.3 Streaming Address Generators
    4. 4.4 Indexed Addressing
    5. 4.5 Circular Addressing
  6. 5Operations
    1. 5.1 Load Operations
    2. 5.2 Store Operations
      1. 5.2.1 Predicated Stores
      2. 5.2.2 Scatter and Transposing Stores
      3. 5.2.3 Optimization of OFFSET_NP1-Based Transpose
      4. 5.2.4 Rounding Stores
      5. 5.2.5 Saturating Stores
    3. 5.3 Arithmetic Operations
      1. 5.3.1 Vector Compares
      2. 5.3.2 Multiplication with Rounding, Truncation, or Left Shift
    4. 5.4 Lookup and Histogram Table Operations
      1. 5.4.1 Determination of Table Size
      2. 5.4.2 Table Configuration
      3. 5.4.3 Copy-in Operation
      4. 5.4.4 Copy-out Operation
      5. 5.4.5 Index Adjustment from Non-zero Agen
      6. 5.4.6 Lookup Operation
      7. 5.4.7 Histogram Update Operation
      8. 5.4.8 16-Way Lookup and Histogram
  7. 6Performance
    1. 6.1 Overview
    2. 6.2 Compiler Requirements
    3. 6.3 Automatic Performance Profiling
    4. 6.4 Performance Options
  8.   A Warnings and Notes
    1.     A.1 Compatibility Warnings
    2.     A.2 Efficiency Warnings

Compiler Requirements

The migration tool generates C++ code that invokes methods in the virtual machine, which is also C++. As such, it relies heavily on the compiler to achieve an efficient translation. In particular, the following compiler optimizations are critical for VCOP translation:

  • Inlining. The virtual machine is implemented as C++ classes with static inline methods. The migration tool expects the compiler to inline them fully. The vloops() function of a translated kernel should contain no calls.
  • Software pipelining. This is an important key to performance on C7x. The compiler should be able to pipeline any loop generated by the migration tool for a VCOP kernel.
  • #pragma PARALLEL_LOOP. There are no loop-carried memory dependencies in a VCOP kernel (this is a built-in property of the architecture). The migration tool uses the PARALLEL_LOOP pragma to convey this property to the compiler. The compiler must not limit the initiation interval of a software pipelined loop due to memory dependence.
  • NLC. The migration tool does not directly invoke the NLC. It expresses loop control using explicit loops and relies on the compiler to collapse them using the NLC where appropriate.
  • Hoisting. The migration tool generates many loop-invariant operations within inner loops and relies on the compiler to recognize that they are invariant and hoist them outside the loop.
  • Forward Substitution for Addressing Primitives. The migration tool generates SE- and SA-based loads as move instructions with the SE/SA pseudo-register as the source operand. Since this operand encodes as a register operand, it can be often be propagated to where it’s used. For example, the sequence:
       VMV SE0++,Vreg1
       ...
       VADDW Vreg1,Vreg2,Vdst
    

    can be optimized to:

       VADDW SE0++,Vreg2,Vdst
    

    A related issue is that the migration tool sometimes generates SA-based references to objects separately from the actual indirection. The compiler should combine these. For example, the migration tool may generate:

       int16 *p = &__sa_noadv_int(0, 4, base); // note “address of”
       ...
       *p = Vreg; 
    

    this generates:

       ADDAW reg_base, SA0++, reg_p
       ...
       VSTW Vreg,*reg_p
    

    which should be optimized to:

       VSTW Vreg,*reg_base[SA0++]
    

This is by no means an exhaustive list. Efficient translation relies on dozens of compiler optimizations, not all of which have been implemented.