SPRUIG3C January   2018  – August 2019 TDA4VM , TDA4VM-Q1

 

  1.   Read This First
    1.     About This Manual
    2.     Related Documentation
    3.     Trademarks
  2. 1Overview and Scope
    1. 1.1 Comparing VCOP and C7000
    2. 1.2 About this Document
      1. 1.2.1 Documentation Conventions
    3. 1.3 Output Format
    4. 1.4 Data Types
      1. 1.4.1 40-bit Incompatibilities
      2. 1.4.2 40-Bit Detection in Host Emulation Mode
    5. 1.5 SIMD Width
    6. 1.6 VCOP Virtual Machine
  3. 2Kernel API
    1. 2.1 Overview
    2. 2.2 Parameter Block
      1. 2.2.1 Tvals Structure
      2. 2.2.2 Pblock Manipulation
  4. 3Loop Control
    1. 3.1 Overview
    2. 3.2 Loop Control and Nested Loops
    3. 3.3 Repeat Loops
    4. 3.4 Compound Conditions
    5. 3.5 Early Exit
  5. 4Addressing
    1. 4.1 Overview
    2. 4.2 Streaming Engines
    3. 4.3 Streaming Address Generators
    4. 4.4 Indexed Addressing
    5. 4.5 Circular Addressing
  6. 5Operations
    1. 5.1 Load Operations
    2. 5.2 Store Operations
      1. 5.2.1 Predicated Stores
      2. 5.2.2 Scatter and Transposing Stores
      3. 5.2.3 Optimization of OFFSET_NP1-Based Transpose
      4. 5.2.4 Rounding Stores
      5. 5.2.5 Saturating Stores
    3. 5.3 Arithmetic Operations
      1. 5.3.1 Vector Compares
      2. 5.3.2 Multiplication with Rounding, Truncation, or Left Shift
    4. 5.4 Lookup and Histogram Table Operations
      1. 5.4.1 Determination of Table Size
      2. 5.4.2 Table Configuration
      3. 5.4.3 Copy-in Operation
      4. 5.4.4 Copy-out Operation
      5. 5.4.5 Index Adjustment from Non-zero Agen
      6. 5.4.6 Lookup Operation
      7. 5.4.7 Histogram Update Operation
      8. 5.4.8 16-Way Lookup and Histogram
  7. 6Performance
    1. 6.1 Overview
    2. 6.2 Compiler Requirements
    3. 6.3 Automatic Performance Profiling
    4. 6.4 Performance Options
  8.   A Warnings and Notes
    1.     A.1 Compatibility Warnings
    2.     A.2 Efficiency Warnings

Arithmetic Operations

Most of VCOP’s arithmetic operations translate directly to either one C7x instruction or a short sequence.

VCOP arithmetic operations generally operate on vectors of signed 40-bit elements. Some VCOP operations ignore the upper 8 bits and operate on 32-bit elements. C7x lacks direct support for 40-bit arithmetic. One option is to model each lane as a 64-bit element, but in practice most kernels rely on only 32 bits of precision. The migration tool models VCOP vectors as having 32-bit elements and translates them accordingly (Section 1.5).

Most arithmetic operations are sign-agnostic, but in some cases treating elements as unsigned rather than signed can correct translation errors arising from the loss of the guard (see Section 1.5). Therefore some translations have unsigned forms in addition to the default signed forms.

In cases where an operation is represented by a C operator, the migration tool generates the expression using the operator, and the C7x compiler generates the appropriate vector instruction. For example, for this VCOP Kernel-C statement:

Vdst = Vsrc1 + Vsrc2; 

The migration tool simply copies the expression verbatim and the compiler generates a VADDW instruction.

In other cases where an operation translates directly to a C7x instruction, the compiler generates a C7x intrinsic to invoke the instruction. For example:

Vdst = min(Vsrc1, Vsrc2); 

translates to

Vdst = __min(Vsrc1, Vsrc2); 

which turns directly into a VMINW instruction.

If there is no direct translation to a single instruction, the migration tool relies on the virtual machine to provide the translation. Each operation that does not correspond to a C operator is implemented by a class in the virtual machine. Like the load and store classes, these are template classes so that variations can be specified at compile time by template parameters. For example, this VCOP Kernel-C statement:

Vdst = unpack(Vsrc1, Vsrc2); 

translates to:

bit_unpack<int>::apply(Vdst, Vsrc1, Vsrc2); 

The element type is specified by the template parameter int. All arithmetic classes have a single method called apply() that implements the translation. In this case the apply() method invokes a sequence of four C7x intrinsics.