SPRUIG3C January   2018  – August 2019 TDA4VM , TDA4VM-Q1

 

  1.   Read This First
    1.     About This Manual
    2.     Related Documentation
    3.     Trademarks
  2. 1Overview and Scope
    1. 1.1 Comparing VCOP and C7000
    2. 1.2 About this Document
      1. 1.2.1 Documentation Conventions
    3. 1.3 Output Format
    4. 1.4 Data Types
      1. 1.4.1 40-bit Incompatibilities
      2. 1.4.2 40-Bit Detection in Host Emulation Mode
    5. 1.5 SIMD Width
    6. 1.6 VCOP Virtual Machine
  3. 2Kernel API
    1. 2.1 Overview
    2. 2.2 Parameter Block
      1. 2.2.1 Tvals Structure
      2. 2.2.2 Pblock Manipulation
  4. 3Loop Control
    1. 3.1 Overview
    2. 3.2 Loop Control and Nested Loops
    3. 3.3 Repeat Loops
    4. 3.4 Compound Conditions
    5. 3.5 Early Exit
  5. 4Addressing
    1. 4.1 Overview
    2. 4.2 Streaming Engines
    3. 4.3 Streaming Address Generators
    4. 4.4 Indexed Addressing
    5. 4.5 Circular Addressing
  6. 5Operations
    1. 5.1 Load Operations
    2. 5.2 Store Operations
      1. 5.2.1 Predicated Stores
      2. 5.2.2 Scatter and Transposing Stores
      3. 5.2.3 Optimization of OFFSET_NP1-Based Transpose
      4. 5.2.4 Rounding Stores
      5. 5.2.5 Saturating Stores
    3. 5.3 Arithmetic Operations
      1. 5.3.1 Vector Compares
      2. 5.3.2 Multiplication with Rounding, Truncation, or Left Shift
    4. 5.4 Lookup and Histogram Table Operations
      1. 5.4.1 Determination of Table Size
      2. 5.4.2 Table Configuration
      3. 5.4.3 Copy-in Operation
      4. 5.4.4 Copy-out Operation
      5. 5.4.5 Index Adjustment from Non-zero Agen
      6. 5.4.6 Lookup Operation
      7. 5.4.7 Histogram Update Operation
      8. 5.4.8 16-Way Lookup and Histogram
  7. 6Performance
    1. 6.1 Overview
    2. 6.2 Compiler Requirements
    3. 6.3 Automatic Performance Profiling
    4. 6.4 Performance Options
  8.   A Warnings and Notes
    1.     A.1 Compatibility Warnings
    2.     A.2 Efficiency Warnings

Load Operations

VCOP’s fundamental load operation fetches a vector of 8 consecutive elements from memory into the 8 lanes of a vector register. In memory the elements are 8, 16, or 32 bit signed or unsigned values; they are sign or zero extended into the 40-bit lanes of the register. On C7x the migration tool models VCOP’s 40 bit lanes as 32 bit lanes; a vector register contains 16 of these lanes. (In 8-way SIMD mode only 8 of these 16 lanes are used; in 16-way SIMD mode all 16 are used.)

VCOP has various additional “distribution modes” that provide for alternate data layouts in memory: for example, as the data is read in it may be oversampled (read each element multiple times), undersampled (skip elements), deinterleaved, and so on. Descriptions of the distribution modes can be found in the VCOP CPU manual or the programmer’s guide.

Thus, load operations are characterized primarily by the element type in memory and distribution mode. The virtual machine has a template class called vcop_load that implements the various combinations using C7x operations. The template parameters of the class specify the type and distribution mode, along with a specification of what kind of low-level addressing to use (SE, SA, or indirect) and the number of SIMD lanes to emulate (8 or 16). The class has two methods, load() and SE_load(), which implement the load operation as specified by the template parameters. The load() method implements non-SE-based loads, and the SE_load() method implements SE-based loads.

For example, here is the translation of a VCOP load instruction using the CIRC2 distribution mode and SA-based addressing.

Kernel-C:

__vptr_uint16 in;
Vreg = in[Agen0].circ2();

translates to:

vcop_load<ushort, circ2, sa1adv>::load(Vreg, (uchar*)(tvals->p1));

The template parameters ushort, and circ2 specify the data type and distribution mode. The template parameter sa1adv tells the template to use SA-based addressing, with SA1 and advancing enabled. The SIMD factor defaults to 8. The runtime argument Vreg is the destination vector register (passed by reference, since the load writes into it). The tvals->p1 expression is the base address. The load() method, when load template is specialized for ushort, circ2, and sa1adv, results in generation of the specific C7x sequence to implement that combination.

The specific C7x instruction sequences generated by specialized load() methods that result from all combinations of type and distribution mode can be determined by examining the template specializations in the header files.

The load() methods need to handle loading the vector elements according to the distribution mode, and sign- or zero-extension. Most combinations of type and distribution mode are covered by a single C7x instruction. A few need additional instructions to exactly mimic VCOP’s specific modes.

In general SE-based loads, invoked via the SE_load() method, simply rely on the SE configuration as setup in the init() function (see Section 1.5), and translate to a simple access of the corresponding SE source register.

The basic setup for the SE is based on the data type. For example, signed 16-bit data uses element type __SE_ELETYPE_16BIT and __SE_PROMOTE_2X_SIGNEXT for sign-extension to 32 bits. The default vector length for 8-way SIMD is __SE_VECLEN_32BYTE, for 8 lanes of 32-bit (4-byte) data. Additional SE features used to implement the specific distribution modes are configured by the various specializations of the vcop_load template.

C7x can speculatively load without risk of faulting. Therefore most load sequences simply load a full C7x vector’s worth of data—that is, 16 lanes of 32-bit values. In 8-way SIMD mode, the extra values are simply unused.