SPRUIG3C January   2018  – August 2019 TDA4VM , TDA4VM-Q1

 

  1.   Read This First
    1.     About This Manual
    2.     Related Documentation
    3.     Trademarks
  2. 1Overview and Scope
    1. 1.1 Comparing VCOP and C7000
    2. 1.2 About this Document
      1. 1.2.1 Documentation Conventions
    3. 1.3 Output Format
    4. 1.4 Data Types
      1. 1.4.1 40-bit Incompatibilities
      2. 1.4.2 40-Bit Detection in Host Emulation Mode
    5. 1.5 SIMD Width
    6. 1.6 VCOP Virtual Machine
  3. 2Kernel API
    1. 2.1 Overview
    2. 2.2 Parameter Block
      1. 2.2.1 Tvals Structure
      2. 2.2.2 Pblock Manipulation
  4. 3Loop Control
    1. 3.1 Overview
    2. 3.2 Loop Control and Nested Loops
    3. 3.3 Repeat Loops
    4. 3.4 Compound Conditions
    5. 3.5 Early Exit
  5. 4Addressing
    1. 4.1 Overview
    2. 4.2 Streaming Engines
    3. 4.3 Streaming Address Generators
    4. 4.4 Indexed Addressing
    5. 4.5 Circular Addressing
  6. 5Operations
    1. 5.1 Load Operations
    2. 5.2 Store Operations
      1. 5.2.1 Predicated Stores
      2. 5.2.2 Scatter and Transposing Stores
      3. 5.2.3 Optimization of OFFSET_NP1-Based Transpose
      4. 5.2.4 Rounding Stores
      5. 5.2.5 Saturating Stores
    3. 5.3 Arithmetic Operations
      1. 5.3.1 Vector Compares
      2. 5.3.2 Multiplication with Rounding, Truncation, or Left Shift
    4. 5.4 Lookup and Histogram Table Operations
      1. 5.4.1 Determination of Table Size
      2. 5.4.2 Table Configuration
      3. 5.4.3 Copy-in Operation
      4. 5.4.4 Copy-out Operation
      5. 5.4.5 Index Adjustment from Non-zero Agen
      6. 5.4.6 Lookup Operation
      7. 5.4.7 Histogram Update Operation
      8. 5.4.8 16-Way Lookup and Histogram
  7. 6Performance
    1. 6.1 Overview
    2. 6.2 Compiler Requirements
    3. 6.3 Automatic Performance Profiling
    4. 6.4 Performance Options
  8.   A Warnings and Notes
    1.     A.1 Compatibility Warnings
    2.     A.2 Efficiency Warnings

Store Operations

Similarly to loads, VCOP’s store operations are characterized by data type and distribution mode. In addition to distribution mode, there are two primary considerations for translation of VCOP stores for C7x: packing, and lane masking.

Packing is the opposite of sign extension. The source data in C7x registers is 32 bits wide. When storing to 16- or 8-bit element types, the elements must be truncated to that size. The C7x has direct instruction support for such packing stores. Packing depends on size but not signedness. Signed and unsigned types generally use the exact same sequence, so there are three fundamental translation sequences for each mode, corresponding to 8-bit, 16-bit, or 32-bit data.

Signedness does come into play for rounding or saturation; these are covered in Section 5.3.3 and Section 5.3.4.

Unlike loads, translation of stores is sensitive to the number of lanes. While excess lanes can be safely loaded and ignored, stores must take care to only store the number of lanes being modeled. That means in the default 8-way SIMD mode, stores are limited to storing only 8 lanes, even though the C7x vectors contain 16 lanes of (32-bit) data. There is instructional support for such partial-vector stores for some cases, but not all. In particular there are no partial-vector packing stores. Thus packing stores in 8-way SIMD mode that use regular indirect addressing require the use of an explicit predicate to mask the store of unused lanes. This lane-masking predicate is constant and can be computed outside the loop.

When using SA-based addressing, the SA automatically provides lane masking based on the VECLEN flag in the SA setup vector. In this case the explicit lane masks are not needed.

The virtual machine has a template class called vcop_store that implements the various combinations in terms of C7x operations. The template parameters of the class specify the type and distribution mode, along with a specification of what kind of low-level addressing to use (SA or indirect), and the number of SIMD lanes to emulate (8 or 16). The class has two methods: store() for regular stores and store_pred() for predicated stores. For example, here is the translation of a store instruction using the DS2 distribution mode and SA-based addressing.

Kernel-C source:

__vptr_s8 out;
out[Agen1].ds2() = Vreg;

translates to:

vcop_store<char, ds2, sa0adv>::store(Vreg, (uchar *)(tvals->p2));

The template parameters char and ds2 specify the data type and distribution mode. The template parameter sa0adv tells the template to use SA-based addressing, with SA0 and advancing enabled. The SIMD factor defaults to 8. The runtime argument Vreg is the source vector register. The tvals->p2 expression is the base address.

Most cases of unconditional stores using the basic store distributions modes translate to a one- or two-instruction sequence. Collating stores store a data-dependent number of elements in packed fashion and are therefore more involved.

Efficiency Warning: Collating Stores
Collating stores require a long translation sequence and may perform poorly.