SPRUIG3C January   2018  – August 2019 TDA4VM , TDA4VM-Q1

 

  1.   Read This First
    1.     About This Manual
    2.     Related Documentation
    3.     Trademarks
  2. 1Overview and Scope
    1. 1.1 Comparing VCOP and C7000
    2. 1.2 About this Document
      1. 1.2.1 Documentation Conventions
    3. 1.3 Output Format
    4. 1.4 Data Types
      1. 1.4.1 40-bit Incompatibilities
      2. 1.4.2 40-Bit Detection in Host Emulation Mode
    5. 1.5 SIMD Width
    6. 1.6 VCOP Virtual Machine
  3. 2Kernel API
    1. 2.1 Overview
    2. 2.2 Parameter Block
      1. 2.2.1 Tvals Structure
      2. 2.2.2 Pblock Manipulation
  4. 3Loop Control
    1. 3.1 Overview
    2. 3.2 Loop Control and Nested Loops
    3. 3.3 Repeat Loops
    4. 3.4 Compound Conditions
    5. 3.5 Early Exit
  5. 4Addressing
    1. 4.1 Overview
    2. 4.2 Streaming Engines
    3. 4.3 Streaming Address Generators
    4. 4.4 Indexed Addressing
    5. 4.5 Circular Addressing
  6. 5Operations
    1. 5.1 Load Operations
    2. 5.2 Store Operations
      1. 5.2.1 Predicated Stores
      2. 5.2.2 Scatter and Transposing Stores
      3. 5.2.3 Optimization of OFFSET_NP1-Based Transpose
      4. 5.2.4 Rounding Stores
      5. 5.2.5 Saturating Stores
    3. 5.3 Arithmetic Operations
      1. 5.3.1 Vector Compares
      2. 5.3.2 Multiplication with Rounding, Truncation, or Left Shift
    4. 5.4 Lookup and Histogram Table Operations
      1. 5.4.1 Determination of Table Size
      2. 5.4.2 Table Configuration
      3. 5.4.3 Copy-in Operation
      4. 5.4.4 Copy-out Operation
      5. 5.4.5 Index Adjustment from Non-zero Agen
      6. 5.4.6 Lookup Operation
      7. 5.4.7 Histogram Update Operation
      8. 5.4.8 16-Way Lookup and Histogram
  7. 6Performance
    1. 6.1 Overview
    2. 6.2 Compiler Requirements
    3. 6.3 Automatic Performance Profiling
    4. 6.4 Performance Options
  8.   A Warnings and Notes
    1.     A.1 Compatibility Warnings
    2.     A.2 Efficiency Warnings

Tvals Structure

To preserve the dispatch API, the C7x translation uses a similar approach. Expressions that are kernel invariant are precomputed by the init() function and stored in a memory block, then accessed by the translated kernel by passing the memory block’s address to the vloops() function. However, the size, layout, and contents of the block is different. On C7x, restricting parameter elements to be 16-bit values is needlessly inefficient; parameter expressions often represent constants, addresses, or even vectors (for example Streaming Engine setup vectors). For this reason the pblock object on C7x is defined as a structure with fields defined by the migration tool. The address of the structure is passed to the init() function, which assigns to the fields of the structure, and the vloops() function, which accesses them. The structure has type < kernel >_tvals_t, and is referred to simply as the tvals structure throughout this document.

Even though the translated parameter block object is a structure, to preserve API compatibility, the pblock pointer parameters to the init() and vloops() function are declared as unsigned short* . The functions convert the pointer to point to the tvals structure by casting it.

The param_count() API function allows the client to allocate the pblock rather than using the built-in statically allocated one. It returns the required size of the pblock in unsigned short units. For example, a user-managed pblock might be allocated as:

typedef unsigned short ptype;
ptype *my_pblock = (ptype*)malloc(mykernel_param_count() * sizeof(ptype)); 

To preserve compatibility, for C7x the return value of the param_count() function is still in terms of unsigned short units, even though the underlying object is actually a structure. This allows client code that allocates the pblock as above to remain unchanged.

The tvals structure consists of one field for each kernel parameter to capture its value, plus a nested substructure for each vloop in the kernel to hold loop-specific expressions. Furthermore, each substructure is actually an array of substructures. This is to handle vloops that use the “repeat” feature, whereby the vloop runs multiple times using different parameter sets.

Table 2-1 illustrates the migration tool-generated tvals structure for a simple kernel with two vloop commands, the first of which is inside a repeat loop.

Table 2-1 Migration tool-generated parameter block access via tvals structure
Kernel-C ProgramGenerated Tvals Structure
kernel(__vptr_int32 parm1, int N)
{
   // "repeat" loop
   foreach (roi, roi < N, 5)
   {
      // vloop 0
      for (...)
      {
      }
   }

   // vloop 1
   for (...)
   {
   }
}
typedef struct 
{
   __vptr_int32 parm1;   // kernel parm
   int N;                // kernel parm
   struct                // tvals for loop0
   {
      struct
      {
         int p0; 
         uchar* p1; 
         ... 
      } tvals[5]; 
   } loop0;

   struct                // tvals for loop1
   {
      struct 
      {
         uchar* p0;
         uint p1;
         uint16 p2; 
         ... 
      } tvals[1]; 
   } loop1;

} kernel_tvals_t;
Access to Parameters in vloops() Function
kernel_vloops(ushort *pblock)// ushort* for API compatibility
{
   kernel_tvals_t *tvals = (kernel_tvals_t *)pblock;    // cast as tvals_t*

   // retrieve captured kernel parameters
   __vptr_int32 parm1 = tvals->parm1;
   int N = tvals->N;

   // access to kernel parameters
   ... parm1 ...

   // access to parameters for vloop 0 (index by repeat loop counter)
   ... tvals->loop0.tvals[roi].p0 ... 

   // access to parameters for vloop 1 
   ... tvals->loop1.tvals[0].p2 ... 
}
Compatibility Warning: Pblock Size
The parameter block for C7x tends to be larger on C7x than VCOP. Programs which (dynamically or statically) allocate the pblock without using the kernel_pblock_size() function may fail if the allocated space is too small for the translated tvals structure.