SPRUIG3C January   2018  – August 2019 TDA4VM , TDA4VM-Q1

 

  1.   Read This First
    1.     About This Manual
    2.     Related Documentation
    3.     Trademarks
  2. 1Overview and Scope
    1. 1.1 Comparing VCOP and C7000
    2. 1.2 About this Document
      1. 1.2.1 Documentation Conventions
    3. 1.3 Output Format
    4. 1.4 Data Types
      1. 1.4.1 40-bit Incompatibilities
      2. 1.4.2 40-Bit Detection in Host Emulation Mode
    5. 1.5 SIMD Width
    6. 1.6 VCOP Virtual Machine
  3. 2Kernel API
    1. 2.1 Overview
    2. 2.2 Parameter Block
      1. 2.2.1 Tvals Structure
      2. 2.2.2 Pblock Manipulation
  4. 3Loop Control
    1. 3.1 Overview
    2. 3.2 Loop Control and Nested Loops
    3. 3.3 Repeat Loops
    4. 3.4 Compound Conditions
    5. 3.5 Early Exit
  5. 4Addressing
    1. 4.1 Overview
    2. 4.2 Streaming Engines
    3. 4.3 Streaming Address Generators
    4. 4.4 Indexed Addressing
    5. 4.5 Circular Addressing
  6. 5Operations
    1. 5.1 Load Operations
    2. 5.2 Store Operations
      1. 5.2.1 Predicated Stores
      2. 5.2.2 Scatter and Transposing Stores
      3. 5.2.3 Optimization of OFFSET_NP1-Based Transpose
      4. 5.2.4 Rounding Stores
      5. 5.2.5 Saturating Stores
    3. 5.3 Arithmetic Operations
      1. 5.3.1 Vector Compares
      2. 5.3.2 Multiplication with Rounding, Truncation, or Left Shift
    4. 5.4 Lookup and Histogram Table Operations
      1. 5.4.1 Determination of Table Size
      2. 5.4.2 Table Configuration
      3. 5.4.3 Copy-in Operation
      4. 5.4.4 Copy-out Operation
      5. 5.4.5 Index Adjustment from Non-zero Agen
      6. 5.4.6 Lookup Operation
      7. 5.4.7 Histogram Update Operation
      8. 5.4.8 16-Way Lookup and Histogram
  7. 6Performance
    1. 6.1 Overview
    2. 6.2 Compiler Requirements
    3. 6.3 Automatic Performance Profiling
    4. 6.4 Performance Options
  8.   A Warnings and Notes
    1.     A.1 Compatibility Warnings
    2.     A.2 Efficiency Warnings

Lookup and Histogram Table Operations

VCOP has dedicated support for Lookup and Histogram Table (LHT) operations that access memory in a non-linear fashion. Both operate by indexing into a table using index values that contain offsets from the table base. In the case of lookup table, the lookup operation loads from memory at the indexed locations and returns the values. In the case of histogram, the update operation increments the memory at the indexed locations.

Both lookup table and histogram support the notion of parallel tables, in which up to 8 distinct tables are referenced using index values contained in the lanes of a vector register (and in the case of lookup, values read from the table are returned in corresponding lanes). To enable parallel access to the tables, the tables are interleaved in memory such that each distinct table occupies a different memory bank or banks than the others.

C7x supports LHT operations in a very similar way. However there are two important differences that fundamentally affect the translation of these operations to C7x.

First, VCOP supports LHT operations directly on its local memory, consisting of the IBUFA, IBUFB, and WBUF buffers. Each is 32KB, for a total of 96KB. Typically these buffers will be mapped to L2 on C7x. But C7x supports LHT operations only on a 32KB area of L1D. Therefore LHT operations require table data to be copied from L2 into L1D before the operation, and then (in the case of histogram) copied back out.

Second, VCOP’s memory is laid out as 8 banks, each 32 bits wide. Each full-width line from the memory interface contains 8x32 or 256 bits. C7x’s memory is laid out as 16 banks, each 64 bits wide. Each line contains 16x64 or 1024 bits. For LHT operations that use parallel tables, the bank geometry affects the table layout in memory.

For example, consider a LHT operation with 4 parallel tables and 32-bit elements. We’ll call them T0, T1, T2, and T3. (In the following description, for simplicity, “locations” refer to word offsets within the linear address space of table memory. In reality, both machines are byte addressable.) On VCOP each table occupies 2 of the 8 banks. In each table line, the two banks corresponding to a given table contain 2 elements: locations 0 and 1 are part of table T0; locations 2 and 3 are part of T1, and so on. On C7x, each table occupies 4 of the 16 banks; in each line the 4 64-bit banks contain 8 table elements. Locations 0—7 are part of T0; locations 8—15 are part of T1, and so on. Figure 5-1 illustrates the two layouts for the table in this example. In this figure, TN[M] represents the element at index M of table N. Shading indicates memory banks. Heavy borders indicate table boundaries.

GUID-D140DE96-12A9-42CC-B9E7-ABE41E84E07F-low.pngFigure 5-1 Differences in Table Layout Between VCOP and C7x

VCOP kernels assume tables are laid out according to VCOP’s geometry; therefore this layout must be preserved at kernel boundaries. In order to operate on such tables with C7x LHT operations, the tables must be rearranged to conform to C7X’s geometry.

In general, a kernel that uses LHT operations must go through the following steps when translated to C7x:

  1. (In the init() function) Compute the LTCR control register flags to configure the table for the number of parallel tables and element type, and store the configuration in the tvals structure.
  2. (Before loop)
    1. Copy table data from L2 location to L1D.
    2. Rearrange table layout from VCOP to C7x.
  3. (Before loop) Configure the table using the value from the tvals structure
  4. (Within loop) Perform the lookup or histogram operation on table in L1D
  5. (After loop, histogram only)
    1. Copy table data from L1D to L2
    2. Rearrange table layout from C7x to VCOP

In practice, steps 2a and 2b (copy and rearrange) are combined: the table is rearranged on-the-fly as it is copied. Similarly, steps 5a and 5b are combined.

Efficiency Warning: LHT Operations
The translation of LHT operations suffers from the overhead introduced by copying and rearranging the table data. This step takes approximately one cycle per 512 bits (64 bytes) of data. The exact penalty depends on the size of the table.

The virtual machine contains two template classes, vcop_lookup and vcop_hist. These implement the translation steps in the previous outline. The template parameters specify the element data type, the number of parallel tables, and in the case of vcop_lookup, the interpolation degree. Both classes are based on another template called LHT_base which provides functionality common to both classes.