SPRUIG3C January   2018  – August 2019 TDA4VM , TDA4VM-Q1

 

  1.   Read This First
    1.     About This Manual
    2.     Related Documentation
    3.     Trademarks
  2. 1Overview and Scope
    1. 1.1 Comparing VCOP and C7000
    2. 1.2 About this Document
      1. 1.2.1 Documentation Conventions
    3. 1.3 Output Format
    4. 1.4 Data Types
      1. 1.4.1 40-bit Incompatibilities
      2. 1.4.2 40-Bit Detection in Host Emulation Mode
    5. 1.5 SIMD Width
    6. 1.6 VCOP Virtual Machine
  3. 2Kernel API
    1. 2.1 Overview
    2. 2.2 Parameter Block
      1. 2.2.1 Tvals Structure
      2. 2.2.2 Pblock Manipulation
  4. 3Loop Control
    1. 3.1 Overview
    2. 3.2 Loop Control and Nested Loops
    3. 3.3 Repeat Loops
    4. 3.4 Compound Conditions
    5. 3.5 Early Exit
  5. 4Addressing
    1. 4.1 Overview
    2. 4.2 Streaming Engines
    3. 4.3 Streaming Address Generators
    4. 4.4 Indexed Addressing
    5. 4.5 Circular Addressing
  6. 5Operations
    1. 5.1 Load Operations
    2. 5.2 Store Operations
      1. 5.2.1 Predicated Stores
      2. 5.2.2 Scatter and Transposing Stores
      3. 5.2.3 Optimization of OFFSET_NP1-Based Transpose
      4. 5.2.4 Rounding Stores
      5. 5.2.5 Saturating Stores
    3. 5.3 Arithmetic Operations
      1. 5.3.1 Vector Compares
      2. 5.3.2 Multiplication with Rounding, Truncation, or Left Shift
    4. 5.4 Lookup and Histogram Table Operations
      1. 5.4.1 Determination of Table Size
      2. 5.4.2 Table Configuration
      3. 5.4.3 Copy-in Operation
      4. 5.4.4 Copy-out Operation
      5. 5.4.5 Index Adjustment from Non-zero Agen
      6. 5.4.6 Lookup Operation
      7. 5.4.7 Histogram Update Operation
      8. 5.4.8 16-Way Lookup and Histogram
  7. 6Performance
    1. 6.1 Overview
    2. 6.2 Compiler Requirements
    3. 6.3 Automatic Performance Profiling
    4. 6.4 Performance Options
  8.   A Warnings and Notes
    1.     A.1 Compatibility Warnings
    2.     A.2 Efficiency Warnings

16-Way Lookup and Histogram

VCOP memory is arranged as lines of 8 32-bit banks, for a total of 256 bits per line. This layout complements the 8-way SIMD width of table lookup operations: each SIMD lane corresponds to a lookup in one memory bank. The table layout in VCOP memory reflects this architecture.

The table layout cannot change between 8-way and 16-way SIMD. The producer of the table is independent of the consumer, so the producer need not be aware of whether the table will be read using 8-way SIMD or 16-way SIMD. So, for 16-way SIMD, only the lookup operation itself changes; the source table layout in L2 does not.

These considerations result in the following default scheme to support 16-way lookup and histogram:

  • The table is assumed to be laid out in L2 memory according to the num_tables parameter of the _LOOKUP directive. This layout is dictated by VCOP’s memory architecture and num_tables, and is independent of the SIMD width that will be used to read the table.
  • If vcop_simd=16 is specified, the lookup table operation is assumed to use twice the number of lanes as that specified by num_tables. For example, if num_tables==8, the lookup operation used 16 lanes. If num_tables==2, the lookup operation uses 4 lanes.
  • Doubling the lookup width is enabled by duplicating the table in L1D as it is copied from L2. In effect, the N-way table becomes a 2N-way table, with the data from lanes [0,N-1] duplicated in lanes [N, 2N-1] (where N is num_tables).
  • For a histogram operation, when the table is copied back to L2, the accumulated sums from the duplicated tables must be added together (with saturation).
  • In this way the lookup throughput is doubled without requiring a change to the table layout in L2.

A few considerations apply to this scheme.

  • First, the user should be aware that the table requires twice as much memory in L1 as in L2. Therefore the maximum table size that can be supported in 16-way SIMD is 16K bytes.
  • Second, because of the duplication, the throughput of the copy-in/out operations is 256 bits per cycle instead of 512 bits as in 8-way mode.
  • Third, like arithmetic kernels, in order to take advantage of 16-way SIMD, the kernel must be written in a SIMD-width independent way. That is, loop control (trip counts) and addressing must be in terms of VCOP_SIMD_WIDTH, rather than being hard-coded, or is more commonly the case, in terms of num_tables.

This scheme is the default, but is not beneficial for all cases of lookup operations. For this reason, _LOOKUP has been given an additional parameter to control this table duplication behavior, duplication:

_LOOKUP(num_tables, num_pts, duplication, table_size)

By default, duplication is VCOP_SIMD_WIDTH/8. That is, for 8-way it is 1 and for 16-way it is 2. In addition, num_pts may be up to VCOP_SIMD_WIDTH. This additional parameter is particularly useful for one table configurations. Consider the case of:

_LOOKUP(1, 8)

If the desired effect is to look up as many points as possible, this specification may be rewritten as:

_LOOKUP(1, VCOP_SIMD_WIDTH, 1)

This configuration will look up 8 points in 8 way mode and 16 points in 16 way mode from a single table in L1D. If the desired effect is to lookup sets of 8 points, the specification may instead be rewritten to:

_LOOKUP(1, 8, VCOP_SIMD_WIDTH/8)

This configuration will look up 8 points using one index from one table in 8 way mode. In 16 way mode, the table will be duplicated in memory, two index values will be used, and two sets of 8 points will be returned.