SPRUIG3C January   2018  – August 2019 TDA4VM , TDA4VM-Q1

 

  1.   Read This First
    1.     About This Manual
    2.     Related Documentation
    3.     Trademarks
  2. 1Overview and Scope
    1. 1.1 Comparing VCOP and C7000
    2. 1.2 About this Document
      1. 1.2.1 Documentation Conventions
    3. 1.3 Output Format
    4. 1.4 Data Types
      1. 1.4.1 40-bit Incompatibilities
      2. 1.4.2 40-Bit Detection in Host Emulation Mode
    5. 1.5 SIMD Width
    6. 1.6 VCOP Virtual Machine
  3. 2Kernel API
    1. 2.1 Overview
    2. 2.2 Parameter Block
      1. 2.2.1 Tvals Structure
      2. 2.2.2 Pblock Manipulation
  4. 3Loop Control
    1. 3.1 Overview
    2. 3.2 Loop Control and Nested Loops
    3. 3.3 Repeat Loops
    4. 3.4 Compound Conditions
    5. 3.5 Early Exit
  5. 4Addressing
    1. 4.1 Overview
    2. 4.2 Streaming Engines
    3. 4.3 Streaming Address Generators
    4. 4.4 Indexed Addressing
    5. 4.5 Circular Addressing
  6. 5Operations
    1. 5.1 Load Operations
    2. 5.2 Store Operations
      1. 5.2.1 Predicated Stores
      2. 5.2.2 Scatter and Transposing Stores
      3. 5.2.3 Optimization of OFFSET_NP1-Based Transpose
      4. 5.2.4 Rounding Stores
      5. 5.2.5 Saturating Stores
    3. 5.3 Arithmetic Operations
      1. 5.3.1 Vector Compares
      2. 5.3.2 Multiplication with Rounding, Truncation, or Left Shift
    4. 5.4 Lookup and Histogram Table Operations
      1. 5.4.1 Determination of Table Size
      2. 5.4.2 Table Configuration
      3. 5.4.3 Copy-in Operation
      4. 5.4.4 Copy-out Operation
      5. 5.4.5 Index Adjustment from Non-zero Agen
      6. 5.4.6 Lookup Operation
      7. 5.4.7 Histogram Update Operation
      8. 5.4.8 16-Way Lookup and Histogram
  7. 6Performance
    1. 6.1 Overview
    2. 6.2 Compiler Requirements
    3. 6.3 Automatic Performance Profiling
    4. 6.4 Performance Options
  8.   A Warnings and Notes
    1.     A.1 Compatibility Warnings
    2.     A.2 Efficiency Warnings

Comparing VCOP and C7000

VCOP kernels are written in a domain-specific language subset of C++ called VCOP Kernel-C. A migration tool accepts kernels written in Kernel-C and outputs code that will be functionally equivalent, within limits, for C7x. This document provides details about the VCOP Kernel-C migration tool: its scope, capability, and limitations.

VCOP Kernel-C, compiled for native execution on EVE, is translated to EVE code by a tool called vcc-arp32. (VCC stands for VCOP C Compiler.) The vcc-arp32 tool generates kernel code for VCOP, as well as C code for ARP32 that sets up the execution environment for the kernel such that the kernel can be called as a C function. This code is then compiled by the ARP32 compiler for native execution on EVE.

For C7x the VCOP Kernel-C migration tool is called vcc7x. It translates a kernel written in VCOP Kernel-C into a C function that can be compiled for C7x and called in the same way as on ARP32.

Typically the kernel is called by dispatch code running on the ARP32 that manages buffers, transfers data, and synchronizes execution between VCOP and ARP32. The APIs that provide this functionality, which are part of the ARP32 RTS, are re-implemented for C7x so that dispatch code can be migrated with little or no modification.

The migration tool is supported by a library that implements specific VCOP operations on C7x. The library is called the VCOP Virtual Machine (VVM, or simply virtual machine) since it layers a VCOP-like interface onto the underlying C7x. VCC translates Kernel-C statements into calls to the virtual machine’s APIs. The VVM is implemented as a C++ library that uses template classes and inline functions to achieve abstraction and efficiency.

On the surface VCOP and C7x have many similarities in that both have SIMD capability in the form of vector-oriented datapaths. In addition, both are designed for iterative (rather than thread-level) operation using pipelined execution.

However, there are also many differences.

  • VCOP has 8-way SIMD with 40-bit integer arithmetic in each lane for a total of 320 bits of arithmetic width, while C7x has 512 bits of width with variable lane widths arranged as 64 lanes x 8 bits, 32x16, 16x32, or 8x64.
  • VCOP has built-in control for up to four levels of nested looping, whereas C7x relies on traditional explicit loop control and software pipelining.
  • VCOP has built-in address generators that automatically compute multi-dimensional array accesses; C7x has the Streaming Engine (SE) and Streaming Address Generators (SA), which are similar concepts but differ in the details.
  • VCOP has built-in support for lookup-table and histogram operations; C7x also supports these operations but not as directly.
  • VCOP has a rich set of addressing modes that enable loads and stores to perform upsampling, downsampling, interleaving, scatter/gather, and so on. C7x has support for many of these operations but others require software emulation.
  • VCOP has 96K bytes of L1 memory for buffers, lookup tables, and histogram bins that must be manually loaded and unloaded; C7x has 32K bytes of L1D SRAM (in addition to an L1 data cache that is automatically managed).
  • Finally, VCOP has a parameter buffer that allows kernel-invariant computations to be pre-computed prior to execution of the kernel, whereas C7x has no such dedicated support, although such a mechanism can be implemented programmatically.

All these differences must be accounted for when translating code written for VCOP to C7x.