SPRUIG3C January   2018  – August 2019 TDA4VM , TDA4VM-Q1

 

  1.   Read This First
    1.     About This Manual
    2.     Related Documentation
    3.     Trademarks
  2. 1Overview and Scope
    1. 1.1 Comparing VCOP and C7000
    2. 1.2 About this Document
      1. 1.2.1 Documentation Conventions
    3. 1.3 Output Format
    4. 1.4 Data Types
      1. 1.4.1 40-bit Incompatibilities
      2. 1.4.2 40-Bit Detection in Host Emulation Mode
    5. 1.5 SIMD Width
    6. 1.6 VCOP Virtual Machine
  3. 2Kernel API
    1. 2.1 Overview
    2. 2.2 Parameter Block
      1. 2.2.1 Tvals Structure
      2. 2.2.2 Pblock Manipulation
  4. 3Loop Control
    1. 3.1 Overview
    2. 3.2 Loop Control and Nested Loops
    3. 3.3 Repeat Loops
    4. 3.4 Compound Conditions
    5. 3.5 Early Exit
  5. 4Addressing
    1. 4.1 Overview
    2. 4.2 Streaming Engines
    3. 4.3 Streaming Address Generators
    4. 4.4 Indexed Addressing
    5. 4.5 Circular Addressing
  6. 5Operations
    1. 5.1 Load Operations
    2. 5.2 Store Operations
      1. 5.2.1 Predicated Stores
      2. 5.2.2 Scatter and Transposing Stores
      3. 5.2.3 Optimization of OFFSET_NP1-Based Transpose
      4. 5.2.4 Rounding Stores
      5. 5.2.5 Saturating Stores
    3. 5.3 Arithmetic Operations
      1. 5.3.1 Vector Compares
      2. 5.3.2 Multiplication with Rounding, Truncation, or Left Shift
    4. 5.4 Lookup and Histogram Table Operations
      1. 5.4.1 Determination of Table Size
      2. 5.4.2 Table Configuration
      3. 5.4.3 Copy-in Operation
      4. 5.4.4 Copy-out Operation
      5. 5.4.5 Index Adjustment from Non-zero Agen
      6. 5.4.6 Lookup Operation
      7. 5.4.7 Histogram Update Operation
      8. 5.4.8 16-Way Lookup and Histogram
  7. 6Performance
    1. 6.1 Overview
    2. 6.2 Compiler Requirements
    3. 6.3 Automatic Performance Profiling
    4. 6.4 Performance Options
  8.   A Warnings and Notes
    1.     A.1 Compatibility Warnings
    2.     A.2 Efficiency Warnings

Automatic Performance Profiling

To facilitate evaluation of the performance of translated kernels running on C7x compared to native execution on VCOP, an automatic profiling mechanism has been added to both the Kernel-C compiler ( vcc-arp32) and the migration tool ( vcc7x). It measures the execution cycles of both the init() function and the vloops() function on both targets so they can be compared.

The profiling mechanism relies on the built-in TSC cycle counter available on both EVE and C7x. The TSC is accessed through API calls in the RTS.

On EVE the kernel executes asynchronously. That is, the vloops() function only dispatches the kernel loops to VCOP but returns without waiting for them to complete. A subsequent call to _vcop_vloop_done() synchronizes ARP32 execution with the completion of the kernel. EVE programmers can manage this synchronization themselves by calling the init() function, the vloops() function, and the _vcop_vloop_done() in the client code, possibly interspersed with other operations like memory transfers.

Alternatively, programmers can use the higher-level kernel() function call which wraps these other calls. The kernel() function waits for the VCOP loop to complete before returning. Automatic profiling is supported only when the kernel is invoked through the higher-level kernel() API. However, users that use the lower-level calls can still use the profiling mechanisms manually by inserting calls to the timer functions themselves.

The mechanism works as follows:

  • Profiling is enabled via the --profile switch on the VCC command line.
  • VCC defines two global variables for each kernel to hold the cycle counts for the init() and vloops() functions. Given a kernel named mykernel, the variables are named mykernel_init_cycles and mykernel_vloops_cycles.
  • In the kernel() function, VCC wraps calls to the init() and vloops() functions with cycle counting via _tsc_gettime(). In the EVE case, the vloops cycle count is taken after _vcop_vloop_done(), so that the elapsed time includes complete execution of the vloop command on VCOP. It stores the cycle counts in the counter variables.
  • At the bottom of the kernel() function, VCC inserts a call to a new API function __vcop_log_kernel_profile() that records the accumulated cycle counts along with the kernel’s name. This function is defined in the compiler’s runtime support library.
  • When the program ends (after main() returns) the values are automatically printed. For example:
       kernel profiling results:
         vcop_fir_2D_short: init cycles=207 vloops cycles=64
    

If the kernel is not invoked through the high-level kernel() API, users may still insert cycle counting code manually and call __vcop_log_kernel_profile() to register the accumulated cycle counts.

On EVE, VCOP is clocked at twice the rate of ARP32. The reported cycles for EVE are ARP32 cycles, not VCOP cycles. Therefore to compare C7x cycles to VCOP cycles, the EVE cycle counts should be doubled.