SPRUIV4C May   2020  – December 2023

 

  1.   1
  2.   Read This First
    1.     About This Manual
    2.     Related Documentation
    3.     Trademarks
  3. 2Introduction
    1. 2.1 C7000 Digital Signal Processor CPU Architecture Overview
    2. 2.2 C7000 Split Datapath and Functional Units
  4. 3C7000 C/C++ Compiler Options
    1. 3.1 Overview
    2. 3.2 Selecting Compiler Options for Performance
    3. 3.3 Understanding Compiler Optimization
      1. 3.3.1 Software Pipelining
      2. 3.3.2 Vectorization and Vector Predication
      3. 3.3.3 Automatic Use of Streaming Engine and Streaming Address Generator
      4. 3.3.4 Loop Collapsing and Loop Coalescing
      5. 3.3.5 Automatic Inlining
      6. 3.3.6 If Conversion
  5. 4Basic Code Optimization
    1. 4.1  Signed Types for Iteration Counters and Limits
    2. 4.2  Floating-Point Division
    3. 4.3  Loop-Carried Dependencies and the Restrict Keyword
      1. 4.3.1 Loop-Carried Dependencies
      2. 4.3.2 The Restrict Keyword
      3. 4.3.3 Run-Time Alias Disambiguation
    4. 4.4  Function Calls and Inlining
    5. 4.5  MUST_ITERATE and PROB_ITERATE Pragmas and Attributes
    6. 4.6  If Statements and Nested If Statements
    7. 4.7  Intrinsics
    8. 4.8  Vector Types
    9. 4.9  C++ Features to Use and Avoid
    10. 4.10 Streaming Engine
    11. 4.11 Streaming Address Generator
    12. 4.12 Optimized Libraries
    13. 4.13 Memory Optimizations
  6. 5Understanding the Assembly Comment Blocks
    1. 5.1 Software Pipelining Processing Stages
    2. 5.2 Software Pipeline Information Comment Block
      1. 5.2.1 Loop and Iteration Count Information
      2. 5.2.2 Dependency and Resource Bounds
      3. 5.2.3 Initiation Interval (ii) and Iterations
      4. 5.2.4 Constant Extensions
      5. 5.2.5 Resources Used and Register Tables
      6. 5.2.6 Stage Collapsing
      7. 5.2.7 Memory Bank Conflicts
      8. 5.2.8 Loop Duration Formula
    3. 5.3 Single Scheduled Iteration Comment Block
    4. 5.4 Identifying Pipeline Failures and Performance Issues
      1. 5.4.1 Issues that Prevent a Loop from Being Software Pipelined
      2. 5.4.2 Software Pipeline Failure Messages
      3. 5.4.3 Performance Issues
  7. 6Revision History

Automatic Use of Streaming Engine and Streaming Address Generator

The compiler can use the Streaming Engine (SE) and/or the Streaming Address Generator (SA) automatically if the --auto_stream=no_saving option is used on C7100 and C7120 devices or the --auto_stream=saving option is used on C7504 and later devices.

If the weighted_vector_sum_v3.cpp example in Section 3.3.2 is compiled with the --auto_stream=no_saving option, the following software pipeline information block is generated. (The generated assembly in this example is for C7100.)

;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop found in file               : weighted_vector_sum_v2.cpp
;*      Loop source line                 : 7
;*      Loop opening brace source line   : 7
;*      Loop closing brace source line   : 9
;*      Loop Unroll Multiple             : 32x
;*      Known Minimum Iteration Count    : 32
;*      Known Max Iteration Count Factor : 1
;*      Loop Carried Dependency Bound(^) : 2
;*      Unpartitioned Resource Bound     : 2
;*      Partitioned Resource Bound       : 2 (pre-sched)
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 2  Schedule found with 4 iterations in parallel. . .
;*----------------------------------------------------------------------------*
;*        SINGLE SCHEDULED ITERATION
;*
;*        ||$C$C36||:
;*   0              TICK                               ; [A_U]
;*   1              VMPYWW  .N2     VBM1,SE0++,VBL0   ; [B_N2] |8|  ^
;*     ||           VMPYWW  .M2     VBM0,SE1++,VBL1   ; [B_M2] |8|  ^
;*   2              VMPYWW  .N2     VBM1,SE0++,VBL0   ; [B_N2] |8|  ^
;*     ||           VMPYWW  .M2     VBM0,SE1++,VBL1   ; [B_M2] |8|  ^
;*   3              NOP             0x2               ; [A_B]
;*   5              VADDW   .L2     VBL1,VBL0,VB0     ; [B_L2] |8|
;*   6              VST16W  .D2     VB0,*D0(0)        ; [A_D2] |8|
;*     ||           VADDW   .L2     VBL1,VBL0,VB0     ; [B_L2] |8|
;*   7              VST16W  .D2     VB0,*D0(64)       ; [A_D2] |8| [C0]
;*     ||           ADDD    .D1     D0,0x80,D0        ; [A_D1] |7| [C1]
;*     ||           BNL     .B1     ||$C$C36||        ; [A_B] |7|
;*   8              ; BRANCHCC OCCURS {||$C$C36||}    ; [] |7|

In this case, the compiler uses SE0 and SE1 to replace the loads that previously set a lower ii bound of 4. With these loads instead being performed with SEs, an ii of 2 is achieved. To use the SEs in the above example, the compiler must configure and open them. The configuration and open actions are shown in comments added by the --src_interlist option before the loop:

;***    -----------------------    S$1 = __internal_SE_TEMPLATE_1_i_1_i_d_i_d_i_d_i_d_i_d_2_4;
;***    -----------------------    S$1.ICNT0 = C$5 = (unsigned)(n+15&0xfffffff0);
;***    -----------------------    __se_open_V0_U32_O(*__se_mem((packed void *)a), 0, S$1);
;***    -----------------------    S$3 = __internal_SE_TEMPLATE_1_i_1_i_d_i_d_i_d_i_d_i_d_2_4;
;***    -----------------------    S$3.ICNT0 = C$5;
;***    -----------------------    __se_open_V0_U32_O(*__se_mem((packed void *)b), 1, S$3);

By default, the compiler uses the SE or SA only if using them appears to be profitable and legal.

For profitability, a key consideration is that using the SEs or SAs comes with a processing overhead; the compiler does not necessarily know whether this overhead is profitable. In the example, the MUST_ITERATE pragma indicates the minimum iteration count is 1024, which convinces the compiler that use of SEs or SAs is likely profitable, so the compiler performs the transformation. If the compiler is not using the SE or SA and you want to cause the compiler to use them, indicating the number of iterations with the MUST_ITERATE or PROB_ITERATE pragma can help.

For legality, most reasons for not using the SE or the SA relate to whether an addressing pattern can always be mapped to an SE or SA. These reasons include, but are not limited to:

  • Iteration counter (ICNT) values that exceed the range of an unsigned 32-bit type. For example, this occurs in for (i = 0; i < icnt; i++) when i and icnt are 64-bit types.
  • DIM values that exceed the range of a signed 32-bit type. For example, this occurs in data_in[i*dim] when dim is a 64-bit type.
  • Additions or multiplies in addressing that exceed the range of a signed 32-bit type. For example, this occurs in data_in[i*dim] when i or dim is a 64-bit type.
  • Addressing exceeding the range of INT_MIN to INT_MAX elements. For example, in int16_ptr[i] when int16_ptr is an int16 * and i is an int, the maximum range is INT_MIN*16 elements to INT_MAX*16 elements.

Each of these are edge cases that are unlikely to occur in practice. To allow the compiler to ignore them, use the --assume_addresses_ok_for_stream option.

If using the SE or SA is not profitable in practice, you can override the --auto_stream and/or --assume_addresses_ok_for_stream options for a single function using the FUNCTION_OPTIONS pragma.

If the code explicitly uses the SE or SA in a function, the compiler does not choose to use either the SE or the SA for optimization. In this case, the compiler assumes that the code handles all aspects of optimization with the SE and SA within that function.

For further information on automatic use of the SE and SA and the associated compiler options, see the C7000 C/C++ Compiler User's Guide (SPRUIG8).