SPRUIV4C User guide

SPRUIV4C May 2020 – December 2023

3.3.1 Software Pipelining

Very-Long Instruction Word (VLIW) digital signal processors (DSPs) like the C7000 depend on software pipelining of loops to achieve maximum performance. Software pipelining is a technique in which successive iterations of a source loop are overlapped so that the functional units on the CPU are utilized on as many cycles as possible throughout the loop.

The following figure shows loop iteration execution both without and with software pipelining. You can see that without software pipelining, loops are scheduled so that loop iteration i completes before iteration i+1 begins. With software pipelining, iterations overlap. Thus, as long as correctness can be preserved, iteration i+1 can start before iteration i finishes. This generally permits a much higher utilization of the machine’s resources than might be achieved from other scheduling techniques. In a software-pipelined loop, even though a single iteration might take s cycles to complete, a new iteration is initiated every ii cycles.

Figure 3-2 Effects of Software Pipelining on Execution

In an efficient software pipelined loop, ii is much less than s. ii is called the initiation interval; it is the number of cycles between starting iteration i and starting iteration i+1. s is the number of cycles for the first iteration to complete, or equivalently, the length of a single scheduled iteration of the software-pipelined loop.

The compiler attempts to software pipeline the innermost source loops. These are loops that do not have any other loops within them. Note that during the compilation process, software pipelining occurs after inlining and after loop transformations that may combine loops, so in certain cases you may see the compiler software pipelining more of your code than you expect.

After software pipelining, the loop has three major phases, as shown in the following figure:

pipe-up (prolog) phase during which the overlapped iterations are started.
steady-state (kernel) phase during which iterations continue to be started.
pipe-down (epilog) phase during which any iterations that have not yet completed are allowed to finish.

Figure 3-3 Loop Iterations with Prolog and Epilog

The following example shows the source code for a simple weighted vector sum.

// weighted_vector_sum.cpp
// Compile with "cl7x -mv7100 --opt_level=3 --debug_software_pipeline
// --src_interlist --symdebug:none weighted_vector_sum.cpp"

void weighted_sum(int * restrict a, int *restrict b, int *restrict out,
                  int weight_a, int weight_b, int n)
{
    #pragma UNROLL(1)
    #pragma MUST_ITERATE(1024, ,32)
    for (int i = 0; i < n; i++)
    {
        out[i] = a[i] * weight_a + b[i] * weight_b;
    }
}

To simplify this first software-pipelining example, two pragmas are used:

The UNROLL pragma tells the compiler not to perform vectorization, which is a transformation technique that is demonstrated in the next section.
The MUST_ITERATE pragma conveys information on how many times the loop executes and is explained later in this document. The example uses this pragma to prevent a "duplicate loop" from being generated, which is also explained later in this document.

Then we compile this code with the following command:

cl7x --opt_level=3 --debug_software_pipeline --src_interlist --symdebug:none weighted_vector_sum.cpp

The --symdebug:none option prevents the compiler from generating debug information and the associated debug directives in the assembly. This debug information is not relevant to the discussion in this document and if included, would unnecessarily lengthen the examples shown here. Normally, you would not turn off debug generation as the generation of debug information does not degrade performance.

Because the --src_interlist option is used, the compiler-generated assembly file is not deleted and has the following contents:

;*----------------------------------------------------------------------------*
;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop found in file               : weighted_vector_sum.cpp
;*      Loop source line                 : 10
;*      Loop opening brace source line   : 11
;*      Loop closing brace source line   : 13
;*      Known Minimum Iteration Count    : 1024                    
;*      Known Max Iteration Count Factor : 32
;*      Loop Carried Dependency Bound(^) : 0
;*      Unpartitioned Resource Bound     : 2
;*      Partitioned Resource Bound       : 2 (pre-sched)
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 2  Schedule found with 7 iterations in parallel
;*
;*      Partitioned Resource Bound(*)    : 2 (post-sched)
. . .
;*----------------------------------------------------------------------------*
;*        SINGLE SCHEDULED ITERATION
;*
;*        ||$C$C36||:
;*   0              TICK                               ; [A_U] 
;*   1              SLDW    .D1     *D1++(4),BM0      ; [A_D1] |12| 
;*     ||           SLDW    .D2     *D2++(4),BM1      ; [A_D2] |12| 
;*   2              NOP     0x5     ; [A_B] 
;*   7              MPYWW   .N2     BM2,BM0,BL0       ; [B_N] |12| 
;*     ||           MPYWW   .M2     BM3,BM1,BL1       ; [B_M2] |12| 
;*   8              NOP     0x3     ; [A_B] 
;*  11              ADDW    .L2     BL1,BL0,B0        ; [B_L2] |12| 
;*  12              STW     .D1X    B0,*D0++(4)       ; [A_D1] |12| 
;*     ||           BNL     .B1     ||$C$C36||        ; [A_B] |10| 
;*  13              ; BRANCHCC OCCURS {||$C$C36||}    ; [] |10| 
;*----------------------------------------------------------------------------*
||$C$L1||:    ; PIPED LOOP PROLOG
;          EXCLUSIVE CPU CYCLES: 8
 
           TICK                               ; [A_U] (R) (SP) <1,0> 
||         SLDW    .D1     *D1++(4),BM1      ; [A_D1] |12| (P) <1,1> 
||         SLDW    .D2     *D2++(4),BM0      ; [A_D2] |12| (P) <1,1> 
 
 
           MV      .L2X    A7,B0             ; [B_L2] |7| (R) 
||         TICK                               ; [A_U] (P) <2,0> 
 
           MV      .L2X    A8,B1             ; [B_L2] |7| (R) 
||         SLDW    .D1     *D1++(4),BM0      ; [A_D1] |12| (P) <2,1> 
||         SLDW    .D2     *D2++(4),BM1      ; [A_D2] |12| (P) <2,1> 
 
           MV      .S2     B0,BM2            ; [B_S2] (R) 
||         MV      .L2     B1,BM3            ; [B_L2] (R) 
||         TICK                               ; [A_U] (P) <3,0> 
 
 
           MPYWW   .N2     BM2,BM1,BL0       ; [B_N] |12| (P) <0,7> 
||         MPYWW   .M2     BM3,BM0,BL1       ; [B_M2] |12| (P) <0,7> 
||         SLDW    .D1     *D1++(4),BM0      ; [A_D1] |12| (P) <3,1> 
||         SLDW    .D2     *D2++(4),BM1      ; [A_D2] |12| (P) <3,1> 
 
           TICK                               ; [A_U] (P) <4,0> 
 
           MPYWW   .N2     BM2,BM1,BL0       ; [B_N] |12| (P) <1,7> 
||         MPYWW   .M2     BM3,BM0,BL1       ; [B_M2] |12| (P) <1,7> 
||         SLDW    .D1     *D1++(4),BM0      ; [A_D1] |12| (P) <4,1> 
||         SLDW    .D2     *D2++(4),BM1      ; [A_D2] |12| (P) <4,1> 
 
           MV      .D2     A6,D0             ; [A_D2] (R) 
||         ADDD    .D1     SP,0xfffffff8,SP  ; [A_D1] (R) 
||         TICK                               ; [A_U] (P) <5,0> 

;** --------------------------------------------------------------------------*
||$C$L2||:    ; PIPED LOOP KERNEL
;          EXCLUSIVE CPU CYCLES: 2

           ADDW    .L2     BL1,BL0,B0        ; [B_L2] |12| <0,11> 
||         MPYWW   .N2     BM2,BM0,BL0       ; [B_N] |12| <2,7> 
||         MPYWW   .M2     BM3,BM1,BL1       ; [B_M2] |12| <2,7> 
||         SLDW    .D1     *D1++(4),BM0      ; [A_D1] |12| <5,1> 
||         SLDW    .D2     *D2++(4),BM1      ; [A_D2] |12| <5,1> 


           BNL     .B1     ||$C$L2||         ; [A_B] |10| <0,12> 
||         STW     .D1X    B0,*D0++(4)       ; [A_D1] |12| <0,12> 
||         TICK                               ; [A_U] <6,0>
;** --------------------------------------------------------------------------*
||$C$L3||:    ; PIPED LOOP EPILOG
;          EXCLUSIVE CPU CYCLES: 7
;**     -----------------------    return;

           ADDD    .D2     SP,0x8,SP         ; [A_D2] (O) 
||         LDD     .D1     *SP(16),A9        ; [A_D1] (O) 
||         ADDW    .L2     BL1,BL0,B0        ; [B_L2] |12| (E) <4,11> 
||         MPYWW   .N2     BM2,BM0,BL1       ; [B_N] |12| (E) <6,7> 
||         MPYWW   .M2     BM3,BM1,BL0       ; [B_M2] |12| (E) <6,7> 

           STW     .D1X    B0,*D0++(4)       ; [A_D1] |12| (E) <4,12> 
           ADDW    .L2     BL1,BL0,B0        ; [B_L2] |12| (E) <5,11> 
           STW     .D1X    B0,*D0++(4)       ; [A_D1] |12| (E) <5,12> 
           ADDW    .L2     BL0,BL1,B0        ; [B_L2] |12| (E) <6,11> 
           STW     .D1X    B0,*D0++(4)       ; [A_D1] |12| (E) <6,12> 

           RET     .B1     ; [A_B] (O) 
||         PROT                               ; [A_U] (E) 

           ; RETURN OCCURS {RP}              ; [] (O)

This assembly output shows the software pipelined loop from the compiler-generated assembly file along with part of the software pipelining information comment block, which includes important information about various characteristics of the loop.

If the compiler successfully software pipelines a loop, the compiler-generated assembly code contains a software pipeline information comment block that contains a message about "ii = xx Schedule found with yy iterations in parallel". The initiation interval, (ii), is a measure of how often the software pipelined loop is able to start executing a new iteration of the loop. The smaller the initiation interval, the fewer cycles it will take to execute the entire loop. The software-pipelined loop information also includes the source lines from which the loop originates, a description of the resource and latency requirements for the loop, and whether the loop was unrolled (among other information). When compiling with –mw, the information also contains a copy of the single scheduled iteration.

In this example, the achieved initiation interval (ii) is 2 cycles, and the number of iterations that will run in parallel is 7.

The comment block also includes a single-scheduled iteration view of the software pipelined loop. The single-scheduled iteration view of the software pipelined loop allows you to see how the compiler transformed the code and how the compiler scheduled one iteration of the software pipelined loop overlap iterations in software pipelining. See Section 5.2 for more information on how to interpret the information in this comment block.