SPRUIV4C User guide

SPRUIV4C May 2020 – December 2023

4.3.1 Loop-Carried Dependencies

In certain cases when software pipelining, the compiler will not be able to overlap successive iterations of the loop in order to get the best performance. When the compiler is not able to overlap successive iterations of the loop, performance suffers: the initiation interval (ii, described earlier) will be larger than desired and few functional units will be simultaneously utilized.

In almost all cases, this is due to a loop-carried dependency. A loop-carried dependency prevents to some degree the execution of iteration i+1 from overlapping with iteration i. A loop-carried dependency bound is a lower limit on the initiation interval of the software pipelined loop (and thus a limit on the speed of the software pipelined loop). A loop-carried dependency bound arises because there is a cycle in the ordering constraints (dependencies) for a set of the instructions in a loop. Out of all these cycle lengths in the loop, the maximum loop-carried dependency cycle is the loop-carried dependency bound. This can occur even if there are plenty of functional units available to perform several iterations in parallel.

If the loop-carried dependency bound is greater than the partitioned resource bound, then one of the loop-carried dependencies is slowing the loop, as the initiation interval is always at least the maximum of the partitioned resource bound and the loop-carried dependency bound.

To reduce or eliminate a problematic loop-carried dependency, one must identify the cycle and then find a way to shorten or break it.

The following example shows a loop with a problematic loop-carried dependency.

void weighted_sum(int * a, int * b, int * out,                                 
                  int weight_a, int weight_b, int n)                           
{                                                                              
    #pragma UNROLL(1)                                                          
    #pragma MUST_ITERATE(1024, ,32)                                            
    for (int i = 0; i < n; i++)                                                
    {                                                                          
        out[i] = a[i] * weight_a + b[i] * weight_b;                            
    }                                                                          
}

The compiler-generated assembly code for this example (shown below) shows that the Loop Carried Dependency Bound in the Software Pipeline Information section of the assembly code is 7 cycles.

;*----------------------------------------------------------------------------* 
;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop found in file               : weighted_vector_sum_v3.cpp
;*      Loop source line                 : 10
;*      Loop opening brace source line   : 11
;*      Loop closing brace source line   : 13
;*      Known Minimum Iteration Count    : 1024                    
;*      Known Max Iteration Count Factor : 32
;*      Loop Carried Dependency Bound(^) : 12
;*      Unpartitioned Resource Bound     : 2
;*      Partitioned Resource Bound       : 2 (pre-sched)
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 12 Schedule found with 2 iterations in parallel
. . .
;*----------------------------------------------------------------------------*
;*        SINGLE SCHEDULED ITERATION
;*
;*        ||$C$C51||:
;*   0              TICK                               ; [A_U] 
;*   1              LDW     .D2     *D1++(4),BM0      ; [A_D2] |12|  ^ 
;*     ||           LDW     .D1     *D2++(4),BM1      ; [A_D1] |12|  ^ 
;*   2              NOP     0x5     ; [A_B] 
;*   7              MPYWW   .M2     BM2,BM0,BL0       ; [B_M2] |12|  ^ 
;*     ||           MPYWW   .N2     BM3,BM1,BL1       ; [B_N2] |12|  ^ 
;*   8              NOP     0x3     ; [A_B] 
;*  11              ADDW    .L2     BL1,BL0,B0        ; [B_L2] |12|  ^ 
;*  12              STW     .D1X    B0,*D0++(4)       ; [A_D1] |12|  ^ 
;*     ||           BNL     .B1     ||$C$C51||        ; [A_B] |10| 
;*  13              ; BRANCHCC OCCURS {||$C$C51||}    ; [] |10| 
;*----------------------------------------------------------------------------*

The final software pipelined initiation interval of the software pipelined loop is at least the greater of the Loop Carried Dependency Bound and the Partitioned Resource Bound. When the Loop Carried Dependency Bound value is greater than the Partitioned Resource Bound value, this indicates the code has a loop-carried dependency bound problem that likely should be addressed. In other words, when the loop-carried dependence bound is greater than the partitioned resource bound, the software pipelined loop could likely run faster if the loop-carried dependency bound is eliminated. Therefore in this example, because the partitioned resource bound is 2 and the loop-carried dependency bound is 12, this code has an issue that should be investigated.

To identify the problem, we need to look at the instructions involved in the loop-carried dependency. These instructions are marked with the caret "^" symbol in the comment block in the compiler-generated assembly file. Notice that the load and store instructions are marked with a caret. This tells us the compiler thinks there may be a loop-carried dependence between successive iterations. This is likely because the compiler cannot prove the stores are writing to an area of memory that is independent of the location from which the load instructions are loading values. In absence of information about the locations of the pointers, arrays and address access patterns, the compiler must assume that successive iterations may load from the location of the previous iteration's stores. See Hand-Tuning Loops and Control Code on the TMS320C6000 (SPRA666) for more about loop-carried dependencies and how to identify them.