Skip to content

Optimize loops for the case in which they are looping #9

@rasky

Description

@rasky
  vge $v24, $v23, $v26.e0                            ## L:250  |  ***78 | finish_mask = vec_pos_int >= vec_step8:ufract.0;
  sqv $v23, 0, 0, $s7                                ## L:249  |      ^ | @Barrier("vpos") store(vec_pos_int, SCRATCH_ADDR); ## Barrier: 0x1
  cfc2 $t9, $vcc                                     ## L:251  |     79 | finish_bitmask = get_vcc();
  LABEL_Mixer_Resample_0009:
  vaddc $v28, $v28, $v26.e7                          ## L:292  |      ^ | vec_pos += vec_step8.7;
  lhu $t8, 0($s7)                                    ## L:264  |     80 | @Barrier("vpos") a0:u16 = load(SCRATCH_ADDR, 0x00); ## Barrier: 0x1

In situations like the above, the first VU instruction of the loop (vaddc) will pair with cfc2 (before the loop) on the first run, while it will instead pair with lhu when loop is cycling (if the loop start is 8-byte aligned -- otherwise pairing doesn't occur at the first instruction at branch target). Mispairing the first instruction can have a ripple effect on pairings of next instructions; the whole next sequence of paired instructions will be mispaired as well, and this can introduce stalls.

There are several interwinded issues here:

  1. The behavior of the code is subtly affected by whether the loop begin label is 8-byte aligned or not. This can create oscillating benchmarks with random effects on performance. This could be fixed if RSPL knew the offset of the label and inserted a nop / vnop if needed just before the loop. (In fact, knowing the full address is not required -- it would be sufficient to know the 8-byte phase of the label; this could be calculated by simply forcing alignment all functions via .balign 8 and then keeping track of the phase).
  2. The reorder seems to "ignore" the loop start label, and prefers creating a pair between the first loop instruction with the previous one. In general, it is more important to optimize iterations rather than first runs. So the reord should treat the start label like a "barrier", and in general ignore pairings with what comes before (same for cycle counter in the comments; it should assume there's a barrier there, as this would more closely match what happens while the loop is cycling, so it's more useful to see).
  3. Even if the previous suggestion were implemented, the first loop might accidentally cause mispairings. This is less important as it only happens once, but could be avoided by some smart insertion of nop / vnop just before loop start (a thing that interacts with point 1, which also requires insertions of nops).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions