Skip to content

[SYCL] Add barrier optimization pass #19353

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: sycl
Choose a base branch
from

Conversation

MrSidims
Copy link
Contributor

@MrSidims MrSidims commented Jul 9, 2025

It removes redundant barriers (both back-to-back and in general in CFG) and downgrades global barrier to local if there are no global memory accesses 'between' them. See description in
SYCLOptimizeBackToBackBarrier.cpp for more details.

const uint64_t LHSWeight = LHSIt->second;
const uint64_t RHSWeight = RHSIt->second;

if (LHSWeight > RHSWeight)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it ok to compare value of static_cast(LHS) and don't use weight?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general - yes. The idea I had in #16750 (from where the function and map originates) is to make the pass forward compatible. Aka if new Scopes are introduced - the pass wouldn't break them as it would consider them as Unknown. Not sure if it's worth to keep, as for example current version of the pass doesn't consider 'unknown' memory semantic masks.

Changed |= eliminateBoundaryBarriers(BarrierPtrs);
// Then remove redundant barriers within a single basic block.
for (auto &BarrierBBPair : BarriersByBB)
Changed = eliminateBackToBackInBB(BarrierBBPair.first, BarrierBBPair.second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can eliminateBackToBackInBB be merged into eliminateDominatedBarriers? eliminateBackToBackInBB is just a special case of the latter in that all barriers are in a single BB, right?

Copy link
Contributor Author

@MrSidims MrSidims Jul 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can be merged. Yet I've left them split as eliminate back to back barriers function is algorithmic-wise simpler, then CFG elimination. And I though that it's a good idea to first optimize back-to-back barriers, then (not yet implemented) hoist 2 or more barriers into one in case if their appropriate blocks share the same predecessor and their semantics match, and only then do CFG-aware removal/downgrade on the remaining barriers).


// If identical then drop Cur.
if (CmpExec == CompareRes::EQUAL && CmpMem == CompareRes::EQUAL) {
if (noFencedMemAccessesBetween(Last.CI, Cur.CI, FenceLast, BBMemInfo)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a note: there could be repeated classifyMemScope calculation of somes instructions in noFencedMemAccessesBetween, e.g. following case:

barrier(CrossDevice)
Instruction Set 1 (RegionMemScope == None)
barrier(Device)
Instruction Set 2 (RegionMemScope == None)
barrier(Workgroup)
Instruction Set 3 (RegionMemScope == None)
barrier(Subgroup)

I guess the case is rare, so probably no need to optimize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. But this would require to do extra memorization (by default), which might be worse comparing extra calculus in the rare case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general there are other ways to define fenced regions between barriers, but I haven't though about them until last Monday, when I found a similar work :) Re-making scanning and re-defining fenced regions is a possible enhancement for the pass.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general there are other ways to define fenced regions between barriers, but I haven't though about them until last Monday, when I found a similar work :) Re-making scanning and re-defining fenced regions is a possible enhancement for the pass.

Sounding interesting, is there a link to the work?

Copy link
Contributor Author

@MrSidims MrSidims Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wenju-he I meant CPU middle end pass, which while is not doing the same as this pass, yet have quite interesting idea for function preparation :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Right it is not the same. I agree that a region based algorithm would be better. Basically the pass here is merging equivalent regions.

if (Fence == RegionMemScope::Unknown)
continue;

if (DT.dominates(B1->CI, B2->CI)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there repeated calculation for the case

  • B1 = A0, B2 = A1, A0 dominates A1, noFencedAccessesCFG returns false
  • B1 = A1, B2 = A0, A1 post-dominates A0, noFencedAccessesCFG is called again on the same instructions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another potential repeated calculation case is:
A0 dominates A1, A1 dominates A2. noFencedAccessesCFG is called twice on the instructions between A0 and A1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is something I'm refactoring and will continue to refactor by merging elimination in CFG and downgrade functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be partially resolved.

It removes redundant barriers (both back-to-back and in general in CFG)
and downgrades global barrier to local if there are no global memory
accesses 'between' them. See description in
SYCLOptimizeBackToBackBarrier.cpp for more details.

Signed-off-by: Sidorov, Dmitry <[email protected]>
@MrSidims MrSidims force-pushed the optimize-barrier-2 branch from e5b70b9 to 6d51dad Compare July 13, 2025 11:21
@MrSidims MrSidims force-pushed the optimize-barrier-2 branch from 6d51dad to 7710333 Compare July 13, 2025 11:22
@MrSidims MrSidims marked this pull request as ready for review July 13, 2025 11:22
@MrSidims MrSidims requested review from a team as code owners July 13, 2025 11:22
@MrSidims MrSidims requested a review from aelovikov-intel July 13, 2025 11:22
@MrSidims MrSidims changed the title [testing for now][SYCL] Add barrier optimization pass [SYCL] Add barrier optimization pass Jul 13, 2025
@MrSidims MrSidims marked this pull request as draft July 13, 2025 12:05
TODO: merge CFG elimination and barrier downgrade

Signed-off-by: Sidorov, Dmitry <[email protected]>
@MrSidims MrSidims force-pushed the optimize-barrier-2 branch from 7710333 to 527e8e3 Compare July 14, 2025 10:00
@MrSidims MrSidims requested a review from wenju-he July 14, 2025 10:06
Copy link
Contributor

@sarnex sarnex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmake lgtm

@sarnex sarnex requested a review from a team July 14, 2025 14:34
Signed-off-by: Dmitry Sidorov <[email protected]>
@asudarsa
Copy link
Contributor

Moving it out of a draft, @wenju-he please take another look, @intel/dpcpp-tools-reviewers please take a look.

There are currently several TODOs and FIXMEs in the code, but I guess in the current PR I will only address these 2 related issues: #19353 (comment) and #19353 (comment) , other than that the pass is (from my perspective) ready.

UPD: also the pass should be moved out from SYCLLowerIR , and I'm open to the suggestions :)

Hi @MrSidims

Can you please state your reasons behind requiring such a move? That might help to get some ideas. Thanks

// loads from __spirv_BuiltIn GVs)
// – Unknown : any other mayReadOrWriteMemory() (intrinsics, calls,
// generic addrspace)
// * Walk the function and record every barrier call into a list of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines 14-26 and 27-39 seem similar. Please check. Thanks

// accesses >= B.MemScope, then remove B.
// - **Exit** : For each barrier B, if on *every* path from B to any function
// return there are no
// accesses >= B.MemScope, then remove B.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does no accesses >= B.MemScope mean? Thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll clarify that (and it seems like clang-format interferes the understanding - I'll disable it for the header), but basically it means, that there are no accesses from a wider or equal memory, then memory scope and memory semantics of the barrier. Aka we look at address space that accompanies the access, see to which memory it belongs and compare with the fenced memory.

@@ -808,6 +813,8 @@ PassBuilder::buildFunctionSimplificationPipeline(OptimizationLevel Level,
.sinkCommonInsts(true)));
FPM.addPass(InstCombinePass());
invokePeepholeEPCallbacks(FPM, Level);
if (SYCLOptimizationMode)
FPM.addPass(SYCLOptimizeBarriersPass());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the pass be run via registerPipeline*Callback in function

void EmitAssemblyHelper::RunOptimizationPipeline(

Most of SYCL passes are added there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied, thanks!

@@ -575,6 +578,8 @@ PassBuilder::buildO1FunctionSimplificationPipeline(OptimizationLevel Level,
SimplifyCFGPass(SimplifyCFGOptions().convertSwitchRangeToICmp(true)));
FPM.addPass(InstCombinePass());
invokePeepholeEPCallbacks(FPM, Level);
if (SYCLOptimizationMode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be interesting if non sycl-opt mode can use the pass as well. It seems not possible at the moment as libspirv ControlBarrier implementation is inlined before optimization pass pipeline starts.

Copy link
Contributor Author

@MrSidims MrSidims Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably misunderstanding what SYCLOptimizationMode is. I though it basically mean [O1, O2, O3] on the device code in the frontend (and no optnone passed to BE), or it's not the case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SYCLOptimizationMode disables some optimizations. SYCLOptimizationMode should be meant for spir64 or SPIR-V target only as the target lacks TTI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved the pass from here.

// FIXME: this check is way too conservative.
if (Fence != RegionMemScope::Unknown && ADominatesB &&
PDT.dominates(B->CI, A->CI) &&
noFencedAccessesCFG(A->CI, B->CI, Fence, BBMemInfo)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there could be repeated calculation if this function is called multiple times and a basic block is re-scanned in multiple ranges (A, B), (A, C), etc.
It might be enough to calculate the highest ExecScope and MemScope score of each basic block only once. Then the score can be used if the basic block is scanned multiple times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, do you mean on line 532 just compare with Required? Realistically it should be enough, though in some rare scenarios it will result in a missing optimization. For now I've added a shortcut in hasFencedAccesses for cases when path from barrier A to barrier B is looking like this: BB-A -> other BB -> BB-B - with this shortcut scan for other BB will be omitted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean each basic block has its own score, i.e. highest score of the instructions within the basic block. After calculating the score, there is no need to iterate over instructions over and over again.

@@ -147,6 +147,9 @@
#include "llvm/Transforms/Vectorize/SLPVectorizer.h"
#include "llvm/Transforms/Vectorize/VectorCombine.h"

// TODO: move it elsewhere
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something we should do before merging this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm thinking of just Transforms

// loads from __spirv_BuiltIn GVs)
// – Unknown : any other mayReadOrWriteMemory() (intrinsics, calls,
// generic addrspace)
// * Walk the function and record every barrier call into a list of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment not redundant with the block just above?

Signed-off-by: Sidorov, Dmitry <[email protected]>
Signed-off-by: Sidorov, Dmitry <[email protected]>
@MrSidims MrSidims requested a review from a team as a code owner July 16, 2025 00:33
Signed-off-by: Sidorov, Dmitry <[email protected]>
Copy link
Contributor

@maarquitos14 maarquitos14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more nits, but overall looks very good. Great job!

<< ") returned " << false << "\n");
return false;
}
// do not enqueue successors (there are none).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// do not enqueue successors (there are none).
// Do not enqueue successors (there are none).

if (semanticsSuperset(A.Semantic, B.Semantic) &&
!semanticsSuperset(B.Semantic, A.Semantic))
return false;
// then fall back to exec/mem‐scope width as before:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// then fall back to exec/mem‐scope width as before:
// Then fall back to exec/mem‐scope width as before:

}
break;
}
if (Cur.CI) // still alive?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (Cur.CI) // still alive?
if (Cur.CI) // Still alive?

}

// True if BD is the first real instruction of the function.
static bool isAtKernelEntry(BarrierDesc &BD) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still non-const, though. Can we add it?

}

// True if BD is immediately before a return/unreachable and nothing follows.
static bool isAtKernelExit(BarrierDesc &BD) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants