-
Notifications
You must be signed in to change notification settings - Fork 794
[SYCL] Add barrier optimization pass #19353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: sycl
Are you sure you want to change the base?
Conversation
1a48836
to
e5b70b9
Compare
const uint64_t LHSWeight = LHSIt->second; | ||
const uint64_t RHSWeight = RHSIt->second; | ||
|
||
if (LHSWeight > RHSWeight) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it ok to compare value of static_cast(LHS) and don't use weight?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general - yes. The idea I had in #16750 (from where the function and map originates) is to make the pass forward compatible. Aka if new Scopes are introduced - the pass wouldn't break them as it would consider them as Unknown. Not sure if it's worth to keep, as for example current version of the pass doesn't consider 'unknown' memory semantic masks.
Changed |= eliminateBoundaryBarriers(BarrierPtrs); | ||
// Then remove redundant barriers within a single basic block. | ||
for (auto &BarrierBBPair : BarriersByBB) | ||
Changed = eliminateBackToBackInBB(BarrierBBPair.first, BarrierBBPair.second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can eliminateBackToBackInBB
be merged into eliminateDominatedBarriers
? eliminateBackToBackInBB
is just a special case of the latter in that all barriers are in a single BB, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can be merged. Yet I've left them split as eliminate back to back barriers function is algorithmic-wise simpler, then CFG elimination. And I though that it's a good idea to first optimize back-to-back barriers, then (not yet implemented) hoist 2 or more barriers into one in case if their appropriate blocks share the same predecessor and their semantics match, and only then do CFG-aware removal/downgrade on the remaining barriers).
|
||
// If identical then drop Cur. | ||
if (CmpExec == CompareRes::EQUAL && CmpMem == CompareRes::EQUAL) { | ||
if (noFencedMemAccessesBetween(Last.CI, Cur.CI, FenceLast, BBMemInfo)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a note: there could be repeated classifyMemScope calculation of somes instructions in noFencedMemAccessesBetween, e.g. following case:
barrier(CrossDevice)
Instruction Set 1 (RegionMemScope == None)
barrier(Device)
Instruction Set 2 (RegionMemScope == None)
barrier(Workgroup)
Instruction Set 3 (RegionMemScope == None)
barrier(Subgroup)
I guess the case is rare, so probably no need to optimize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. But this would require to do extra memorization (by default), which might be worse comparing extra calculus in the rare case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general there are other ways to define fenced regions between barriers, but I haven't though about them until last Monday, when I found a similar work :) Re-making scanning and re-defining fenced regions is a possible enhancement for the pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general there are other ways to define fenced regions between barriers, but I haven't though about them until last Monday, when I found a similar work :) Re-making scanning and re-defining fenced regions is a possible enhancement for the pass.
Sounding interesting, is there a link to the work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wenju-he I meant CPU middle end pass, which while is not doing the same as this pass, yet have quite interesting idea for function preparation :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Right it is not the same. I agree that a region based algorithm would be better. Basically the pass here is merging equivalent regions.
if (Fence == RegionMemScope::Unknown) | ||
continue; | ||
|
||
if (DT.dominates(B1->CI, B2->CI)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there repeated calculation for the case
- B1 = A0, B2 = A1, A0 dominates A1, noFencedAccessesCFG returns false
- B1 = A1, B2 = A0, A1 post-dominates A0, noFencedAccessesCFG is called again on the same instructions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another potential repeated calculation case is:
A0 dominates A1, A1 dominates A2. noFencedAccessesCFG is called twice on the instructions between A0 and A1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is something I'm refactoring and will continue to refactor by merging elimination in CFG and downgrade functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be partially resolved.
It removes redundant barriers (both back-to-back and in general in CFG) and downgrades global barrier to local if there are no global memory accesses 'between' them. See description in SYCLOptimizeBackToBackBarrier.cpp for more details. Signed-off-by: Sidorov, Dmitry <[email protected]>
e5b70b9
to
6d51dad
Compare
6d51dad
to
7710333
Compare
TODO: merge CFG elimination and barrier downgrade Signed-off-by: Sidorov, Dmitry <[email protected]>
7710333
to
527e8e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmake lgtm
Signed-off-by: Dmitry Sidorov <[email protected]>
8b88749
to
1b25b97
Compare
Hi @MrSidims Can you please state your reasons behind requiring such a move? That might help to get some ideas. Thanks |
// loads from __spirv_BuiltIn GVs) | ||
// – Unknown : any other mayReadOrWriteMemory() (intrinsics, calls, | ||
// generic addrspace) | ||
// * Walk the function and record every barrier call into a list of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lines 14-26 and 27-39 seem similar. Please check. Thanks
// accesses >= B.MemScope, then remove B. | ||
// - **Exit** : For each barrier B, if on *every* path from B to any function | ||
// return there are no | ||
// accesses >= B.MemScope, then remove B. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does no accesses >= B.MemScope mean? Thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll clarify that (and it seems like clang-format interferes the understanding - I'll disable it for the header), but basically it means, that there are no accesses from a wider or equal memory, then memory scope and memory semantics of the barrier. Aka we look at address space that accompanies the access, see to which memory it belongs and compare with the fenced memory.
@@ -808,6 +813,8 @@ PassBuilder::buildFunctionSimplificationPipeline(OptimizationLevel Level, | |||
.sinkCommonInsts(true))); | |||
FPM.addPass(InstCombinePass()); | |||
invokePeepholeEPCallbacks(FPM, Level); | |||
if (SYCLOptimizationMode) | |||
FPM.addPass(SYCLOptimizeBarriersPass()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the pass be run via registerPipeline*Callback in function
llvm/clang/lib/CodeGen/BackendUtil.cpp
Line 873 in a960c7f
void EmitAssemblyHelper::RunOptimizationPipeline( |
Most of SYCL passes are added there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applied, thanks!
@@ -575,6 +578,8 @@ PassBuilder::buildO1FunctionSimplificationPipeline(OptimizationLevel Level, | |||
SimplifyCFGPass(SimplifyCFGOptions().convertSwitchRangeToICmp(true))); | |||
FPM.addPass(InstCombinePass()); | |||
invokePeepholeEPCallbacks(FPM, Level); | |||
if (SYCLOptimizationMode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be interesting if non sycl-opt mode can use the pass as well. It seems not possible at the moment as libspirv ControlBarrier implementation is inlined before optimization pass pipeline starts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I probably misunderstanding what SYCLOptimizationMode is. I though it basically mean [O1, O2, O3] on the device code in the frontend (and no optnone passed to BE), or it's not the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SYCLOptimizationMode disables some optimizations. SYCLOptimizationMode should be meant for spir64 or SPIR-V target only as the target lacks TTI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved the pass from here.
// FIXME: this check is way too conservative. | ||
if (Fence != RegionMemScope::Unknown && ADominatesB && | ||
PDT.dominates(B->CI, A->CI) && | ||
noFencedAccessesCFG(A->CI, B->CI, Fence, BBMemInfo)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there could be repeated calculation if this function is called multiple times and a basic block is re-scanned in multiple ranges (A, B), (A, C), etc.
It might be enough to calculate the highest ExecScope and MemScope score of each basic block only once. Then the score can be used if the basic block is scanned multiple times.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, do you mean on line 532 just compare with Required? Realistically it should be enough, though in some rare scenarios it will result in a missing optimization. For now I've added a shortcut in hasFencedAccesses for cases when path from barrier A to barrier B is looking like this: BB-A -> other BB -> BB-B - with this shortcut scan for other BB will be omitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean each basic block has its own score, i.e. highest score of the instructions within the basic block. After calculating the score, there is no need to iterate over instructions over and over again.
@@ -147,6 +147,9 @@ | |||
#include "llvm/Transforms/Vectorize/SLPVectorizer.h" | |||
#include "llvm/Transforms/Vectorize/VectorCombine.h" | |||
|
|||
// TODO: move it elsewhere |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something we should do before merging this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm thinking of just Transforms
// loads from __spirv_BuiltIn GVs) | ||
// – Unknown : any other mayReadOrWriteMemory() (intrinsics, calls, | ||
// generic addrspace) | ||
// * Walk the function and record every barrier call into a list of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment not redundant with the block just above?
Signed-off-by: Sidorov, Dmitry <[email protected]>
Signed-off-by: Sidorov, Dmitry <[email protected]>
Signed-off-by: Sidorov, Dmitry <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few more nits, but overall looks very good. Great job!
<< ") returned " << false << "\n"); | ||
return false; | ||
} | ||
// do not enqueue successors (there are none). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// do not enqueue successors (there are none). | |
// Do not enqueue successors (there are none). |
if (semanticsSuperset(A.Semantic, B.Semantic) && | ||
!semanticsSuperset(B.Semantic, A.Semantic)) | ||
return false; | ||
// then fall back to exec/mem‐scope width as before: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// then fall back to exec/mem‐scope width as before: | |
// Then fall back to exec/mem‐scope width as before: |
} | ||
break; | ||
} | ||
if (Cur.CI) // still alive? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (Cur.CI) // still alive? | |
if (Cur.CI) // Still alive? |
} | ||
|
||
// True if BD is the first real instruction of the function. | ||
static bool isAtKernelEntry(BarrierDesc &BD) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is still non-const, though. Can we add it?
} | ||
|
||
// True if BD is immediately before a return/unreachable and nothing follows. | ||
static bool isAtKernelExit(BarrierDesc &BD) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add it?
It removes redundant barriers (both back-to-back and in general in CFG) and downgrades global barrier to local if there are no global memory accesses 'between' them. See description in
SYCLOptimizeBackToBackBarrier.cpp for more details.