Skip to content

[AMDGPU] Use reverse iteration in CodeGenPrepare #145484

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: users/pierre-vh/rm-promote-i32-cgp
Choose a base branch
from

Conversation

Pierre-vh
Copy link
Contributor

In order to make this easier, I also removed all "removeFromParent" calls from the visitors, instead adding instructions
to a set of instructions to delete once the function has been visited.
This avoids crashes due to functions deleting their operands. In theory we could allow functions to delete the
instruction they visited (and only that one) but I think having one idiom for everything is less error-prone.

Fixes #140219

In order to make this easier, I also removed all "removeFromParent" calls from the visitors, instead adding instructions
to a set of instructions to delete once the function has been visited.
This avoids crashes due to functions deleting their operands. In theory we could allow functions to delete the
instruction they visited (and only that one) but I think having one idiom for everything is less error-prone.

Fixes #140219
Copy link
Contributor Author

Pierre-vh commented Jun 24, 2025

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@Pierre-vh Pierre-vh requested review from arsenm, jayfoad and shiltian June 24, 2025 09:39
@Pierre-vh Pierre-vh marked this pull request as ready for review June 24, 2025 09:39
@llvmbot
Copy link
Member

llvmbot commented Jun 24, 2025

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre van Houtryve (Pierre-vh)

Changes

In order to make this easier, I also removed all "removeFromParent" calls from the visitors, instead adding instructions
to a set of instructions to delete once the function has been visited.
This avoids crashes due to functions deleting their operands. In theory we could allow functions to delete the
instruction they visited (and only that one) but I think having one idiom for everything is less error-prone.

Fixes #140219


Patch is 48.17 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/145484.diff

5 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp (+29-53)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-break-large-phis-heuristics.ll (+21-21)
  • (modified) llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-fdiv.ll (+70-40)
  • (modified) llvm/test/CodeGen/AMDGPU/fdiv_flags.f32.ll (+33-105)
  • (modified) llvm/test/CodeGen/AMDGPU/uniform-select.ll (+32-32)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp b/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
index 5f1983791cfae..2a3aa1ac672b6 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp
@@ -15,6 +15,7 @@
 #include "AMDGPU.h"
 #include "AMDGPUTargetMachine.h"
 #include "SIModeRegisterDefaults.h"
+#include "llvm/ADT/SetVector.h"
 #include "llvm/Analysis/AssumptionCache.h"
 #include "llvm/Analysis/ConstantFolding.h"
 #include "llvm/Analysis/TargetLibraryInfo.h"
@@ -109,6 +110,7 @@ class AMDGPUCodeGenPrepareImpl
   bool FlowChanged = false;
   mutable Function *SqrtF32 = nullptr;
   mutable Function *LdexpF32 = nullptr;
+  mutable SetVector<Value *> DeadVals;
 
   DenseMap<const PHINode *, bool> BreakPhiNodesCache;
 
@@ -285,28 +287,19 @@ bool AMDGPUCodeGenPrepareImpl::run() {
   BreakPhiNodesCache.clear();
   bool MadeChange = false;
 
-  Function::iterator NextBB;
-  for (Function::iterator FI = F.begin(), FE = F.end(); FI != FE; FI = NextBB) {
-    BasicBlock *BB = &*FI;
-    NextBB = std::next(FI);
-
-    BasicBlock::iterator Next;
-    for (BasicBlock::iterator I = BB->begin(), E = BB->end(); I != E;
-         I = Next) {
-      Next = std::next(I);
-
-      MadeChange |= visit(*I);
-
-      if (Next != E) { // Control flow changed
-        BasicBlock *NextInstBB = Next->getParent();
-        if (NextInstBB != BB) {
-          BB = NextInstBB;
-          E = BB->end();
-          FE = F.end();
-        }
-      }
+  for (BasicBlock &BB : reverse(F)) {
+    for (Instruction &I : make_early_inc_range(reverse(BB))) {
+      if (!DeadVals.contains(&I))
+        MadeChange |= visit(I);
     }
   }
+
+  while (!DeadVals.empty()) {
+    RecursivelyDeleteTriviallyDeadInstructions(
+        DeadVals.pop_back_val(), TLI, /*MSSAU*/ nullptr,
+        [&](Value *V) { DeadVals.remove(V); });
+  }
+
   return MadeChange;
 }
 
@@ -426,7 +419,7 @@ bool AMDGPUCodeGenPrepareImpl::replaceMulWithMul24(BinaryOperator &I) const {
   Value *NewVal = insertValues(Builder, Ty, ResultVals);
   NewVal->takeName(&I);
   I.replaceAllUsesWith(NewVal);
-  I.eraseFromParent();
+  DeadVals.insert(&I);
 
   return true;
 }
@@ -500,10 +493,10 @@ bool AMDGPUCodeGenPrepareImpl::foldBinOpIntoSelect(BinaryOperator &BO) const {
                                           FoldedT, FoldedF);
   NewSelect->takeName(&BO);
   BO.replaceAllUsesWith(NewSelect);
-  BO.eraseFromParent();
+  DeadVals.insert(&BO);
   if (CastOp)
-    CastOp->eraseFromParent();
-  Sel->eraseFromParent();
+    DeadVals.insert(CastOp);
+  DeadVals.insert(Sel);
   return true;
 }
 
@@ -900,7 +893,7 @@ bool AMDGPUCodeGenPrepareImpl::visitFDiv(BinaryOperator &FDiv) {
   if (NewVal) {
     FDiv.replaceAllUsesWith(NewVal);
     NewVal->takeName(&FDiv);
-    RecursivelyDeleteTriviallyDeadInstructions(&FDiv, TLI);
+    DeadVals.insert(&FDiv);
   }
 
   return true;
@@ -1310,7 +1303,8 @@ within the byte are all 0.
 static bool tryNarrowMathIfNoOverflow(Instruction *I,
                                       const SITargetLowering *TLI,
                                       const TargetTransformInfo &TTI,
-                                      const DataLayout &DL) {
+                                      const DataLayout &DL,
+                                      SetVector<Value *> &DeadVals) {
   unsigned Opc = I->getOpcode();
   Type *OldType = I->getType();
 
@@ -1365,7 +1359,7 @@ static bool tryNarrowMathIfNoOverflow(Instruction *I,
 
   Value *Zext = Builder.CreateZExt(Arith, OldType);
   I->replaceAllUsesWith(Zext);
-  I->eraseFromParent();
+  DeadVals.insert(I);
   return true;
 }
 
@@ -1376,7 +1370,7 @@ bool AMDGPUCodeGenPrepareImpl::visitBinaryOperator(BinaryOperator &I) {
   if (UseMul24Intrin && replaceMulWithMul24(I))
     return true;
   if (tryNarrowMathIfNoOverflow(&I, ST.getTargetLowering(),
-                                TM.getTargetTransformInfo(F), DL))
+                                TM.getTargetTransformInfo(F), DL, DeadVals))
     return true;
 
   bool Changed = false;
@@ -1441,7 +1435,7 @@ bool AMDGPUCodeGenPrepareImpl::visitBinaryOperator(BinaryOperator &I) {
 
     if (NewDiv) {
       I.replaceAllUsesWith(NewDiv);
-      I.eraseFromParent();
+      DeadVals.insert(&I);
       Changed = true;
     }
   }
@@ -1497,7 +1491,7 @@ bool AMDGPUCodeGenPrepareImpl::visitLoadInst(LoadInst &I) {
     Value *ValTrunc = Builder.CreateTrunc(WidenLoad, IntNTy);
     Value *ValOrig = Builder.CreateBitCast(ValTrunc, I.getType());
     I.replaceAllUsesWith(ValOrig);
-    I.eraseFromParent();
+    DeadVals.insert(&I);
     return true;
   }
 
@@ -1539,7 +1533,7 @@ bool AMDGPUCodeGenPrepareImpl::visitSelectInst(SelectInst &I) {
 
   Fract->takeName(&I);
   I.replaceAllUsesWith(Fract);
-  RecursivelyDeleteTriviallyDeadInstructions(&I, TLI);
+  DeadVals.insert(&I);
   return true;
 }
 
@@ -1827,7 +1821,7 @@ bool AMDGPUCodeGenPrepareImpl::visitPHINode(PHINode &I) {
   }
 
   I.replaceAllUsesWith(Vec);
-  I.eraseFromParent();
+  DeadVals.insert(&I);
   return true;
 }
 
@@ -1908,7 +1902,7 @@ bool AMDGPUCodeGenPrepareImpl::visitAddrSpaceCastInst(AddrSpaceCastInst &I) {
   auto *Intrin = B.CreateIntrinsic(
       I.getType(), Intrinsic::amdgcn_addrspacecast_nonnull, {I.getOperand(0)});
   I.replaceAllUsesWith(Intrin);
-  I.eraseFromParent();
+  DeadVals.insert(&I);
   return true;
 }
 
@@ -2005,16 +1999,10 @@ bool AMDGPUCodeGenPrepareImpl::visitFMinLike(IntrinsicInst &I) {
   Value *Fract = applyFractPat(Builder, FractArg);
   Fract->takeName(&I);
   I.replaceAllUsesWith(Fract);
-
-  RecursivelyDeleteTriviallyDeadInstructions(&I, TLI);
+  DeadVals.insert(&I);
   return true;
 }
 
-static bool isOneOrNegOne(const Value *Val) {
-  const APFloat *C;
-  return match(Val, m_APFloat(C)) && C->getExactLog2Abs() == 0;
-}
-
 // Expand llvm.sqrt.f32 calls with !fpmath metadata in a semi-fast way.
 bool AMDGPUCodeGenPrepareImpl::visitSqrt(IntrinsicInst &Sqrt) {
   Type *Ty = Sqrt.getType()->getScalarType();
@@ -2035,18 +2023,6 @@ bool AMDGPUCodeGenPrepareImpl::visitSqrt(IntrinsicInst &Sqrt) {
   if (ReqdAccuracy < 1.0f)
     return false;
 
-  // FIXME: This is an ugly hack for this pass using forward iteration instead
-  // of reverse. If it worked like a normal combiner, the rsq would form before
-  // we saw a sqrt call.
-  auto *FDiv =
-      dyn_cast_or_null<FPMathOperator>(Sqrt.getUniqueUndroppableUser());
-  if (FDiv && FDiv->getOpcode() == Instruction::FDiv &&
-      FDiv->getFPAccuracy() >= 1.0f &&
-      canOptimizeWithRsq(FPOp, FDiv->getFastMathFlags(), SqrtFMF) &&
-      // TODO: We should also handle the arcp case for the fdiv with non-1 value
-      isOneOrNegOne(FDiv->getOperand(0)))
-    return false;
-
   Value *SrcVal = Sqrt.getOperand(0);
   bool CanTreatAsDAZ = canIgnoreDenormalInput(SrcVal, &Sqrt);
 
@@ -2070,7 +2046,7 @@ bool AMDGPUCodeGenPrepareImpl::visitSqrt(IntrinsicInst &Sqrt) {
   Value *NewSqrt = insertValues(Builder, Sqrt.getType(), ResultVals);
   NewSqrt->takeName(&Sqrt);
   Sqrt.replaceAllUsesWith(NewSqrt);
-  Sqrt.eraseFromParent();
+  DeadVals.insert(&Sqrt);
   return true;
 }
 
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-break-large-phis-heuristics.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-break-large-phis-heuristics.ll
index c94b33334646d..1f36902762f0b 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-break-large-phis-heuristics.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-break-large-phis-heuristics.ll
@@ -726,16 +726,16 @@ define amdgpu_kernel void @used_by_unbreakable_and_breakable_phi(<5 x double> %i
 ; CHECK-NEXT:    [[LARGEPHI_EXTRACTSLICE815:%.*]] = extractelement <5 x double> [[LARGEPHI_INSERTSLICE4]], i64 4
 ; CHECK-NEXT:    br label [[END]]
 ; CHECK:       end:
-; CHECK-NEXT:    [[TMP5:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE01]], [[THEN1]] ], [ [[LARGEPHI_EXTRACTSLICE1]], [[FINALLY]] ]
-; CHECK-NEXT:    [[TMP6:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE22]], [[THEN1]] ], [ [[LARGEPHI_EXTRACTSLICE3]], [[FINALLY]] ]
-; CHECK-NEXT:    [[TMP7:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE43]], [[THEN1]] ], [ [[LARGEPHI_EXTRACTSLICE5]], [[FINALLY]] ]
-; CHECK-NEXT:    [[TMP8:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE64]], [[THEN1]] ], [ [[LARGEPHI_EXTRACTSLICE7]], [[FINALLY]] ]
-; CHECK-NEXT:    [[TMP9:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE85]], [[THEN1]] ], [ [[LARGEPHI_EXTRACTSLICE9]], [[FINALLY]] ]
-; CHECK-NEXT:    [[TMP10:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE011]], [[THEN1]] ], [ 0.000000e+00, [[FINALLY]] ]
-; CHECK-NEXT:    [[TMP11:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE212]], [[THEN1]] ], [ 0.000000e+00, [[FINALLY]] ]
-; CHECK-NEXT:    [[TMP12:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE413]], [[THEN1]] ], [ 0.000000e+00, [[FINALLY]] ]
-; CHECK-NEXT:    [[TMP13:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE614]], [[THEN1]] ], [ 0.000000e+00, [[FINALLY]] ]
-; CHECK-NEXT:    [[TMP14:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE815]], [[THEN1]] ], [ 0.000000e+00, [[FINALLY]] ]
+; CHECK-NEXT:    [[TMP5:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE01]], [[THEN1]] ], [ 0.000000e+00, [[FINALLY]] ]
+; CHECK-NEXT:    [[TMP6:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE22]], [[THEN1]] ], [ 0.000000e+00, [[FINALLY]] ]
+; CHECK-NEXT:    [[TMP7:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE43]], [[THEN1]] ], [ 0.000000e+00, [[FINALLY]] ]
+; CHECK-NEXT:    [[TMP8:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE64]], [[THEN1]] ], [ 0.000000e+00, [[FINALLY]] ]
+; CHECK-NEXT:    [[TMP9:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE85]], [[THEN1]] ], [ 0.000000e+00, [[FINALLY]] ]
+; CHECK-NEXT:    [[TMP10:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE011]], [[THEN1]] ], [ [[LARGEPHI_EXTRACTSLICE1]], [[FINALLY]] ]
+; CHECK-NEXT:    [[TMP11:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE212]], [[THEN1]] ], [ [[LARGEPHI_EXTRACTSLICE3]], [[FINALLY]] ]
+; CHECK-NEXT:    [[TMP12:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE413]], [[THEN1]] ], [ [[LARGEPHI_EXTRACTSLICE5]], [[FINALLY]] ]
+; CHECK-NEXT:    [[TMP13:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE614]], [[THEN1]] ], [ [[LARGEPHI_EXTRACTSLICE7]], [[FINALLY]] ]
+; CHECK-NEXT:    [[TMP14:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE815]], [[THEN1]] ], [ [[LARGEPHI_EXTRACTSLICE9]], [[FINALLY]] ]
 ; CHECK-NEXT:    [[LARGEPHI_INSERTSLICE016:%.*]] = insertelement <5 x double> poison, double [[TMP10]], i64 0
 ; CHECK-NEXT:    [[LARGEPHI_INSERTSLICE117:%.*]] = insertelement <5 x double> [[LARGEPHI_INSERTSLICE016]], double [[TMP11]], i64 1
 ; CHECK-NEXT:    [[LARGEPHI_INSERTSLICE218:%.*]] = insertelement <5 x double> [[LARGEPHI_INSERTSLICE117]], double [[TMP12]], i64 2
@@ -746,8 +746,8 @@ define amdgpu_kernel void @used_by_unbreakable_and_breakable_phi(<5 x double> %i
 ; CHECK-NEXT:    [[LARGEPHI_INSERTSLICE28:%.*]] = insertelement <5 x double> [[LARGEPHI_INSERTSLICE17]], double [[TMP7]], i64 2
 ; CHECK-NEXT:    [[LARGEPHI_INSERTSLICE39:%.*]] = insertelement <5 x double> [[LARGEPHI_INSERTSLICE28]], double [[TMP8]], i64 3
 ; CHECK-NEXT:    [[LARGEPHI_INSERTSLICE410:%.*]] = insertelement <5 x double> [[LARGEPHI_INSERTSLICE39]], double [[TMP9]], i64 4
-; CHECK-NEXT:    store <5 x double> [[LARGEPHI_INSERTSLICE410]], ptr [[OUT]], align 1
 ; CHECK-NEXT:    store <5 x double> [[LARGEPHI_INSERTSLICE420]], ptr [[OUT]], align 1
+; CHECK-NEXT:    store <5 x double> [[LARGEPHI_INSERTSLICE410]], ptr [[OUT]], align 1
 ; CHECK-NEXT:    ret void
 ;
 entry:
@@ -1187,11 +1187,11 @@ define amdgpu_kernel void @test_breakable_chain_5_out_of_7(<5 x double> %in, ptr
 ; CHECK-NEXT:    [[LARGEPHI_EXTRACTSLICE960:%.*]] = extractelement <5 x double> [[IN]], i64 4
 ; CHECK-NEXT:    br i1 [[COND]], label [[END:%.*]], label [[COND5_END]]
 ; CHECK:       cond5.end:
-; CHECK-NEXT:    [[TMP25:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE041]], [[COND4_END]] ], [ [[LARGEPHI_EXTRACTSLICE1]], [[COND5_TRUE]] ]
-; CHECK-NEXT:    [[TMP26:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE242]], [[COND4_END]] ], [ [[LARGEPHI_EXTRACTSLICE3]], [[COND5_TRUE]] ]
-; CHECK-NEXT:    [[TMP27:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE443]], [[COND4_END]] ], [ [[LARGEPHI_EXTRACTSLICE5]], [[COND5_TRUE]] ]
-; CHECK-NEXT:    [[TMP28:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE644]], [[COND4_END]] ], [ [[LARGEPHI_EXTRACTSLICE7]], [[COND5_TRUE]] ]
-; CHECK-NEXT:    [[TMP29:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE845]], [[COND4_END]] ], [ [[LARGEPHI_EXTRACTSLICE9]], [[COND5_TRUE]] ]
+; CHECK-NEXT:    [[TMP25:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE041]], [[COND4_END]] ], [ [[LARGEPHI_EXTRACTSLICE152]], [[COND5_TRUE]] ]
+; CHECK-NEXT:    [[TMP26:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE242]], [[COND4_END]] ], [ [[LARGEPHI_EXTRACTSLICE354]], [[COND5_TRUE]] ]
+; CHECK-NEXT:    [[TMP27:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE443]], [[COND4_END]] ], [ [[LARGEPHI_EXTRACTSLICE556]], [[COND5_TRUE]] ]
+; CHECK-NEXT:    [[TMP28:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE644]], [[COND4_END]] ], [ [[LARGEPHI_EXTRACTSLICE758]], [[COND5_TRUE]] ]
+; CHECK-NEXT:    [[TMP29:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE845]], [[COND4_END]] ], [ [[LARGEPHI_EXTRACTSLICE960]], [[COND5_TRUE]] ]
 ; CHECK-NEXT:    [[LARGEPHI_INSERTSLICE046:%.*]] = insertelement <5 x double> poison, double [[TMP25]], i64 0
 ; CHECK-NEXT:    [[LARGEPHI_INSERTSLICE147:%.*]] = insertelement <5 x double> [[LARGEPHI_INSERTSLICE046]], double [[TMP26]], i64 1
 ; CHECK-NEXT:    [[LARGEPHI_INSERTSLICE248:%.*]] = insertelement <5 x double> [[LARGEPHI_INSERTSLICE147]], double [[TMP27]], i64 2
@@ -1204,11 +1204,11 @@ define amdgpu_kernel void @test_breakable_chain_5_out_of_7(<5 x double> %in, ptr
 ; CHECK-NEXT:    [[LARGEPHI_EXTRACTSLICE859:%.*]] = extractelement <5 x double> [[LARGEPHI_INSERTSLICE450]], i64 4
 ; CHECK-NEXT:    br label [[END]]
 ; CHECK:       end:
-; CHECK-NEXT:    [[TMP30:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE051]], [[COND5_END]] ], [ [[LARGEPHI_EXTRACTSLICE152]], [[COND5_TRUE]] ]
-; CHECK-NEXT:    [[TMP31:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE253]], [[COND5_END]] ], [ [[LARGEPHI_EXTRACTSLICE354]], [[COND5_TRUE]] ]
-; CHECK-NEXT:    [[TMP32:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE455]], [[COND5_END]] ], [ [[LARGEPHI_EXTRACTSLICE556]], [[COND5_TRUE]] ]
-; CHECK-NEXT:    [[TMP33:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE657]], [[COND5_END]] ], [ [[LARGEPHI_EXTRACTSLICE758]], [[COND5_TRUE]] ]
-; CHECK-NEXT:    [[TMP34:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE859]], [[COND5_END]] ], [ [[LARGEPHI_EXTRACTSLICE960]], [[COND5_TRUE]] ]
+; CHECK-NEXT:    [[TMP30:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE051]], [[COND5_END]] ], [ [[LARGEPHI_EXTRACTSLICE1]], [[COND5_TRUE]] ]
+; CHECK-NEXT:    [[TMP31:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE253]], [[COND5_END]] ], [ [[LARGEPHI_EXTRACTSLICE3]], [[COND5_TRUE]] ]
+; CHECK-NEXT:    [[TMP32:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE455]], [[COND5_END]] ], [ [[LARGEPHI_EXTRACTSLICE5]], [[COND5_TRUE]] ]
+; CHECK-NEXT:    [[TMP33:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE657]], [[COND5_END]] ], [ [[LARGEPHI_EXTRACTSLICE7]], [[COND5_TRUE]] ]
+; CHECK-NEXT:    [[TMP34:%.*]] = phi double [ [[LARGEPHI_EXTRACTSLICE859]], [[COND5_END]] ], [ [[LARGEPHI_EXTRACTSLICE9]], [[COND5_TRUE]] ]
 ; CHECK-NEXT:    [[LARGEPHI_INSERTSLICE061:%.*]] = insertelement <5 x double> poison, double [[TMP30]], i64 0
 ; CHECK-NEXT:    [[LARGEPHI_INSERTSLICE162:%.*]] = insertelement <5 x double> [[LARGEPHI_INSERTSLICE061]], double [[TMP31]], i64 1
 ; CHECK-NEXT:    [[LARGEPHI_INSERTSLICE263:%.*]] = insertelement <5 x double> [[LARGEPHI_INSERTSLICE162]], double [[TMP32]], i64 2
diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-fdiv.ll b/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-fdiv.ll
index a4f9ce3e7350a..7ff86ac152feb 100644
--- a/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-fdiv.ll
+++ b/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-fdiv.ll
@@ -2160,7 +2160,22 @@ define amdgpu_kernel void @rsq_f32_vector_fpmath(ptr addrspace(1) %out, <2 x flo
 ; IEEE-GOODFREXP-NEXT:    [[TMP38:%.*]] = insertelement <2 x float> poison, float [[TMP27]], i64 0
 ; IEEE-GOODFREXP-NEXT:    [[MD_1ULP_UNDEF:%.*]] = insertelement <2 x float> [[TMP38]], float [[TMP37]], i64 1
 ; IEEE-GOODFREXP-NEXT:    store volatile <2 x float> [[MD_1ULP_UNDEF]], ptr addrspace(1) [[OUT]], align 4
-; IEEE-GOODFREXP-NEXT:    [[SQRT_X_3ULP:%.*]] = call contract <2 x float> @llvm.sqrt.v2f32(<2 x float> [[X]]), !fpmath [[META3:![0-9]+]]
+; IEEE-GOODFREXP-NEXT:    [[TMP56:%.*]] = extractelement <2 x float> [[X]], i64 0
+; IEEE-GOODFREXP-NEXT:    [[TMP57:%.*]] = extractelement <2 x float> [[X]], i64 1
+; IEEE-GOODFREXP-NEXT:    [[TMP58:%.*]] = fcmp olt float [[TMP56]], 0x3810000000000000
+; IEEE-GOODFREXP-NEXT:    [[TMP59:%.*]] = select i1 [[TMP58]], i32 32, i32 0
+; IEEE-GOODFREXP-NEXT:    [[TMP60:%.*]] = call float @llvm.ldexp.f32.i32(float [[TMP56]], i32 [[TMP59]])
+; IEEE-GOODFREXP-NEXT:    [[TMP61:%.*]] = call float @llvm.amdgcn.sqrt.f32(float [[TMP60]])
+; IEEE-GOODFREXP-NEXT:    [[TMP62:%.*]] = select i1 [[TMP58]], i32 -16, i32 0
+; IEEE-GOODFREXP-NEXT:    [[TMP63:%.*]] = call float @llvm.ldexp.f32.i32(float [[TMP61]], i32 [[TMP62]])
+; IEEE-GOODFREXP-NEXT:    [[TMP64:%.*]] = fcmp olt float [[TMP57]], 0x3810000000000000
+; IEEE-GOODFREXP-NEXT:    [[TMP65:%.*]] = select i1 [[TMP64]], i32 32, i32 0
+; IEEE-GOODFREXP-NEXT:    [[TMP66:%.*]] = call float @llvm.ldexp.f32.i32(float [[TMP57]], i32 [[TMP65]])
+; IEEE-GOODFREXP-NEXT:    [[TMP67:%.*]] = call float @llvm.amdgcn.sqrt.f32(float [[TMP66]])
+; IEEE-GOODFREXP-NEXT:    [[TMP68:%.*]] = select i1 [[TMP64]], i32 -16, i32 0
+; IEEE-GOODFREXP-NEXT:    [[TMP69:%.*]] = call float @llvm.ldexp.f32.i32(float [[TMP67]], i32 [[TMP68]])
+; IEEE-GOODFREXP-NEXT:    [[TMP70:%.*]] = insertelement <2 x float> poison, float [[TMP63]], i64 0
+; IEEE-GOODFREXP-NEXT:    [[SQRT_X_3ULP:%.*]] = insertelement <2 x float> [[TMP70]], float [[TMP69]], i64 1
 ; IEEE-GOODFREXP-NEXT:    [[TMP39:%.*]] = extractelement <2 x float> [[SQRT_X_3ULP]], i64 0
 ; IEEE-GOODFREXP-NEXT:    [[TMP40:%.*]] = extractelement <2 x float> [[SQRT_X_3ULP]], i64 1
 ; IEEE-GOODFREXP-NEXT:    [[TMP41:%.*]] = extractelement <2 x float> [[X]], i64 0
@@ -2231,7 +2246,22 @@ define amdgpu_kernel void @rsq_f32_vector_fpmath(ptr addrspace(1) %out, <2 x flo
 ; IEEE-BADFREXP-NEXT:    [[TMP38:%.*]] = insertelement <2 x float> poison, float [[TMP27]], i64 0
 ; IEEE-BADFREXP-NEXT:    [[MD_1ULP_UNDEF:%.*]] = insertelement <2 x float> [[TMP38]], float [[TMP37]], i64 1
 ; IEEE-BADFREXP-NEXT:    store volatile <2 x float> [[MD_1ULP_UNDEF]], ptr addrspace(1) [[OUT]], align 4
-; IEEE-BADFREXP-NEXT:    [[SQRT_X_3ULP:%.*]] = call contract <2 x float> @llvm.sqrt.v2f32(<2 x float> [[X]]), !fpmath [[META3:![0-9]+]]
+; IEEE-BADFREXP-NEXT:    [[TMP56:%.*]] = extractelement <2 x float> [[X]], i64 0
+; IEEE-BADFREXP-NEXT:    [[TMP57:%.*]] = extractelement <2 x float> [[X]], i64 1
+; IEEE-BADFREXP-NEXT:    [[TMP58:%.*]] = fcmp olt float [[TMP56]], 0x3810000000000000
+; IEEE-BADFREXP-NEXT:    [[TMP59:%.*]] = select i1 [[TMP58]], i32 32, i32 0
+; IEEE-BADFREXP-NEXT:    [[TMP60:%.*]] = call float @llvm.ldexp.f32.i32(float [[TMP56]], i32 [[TMP59]])
+; IEEE-BADFREXP-NEXT:    [[TMP61:%.*]] = call float @llvm.amdgcn.sqrt.f32(float [[TMP60]])
+; IEEE-BADFREXP-NEXT:    [[TMP62:%.*]] = select i1 [[TMP58]], i32 -16, i32 0
+; IEEE-BADFREXP-NEXT:    [[TMP63:%.*]] = call float @llvm.ldexp.f32.i32(float [[TMP61]], i32 [[TMP62]])
+; IEEE-BADFREXP-NEXT:    [[TMP64:%.*]] = fcmp olt float [[TMP57]], 0x3810000000000000
+; IEEE-BADFREXP-NEXT:    [[TMP65:%.*]] = select i1 [[TMP64]], i32 32, i32 0
+; IEEE-BADFREXP-NEXT:    [[TMP66:%.*]] = call float @llvm.ldexp.f32.i32(float [[TMP57]], i32 [[TMP65]])
+; IEEE-BADFREXP-NEXT:    [[TMP67:%.*]] = call float @llvm.amdgcn.sqrt.f32(float [[TMP66]])
+; IEEE-BADFREXP-NEXT:    [[TMP68:%.*]] = select i1 [[TMP64]], i32 -16, i32 0
+; IEEE-BADFREXP-NEXT:    [[TMP69:%.*]] = call float @llvm.ldexp.f32.i32(float [[TMP67]], i32 [[TMP68]])
+; IEEE-BADFREXP-NEXT:    [[TMP70:%.*]] = insertelement <2 x float> poison, float [[TMP63]], i64 0
+; IEEE-BADFREXP-NEXT:    [[SQRT_X_3ULP:%.*]] = insertelement <2 x float> [[TMP70]], float [[TMP69]], i64 1
 ; IEEE-BADFREXP-NEXT:    [[TMP39:%.*]] = extractelement <2 x float> [[SQRT_X_3ULP]], i64 0
 ; IEEE-BADFREXP-NEXT:    [[TMP40:%.*]] = extractelement <2 x float> [[SQRT_X_3ULP]], i64 1
 ; IEEE-BADFREXP-NEXT:    [[TMP41:%.*]] = extra...
[truncated]

@@ -1634,29 +1634,18 @@ define float @v_recip_sqrt_f32_ulp25_contract(float %x) {
; IR-IEEE-SDAG-LABEL: v_recip_sqrt_f32_ulp25_contract:
; IR-IEEE-SDAG: ; %bb.0:
; IR-IEEE-SDAG-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; IR-IEEE-SDAG-NEXT: s_mov_b32 s4, 0xf800000
; IR-IEEE-SDAG-NEXT: v_mul_f32_e32 v1, 0x4f800000, v0
; IR-IEEE-SDAG-NEXT: s_mov_b32 s4, 0x800000
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file changed after I deleted the hack around forward iteration

@@ -2160,7 +2160,22 @@ define amdgpu_kernel void @rsq_f32_vector_fpmath(ptr addrspace(1) %out, <2 x flo
; IEEE-GOODFREXP-NEXT: [[TMP38:%.*]] = insertelement <2 x float> poison, float [[TMP27]], i64 0
; IEEE-GOODFREXP-NEXT: [[MD_1ULP_UNDEF:%.*]] = insertelement <2 x float> [[TMP38]], float [[TMP37]], i64 1
; IEEE-GOODFREXP-NEXT: store volatile <2 x float> [[MD_1ULP_UNDEF]], ptr addrspace(1) [[OUT]], align 4
; IEEE-GOODFREXP-NEXT: [[SQRT_X_3ULP:%.*]] = call contract <2 x float> @llvm.sqrt.v2f32(<2 x float> [[X]]), !fpmath [[META3:![0-9]+]]
; IEEE-GOODFREXP-NEXT: [[TMP56:%.*]] = extractelement <2 x float> [[X]], i64 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why this changed that much, and whether it's a good thing or not
I only observed this change and there's no similar codegen regression, so I assume it's a neutral change?


while (!DeadVals.empty()) {
RecursivelyDeleteTriviallyDeadInstructions(
DeadVals.pop_back_val(), TLI, /*MSSAU*/ nullptr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
DeadVals.pop_back_val(), TLI, /*MSSAU*/ nullptr,
DeadVals.pop_back_val(), TLI, /*MSSAU=*/ nullptr,

@@ -109,6 +110,7 @@ class AMDGPUCodeGenPrepareImpl
bool FlowChanged = false;
mutable Function *SqrtF32 = nullptr;
mutable Function *LdexpF32 = nullptr;
mutable SetVector<Value *> DeadVals;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this needs to be a set, the iteration shouldn't revisit the same instruction twice (at least other IR combiner passes seem to all use vectors)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't use a set because I thought we could visit twice, I used one because I call the recursive delete function and I need to avoid cases where it may delete another instruction in that set (and then we visit a dead pointer).

I can use a vector + find of course if you prefer

@jayfoad
Copy link
Contributor

jayfoad commented Jun 24, 2025

I don't understand the high level motivation here. "Normal" combining/simplification order is to visit the operands of an instruction before you visit the instruction itself. That way the "visit" function can assume that the operands have already been simplified. GlobalISel combines already work this way, and a lot of effort has been put into trying to make the SelectionDAG combiner work this way too.

@arsenm
Copy link
Contributor

arsenm commented Jun 24, 2025

I don't understand the high level motivation here. "Normal" combining/simplification order is to visit the operands of an instruction before you visit the instruction itself.

Pattern matching is bottom up. This is essentially a selection problem, and selection is done bottom up.

That way the "visit" function can assume that the operands have already been simplified.

This is one of the problems, you want to see the original value before it's been dirtied up by the transformations.

GlobalISel combines already work this way,

The globalisel combiner is in reverse order.

Doing it in reverse also largely avoids the problem of stale uniformity info. All newly created values are falsely seen as uniform. If you do it in reverse, you don't encounter those new incorrect values

@jayfoad
Copy link
Contributor

jayfoad commented Jun 24, 2025

I don't understand the high level motivation here. "Normal" combining/simplification order is to visit the operands of an instruction before you visit the instruction itself.

Pattern matching is bottom up. This is essentially a selection problem, and selection is done bottom up.

When you say "bottom up" I think you mean the opposite of how it's used in the instruction selection literature so I was trying to avoid that phrase.

GlobalISel combines already work this way,

The globalisel combiner is in reverse order.

??? It iterates reverse(*MBB) to push instructions onto the back of the worklist, and then pops them off the back to process them.

@arsenm
Copy link
Contributor

arsenm commented Jun 24, 2025

This is not a simplifying pass, it is making the IR more complicated. We have to do hacks like this to prevent later more profitable combines from needing to parse out expanded IR:

// FIXME: This is an ugly hack for this pass using forward iteration instead

@jayfoad
Copy link
Contributor

jayfoad commented Jun 24, 2025

This is not a simplifying pass, it is making the IR more complicated. We have to do hacks like this to prevent later more profitable combines from needing to parse out expanded IR:

Fair enough, makes sense. I just want to make sure the justification is properly understood and explained, preferably in the commit message.

// FIXME: This is an ugly hack for this pass using forward iteration instead

I only disagree with the "like a normal combiner" part of that comment :)

@Pierre-vh Pierre-vh requested a review from arsenm June 26, 2025 08:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants