[AMDGPU] Insert `s_wait_xcnt(0)` before atomics to work around write combining misses hazards #167605

shiltian · 2025-11-12T00:05:42Z

This patch adds a workaround for a hazard on GFX1250, which inserts an s_wait_xcnt(0) instruction before any atomic operation that might write to memory.

Fixes SWDEV-543703.

shiltian · 2025-11-12T00:06:01Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

rampitec

It would be better to do it in the SIInsertWaitcounts. That way you would factor in other waits, so you do not double wait.

llvm/test/CodeGen/AMDGPU/atomic_store_local.ll

llvmbot · 2025-11-13T03:28:47Z

@llvm/pr-subscribers-llvm-globalisel

@llvm/pr-subscribers-backend-amdgpu

Author: Shilei Tian (shiltian)

Changes

This patch adds a workaround for a hazard on GFX1250, which inserts an s_wait_xcnt(0) instruction before any atomic operation that might write to memory.

Fixes SWDEV-543703.

Patch is 156.26 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/167605.diff

13 Files Affected:

(modified) llvm/lib/Target/AMDGPU/AMDGPU.td (+9)
(modified) llvm/lib/Target/AMDGPU/GCNSubtarget.h (+5-1)
(modified) llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp (+6)
(modified) llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll (+6)
(modified) llvm/test/CodeGen/AMDGPU/atomics-system-scope.ll (+56)
(modified) llvm/test/CodeGen/AMDGPU/flat-saddr-atomics.ll (+188)
(modified) llvm/test/CodeGen/AMDGPU/fp-atomics-gfx942.ll (+1)
(modified) llvm/test/CodeGen/AMDGPU/fp64-atomics-gfx90a.ll (+6)
(modified) llvm/test/CodeGen/AMDGPU/global-atomicrmw-fadd.ll (+72)
(modified) llvm/test/CodeGen/AMDGPU/literal64.ll (+2)
(modified) llvm/test/CodeGen/AMDGPU/scale-offset-flat.ll (+3)
(modified) llvm/test/CodeGen/AMDGPU/scale-offset-global.ll (+2)
(modified) llvm/test/CodeGen/AMDGPU/wait-xcnt.mir (-1)

diff --git a/llvm/lib/Target/AMDGPU/AMDGPU.td b/llvm/lib/Target/AMDGPU/AMDGPU.td
index b008354cfd462..fc7c892575fdc 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPU.td
+++ b/llvm/lib/Target/AMDGPU/AMDGPU.td
@@ -895,6 +895,12 @@ def FeatureCvtFP8VOP1Bug : SubtargetFeature<"cvt-fp8-vop1-bug",
   [FeatureFP8ConversionInsts]
 >;
 
+def FeatureWriteCombiningMissesHazards : SubtargetFeature<"write-combining-misses-hazards",
+  "HasWriteCombiningMissesHazards",
+  "true",
+  "Write combining misses hazards that require s_wait_cnt(0) before every atomic operation"
+>;
+
 def FeaturePkFmacF16Inst : SubtargetFeature<"pk-fmac-f16-inst",
   "HasPkFmacF16Inst",
   "true",
@@ -2145,6 +2151,7 @@ def FeatureISAVersion12_50 : FeatureSet<
    FeatureXNACK,
    FeatureClusters,
    FeatureD16Writes32BitVgpr,
+   FeatureWriteCombiningMissesHazards,
 ]>;
 
 def FeatureISAVersion12_51 : FeatureSet<
@@ -2945,6 +2952,8 @@ def HasGWS : Predicate<"Subtarget->hasGWS()">;
 def HasCvtFP8VOP1Bug : Predicate<"Subtarget->hasCvtFP8VOP1Bug()">;
 def HasNoCvtFP8VOP1Bug : Predicate<"!Subtarget->hasCvtFP8VOP1Bug()">;
 
+def HasWriteCombiningMissesHazards : Predicate<"Subtarget->hasWriteCombiningMissesHazards()">;
+
 def HasAtomicCSubNoRtnInsts : Predicate<"Subtarget->hasAtomicCSubNoRtnInsts()">;
 
 def HasScalarDwordx3Loads : Predicate<"Subtarget->hasScalarDwordx3Loads()">;
diff --git a/llvm/lib/Target/AMDGPU/GCNSubtarget.h b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
index f377b8aaf1333..52ca334f71bd4 100644
--- a/llvm/lib/Target/AMDGPU/GCNSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/GCNSubtarget.h
@@ -282,7 +282,7 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
   bool HasPointSampleAccel = false;
   bool HasLdsBarrierArriveAtomic = false;
   bool HasSetPrioIncWgInst = false;
-
+  bool HasWriteCombiningMissesHazards = false;
   bool RequiresCOV6 = false;
   bool UseBlockVGPROpsForCSR = false;
   bool HasGloballyAddressableScratch = false;
@@ -1834,6 +1834,10 @@ class GCNSubtarget final : public AMDGPUGenSubtargetInfo,
     return getGeneration() == GFX12;
   }
 
+  bool hasWriteCombiningMissesHazards() const {
+    return HasWriteCombiningMissesHazards;
+  }
+
   // Requires s_wait_alu(0) after s102/s103 write and src_flat_scratch_base
   // read.
   bool hasScratchBaseForwardingHazard() const {
diff --git a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
index 306d59d0867cd..bff0416e10b86 100644
--- a/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp
@@ -2086,6 +2086,12 @@ bool SIInsertWaitcnts::generateWaitcntInstBefore(MachineInstr &MI,
   // Verify that the wait is actually needed.
   ScoreBrackets.simplifyWaitcnt(Wait);
 
+  // An s_wait_xcnt(0) before every atomic store/RMW operation is required to
+  // work around the write combining misses hazard.
+  if (ST->hasWriteCombiningMissesHazards() && SIInstrInfo::isAtomic(MI) &&
+      SIInstrInfo::isVMEM(MI) && MI.mayStore())
+    Wait.XCnt = 0;
+
   // When forcing emit, we need to skip terminators because that would break the
   // terminators of the MBB if we emit a waitcnt between terminators.
   if (ForceEmitZeroFlag && !MI.isTerminator())
diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll b/llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll
index 7e297f46a780e..45843444143dc 100644
--- a/llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll
+++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/fp64-atomics-gfx90a.ll
@@ -1764,6 +1764,7 @@ define double @global_atomic_fadd_f64_rtn_pat(ptr addrspace(1) %ptr, double %dat
 ; GFX1250-NEXT:    v_mov_b64_e32 v[2:3], 4.0
 ; GFX1250-NEXT:    global_wb scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_storecnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_add_f64 v[0:1], v[0:1], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    global_inv scope:SCOPE_SYS
@@ -1802,6 +1803,7 @@ define double @global_atomic_fadd_f64_rtn_pat_agent(ptr addrspace(1) %ptr, doubl
 ; GFX1250-NEXT:    v_mov_b64_e32 v[2:3], 4.0
 ; GFX1250-NEXT:    global_wb scope:SCOPE_DEV
 ; GFX1250-NEXT:    s_wait_storecnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_add_f64 v[0:1], v[0:1], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_DEV
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    global_inv scope:SCOPE_DEV
@@ -1842,6 +1844,7 @@ define double @global_atomic_fadd_f64_rtn_pat_system(ptr addrspace(1) %ptr, doub
 ; GFX1250-NEXT:    v_mov_b64_e32 v[2:3], 4.0
 ; GFX1250-NEXT:    global_wb scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_storecnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_add_f64 v[0:1], v[0:1], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    global_inv scope:SCOPE_SYS
@@ -2088,6 +2091,7 @@ define double @flat_atomic_fadd_f64_rtn_pat(ptr %ptr) #1 {
 ; GFX1250-NEXT:    v_mov_b64_e32 v[2:3], 4.0
 ; GFX1250-NEXT:    global_wb scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_storecnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    flat_atomic_add_f64 v[0:1], v[0:1], v[2:3] th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    global_inv scope:SCOPE_SYS
@@ -2126,6 +2130,7 @@ define double @flat_atomic_fadd_f64_rtn_pat_agent(ptr %ptr) #1 {
 ; GFX1250-NEXT:    v_mov_b64_e32 v[2:3], 4.0
 ; GFX1250-NEXT:    global_wb scope:SCOPE_DEV
 ; GFX1250-NEXT:    s_wait_storecnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    flat_atomic_add_f64 v[0:1], v[0:1], v[2:3] th:TH_ATOMIC_RETURN scope:SCOPE_DEV
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    global_inv scope:SCOPE_DEV
@@ -2168,6 +2173,7 @@ define double @flat_atomic_fadd_f64_rtn_pat_system(ptr %ptr) #1 {
 ; GFX1250-NEXT:    v_mov_b64_e32 v[2:3], 4.0
 ; GFX1250-NEXT:    global_wb scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_storecnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    flat_atomic_add_f64 v[0:1], v[0:1], v[2:3] th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    global_inv scope:SCOPE_SYS
diff --git a/llvm/test/CodeGen/AMDGPU/atomics-system-scope.ll b/llvm/test/CodeGen/AMDGPU/atomics-system-scope.ll
index 54871a622189b..d159746726442 100644
--- a/llvm/test/CodeGen/AMDGPU/atomics-system-scope.ll
+++ b/llvm/test/CodeGen/AMDGPU/atomics-system-scope.ll
@@ -6,6 +6,7 @@ define float @global_system_atomic_fadd_f32(ptr addrspace(1) %ptr, float %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_add_f32 v0, v[0:1], v2, off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -18,6 +19,7 @@ define float @global_one_as_atomic_fadd_f32(ptr addrspace(1) %ptr, float %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_add_f32 v0, v[0:1], v2, off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -30,6 +32,7 @@ define double @global_system_atomic_fadd_f64(ptr addrspace(1) %ptr, double %val)
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_add_f64 v[0:1], v[0:1], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -42,6 +45,7 @@ define double @global_one_as_atomic_fadd_f64(ptr addrspace(1) %ptr, double %val)
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_add_f64 v[0:1], v[0:1], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -54,6 +58,7 @@ define float @global_system_atomic_fmin_f32(ptr addrspace(1) %ptr, float %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_min_num_f32 v0, v[0:1], v2, off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -66,6 +71,7 @@ define float @global_one_as_atomic_fmin_f32(ptr addrspace(1) %ptr, float %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_min_num_f32 v0, v[0:1], v2, off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -78,6 +84,7 @@ define double @global_system_atomic_fmin_f64(ptr addrspace(1) %ptr, double %val)
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_min_num_f64 v[0:1], v[0:1], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -90,6 +97,7 @@ define double @global_one_as_atomic_fmin_f64(ptr addrspace(1) %ptr, double %val)
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_min_num_f64 v[0:1], v[0:1], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -102,6 +110,7 @@ define float @global_system_atomic_fmax_f32(ptr addrspace(1) %ptr, float %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_max_num_f32 v0, v[0:1], v2, off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -114,6 +123,7 @@ define float @global_one_as_atomic_fmax_f32(ptr addrspace(1) %ptr, float %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_max_num_f32 v0, v[0:1], v2, off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -126,6 +136,7 @@ define double @global_system_atomic_fmax_f64(ptr addrspace(1) %ptr, double %val)
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_max_num_f64 v[0:1], v[0:1], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -138,6 +149,7 @@ define double @global_one_as_atomic_fmax_f64(ptr addrspace(1) %ptr, double %val)
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_max_num_f64 v[0:1], v[0:1], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -150,6 +162,7 @@ define i32 @global_one_as_atomic_min_i32(ptr addrspace(1) %ptr, i32 %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_min_i32 v0, v[0:1], v2, off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -162,6 +175,7 @@ define i32 @global_system_atomic_min_i32(ptr addrspace(1) %ptr, i32 %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_min_i32 v0, v[0:1], v2, off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -174,6 +188,7 @@ define i32 @global_one_as_atomic_max_i32(ptr addrspace(1) %ptr, i32 %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_max_i32 v0, v[0:1], v2, off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -186,6 +201,7 @@ define i32 @global_system_atomic_max_i32(ptr addrspace(1) %ptr, i32 %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_max_i32 v0, v[0:1], v2, off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -198,6 +214,7 @@ define i32 @global_one_as_atomic_umin_i32(ptr addrspace(1) %ptr, i32 %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_min_u32 v0, v[0:1], v2, off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -210,6 +227,7 @@ define i32 @global_system_atomic_umin_i32(ptr addrspace(1) %ptr, i32 %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_min_u32 v0, v[0:1], v2, off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -222,6 +240,7 @@ define i32 @global_one_as_atomic_umax_i32(ptr addrspace(1) %ptr, i32 %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_max_u32 v0, v[0:1], v2, off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -234,6 +253,7 @@ define i32 @global_system_atomic_umax_i32(ptr addrspace(1) %ptr, i32 %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_max_u32 v0, v[0:1], v2, off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -246,6 +266,7 @@ define i64 @global_one_as_atomic_min_i64(ptr addrspace(1) %ptr, i64 %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_min_i64 v[0:1], v[0:1], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -258,6 +279,7 @@ define i64 @global_system_atomic_min_i64(ptr addrspace(1) %ptr, i64 %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_min_i64 v[0:1], v[0:1], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -270,6 +292,7 @@ define i64 @global_one_as_atomic_max_i64(ptr addrspace(1) %ptr, i64 %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_max_i64 v[0:1], v[0:1], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -282,6 +305,7 @@ define i64 @global_system_atomic_max_i64(ptr addrspace(1) %ptr, i64 %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_max_i64 v[0:1], v[0:1], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -294,6 +318,7 @@ define i64 @global_one_as_atomic_umin_i64(ptr addrspace(1) %ptr, i64 %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_min_u64 v[0:1], v[0:1], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -306,6 +331,7 @@ define i64 @global_system_atomic_umin_i64(ptr addrspace(1) %ptr, i64 %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_min_u64 v[0:1], v[0:1], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -318,6 +344,7 @@ define i64 @global_one_as_atomic_umax_i64(ptr addrspace(1) %ptr, i64 %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_max_u64 v[0:1], v[0:1], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -330,6 +357,7 @@ define i64 @global_system_atomic_umax_i64(ptr addrspace(1) %ptr, i64 %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_max_u64 v[0:1], v[0:1], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -514,6 +542,7 @@ define float @flat_system_atomic_fadd_f32(ptr %ptr, float %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    flat_atomic_add_f32 v0, v[0:1], v2 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -526,6 +555,7 @@ define float @flat_one_as_atomic_fadd_f32(ptr %ptr, float %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    flat_atomic_add_f32 v0, v[0:1], v2 th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_set_pc_i64 s[30:31]
@@ -562,6 +592,7 @@ define double @flat_system_atomic_fadd_f64(ptr %ptr, double %val) {
 ; GFX1250-NEXT:    s_xor_b32 s1, exec_lo, s1
 ; GFX1250-NEXT:    s_cbranch_execz .LBB34_5
 ; GFX1250-NEXT:  ; %bb.4: ; %atomicrmw.global
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_add_f64 v[0:1], v[4:5], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    ; implicit-def: $vgpr4_vgpr5
 ; GFX1250-NEXT:    ; implicit-def: $vgpr2_vgpr3
@@ -627,6 +658,7 @@ define double @flat_one_as_atomic_fadd_f64(ptr %ptr, double %val) {
 ; GFX1250-NEXT:    s_xor_b32 s1, exec_lo, s1
 ; GFX1250-NEXT:    s_cbranch_execz .LBB35_5
 ; GFX1250-NEXT:  ; %bb.4: ; %atomicrmw.global
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    global_atomic_add_f64 v[0:1], v[4:5], v[2:3], off th:TH_ATOMIC_RETURN scope:SCOPE_SYS
 ; GFX1250-NEXT:    ; implicit-def: $vgpr4_vgpr5
 ; GFX1250-NEXT:    ; implicit-def: $vgpr2_vgpr3
@@ -668,6 +700,7 @@ define float @flat_system_atomic_fmin_f32(ptr %ptr, float %val) {
 ; GFX1250:       ; %bb.0:
 ; GFX1250-NEXT:    s_wait_loadcnt_dscnt 0x0
 ; GFX1250-NEXT:    s_wait_kmcnt 0x0
+; GFX1250-NEXT:    s_wait_xcnt 0x0
 ; GFX1250-NEXT:    flat_atomic_min_num_f32 v0, v[0:1],...
[truncated]

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp

…combining miss hazard This patch adds a workaround for a hazzard on GFX1250, which inserts an `s_wait_xcnt(0)` instruction before any atomic operation that might write to memory. Fixes SWDEV-543703.

Pierre-vh · 2025-11-13T09:14:50Z

How is this different from the workaround already implemented in SIMemoryLegalizer? We also insert xcnt waits there. See finalizeStore in SIGfx12CacheControl, and the requiresWaitXCntBeforeAtomicStores Subtarget hook.

shiltian · 2025-11-13T14:34:25Z

Aha, thanks for pointing that out. @Pierre-vh The workaround was proposed after your PR has been merged (internally), but I think that should cover this case.

shiltian requested a review from rampitec November 12, 2025 00:05

shiltian changed the title ~~[AMDGPU] Insert s_wait_xcnt(0) before atomics to work around write-combining miss bug~~ [AMDGPU] Insert s_wait_xcnt(0) before atomics to work around write combining miss bug Nov 12, 2025

shiltian changed the title ~~[AMDGPU] Insert s_wait_xcnt(0) before atomics to work around write combining miss bug~~ [AMDGPU] Insert s_wait_xcnt(0) before atomics to work around write combining misses bug Nov 12, 2025

shiltian force-pushed the users/shiltian/write-combining-miss-bug branch from b79ad30 to 6cbe7d6 Compare November 12, 2025 00:13

shiltian changed the title ~~[AMDGPU] Insert s_wait_xcnt(0) before atomics to work around write combining misses bug~~ [AMDGPU] Insert s_wait_xcnt(0) before atomics to work around write combining misses hazzards Nov 12, 2025

shiltian changed the title ~~[AMDGPU] Insert s_wait_xcnt(0) before atomics to work around write combining misses hazzards~~ [AMDGPU] Insert s_wait_xcnt(0) before atomics to work around write combining misses hazards Nov 12, 2025

shiltian force-pushed the users/shiltian/write-combining-miss-bug branch 3 times, most recently from 53455f7 to c15b311 Compare November 12, 2025 16:48

rampitec reviewed Nov 12, 2025

View reviewed changes

llvm/test/CodeGen/AMDGPU/atomic_store_local.ll Outdated Show resolved Hide resolved

shiltian force-pushed the users/shiltian/write-combining-miss-bug branch from c15b311 to 0033198 Compare November 13, 2025 03:27

shiltian marked this pull request as ready for review November 13, 2025 03:28

llvmbot added backend:AMDGPU llvm:globalisel labels Nov 13, 2025

shiltian requested a review from rampitec November 13, 2025 03:39

shiltian force-pushed the users/shiltian/write-combining-miss-bug branch from 0033198 to e58e799 Compare November 13, 2025 04:18

arsenm reviewed Nov 13, 2025

View reviewed changes

llvm/lib/Target/AMDGPU/SIInsertWaitcnts.cpp Outdated Show resolved Hide resolved

shiltian added 2 commits November 13, 2025 00:05

[AMDGPU] Insert s_wait_xcnt(0) before atomics to work around write-…

ccde342

…combining miss hazard This patch adds a workaround for a hazzard on GFX1250, which inserts an `s_wait_xcnt(0)` instruction before any atomic operation that might write to memory. Fixes SWDEV-543703.

remove unnecessary mayStore check

4d47649

shiltian force-pushed the users/shiltian/write-combining-miss-bug branch from e58e799 to 4d47649 Compare November 13, 2025 05:06

Pierre-vh self-requested a review November 13, 2025 09:14

shiltian closed this Nov 13, 2025

shiltian deleted the users/shiltian/write-combining-miss-bug branch November 13, 2025 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AMDGPU] Insert `s_wait_xcnt(0)` before atomics to work around write combining misses hazards #167605

[AMDGPU] Insert `s_wait_xcnt(0)` before atomics to work around write combining misses hazards #167605

Uh oh!

shiltian commented Nov 12, 2025 •

edited

Loading

Uh oh!

shiltian commented Nov 12, 2025

Uh oh!

rampitec left a comment

Uh oh!

Uh oh!

llvmbot commented Nov 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Pierre-vh commented Nov 13, 2025

Uh oh!

shiltian commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[AMDGPU] Insert s_wait_xcnt(0) before atomics to work around write combining misses hazards #167605

[AMDGPU] Insert s_wait_xcnt(0) before atomics to work around write combining misses hazards #167605

Uh oh!

Conversation

shiltian commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shiltian commented Nov 12, 2025

Uh oh!

rampitec left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llvmbot commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Pierre-vh commented Nov 13, 2025

Uh oh!

shiltian commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

[AMDGPU] Insert `s_wait_xcnt(0)` before atomics to work around write combining misses hazards #167605

[AMDGPU] Insert `s_wait_xcnt(0)` before atomics to work around write combining misses hazards #167605

shiltian commented Nov 12, 2025 •

edited

Loading

llvmbot commented Nov 13, 2025 •

edited

Loading