Skip to content

[LoadStoreVectorizer] Propagate alignment through contiguous chain #145733

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

dakersnar
Copy link
Contributor

At this point in the vectorization pass, we are guaranteed to have a contiguous chain with defined offsets for each element. Using this information, we can derive and upgrade alignment for elements in the chain based on their offset from previous well-aligned elements. This enables vectorization of chains that are longer than the maximum vector length of the target. This algorithm is also robust to the head of the chain not being well-aligned; if we find a better alignment while iterating from the beginning to the end of the chain, we will use that alignment moving forward.

@llvmbot
Copy link
Member

llvmbot commented Jun 25, 2025

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-vectorizers

Author: Drew Kersnar (dakersnar)

Changes

At this point in the vectorization pass, we are guaranteed to have a contiguous chain with defined offsets for each element. Using this information, we can derive and upgrade alignment for elements in the chain based on their offset from previous well-aligned elements. This enables vectorization of chains that are longer than the maximum vector length of the target. This algorithm is also robust to the head of the chain not being well-aligned; if we find a better alignment while iterating from the beginning to the end of the chain, we will use that alignment moving forward.


Full diff: https://github.com/llvm/llvm-project/pull/145733.diff

2 Files Affected:

  • (modified) llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp (+35)
  • (added) llvm/test/Transforms/LoadStoreVectorizer/prop-align.ll (+296)
diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 89f63c3b66aad..e14a936b764e5 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -343,6 +343,9 @@ class Vectorizer {
   /// Postcondition: For all i, ret[i][0].second == 0, because the first instr
   /// in the chain is the leader, and an instr touches distance 0 from itself.
   std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
+
+  /// Propagates the best alignment in a chain of contiguous accesses
+  void propagateBestAlignmentsInChain(ArrayRef<ChainElem> C) const;
 };
 
 class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -716,6 +719,14 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
   unsigned AS = getLoadStoreAddressSpace(C[0].Inst);
   unsigned VecRegBytes = TTI.getLoadStoreVecRegBitWidth(AS) / 8;
 
+  // We know that the accesses are contiguous. Propagate alignment
+  // information so that slices of the chain can still be vectorized.
+  propagateBestAlignmentsInChain(C);
+  LLVM_DEBUG({
+    dbgs() << "LSV: Chain after alignment propagation:\n";
+    dumpChain(C);
+  });
+
   std::vector<Chain> Ret;
   for (unsigned CBegin = 0; CBegin < C.size(); ++CBegin) {
     // Find candidate chains of size not greater than the largest vector reg.
@@ -1634,3 +1645,27 @@ std::optional<APInt> Vectorizer::getConstantOffset(Value *PtrA, Value *PtrB,
         .sextOrTrunc(OrigBitWidth);
   return std::nullopt;
 }
+
+void Vectorizer::propagateBestAlignmentsInChain(ArrayRef<ChainElem> C) const {
+  ChainElem BestAlignedElem = C[0];
+  Align BestAlignSoFar = getLoadStoreAlignment(C[0].Inst);
+
+  for (const ChainElem &E : C) {
+    Align OrigAlign = getLoadStoreAlignment(E.Inst);
+    if (OrigAlign > BestAlignSoFar) {
+      BestAlignedElem = E;
+      BestAlignSoFar = OrigAlign;
+    }
+
+    APInt OffsetFromBestAlignedElem =
+        E.OffsetFromLeader - BestAlignedElem.OffsetFromLeader;
+    assert(OffsetFromBestAlignedElem.isNonNegative());
+    // commonAlignment is equivalent to a greatest common power-of-two divisor;
+    // it returns the largest power of 2 that divides both A and B.
+    Align NewAlign = commonAlignment(
+        BestAlignSoFar, OffsetFromBestAlignedElem.getLimitedValue());
+    if (NewAlign > OrigAlign)
+      setLoadStoreAlignment(E.Inst, NewAlign);
+  }
+  return;
+}
diff --git a/llvm/test/Transforms/LoadStoreVectorizer/prop-align.ll b/llvm/test/Transforms/LoadStoreVectorizer/prop-align.ll
new file mode 100644
index 0000000000000..a1878dc051d99
--- /dev/null
+++ b/llvm/test/Transforms/LoadStoreVectorizer/prop-align.ll
@@ -0,0 +1,296 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -passes=load-store-vectorizer -S < %s | FileCheck %s
+
+; The IR has the first float3 labeled with align 16, and that 16 should
+; be propagated such that the second set of 4 values
+; can also be vectorized together.
+%struct.float3 = type { float, float, float }
+%struct.S1 = type { %struct.float3, %struct.float3, i32, i32 }
+
+define void @testStore(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testStore(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    store <4 x float> zeroinitializer, ptr [[TMP0]], align 16
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds [[STRUCT_S1:%.*]], ptr [[TMP0]], i64 0, i32 1, i32 1
+; CHECK-NEXT:    store <4 x i32> zeroinitializer, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    ret void
+;
+  store float 0.000000e+00, ptr %1, align 16
+  %getElem = getelementptr inbounds %struct.float3, ptr %1, i64 0, i32 1
+  store float 0.000000e+00, ptr %getElem, align 4
+  %getElem8 = getelementptr inbounds %struct.float3, ptr %1, i64 0, i32 2
+  store float 0.000000e+00, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1
+  store float 0.000000e+00, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1, i32 1
+  store float 0.000000e+00, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1, i32 2
+  store float 0.000000e+00, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 2
+  store i32 0, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 3
+  store i32 0, ptr %getElem13, align 4
+  ret void
+}
+
+define void @testLoad(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testLoad(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    [[TMP2:%.*]] = load <4 x float>, ptr [[TMP0]], align 16
+; CHECK-NEXT:    [[L11:%.*]] = extractelement <4 x float> [[TMP2]], i32 0
+; CHECK-NEXT:    [[L22:%.*]] = extractelement <4 x float> [[TMP2]], i32 1
+; CHECK-NEXT:    [[L33:%.*]] = extractelement <4 x float> [[TMP2]], i32 2
+; CHECK-NEXT:    [[L44:%.*]] = extractelement <4 x float> [[TMP2]], i32 3
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds [[STRUCT_S1:%.*]], ptr [[TMP0]], i64 0, i32 1, i32 1
+; CHECK-NEXT:    [[TMP3:%.*]] = load <4 x i32>, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    [[L55:%.*]] = extractelement <4 x i32> [[TMP3]], i32 0
+; CHECK-NEXT:    [[TMP4:%.*]] = bitcast i32 [[L55]] to float
+; CHECK-NEXT:    [[L66:%.*]] = extractelement <4 x i32> [[TMP3]], i32 1
+; CHECK-NEXT:    [[TMP5:%.*]] = bitcast i32 [[L66]] to float
+; CHECK-NEXT:    [[L77:%.*]] = extractelement <4 x i32> [[TMP3]], i32 2
+; CHECK-NEXT:    [[L88:%.*]] = extractelement <4 x i32> [[TMP3]], i32 3
+; CHECK-NEXT:    ret void
+;
+  %l1 = load float, ptr %1, align 16
+  %getElem = getelementptr inbounds %struct.float3, ptr %1, i64 0, i32 1
+  %l2 = load float, ptr %getElem, align 4
+  %getElem8 = getelementptr inbounds %struct.float3, ptr %1, i64 0, i32 2
+  %l3 = load float, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1
+  %l4 = load float, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1, i32 1
+  %l5 = load float, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1, i32 2
+  %l6 = load float, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 2
+  %l7 = load i32, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 3
+  %l8 = load i32, ptr %getElem13, align 4
+  ret void
+}
+
+; Also, test without the struct geps, to see if it still works with i8 geps/ptradd
+
+define void @testStorei8(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testStorei8(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    store <4 x float> zeroinitializer, ptr [[TMP0]], align 16
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 16
+; CHECK-NEXT:    store <4 x i32> zeroinitializer, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    ret void
+;
+  store float 0.000000e+00, ptr %1, align 16
+  %getElem = getelementptr inbounds i8, ptr %1, i64 4
+  store float 0.000000e+00, ptr %getElem, align 4
+  %getElem8 = getelementptr inbounds i8, ptr %1, i64 8
+  store float 0.000000e+00, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds i8, ptr %1, i64 12
+  store float 0.000000e+00, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds i8, ptr %1, i64 16
+  store float 0.000000e+00, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds i8, ptr %1, i64 20
+  store float 0.000000e+00, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds i8, ptr %1, i64 24
+  store i32 0, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds i8, ptr %1, i64 28
+  store i32 0, ptr %getElem13, align 4
+  ret void
+}
+
+define void @testLoadi8(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testLoadi8(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    [[TMP2:%.*]] = load <4 x float>, ptr [[TMP0]], align 16
+; CHECK-NEXT:    [[L11:%.*]] = extractelement <4 x float> [[TMP2]], i32 0
+; CHECK-NEXT:    [[L22:%.*]] = extractelement <4 x float> [[TMP2]], i32 1
+; CHECK-NEXT:    [[L33:%.*]] = extractelement <4 x float> [[TMP2]], i32 2
+; CHECK-NEXT:    [[L44:%.*]] = extractelement <4 x float> [[TMP2]], i32 3
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 16
+; CHECK-NEXT:    [[TMP3:%.*]] = load <4 x i32>, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    [[L55:%.*]] = extractelement <4 x i32> [[TMP3]], i32 0
+; CHECK-NEXT:    [[TMP4:%.*]] = bitcast i32 [[L55]] to float
+; CHECK-NEXT:    [[L66:%.*]] = extractelement <4 x i32> [[TMP3]], i32 1
+; CHECK-NEXT:    [[TMP5:%.*]] = bitcast i32 [[L66]] to float
+; CHECK-NEXT:    [[L77:%.*]] = extractelement <4 x i32> [[TMP3]], i32 2
+; CHECK-NEXT:    [[L88:%.*]] = extractelement <4 x i32> [[TMP3]], i32 3
+; CHECK-NEXT:    ret void
+;
+  %l1 = load float, ptr %1, align 16
+  %getElem = getelementptr inbounds i8, ptr %1, i64 4
+  %l2 = load float, ptr %getElem, align 4
+  %getElem8 = getelementptr inbounds i8, ptr %1, i64 8
+  %l3 = load float, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds i8, ptr %1, i64 12
+  %l4 = load float, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds i8, ptr %1, i64 16
+  %l5 = load float, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds i8, ptr %1, i64 20
+  %l6 = load float, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds i8, ptr %1, i64 24
+  %l7 = load i32, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds i8, ptr %1, i64 28
+  %l8 = load i32, ptr %getElem13, align 4
+  ret void
+}
+
+
+; This version of the test adjusts the struct to hold two i32s at the beginning,
+; but still assumes that the first float3 is 16 aligned. If the alignment
+; propagation works correctly, it should be able to load this struct in three
+; loads: a 2x32, a 4x32, and a 4x32. Without the alignment propagation, the last
+; 4x32 will instead be a 2x32 and a 2x32
+%struct.S2 = type { i32, i32, %struct.float3, %struct.float3, i32, i32 }
+
+define void @testStore_2(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testStore_2(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    store <2 x i32> zeroinitializer, ptr [[TMP0]], align 8
+; CHECK-NEXT:    [[GETELEM1:%.*]] = getelementptr inbounds [[STRUCT_S2:%.*]], ptr [[TMP0]], i64 0, i32 2
+; CHECK-NEXT:    store <4 x float> zeroinitializer, ptr [[GETELEM1]], align 16
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds [[STRUCT_S2]], ptr [[TMP0]], i64 0, i32 3, i32 1
+; CHECK-NEXT:    store <4 x i32> zeroinitializer, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    ret void
+;
+  store i32 0, ptr %1, align 8
+  %getElem = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 1
+  store i32 0, ptr %getElem, align 4
+  %getElem1 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2
+  store float 0.000000e+00, ptr %getElem1, align 16
+  %getElem2 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2, i32 1
+  store float 0.000000e+00, ptr %getElem2, align 4
+  %getElem8 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2, i32 2
+  store float 0.000000e+00, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3
+  store float 0.000000e+00, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3, i32 1
+  store float 0.000000e+00, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3, i32 2
+  store float 0.000000e+00, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 4
+  store i32 0, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 5
+  store i32 0, ptr %getElem13, align 4
+  ret void
+}
+
+define void @testLoad_2(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testLoad_2(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    [[TMP2:%.*]] = load <2 x i32>, ptr [[TMP0]], align 8
+; CHECK-NEXT:    [[L1:%.*]] = extractelement <2 x i32> [[TMP2]], i32 0
+; CHECK-NEXT:    [[L22:%.*]] = extractelement <2 x i32> [[TMP2]], i32 1
+; CHECK-NEXT:    [[GETELEM1:%.*]] = getelementptr inbounds [[STRUCT_S2:%.*]], ptr [[TMP0]], i64 0, i32 2
+; CHECK-NEXT:    [[TMP3:%.*]] = load <4 x float>, ptr [[GETELEM1]], align 16
+; CHECK-NEXT:    [[L33:%.*]] = extractelement <4 x float> [[TMP3]], i32 0
+; CHECK-NEXT:    [[L44:%.*]] = extractelement <4 x float> [[TMP3]], i32 1
+; CHECK-NEXT:    [[L55:%.*]] = extractelement <4 x float> [[TMP3]], i32 2
+; CHECK-NEXT:    [[L66:%.*]] = extractelement <4 x float> [[TMP3]], i32 3
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds [[STRUCT_S2]], ptr [[TMP0]], i64 0, i32 3, i32 1
+; CHECK-NEXT:    [[TMP4:%.*]] = load <4 x i32>, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    [[L77:%.*]] = extractelement <4 x i32> [[TMP4]], i32 0
+; CHECK-NEXT:    [[TMP5:%.*]] = bitcast i32 [[L77]] to float
+; CHECK-NEXT:    [[L88:%.*]] = extractelement <4 x i32> [[TMP4]], i32 1
+; CHECK-NEXT:    [[TMP6:%.*]] = bitcast i32 [[L88]] to float
+; CHECK-NEXT:    [[L99:%.*]] = extractelement <4 x i32> [[TMP4]], i32 2
+; CHECK-NEXT:    [[L010:%.*]] = extractelement <4 x i32> [[TMP4]], i32 3
+; CHECK-NEXT:    ret void
+;
+  %l = load i32, ptr %1, align 8
+  %getElem = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 1
+  %l2 = load i32, ptr %getElem, align 4
+  %getElem1 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2
+  %l3 = load float, ptr %getElem1, align 16
+  %getElem2 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2, i32 1
+  %l4 = load float, ptr %getElem2, align 4
+  %getElem8 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2, i32 2
+  %l5 = load float, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3
+  %l6 = load float, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3, i32 1
+  %l7 = load float, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3, i32 2
+  %l8 = load float, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 4
+  %l9 = load i32, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 5
+  %l0 = load i32, ptr %getElem13, align 4
+  ret void
+}
+
+; Also, test without the struct geps, to see if it still works with i8 geps/ptradd
+
+define void @testStorei8_2(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testStorei8_2(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    store <2 x i32> zeroinitializer, ptr [[TMP0]], align 8
+; CHECK-NEXT:    [[GETELEM1:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 8
+; CHECK-NEXT:    store <4 x float> zeroinitializer, ptr [[GETELEM1]], align 16
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 24
+; CHECK-NEXT:    store <4 x i32> zeroinitializer, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    ret void
+;
+  store i32 0, ptr %1, align 8
+  %getElem = getelementptr inbounds i8, ptr %1, i64 4
+  store i32 0, ptr %getElem, align 4
+  %getElem1 = getelementptr inbounds i8, ptr %1, i64 8
+  store float 0.000000e+00, ptr %getElem1, align 16
+  %getElem2 = getelementptr inbounds i8, ptr %1, i64 12
+  store float 0.000000e+00, ptr %getElem2, align 4
+  %getElem8 = getelementptr inbounds i8, ptr %1, i64 16
+  store float 0.000000e+00, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds i8, ptr %1, i64 20
+  store float 0.000000e+00, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds i8, ptr %1, i64 24
+  store float 0.000000e+00, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds i8, ptr %1, i64 28
+  store float 0.000000e+00, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds i8, ptr %1, i64 32
+  store i32 0, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds i8, ptr %1, i64 36
+  store i32 0, ptr %getElem13, align 4
+  ret void
+}
+
+define void @testLoadi8_2(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testLoadi8_2(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    [[TMP2:%.*]] = load <2 x i32>, ptr [[TMP0]], align 8
+; CHECK-NEXT:    [[L1:%.*]] = extractelement <2 x i32> [[TMP2]], i32 0
+; CHECK-NEXT:    [[L22:%.*]] = extractelement <2 x i32> [[TMP2]], i32 1
+; CHECK-NEXT:    [[GETELEM1:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 8
+; CHECK-NEXT:    [[TMP3:%.*]] = load <4 x float>, ptr [[GETELEM1]], align 16
+; CHECK-NEXT:    [[L33:%.*]] = extractelement <4 x float> [[TMP3]], i32 0
+; CHECK-NEXT:    [[L44:%.*]] = extractelement <4 x float> [[TMP3]], i32 1
+; CHECK-NEXT:    [[L55:%.*]] = extractelement <4 x float> [[TMP3]], i32 2
+; CHECK-NEXT:    [[L66:%.*]] = extractelement <4 x float> [[TMP3]], i32 3
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 24
+; CHECK-NEXT:    [[TMP4:%.*]] = load <4 x i32>, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    [[L77:%.*]] = extractelement <4 x i32> [[TMP4]], i32 0
+; CHECK-NEXT:    [[TMP5:%.*]] = bitcast i32 [[L77]] to float
+; CHECK-NEXT:    [[L88:%.*]] = extractelement <4 x i32> [[TMP4]], i32 1
+; CHECK-NEXT:    [[TMP6:%.*]] = bitcast i32 [[L88]] to float
+; CHECK-NEXT:    [[L99:%.*]] = extractelement <4 x i32> [[TMP4]], i32 2
+; CHECK-NEXT:    [[L010:%.*]] = extractelement <4 x i32> [[TMP4]], i32 3
+; CHECK-NEXT:    ret void
+;
+  %l = load i32, ptr %1, align 8
+  %getElem = getelementptr inbounds i8, ptr %1, i64 4
+  %l2 = load i32, ptr %getElem, align 4
+  %getElem1 = getelementptr inbounds i8, ptr %1, i64 8
+  %l3 = load float, ptr %getElem1, align 16
+  %getElem2 = getelementptr inbounds i8, ptr %1, i64 12
+  %l4 = load float, ptr %getElem2, align 4
+  %getElem8 = getelementptr inbounds i8, ptr %1, i64 16
+  %l5 = load float, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds i8, ptr %1, i64 20
+  %l6 = load float, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds i8, ptr %1, i64 24
+  %l7 = load float, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds i8, ptr %1, i64 28
+  %l8 = load float, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds i8, ptr %1, i64 32
+  %l9 = load i32, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds i8, ptr %1, i64 36
+  %l0 = load i32, ptr %getElem13, align 4
+  ret void
+}

Copy link
Member

@AlexMaclean AlexMaclean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Comment on lines -883 to -890
// If this is a load/store of an alloca, we might have upgraded the alloca's
// alignment earlier. Get the new alignment.
if (AS == DL.getAllocaAddrSpace()) {
Alignment = std::max(
Alignment,
getOrEnforceKnownAlignment(getLoadStorePointerOperand(C[0].Inst),
MaybeAlign(), DL, C[0].Inst, nullptr, &DT));
}
Copy link
Contributor Author

@dakersnar dakersnar Jun 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this change: I don't totally understand why we delay the upgrade of load/store alignment, but I'd imagine it is preferred to minimize IR changes before vectorization is certain.

If this change gets accepted, we are ok with eagerly upgrading load/store alignment. So, for consistency, we should also be ok with eagerly doing so for the existing alloca alignment upgrade optimization. Hence, I am bundling this simplification with this change, which lets us remove this handling in vectorizeChain by instead calling setLoadStoreAlignment on line 837.

There is one test case change that results from this, in massive_indirection.ll. This is not a regression, as the target supports unaligned loads. The upgrade from alignment of 8 to an alignment of 4294967296 that happened with this call to getOrEnforceKnownAlignment is not the intended purpose of this block of code, merely an unintended side effect. This is evidenced by the fact that there are no allocas in the test case.

Copy link
Member

@AlexMaclean AlexMaclean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please wait for review by someone with more LSV experience before landing.

@dakersnar
Copy link
Contributor Author

Adding @michalpaszkowski to reviewers since I saw you reviewed a recently merged LSV change. If you can think of anyone else who would be a better fit to review this let me know :)

%l7 = load i32, ptr %getElem12, align 8
%getElem13 = getelementptr inbounds i8, ptr %1, i64 28
%l8 = load i32, ptr %getElem13, align 4
ret void
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use more representative tests with some uses? As is these all optimize to memset or no-op

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight pushback, my understanding is that lit tests are most useful when they are minimal reproducers of the problem the optimization is targeting. Adding uses would not really change the nature of this optimization. Tests like llvm/test/Transforms/LoadStoreVectorizer/NVPTX/vectorize_i1.ll follow this thinking.

If you think it would be better, I could combine each pair of load and store tests into individual tests, storing the result of the loads. Other LSV tests use that pattern a lot.

@@ -1634,3 +1638,32 @@ std::optional<APInt> Vectorizer::getConstantOffset(Value *PtrA, Value *PtrB,
.sextOrTrunc(OrigBitWidth);
return std::nullopt;
}

void Vectorizer::propagateBestAlignmentsInChain(ArrayRef<ChainElem> C) const {
auto PropagateAlignments = [](auto ChainIt) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this a proper helper function instead of a lambda

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I can't figure out a way to have this take in both the iterator and the reverse iterator without it being a templated helper function. I was leaning towards this lambda approach because it is only used in this one function, do you think a separate helper would still be preferred even if we need to use a template? Open to ideas.

Comment on lines +4 to +6
; The IR has the first float3 labeled with align 16, and that 16 should
; be propagated such that the second set of 4 values
; can also be vectorized together.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this papering over a missed middle end optimization? The middle end should have inferred the pointer argument to align 16, and then each successive access should already have a refined alignment derived from that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is nice to have for a variety of edge cases, but the specific motivator for this change is based on InstCombine's handling of nested structs. You are correct that if loads/stores are unpacked from aligned loads/stores of aggregates, alignment will be propagated to the unpacked elements. However, with nested structs, InstCombine unpacks one layer at a time, losing alignment context in between passes over the worklist.

Before IC:

; RUN: opt < %s -passes=instcombine -S | FileCheck %s
%struct.S1 = type { %struct.float3, %struct.float3, i32, i32 }
%struct.float3 = type { float, float, float }

define void @_Z7init_s1P2S1S0_(ptr noundef %v, ptr noundef %vout) {
  %v.addr = alloca ptr, align 8
  %vout.addr = alloca ptr, align 8
  store ptr %v, ptr %v.addr, align 8
  store ptr %vout, ptr %vout.addr, align 8
  %tmp = load ptr, ptr %vout.addr, align 8
  %tmp1 = load ptr, ptr %v.addr, align 8
  %1 = load %struct.S1, ptr %tmp1, align 16
  store %struct.S1 %1, ptr %tmp, align 16
  ret void
}

After IC:

define void @_Z7init_s1P2S1S0_(ptr noundef %v, ptr noundef %vout) {
  %.unpack.unpack = load float, ptr %v, align 16
  %.unpack.elt7 = getelementptr inbounds nuw i8, ptr %v, i64 4
  %.unpack.unpack8 = load float, ptr %.unpack.elt7, align 4
  %.unpack.elt9 = getelementptr inbounds nuw i8, ptr %v, i64 8
  %.unpack.unpack10 = load float, ptr %.unpack.elt9, align 8
  %.elt1 = getelementptr inbounds nuw i8, ptr %v, i64 12
  %.unpack2.unpack = load float, ptr %.elt1, align 4
  %.unpack2.elt12 = getelementptr inbounds nuw i8, ptr %v, i64 16
  %.unpack2.unpack13 = load float, ptr %.unpack2.elt12, align 4 ; <----------- this should be align 16
  %.unpack2.elt14 = getelementptr inbounds nuw i8, ptr %v, i64 20
  %.unpack2.unpack15 = load float, ptr %.unpack2.elt14, align 4
  %.elt3 = getelementptr inbounds nuw i8, ptr %v, i64 24
  %.unpack4 = load i32, ptr %.elt3, align 8
  %.elt5 = getelementptr inbounds nuw i8, ptr %v, i64 28
  %.unpack6 = load i32, ptr %.elt5, align 4
  store float %.unpack.unpack, ptr %vout, align 16
  %vout.repack23 = getelementptr inbounds nuw i8, ptr %vout, i64 4
  store float %.unpack.unpack8, ptr %vout.repack23, align 4
  %vout.repack25 = getelementptr inbounds nuw i8, ptr %vout, i64 8
  store float %.unpack.unpack10, ptr %vout.repack25, align 8
  %vout.repack17 = getelementptr inbounds nuw i8, ptr %vout, i64 12
  store float %.unpack2.unpack, ptr %vout.repack17, align 4
  %vout.repack17.repack27 = getelementptr inbounds nuw i8, ptr %vout, i64 16
  store float %.unpack2.unpack13, ptr %vout.repack17.repack27, align 4
  %vout.repack17.repack29 = getelementptr inbounds nuw i8, ptr %vout, i64 20
  store float %.unpack2.unpack15, ptr %vout.repack17.repack29, align 4
  %vout.repack19 = getelementptr inbounds nuw i8, ptr %vout, i64 24
  store i32 %.unpack4, ptr %vout.repack19, align 8
  %vout.repack21 = getelementptr inbounds nuw i8, ptr %vout, i64 28
  store i32 %.unpack6, ptr %vout.repack21, align 4
  ret void
}

To visualize what's happening under the hood, InstCombine is unpacking the load in stages like this:
%1 = load %struct.S1, ptr %tmp1, align 16
->
load struct.float3 align 16
load struct.float3 align 4
load i32 align 8
load i32 align 4
->
load float align 16
load float align 4
load float align 8
load float align 4
load float align 4
load float align 4
load i32 align 8
load i32 align 4

Comment on lines +1667 to +1668
PropagateAlignments(C);
PropagateAlignments(reverse(C));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this need to go through twice? I'd expect each load and store to have started with an optimally computed alignment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_forward_and_reverse in the test file demonstrates why going backwards could theoretically be useful. If you have a well-aligned element later in the chain, propagating the alignment backwards could improve earlier elements in the chain.

I haven't thought of a specific end-to-end test that would trigger this, but I think it can't hurt to include.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants