[LoadStoreVectorizer] Propagate alignment through contiguous chain #145733

dakersnar · 2025-06-25T16:12:07Z

At this point in the vectorization pass, we are guaranteed to have a contiguous chain with defined offsets for each element. Using this information, we can derive and upgrade alignment for elements in the chain based on their offset from previous well-aligned elements. This enables vectorization of chains that are longer than the maximum vector length of the target. This algorithm is also robust to the head of the chain not being well-aligned; if we find a better alignment while iterating from the beginning to the end of the chain, we will use that alignment moving forward.

… improve vectorization

llvmbot · 2025-06-25T16:12:30Z

@llvm/pr-subscribers-llvm-transforms

@llvm/pr-subscribers-vectorizers

Author: Drew Kersnar (dakersnar)

Changes

At this point in the vectorization pass, we are guaranteed to have a contiguous chain with defined offsets for each element. Using this information, we can derive and upgrade alignment for elements in the chain based on their offset from previous well-aligned elements. This enables vectorization of chains that are longer than the maximum vector length of the target. This algorithm is also robust to the head of the chain not being well-aligned; if we find a better alignment while iterating from the beginning to the end of the chain, we will use that alignment moving forward.

Full diff: https://github.com/llvm/llvm-project/pull/145733.diff

2 Files Affected:

(modified) llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp (+35)
(added) llvm/test/Transforms/LoadStoreVectorizer/prop-align.ll (+296)

diff --git a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
index 89f63c3b66aad..e14a936b764e5 100644
--- a/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp
@@ -343,6 +343,9 @@ class Vectorizer {
   /// Postcondition: For all i, ret[i][0].second == 0, because the first instr
   /// in the chain is the leader, and an instr touches distance 0 from itself.
   std::vector<Chain> gatherChains(ArrayRef<Instruction *> Instrs);
+
+  /// Propagates the best alignment in a chain of contiguous accesses
+  void propagateBestAlignmentsInChain(ArrayRef<ChainElem> C) const;
 };
 
 class LoadStoreVectorizerLegacyPass : public FunctionPass {
@@ -716,6 +719,14 @@ std::vector<Chain> Vectorizer::splitChainByAlignment(Chain &C) {
   unsigned AS = getLoadStoreAddressSpace(C[0].Inst);
   unsigned VecRegBytes = TTI.getLoadStoreVecRegBitWidth(AS) / 8;
 
+  // We know that the accesses are contiguous. Propagate alignment
+  // information so that slices of the chain can still be vectorized.
+  propagateBestAlignmentsInChain(C);
+  LLVM_DEBUG({
+    dbgs() << "LSV: Chain after alignment propagation:\n";
+    dumpChain(C);
+  });
+
   std::vector<Chain> Ret;
   for (unsigned CBegin = 0; CBegin < C.size(); ++CBegin) {
     // Find candidate chains of size not greater than the largest vector reg.
@@ -1634,3 +1645,27 @@ std::optional<APInt> Vectorizer::getConstantOffset(Value *PtrA, Value *PtrB,
         .sextOrTrunc(OrigBitWidth);
   return std::nullopt;
 }
+
+void Vectorizer::propagateBestAlignmentsInChain(ArrayRef<ChainElem> C) const {
+  ChainElem BestAlignedElem = C[0];
+  Align BestAlignSoFar = getLoadStoreAlignment(C[0].Inst);
+
+  for (const ChainElem &E : C) {
+    Align OrigAlign = getLoadStoreAlignment(E.Inst);
+    if (OrigAlign > BestAlignSoFar) {
+      BestAlignedElem = E;
+      BestAlignSoFar = OrigAlign;
+    }
+
+    APInt OffsetFromBestAlignedElem =
+        E.OffsetFromLeader - BestAlignedElem.OffsetFromLeader;
+    assert(OffsetFromBestAlignedElem.isNonNegative());
+    // commonAlignment is equivalent to a greatest common power-of-two divisor;
+    // it returns the largest power of 2 that divides both A and B.
+    Align NewAlign = commonAlignment(
+        BestAlignSoFar, OffsetFromBestAlignedElem.getLimitedValue());
+    if (NewAlign > OrigAlign)
+      setLoadStoreAlignment(E.Inst, NewAlign);
+  }
+  return;
+}
diff --git a/llvm/test/Transforms/LoadStoreVectorizer/prop-align.ll b/llvm/test/Transforms/LoadStoreVectorizer/prop-align.ll
new file mode 100644
index 0000000000000..a1878dc051d99
--- /dev/null
+++ b/llvm/test/Transforms/LoadStoreVectorizer/prop-align.ll
@@ -0,0 +1,296 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5
+; RUN: opt -passes=load-store-vectorizer -S < %s | FileCheck %s
+
+; The IR has the first float3 labeled with align 16, and that 16 should
+; be propagated such that the second set of 4 values
+; can also be vectorized together.
+%struct.float3 = type { float, float, float }
+%struct.S1 = type { %struct.float3, %struct.float3, i32, i32 }
+
+define void @testStore(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testStore(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    store <4 x float> zeroinitializer, ptr [[TMP0]], align 16
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds [[STRUCT_S1:%.*]], ptr [[TMP0]], i64 0, i32 1, i32 1
+; CHECK-NEXT:    store <4 x i32> zeroinitializer, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    ret void
+;
+  store float 0.000000e+00, ptr %1, align 16
+  %getElem = getelementptr inbounds %struct.float3, ptr %1, i64 0, i32 1
+  store float 0.000000e+00, ptr %getElem, align 4
+  %getElem8 = getelementptr inbounds %struct.float3, ptr %1, i64 0, i32 2
+  store float 0.000000e+00, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1
+  store float 0.000000e+00, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1, i32 1
+  store float 0.000000e+00, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1, i32 2
+  store float 0.000000e+00, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 2
+  store i32 0, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 3
+  store i32 0, ptr %getElem13, align 4
+  ret void
+}
+
+define void @testLoad(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testLoad(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    [[TMP2:%.*]] = load <4 x float>, ptr [[TMP0]], align 16
+; CHECK-NEXT:    [[L11:%.*]] = extractelement <4 x float> [[TMP2]], i32 0
+; CHECK-NEXT:    [[L22:%.*]] = extractelement <4 x float> [[TMP2]], i32 1
+; CHECK-NEXT:    [[L33:%.*]] = extractelement <4 x float> [[TMP2]], i32 2
+; CHECK-NEXT:    [[L44:%.*]] = extractelement <4 x float> [[TMP2]], i32 3
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds [[STRUCT_S1:%.*]], ptr [[TMP0]], i64 0, i32 1, i32 1
+; CHECK-NEXT:    [[TMP3:%.*]] = load <4 x i32>, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    [[L55:%.*]] = extractelement <4 x i32> [[TMP3]], i32 0
+; CHECK-NEXT:    [[TMP4:%.*]] = bitcast i32 [[L55]] to float
+; CHECK-NEXT:    [[L66:%.*]] = extractelement <4 x i32> [[TMP3]], i32 1
+; CHECK-NEXT:    [[TMP5:%.*]] = bitcast i32 [[L66]] to float
+; CHECK-NEXT:    [[L77:%.*]] = extractelement <4 x i32> [[TMP3]], i32 2
+; CHECK-NEXT:    [[L88:%.*]] = extractelement <4 x i32> [[TMP3]], i32 3
+; CHECK-NEXT:    ret void
+;
+  %l1 = load float, ptr %1, align 16
+  %getElem = getelementptr inbounds %struct.float3, ptr %1, i64 0, i32 1
+  %l2 = load float, ptr %getElem, align 4
+  %getElem8 = getelementptr inbounds %struct.float3, ptr %1, i64 0, i32 2
+  %l3 = load float, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1
+  %l4 = load float, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1, i32 1
+  %l5 = load float, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 1, i32 2
+  %l6 = load float, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 2
+  %l7 = load i32, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds %struct.S1, ptr %1, i64 0, i32 3
+  %l8 = load i32, ptr %getElem13, align 4
+  ret void
+}
+
+; Also, test without the struct geps, to see if it still works with i8 geps/ptradd
+
+define void @testStorei8(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testStorei8(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    store <4 x float> zeroinitializer, ptr [[TMP0]], align 16
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 16
+; CHECK-NEXT:    store <4 x i32> zeroinitializer, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    ret void
+;
+  store float 0.000000e+00, ptr %1, align 16
+  %getElem = getelementptr inbounds i8, ptr %1, i64 4
+  store float 0.000000e+00, ptr %getElem, align 4
+  %getElem8 = getelementptr inbounds i8, ptr %1, i64 8
+  store float 0.000000e+00, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds i8, ptr %1, i64 12
+  store float 0.000000e+00, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds i8, ptr %1, i64 16
+  store float 0.000000e+00, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds i8, ptr %1, i64 20
+  store float 0.000000e+00, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds i8, ptr %1, i64 24
+  store i32 0, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds i8, ptr %1, i64 28
+  store i32 0, ptr %getElem13, align 4
+  ret void
+}
+
+define void @testLoadi8(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testLoadi8(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    [[TMP2:%.*]] = load <4 x float>, ptr [[TMP0]], align 16
+; CHECK-NEXT:    [[L11:%.*]] = extractelement <4 x float> [[TMP2]], i32 0
+; CHECK-NEXT:    [[L22:%.*]] = extractelement <4 x float> [[TMP2]], i32 1
+; CHECK-NEXT:    [[L33:%.*]] = extractelement <4 x float> [[TMP2]], i32 2
+; CHECK-NEXT:    [[L44:%.*]] = extractelement <4 x float> [[TMP2]], i32 3
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 16
+; CHECK-NEXT:    [[TMP3:%.*]] = load <4 x i32>, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    [[L55:%.*]] = extractelement <4 x i32> [[TMP3]], i32 0
+; CHECK-NEXT:    [[TMP4:%.*]] = bitcast i32 [[L55]] to float
+; CHECK-NEXT:    [[L66:%.*]] = extractelement <4 x i32> [[TMP3]], i32 1
+; CHECK-NEXT:    [[TMP5:%.*]] = bitcast i32 [[L66]] to float
+; CHECK-NEXT:    [[L77:%.*]] = extractelement <4 x i32> [[TMP3]], i32 2
+; CHECK-NEXT:    [[L88:%.*]] = extractelement <4 x i32> [[TMP3]], i32 3
+; CHECK-NEXT:    ret void
+;
+  %l1 = load float, ptr %1, align 16
+  %getElem = getelementptr inbounds i8, ptr %1, i64 4
+  %l2 = load float, ptr %getElem, align 4
+  %getElem8 = getelementptr inbounds i8, ptr %1, i64 8
+  %l3 = load float, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds i8, ptr %1, i64 12
+  %l4 = load float, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds i8, ptr %1, i64 16
+  %l5 = load float, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds i8, ptr %1, i64 20
+  %l6 = load float, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds i8, ptr %1, i64 24
+  %l7 = load i32, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds i8, ptr %1, i64 28
+  %l8 = load i32, ptr %getElem13, align 4
+  ret void
+}
+
+
+; This version of the test adjusts the struct to hold two i32s at the beginning,
+; but still assumes that the first float3 is 16 aligned. If the alignment
+; propagation works correctly, it should be able to load this struct in three
+; loads: a 2x32, a 4x32, and a 4x32. Without the alignment propagation, the last
+; 4x32 will instead be a 2x32 and a 2x32
+%struct.S2 = type { i32, i32, %struct.float3, %struct.float3, i32, i32 }
+
+define void @testStore_2(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testStore_2(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    store <2 x i32> zeroinitializer, ptr [[TMP0]], align 8
+; CHECK-NEXT:    [[GETELEM1:%.*]] = getelementptr inbounds [[STRUCT_S2:%.*]], ptr [[TMP0]], i64 0, i32 2
+; CHECK-NEXT:    store <4 x float> zeroinitializer, ptr [[GETELEM1]], align 16
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds [[STRUCT_S2]], ptr [[TMP0]], i64 0, i32 3, i32 1
+; CHECK-NEXT:    store <4 x i32> zeroinitializer, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    ret void
+;
+  store i32 0, ptr %1, align 8
+  %getElem = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 1
+  store i32 0, ptr %getElem, align 4
+  %getElem1 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2
+  store float 0.000000e+00, ptr %getElem1, align 16
+  %getElem2 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2, i32 1
+  store float 0.000000e+00, ptr %getElem2, align 4
+  %getElem8 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2, i32 2
+  store float 0.000000e+00, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3
+  store float 0.000000e+00, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3, i32 1
+  store float 0.000000e+00, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3, i32 2
+  store float 0.000000e+00, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 4
+  store i32 0, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 5
+  store i32 0, ptr %getElem13, align 4
+  ret void
+}
+
+define void @testLoad_2(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testLoad_2(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    [[TMP2:%.*]] = load <2 x i32>, ptr [[TMP0]], align 8
+; CHECK-NEXT:    [[L1:%.*]] = extractelement <2 x i32> [[TMP2]], i32 0
+; CHECK-NEXT:    [[L22:%.*]] = extractelement <2 x i32> [[TMP2]], i32 1
+; CHECK-NEXT:    [[GETELEM1:%.*]] = getelementptr inbounds [[STRUCT_S2:%.*]], ptr [[TMP0]], i64 0, i32 2
+; CHECK-NEXT:    [[TMP3:%.*]] = load <4 x float>, ptr [[GETELEM1]], align 16
+; CHECK-NEXT:    [[L33:%.*]] = extractelement <4 x float> [[TMP3]], i32 0
+; CHECK-NEXT:    [[L44:%.*]] = extractelement <4 x float> [[TMP3]], i32 1
+; CHECK-NEXT:    [[L55:%.*]] = extractelement <4 x float> [[TMP3]], i32 2
+; CHECK-NEXT:    [[L66:%.*]] = extractelement <4 x float> [[TMP3]], i32 3
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds [[STRUCT_S2]], ptr [[TMP0]], i64 0, i32 3, i32 1
+; CHECK-NEXT:    [[TMP4:%.*]] = load <4 x i32>, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    [[L77:%.*]] = extractelement <4 x i32> [[TMP4]], i32 0
+; CHECK-NEXT:    [[TMP5:%.*]] = bitcast i32 [[L77]] to float
+; CHECK-NEXT:    [[L88:%.*]] = extractelement <4 x i32> [[TMP4]], i32 1
+; CHECK-NEXT:    [[TMP6:%.*]] = bitcast i32 [[L88]] to float
+; CHECK-NEXT:    [[L99:%.*]] = extractelement <4 x i32> [[TMP4]], i32 2
+; CHECK-NEXT:    [[L010:%.*]] = extractelement <4 x i32> [[TMP4]], i32 3
+; CHECK-NEXT:    ret void
+;
+  %l = load i32, ptr %1, align 8
+  %getElem = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 1
+  %l2 = load i32, ptr %getElem, align 4
+  %getElem1 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2
+  %l3 = load float, ptr %getElem1, align 16
+  %getElem2 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2, i32 1
+  %l4 = load float, ptr %getElem2, align 4
+  %getElem8 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 2, i32 2
+  %l5 = load float, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3
+  %l6 = load float, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3, i32 1
+  %l7 = load float, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 3, i32 2
+  %l8 = load float, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 4
+  %l9 = load i32, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds %struct.S2, ptr %1, i64 0, i32 5
+  %l0 = load i32, ptr %getElem13, align 4
+  ret void
+}
+
+; Also, test without the struct geps, to see if it still works with i8 geps/ptradd
+
+define void @testStorei8_2(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testStorei8_2(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    store <2 x i32> zeroinitializer, ptr [[TMP0]], align 8
+; CHECK-NEXT:    [[GETELEM1:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 8
+; CHECK-NEXT:    store <4 x float> zeroinitializer, ptr [[GETELEM1]], align 16
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 24
+; CHECK-NEXT:    store <4 x i32> zeroinitializer, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    ret void
+;
+  store i32 0, ptr %1, align 8
+  %getElem = getelementptr inbounds i8, ptr %1, i64 4
+  store i32 0, ptr %getElem, align 4
+  %getElem1 = getelementptr inbounds i8, ptr %1, i64 8
+  store float 0.000000e+00, ptr %getElem1, align 16
+  %getElem2 = getelementptr inbounds i8, ptr %1, i64 12
+  store float 0.000000e+00, ptr %getElem2, align 4
+  %getElem8 = getelementptr inbounds i8, ptr %1, i64 16
+  store float 0.000000e+00, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds i8, ptr %1, i64 20
+  store float 0.000000e+00, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds i8, ptr %1, i64 24
+  store float 0.000000e+00, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds i8, ptr %1, i64 28
+  store float 0.000000e+00, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds i8, ptr %1, i64 32
+  store i32 0, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds i8, ptr %1, i64 36
+  store i32 0, ptr %getElem13, align 4
+  ret void
+}
+
+define void @testLoadi8_2(ptr nocapture writeonly %1) {
+; CHECK-LABEL: define void @testLoadi8_2(
+; CHECK-SAME: ptr writeonly captures(none) [[TMP0:%.*]]) {
+; CHECK-NEXT:    [[TMP2:%.*]] = load <2 x i32>, ptr [[TMP0]], align 8
+; CHECK-NEXT:    [[L1:%.*]] = extractelement <2 x i32> [[TMP2]], i32 0
+; CHECK-NEXT:    [[L22:%.*]] = extractelement <2 x i32> [[TMP2]], i32 1
+; CHECK-NEXT:    [[GETELEM1:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 8
+; CHECK-NEXT:    [[TMP3:%.*]] = load <4 x float>, ptr [[GETELEM1]], align 16
+; CHECK-NEXT:    [[L33:%.*]] = extractelement <4 x float> [[TMP3]], i32 0
+; CHECK-NEXT:    [[L44:%.*]] = extractelement <4 x float> [[TMP3]], i32 1
+; CHECK-NEXT:    [[L55:%.*]] = extractelement <4 x float> [[TMP3]], i32 2
+; CHECK-NEXT:    [[L66:%.*]] = extractelement <4 x float> [[TMP3]], i32 3
+; CHECK-NEXT:    [[GETELEM10:%.*]] = getelementptr inbounds i8, ptr [[TMP0]], i64 24
+; CHECK-NEXT:    [[TMP4:%.*]] = load <4 x i32>, ptr [[GETELEM10]], align 16
+; CHECK-NEXT:    [[L77:%.*]] = extractelement <4 x i32> [[TMP4]], i32 0
+; CHECK-NEXT:    [[TMP5:%.*]] = bitcast i32 [[L77]] to float
+; CHECK-NEXT:    [[L88:%.*]] = extractelement <4 x i32> [[TMP4]], i32 1
+; CHECK-NEXT:    [[TMP6:%.*]] = bitcast i32 [[L88]] to float
+; CHECK-NEXT:    [[L99:%.*]] = extractelement <4 x i32> [[TMP4]], i32 2
+; CHECK-NEXT:    [[L010:%.*]] = extractelement <4 x i32> [[TMP4]], i32 3
+; CHECK-NEXT:    ret void
+;
+  %l = load i32, ptr %1, align 8
+  %getElem = getelementptr inbounds i8, ptr %1, i64 4
+  %l2 = load i32, ptr %getElem, align 4
+  %getElem1 = getelementptr inbounds i8, ptr %1, i64 8
+  %l3 = load float, ptr %getElem1, align 16
+  %getElem2 = getelementptr inbounds i8, ptr %1, i64 12
+  %l4 = load float, ptr %getElem2, align 4
+  %getElem8 = getelementptr inbounds i8, ptr %1, i64 16
+  %l5 = load float, ptr %getElem8, align 8
+  %getElem9 = getelementptr inbounds i8, ptr %1, i64 20
+  %l6 = load float, ptr %getElem9, align 4
+  %getElem10 = getelementptr inbounds i8, ptr %1, i64 24
+  %l7 = load float, ptr %getElem10, align 4
+  %getElem11 = getelementptr inbounds i8, ptr %1, i64 28
+  %l8 = load float, ptr %getElem11, align 4
+  %getElem12 = getelementptr inbounds i8, ptr %1, i64 32
+  %l9 = load i32, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds i8, ptr %1, i64 36
+  %l0 = load i32, ptr %getElem13, align 4
+  ret void
+}

AlexMaclean

Nice

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

…ests

dakersnar · 2025-06-25T23:04:51Z

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

-  // If this is a load/store of an alloca, we might have upgraded the alloca's
-  // alignment earlier.  Get the new alignment.
-  if (AS == DL.getAllocaAddrSpace()) {
-    Alignment = std::max(
-        Alignment,
-        getOrEnforceKnownAlignment(getLoadStorePointerOperand(C[0].Inst),
-                                   MaybeAlign(), DL, C[0].Inst, nullptr, &DT));
-  }


Regarding this change: I don't totally understand why we delay the upgrade of load/store alignment, but I'd imagine it is preferred to minimize IR changes before vectorization is certain.

If this change gets accepted, we are ok with eagerly upgrading load/store alignment. So, for consistency, we should also be ok with eagerly doing so for the existing alloca alignment upgrade optimization. Hence, I am bundling this simplification with this change, which lets us remove this handling in vectorizeChain by instead calling setLoadStoreAlignment on line 837.

There is one test case change that results from this, in massive_indirection.ll. This is not a regression, as the target supports unaligned loads. The upgrade from alignment of 8 to an alignment of 4294967296 that happened with this call to getOrEnforceKnownAlignment is not the intended purpose of this block of code, merely an unintended side effect. This is evidenced by the fact that there are no allocas in the test case.

AlexMaclean

LGTM, but please wait for review by someone with more LSV experience before landing.

dakersnar · 2025-06-27T22:37:01Z

Adding @michalpaszkowski to reviewers since I saw you reviewed a recently merged LSV change. If you can think of anyone else who would be a better fit to review this let me know :)

arsenm · 2025-07-02T02:29:13Z

llvm/test/Transforms/LoadStoreVectorizer/prop-align.ll

+  %l7 = load i32, ptr %getElem12, align 8
+  %getElem13 = getelementptr inbounds i8, ptr %1, i64 28
+  %l8 = load i32, ptr %getElem13, align 4
+  ret void


Can you use more representative tests with some uses? As is these all optimize to memset or no-op

Slight pushback, my understanding is that lit tests are most useful when they are minimal reproducers of the problem the optimization is targeting. Adding uses would not really change the nature of this optimization. Tests like llvm/test/Transforms/LoadStoreVectorizer/NVPTX/vectorize_i1.ll follow this thinking.

If you think it would be better, I could combine each pair of load and store tests into individual tests, storing the result of the loads. Other LSV tests use that pattern a lot.

arsenm · 2025-07-02T02:29:49Z

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

@@ -1634,3 +1638,32 @@ std::optional<APInt> Vectorizer::getConstantOffset(Value *PtrA, Value *PtrB,
        .sextOrTrunc(OrigBitWidth);
  return std::nullopt;
 }
+
+void Vectorizer::propagateBestAlignmentsInChain(ArrayRef<ChainElem> C) const {
+  auto PropagateAlignments = [](auto ChainIt) {


Make this a proper helper function instead of a lambda

Hmm, I can't figure out a way to have this take in both the iterator and the reverse iterator without it being a templated helper function. I was leaning towards this lambda approach because it is only used in this one function, do you think a separate helper would still be preferred even if we need to use a template? Open to ideas.

arsenm · 2025-07-02T02:30:59Z

llvm/test/Transforms/LoadStoreVectorizer/prop-align.ll

+; The IR has the first float3 labeled with align 16, and that 16 should
+; be propagated such that the second set of 4 values
+; can also be vectorized together.


Is this papering over a missed middle end optimization? The middle end should have inferred the pointer argument to align 16, and then each successive access should already have a refined alignment derived from that

I think this is nice to have for a variety of edge cases, but the specific motivator for this change is based on InstCombine's handling of nested structs. You are correct that if loads/stores are unpacked from aligned loads/stores of aggregates, alignment will be propagated to the unpacked elements. However, with nested structs, InstCombine unpacks one layer at a time, losing alignment context in between passes over the worklist.

Before IC:

; RUN: opt < %s -passes=instcombine -S | FileCheck %s %struct.S1 = type { %struct.float3, %struct.float3, i32, i32 } %struct.float3 = type { float, float, float } define void @_Z7init_s1P2S1S0_(ptr noundef %v, ptr noundef %vout) { %v.addr = alloca ptr, align 8 %vout.addr = alloca ptr, align 8 store ptr %v, ptr %v.addr, align 8 store ptr %vout, ptr %vout.addr, align 8 %tmp = load ptr, ptr %vout.addr, align 8 %tmp1 = load ptr, ptr %v.addr, align 8 %1 = load %struct.S1, ptr %tmp1, align 16 store %struct.S1 %1, ptr %tmp, align 16 ret void }

After IC:

define void @_Z7init_s1P2S1S0_(ptr noundef %v, ptr noundef %vout) { %.unpack.unpack = load float, ptr %v, align 16 %.unpack.elt7 = getelementptr inbounds nuw i8, ptr %v, i64 4 %.unpack.unpack8 = load float, ptr %.unpack.elt7, align 4 %.unpack.elt9 = getelementptr inbounds nuw i8, ptr %v, i64 8 %.unpack.unpack10 = load float, ptr %.unpack.elt9, align 8 %.elt1 = getelementptr inbounds nuw i8, ptr %v, i64 12 %.unpack2.unpack = load float, ptr %.elt1, align 4 %.unpack2.elt12 = getelementptr inbounds nuw i8, ptr %v, i64 16 %.unpack2.unpack13 = load float, ptr %.unpack2.elt12, align 4 ; <----------- this should be align 16 %.unpack2.elt14 = getelementptr inbounds nuw i8, ptr %v, i64 20 %.unpack2.unpack15 = load float, ptr %.unpack2.elt14, align 4 %.elt3 = getelementptr inbounds nuw i8, ptr %v, i64 24 %.unpack4 = load i32, ptr %.elt3, align 8 %.elt5 = getelementptr inbounds nuw i8, ptr %v, i64 28 %.unpack6 = load i32, ptr %.elt5, align 4 store float %.unpack.unpack, ptr %vout, align 16 %vout.repack23 = getelementptr inbounds nuw i8, ptr %vout, i64 4 store float %.unpack.unpack8, ptr %vout.repack23, align 4 %vout.repack25 = getelementptr inbounds nuw i8, ptr %vout, i64 8 store float %.unpack.unpack10, ptr %vout.repack25, align 8 %vout.repack17 = getelementptr inbounds nuw i8, ptr %vout, i64 12 store float %.unpack2.unpack, ptr %vout.repack17, align 4 %vout.repack17.repack27 = getelementptr inbounds nuw i8, ptr %vout, i64 16 store float %.unpack2.unpack13, ptr %vout.repack17.repack27, align 4 %vout.repack17.repack29 = getelementptr inbounds nuw i8, ptr %vout, i64 20 store float %.unpack2.unpack15, ptr %vout.repack17.repack29, align 4 %vout.repack19 = getelementptr inbounds nuw i8, ptr %vout, i64 24 store i32 %.unpack4, ptr %vout.repack19, align 8 %vout.repack21 = getelementptr inbounds nuw i8, ptr %vout, i64 28 store i32 %.unpack6, ptr %vout.repack21, align 4 ret void }

To visualize what's happening under the hood, InstCombine is unpacking the load in stages like this:
%1 = load %struct.S1, ptr %tmp1, align 16
->
load struct.float3 align 16
load struct.float3 align 4
load i32 align 8
load i32 align 4
->
load float align 16
load float align 4
load float align 8
load float align 4
load float align 4
load float align 4
load i32 align 8
load i32 align 4

arsenm · 2025-07-02T02:31:33Z

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp

+  PropagateAlignments(C);
+  PropagateAlignments(reverse(C));


Why does this need to go through twice? I'd expect each load and store to have started with an optimally computed alignment

test_forward_and_reverse in the test file demonstrates why going backwards could theoretically be useful. If you have a well-aligned element later in the chain, propagating the alignment backwards could improve earlier elements in the chain.

I haven't thought of a specific end-to-end test that would trigger this, but I think it can't hurt to include.

[LoadStoreVectorizer] Propagate alignment through contiguous chain to…

15600f9

… improve vectorization

dakersnar requested review from jlebar and AlexMaclean June 25, 2025 16:12

dakersnar self-assigned this Jun 25, 2025

dakersnar added the vectorizers label Jun 25, 2025

llvmbot added the llvm:transforms label Jun 25, 2025

AlexMaclean reviewed Jun 25, 2025

View reviewed changes

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp Outdated Show resolved Hide resolved

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp Outdated Show resolved Hide resolved

llvm/lib/Transforms/Vectorize/LoadStoreVectorizer.cpp Outdated Show resolved Hide resolved

dakersnar added 2 commits June 25, 2025 17:48

Address feedback, add reverse propagation, simplify and expand unit t…

b905f1c

…ests

For consistency, remove the delaying of load/store alignment upgrading

cd9e9f6

dakersnar commented Jun 25, 2025

View reviewed changes

AlexMaclean approved these changes Jun 25, 2025

View reviewed changes

dakersnar requested review from michalpaszkowski and arsenm June 27, 2025 22:37

arsenm reviewed Jul 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[LoadStoreVectorizer] Propagate alignment through contiguous chain #145733

[LoadStoreVectorizer] Propagate alignment through contiguous chain #145733

dakersnar commented Jun 25, 2025

Uh oh!

llvmbot commented Jun 25, 2025 •

edited

Loading

Uh oh!

AlexMaclean left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dakersnar Jun 25, 2025 •

edited

Loading

Uh oh!

AlexMaclean left a comment

Uh oh!

dakersnar commented Jun 27, 2025

Uh oh!

arsenm Jul 2, 2025

Uh oh!

dakersnar Jul 3, 2025

Uh oh!

arsenm Jul 2, 2025

Uh oh!

dakersnar Jul 3, 2025

Uh oh!

arsenm Jul 2, 2025

Uh oh!

dakersnar Jul 3, 2025

Uh oh!

arsenm Jul 2, 2025

Uh oh!

dakersnar Jul 3, 2025

Uh oh!

Uh oh!

[LoadStoreVectorizer] Propagate alignment through contiguous chain #145733

Are you sure you want to change the base?

[LoadStoreVectorizer] Propagate alignment through contiguous chain #145733

Conversation

dakersnar commented Jun 25, 2025

Uh oh!

llvmbot commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexMaclean left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dakersnar Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AlexMaclean left a comment

Choose a reason for hiding this comment

Uh oh!

dakersnar commented Jun 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

llvmbot commented Jun 25, 2025 •

edited

Loading

dakersnar Jun 25, 2025 •

edited

Loading