Skip to content

[CIR] Update Accumulate Bits Algorithm #1688

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 17, 2025

Conversation

Andres-Salamanca
Copy link
Contributor

This PR addresses the feedback from llvm/llvm-project#142041 (comment). Our algorithm for accumulating bitfields has diverged from CodeGen since Clang 19.

There is one key difference: in CIR, we use the function getBitfieldStorageType, which checks whether the bit width of the current accumulation run is a valid fundamental width (i.e., a power of two: 8, 16, 32, 64). If it is, it returns a CIR type of that size otherwise, it returns an array with the closest alignment.

For example, given the following struct:

struct S {
  int a : 4;
  int b : 27;
  int c : 17;
  int d : 2;
  int e : 15;
  unsigned f;
};

The CodeGen output is:

%struct.S = type { i64, i16, i32 }

Whereas the new CIR algorithm produces:

!cir.record<struct "S" {!cir.array<!u8i x 7>, !u16i, !u32i}>

In CIR, the algorithm accumulates up to field d, resulting in 50 accumulated bits. Since 50 is not a fundamental width, the closest alignment is 56 bits, which leads to the type !cir.array<!u8i x 7>. The algorithm stops before accumulating field e because including it would exceed the register size (64), which is not ideal.

At this point, it's unclear whether this divergence from CodeGen represents an improvement. If we wanted to match CodeGen exactly, we would need to replace the use of getBitfieldStorageType with getUIntNType. The difference is that getUIntNType always returns the closest power-of-two integer type instead of falling back to an array when the size is not a fundamental width.

With this change, CIR would match CodeGen's layout exactly. It would require the following small code change:

diff --git a/clang/lib/CIR/CodeGen/CIRRecordLayoutBuilder.cpp b/clang/lib/CIR/CodeGen/CIRRecordLayoutBuilder.cpp
index 7c1802bff077..17538b191738 100644
--- a/clang/lib/CIR/CodeGen/CIRRecordLayoutBuilder.cpp
+++ b/clang/lib/CIR/CodeGen/CIRRecordLayoutBuilder.cpp
@@ -616,7 +616,7 @@ CIRRecordLowering::accumulateBitFields(RecordDecl::field_iterator Field,
       if (!InstallBest) {
         // Determine if accumulating the just-seen span will create an expensive
         // access unit or not.
-        mlir::Type Type = getBitfieldStorageType(astContext.toBits(AccessSize));
+        mlir::Type Type = getUIntNType(astContext.toBits(AccessSize));
         if (!astContext.getTargetInfo().hasCheapUnalignedBitFieldAccess())
           llvm_unreachable("NYI");

@@ -674,12 +674,12 @@ CIRRecordLowering::accumulateBitFields(RecordDecl::field_iterator Field,
         // remain there after a stable sort.
         mlir::Type Type;
         if (BestClipped) {
-          assert(getSize(getBitfieldStorageType(
+          assert(getSize(getUIntNType(
                      astContext.toBits(AccessSize))) > AccessSize &&
                  "Clipped access need not be clipped");
           Type = getByteArrayType(AccessSize);
         } else {
-          Type = getBitfieldStorageType(astContext.toBits(AccessSize));
+          Type = getUIntNType(astContext.toBits(AccessSize));
           assert(getSize(Type) == AccessSize &&
                  "Unclipped access must be clipped");
         }

You can see a comparison between the two functions https://godbolt.org/z/qjx1MaEWG.

I'm currently unsure whether using one function over the other has performance implications. Regarding the ARM error I mentioned in the meeting: it was an assert I had forgotten to update. It's now fixed sorry for the confusion.

@Andres-Salamanca
Copy link
Contributor Author

@andykaylor
Copy link
Collaborator

It's difficult to compare the implementations without seeing a side-by-side comparison of the resulting LLVM IR for accessing elements in a structure. I suspect that representing the memory using an array of bytes is going to be bad compared to using power-of-two integers because they result in different physical representations on little-endian machines.

Consider this example:

https://godbolt.org/z/3eWjee55f

struct S {
  int a : 4;
  int b : 27;
  int c : 17;
  int d : 2;
  int e : 15;
  unsigned f;
};

S s;

int f() {
  return s.c;
}

The current incubator produces

%struct.S = type { i32, [3 x i8], i16, i32 }

I think we could get similar results with your PR by replacing a and b with a single integer field.

The LLVM IR generated looks like this:

define dso_local range(i32 -65536, 65536) i32 @_Z1fv() local_unnamed_addr #0 !dbg !4 {
  %1 = load i24, ptr getelementptr inbounds nuw (i8, ptr @s, i64 4), align 4, !dbg !7
  %2 = shl i24 %1, 7, !dbg !7
  %3 = ashr exact i24 %2, 7, !dbg !7
  %4 = sext i24 %3 to i32, !dbg !7
  ret i32 %4, !dbg !8
}

That doesn't seem too bad until you consider that c is placed in bytes 0-3, so it crosses the boundary where it is split in little-endian representation. The x86-64 assembly generated looks like this:

_Z1fv:
.L_Z1fv$local:
  movzx eax, byte ptr [rip + .Ls$local+6]
  movzx ecx, word ptr [rip + .Ls$local+4]
  shl ecx, 15
  shl eax, 31
  or eax, ecx
  sar eax, 15
  ret

We had to load two different values and combine them to get the correct result!

By contrast, classic codegen emits this:

%struct.S = type { i32, i24, i16, i32 }

define dso_local noundef i32 @f()() local_unnamed_addr #0 !dbg !30 {
  %1 = load i32, ptr getelementptr inbounds (%struct.S, ptr @s, i64 0, i32 1), align 4, !dbg !33
  %2 = shl i32 %1, 15, !dbg !33
  %3 = ashr exact i32 %2, 15, !dbg !33
  ret i32 %3, !dbg !34
}

And the generated x86-64 assembly is this:

f():
  mov eax, dword ptr [rip + s+4]
  shl eax, 15
  sar eax, 15
  ret

It almost looks like classic codegen is cheating by telling LLVM IR to load an i32 instead of an i24 but it can do that because using an i24 in the structure definition means that the endianness is baked into the structure.

@andykaylor
Copy link
Collaborator

That is to say, I think you should make the change to get this to emit the same structure that classic codegen emits.

@Andres-Salamanca
Copy link
Contributor Author

Yes, @andykaylor 100% you're correct
Previously, with the array, it looked like this in my machine:

_Z1fv:                                  # @_Z1fv
.L_Z1fv$local:
    .type   .L_Z1fv$local,@function
    .cfi_startproc
# %bb.0:
    movzbl  .Ls$local+6(%rip), %eax
    shll    $16, %eax
    movzwl  .Ls$local+4(%rip), %ecx
    orl     %eax, %ecx
                                            
    movl    %ecx, %eax
    shlq    $47, %rax
    sarq    $47, %rax
                                            
    movl    %eax, -4(%rsp)
    movl    -4(%rsp), %eax
    retq

And now, using power-of-two types, it looks much better exactly like CodeGen:

_Z1fv:                                  # @_Z1fv
.L_Z1fv$local:
    .type   .L_Z1fv$local,@function
    .cfi_startproc
# %bb.0:
    movq    .Ls$local(%rip), %rax
    shlq    $15, %rax
    sarq    $47, %rax
                                            
    retq

Thanks for the detailed explanation, Andy ;)

@bcardosolopes
Copy link
Member

Thanks for fixing this. I agree with Andy here too, matching classic codegen is the way to go!

@bcardosolopes bcardosolopes merged commit 6ddc48e into llvm:main Jun 17, 2025
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants