Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework initialization of constants & class variables #15333

Merged
merged 12 commits into from
Jan 20, 2025

Conversation

ysbaddaden
Copy link
Contributor

@ysbaddaden ysbaddaden commented Jan 9, 2025

Follow up and closes #15216 by @BlobCodes. Also incorporates the LLVM optimization from master...BlobCodes:crystal:perf/crystal-once-v2.

Similar to the original PR, this changes the flag to have 3 states instead of 2. This led to change the signature of the __crystal_once[_init] functions that now take an Int8 (i8) instead of Bool (i1). In addition it drops the "once state" that isn't needed anymore.

This requires a new compiler build to benefit from the improvement. The legacy versions of the __crystal_once[_init] methods are still supported by both the stdlib and the compiler to keep both forward & backward compatibility (1.15- can build 1.16+ and 1.16+ can build 1.15-).

Unlike the original PR, this one doesn't use any atomics: the mutex already protects the flag, and the mutex itself is explicitly initialized by the main thread before any other thread is started (i.e. no parallelism issues).

A follow-up could leverage ReferenceStorage and .unsafe_construct to inline the Mutex instead of allocating in the GC heap. Along with #15330 then __crystal_once_init could become allocation free, which could prove useful for such a core/low level feature.

@oprypin
Copy link
Member

oprypin commented Jan 9, 2025

This should have a Co-authored-by: attribution somewhere in some commit

@ysbaddaden ysbaddaden force-pushed the refactor/crystal-once branch from d1de2eb to ce6e9a2 Compare January 10, 2025 16:40
@ysbaddaden
Copy link
Contributor Author

Rebased to add the co-authored-by to the individual commits 👍

@oprypin
Copy link
Member

oprypin commented Jan 10, 2025

I think it's not registering in the interface. Maybe it has to be on the last line 🤔

@straight-shoota
Copy link
Member

FWIW we could also add the co-author attribution in the squash commit when we merge this PR (though we'd need to remember that, so it's best to already add it to the original commits 😅)

@Blacksmoke16
Copy link
Member

I'm pretty sure the easiest way would be to cherry pick their commits into a branch, then push up your own. If there are commits from more than 1 author on a branch GH should automatically add the co-authored-by to the squash commit message for all additional authors.

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Jan 10, 2025

I added a commit to only enable the feature from crystal 1.16.0-dev. Let's merge after updating the version in the master branch.

EDIT: why is CI getting completely broken because I changed the version comparison from 1.15.0-dev to 1.16.0-dev?

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Jan 10, 2025

Trying to work on #14905 I notice that this is still concurrency unsafe because we only enable Mutex for preview_mt. The example thus fails with "recursive initialization" which is wrong because they're concurrent accesses.

@ysbaddaden ysbaddaden marked this pull request as draft January 13, 2025 09:30
@ysbaddaden
Copy link
Contributor Author

Moving back to draft while I'm working on #14905. I might change the new __crystal_once signature to take a closure_data pointer, so we can pass any proc.

@ysbaddaden
Copy link
Contributor Author

Ready again. I won't need change to the fun signatures, so it's 👌

Co-authored-by: David Keller <[email protected]>

Based on the PR by @BlobCodes:
crystal-lang#15216

The performance improvement is two-fold:

1. the usage of a i8 instead of an i1 boolean to have 3 states instead
   of 2, which permits to quickly detect recursive calls without an
   array;
2. inline tricks to optimize the fast and slow paths.

Unlike the PR:

1. Doesn't use atomics: it already uses a mutex that guarantees acquire
   release memory ordering semantics, and __crystal_once_init is only
   ever called in the main thread before any other thread is started.
2. Removes the need for a state maintained by the compiler, yet keeps
   forward and backward compatibility (both signatures are supported).
Co-authored-by: David Keller <[email protected]>

@BlobCodes: I noticed that adding this code prevents LLVM from
re-running the once mechanism multiple times for the same variable.

Modified to avoid an undefined behavior when the assumption doesn't
hold that doubles as a safety net (print error + exit).
Co-authored-by: David Keller <[email protected]>

@BlobCodes: I think it would be better to print the bug message in
`Crystal.once` instead of `__crystal_once` to reduce complexity at the
callsite. The previous unreachable method can then be used in the
inlined `__crystal_once` so LLVM also knows it doesn't have to re-run
the method.

It's now even safe because `Crystal.once` would panic if it failed; it
should already be impossible, but let's err on the safe side.
@ysbaddaden ysbaddaden force-pushed the refactor/crystal-once branch from f7e50b7 to 2296069 Compare January 14, 2025 08:57
@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Jan 14, 2025

CI is finally all green.

src/crystal/once.cr Outdated Show resolved Hide resolved
src/compiler/crystal/codegen/once.cr Outdated Show resolved Hide resolved
Comment on lines 102 to 105
# tell LLVM that it can optimize away repeated `__crystal_once` calls for
# this global (e.g. repeated access to constant in a single funtion);
# this is truly unreachable otherwise `Crystal.once` would have panicked
Crystal.once_unreachable unless flag.value.initialized?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: This is pretty darn clever 👏 Kudos to @BlobCodes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second that: thanks a lot @BlobCodes 🙇

src/crystal/once.cr Outdated Show resolved Hide resolved
Comment on lines +32 to +36
enum OnceState : Int8
Processing = -1
Uninitialized = 0
Initialized = 1
end
Copy link
Contributor

@BlobCodes BlobCodes Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very nit-picky, but in my v2 branch, I replaced this trinary enum with a Int8 so all comparisons are done with 0 (== Initialized becomes > 0).

Comparing with 0 results in fewer (or smaller) assembly instructions on most CPUs
The only arch I'm really familiar with is risc-v, so here's an example in risc-v:

# enum comparison `return if state == Initialized`
  li t0, 1         # load 1 into t0
  bne t0, a0, init # initialize if needed
  ret              # early return (or normal program flow if inlined)
init:
  # stuff

# comparison `return if state <= 0`
  blez a0, init # initialize if needed
  ret           # early-return (or normal program flow if inlined)
init:
  # stuff

Of course this is micro-optimization, but if this is inlined into every const access, it could be noticable.

Copy link
Contributor Author

@ysbaddaden ysbaddaden Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: I fixed the examples below... I stupidly used a UInt8 🤦

On x86_64 the only difference is in the jump instruction:

; flag.value == 1
cmpb   $0x1,(%rdi)
jne    39 <foo+0x9>

; flag.value > 0
cmpb   $0x0,(%rdi)
jle    39 <foo+0x9>

Same for ARM32:

; flag.value == 1
ldrb    r0, [r0]
cmp     r0, #1
moveq   pc, lr

; flag.value > 0
ldrb    r0, [r0]
cmp     r0, #1
movge   pc, lr

But AArch64 indeed only needs one instruction instead of two for the equality check: And same for AArch64:

; flag.value == 1
ldrb    w8, [x0]
cmp     w8, #0x1
b.ne    40 <foo+0x10>

; flag.value > 0
ldrb    w8, [x0]
cmp     w8, #0x1
b.lt    40 <*foo+0x10>

NOTE: I'm not fluent in the assembly of each arch. I compiled a tiny program with --cross-compile --target=... then used objdump --disassemble from a crosschain build of binutils to compare the LLVM generated assembly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

; flag.value > 0
cmpb   $0x0,(%rdi)
je     39 <foo+0x9>

To me that looks like an unsigned comparison. So it only checks x != 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wrote a commit, but didn't push it (yet): I'm wondering if we'd really benefit from the change in practice 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested again, it's actually a jle!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@BlobCodes I run my tests again and updated my comment above.

We can probably do better in manually written assembly, but LLVM actually generates the same assembly for all 3 architectures. The only difference stands in the jump instruction 🤷

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's see in a follow up if we can squeeze even more performance with this idea.

@ysbaddaden ysbaddaden added this to the 1.16.0 milestone Jan 17, 2025
@oprypin
Copy link
Member

oprypin commented Jan 17, 2025

Do not forget!! Co-authored-by to be moved to the bottom

@straight-shoota
Copy link
Member

@oprypin: GitHub seems to recognize it. Correctly populates the Co-Authored-By tag for the commit message:
grafik

@oprypin
Copy link
Member

oprypin commented Jan 17, 2025

@straight-shoota That's not right. The suggested commit message that I get is huge, with all messages concatenated. Maybe you edited it yourself. Try to clear browser caches.

@straight-shoota
Copy link
Member

I'm using https://github.com/refined-github/refined-github which is probably adjusting the commit message from the default experience (or it's GitHub's New Merge Experience feature). Anyway, as a result I get the correct attribution pre-filled. So it must be discoverable somehow. I

@oprypin
Copy link
Member

oprypin commented Jan 17, 2025

Hmm well in that case it's sad that it completely loses the concatenated commit descriptions. That seems undesirable in general, not only here. Or maybe you consider it a positive, I don't know

@straight-shoota straight-shoota merged commit 8d02c8b into crystal-lang:master Jan 20, 2025
70 checks passed
@ysbaddaden ysbaddaden deleted the refactor/crystal-once branch January 30, 2025 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants