-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework initialization of constants & class variables #15333
Rework initialization of constants & class variables #15333
Conversation
This should have a |
d1de2eb
to
ce6e9a2
Compare
Rebased to add the |
I think it's not registering in the interface. Maybe it has to be on the last line 🤔 |
FWIW we could also add the co-author attribution in the squash commit when we merge this PR (though we'd need to remember that, so it's best to already add it to the original commits 😅) |
I'm pretty sure the easiest way would be to cherry pick their commits into a branch, then push up your own. If there are commits from more than 1 author on a branch GH should automatically add the |
I added a commit to only enable the feature from crystal EDIT: why is CI getting completely broken because I changed the version comparison from |
Trying to work on #14905 I notice that this is still concurrency unsafe because we only enable |
Moving back to draft while I'm working on #14905. I might change the new |
Ready again. I won't need change to the |
Co-authored-by: David Keller <[email protected]> Based on the PR by @BlobCodes: crystal-lang#15216 The performance improvement is two-fold: 1. the usage of a i8 instead of an i1 boolean to have 3 states instead of 2, which permits to quickly detect recursive calls without an array; 2. inline tricks to optimize the fast and slow paths. Unlike the PR: 1. Doesn't use atomics: it already uses a mutex that guarantees acquire release memory ordering semantics, and __crystal_once_init is only ever called in the main thread before any other thread is started. 2. Removes the need for a state maintained by the compiler, yet keeps forward and backward compatibility (both signatures are supported).
Co-authored-by: David Keller <[email protected]> @BlobCodes: I noticed that adding this code prevents LLVM from re-running the once mechanism multiple times for the same variable. Modified to avoid an undefined behavior when the assumption doesn't hold that doubles as a safety net (print error + exit).
Co-authored-by: David Keller <[email protected]> @BlobCodes: I think it would be better to print the bug message in `Crystal.once` instead of `__crystal_once` to reduce complexity at the callsite. The previous unreachable method can then be used in the inlined `__crystal_once` so LLVM also knows it doesn't have to re-run the method. It's now even safe because `Crystal.once` would panic if it failed; it should already be impossible, but let's err on the safe side.
f7e50b7
to
2296069
Compare
CI is finally all green. |
src/crystal/once.cr
Outdated
# tell LLVM that it can optimize away repeated `__crystal_once` calls for | ||
# this global (e.g. repeated access to constant in a single funtion); | ||
# this is truly unreachable otherwise `Crystal.once` would have panicked | ||
Crystal.once_unreachable unless flag.value.initialized? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: This is pretty darn clever 👏 Kudos to @BlobCodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second that: thanks a lot @BlobCodes 🙇
enum OnceState : Int8 | ||
Processing = -1 | ||
Uninitialized = 0 | ||
Initialized = 1 | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very nit-picky, but in my v2 branch, I replaced this trinary enum with a Int8
so all comparisons are done with 0 (== Initialized
becomes > 0
).
Comparing with 0
results in fewer (or smaller) assembly instructions on most CPUs
The only arch I'm really familiar with is risc-v, so here's an example in risc-v:
# enum comparison `return if state == Initialized`
li t0, 1 # load 1 into t0
bne t0, a0, init # initialize if needed
ret # early return (or normal program flow if inlined)
init:
# stuff
# comparison `return if state <= 0`
blez a0, init # initialize if needed
ret # early-return (or normal program flow if inlined)
init:
# stuff
Of course this is micro-optimization, but if this is inlined into every const access, it could be noticable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: I fixed the examples below... I stupidly used a UInt8 🤦
On x86_64 the only difference is in the jump instruction:
; flag.value == 1
cmpb $0x1,(%rdi)
jne 39 <foo+0x9>
; flag.value > 0
cmpb $0x0,(%rdi)
jle 39 <foo+0x9>
Same for ARM32:
; flag.value == 1
ldrb r0, [r0]
cmp r0, #1
moveq pc, lr
; flag.value > 0
ldrb r0, [r0]
cmp r0, #1
movge pc, lr
But AArch64 indeed only needs one instruction instead of two for the equality check: And same for AArch64:
; flag.value == 1
ldrb w8, [x0]
cmp w8, #0x1
b.ne 40 <foo+0x10>
; flag.value > 0
ldrb w8, [x0]
cmp w8, #0x1
b.lt 40 <*foo+0x10>
NOTE: I'm not fluent in the assembly of each arch. I compiled a tiny program with --cross-compile --target=...
then used objdump --disassemble
from a crosschain build of binutils
to compare the LLVM generated assembly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
; flag.value > 0
cmpb $0x0,(%rdi)
je 39 <foo+0x9>
To me that looks like an unsigned comparison. So it only checks x != 0
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wrote a commit, but didn't push it (yet): I'm wondering if we'd really benefit from the change in practice 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested again, it's actually a jle
!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BlobCodes I run my tests again and updated my comment above.
We can probably do better in manually written assembly, but LLVM actually generates the same assembly for all 3 architectures. The only difference stands in the jump instruction 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's see in a follow up if we can squeeze even more performance with this idea.
Do not forget!! Co-authored-by to be moved to the bottom |
@oprypin: GitHub seems to recognize it. Correctly populates the |
@straight-shoota That's not right. The suggested commit message that I get is huge, with all messages concatenated. Maybe you edited it yourself. Try to clear browser caches. |
I'm using https://github.com/refined-github/refined-github which is probably adjusting the commit message from the default experience (or it's GitHub's New Merge Experience feature). Anyway, as a result I get the correct attribution pre-filled. So it must be discoverable somehow. I |
Hmm well in that case it's sad that it completely loses the concatenated commit descriptions. That seems undesirable in general, not only here. Or maybe you consider it a positive, I don't know |
Follow up and closes #15216 by @BlobCodes. Also incorporates the LLVM optimization from master...BlobCodes:crystal:perf/crystal-once-v2.
Similar to the original PR, this changes the
flag
to have 3 states instead of 2. This led to change the signature of the__crystal_once[_init]
functions that now take an Int8 (i8) instead of Bool (i1). In addition it drops the "once state" that isn't needed anymore.This requires a new compiler build to benefit from the improvement. The legacy versions of the
__crystal_once[_init]
methods are still supported by both the stdlib and the compiler to keep both forward & backward compatibility (1.15- can build 1.16+ and 1.16+ can build 1.15-).Unlike the original PR, this one doesn't use any atomics: the mutex already protects the flag, and the mutex itself is explicitly initialized by the main thread before any other thread is started (i.e. no parallelism issues).
A follow-up could leverage
ReferenceStorage
and.unsafe_construct
to inline theMutex
instead of allocating in the GC heap. Along with #15330 then__crystal_once_init
could become allocation free, which could prove useful for such a core/low level feature.