-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[stdlib] Remove redundant next_power_of_two()
from math.mojo
and optimize existing function in bit.mojo
#4278
base: main
Are you sure you want to change the base?
[stdlib] Remove redundant next_power_of_two()
from math.mojo
and optimize existing function in bit.mojo
#4278
Conversation
Signed-off-by: martinvuyk <[email protected]>
next_power_of_two()
from math.mojo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm interested in perf difference (if any) between the two implementations. I think we should just remove it from math
entirely rather than having the import around in mojo/stdlib/src/math/__init__.mojo
.
Signed-off-by: martinvuyk <[email protected]>
next_power_of_two()
from math.mojo
next_power_of_two()
from math.mojo
and replace existing function in bit.mojo
@JoeLoser moved the implementation, added a benchmark and changed the whole PR description. |
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Could you please post benchmark results (numbers) as comment / description? |
Read the PR description. 2x speedup. 199ms vs 100ms to execute the code, doesn't make much difference to post the exact results for a doubling in perf in a trivial function. |
Signed-off-by: martinvuyk <[email protected]>
Shall we write it in the same branchless fashion like I did in #4253? |
@soraros that is very interesting. Now I'm wondering whether val = Scalar[DType.index](int_value)
select(val <= 1, 1, 1 << (bitwidthof[Int]() - count_leading_zeros(val - 1))) is faster or not than using bitwise ops i.e. (1 << (bitwidthof[Int]() - count_leading_zeros(val - 1))) & -Int(val > 1) | Int(val <= 1) I'm also wondering whether we should implement this function for |
I think they ultimately generate the same asm, the In fact, the |
next_power_of_two()
from math.mojo
and replace existing function in bit.mojo
next_power_of_two()
from math.mojo
and optimize existing function in bit.mojo
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Nothing better than over-engineering the most trivial of functions for 2 hours. Anyway, it was fun trying and learning some new bitwise tricks. @soraros it seems like the best (most stable) option is to use |
Signed-off-by: martinvuyk <[email protected]>
This comment was marked as outdated.
This comment was marked as outdated.
Signed-off-by: martinvuyk <[email protected]>
Totally agree, now it looks very pretty :] |
I actually find eyeballing the generated IR and assembly really helpful when comparing different implementations. It can greatly enhance the experience of "over-engineering." |
Co-authored-by: soraros <[email protected]> Signed-off-by: martinvuyk <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. Thanks a bunch for the contribution and sorry for not searching for the existing implementation. btw. If you want to dump the asm, then you can do
from sys import bitwidthof
from compile import compile_info
from utils._select import _select_register_value as select
from bit import count_leading_zeros
@always_inline
fn next_power_of_two(val: UInt) -> UInt:
return select(
val == 0,
1,
1 << (bitwidthof[__type_of(val)]() - count_leading_zeros(val - 1)),
)
def main():
print(compile_info[next_power_of_two, emission_kind="llvm"]())
print(compile_info[next_power_of_two, emission_kind="asm"]())
no worries I also reimplemented the logic and wanted to commit it to the stdlib when I realized
I think I'm going to start doing that more often and not eyeballing benchmarks for small functions so much 😅 |
Signed-off-by: martinvuyk <[email protected]>
…th' into remove-next-power-of-two-from-math
Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
!sync |
…9108) [External] [stdlib] [NFC] Fix `_select_register_value` docstrings Split off from #4278 Co-authored-by: martinvuyk <[email protected]> Closes #4287 MODULAR_ORIG_COMMIT_REV_ID: c4cc97159708bec888c996ed8559448b4af4c833
Optimization of
next_power_of_two
Benchmark results
CPU: i7 7th gen
Measured in miliseconds to do 100 million calculations
I decided to go for
..._int_v3
and..._uint_v2
(which useSIMD[bool].select
) since they are the most stable in performance between executions, even though the..._int_v4
and..._uint_v1
have sometimes given better results.