Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stdlib] Remove redundant next_power_of_two() from math.mojo and optimize existing function in bit.mojo #4278

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

martinvuyk
Copy link
Contributor

@martinvuyk martinvuyk commented Apr 3, 2025

Optimization of next_power_of_two

Benchmark results

CPU: i7 7th gen
Measured in miliseconds to do 100 million calculations

bench_next_power_of_two_int_v1,204.92954861666666
bench_next_power_of_two_int_v2,117.86505349666666
bench_next_power_of_two_int_v3,111.60907072111111
bench_next_power_of_two_int_v4,105.20949030055556
bench_next_power_of_two_uint_v1,89.40362252435895
bench_next_power_of_two_uint_v2,82.5889815992857
bench_next_power_of_two_uint_v3,85.60044429065934
bench_next_power_of_two_uint_v4,108.61543930999999

I decided to go for ..._int_v3 and ..._uint_v2 (which use SIMD[bool].select) since they are the most stable in performance between executions, even though the ..._int_v4 and ..._uint_v1 have sometimes given better results.

@martinvuyk martinvuyk requested a review from a team as a code owner April 3, 2025 23:20
@martinvuyk martinvuyk changed the title [stdlib] [NFC] Remove redundant next_power_of_two from math.mojo [stdlib] [NFC] Remove redundant next_power_of_two() from math.mojo Apr 3, 2025
Copy link
Collaborator

@JoeLoser JoeLoser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested in perf difference (if any) between the two implementations. I think we should just remove it from math entirely rather than having the import around in mojo/stdlib/src/math/__init__.mojo.

Signed-off-by: martinvuyk <[email protected]>
@martinvuyk martinvuyk changed the title [stdlib] [NFC] Remove redundant next_power_of_two() from math.mojo [stdlib] Move redundant next_power_of_two() from math.mojo and replace existing function in bit.mojo Apr 4, 2025
@martinvuyk
Copy link
Contributor Author

@JoeLoser moved the implementation, added a benchmark and changed the whole PR description.

Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
@gryznar
Copy link
Contributor

gryznar commented Apr 4, 2025

@JoeLoser moved the implementation, added a benchmark and changed the whole PR description.

Could you please post benchmark results (numbers) as comment / description?

@martinvuyk
Copy link
Contributor Author

Read the PR description. 2x speedup. 199ms vs 100ms to execute the code, doesn't make much difference to post the exact results for a doubling in perf in a trivial function.

Signed-off-by: martinvuyk <[email protected]>
@soraros
Copy link
Contributor

soraros commented Apr 4, 2025

Shall we write it in the same branchless fashion like I did in #4253?

@martinvuyk
Copy link
Contributor Author

@soraros that is very interesting. Now I'm wondering whether

val = Scalar[DType.index](int_value)
select(val <= 1, 1, 1 << (bitwidthof[Int]() - count_leading_zeros(val - 1)))

is faster or not than using bitwise ops i.e.

(1 << (bitwidthof[Int]() - count_leading_zeros(val - 1))) & -Int(val > 1) | Int(val <= 1)

I'm also wondering whether we should implement this function for UInt as well at some point

@soraros
Copy link
Contributor

soraros commented Apr 4, 2025

I think they ultimately generate the same asm, the select variant is just a bit more readable.

In fact, the select variant is equivalent to using if, though the generated LLVM IR is quite a bit simpler, which is beneficial, since it's an @always_inline function.

@martinvuyk martinvuyk changed the title [stdlib] Move redundant next_power_of_two() from math.mojo and replace existing function in bit.mojo [stdlib] Remove redundant next_power_of_two() from math.mojo and optimize existing function in bit.mojo Apr 4, 2025
@martinvuyk
Copy link
Contributor Author

martinvuyk commented Apr 4, 2025

Nothing better than over-engineering the most trivial of functions for 2 hours. Anyway, it was fun trying and learning some new bitwise tricks. @soraros it seems like the best (most stable) option is to use SIMD[bool].select, and when the compiler knows that it's a UInt being cast to a DType.index the performance improves noticeably (see the numbers in the PR description)

Signed-off-by: martinvuyk <[email protected]>
@soraros

This comment was marked as outdated.

@martinvuyk
Copy link
Contributor Author

I personally find g more readable. Also, there's a constructor from scalar to Int, so you could skip the index().

Totally agree, now it looks very pretty :]

@soraros
Copy link
Contributor

soraros commented Apr 4, 2025

I actually find eyeballing the generated IR and assembly really helpful when comparing different implementations. It can greatly enhance the experience of "over-engineering."

Co-authored-by: soraros <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
Copy link
Contributor

@abduld abduld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Thanks a bunch for the contribution and sorry for not searching for the existing implementation. btw. If you want to dump the asm, then you can do

from sys import bitwidthof
from compile import compile_info
from utils._select import _select_register_value as select
from bit import count_leading_zeros


@always_inline
fn next_power_of_two(val: UInt) -> UInt:
    return select(
        val == 0,
        1,
        1 << (bitwidthof[__type_of(val)]() - count_leading_zeros(val - 1)),
    )


def main():
    print(compile_info[next_power_of_two, emission_kind="llvm"]())

    print(compile_info[next_power_of_two, emission_kind="asm"]())

@martinvuyk
Copy link
Contributor Author

sorry for not searching for the existing implementation.

no worries I also reimplemented the logic and wanted to commit it to the stdlib when I realized bit_ceil was already implemented and in the bit module. I'm not a C++ programmer so I didn't know to look for that function name, that's why I proposed the function name change in #3271

If you want to dump the asm, then you can do

I think I'm going to start doing that more often and not eyeballing benchmarks for small functions so much 😅

Signed-off-by: martinvuyk <[email protected]>
Signed-off-by: martinvuyk <[email protected]>
@JoeLoser
Copy link
Collaborator

JoeLoser commented Apr 5, 2025

!sync

@modularbot modularbot added the imported-internally Signals that a given pull request has been imported internally. label Apr 5, 2025
modularbot pushed a commit that referenced this pull request Apr 6, 2025
…9108)

[External] [stdlib] [NFC] Fix `_select_register_value` docstrings

Split off from #4278

Co-authored-by: martinvuyk <[email protected]>
Closes #4287
MODULAR_ORIG_COMMIT_REV_ID: c4cc97159708bec888c996ed8559448b4af4c833
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
imported-internally Signals that a given pull request has been imported internally.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants