-
Notifications
You must be signed in to change notification settings - Fork 1.1k
ecmult_multi: reduce strauss memory usage by 30% #1761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
The current approach looks clean. What other approaches do you have in mind? |
hebasto
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concept ACK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my x86_64 system, this PR reduces the memory allocated on the scratch space from 2224 bytes to 1452 bytes per point.
Please ping me once it’s undrafted.
|
The current approach requires a temporary array
Both options seem to be worse. |
I might suggest a third option: hebasto@5c0d6ee. |
| int wnaf_tmp[256]; | ||
| int ret, i; | ||
|
|
||
| VERIFY_CHECK(2 <= w && w <= 8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| VERIFY_CHECK(2 <= w && w <= 8); | |
| VERIFY_CHECK(2 <= w && w <= 7); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why w = 8 wouldn't work. The documentation of wnaf says
* - each wnaf[i] is either 0, or an odd integer between -(1<<(w-1) - 1) and (1<<(w-1) - 1)
So for w = 8, wnaf[i] is in [-127, 127] which fits in an int8_t.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, yes, you're right. I was getting confused. secp256k1_ecmult_wnaf itself needs w <= 31 (and not 32), if only because it performs a carry << w shift (for int carry) which is certainly UB if int is 32 bits. (In fact, if carry == 1, then even 1 << 31 is UB. This is another edge case that we should fix! Let me add this to the other issue.)
But since your function only copies the results, everything is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, if carry == 1, then even 1 << 31 is UB. This is another edge case that we should fix! Let me add this to the other issue
Oh, great catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the current approach in the PR is good. It may not be elegant to have a tmp array, but it's simple and correct. We'd need to benchmark if the tmp array makes a difference in the end. But I think this PR needs a benchmark in general to make sure that using int8_t does not increase the running time (much).
If we want to avoid it, here's yet another variant: real-or-random@f83731b
It uses a macro to define different variants of secp256k1_ecmult_wnaf parametrized in the output type. The macro is not elegant either, but this variant is better for type safety than just turning secp256k1_ecmult_wnaf into a macro.
In fact, the current secp256k1_ecmult_wnaf needs the unstated and unchecked assumption that int has at least 32 value bits when it VERIFY_CHECKs that w <= 31. In practice, we call it only with WINDOW_A == 5 and WINDOW_G == ECMULT_WINDOW_SIZE where the latter is configurable in the range 2..24.
A consequence of this "bug" is that the code fails on a 16-bit platform if you set ECMULT_WINDOW_SIZE > 16. I don't think we need to support this, but code without unchecked assumptions is bad. So I suggest that we rewrite the function to use int32_t instead of int even if we don't use my macro approach. Alternatively, we could add the assumption that INT_MAX >= INT32_MAX but this forbids 16-bit platforms, and the code seems to work on them in principle; see #792 (comment).
Sorry, I forgot to comment on that option. That's also clean, but it introduces a lot of code complexity. The way I see it:
|
|
Thanks for demonstrating creative alternative solutions :) If it weren't for the layers of indirection, I'd consider @hebasto's approach to be the most elegant. The PR's current approach is just so much simpler. And I ran benchmarks with |
real-or-random
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
utACK 26166c4
This is a draft because I'm not sure about the cleanest way to implement it.