-
Notifications
You must be signed in to change notification settings - Fork 2.4k
7z: Change "data length" cost to reflect "data length penalty" #5919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
7z: Change "data length" cost to reflect "data length penalty" #5919
Conversation
5871699 to
a470579
Compare
solardiz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure this is the right approach, but I don't mind.
src/7z_common_plug.c
Outdated
| if (sevenzip_trust_padding) | ||
| cost >>= (pad_size * 8); | ||
|
|
||
| cost += sevenzip_iteration_count(salt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assumes that packed size (sans early reject) and iterations contribute to the total cost equally, but that's probably a great oversimplification. Maybe there should be an empirically tuned coefficient here at least?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I first had it multiplied with iterations and before shifting, which was way worse. Adding iterations after shifting size makes some sense: for many archives, this will correctly end up little more than the iterations count. This all is a simplification anyway because the inflate method matters.
The thing is a huge archive with no or little padding is a huge hit to performance.
Instead of this whole PR, we could change the existing "data length" cost to be a padding-shifted one, and rename it to "effective size after early reject" or something better (not sure what).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did consider something like 8 + (pad_size * 8) though. Thinking again it's probably better and also increases the 4 GB ceiling. I can run some speed tests.
src/7z_fmt_plug.c
Outdated
| MAX_KEYS_PER_CRYPT, | ||
| FMT_CASE | FMT_8_BIT | FMT_OMP | FMT_UNICODE | FMT_ENC | FMT_DYNA_SALT | FMT_HUGE_INPUT, | ||
| { | ||
| "cost", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe call this "estimated cost"? Also in the other format.
a470579 to
9cef6fb
Compare
b2ebb40 to
01f2b69
Compare
Now called "data size penalty" instead, to better reflect this. This is much more usable for actually filtering out hashes: Tens or hundreds of megabytes without a single byte of padding is a huge hit to performance (especially for the GPU format) but as soon as we have a byte or two of padding, the impact very quickly approaches zero.
01f2b69 to
de4e8e4
Compare
After some testing I gave up on this. I did find that on CPU, a formula of So I took a much simpler approach: This commit now merely changes the existing "data length" cost to "data size penalty" by simply decreasing the data size according to how much padding we have. The old cost was informative but useless for filtering - the new one reflects that a 4 GB data size is no problem at all as long as we have at least two bytes of padding for early rejection - while a 50 MB size with no padding will hit performance significantly. |
Make this the default as none of the others are very useful other than for information.
I'm still fond of the other costs so this had to bump FMT_TUNABLE_COSTS to 5. I have wanted to do so before, I think it can't hurt - even though some "costs" are more important for information than for filtering.