-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add small bias towards bit set encoding. #14155
Conversation
Currently, blocks of postings get encoded as a bit set instead of packed deltas (FOR) whenever the bit set is more storage-efficient. However, the bit set approach is quite more CPU-efficient at search time, so this PR introduces a small bias towards the bit set encoding by using it as soon as it's more storage-efficient than FOR with the next number of bits per value. The impact on storage efficiency of the Wikipedia dataset is negligible (+0.15% on `.doc` files, while `.doc` files don't dominate storage requirements, positions do) while some queries get a good speedup.
luceneutil on wikibigall:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good trade-off!
Currently, blocks of postings get encoded as a bit set instead of packed deltas (FOR) whenever the bit set is more storage-efficient. However, the bit set approach is quite more CPU-efficient at search time, so this PR introduces a small bias towards the bit set encoding by using it as soon as it's more storage-efficient than FOR with the next number of bits per value. The impact on storage efficiency of the Wikipedia dataset is negligible (+0.15% on `.doc` files, while `.doc` files don't dominate storage requirements, positions do) while some queries get a good speedup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Nightly benchmarks agree with my local results: https://benchmarks.mikemccandless.com/CountOrHighHigh.html |
And the index size increase is indeed very small: https://benchmarks.mikemccandless.com/indexing.html#FixedIndexSize |
You need a telescope to see the difference |
Currently, blocks of postings get encoded as a bit set instead of packed deltas (FOR) whenever the bit set is more storage-efficient. However, the bit set approach is quite more CPU-efficient at search time, so this PR introduces a small bias towards the bit set encoding by using it as soon as it's more storage-efficient than FOR with the next number of bits per value.
The impact on storage efficiency of the Wikipedia dataset is negligible (+0.15% on
.doc
files, while.doc
files don't dominate storage requirements, positions do) while some queries get a good speedup.