Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add small bias towards bit set encoding. #14155

Merged
merged 1 commit into from
Jan 23, 2025

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Jan 21, 2025

Currently, blocks of postings get encoded as a bit set instead of packed deltas (FOR) whenever the bit set is more storage-efficient. However, the bit set approach is quite more CPU-efficient at search time, so this PR introduces a small bias towards the bit set encoding by using it as soon as it's more storage-efficient than FOR with the next number of bits per value.

The impact on storage efficiency of the Wikipedia dataset is negligible (+0.15% on .doc files, while .doc files don't dominate storage requirements, positions do) while some queries get a good speedup.

Currently, blocks of postings get encoded as a bit set instead of packed deltas
(FOR) whenever the bit set is more storage-efficient. However, the bit set
approach is quite more CPU-efficient at search time, so this PR introduces a
small bias towards the bit set encoding by using it as soon as it's more
storage-efficient than FOR with the next number of bits per value.

The impact on storage efficiency of the Wikipedia dataset is negligible (+0.15%
on `.doc` files, while `.doc` files don't dominate storage requirements,
positions do) while some queries get a good speedup.
@jpountz jpountz added this to the 10.2.0 milestone Jan 21, 2025
@jpountz
Copy link
Contributor Author

jpountz commented Jan 21, 2025

luceneutil on wikibigall:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          Phrase       15.38      (5.7%)       14.98      (6.5%)   -2.6% ( -13% -   10%) 0.182
                      OrHighHigh       54.73      (3.6%)       53.73      (5.1%)   -1.8% ( -10% -    7%) 0.189
                     OrStopWords       35.42      (4.4%)       34.93      (6.3%)   -1.4% ( -11% -    9%) 0.421
              Or2Terms2StopWords      167.99      (3.0%)      166.21      (4.2%)   -1.1% (  -8% -    6%) 0.358
                         Prefix3      138.09      (4.1%)      136.64      (3.2%)   -1.1% (  -8% -    6%) 0.368
                       OrHighMed      204.64      (3.0%)      202.61      (4.8%)   -1.0% (  -8% -    7%) 0.434
                          Fuzzy1       81.81      (2.5%)       81.05      (3.0%)   -0.9% (  -6% -    4%) 0.287
                 AndHighOrMedMed       44.21      (1.4%)       43.80      (2.0%)   -0.9% (  -4% -    2%) 0.089
                          Fuzzy2       77.05      (2.3%)       76.37      (2.9%)   -0.9% (  -5% -    4%) 0.282
                DismaxOrHighHigh      120.45      (2.1%)      119.40      (2.4%)   -0.9% (  -5% -    3%) 0.221
                 FilteredPrefix3      132.19      (4.0%)      131.04      (3.3%)   -0.9% (  -7% -    6%) 0.456
                      TermDTSort      277.77      (8.1%)      275.54      (6.4%)   -0.8% ( -14% -   14%) 0.728
                          OrMany       19.73      (3.0%)       19.58      (3.4%)   -0.7% (  -6% -    5%) 0.473
                      OrHighRare      278.72      (4.3%)      276.77      (4.5%)   -0.7% (  -9% -    8%) 0.614
                 DismaxOrHighMed      176.16      (1.6%)      175.09      (3.0%)   -0.6% (  -5% -    4%) 0.422
                     CountPhrase        4.26      (1.4%)        4.24      (2.8%)   -0.5% (  -4% -    3%) 0.434
                       CountTerm     9427.40      (3.8%)     9388.54      (4.3%)   -0.4% (  -8% -    7%) 0.748
                        Wildcard       78.75      (3.7%)       78.48      (3.2%)   -0.3% (  -6% -    6%) 0.755
                   TermMonthSort     3352.76      (2.8%)     3342.92      (2.7%)   -0.3% (  -5% -    5%) 0.735
                        Or3Terms      174.57      (3.1%)      174.06      (4.3%)   -0.3% (  -7% -    7%) 0.806
             And2Terms2StopWords      165.15      (3.0%)      164.80      (2.6%)   -0.2% (  -5% -    5%) 0.806
               CombinedOrHighMed       72.05      (1.6%)       71.91      (1.5%)   -0.2% (  -3% -    2%) 0.696
              CombinedAndHighMed       55.64      (1.6%)       55.60      (1.7%)   -0.1% (  -3% -    3%) 0.895
              CombinedOrHighHigh       18.79      (1.6%)       18.78      (1.5%)   -0.1% (  -3% -    3%) 0.902
      FilteredOr2Terms2StopWords      150.72      (1.2%)      150.69      (1.3%)   -0.0% (  -2% -    2%) 0.952
             CombinedAndHighHigh       15.07      (1.8%)       15.08      (1.8%)    0.0% (  -3% -    3%) 0.933
                   TermTitleSort      146.88      (1.9%)      146.96      (2.2%)    0.1% (  -3% -    4%) 0.937
                  FilteredPhrase       33.19      (1.7%)       33.23      (1.7%)    0.1% (  -3% -    3%) 0.830
                    CombinedTerm       31.22      (2.0%)       31.27      (2.2%)    0.1% (  -3% -    4%) 0.828
                  FilteredOrMany       16.60      (4.2%)       16.63      (3.8%)    0.2% (  -7% -    8%) 0.894
             CountFilteredOrMany       24.78      (1.6%)       24.85      (5.8%)    0.3% (  -6% -    7%) 0.831
                            Term      469.54      (3.4%)      470.98      (2.8%)    0.3% (  -5% -    6%) 0.754
             CountFilteredPhrase       26.37      (1.9%)       26.45      (2.5%)    0.3% (  -4% -    4%) 0.661
               FilteredOrHighMed      156.74      (1.1%)      157.36      (1.0%)    0.4% (  -1% -    2%) 0.240
                    AndStopWords       32.20      (3.6%)       32.35      (3.8%)    0.5% (  -6% -    8%) 0.692
     FilteredAnd2Terms2StopWords      202.13      (2.0%)      203.08      (1.7%)    0.5% (  -3% -    4%) 0.423
                      DismaxTerm      574.08      (3.4%)      577.75      (2.6%)    0.6% (  -5% -    6%) 0.504
                FilteredOr3Terms      165.82      (1.3%)      166.90      (1.3%)    0.6% (  -1% -    3%) 0.111
                          IntNRQ      112.17     (15.4%)      112.90     (14.0%)    0.7% ( -24% -   35%) 0.888
               FilteredAnd3Terms      193.47      (2.2%)      194.74      (2.3%)    0.7% (  -3% -    5%) 0.357
             FilteredOrStopWords       47.71      (1.3%)       48.03      (1.7%)    0.7% (  -2% -    3%) 0.165
                       And3Terms      175.24      (2.9%)      176.44      (3.0%)    0.7% (  -5% -    6%) 0.459
                     AndHighHigh       43.83      (2.5%)       44.13      (2.8%)    0.7% (  -4% -    6%) 0.413
                  FilteredIntNRQ      110.65     (14.9%)      111.63     (13.9%)    0.9% ( -24% -   34%) 0.845
               TermDayOfYearSort      644.13      (1.5%)      652.34      (1.9%)    1.3% (  -2% -    4%) 0.019
                      AndHighMed      127.96      (2.6%)      129.67      (2.6%)    1.3% (  -3% -    6%) 0.105
              FilteredAndHighMed      131.59      (2.6%)      133.42      (2.5%)    1.4% (  -3% -    6%) 0.084
                        PKLookup      278.30      (2.2%)      282.31      (1.7%)    1.4% (  -2% -    5%) 0.020
                     CountOrMany       28.24      (2.4%)       28.65      (2.2%)    1.5% (  -3% -    6%) 0.045
          CountFilteredOrHighMed      117.39      (0.7%)      119.17      (1.9%)    1.5% (  -1% -    4%) 0.001
            FilteredAndStopWords       55.24      (2.4%)       56.09      (2.0%)    1.5% (  -2% -    6%) 0.026
              FilteredOrHighHigh       68.44      (1.4%)       69.57      (1.4%)    1.6% (  -1% -    4%) 0.000
                    FilteredTerm      159.84      (1.8%)      162.73      (1.5%)    1.8% (  -1% -    5%) 0.001
                AndMedOrHighHigh       65.61      (2.6%)       66.89      (1.8%)    2.0% (  -2% -    6%) 0.005
             FilteredAndHighHigh       69.14      (2.1%)       70.66      (1.6%)    2.2% (  -1% -    6%) 0.000
         CountFilteredOrHighHigh      105.70      (0.8%)      110.12      (3.1%)    4.2% (   0% -    8%) 0.000
                  CountOrHighMed      348.57      (1.8%)      365.98      (2.0%)    5.0% (   1% -    8%) 0.000
                 CountAndHighMed      297.29      (2.0%)      332.78      (2.4%)   11.9% (   7% -   16%) 0.000
                 CountOrHighHigh      282.00      (1.7%)      319.03      (2.5%)   13.1% (   8% -   17%) 0.000
                CountAndHighHigh      296.83      (2.3%)      337.52      (2.6%)   13.7% (   8% -   19%) 0.000

Copy link
Contributor

@gf2121 gf2121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good trade-off!

@jpountz jpountz merged commit 8487718 into apache:main Jan 23, 2025
5 checks passed
@jpountz jpountz deleted the add_slight_bias_towards_bit_set branch January 23, 2025 13:40
jpountz added a commit that referenced this pull request Jan 23, 2025
Currently, blocks of postings get encoded as a bit set instead of packed deltas
(FOR) whenever the bit set is more storage-efficient. However, the bit set
approach is quite more CPU-efficient at search time, so this PR introduces a
small bias towards the bit set encoding by using it as soon as it's more
storage-efficient than FOR with the next number of bits per value.

The impact on storage efficiency of the Wikipedia dataset is negligible (+0.15%
on `.doc` files, while `.doc` files don't dominate storage requirements,
positions do) while some queries get a good speedup.
Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@jpountz
Copy link
Contributor Author

jpountz commented Jan 24, 2025

Nightly benchmarks agree with my local results: https://benchmarks.mikemccandless.com/CountOrHighHigh.html

@jpountz
Copy link
Contributor Author

jpountz commented Jan 24, 2025

And the index size increase is indeed very small: https://benchmarks.mikemccandless.com/indexing.html#FixedIndexSize

@rmuir
Copy link
Member

rmuir commented Jan 24, 2025

You need a telescope to see the difference

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants