Add small bias towards bit set encoding. #14155

jpountz · 2025-01-21T09:26:13Z

Currently, blocks of postings get encoded as a bit set instead of packed deltas (FOR) whenever the bit set is more storage-efficient. However, the bit set approach is quite more CPU-efficient at search time, so this PR introduces a small bias towards the bit set encoding by using it as soon as it's more storage-efficient than FOR with the next number of bits per value.

The impact on storage efficiency of the Wikipedia dataset is negligible (+0.15% on .doc files, while .doc files don't dominate storage requirements, positions do) while some queries get a good speedup.

Currently, blocks of postings get encoded as a bit set instead of packed deltas (FOR) whenever the bit set is more storage-efficient. However, the bit set approach is quite more CPU-efficient at search time, so this PR introduces a small bias towards the bit set encoding by using it as soon as it's more storage-efficient than FOR with the next number of bits per value. The impact on storage efficiency of the Wikipedia dataset is negligible (+0.15% on `.doc` files, while `.doc` files don't dominate storage requirements, positions do) while some queries get a good speedup.

jpountz · 2025-01-21T09:33:12Z

luceneutil on wikibigall:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          Phrase       15.38      (5.7%)       14.98      (6.5%)   -2.6% ( -13% -   10%) 0.182
                      OrHighHigh       54.73      (3.6%)       53.73      (5.1%)   -1.8% ( -10% -    7%) 0.189
                     OrStopWords       35.42      (4.4%)       34.93      (6.3%)   -1.4% ( -11% -    9%) 0.421
              Or2Terms2StopWords      167.99      (3.0%)      166.21      (4.2%)   -1.1% (  -8% -    6%) 0.358
                         Prefix3      138.09      (4.1%)      136.64      (3.2%)   -1.1% (  -8% -    6%) 0.368
                       OrHighMed      204.64      (3.0%)      202.61      (4.8%)   -1.0% (  -8% -    7%) 0.434
                          Fuzzy1       81.81      (2.5%)       81.05      (3.0%)   -0.9% (  -6% -    4%) 0.287
                 AndHighOrMedMed       44.21      (1.4%)       43.80      (2.0%)   -0.9% (  -4% -    2%) 0.089
                          Fuzzy2       77.05      (2.3%)       76.37      (2.9%)   -0.9% (  -5% -    4%) 0.282
                DismaxOrHighHigh      120.45      (2.1%)      119.40      (2.4%)   -0.9% (  -5% -    3%) 0.221
                 FilteredPrefix3      132.19      (4.0%)      131.04      (3.3%)   -0.9% (  -7% -    6%) 0.456
                      TermDTSort      277.77      (8.1%)      275.54      (6.4%)   -0.8% ( -14% -   14%) 0.728
                          OrMany       19.73      (3.0%)       19.58      (3.4%)   -0.7% (  -6% -    5%) 0.473
                      OrHighRare      278.72      (4.3%)      276.77      (4.5%)   -0.7% (  -9% -    8%) 0.614
                 DismaxOrHighMed      176.16      (1.6%)      175.09      (3.0%)   -0.6% (  -5% -    4%) 0.422
                     CountPhrase        4.26      (1.4%)        4.24      (2.8%)   -0.5% (  -4% -    3%) 0.434
                       CountTerm     9427.40      (3.8%)     9388.54      (4.3%)   -0.4% (  -8% -    7%) 0.748
                        Wildcard       78.75      (3.7%)       78.48      (3.2%)   -0.3% (  -6% -    6%) 0.755
                   TermMonthSort     3352.76      (2.8%)     3342.92      (2.7%)   -0.3% (  -5% -    5%) 0.735
                        Or3Terms      174.57      (3.1%)      174.06      (4.3%)   -0.3% (  -7% -    7%) 0.806
             And2Terms2StopWords      165.15      (3.0%)      164.80      (2.6%)   -0.2% (  -5% -    5%) 0.806
               CombinedOrHighMed       72.05      (1.6%)       71.91      (1.5%)   -0.2% (  -3% -    2%) 0.696
              CombinedAndHighMed       55.64      (1.6%)       55.60      (1.7%)   -0.1% (  -3% -    3%) 0.895
              CombinedOrHighHigh       18.79      (1.6%)       18.78      (1.5%)   -0.1% (  -3% -    3%) 0.902
      FilteredOr2Terms2StopWords      150.72      (1.2%)      150.69      (1.3%)   -0.0% (  -2% -    2%) 0.952
             CombinedAndHighHigh       15.07      (1.8%)       15.08      (1.8%)    0.0% (  -3% -    3%) 0.933
                   TermTitleSort      146.88      (1.9%)      146.96      (2.2%)    0.1% (  -3% -    4%) 0.937
                  FilteredPhrase       33.19      (1.7%)       33.23      (1.7%)    0.1% (  -3% -    3%) 0.830
                    CombinedTerm       31.22      (2.0%)       31.27      (2.2%)    0.1% (  -3% -    4%) 0.828
                  FilteredOrMany       16.60      (4.2%)       16.63      (3.8%)    0.2% (  -7% -    8%) 0.894
             CountFilteredOrMany       24.78      (1.6%)       24.85      (5.8%)    0.3% (  -6% -    7%) 0.831
                            Term      469.54      (3.4%)      470.98      (2.8%)    0.3% (  -5% -    6%) 0.754
             CountFilteredPhrase       26.37      (1.9%)       26.45      (2.5%)    0.3% (  -4% -    4%) 0.661
               FilteredOrHighMed      156.74      (1.1%)      157.36      (1.0%)    0.4% (  -1% -    2%) 0.240
                    AndStopWords       32.20      (3.6%)       32.35      (3.8%)    0.5% (  -6% -    8%) 0.692
     FilteredAnd2Terms2StopWords      202.13      (2.0%)      203.08      (1.7%)    0.5% (  -3% -    4%) 0.423
                      DismaxTerm      574.08      (3.4%)      577.75      (2.6%)    0.6% (  -5% -    6%) 0.504
                FilteredOr3Terms      165.82      (1.3%)      166.90      (1.3%)    0.6% (  -1% -    3%) 0.111
                          IntNRQ      112.17     (15.4%)      112.90     (14.0%)    0.7% ( -24% -   35%) 0.888
               FilteredAnd3Terms      193.47      (2.2%)      194.74      (2.3%)    0.7% (  -3% -    5%) 0.357
             FilteredOrStopWords       47.71      (1.3%)       48.03      (1.7%)    0.7% (  -2% -    3%) 0.165
                       And3Terms      175.24      (2.9%)      176.44      (3.0%)    0.7% (  -5% -    6%) 0.459
                     AndHighHigh       43.83      (2.5%)       44.13      (2.8%)    0.7% (  -4% -    6%) 0.413
                  FilteredIntNRQ      110.65     (14.9%)      111.63     (13.9%)    0.9% ( -24% -   34%) 0.845
               TermDayOfYearSort      644.13      (1.5%)      652.34      (1.9%)    1.3% (  -2% -    4%) 0.019
                      AndHighMed      127.96      (2.6%)      129.67      (2.6%)    1.3% (  -3% -    6%) 0.105
              FilteredAndHighMed      131.59      (2.6%)      133.42      (2.5%)    1.4% (  -3% -    6%) 0.084
                        PKLookup      278.30      (2.2%)      282.31      (1.7%)    1.4% (  -2% -    5%) 0.020
                     CountOrMany       28.24      (2.4%)       28.65      (2.2%)    1.5% (  -3% -    6%) 0.045
          CountFilteredOrHighMed      117.39      (0.7%)      119.17      (1.9%)    1.5% (  -1% -    4%) 0.001
            FilteredAndStopWords       55.24      (2.4%)       56.09      (2.0%)    1.5% (  -2% -    6%) 0.026
              FilteredOrHighHigh       68.44      (1.4%)       69.57      (1.4%)    1.6% (  -1% -    4%) 0.000
                    FilteredTerm      159.84      (1.8%)      162.73      (1.5%)    1.8% (  -1% -    5%) 0.001
                AndMedOrHighHigh       65.61      (2.6%)       66.89      (1.8%)    2.0% (  -2% -    6%) 0.005
             FilteredAndHighHigh       69.14      (2.1%)       70.66      (1.6%)    2.2% (  -1% -    6%) 0.000
         CountFilteredOrHighHigh      105.70      (0.8%)      110.12      (3.1%)    4.2% (   0% -    8%) 0.000
                  CountOrHighMed      348.57      (1.8%)      365.98      (2.0%)    5.0% (   1% -    8%) 0.000
                 CountAndHighMed      297.29      (2.0%)      332.78      (2.4%)   11.9% (   7% -   16%) 0.000
                 CountOrHighHigh      282.00      (1.7%)      319.03      (2.5%)   13.1% (   8% -   17%) 0.000
                CountAndHighHigh      296.83      (2.3%)      337.52      (2.6%)   13.7% (   8% -   19%) 0.000

gf2121

Good trade-off!

Currently, blocks of postings get encoded as a bit set instead of packed deltas (FOR) whenever the bit set is more storage-efficient. However, the bit set approach is quite more CPU-efficient at search time, so this PR introduces a small bias towards the bit set encoding by using it as soon as it's more storage-efficient than FOR with the next number of bits per value. The impact on storage efficiency of the Wikipedia dataset is negligible (+0.15% on `.doc` files, while `.doc` files don't dominate storage requirements, positions do) while some queries get a good speedup.

mikemccand

Nice!

jpountz · 2025-01-24T13:58:17Z

Nightly benchmarks agree with my local results: https://benchmarks.mikemccandless.com/CountOrHighHigh.html

jpountz · 2025-01-24T14:05:53Z

And the index size increase is indeed very small: https://benchmarks.mikemccandless.com/indexing.html#FixedIndexSize

rmuir · 2025-01-24T17:09:24Z

You need a telescope to see the difference

jpountz added this to the 10.2.0 milestone Jan 21, 2025

gf2121 approved these changes Jan 23, 2025

View reviewed changes

jpountz merged commit 8487718 into apache:main Jan 23, 2025
5 checks passed

jpountz deleted the add_slight_bias_towards_bit_set branch January 23, 2025 13:40

mikemccand reviewed Jan 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add small bias towards bit set encoding. #14155

Add small bias towards bit set encoding. #14155

jpountz commented Jan 21, 2025

jpountz commented Jan 21, 2025

gf2121 left a comment

mikemccand left a comment

jpountz commented Jan 24, 2025

jpountz commented Jan 24, 2025

rmuir commented Jan 24, 2025

Add small bias towards bit set encoding. #14155

Add small bias towards bit set encoding. #14155

Conversation

jpountz commented Jan 21, 2025

jpountz commented Jan 21, 2025

gf2121 left a comment

Choose a reason for hiding this comment

mikemccand left a comment

Choose a reason for hiding this comment

jpountz commented Jan 24, 2025

jpountz commented Jan 24, 2025

rmuir commented Jan 24, 2025