Skip to content

Vectorize bitset to array #14910

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 31 commits into
base: main
Choose a base branch
from
Draft

Conversation

gf2121
Copy link
Contributor

@gf2121 gf2121 commented Jul 7, 2025

This is a minimal prof to describe an idea about how to vectorize a bitset into an array, which can be a hot path when posting is encoded as a bitset. This version currently only runs on AVX512, but can be adapted to more in the future.

Benchmark                             (bitSetSize)   Mode  Cnt      Score      Error   Units
BitsetToArrayBenchmark.baseline                128  thrpt    5   5477.202 ±   36.920  ops/ms
BitsetToArrayBenchmark.baseline                256  thrpt    5   6197.595 ±   92.064  ops/ms
BitsetToArrayBenchmark.baseline                512  thrpt    5   7121.446 ±  113.840  ops/ms
BitsetToArrayBenchmark.baseline                768  thrpt    5   7361.335 ±  286.118  ops/ms
BitsetToArrayBenchmark.vectorized512           128  thrpt    5  85321.831 ± 1539.445  ops/ms
BitsetToArrayBenchmark.vectorized512           256  thrpt    5  58632.773 ± 1130.691  ops/ms
BitsetToArrayBenchmark.vectorized512           512  thrpt    5  48780.092 ±  958.403  ops/ms
BitsetToArrayBenchmark.vectorized512           768  thrpt    5  29373.799 ±  392.238  ops/ms

Copy link

github-actions bot commented Jul 7, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@uschindler
Copy link
Contributor

This cannot be merged without adding this to the java24 part and reoving the requires of incubator module for JMH.

I assume this is only meant for quick checks and stays draft?

@gf2121
Copy link
Contributor Author

gf2121 commented Jul 7, 2025

Thanks for reminding!

I assume this is only meant for quick checks and stays draft?

Yes, after the code integrated into VectorUtil benchmark will call VectorUtil directly and remove the requirement for the incubator module, just like other benchmarks.

@gf2121
Copy link
Contributor Author

gf2121 commented Jul 9, 2025

I managed to get some luceneutil data on AVX512

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                 FilteredPrefix3       76.86      (3.2%)       76.20      (5.5%)   -0.9% (  -9% -    8%) 0.552
                         Prefix3       81.74      (3.1%)       81.13      (4.9%)   -0.7% (  -8% -    7%) 0.567
                AndMedOrHighHigh       22.83      (1.8%)       22.78      (2.0%)   -0.3% (  -3% -    3%) 0.670
                   TermMonthSort     1184.75      (2.7%)     1182.73      (8.4%)   -0.2% ( -10% -   11%) 0.931
              CombinedOrHighHigh        7.95      (2.8%)        7.94      (2.7%)   -0.2% (  -5% -    5%) 0.844
             And2Terms2StopWords       66.71      (2.5%)       66.61      (4.6%)   -0.1% (  -7% -    7%) 0.903
                          Fuzzy1       33.29      (2.5%)       33.26      (3.4%)   -0.1% (  -5% -    5%) 0.929
             FilteredOrStopWords       11.72      (2.0%)       11.71      (3.5%)   -0.1% (  -5% -    5%) 0.929
                    FilteredTerm       68.60      (1.6%)       68.56      (4.1%)   -0.0% (  -5% -    5%) 0.962
             CountFilteredIntNRQ       24.33      (1.1%)       24.33      (2.4%)   -0.0% (  -3% -    3%) 0.991
                          IntNRQ       55.49      (1.2%)       55.51      (3.1%)    0.0% (  -4% -    4%) 0.959
      FilteredOr2Terms2StopWords       59.21      (2.1%)       59.27      (4.5%)    0.1% (  -6% -    6%) 0.929
                 CountOrHighHigh       63.85      (1.0%)       63.93      (2.4%)    0.1% (  -3% -    3%) 0.827
             CountFilteredPhrase       12.29      (1.6%)       12.31      (1.9%)    0.1% (  -3% -    3%) 0.800
                      AndHighMed       72.10      (2.0%)       72.23      (4.2%)    0.2% (  -5% -    6%) 0.860
                          Phrase        9.76      (2.2%)        9.78      (3.3%)    0.2% (  -5% -    5%) 0.823
                  CountOrHighMed       93.67      (1.3%)       93.87      (2.7%)    0.2% (  -3% -    4%) 0.746
                 CountAndHighMed       90.17      (1.0%)       90.37      (2.2%)    0.2% (  -2% -    3%) 0.690
                      DismaxTerm      331.70      (3.2%)      332.55      (6.0%)    0.3% (  -8% -    9%) 0.867
               FilteredAnd3Terms      105.66      (2.0%)      105.96      (2.9%)    0.3% (  -4% -    5%) 0.717
          CountFilteredOrHighMed       32.69      (1.4%)       32.78      (2.0%)    0.3% (  -3% -    3%) 0.588
                          Fuzzy2       30.47      (2.4%)       30.57      (3.4%)    0.3% (  -5% -    6%) 0.724
              Or2Terms2StopWords       68.51      (2.3%)       68.74      (4.7%)    0.3% (  -6% -    7%) 0.765
             CombinedAndHighHigh        8.07      (2.1%)        8.10      (2.2%)    0.4% (  -3% -    4%) 0.596
                   TermTitleSort       59.34      (2.8%)       59.55      (4.0%)    0.4% (  -6% -    7%) 0.739
                        Wildcard       48.97      (3.6%)       49.16      (4.6%)    0.4% (  -7% -    8%) 0.766
                CountAndHighHigh       63.48      (1.3%)       63.74      (2.2%)    0.4% (  -3% -    3%) 0.472
                      TermDTSort      191.53      (2.0%)      192.32      (5.5%)    0.4% (  -7% -    8%) 0.752
             FilteredAndHighHigh       16.87      (1.3%)       16.94      (2.3%)    0.4% (  -3% -    4%) 0.457
            FilteredAndStopWords       13.70      (1.6%)       13.76      (2.2%)    0.4% (  -3% -    4%) 0.468
         CountFilteredOrHighHigh       27.44      (0.9%)       27.56      (1.7%)    0.4% (  -2% -    3%) 0.305
                         Respell       27.54      (2.1%)       27.66      (2.4%)    0.4% (  -3% -    4%) 0.534
                    CombinedTerm       16.58      (2.9%)       16.66      (3.0%)    0.5% (  -5% -    6%) 0.621
                       OrHighMed       87.71      (2.3%)       88.12      (4.9%)    0.5% (  -6% -    7%) 0.702
                            Term      421.54      (3.5%)      423.51      (5.9%)    0.5% (  -8% -   10%) 0.761
                  FilteredIntNRQ       54.79      (1.9%)       55.09      (2.7%)    0.5% (  -4% -    5%) 0.468
              FilteredOrHighHigh       18.14      (1.8%)       18.24      (3.5%)    0.5% (  -4% -    5%) 0.539
                 DismaxOrHighMed       57.31      (1.9%)       57.65      (5.4%)    0.6% (  -6% -    8%) 0.647
     FilteredAnd2Terms2StopWords       69.42      (1.9%)       69.84      (3.1%)    0.6% (  -4% -    5%) 0.450
               TermDayOfYearSort      317.00      (2.3%)      319.07      (3.9%)    0.7% (  -5% -    6%) 0.515
              FilteredAndHighMed       46.77      (1.5%)       47.11      (2.6%)    0.7% (  -3% -    4%) 0.270
                      OrHighRare      116.92      (4.6%)      117.89      (5.9%)    0.8% (  -9% -   11%) 0.620
                 AndHighOrMedMed       21.55      (2.1%)       21.74      (1.9%)    0.9% (  -3% -    5%) 0.172
                FilteredOr3Terms       52.71      (1.9%)       53.18      (4.2%)    0.9% (  -5% -    7%) 0.386
                  FilteredPhrase       12.77      (1.8%)       12.89      (3.2%)    0.9% (  -4% -    6%) 0.262
               FilteredOrHighMed       50.52      (2.4%)       50.99      (4.5%)    0.9% (  -5% -    8%) 0.416
               CombinedOrHighMed       28.48      (2.3%)       28.76      (4.4%)    1.0% (  -5% -    7%) 0.392
                DismaxOrHighHigh       39.77      (1.9%)       40.16      (3.4%)    1.0% (  -4% -    6%) 0.256
                       And3Terms       84.76      (2.1%)       85.63      (3.6%)    1.0% (  -4% -    6%) 0.272
                        Or3Terms       76.27      (1.3%)       77.08      (3.7%)    1.1% (  -3% -    6%) 0.226
              CombinedAndHighMed       29.04      (2.3%)       29.39      (4.1%)    1.2% (  -5% -    7%) 0.252
                        PKLookup       75.51      (1.3%)       76.44      (3.2%)    1.2% (  -3% -    5%) 0.112
                       CountTerm     2847.75      (5.6%)     2897.40      (8.8%)    1.7% ( -11% -   17%) 0.454
                      OrHighHigh       29.50      (2.0%)       30.17      (2.9%)    2.3% (  -2% -    7%) 0.004
                          IntSet      150.21      (4.2%)      154.28      (4.6%)    2.7% (  -5% -   11%) 0.051
                     AndHighHigh       30.04      (1.9%)       31.63      (3.1%)    5.3% (   0% -   10%) 0.000
                    AndStopWords       10.60      (1.9%)       11.83      (2.0%)   11.6% (   7% -   15%) 0.000
                     OrStopWords       11.41      (2.8%)       13.26      (3.0%)   16.2% (  10% -   22%) 0.000

@gf2121
Copy link
Contributor Author

gf2121 commented Jul 10, 2025

Some more data:

Mac M2

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                       CountTerm    12276.27     (12.1%)    11998.30      (7.3%)   -2.3% ( -19% -   19%) 0.563
                   TermMonthSort     4162.59      (8.8%)     4111.81      (3.4%)   -1.2% ( -12% -   12%) 0.641
                CountAndHighHigh       84.34      (2.6%)       83.54      (2.5%)   -0.9% (  -5% -    4%) 0.342
          CountFilteredOrHighMed       48.75      (4.9%)       48.31      (3.8%)   -0.9% (  -9% -    8%) 0.591
         CountFilteredOrHighHigh       39.65      (4.2%)       39.30      (3.2%)   -0.9% (  -7% -    6%) 0.543

                                                    ...

                      OrHighHigh       48.67     (12.7%)       52.67      (2.7%)    8.2% (  -6% -   27%) 0.023
                    AndStopWords       16.25      (9.7%)       17.63      (4.3%)    8.5% (  -5% -   24%) 0.004
                     AndHighHigh       50.29     (13.5%)       55.32      (2.5%)   10.0% (  -5% -   30%) 0.009
                     OrStopWords       18.18     (10.6%)       20.61      (3.1%)   13.4% (   0% -   30%) 0.000

AVX512 (mentioned above)

TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                 FilteredPrefix3       76.86      (3.2%)       76.20      (5.5%)   -0.9% (  -9% -    8%) 0.552
                         Prefix3       81.74      (3.1%)       81.13      (4.9%)   -0.7% (  -8% -    7%) 0.567
                AndMedOrHighHigh       22.83      (1.8%)       22.78      (2.0%)   -0.3% (  -3% -    3%) 0.670
                   TermMonthSort     1184.75      (2.7%)     1182.73      (8.4%)   -0.2% ( -10% -   11%) 0.931
              CombinedOrHighHigh        7.95      (2.8%)        7.94      (2.7%)   -0.2% (  -5% -    5%) 0.844
             And2Terms2StopWords       66.71      (2.5%)       66.61      (4.6%)   -0.1% (  -7% -    7%) 0.903

                                                    ...
 
                       CountTerm     2847.75      (5.6%)     2897.40      (8.8%)    1.7% ( -11% -   17%) 0.454
                      OrHighHigh       29.50      (2.0%)       30.17      (2.9%)    2.3% (  -2% -    7%) 0.004
                          IntSet      150.21      (4.2%)      154.28      (4.6%)    2.7% (  -5% -   11%) 0.051
                     AndHighHigh       30.04      (1.9%)       31.63      (3.1%)    5.3% (   0% -   10%) 0.000
                    AndStopWords       10.60      (1.9%)       11.83      (2.0%)   11.6% (   7% -   15%) 0.000
                     OrStopWords       11.41      (2.8%)       13.26      (3.0%)   16.2% (  10% -   22%) 0.000

same AVX512 machine without --add-modules=jdk.incubator.vector

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                 FilteredPrefix3       74.47      (3.7%)       73.32      (4.0%)   -1.5% (  -8% -    6%) 0.210
                         Prefix3       79.35      (3.9%)       78.23      (3.5%)   -1.4% (  -8% -    6%) 0.232
                       CountTerm     2921.10      (6.2%)     2897.44      (7.5%)   -0.8% ( -13% -   13%) 0.708
             And2Terms2StopWords       62.09      (1.7%)       61.80      (2.6%)   -0.5% (  -4% -    3%) 0.482

                                                    ...
 
                      OrHighHigh       27.33      (2.4%)       27.66      (2.0%)    1.2% (  -3% -    5%) 0.092
                      OrHighRare      116.98      (3.3%)      118.89      (2.7%)    1.6% (  -4% -    7%) 0.088
                     AndHighHigh       27.52      (2.0%)       28.02      (1.5%)    1.8% (  -1% -    5%) 0.001
                    AndStopWords       10.60      (3.0%)       11.01      (1.7%)    3.8% (   0% -    8%) 0.000
                     OrStopWords       11.26      (3.6%)       11.97      (2.2%)    6.3% (   0% -   12%) 0.000

@gf2121 gf2121 marked this pull request as ready for review July 10, 2025 16:36
Copy link

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@jpountz
Copy link
Contributor

jpountz commented Jul 10, 2025

This is very cool and the speedup makes sense to me. When dynamic pruning is enabled, only queries whose leading clauses are dense benefit significantly from this speedup (OrStopWords and AndStopWords). But if you evaluated exhaustive evaluation, I'm sure we'd be seeing a bigger speedup on all disjunctive queries that have one dense postings list or more.

Like for #14896, I'd like to split this PR in two: one where we merge your scalar improvements, and then this one where we add support for vectorization. By the way, we may want to look into other approaches for the scalar case. Since we only use bit sets in postings when many bits would be set, a linear scan should perform quite efficiently? (foreach (bit in 0..n) { if bitSet.get(bit) out.append(bit); }) I imagine that you used a micro benchmark to come up with your manual unrolling, let's include this micro benchmark in the PR?

Copy link

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@gf2121
Copy link
Contributor Author

gf2121 commented Jul 13, 2025

JMH results with the vectorized implementations:

Benchmark                                                (bitCount)   Mode  Cnt   Score   Error   Units
BitsetToArrayBenchmark.dense                                      5  thrpt    5   9.583 ± 0.238  ops/us
BitsetToArrayBenchmark.dense                                     10  thrpt    5   6.926 ± 0.151  ops/us
BitsetToArrayBenchmark.dense                                     20  thrpt    5   4.597 ± 0.042  ops/us
BitsetToArrayBenchmark.dense                                     30  thrpt    5   3.420 ± 0.033  ops/us
BitsetToArrayBenchmark.dense                                     40  thrpt    5   3.766 ± 0.013  ops/us
BitsetToArrayBenchmark.dense                                     50  thrpt    5   5.299 ± 0.126  ops/us
BitsetToArrayBenchmark.dense                                     60  thrpt    5   8.991 ± 0.223  ops/us
BitsetToArrayBenchmark.denseBranchLess                            5  thrpt    5  13.520 ± 0.132  ops/us
BitsetToArrayBenchmark.denseBranchLess                           10  thrpt    5  13.440 ± 0.575  ops/us
BitsetToArrayBenchmark.denseBranchLess                           20  thrpt    5  13.521 ± 0.289  ops/us
BitsetToArrayBenchmark.denseBranchLess                           30  thrpt    5  13.488 ± 0.641  ops/us
BitsetToArrayBenchmark.denseBranchLess                           40  thrpt    5  13.501 ± 0.375  ops/us
BitsetToArrayBenchmark.denseBranchLess                           50  thrpt    5  13.555 ± 0.384  ops/us
BitsetToArrayBenchmark.denseBranchLess                           60  thrpt    5  13.524 ± 0.498  ops/us
BitsetToArrayBenchmark.denseBranchLessCmov                        5  thrpt    5   8.521 ± 0.120  ops/us
BitsetToArrayBenchmark.denseBranchLessCmov                       10  thrpt    5   6.315 ± 0.164  ops/us
BitsetToArrayBenchmark.denseBranchLessCmov                       20  thrpt    5  11.531 ± 0.176  ops/us
BitsetToArrayBenchmark.denseBranchLessCmov                       30  thrpt    5  11.493 ± 0.255  ops/us
BitsetToArrayBenchmark.denseBranchLessCmov                       40  thrpt    5  11.535 ± 0.018  ops/us
BitsetToArrayBenchmark.denseBranchLessCmov                       50  thrpt    5  11.539 ± 0.084  ops/us
BitsetToArrayBenchmark.denseBranchLessCmov                       60  thrpt    5   9.100 ± 0.017  ops/us
BitsetToArrayBenchmark.denseBranchLessParallel                    5  thrpt    5  15.428 ± 0.155  ops/us
BitsetToArrayBenchmark.denseBranchLessParallel                   10  thrpt    5  15.424 ± 0.282  ops/us
BitsetToArrayBenchmark.denseBranchLessParallel                   20  thrpt    5  15.375 ± 0.341  ops/us
BitsetToArrayBenchmark.denseBranchLessParallel                   30  thrpt    5  15.395 ± 0.121  ops/us
BitsetToArrayBenchmark.denseBranchLessParallel                   40  thrpt    5  15.308 ± 0.407  ops/us
BitsetToArrayBenchmark.denseBranchLessParallel                   50  thrpt    5  15.322 ± 0.174  ops/us
BitsetToArrayBenchmark.denseBranchLessParallel                   60  thrpt    5  15.439 ± 0.064  ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling                   5  thrpt    5  15.795 ± 0.380  ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling                  10  thrpt    5  15.827 ± 0.228  ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling                  20  thrpt    5  15.672 ± 0.991  ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling                  30  thrpt    5  15.789 ± 0.327  ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling                  40  thrpt    5  15.764 ± 0.350  ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling                  50  thrpt    5  15.725 ± 0.393  ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling                  60  thrpt    5  15.868 ± 0.028  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized                  5  thrpt    5  25.889 ± 0.471  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized                 10  thrpt    5  25.975 ± 0.129  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized                 20  thrpt    5  25.852 ± 0.299  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized                 30  thrpt    5  25.888 ± 0.371  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized                 40  thrpt    5  25.708 ± 1.028  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized                 50  thrpt    5  25.856 ± 0.612  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized                 60  thrpt    5  25.931 ± 0.144  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512               5  thrpt    5  28.221 ± 0.545  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512              10  thrpt    5  28.306 ± 0.209  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512              20  thrpt    5  26.827 ± 1.704  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512              30  thrpt    5  27.027 ± 0.214  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512              40  thrpt    5  26.504 ± 0.909  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512              50  thrpt    5  25.725 ± 0.084  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512              60  thrpt    5  25.495 ± 1.521  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512AVX2           5  thrpt    5   1.137 ± 0.473  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512AVX2          10  thrpt    5   0.856 ± 0.312  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512AVX2          20  thrpt    5   0.171 ± 0.091  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512AVX2          30  thrpt    5   0.159 ± 0.072  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512AVX2          40  thrpt    5   0.097 ± 0.042  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512AVX2          50  thrpt    5   0.069 ± 0.021  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512AVX2          60  thrpt    5   0.068 ± 0.041  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2              5  thrpt    5  20.310 ± 0.139  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2             10  thrpt    5  20.125 ± 0.352  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2             20  thrpt    5  19.961 ± 0.653  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2             30  thrpt    5  20.025 ± 1.040  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2             40  thrpt    5  20.051 ± 0.556  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2             50  thrpt    5  20.128 ± 0.131  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2             60  thrpt    5  19.769 ± 2.266  ops/us
BitsetToArrayBenchmark.denseInvert                                5  thrpt    5  19.958 ± 0.355  ops/us
BitsetToArrayBenchmark.denseInvert                               10  thrpt    5  13.497 ± 0.826  ops/us
BitsetToArrayBenchmark.denseInvert                               20  thrpt    5   6.995 ± 0.093  ops/us
BitsetToArrayBenchmark.denseInvert                               30  thrpt    5   4.579 ± 0.035  ops/us
BitsetToArrayBenchmark.denseInvert                               40  thrpt    5   4.447 ± 0.028  ops/us
BitsetToArrayBenchmark.denseInvert                               50  thrpt    5   4.082 ± 0.051  ops/us
BitsetToArrayBenchmark.denseInvert                               60  thrpt    5   6.732 ± 0.145  ops/us
BitsetToArrayBenchmark.forLoop                                    5  thrpt    5  26.332 ± 0.080  ops/us
BitsetToArrayBenchmark.forLoop                                   10  thrpt    5  21.765 ± 0.029  ops/us
BitsetToArrayBenchmark.forLoop                                   20  thrpt    5  15.878 ± 0.247  ops/us
BitsetToArrayBenchmark.forLoop                                   30  thrpt    5  12.606 ± 0.251  ops/us
BitsetToArrayBenchmark.forLoop                                   40  thrpt    5  10.440 ± 0.036  ops/us
BitsetToArrayBenchmark.forLoop                                   50  thrpt    5   8.875 ± 0.164  ops/us
BitsetToArrayBenchmark.forLoop                                   60  thrpt    5   7.735 ± 0.171  ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling                     5  thrpt    5  26.018 ± 0.586  ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling                    10  thrpt    5  21.031 ± 0.364  ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling                    20  thrpt    5  15.683 ± 0.266  ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling                    30  thrpt    5  12.502 ± 0.056  ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling                    40  thrpt    5  10.330 ± 0.212  ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling                    50  thrpt    5   8.842 ± 0.020  ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling                    60  thrpt    5   7.705 ± 0.172  ops/us
BitsetToArrayBenchmark.hybrid                                     5  thrpt    5  25.588 ± 0.491  ops/us
BitsetToArrayBenchmark.hybrid                                    10  thrpt    5  21.151 ± 0.403  ops/us
BitsetToArrayBenchmark.hybrid                                    20  thrpt    5  15.653 ± 0.263  ops/us
BitsetToArrayBenchmark.hybrid                                    30  thrpt    5  12.431 ± 0.027  ops/us
BitsetToArrayBenchmark.hybrid                                    40  thrpt    5  15.414 ± 0.032  ops/us
BitsetToArrayBenchmark.hybrid                                    50  thrpt    5  15.415 ± 0.065  ops/us
BitsetToArrayBenchmark.hybrid                                    60  thrpt    5  15.188 ± 0.806  ops/us
BitsetToArrayBenchmark.whileLoop                                  5  thrpt    5  29.224 ± 0.503  ops/us
BitsetToArrayBenchmark.whileLoop                                 10  thrpt    5  23.237 ± 0.697  ops/us
BitsetToArrayBenchmark.whileLoop                                 20  thrpt    5  16.777 ± 0.278  ops/us
BitsetToArrayBenchmark.whileLoop                                 30  thrpt    5  13.019 ± 0.213  ops/us
BitsetToArrayBenchmark.whileLoop                                 40  thrpt    5  10.700 ± 0.095  ops/us
BitsetToArrayBenchmark.whileLoop                                 50  thrpt    5   9.047 ± 0.015  ops/us
BitsetToArrayBenchmark.whileLoop                                 60  thrpt    5   7.786 ± 0.224  ops/us

@jpountz
Copy link
Contributor

jpountz commented Jul 13, 2025

Thank you for updating the benchmark. I suggest we first figure how to handle compress() on #14896 before coming back to this PR.

Copy link

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@gf2121
Copy link
Contributor Author

gf2121 commented Jul 13, 2025

I suggest we first figure how to handle compress() on #14896 before coming back to this PR.

+1, I'm tracking this PR as well.

Copy link

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Copy link

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Copy link

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Copy link

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@github-actions github-actions bot added this to the 10.3.0 milestone Jul 17, 2025
@@ -25,6 +25,7 @@
requires org.apache.lucene.core;
requires org.apache.lucene.expressions;
requires org.apache.lucene.sandbox;
requires jdk.incubator.vector;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the code from benchmark module and only benchmark code in the java24 source set.

}
}

// NOCOMMIT remove vectorized methods and requirement on vector module before merge.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes this needs to go away. Our build system found this already. ❤️

@@ -39,6 +40,7 @@ int word2Array(long word, int base, int[] docs, int offset) {
return intWord2Array((int) (word >>> 32), docs, offset, base + 32);
}

@SuppressForbidden(reason = "Uses compress only where fast and carefully contained")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The if check as described in forbidden apis is missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check is before instance picking, do we must check it within this method?

if (Constants.HAS_FAST_COMPRESS_MASK_CAST
&& PanamaVectorConstants.PREFERRED_VECTOR_BITSIZE >= 256) {
return PanamaBitSetUtil.INSTANCE;
} else {
return BitSetUtil.INSTANCE;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, then it is ok. Please add this info to the @SuppressWarnings message!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe say: The whole BitsetUtil impl instance is only used when HAS_FAST_COMPRESS_MASK_CAST is enabled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just want to make sure that Robert does not complain.

Copy link
Contributor

@uschindler uschindler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just my final comment: The setup on how the instances for look fine, although I would have preferred an interface instead of subclassing.

So BitSetUtil as public (but restricted) interface, both DefaultBitSetUtil as non-vectorized implementation and PanamaBitSetUtil as another implementation, both package private.

@@ -206,7 +208,8 @@ private static Optional<Module> lookupVectorModule() {
"org.apache.lucene.util.VectorUtil",
"org.apache.lucene.codecs.lucene103.Lucene103PostingsReader",
"org.apache.lucene.codecs.lucene103.PostingIndexInput",
"org.apache.lucene.tests.util.TestSysoutsLimits");
"org.apache.lucene.tests.util.TestSysoutsLimits",
"org.apache.lucene.util.FixedBitSet");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW we should try to avoid expanding this list. Hopefully as per my other comment, we can move bitsetToArray to VectorUtil instead so that this list can stay as-is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fully agree.

@@ -114,6 +114,8 @@ public static VectorizationProvider getInstance() {
/** Create a new {@link PostingDecodingUtil} for the given {@link IndexInput}. */
public abstract PostingDecodingUtil newPostingDecodingUtil(IndexInput input) throws IOException;

public abstract BitSetUtil newBitSetUtil();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather have the methods of BitSetUtil in this class. PostingDecodingUtil is a bit different because it requires state (the MemorySegment), which is not the case with BitSetUtil.

Then we can call bitsetToArray from VectorUtil and don't need to add a new class that is allowed to call the vector API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this. It is stateless so let's reuse the already existing method. Then we can also have the "if Constant" part there.

You can ignore my other note.


for (int i = 0; i < Integer.SIZE; i += INT_SPECIES.length()) {
IntVector.fromArray(INT_SPECIES, IDENTITY, i)
.add(base)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we broadcast base to an IntVector outside of the loop so that it's only done once rather than once per iteration?

private static int intWord2Array(int word, int[] resultArray, int offset, int base) {
IntVector bitMask = IntVector.fromArray(INT_SPECIES, IDENTITY_MASK, 0);

for (int i = 0; i < Integer.SIZE; i += INT_SPECIES.length()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit uncomfortable with the underlying assumption that INT_SPECIES.length() is a divisor of Integer.SIZE. Can we write the code in a way that doesn't make this assumption or add a check somewhere?


@SuppressForbidden(reason = "Uses compress only where fast and carefully contained")
private static int intWord2Array(int word, int[] resultArray, int offset, int base) {
IntVector bitMask = IntVector.fromArray(INT_SPECIES, IDENTITY_MASK, 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be a private static final field?

IntVector.fromArray(INT_SPECIES, IDENTITY, i)
.add(base)
.compress(bitMask.and(word).compare(VectorOperators.NE, 0))
.reinterpretAsInts()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary? Doesn't compress() already return an IntVector?

: "Array length must be at least bitSet.cardinality(from, to) + 1";
public final int bitsetToArray(FixedBitSet bitSet, int from, int to, int base, int[] array) {
assert bitSet.cardinality(from, to) + 16 <= array.length
: "Array length must be at least bitSet.cardinality(from, to) + 16";

Objects.checkFromToIndex(from, to, bitSet.length());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should refactor this bitsetToArray method to compute the bitCount of each word up-front to reduce dependencies between iterations of this loop (we rely on the result of wordToArray to know the next index at which to start writing data, so we can't start the next iteration of the loop before the current one is finished).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, maybe the micro benchmark should be updated to operate on a FixedBitSet instead of a single word to better capture this sort of things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the idea, i can try. I thought it could be challenge for compiler to know there is no overlap between the writing range across iterations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the result

BitsetToArrayBenchmark.hybrid                            256  thrpt    5  5.239 ± 0.202  ops/us
BitsetToArrayBenchmark.hybrid                            384  thrpt    5  6.077 ± 0.135  ops/us
BitsetToArrayBenchmark.hybrid                            512  thrpt    5  6.259 ± 0.087  ops/us
BitsetToArrayBenchmark.hybrid                            768  thrpt    5  5.926 ± 0.078  ops/us
BitsetToArrayBenchmark.hybrid                           1024  thrpt    5  4.889 ± 0.056  ops/us
BitsetToArrayBenchmark.hybridUnrolling                   256  thrpt    5  4.850 ± 0.054  ops/us
BitsetToArrayBenchmark.hybridUnrolling                   384  thrpt    5  5.940 ± 0.086  ops/us
BitsetToArrayBenchmark.hybridUnrolling                   512  thrpt    5  6.271 ± 0.098  ops/us
BitsetToArrayBenchmark.hybridUnrolling                   768  thrpt    5  5.328 ± 0.106  ops/us
BitsetToArrayBenchmark.hybridUnrolling                  1024  thrpt    5  4.174 ± 0.059  ops/us

In case you are interested, here is the full result of this new benchmark.

Benchmark                                        (bitLength)   Mode  Cnt  Score   Error   Units
BitsetToArrayBenchmark.dense                             256  thrpt    5  2.152 ± 0.017  ops/us
BitsetToArrayBenchmark.dense                             384  thrpt    5  1.308 ± 0.024  ops/us
BitsetToArrayBenchmark.dense                             512  thrpt    5  1.156 ± 0.020  ops/us
BitsetToArrayBenchmark.dense                             768  thrpt    5  0.991 ± 0.024  ops/us
BitsetToArrayBenchmark.dense                            1024  thrpt    5  0.888 ± 0.020  ops/us
BitsetToArrayBenchmark.denseBranchLess                   256  thrpt    5  5.646 ± 0.050  ops/us
BitsetToArrayBenchmark.denseBranchLess                   384  thrpt    5  3.999 ± 0.044  ops/us
BitsetToArrayBenchmark.denseBranchLess                   512  thrpt    5  3.097 ± 0.085  ops/us
BitsetToArrayBenchmark.denseBranchLess                   768  thrpt    5  2.099 ± 0.017  ops/us
BitsetToArrayBenchmark.denseBranchLess                  1024  thrpt    5  1.622 ± 0.020  ops/us
BitsetToArrayBenchmark.denseBranchLessCmov               256  thrpt    5  3.692 ± 0.032  ops/us
BitsetToArrayBenchmark.denseBranchLessCmov               384  thrpt    5  2.572 ± 0.033  ops/us
BitsetToArrayBenchmark.denseBranchLessCmov               512  thrpt    5  1.970 ± 0.018  ops/us
BitsetToArrayBenchmark.denseBranchLessCmov               768  thrpt    5  0.815 ± 0.015  ops/us
BitsetToArrayBenchmark.denseBranchLessCmov              1024  thrpt    5  0.728 ± 0.008  ops/us
BitsetToArrayBenchmark.denseBranchLessParallel           256  thrpt    5  5.803 ± 0.054  ops/us
BitsetToArrayBenchmark.denseBranchLessParallel           384  thrpt    5  4.106 ± 0.056  ops/us
BitsetToArrayBenchmark.denseBranchLessParallel           512  thrpt    5  3.202 ± 0.033  ops/us
BitsetToArrayBenchmark.denseBranchLessParallel           768  thrpt    5  2.181 ± 0.018  ops/us
BitsetToArrayBenchmark.denseBranchLessParallel          1024  thrpt    5  1.657 ± 0.019  ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling          256  thrpt    5  6.157 ± 0.104  ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling          384  thrpt    5  4.380 ± 0.042  ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling          512  thrpt    5  3.392 ± 0.060  ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling          768  thrpt    5  2.354 ± 0.023  ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling         1024  thrpt    5  1.794 ± 0.012  ops/us
BitsetToArrayBenchmark.denseInvert                       256  thrpt    5  1.865 ± 0.034  ops/us
BitsetToArrayBenchmark.denseInvert                       384  thrpt    5  1.791 ± 0.025  ops/us
BitsetToArrayBenchmark.denseInvert                       512  thrpt    5  1.817 ± 0.018  ops/us
BitsetToArrayBenchmark.denseInvert                       768  thrpt    5  1.904 ± 0.015  ops/us
BitsetToArrayBenchmark.denseInvert                      1024  thrpt    5  1.816 ± 0.016  ops/us
BitsetToArrayBenchmark.forLoop                           256  thrpt    5  5.645 ± 0.081  ops/us
BitsetToArrayBenchmark.forLoop                           384  thrpt    5  6.118 ± 0.073  ops/us
BitsetToArrayBenchmark.forLoop                           512  thrpt    5  6.352 ± 0.068  ops/us
BitsetToArrayBenchmark.forLoop                           768  thrpt    5  5.957 ± 0.081  ops/us
BitsetToArrayBenchmark.forLoop                          1024  thrpt    5  4.931 ± 0.088  ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling            256  thrpt    5  5.411 ± 0.156  ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling            384  thrpt    5  5.699 ± 0.238  ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling            512  thrpt    5  5.585 ± 0.161  ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling            768  thrpt    5  4.667 ± 0.101  ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling           1024  thrpt    5  3.999 ± 0.037  ops/us
BitsetToArrayBenchmark.hybrid                            256  thrpt    5  5.239 ± 0.202  ops/us
BitsetToArrayBenchmark.hybrid                            384  thrpt    5  6.077 ± 0.135  ops/us
BitsetToArrayBenchmark.hybrid                            512  thrpt    5  6.259 ± 0.087  ops/us
BitsetToArrayBenchmark.hybrid                            768  thrpt    5  5.926 ± 0.078  ops/us
BitsetToArrayBenchmark.hybrid                           1024  thrpt    5  4.889 ± 0.056  ops/us
BitsetToArrayBenchmark.hybridUnrolling                   256  thrpt    5  4.850 ± 0.054  ops/us
BitsetToArrayBenchmark.hybridUnrolling                   384  thrpt    5  5.940 ± 0.086  ops/us
BitsetToArrayBenchmark.hybridUnrolling                   512  thrpt    5  6.271 ± 0.098  ops/us
BitsetToArrayBenchmark.hybridUnrolling                   768  thrpt    5  5.328 ± 0.106  ops/us
BitsetToArrayBenchmark.hybridUnrolling                  1024  thrpt    5  4.174 ± 0.059  ops/us
BitsetToArrayBenchmark.whileLoop                         256  thrpt    5  3.932 ± 0.021  ops/us
BitsetToArrayBenchmark.whileLoop                         384  thrpt    5  3.779 ± 0.054  ops/us
BitsetToArrayBenchmark.whileLoop                         512  thrpt    5  3.889 ± 0.049  ops/us
BitsetToArrayBenchmark.whileLoop                         768  thrpt    5  3.747 ± 0.083  ops/us
BitsetToArrayBenchmark.whileLoop                        1024  thrpt    5  3.856 ± 0.088  ops/us

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it is interesting to see forLoop is much faster than while loop, which meets my luceneutil result while previous benchmark did not show.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, the vectorized method benchmarks

Benchmark                                                 (bitLength)   Mode  Cnt   Score   Error   Units
BitsetToArrayBenchmark.denseBranchLessVectorized                  256  thrpt    5  17.648 ± 0.059  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized                  384  thrpt    5  14.186 ± 0.049  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized                  512  thrpt    5  12.102 ± 0.203  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized                  768  thrpt    5   9.585 ± 0.068  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized                 1024  thrpt    5   7.433 ± 0.093  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2              256  thrpt    5  10.885 ± 0.069  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2              384  thrpt    5   7.534 ± 0.280  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2              512  thrpt    5   6.894 ± 0.019  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2              768  thrpt    5   5.068 ± 0.014  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2             1024  thrpt    5   3.862 ± 0.109  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedFromLong          256  thrpt    5  18.918 ± 0.168  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedFromLong          384  thrpt    5  18.055 ± 0.093  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedFromLong          512  thrpt    5  17.804 ± 0.079  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedFromLong          768  thrpt    5  16.803 ± 0.051  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedFromLong         1024  thrpt    5  17.066 ± 0.042  ops/us

@jpountz
Copy link
Contributor

jpountz commented Jul 17, 2025

FWIW I benchmarked this change on my machine (AVX2), there may be a small speedup for some queries but not as much as previously reported:

                           TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
              FilteredDismaxTerm      159.75      (1.8%)      156.59      (2.0%)   -2.0% (  -5% -    1%) 0.020
                    FilteredTerm      160.05      (1.8%)      157.28      (2.4%)   -1.7% (  -5% -    2%) 0.064
                  FilteredPhrase       31.66      (1.9%)       31.21      (2.4%)   -1.4% (  -5% -    2%) 0.144
        FilteredDismaxOrHighHigh       69.26      (2.1%)       68.44      (3.1%)   -1.2% (  -6% -    4%) 0.324
                            Term      624.66      (6.6%)      617.30      (4.6%)   -1.2% ( -11% -   10%) 0.644
     FilteredAnd2Terms2StopWords      211.38      (1.5%)      209.23      (1.9%)   -1.0% (  -4% -    2%) 0.194
                       CountTerm     9064.76      (1.7%)     8973.82      (3.5%)   -1.0% (  -6% -    4%) 0.413
               FilteredAnd3Terms      187.39      (1.3%)      185.79      (1.3%)   -0.9% (  -3% -    1%) 0.142
              CombinedOrHighHigh       22.75      (1.0%)       22.56      (1.6%)   -0.8% (  -3% -    1%) 0.154
             And2Terms2StopWords      199.52      (1.9%)      197.87      (2.5%)   -0.8% (  -5% -    3%) 0.397
                 AndHighOrMedMed       49.74      (1.6%)       49.33      (1.9%)   -0.8% (  -4% -    2%) 0.296
             CountFilteredPhrase       24.76      (2.0%)       24.57      (2.9%)   -0.8% (  -5% -    4%) 0.487
                AndMedOrHighHigh       85.96      (1.2%)       85.32      (2.4%)   -0.8% (  -4% -    2%) 0.383
                       And3Terms      235.75      (1.9%)      234.03      (2.1%)   -0.7% (  -4% -    3%) 0.410
              Or2Terms2StopWords      200.01      (2.5%)      198.59      (3.1%)   -0.7% (  -6% -    5%) 0.572
                  FilteredIntNRQ      289.06      (0.8%)      287.05      (1.7%)   -0.7% (  -3% -    1%) 0.251
         FilteredDismaxOrHighMed      125.35      (1.9%)      124.53      (2.3%)   -0.7% (  -4% -    3%) 0.486
              FilteredOrHighHigh       65.45      (2.9%)       65.04      (2.0%)   -0.6% (  -5% -    4%) 0.572
                    CombinedTerm       37.78      (1.8%)       37.56      (2.1%)   -0.6% (  -4% -    3%) 0.499
              FilteredAndHighMed      153.64      (1.7%)      152.75      (2.0%)   -0.6% (  -4% -    3%) 0.489
             CombinedAndHighHigh       22.87      (1.0%)       22.75      (1.9%)   -0.6% (  -3% -    2%) 0.412
                      TermDTSort      408.86      (3.1%)      406.81      (4.9%)   -0.5% (  -8% -    7%) 0.783
      FilteredOr2Terms2StopWords      143.31      (2.1%)      142.59      (1.7%)   -0.5% (  -4% -    3%) 0.551
               FilteredOrHighMed      149.58      (1.9%)      148.84      (1.3%)   -0.5% (  -3% -    2%) 0.488
                          OrMany       22.71      (2.2%)       22.60      (2.4%)   -0.5% (  -4% -    4%) 0.627
              CombinedAndHighMed       87.88      (0.7%)       87.45      (1.1%)   -0.5% (  -2% -    1%) 0.232
                      OrHighRare      273.20     (12.1%)      272.03     (11.1%)   -0.4% ( -21% -   25%) 0.934
                FilteredOr3Terms      162.88      (1.8%)      162.22      (1.2%)   -0.4% (  -3% -    2%) 0.545
               CombinedOrHighMed       86.79      (0.6%)       86.45      (1.2%)   -0.4% (  -2% -    1%) 0.352
            FilteredAndStopWords       64.14      (2.8%)       63.95      (3.8%)   -0.3% (  -6% -    6%) 0.844
                  FilteredOrMany       15.94      (2.8%)       15.90      (1.5%)   -0.3% (  -4% -    4%) 0.779
                        Or3Terms      226.68      (2.3%)      226.18      (2.8%)   -0.2% (  -5% -    4%) 0.846
                  CountOrHighMed      359.27      (2.3%)      358.53      (2.1%)   -0.2% (  -4% -    4%) 0.834
                 DismaxOrHighMed      184.76      (2.7%)      184.40      (2.1%)   -0.2% (  -4% -    4%) 0.853
                      DismaxTerm      730.52      (2.8%)      729.16      (2.1%)   -0.2% (  -5% -    4%) 0.868
                 FilteredPrefix3      147.61      (1.7%)      147.38      (2.9%)   -0.2% (  -4% -    4%) 0.883
                 CountAndHighMed      305.37      (1.3%)      305.04      (1.4%)   -0.1% (  -2% -    2%) 0.859
                       OrHighMed      251.29      (3.0%)      251.10      (3.0%)   -0.1% (  -5% -    6%) 0.955
                      AndHighMed      197.92      (2.3%)      197.86      (2.3%)   -0.0% (  -4% -    4%) 0.975
               TermDayOfYearSort      270.57      (6.3%)      270.54      (6.1%)   -0.0% ( -11% -   13%) 0.996
             FilteredOrStopWords       43.96      (3.1%)       43.98      (2.6%)    0.0% (  -5% -    5%) 0.978
                     CountPhrase        4.06      (2.8%)        4.06      (2.1%)    0.1% (  -4% -    5%) 0.946
             FilteredAndHighHigh       77.63      (2.8%)       77.74      (3.0%)    0.1% (  -5% -    6%) 0.918
          CountFilteredOrHighMed      146.93      (0.7%)      147.31      (0.6%)    0.3% (  -1% -    1%) 0.370
                     CountOrMany       28.14      (1.4%)       28.23      (1.2%)    0.3% (  -2% -    2%) 0.613
                DismaxOrHighHigh      126.24      (5.6%)      126.71      (4.8%)    0.4% (  -9% -   11%) 0.874
         CountFilteredOrHighHigh      135.00      (1.0%)      135.57      (0.7%)    0.4% (  -1% -    2%) 0.276
             CountFilteredOrMany       26.38      (1.7%)       26.54      (1.5%)    0.6% (  -2% -    3%) 0.397
                 CountOrHighHigh      332.56      (2.5%)      336.29      (2.4%)    1.1% (  -3% -    6%) 0.307
                CountAndHighHigh      349.06      (2.2%)      353.95      (2.1%)    1.4% (  -2% -    5%) 0.149
                      OrHighHigh       76.02      (3.4%)       77.64      (3.1%)    2.1% (  -4% -    9%) 0.149
                     OrStopWords       47.26      (3.3%)       48.31      (3.7%)    2.2% (  -4% -    9%) 0.162
                    AndStopWords       45.69      (2.5%)       46.71      (3.0%)    2.2% (  -3% -    7%) 0.071
                     AndHighHigh       67.01      (2.5%)       69.40      (3.2%)    3.6% (  -2% -    9%) 0.005

@uschindler
Copy link
Contributor

Just my final comment: The setup on how the instances for look fine, although I would have preferred an interface instead of subclassing.

So BitSetUtil as public (but restricted) interface, both DefaultBitSetUtil as non-vectorized implementation and PanamaBitSetUtil as another implementation, both package private.

See Adrien's comment.

@gf2121
Copy link
Contributor Author

gf2121 commented Jul 18, 2025

there may be a small speedup for some queries but not as much as previously reported.

I guess your benchmark runs against current main which includes #14935 while previous report not. The following report is what i get on AVX512 now, so i think your result on AVX2 is expected.

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      OrHighRare      127.60      (6.9%)      124.88      (8.4%)   -2.1% ( -16% -   14%) 0.382
                            Term      483.00      (4.6%)      475.41      (4.9%)   -1.6% ( -10% -    8%) 0.296
              CombinedAndHighMed       31.34      (2.7%)       30.99      (3.1%)   -1.1% (  -6% -    4%) 0.226
                      TermDTSort      230.76      (3.2%)      229.41      (3.3%)   -0.6% (  -6% -    6%) 0.567
                      DismaxTerm      422.26      (3.1%)      420.12      (3.1%)   -0.5% (  -6% -    5%) 0.603
                    FilteredTerm       72.81      (1.7%)       72.52      (2.0%)   -0.4% (  -4% -    3%) 0.505
                  CountOrHighMed      101.54      (1.1%)      101.15      (1.5%)   -0.4% (  -2% -    2%) 0.351
              FilteredOrHighHigh       19.23      (1.0%)       19.16      (1.4%)   -0.3% (  -2% -    2%) 0.358
                      AndHighMed       79.76      (2.2%)       79.50      (3.0%)   -0.3% (  -5% -    4%) 0.687
                AndMedOrHighHigh       24.19      (1.6%)       24.12      (1.9%)   -0.3% (  -3% -    3%) 0.578
                       OrHighMed       97.86      (2.0%)       97.58      (2.6%)   -0.3% (  -4% -    4%) 0.688
                 CountOrHighHigh       68.53      (0.8%)       68.36      (1.1%)   -0.2% (  -2% -    1%) 0.422
                    CombinedTerm       17.09      (2.9%)       17.05      (3.2%)   -0.2% (  -6% -    6%) 0.808
               CombinedOrHighMed       30.64      (2.2%)       30.57      (3.6%)   -0.2% (  -5% -    5%) 0.810
             CombinedAndHighHigh        8.45      (2.7%)        8.43      (2.6%)   -0.2% (  -5% -    5%) 0.787
              FilteredAndHighMed       50.35      (1.3%)       50.26      (1.5%)   -0.2% (  -2% -    2%) 0.687
                       And3Terms       98.50      (3.7%)       98.32      (4.6%)   -0.2% (  -8% -    8%) 0.891
                  FilteredPhrase       14.27      (2.1%)       14.24      (1.9%)   -0.2% (  -4% -    3%) 0.791
               FilteredOrHighMed       54.69      (1.4%)       54.60      (1.4%)   -0.2% (  -2% -    2%) 0.722
                CountAndHighHigh       67.35      (0.9%)       67.26      (1.2%)   -0.1% (  -2% -    2%) 0.697
                        Wildcard       56.31      (3.3%)       56.26      (3.3%)   -0.1% (  -6% -    6%) 0.924
                          Fuzzy1       37.10      (2.2%)       37.07      (2.5%)   -0.1% (  -4% -    4%) 0.927
          CountFilteredOrHighMed       34.07      (0.4%)       34.05      (0.6%)   -0.1% (  -1% -    0%) 0.688
                         Respell       29.69      (2.6%)       29.67      (2.4%)   -0.1% (  -4% -    5%) 0.939
         CountFilteredOrHighHigh       28.53      (0.5%)       28.52      (0.7%)   -0.0% (  -1% -    1%) 0.817
                          Fuzzy2       33.99      (2.2%)       34.00      (2.4%)    0.0% (  -4% -    4%) 0.985
                 AndHighOrMedMed       22.20      (2.0%)       22.20      (2.0%)    0.0% (  -3% -    4%) 0.982
                 CountAndHighMed       96.12      (1.2%)       96.13      (1.4%)    0.0% (  -2% -    2%) 0.967
                       CountTerm     4947.81      (3.3%)     4949.52      (3.7%)    0.0% (  -6% -    7%) 0.975
                 DismaxOrHighMed       65.72      (2.0%)       65.78      (2.4%)    0.1% (  -4% -    4%) 0.897
                  FilteredIntNRQ       60.66      (0.9%)       60.72      (1.2%)    0.1% (  -2% -    2%) 0.791
               TermDayOfYearSort      407.35      (2.2%)      407.88      (2.3%)    0.1% (  -4% -    4%) 0.855
             CountFilteredIntNRQ       25.92      (0.4%)       25.95      (0.5%)    0.1% (   0% -    1%) 0.324
                FilteredOr3Terms       57.49      (1.2%)       57.60      (1.1%)    0.2% (  -2% -    2%) 0.599
                          IntNRQ       61.25      (1.0%)       61.39      (1.2%)    0.2% (  -1% -    2%) 0.526
             FilteredAndHighHigh       17.66      (1.1%)       17.70      (0.8%)    0.2% (  -1% -    2%) 0.455
             FilteredOrStopWords       12.28      (1.8%)       12.31      (1.9%)    0.2% (  -3% -    4%) 0.677
      FilteredOr2Terms2StopWords       65.25      (1.1%)       65.43      (1.0%)    0.3% (  -1% -    2%) 0.408
                        Or3Terms       89.77      (4.0%)       90.03      (4.7%)    0.3% (  -8% -    9%) 0.829
                          Phrase       10.62      (3.2%)       10.65      (3.3%)    0.3% (  -5% -    7%) 0.753
                DismaxOrHighHigh       44.42      (2.5%)       44.57      (2.8%)    0.3% (  -4% -    5%) 0.696
                         Prefix3       97.90      (3.1%)       98.26      (3.2%)    0.4% (  -5% -    6%) 0.712
     FilteredAnd2Terms2StopWords       76.73      (1.7%)       77.03      (2.3%)    0.4% (  -3% -    4%) 0.536
            FilteredAndStopWords       14.35      (1.2%)       14.41      (1.0%)    0.4% (  -1% -    2%) 0.209
               FilteredAnd3Terms      116.13      (1.1%)      116.65      (1.3%)    0.4% (  -1% -    2%) 0.236
                 FilteredPrefix3       91.42      (3.0%)       91.84      (2.9%)    0.5% (  -5% -    6%) 0.620
                          IntSet      167.56      (5.1%)      168.35      (5.0%)    0.5% (  -9% -   11%) 0.767
              Or2Terms2StopWords       77.52      (2.5%)       77.94      (3.0%)    0.5% (  -4% -    6%) 0.536
             And2Terms2StopWords       73.71      (2.6%)       74.12      (3.0%)    0.6% (  -4% -    6%) 0.536
                   TermMonthSort     1991.89      (2.0%)     2008.30      (2.8%)    0.8% (  -3% -    5%) 0.281
                      OrHighHigh       32.79      (4.1%)       33.08      (4.5%)    0.9% (  -7% -    9%) 0.520
              CombinedOrHighHigh        8.25      (3.1%)        8.32      (3.2%)    0.9% (  -5% -    7%) 0.360
                   TermTitleSort       72.91      (1.9%)       73.66      (3.1%)    1.0% (  -3% -    6%) 0.211
             CountFilteredPhrase       12.76      (1.6%)       12.91      (1.7%)    1.1% (  -2% -    4%) 0.036
                     AndHighHigh       32.95      (3.0%)       33.61      (3.9%)    2.0% (  -4% -    9%) 0.067
                    AndStopWords       12.52      (4.9%)       13.05      (6.1%)    4.2% (  -6% -   15%) 0.015
                     OrStopWords       13.62      (4.4%)       14.58      (5.4%)    7.1% (  -2% -   17%) 0.000

@gf2121
Copy link
Contributor Author

gf2121 commented Jul 18, 2025

FWIW I benchmarked this change on my machine (AVX2), there may be a small speedup for some queries but not as much as previously reported.

According to this report, this optimization can only have noticeable improvements on AVX512, so now i am actually a bit hesitant to move on.

@gf2121 gf2121 marked this pull request as draft July 19, 2025 13:22
@uschindler
Copy link
Contributor

Much better now. I am still not sure if this is all worth the complexity!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants