Vectorize bitset to array #14910

gf2121 · 2025-07-07T12:22:40Z

This is a minimal prof to describe an idea about how to vectorize a bitset into an array, which can be a hot path when posting is encoded as a bitset. This version currently only runs on AVX512, but can be adapted to more in the future.

Benchmark                             (bitSetSize)   Mode  Cnt      Score      Error   Units
BitsetToArrayBenchmark.baseline                128  thrpt    5   5477.202 ±   36.920  ops/ms
BitsetToArrayBenchmark.baseline                256  thrpt    5   6197.595 ±   92.064  ops/ms
BitsetToArrayBenchmark.baseline                512  thrpt    5   7121.446 ±  113.840  ops/ms
BitsetToArrayBenchmark.baseline                768  thrpt    5   7361.335 ±  286.118  ops/ms
BitsetToArrayBenchmark.vectorized512           128  thrpt    5  85321.831 ± 1539.445  ops/ms
BitsetToArrayBenchmark.vectorized512           256  thrpt    5  58632.773 ± 1130.691  ops/ms
BitsetToArrayBenchmark.vectorized512           512  thrpt    5  48780.092 ±  958.403  ops/ms
BitsetToArrayBenchmark.vectorized512           768  thrpt    5  29373.799 ±  392.238  ops/ms

github-actions · 2025-07-07T12:23:35Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

uschindler · 2025-07-07T15:32:47Z

This cannot be merged without adding this to the java24 part and reoving the requires of incubator module for JMH.

I assume this is only meant for quick checks and stays draft?

gf2121 · 2025-07-07T15:48:35Z

Thanks for reminding!

I assume this is only meant for quick checks and stays draft?

Yes, after the code integrated into VectorUtil benchmark will call VectorUtil directly and remove the requirement for the incubator module, just like other benchmarks.

gf2121 · 2025-07-09T11:20:20Z

I managed to get some luceneutil data on AVX512

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                 FilteredPrefix3       76.86      (3.2%)       76.20      (5.5%)   -0.9% (  -9% -    8%) 0.552
                         Prefix3       81.74      (3.1%)       81.13      (4.9%)   -0.7% (  -8% -    7%) 0.567
                AndMedOrHighHigh       22.83      (1.8%)       22.78      (2.0%)   -0.3% (  -3% -    3%) 0.670
                   TermMonthSort     1184.75      (2.7%)     1182.73      (8.4%)   -0.2% ( -10% -   11%) 0.931
              CombinedOrHighHigh        7.95      (2.8%)        7.94      (2.7%)   -0.2% (  -5% -    5%) 0.844
             And2Terms2StopWords       66.71      (2.5%)       66.61      (4.6%)   -0.1% (  -7% -    7%) 0.903
                          Fuzzy1       33.29      (2.5%)       33.26      (3.4%)   -0.1% (  -5% -    5%) 0.929
             FilteredOrStopWords       11.72      (2.0%)       11.71      (3.5%)   -0.1% (  -5% -    5%) 0.929
                    FilteredTerm       68.60      (1.6%)       68.56      (4.1%)   -0.0% (  -5% -    5%) 0.962
             CountFilteredIntNRQ       24.33      (1.1%)       24.33      (2.4%)   -0.0% (  -3% -    3%) 0.991
                          IntNRQ       55.49      (1.2%)       55.51      (3.1%)    0.0% (  -4% -    4%) 0.959
      FilteredOr2Terms2StopWords       59.21      (2.1%)       59.27      (4.5%)    0.1% (  -6% -    6%) 0.929
                 CountOrHighHigh       63.85      (1.0%)       63.93      (2.4%)    0.1% (  -3% -    3%) 0.827
             CountFilteredPhrase       12.29      (1.6%)       12.31      (1.9%)    0.1% (  -3% -    3%) 0.800
                      AndHighMed       72.10      (2.0%)       72.23      (4.2%)    0.2% (  -5% -    6%) 0.860
                          Phrase        9.76      (2.2%)        9.78      (3.3%)    0.2% (  -5% -    5%) 0.823
                  CountOrHighMed       93.67      (1.3%)       93.87      (2.7%)    0.2% (  -3% -    4%) 0.746
                 CountAndHighMed       90.17      (1.0%)       90.37      (2.2%)    0.2% (  -2% -    3%) 0.690
                      DismaxTerm      331.70      (3.2%)      332.55      (6.0%)    0.3% (  -8% -    9%) 0.867
               FilteredAnd3Terms      105.66      (2.0%)      105.96      (2.9%)    0.3% (  -4% -    5%) 0.717
          CountFilteredOrHighMed       32.69      (1.4%)       32.78      (2.0%)    0.3% (  -3% -    3%) 0.588
                          Fuzzy2       30.47      (2.4%)       30.57      (3.4%)    0.3% (  -5% -    6%) 0.724
              Or2Terms2StopWords       68.51      (2.3%)       68.74      (4.7%)    0.3% (  -6% -    7%) 0.765
             CombinedAndHighHigh        8.07      (2.1%)        8.10      (2.2%)    0.4% (  -3% -    4%) 0.596
                   TermTitleSort       59.34      (2.8%)       59.55      (4.0%)    0.4% (  -6% -    7%) 0.739
                        Wildcard       48.97      (3.6%)       49.16      (4.6%)    0.4% (  -7% -    8%) 0.766
                CountAndHighHigh       63.48      (1.3%)       63.74      (2.2%)    0.4% (  -3% -    3%) 0.472
                      TermDTSort      191.53      (2.0%)      192.32      (5.5%)    0.4% (  -7% -    8%) 0.752
             FilteredAndHighHigh       16.87      (1.3%)       16.94      (2.3%)    0.4% (  -3% -    4%) 0.457
            FilteredAndStopWords       13.70      (1.6%)       13.76      (2.2%)    0.4% (  -3% -    4%) 0.468
         CountFilteredOrHighHigh       27.44      (0.9%)       27.56      (1.7%)    0.4% (  -2% -    3%) 0.305
                         Respell       27.54      (2.1%)       27.66      (2.4%)    0.4% (  -3% -    4%) 0.534
                    CombinedTerm       16.58      (2.9%)       16.66      (3.0%)    0.5% (  -5% -    6%) 0.621
                       OrHighMed       87.71      (2.3%)       88.12      (4.9%)    0.5% (  -6% -    7%) 0.702
                            Term      421.54      (3.5%)      423.51      (5.9%)    0.5% (  -8% -   10%) 0.761
                  FilteredIntNRQ       54.79      (1.9%)       55.09      (2.7%)    0.5% (  -4% -    5%) 0.468
              FilteredOrHighHigh       18.14      (1.8%)       18.24      (3.5%)    0.5% (  -4% -    5%) 0.539
                 DismaxOrHighMed       57.31      (1.9%)       57.65      (5.4%)    0.6% (  -6% -    8%) 0.647
     FilteredAnd2Terms2StopWords       69.42      (1.9%)       69.84      (3.1%)    0.6% (  -4% -    5%) 0.450
               TermDayOfYearSort      317.00      (2.3%)      319.07      (3.9%)    0.7% (  -5% -    6%) 0.515
              FilteredAndHighMed       46.77      (1.5%)       47.11      (2.6%)    0.7% (  -3% -    4%) 0.270
                      OrHighRare      116.92      (4.6%)      117.89      (5.9%)    0.8% (  -9% -   11%) 0.620
                 AndHighOrMedMed       21.55      (2.1%)       21.74      (1.9%)    0.9% (  -3% -    5%) 0.172
                FilteredOr3Terms       52.71      (1.9%)       53.18      (4.2%)    0.9% (  -5% -    7%) 0.386
                  FilteredPhrase       12.77      (1.8%)       12.89      (3.2%)    0.9% (  -4% -    6%) 0.262
               FilteredOrHighMed       50.52      (2.4%)       50.99      (4.5%)    0.9% (  -5% -    8%) 0.416
               CombinedOrHighMed       28.48      (2.3%)       28.76      (4.4%)    1.0% (  -5% -    7%) 0.392
                DismaxOrHighHigh       39.77      (1.9%)       40.16      (3.4%)    1.0% (  -4% -    6%) 0.256
                       And3Terms       84.76      (2.1%)       85.63      (3.6%)    1.0% (  -4% -    6%) 0.272
                        Or3Terms       76.27      (1.3%)       77.08      (3.7%)    1.1% (  -3% -    6%) 0.226
              CombinedAndHighMed       29.04      (2.3%)       29.39      (4.1%)    1.2% (  -5% -    7%) 0.252
                        PKLookup       75.51      (1.3%)       76.44      (3.2%)    1.2% (  -3% -    5%) 0.112
                       CountTerm     2847.75      (5.6%)     2897.40      (8.8%)    1.7% ( -11% -   17%) 0.454
                      OrHighHigh       29.50      (2.0%)       30.17      (2.9%)    2.3% (  -2% -    7%) 0.004
                          IntSet      150.21      (4.2%)      154.28      (4.6%)    2.7% (  -5% -   11%) 0.051
                     AndHighHigh       30.04      (1.9%)       31.63      (3.1%)    5.3% (   0% -   10%) 0.000
                    AndStopWords       10.60      (1.9%)       11.83      (2.0%)   11.6% (   7% -   15%) 0.000
                     OrStopWords       11.41      (2.8%)       13.26      (3.0%)   16.2% (  10% -   22%) 0.000

gf2121 · 2025-07-10T16:36:06Z

Some more data:

Mac M2

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                       CountTerm    12276.27     (12.1%)    11998.30      (7.3%)   -2.3% ( -19% -   19%) 0.563
                   TermMonthSort     4162.59      (8.8%)     4111.81      (3.4%)   -1.2% ( -12% -   12%) 0.641
                CountAndHighHigh       84.34      (2.6%)       83.54      (2.5%)   -0.9% (  -5% -    4%) 0.342
          CountFilteredOrHighMed       48.75      (4.9%)       48.31      (3.8%)   -0.9% (  -9% -    8%) 0.591
         CountFilteredOrHighHigh       39.65      (4.2%)       39.30      (3.2%)   -0.9% (  -7% -    6%) 0.543

                                                    ...

                      OrHighHigh       48.67     (12.7%)       52.67      (2.7%)    8.2% (  -6% -   27%) 0.023
                    AndStopWords       16.25      (9.7%)       17.63      (4.3%)    8.5% (  -5% -   24%) 0.004
                     AndHighHigh       50.29     (13.5%)       55.32      (2.5%)   10.0% (  -5% -   30%) 0.009
                     OrStopWords       18.18     (10.6%)       20.61      (3.1%)   13.4% (   0% -   30%) 0.000

AVX512 (mentioned above)

TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                 FilteredPrefix3       76.86      (3.2%)       76.20      (5.5%)   -0.9% (  -9% -    8%) 0.552
                         Prefix3       81.74      (3.1%)       81.13      (4.9%)   -0.7% (  -8% -    7%) 0.567
                AndMedOrHighHigh       22.83      (1.8%)       22.78      (2.0%)   -0.3% (  -3% -    3%) 0.670
                   TermMonthSort     1184.75      (2.7%)     1182.73      (8.4%)   -0.2% ( -10% -   11%) 0.931
              CombinedOrHighHigh        7.95      (2.8%)        7.94      (2.7%)   -0.2% (  -5% -    5%) 0.844
             And2Terms2StopWords       66.71      (2.5%)       66.61      (4.6%)   -0.1% (  -7% -    7%) 0.903

                                                    ...
 
                       CountTerm     2847.75      (5.6%)     2897.40      (8.8%)    1.7% ( -11% -   17%) 0.454
                      OrHighHigh       29.50      (2.0%)       30.17      (2.9%)    2.3% (  -2% -    7%) 0.004
                          IntSet      150.21      (4.2%)      154.28      (4.6%)    2.7% (  -5% -   11%) 0.051
                     AndHighHigh       30.04      (1.9%)       31.63      (3.1%)    5.3% (   0% -   10%) 0.000
                    AndStopWords       10.60      (1.9%)       11.83      (2.0%)   11.6% (   7% -   15%) 0.000
                     OrStopWords       11.41      (2.8%)       13.26      (3.0%)   16.2% (  10% -   22%) 0.000

same AVX512 machine without --add-modules=jdk.incubator.vector

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                 FilteredPrefix3       74.47      (3.7%)       73.32      (4.0%)   -1.5% (  -8% -    6%) 0.210
                         Prefix3       79.35      (3.9%)       78.23      (3.5%)   -1.4% (  -8% -    6%) 0.232
                       CountTerm     2921.10      (6.2%)     2897.44      (7.5%)   -0.8% ( -13% -   13%) 0.708
             And2Terms2StopWords       62.09      (1.7%)       61.80      (2.6%)   -0.5% (  -4% -    3%) 0.482

                                                    ...
 
                      OrHighHigh       27.33      (2.4%)       27.66      (2.0%)    1.2% (  -3% -    5%) 0.092
                      OrHighRare      116.98      (3.3%)      118.89      (2.7%)    1.6% (  -4% -    7%) 0.088
                     AndHighHigh       27.52      (2.0%)       28.02      (1.5%)    1.8% (  -1% -    5%) 0.001
                    AndStopWords       10.60      (3.0%)       11.01      (1.7%)    3.8% (   0% -    8%) 0.000
                     OrStopWords       11.26      (3.6%)       11.97      (2.2%)    6.3% (   0% -   12%) 0.000

github-actions · 2025-07-10T16:37:41Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

jpountz · 2025-07-10T21:32:42Z

This is very cool and the speedup makes sense to me. When dynamic pruning is enabled, only queries whose leading clauses are dense benefit significantly from this speedup (OrStopWords and AndStopWords). But if you evaluated exhaustive evaluation, I'm sure we'd be seeing a bigger speedup on all disjunctive queries that have one dense postings list or more.

Like for #14896, I'd like to split this PR in two: one where we merge your scalar improvements, and then this one where we add support for vectorization. By the way, we may want to look into other approaches for the scalar case. Since we only use bit sets in postings when many bits would be set, a linear scan should perform quite efficiently? (foreach (bit in 0..n) { if bitSet.get(bit) out.append(bit); }) I imagine that you used a micro benchmark to come up with your manual unrolling, let's include this micro benchmark in the PR?

…rray

github-actions · 2025-07-13T07:03:26Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

gf2121 · 2025-07-13T07:13:58Z

JMH results with the vectorized implementations:

Benchmark                                                (bitCount)   Mode  Cnt   Score   Error   Units
BitsetToArrayBenchmark.dense                                      5  thrpt    5   9.583 ± 0.238  ops/us
BitsetToArrayBenchmark.dense                                     10  thrpt    5   6.926 ± 0.151  ops/us
BitsetToArrayBenchmark.dense                                     20  thrpt    5   4.597 ± 0.042  ops/us
BitsetToArrayBenchmark.dense                                     30  thrpt    5   3.420 ± 0.033  ops/us
BitsetToArrayBenchmark.dense                                     40  thrpt    5   3.766 ± 0.013  ops/us
BitsetToArrayBenchmark.dense                                     50  thrpt    5   5.299 ± 0.126  ops/us
BitsetToArrayBenchmark.dense                                     60  thrpt    5   8.991 ± 0.223  ops/us
BitsetToArrayBenchmark.denseBranchLess                            5  thrpt    5  13.520 ± 0.132  ops/us
BitsetToArrayBenchmark.denseBranchLess                           10  thrpt    5  13.440 ± 0.575  ops/us
BitsetToArrayBenchmark.denseBranchLess                           20  thrpt    5  13.521 ± 0.289  ops/us
BitsetToArrayBenchmark.denseBranchLess                           30  thrpt    5  13.488 ± 0.641  ops/us
BitsetToArrayBenchmark.denseBranchLess                           40  thrpt    5  13.501 ± 0.375  ops/us
BitsetToArrayBenchmark.denseBranchLess                           50  thrpt    5  13.555 ± 0.384  ops/us
BitsetToArrayBenchmark.denseBranchLess                           60  thrpt    5  13.524 ± 0.498  ops/us
BitsetToArrayBenchmark.denseBranchLessCmov                        5  thrpt    5   8.521 ± 0.120  ops/us
BitsetToArrayBenchmark.denseBranchLessCmov                       10  thrpt    5   6.315 ± 0.164  ops/us
BitsetToArrayBenchmark.denseBranchLessCmov                       20  thrpt    5  11.531 ± 0.176  ops/us
BitsetToArrayBenchmark.denseBranchLessCmov                       30  thrpt    5  11.493 ± 0.255  ops/us
BitsetToArrayBenchmark.denseBranchLessCmov                       40  thrpt    5  11.535 ± 0.018  ops/us
BitsetToArrayBenchmark.denseBranchLessCmov                       50  thrpt    5  11.539 ± 0.084  ops/us
BitsetToArrayBenchmark.denseBranchLessCmov                       60  thrpt    5   9.100 ± 0.017  ops/us
BitsetToArrayBenchmark.denseBranchLessParallel                    5  thrpt    5  15.428 ± 0.155  ops/us
BitsetToArrayBenchmark.denseBranchLessParallel                   10  thrpt    5  15.424 ± 0.282  ops/us
BitsetToArrayBenchmark.denseBranchLessParallel                   20  thrpt    5  15.375 ± 0.341  ops/us
BitsetToArrayBenchmark.denseBranchLessParallel                   30  thrpt    5  15.395 ± 0.121  ops/us
BitsetToArrayBenchmark.denseBranchLessParallel                   40  thrpt    5  15.308 ± 0.407  ops/us
BitsetToArrayBenchmark.denseBranchLessParallel                   50  thrpt    5  15.322 ± 0.174  ops/us
BitsetToArrayBenchmark.denseBranchLessParallel                   60  thrpt    5  15.439 ± 0.064  ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling                   5  thrpt    5  15.795 ± 0.380  ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling                  10  thrpt    5  15.827 ± 0.228  ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling                  20  thrpt    5  15.672 ± 0.991  ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling                  30  thrpt    5  15.789 ± 0.327  ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling                  40  thrpt    5  15.764 ± 0.350  ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling                  50  thrpt    5  15.725 ± 0.393  ops/us
BitsetToArrayBenchmark.denseBranchLessUnrolling                  60  thrpt    5  15.868 ± 0.028  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized                  5  thrpt    5  25.889 ± 0.471  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized                 10  thrpt    5  25.975 ± 0.129  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized                 20  thrpt    5  25.852 ± 0.299  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized                 30  thrpt    5  25.888 ± 0.371  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized                 40  thrpt    5  25.708 ± 1.028  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized                 50  thrpt    5  25.856 ± 0.612  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized                 60  thrpt    5  25.931 ± 0.144  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512               5  thrpt    5  28.221 ± 0.545  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512              10  thrpt    5  28.306 ± 0.209  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512              20  thrpt    5  26.827 ± 1.704  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512              30  thrpt    5  27.027 ± 0.214  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512              40  thrpt    5  26.504 ± 0.909  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512              50  thrpt    5  25.725 ± 0.084  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512              60  thrpt    5  25.495 ± 1.521  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512AVX2           5  thrpt    5   1.137 ± 0.473  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512AVX2          10  thrpt    5   0.856 ± 0.312  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512AVX2          20  thrpt    5   0.171 ± 0.091  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512AVX2          30  thrpt    5   0.159 ± 0.072  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512AVX2          40  thrpt    5   0.097 ± 0.042  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512AVX2          50  thrpt    5   0.069 ± 0.021  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorized512AVX2          60  thrpt    5   0.068 ± 0.041  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2              5  thrpt    5  20.310 ± 0.139  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2             10  thrpt    5  20.125 ± 0.352  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2             20  thrpt    5  19.961 ± 0.653  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2             30  thrpt    5  20.025 ± 1.040  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2             40  thrpt    5  20.051 ± 0.556  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2             50  thrpt    5  20.128 ± 0.131  ops/us
BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2             60  thrpt    5  19.769 ± 2.266  ops/us
BitsetToArrayBenchmark.denseInvert                                5  thrpt    5  19.958 ± 0.355  ops/us
BitsetToArrayBenchmark.denseInvert                               10  thrpt    5  13.497 ± 0.826  ops/us
BitsetToArrayBenchmark.denseInvert                               20  thrpt    5   6.995 ± 0.093  ops/us
BitsetToArrayBenchmark.denseInvert                               30  thrpt    5   4.579 ± 0.035  ops/us
BitsetToArrayBenchmark.denseInvert                               40  thrpt    5   4.447 ± 0.028  ops/us
BitsetToArrayBenchmark.denseInvert                               50  thrpt    5   4.082 ± 0.051  ops/us
BitsetToArrayBenchmark.denseInvert                               60  thrpt    5   6.732 ± 0.145  ops/us
BitsetToArrayBenchmark.forLoop                                    5  thrpt    5  26.332 ± 0.080  ops/us
BitsetToArrayBenchmark.forLoop                                   10  thrpt    5  21.765 ± 0.029  ops/us
BitsetToArrayBenchmark.forLoop                                   20  thrpt    5  15.878 ± 0.247  ops/us
BitsetToArrayBenchmark.forLoop                                   30  thrpt    5  12.606 ± 0.251  ops/us
BitsetToArrayBenchmark.forLoop                                   40  thrpt    5  10.440 ± 0.036  ops/us
BitsetToArrayBenchmark.forLoop                                   50  thrpt    5   8.875 ± 0.164  ops/us
BitsetToArrayBenchmark.forLoop                                   60  thrpt    5   7.735 ± 0.171  ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling                     5  thrpt    5  26.018 ± 0.586  ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling                    10  thrpt    5  21.031 ± 0.364  ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling                    20  thrpt    5  15.683 ± 0.266  ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling                    30  thrpt    5  12.502 ± 0.056  ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling                    40  thrpt    5  10.330 ± 0.212  ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling                    50  thrpt    5   8.842 ± 0.020  ops/us
BitsetToArrayBenchmark.forLoopManualUnrolling                    60  thrpt    5   7.705 ± 0.172  ops/us
BitsetToArrayBenchmark.hybrid                                     5  thrpt    5  25.588 ± 0.491  ops/us
BitsetToArrayBenchmark.hybrid                                    10  thrpt    5  21.151 ± 0.403  ops/us
BitsetToArrayBenchmark.hybrid                                    20  thrpt    5  15.653 ± 0.263  ops/us
BitsetToArrayBenchmark.hybrid                                    30  thrpt    5  12.431 ± 0.027  ops/us
BitsetToArrayBenchmark.hybrid                                    40  thrpt    5  15.414 ± 0.032  ops/us
BitsetToArrayBenchmark.hybrid                                    50  thrpt    5  15.415 ± 0.065  ops/us
BitsetToArrayBenchmark.hybrid                                    60  thrpt    5  15.188 ± 0.806  ops/us
BitsetToArrayBenchmark.whileLoop                                  5  thrpt    5  29.224 ± 0.503  ops/us
BitsetToArrayBenchmark.whileLoop                                 10  thrpt    5  23.237 ± 0.697  ops/us
BitsetToArrayBenchmark.whileLoop                                 20  thrpt    5  16.777 ± 0.278  ops/us
BitsetToArrayBenchmark.whileLoop                                 30  thrpt    5  13.019 ± 0.213  ops/us
BitsetToArrayBenchmark.whileLoop                                 40  thrpt    5  10.700 ± 0.095  ops/us
BitsetToArrayBenchmark.whileLoop                                 50  thrpt    5   9.047 ± 0.015  ops/us
BitsetToArrayBenchmark.whileLoop                                 60  thrpt    5   7.786 ± 0.224  ops/us

…rray

jpountz · 2025-07-13T07:33:26Z

Thank you for updating the benchmark. I suggest we first figure how to handle compress() on #14896 before coming back to this PR.

github-actions · 2025-07-13T07:33:41Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

gf2121 · 2025-07-13T07:43:52Z

I suggest we first figure how to handle compress() on #14896 before coming back to this PR.

+1, I'm tracking this PR as well.

github-actions · 2025-07-13T08:36:12Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

github-actions · 2025-07-17T14:33:16Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

github-actions · 2025-07-17T14:37:33Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

github-actions · 2025-07-17T14:54:51Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

uschindler · 2025-07-17T15:01:15Z

lucene/benchmark-jmh/src/java/module-info.java

@@ -25,6 +25,7 @@
  requires org.apache.lucene.core;
  requires org.apache.lucene.expressions;
  requires org.apache.lucene.sandbox;
+  requires jdk.incubator.vector;


Please remove the code from benchmark module and only benchmark code in the java24 source set.

uschindler · 2025-07-17T15:02:13Z

lucene/benchmark-jmh/src/java/org/apache/lucene/benchmark/jmh/BitsetToArrayBenchmark.java

+    }
+  }
+
+  // NOCOMMIT remove vectorized methods and requirement on vector module before merge.


Ah, yes this needs to go away. Our build system found this already. ❤️

uschindler · 2025-07-17T15:15:52Z

lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaBitSetUtil.java

@@ -39,6 +40,7 @@ int word2Array(long word, int base, int[] docs, int offset) {
    return intWord2Array((int) (word >>> 32), docs, offset, base + 32);
  }

+  @SuppressForbidden(reason = "Uses compress only where fast and carefully contained")


The if check as described in forbidden apis is missing.

The check is before instance picking, do we must check it within this method?

lucene/lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaVectorizationProvider.java

Lines 94 to 99 in 2068441

if (Constants.HAS_FAST_COMPRESS_MASK_CAST

&& PanamaVectorConstants.PREFERRED_VECTOR_BITSIZE >= 256) {

return PanamaBitSetUtil.INSTANCE;

} else {

return BitSetUtil.INSTANCE;

}

Hi, then it is ok. Please add this info to the @SuppressWarnings message!

Maybe say: The whole BitsetUtil impl instance is only used when HAS_FAST_COMPRESS_MASK_CAST is enabled.

I just want to make sure that Robert does not complain.

uschindler

Just my final comment: The setup on how the instances for look fine, although I would have preferred an interface instead of subclassing.

So BitSetUtil as public (but restricted) interface, both DefaultBitSetUtil as non-vectorized implementation and PanamaBitSetUtil as another implementation, both package private.

jpountz · 2025-07-17T16:01:40Z

lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java

@@ -206,7 +208,8 @@ private static Optional<Module> lookupVectorModule() {
          "org.apache.lucene.util.VectorUtil",
          "org.apache.lucene.codecs.lucene103.Lucene103PostingsReader",
          "org.apache.lucene.codecs.lucene103.PostingIndexInput",
-          "org.apache.lucene.tests.util.TestSysoutsLimits");
+          "org.apache.lucene.tests.util.TestSysoutsLimits",
+          "org.apache.lucene.util.FixedBitSet");


FWIW we should try to avoid expanding this list. Hopefully as per my other comment, we can move bitsetToArray to VectorUtil instead so that this list can stay as-is.

Fully agree.

jpountz · 2025-07-17T16:04:33Z

lucene/core/src/java/org/apache/lucene/internal/vectorization/VectorizationProvider.java

@@ -114,6 +114,8 @@ public static VectorizationProvider getInstance() {
  /** Create a new {@link PostingDecodingUtil} for the given {@link IndexInput}. */
  public abstract PostingDecodingUtil newPostingDecodingUtil(IndexInput input) throws IOException;

+  public abstract BitSetUtil newBitSetUtil();


I would rather have the methods of BitSetUtil in this class. PostingDecodingUtil is a bit different because it requires state (the MemorySegment), which is not the case with BitSetUtil.

Then we can call bitsetToArray from VectorUtil and don't need to add a new class that is allowed to call the vector API.

I agree with this. It is stateless so let's reuse the already existing method. Then we can also have the "if Constant" part there.

You can ignore my other note.

jpountz · 2025-07-17T16:06:54Z

lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaBitSetUtil.java

+
+    for (int i = 0; i < Integer.SIZE; i += INT_SPECIES.length()) {
+      IntVector.fromArray(INT_SPECIES, IDENTITY, i)
+          .add(base)


Should we broadcast base to an IntVector outside of the loop so that it's only done once rather than once per iteration?

jpountz · 2025-07-17T16:14:20Z

lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaBitSetUtil.java

+  private static int intWord2Array(int word, int[] resultArray, int offset, int base) {
+    IntVector bitMask = IntVector.fromArray(INT_SPECIES, IDENTITY_MASK, 0);
+
+    for (int i = 0; i < Integer.SIZE; i += INT_SPECIES.length()) {


I'm a bit uncomfortable with the underlying assumption that INT_SPECIES.length() is a divisor of Integer.SIZE. Can we write the code in a way that doesn't make this assumption or add a check somewhere?

jpountz · 2025-07-17T16:15:36Z

lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaBitSetUtil.java

+
+  @SuppressForbidden(reason = "Uses compress only where fast and carefully contained")
+  private static int intWord2Array(int word, int[] resultArray, int offset, int base) {
+    IntVector bitMask = IntVector.fromArray(INT_SPECIES, IDENTITY_MASK, 0);


I wonder if this should be a private static final field?

jpountz · 2025-07-17T16:17:01Z

lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaBitSetUtil.java

+      IntVector.fromArray(INT_SPECIES, IDENTITY, i)
+          .add(base)
+          .compress(bitMask.and(word).compare(VectorOperators.NE, 0))
+          .reinterpretAsInts()


Why is this necessary? Doesn't compress() already return an IntVector?

jpountz · 2025-07-17T16:44:56Z

lucene/core/src/java/org/apache/lucene/internal/vectorization/BitSetUtil.java

-        : "Array length must be at least bitSet.cardinality(from, to) + 1";
+  public final int bitsetToArray(FixedBitSet bitSet, int from, int to, int base, int[] array) {
+    assert bitSet.cardinality(from, to) + 16 <= array.length
+        : "Array length must be at least bitSet.cardinality(from, to) + 16";

    Objects.checkFromToIndex(from, to, bitSet.length());


I wonder if we should refactor this bitsetToArray method to compute the bitCount of each word up-front to reduce dependencies between iterations of this loop (we rely on the result of wordToArray to know the next index at which to start writing data, so we can't start the next iteration of the loop before the current one is finished).

By the way, maybe the micro benchmark should be updated to operate on a FixedBitSet instead of a single word to better capture this sort of things.

I understand the idea, i can try. I thought it could be challenge for compiler to know there is no overlap between the writing range across iterations.

Here is the result

BitsetToArrayBenchmark.hybrid 256 thrpt 5 5.239 ± 0.202 ops/us BitsetToArrayBenchmark.hybrid 384 thrpt 5 6.077 ± 0.135 ops/us BitsetToArrayBenchmark.hybrid 512 thrpt 5 6.259 ± 0.087 ops/us BitsetToArrayBenchmark.hybrid 768 thrpt 5 5.926 ± 0.078 ops/us BitsetToArrayBenchmark.hybrid 1024 thrpt 5 4.889 ± 0.056 ops/us BitsetToArrayBenchmark.hybridUnrolling 256 thrpt 5 4.850 ± 0.054 ops/us BitsetToArrayBenchmark.hybridUnrolling 384 thrpt 5 5.940 ± 0.086 ops/us BitsetToArrayBenchmark.hybridUnrolling 512 thrpt 5 6.271 ± 0.098 ops/us BitsetToArrayBenchmark.hybridUnrolling 768 thrpt 5 5.328 ± 0.106 ops/us BitsetToArrayBenchmark.hybridUnrolling 1024 thrpt 5 4.174 ± 0.059 ops/us

In case you are interested, here is the full result of this new benchmark.

Benchmark (bitLength) Mode Cnt Score Error Units BitsetToArrayBenchmark.dense 256 thrpt 5 2.152 ± 0.017 ops/us BitsetToArrayBenchmark.dense 384 thrpt 5 1.308 ± 0.024 ops/us BitsetToArrayBenchmark.dense 512 thrpt 5 1.156 ± 0.020 ops/us BitsetToArrayBenchmark.dense 768 thrpt 5 0.991 ± 0.024 ops/us BitsetToArrayBenchmark.dense 1024 thrpt 5 0.888 ± 0.020 ops/us BitsetToArrayBenchmark.denseBranchLess 256 thrpt 5 5.646 ± 0.050 ops/us BitsetToArrayBenchmark.denseBranchLess 384 thrpt 5 3.999 ± 0.044 ops/us BitsetToArrayBenchmark.denseBranchLess 512 thrpt 5 3.097 ± 0.085 ops/us BitsetToArrayBenchmark.denseBranchLess 768 thrpt 5 2.099 ± 0.017 ops/us BitsetToArrayBenchmark.denseBranchLess 1024 thrpt 5 1.622 ± 0.020 ops/us BitsetToArrayBenchmark.denseBranchLessCmov 256 thrpt 5 3.692 ± 0.032 ops/us BitsetToArrayBenchmark.denseBranchLessCmov 384 thrpt 5 2.572 ± 0.033 ops/us BitsetToArrayBenchmark.denseBranchLessCmov 512 thrpt 5 1.970 ± 0.018 ops/us BitsetToArrayBenchmark.denseBranchLessCmov 768 thrpt 5 0.815 ± 0.015 ops/us BitsetToArrayBenchmark.denseBranchLessCmov 1024 thrpt 5 0.728 ± 0.008 ops/us BitsetToArrayBenchmark.denseBranchLessParallel 256 thrpt 5 5.803 ± 0.054 ops/us BitsetToArrayBenchmark.denseBranchLessParallel 384 thrpt 5 4.106 ± 0.056 ops/us BitsetToArrayBenchmark.denseBranchLessParallel 512 thrpt 5 3.202 ± 0.033 ops/us BitsetToArrayBenchmark.denseBranchLessParallel 768 thrpt 5 2.181 ± 0.018 ops/us BitsetToArrayBenchmark.denseBranchLessParallel 1024 thrpt 5 1.657 ± 0.019 ops/us BitsetToArrayBenchmark.denseBranchLessUnrolling 256 thrpt 5 6.157 ± 0.104 ops/us BitsetToArrayBenchmark.denseBranchLessUnrolling 384 thrpt 5 4.380 ± 0.042 ops/us BitsetToArrayBenchmark.denseBranchLessUnrolling 512 thrpt 5 3.392 ± 0.060 ops/us BitsetToArrayBenchmark.denseBranchLessUnrolling 768 thrpt 5 2.354 ± 0.023 ops/us BitsetToArrayBenchmark.denseBranchLessUnrolling 1024 thrpt 5 1.794 ± 0.012 ops/us BitsetToArrayBenchmark.denseInvert 256 thrpt 5 1.865 ± 0.034 ops/us BitsetToArrayBenchmark.denseInvert 384 thrpt 5 1.791 ± 0.025 ops/us BitsetToArrayBenchmark.denseInvert 512 thrpt 5 1.817 ± 0.018 ops/us BitsetToArrayBenchmark.denseInvert 768 thrpt 5 1.904 ± 0.015 ops/us BitsetToArrayBenchmark.denseInvert 1024 thrpt 5 1.816 ± 0.016 ops/us BitsetToArrayBenchmark.forLoop 256 thrpt 5 5.645 ± 0.081 ops/us BitsetToArrayBenchmark.forLoop 384 thrpt 5 6.118 ± 0.073 ops/us BitsetToArrayBenchmark.forLoop 512 thrpt 5 6.352 ± 0.068 ops/us BitsetToArrayBenchmark.forLoop 768 thrpt 5 5.957 ± 0.081 ops/us BitsetToArrayBenchmark.forLoop 1024 thrpt 5 4.931 ± 0.088 ops/us BitsetToArrayBenchmark.forLoopManualUnrolling 256 thrpt 5 5.411 ± 0.156 ops/us BitsetToArrayBenchmark.forLoopManualUnrolling 384 thrpt 5 5.699 ± 0.238 ops/us BitsetToArrayBenchmark.forLoopManualUnrolling 512 thrpt 5 5.585 ± 0.161 ops/us BitsetToArrayBenchmark.forLoopManualUnrolling 768 thrpt 5 4.667 ± 0.101 ops/us BitsetToArrayBenchmark.forLoopManualUnrolling 1024 thrpt 5 3.999 ± 0.037 ops/us BitsetToArrayBenchmark.hybrid 256 thrpt 5 5.239 ± 0.202 ops/us BitsetToArrayBenchmark.hybrid 384 thrpt 5 6.077 ± 0.135 ops/us BitsetToArrayBenchmark.hybrid 512 thrpt 5 6.259 ± 0.087 ops/us BitsetToArrayBenchmark.hybrid 768 thrpt 5 5.926 ± 0.078 ops/us BitsetToArrayBenchmark.hybrid 1024 thrpt 5 4.889 ± 0.056 ops/us BitsetToArrayBenchmark.hybridUnrolling 256 thrpt 5 4.850 ± 0.054 ops/us BitsetToArrayBenchmark.hybridUnrolling 384 thrpt 5 5.940 ± 0.086 ops/us BitsetToArrayBenchmark.hybridUnrolling 512 thrpt 5 6.271 ± 0.098 ops/us BitsetToArrayBenchmark.hybridUnrolling 768 thrpt 5 5.328 ± 0.106 ops/us BitsetToArrayBenchmark.hybridUnrolling 1024 thrpt 5 4.174 ± 0.059 ops/us BitsetToArrayBenchmark.whileLoop 256 thrpt 5 3.932 ± 0.021 ops/us BitsetToArrayBenchmark.whileLoop 384 thrpt 5 3.779 ± 0.054 ops/us BitsetToArrayBenchmark.whileLoop 512 thrpt 5 3.889 ± 0.049 ops/us BitsetToArrayBenchmark.whileLoop 768 thrpt 5 3.747 ± 0.083 ops/us BitsetToArrayBenchmark.whileLoop 1024 thrpt 5 3.856 ± 0.088 ops/us

Also it is interesting to see forLoop is much faster than while loop, which meets my luceneutil result while previous benchmark did not show.

In addition, the vectorized method benchmarks

Benchmark (bitLength) Mode Cnt Score Error Units BitsetToArrayBenchmark.denseBranchLessVectorized 256 thrpt 5 17.648 ± 0.059 ops/us BitsetToArrayBenchmark.denseBranchLessVectorized 384 thrpt 5 14.186 ± 0.049 ops/us BitsetToArrayBenchmark.denseBranchLessVectorized 512 thrpt 5 12.102 ± 0.203 ops/us BitsetToArrayBenchmark.denseBranchLessVectorized 768 thrpt 5 9.585 ± 0.068 ops/us BitsetToArrayBenchmark.denseBranchLessVectorized 1024 thrpt 5 7.433 ± 0.093 ops/us BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2 256 thrpt 5 10.885 ± 0.069 ops/us BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2 384 thrpt 5 7.534 ± 0.280 ops/us BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2 512 thrpt 5 6.894 ± 0.019 ops/us BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2 768 thrpt 5 5.068 ± 0.014 ops/us BitsetToArrayBenchmark.denseBranchLessVectorizedAVX2 1024 thrpt 5 3.862 ± 0.109 ops/us BitsetToArrayBenchmark.denseBranchLessVectorizedFromLong 256 thrpt 5 18.918 ± 0.168 ops/us BitsetToArrayBenchmark.denseBranchLessVectorizedFromLong 384 thrpt 5 18.055 ± 0.093 ops/us BitsetToArrayBenchmark.denseBranchLessVectorizedFromLong 512 thrpt 5 17.804 ± 0.079 ops/us BitsetToArrayBenchmark.denseBranchLessVectorizedFromLong 768 thrpt 5 16.803 ± 0.051 ops/us BitsetToArrayBenchmark.denseBranchLessVectorizedFromLong 1024 thrpt 5 17.066 ± 0.042 ops/us

jpountz · 2025-07-17T16:47:55Z

FWIW I benchmarked this change on my machine (AVX2), there may be a small speedup for some queries but not as much as previously reported:

                           TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
              FilteredDismaxTerm      159.75      (1.8%)      156.59      (2.0%)   -2.0% (  -5% -    1%) 0.020
                    FilteredTerm      160.05      (1.8%)      157.28      (2.4%)   -1.7% (  -5% -    2%) 0.064
                  FilteredPhrase       31.66      (1.9%)       31.21      (2.4%)   -1.4% (  -5% -    2%) 0.144
        FilteredDismaxOrHighHigh       69.26      (2.1%)       68.44      (3.1%)   -1.2% (  -6% -    4%) 0.324
                            Term      624.66      (6.6%)      617.30      (4.6%)   -1.2% ( -11% -   10%) 0.644
     FilteredAnd2Terms2StopWords      211.38      (1.5%)      209.23      (1.9%)   -1.0% (  -4% -    2%) 0.194
                       CountTerm     9064.76      (1.7%)     8973.82      (3.5%)   -1.0% (  -6% -    4%) 0.413
               FilteredAnd3Terms      187.39      (1.3%)      185.79      (1.3%)   -0.9% (  -3% -    1%) 0.142
              CombinedOrHighHigh       22.75      (1.0%)       22.56      (1.6%)   -0.8% (  -3% -    1%) 0.154
             And2Terms2StopWords      199.52      (1.9%)      197.87      (2.5%)   -0.8% (  -5% -    3%) 0.397
                 AndHighOrMedMed       49.74      (1.6%)       49.33      (1.9%)   -0.8% (  -4% -    2%) 0.296
             CountFilteredPhrase       24.76      (2.0%)       24.57      (2.9%)   -0.8% (  -5% -    4%) 0.487
                AndMedOrHighHigh       85.96      (1.2%)       85.32      (2.4%)   -0.8% (  -4% -    2%) 0.383
                       And3Terms      235.75      (1.9%)      234.03      (2.1%)   -0.7% (  -4% -    3%) 0.410
              Or2Terms2StopWords      200.01      (2.5%)      198.59      (3.1%)   -0.7% (  -6% -    5%) 0.572
                  FilteredIntNRQ      289.06      (0.8%)      287.05      (1.7%)   -0.7% (  -3% -    1%) 0.251
         FilteredDismaxOrHighMed      125.35      (1.9%)      124.53      (2.3%)   -0.7% (  -4% -    3%) 0.486
              FilteredOrHighHigh       65.45      (2.9%)       65.04      (2.0%)   -0.6% (  -5% -    4%) 0.572
                    CombinedTerm       37.78      (1.8%)       37.56      (2.1%)   -0.6% (  -4% -    3%) 0.499
              FilteredAndHighMed      153.64      (1.7%)      152.75      (2.0%)   -0.6% (  -4% -    3%) 0.489
             CombinedAndHighHigh       22.87      (1.0%)       22.75      (1.9%)   -0.6% (  -3% -    2%) 0.412
                      TermDTSort      408.86      (3.1%)      406.81      (4.9%)   -0.5% (  -8% -    7%) 0.783
      FilteredOr2Terms2StopWords      143.31      (2.1%)      142.59      (1.7%)   -0.5% (  -4% -    3%) 0.551
               FilteredOrHighMed      149.58      (1.9%)      148.84      (1.3%)   -0.5% (  -3% -    2%) 0.488
                          OrMany       22.71      (2.2%)       22.60      (2.4%)   -0.5% (  -4% -    4%) 0.627
              CombinedAndHighMed       87.88      (0.7%)       87.45      (1.1%)   -0.5% (  -2% -    1%) 0.232
                      OrHighRare      273.20     (12.1%)      272.03     (11.1%)   -0.4% ( -21% -   25%) 0.934
                FilteredOr3Terms      162.88      (1.8%)      162.22      (1.2%)   -0.4% (  -3% -    2%) 0.545
               CombinedOrHighMed       86.79      (0.6%)       86.45      (1.2%)   -0.4% (  -2% -    1%) 0.352
            FilteredAndStopWords       64.14      (2.8%)       63.95      (3.8%)   -0.3% (  -6% -    6%) 0.844
                  FilteredOrMany       15.94      (2.8%)       15.90      (1.5%)   -0.3% (  -4% -    4%) 0.779
                        Or3Terms      226.68      (2.3%)      226.18      (2.8%)   -0.2% (  -5% -    4%) 0.846
                  CountOrHighMed      359.27      (2.3%)      358.53      (2.1%)   -0.2% (  -4% -    4%) 0.834
                 DismaxOrHighMed      184.76      (2.7%)      184.40      (2.1%)   -0.2% (  -4% -    4%) 0.853
                      DismaxTerm      730.52      (2.8%)      729.16      (2.1%)   -0.2% (  -5% -    4%) 0.868
                 FilteredPrefix3      147.61      (1.7%)      147.38      (2.9%)   -0.2% (  -4% -    4%) 0.883
                 CountAndHighMed      305.37      (1.3%)      305.04      (1.4%)   -0.1% (  -2% -    2%) 0.859
                       OrHighMed      251.29      (3.0%)      251.10      (3.0%)   -0.1% (  -5% -    6%) 0.955
                      AndHighMed      197.92      (2.3%)      197.86      (2.3%)   -0.0% (  -4% -    4%) 0.975
               TermDayOfYearSort      270.57      (6.3%)      270.54      (6.1%)   -0.0% ( -11% -   13%) 0.996
             FilteredOrStopWords       43.96      (3.1%)       43.98      (2.6%)    0.0% (  -5% -    5%) 0.978
                     CountPhrase        4.06      (2.8%)        4.06      (2.1%)    0.1% (  -4% -    5%) 0.946
             FilteredAndHighHigh       77.63      (2.8%)       77.74      (3.0%)    0.1% (  -5% -    6%) 0.918
          CountFilteredOrHighMed      146.93      (0.7%)      147.31      (0.6%)    0.3% (  -1% -    1%) 0.370
                     CountOrMany       28.14      (1.4%)       28.23      (1.2%)    0.3% (  -2% -    2%) 0.613
                DismaxOrHighHigh      126.24      (5.6%)      126.71      (4.8%)    0.4% (  -9% -   11%) 0.874
         CountFilteredOrHighHigh      135.00      (1.0%)      135.57      (0.7%)    0.4% (  -1% -    2%) 0.276
             CountFilteredOrMany       26.38      (1.7%)       26.54      (1.5%)    0.6% (  -2% -    3%) 0.397
                 CountOrHighHigh      332.56      (2.5%)      336.29      (2.4%)    1.1% (  -3% -    6%) 0.307
                CountAndHighHigh      349.06      (2.2%)      353.95      (2.1%)    1.4% (  -2% -    5%) 0.149
                      OrHighHigh       76.02      (3.4%)       77.64      (3.1%)    2.1% (  -4% -    9%) 0.149
                     OrStopWords       47.26      (3.3%)       48.31      (3.7%)    2.2% (  -4% -    9%) 0.162
                    AndStopWords       45.69      (2.5%)       46.71      (3.0%)    2.2% (  -3% -    7%) 0.071
                     AndHighHigh       67.01      (2.5%)       69.40      (3.2%)    3.6% (  -2% -    9%) 0.005

uschindler · 2025-07-17T18:10:38Z

Just my final comment: The setup on how the instances for look fine, although I would have preferred an interface instead of subclassing.

So BitSetUtil as public (but restricted) interface, both DefaultBitSetUtil as non-vectorized implementation and PanamaBitSetUtil as another implementation, both package private.

See Adrien's comment.

gf2121 · 2025-07-18T05:27:18Z

there may be a small speedup for some queries but not as much as previously reported.

I guess your benchmark runs against current main which includes #14935 while previous report not. The following report is what i get on AVX512 now, so i think your result on AVX2 is expected.

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      OrHighRare      127.60      (6.9%)      124.88      (8.4%)   -2.1% ( -16% -   14%) 0.382
                            Term      483.00      (4.6%)      475.41      (4.9%)   -1.6% ( -10% -    8%) 0.296
              CombinedAndHighMed       31.34      (2.7%)       30.99      (3.1%)   -1.1% (  -6% -    4%) 0.226
                      TermDTSort      230.76      (3.2%)      229.41      (3.3%)   -0.6% (  -6% -    6%) 0.567
                      DismaxTerm      422.26      (3.1%)      420.12      (3.1%)   -0.5% (  -6% -    5%) 0.603
                    FilteredTerm       72.81      (1.7%)       72.52      (2.0%)   -0.4% (  -4% -    3%) 0.505
                  CountOrHighMed      101.54      (1.1%)      101.15      (1.5%)   -0.4% (  -2% -    2%) 0.351
              FilteredOrHighHigh       19.23      (1.0%)       19.16      (1.4%)   -0.3% (  -2% -    2%) 0.358
                      AndHighMed       79.76      (2.2%)       79.50      (3.0%)   -0.3% (  -5% -    4%) 0.687
                AndMedOrHighHigh       24.19      (1.6%)       24.12      (1.9%)   -0.3% (  -3% -    3%) 0.578
                       OrHighMed       97.86      (2.0%)       97.58      (2.6%)   -0.3% (  -4% -    4%) 0.688
                 CountOrHighHigh       68.53      (0.8%)       68.36      (1.1%)   -0.2% (  -2% -    1%) 0.422
                    CombinedTerm       17.09      (2.9%)       17.05      (3.2%)   -0.2% (  -6% -    6%) 0.808
               CombinedOrHighMed       30.64      (2.2%)       30.57      (3.6%)   -0.2% (  -5% -    5%) 0.810
             CombinedAndHighHigh        8.45      (2.7%)        8.43      (2.6%)   -0.2% (  -5% -    5%) 0.787
              FilteredAndHighMed       50.35      (1.3%)       50.26      (1.5%)   -0.2% (  -2% -    2%) 0.687
                       And3Terms       98.50      (3.7%)       98.32      (4.6%)   -0.2% (  -8% -    8%) 0.891
                  FilteredPhrase       14.27      (2.1%)       14.24      (1.9%)   -0.2% (  -4% -    3%) 0.791
               FilteredOrHighMed       54.69      (1.4%)       54.60      (1.4%)   -0.2% (  -2% -    2%) 0.722
                CountAndHighHigh       67.35      (0.9%)       67.26      (1.2%)   -0.1% (  -2% -    2%) 0.697
                        Wildcard       56.31      (3.3%)       56.26      (3.3%)   -0.1% (  -6% -    6%) 0.924
                          Fuzzy1       37.10      (2.2%)       37.07      (2.5%)   -0.1% (  -4% -    4%) 0.927
          CountFilteredOrHighMed       34.07      (0.4%)       34.05      (0.6%)   -0.1% (  -1% -    0%) 0.688
                         Respell       29.69      (2.6%)       29.67      (2.4%)   -0.1% (  -4% -    5%) 0.939
         CountFilteredOrHighHigh       28.53      (0.5%)       28.52      (0.7%)   -0.0% (  -1% -    1%) 0.817
                          Fuzzy2       33.99      (2.2%)       34.00      (2.4%)    0.0% (  -4% -    4%) 0.985
                 AndHighOrMedMed       22.20      (2.0%)       22.20      (2.0%)    0.0% (  -3% -    4%) 0.982
                 CountAndHighMed       96.12      (1.2%)       96.13      (1.4%)    0.0% (  -2% -    2%) 0.967
                       CountTerm     4947.81      (3.3%)     4949.52      (3.7%)    0.0% (  -6% -    7%) 0.975
                 DismaxOrHighMed       65.72      (2.0%)       65.78      (2.4%)    0.1% (  -4% -    4%) 0.897
                  FilteredIntNRQ       60.66      (0.9%)       60.72      (1.2%)    0.1% (  -2% -    2%) 0.791
               TermDayOfYearSort      407.35      (2.2%)      407.88      (2.3%)    0.1% (  -4% -    4%) 0.855
             CountFilteredIntNRQ       25.92      (0.4%)       25.95      (0.5%)    0.1% (   0% -    1%) 0.324
                FilteredOr3Terms       57.49      (1.2%)       57.60      (1.1%)    0.2% (  -2% -    2%) 0.599
                          IntNRQ       61.25      (1.0%)       61.39      (1.2%)    0.2% (  -1% -    2%) 0.526
             FilteredAndHighHigh       17.66      (1.1%)       17.70      (0.8%)    0.2% (  -1% -    2%) 0.455
             FilteredOrStopWords       12.28      (1.8%)       12.31      (1.9%)    0.2% (  -3% -    4%) 0.677
      FilteredOr2Terms2StopWords       65.25      (1.1%)       65.43      (1.0%)    0.3% (  -1% -    2%) 0.408
                        Or3Terms       89.77      (4.0%)       90.03      (4.7%)    0.3% (  -8% -    9%) 0.829
                          Phrase       10.62      (3.2%)       10.65      (3.3%)    0.3% (  -5% -    7%) 0.753
                DismaxOrHighHigh       44.42      (2.5%)       44.57      (2.8%)    0.3% (  -4% -    5%) 0.696
                         Prefix3       97.90      (3.1%)       98.26      (3.2%)    0.4% (  -5% -    6%) 0.712
     FilteredAnd2Terms2StopWords       76.73      (1.7%)       77.03      (2.3%)    0.4% (  -3% -    4%) 0.536
            FilteredAndStopWords       14.35      (1.2%)       14.41      (1.0%)    0.4% (  -1% -    2%) 0.209
               FilteredAnd3Terms      116.13      (1.1%)      116.65      (1.3%)    0.4% (  -1% -    2%) 0.236
                 FilteredPrefix3       91.42      (3.0%)       91.84      (2.9%)    0.5% (  -5% -    6%) 0.620
                          IntSet      167.56      (5.1%)      168.35      (5.0%)    0.5% (  -9% -   11%) 0.767
              Or2Terms2StopWords       77.52      (2.5%)       77.94      (3.0%)    0.5% (  -4% -    6%) 0.536
             And2Terms2StopWords       73.71      (2.6%)       74.12      (3.0%)    0.6% (  -4% -    6%) 0.536
                   TermMonthSort     1991.89      (2.0%)     2008.30      (2.8%)    0.8% (  -3% -    5%) 0.281
                      OrHighHigh       32.79      (4.1%)       33.08      (4.5%)    0.9% (  -7% -    9%) 0.520
              CombinedOrHighHigh        8.25      (3.1%)        8.32      (3.2%)    0.9% (  -5% -    7%) 0.360
                   TermTitleSort       72.91      (1.9%)       73.66      (3.1%)    1.0% (  -3% -    6%) 0.211
             CountFilteredPhrase       12.76      (1.6%)       12.91      (1.7%)    1.1% (  -2% -    4%) 0.036
                     AndHighHigh       32.95      (3.0%)       33.61      (3.9%)    2.0% (  -4% -    9%) 0.067
                    AndStopWords       12.52      (4.9%)       13.05      (6.1%)    4.2% (  -6% -   15%) 0.015
                     OrStopWords       13.62      (4.4%)       14.58      (5.4%)    7.1% (  -2% -   17%) 0.000

gf2121 · 2025-07-18T10:47:44Z

FWIW I benchmarked this change on my machine (AVX2), there may be a small speedup for some queries but not as much as previously reported.

According to this report, this optimization can only have noticeable improvements on AVX512, so now i am actually a bit hesitant to move on.

…rray

uschindler · 2025-07-19T19:53:36Z

Much better now. I am still not sure if this is all worth the complexity!

vectorize_bitset_to_array

7c7b333

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Jul 7, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Jul 7, 2025

gf2121 marked this pull request as draft July 7, 2025 12:22

iter

08d7a73

iter

505562f

github-actions bot added the module:core/codecs label Jul 10, 2025

gf2121 added 3 commits July 10, 2025 23:43

iter

ab79f5e

iter

0f13d7a

iter

82a609c

license

df47bdd

gf2121 marked this pull request as ready for review July 10, 2025 16:36

gf2121 mentioned this pull request Jul 11, 2025

Optimize bitset to array #14935

Merged

gf2121 added 2 commits July 13, 2025 14:22

Merge remote-tracking branch 'origin/main' into vectorize_bitset_to_a…

52d80c7

…rray

follow another PR

7f9eddb

Merge remote-tracking branch 'origin/main' into vectorize_bitset_to_a…

0e49d07

…rray

iter

5b3a859

fix

98a1012

fix

67a0e85

CHANGES

b9cef73

github-actions bot added this to the 10.3.0 milestone Jul 17, 2025

iter

d277fa4

uschindler requested changes Jul 17, 2025

View reviewed changes

iter

2068441

uschindler reviewed Jul 17, 2025

View reviewed changes

jpountz reviewed Jul 17, 2025

View reviewed changes

benchmark

7d50184

gf2121 added 7 commits July 18, 2025 18:52

unused var

b2727dd

add back vector benchmarks

6711f3d

add back vector benchmarks

fdba5e1

iter

d694b51

fromLong supported in AVX2, it is convert not supported

4959f9e

for benchmark only

5fb85ab

iter

17fc767

gf2121 marked this pull request as draft July 19, 2025 13:22

gf2121 added 2 commits July 20, 2025 03:11

move to VectorUtil

b8dd168

Merge remote-tracking branch 'origin/main' into vectorize_bitset_to_a…

f6f52d7

…rray

	if (Constants.HAS_FAST_COMPRESS_MASK_CAST
	&& PanamaVectorConstants.PREFERRED_VECTOR_BITSIZE >= 256) {
	return PanamaBitSetUtil.INSTANCE;
	} else {
	return BitSetUtil.INSTANCE;
	}

Vectorize bitset to array #14910

Are you sure you want to change the base?

Vectorize bitset to array #14910

Conversation

gf2121 commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jul 7, 2025

Uh oh!

uschindler commented Jul 7, 2025

Uh oh!

gf2121 commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gf2121 commented Jul 9, 2025

Uh oh!

gf2121 commented Jul 10, 2025

Uh oh!

github-actions bot commented Jul 10, 2025

Uh oh!

jpountz commented Jul 10, 2025

Uh oh!

github-actions bot commented Jul 13, 2025

Uh oh!

gf2121 commented Jul 13, 2025

Uh oh!

jpountz commented Jul 13, 2025

Uh oh!

github-actions bot commented Jul 13, 2025

Uh oh!

gf2121 commented Jul 13, 2025

Uh oh!

github-actions bot commented Jul 13, 2025

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

github-actions bot commented Jul 17, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

uschindler left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

gf2121 commented Jul 7, 2025 •

edited

Loading

gf2121 commented Jul 7, 2025 •

edited

Loading

gf2121 commented Jul 18, 2025 •

edited

Loading

gf2121 commented Jul 18, 2025 •

edited

Loading