Use Vector API to decode BKD docIds #14203

gf2121 · 2025-02-06T07:55:37Z

Context: #14176

I find that when running with constant block size (512), JIT can auto-vectorize the decoding loop. But it does not work when block size become variable, which can be true in real BKD leaves. This PR proposes to use vector API to decode DocIds in BKD.

MAC M2

Benchmark                        (bpv)  (countVariable)   Mode  Cnt    Score   Error   Units
BKDCodecBenchmark.current           16             true  thrpt    5   85.316 ± 2.181  ops/ms
BKDCodecBenchmark.current           16            false  thrpt    5  208.971 ± 2.734  ops/ms
BKDCodecBenchmark.current           24             true  thrpt    5   85.752 ± 2.129  ops/ms
BKDCodecBenchmark.current           24            false  thrpt    5  147.652 ± 1.786  ops/ms
BKDCodecBenchmark.currentVector     16             true  thrpt    5  186.534 ± 2.376  ops/ms
BKDCodecBenchmark.currentVector     16            false  thrpt    5  213.891 ± 4.671  ops/ms
BKDCodecBenchmark.currentVector     24             true  thrpt    5  140.298 ± 2.189  ops/ms
BKDCodecBenchmark.currentVector     24            false  thrpt    5  134.398 ± 1.640  ops/ms
BKDCodecBenchmark.legacy            16             true  thrpt    5   87.278 ± 1.432  ops/ms
BKDCodecBenchmark.legacy            16            false  thrpt    5  201.612 ± 3.277  ops/ms
BKDCodecBenchmark.legacy            24             true  thrpt    5   87.148 ± 1.704  ops/ms
BKDCodecBenchmark.legacy            24            false  thrpt    5   84.830 ± 8.852  ops/ms

Linux X86 (AVX512 supported)

Benchmark                        (bpv)  (countVariable)   Mode  Cnt    Score    Error   Units
BKDCodecBenchmark.current           16             true  thrpt    5   27.711 ?  2.777  ops/ms
BKDCodecBenchmark.current           16            false  thrpt    5  132.859 ? 16.914  ops/ms
BKDCodecBenchmark.current           24             true  thrpt    5   34.672 ?  5.730  ops/ms
BKDCodecBenchmark.current           24            false  thrpt    5   33.017 ?  5.080  ops/ms
BKDCodecBenchmark.currentVector     16             true  thrpt    5   99.538 ? 11.813  ops/ms
BKDCodecBenchmark.currentVector     16            false  thrpt    5  107.525 ? 11.693  ops/ms
BKDCodecBenchmark.currentVector     24             true  thrpt    5   69.268 ? 10.351  ops/ms
BKDCodecBenchmark.currentVector     24            false  thrpt    5   64.134 ?  7.790  ops/ms
BKDCodecBenchmark.legacy            16             true  thrpt    5   27.531 ?  3.810  ops/ms
BKDCodecBenchmark.legacy            16            false  thrpt    5  125.707 ?  9.652  ops/ms
BKDCodecBenchmark.legacy            24             true  thrpt    5   22.528 ?  4.724  ops/ms
BKDCodecBenchmark.legacy            24            false  thrpt    5   23.903 ?  3.505  ops/ms

gf2121 · 2025-02-07T09:16:09Z

E2E result on Mac M2 is a bit disappointing:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          IntSet      834.36      (3.4%)      835.17      (3.8%)    0.1% (  -6% -    7%) 0.933
             CountFilteredIntNRQ       90.86      (2.2%)       92.56      (3.2%)    1.9% (  -3% -    7%) 0.033
                      TermDTSort      201.03      (6.7%)      206.65      (4.9%)    2.8% (  -8% -   15%) 0.131
                          IntNRQ      147.82      (2.0%)      151.98      (2.9%)    2.8% (  -2% -    7%) 0.000
                  FilteredIntNRQ      145.10      (2.9%)      150.02      (3.1%)    3.4% (  -2% -    9%) 0.000
               TermDayOfYearSort      200.93      (5.7%)      208.48      (4.0%)    3.8% (  -5% -   14%) 0.016

Profile suggests bottleneck is FixedBitset#set rather than decoding

gf2121 · 2025-02-08T05:26:07Z

On a AVX-512 Linux X86 machine:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          IntSet      247.99      (4.1%)      244.08      (2.5%)   -1.6% (  -7% -    5%) 0.143
                      TermDTSort       82.84      (6.5%)       83.46      (8.8%)    0.8% ( -13% -   17%) 0.759
               TermDayOfYearSort       83.58      (4.6%)       85.12      (6.3%)    1.8% (  -8% -   13%) 0.290
             CountFilteredIntNRQ       38.61      (2.9%)       42.32      (2.6%)    9.6% (   4% -   15%) 0.000
                  FilteredIntNRQ       64.02      (3.6%)       75.48      (3.7%)   17.9% (  10% -   26%) 0.000
                          IntNRQ       66.28      (4.5%)       79.10      (2.9%)   19.3% (  11% -   28%) 0.000

jpountz · 2025-02-10T20:52:55Z

Thanks for looking into it. Were you able to confirm that the difference with the variable count is indeed that auto-vectorization not getting enabled as opposed to something else such as different loop unrolling? I'm curious if you can compare the produced assembly and/or trick the JVM into generating more efficient code by writing the loop a bit differently, e.g. by having a fixed-size inner loop?

gf2121 · 2025-02-11T15:43:09Z

Thanks for feedback! I implement the fixed-size inner loop and print out assembly for all. perf_asm.log

When profiling enabled, #current countVariable=true and #current countVariable=false has same (slow) speed. It seems like profiling prevented some optimization.
According to the assembly, #current bpv=16 does not get auto-vectorized. #current bpv=24 gets vectorized on the shift loop, but not for the remainder loop.
According to the assembly, the innerloop get auto-vectorized, but slower than vector API.

MAC M2

Benchmark                        (bpv)  (countVariable)   Mode  Cnt    Score    Error   Units
BKDCodecBenchmark.current           16             true  thrpt    5  103.490 ±  6.785  ops/ms
BKDCodecBenchmark.current           16            false  thrpt    5  212.488 ±  5.383  ops/ms
BKDCodecBenchmark.current           24             true  thrpt    5   91.203 ±  1.023  ops/ms
BKDCodecBenchmark.current           24            false  thrpt    5  149.742 ±  1.953  ops/ms
BKDCodecBenchmark.currentVector     16             true  thrpt    5  213.162 ±  1.598  ops/ms
BKDCodecBenchmark.currentVector     16            false  thrpt    5  216.529 ±  2.518  ops/ms
BKDCodecBenchmark.currentVector     24             true  thrpt    5  153.970 ±  1.101  ops/ms
BKDCodecBenchmark.currentVector     24            false  thrpt    5  140.103 ±  3.001  ops/ms
BKDCodecBenchmark.innerLoop         16             true  thrpt    5  129.281 ±  0.471  ops/ms
BKDCodecBenchmark.innerLoop         16            false  thrpt    5  131.083 ±  8.775  ops/ms
BKDCodecBenchmark.innerLoop         24             true  thrpt    5   99.597 ±  2.850  ops/ms
BKDCodecBenchmark.innerLoop         24            false  thrpt    5   96.235 ± 14.875  ops/ms
BKDCodecBenchmark.legacy            16             true  thrpt    5  104.314 ±  0.557  ops/ms
BKDCodecBenchmark.legacy            16            false  thrpt    5  202.175 ± 10.863  ops/ms
BKDCodecBenchmark.legacy            24             true  thrpt    5   86.016 ±  1.315  ops/ms
BKDCodecBenchmark.legacy            24            false  thrpt    5   85.609 ±  5.733  ops/ms

Linux X86 AVX512 profiling disabled

Benchmark                        (bpv)  (countVariable)   Mode  Cnt    Score    Error   Units
BKDCodecBenchmark.current           16             true  thrpt    5   41.138 ±  1.770  ops/ms
BKDCodecBenchmark.current           16            false  thrpt    5  142.277 ±  0.943  ops/ms
BKDCodecBenchmark.current           24             true  thrpt    5   43.104 ±  0.066  ops/ms
BKDCodecBenchmark.current           24            false  thrpt    5   42.760 ±  0.496  ops/ms
BKDCodecBenchmark.currentVector     16             true  thrpt    5   86.565 ±  0.904  ops/ms
BKDCodecBenchmark.currentVector     16            false  thrpt    5   86.624 ±  0.395  ops/ms
BKDCodecBenchmark.currentVector     24             true  thrpt    5   80.064 ±  2.604  ops/ms
BKDCodecBenchmark.currentVector     24            false  thrpt    5   76.638 ± 18.692  ops/ms
BKDCodecBenchmark.innerLoop         16             true  thrpt    5   43.810 ±  1.096  ops/ms
BKDCodecBenchmark.innerLoop         16            false  thrpt    5   42.485 ±  0.073  ops/ms
BKDCodecBenchmark.innerLoop         24             true  thrpt    5   37.255 ±  0.994  ops/ms
BKDCodecBenchmark.innerLoop         24            false  thrpt    5   37.243 ±  0.593  ops/ms
BKDCodecBenchmark.legacy            16             true  thrpt    5   41.415 ±  0.079  ops/ms
BKDCodecBenchmark.legacy            16            false  thrpt    5  145.713 ±  0.381  ops/ms
BKDCodecBenchmark.legacy            24             true  thrpt    5   27.758 ±  4.210  ops/ms
BKDCodecBenchmark.legacy            24            false  thrpt    5   28.519 ±  1.839  ops/ms

Linux X86 AVX512 profiling enabled

Benchmark                            (bpv)  (countVariable)   Mode  Cnt   Score   Error   Units
BKDCodecBenchmark.current               16             true  thrpt    5  29.878 ± 0.130  ops/ms
BKDCodecBenchmark.current:asm           16             true  thrpt          NaN             ---
BKDCodecBenchmark.current               16            false  thrpt    5  29.314 ± 0.229  ops/ms
BKDCodecBenchmark.current:asm           16            false  thrpt          NaN             ---
BKDCodecBenchmark.current               24             true  thrpt    5  34.874 ± 0.320  ops/ms
BKDCodecBenchmark.current:asm           24             true  thrpt          NaN             ---
BKDCodecBenchmark.current               24            false  thrpt    5  33.987 ± 0.055  ops/ms
BKDCodecBenchmark.current:asm           24            false  thrpt          NaN             ---
BKDCodecBenchmark.currentVector         16             true  thrpt    5  79.717 ± 5.983  ops/ms
BKDCodecBenchmark.currentVector:asm     16             true  thrpt          NaN             ---
BKDCodecBenchmark.currentVector         16            false  thrpt    5  81.924 ± 3.799  ops/ms
BKDCodecBenchmark.currentVector:asm     16            false  thrpt          NaN             ---
BKDCodecBenchmark.currentVector         24             true  thrpt    5  65.615 ± 8.901  ops/ms
BKDCodecBenchmark.currentVector:asm     24             true  thrpt          NaN             ---
BKDCodecBenchmark.currentVector         24            false  thrpt    5  74.759 ± 2.173  ops/ms
BKDCodecBenchmark.currentVector:asm     24            false  thrpt          NaN             ---
BKDCodecBenchmark.innerLoop             16             true  thrpt    5  40.869 ± 3.407  ops/ms
BKDCodecBenchmark.innerLoop:asm         16             true  thrpt          NaN             ---
BKDCodecBenchmark.innerLoop             16            false  thrpt    5  41.825 ± 1.644  ops/ms
BKDCodecBenchmark.innerLoop:asm         16            false  thrpt          NaN             ---
BKDCodecBenchmark.innerLoop             24             true  thrpt    5  37.251 ± 3.447  ops/ms
BKDCodecBenchmark.innerLoop:asm         24             true  thrpt          NaN             ---
BKDCodecBenchmark.innerLoop             24            false  thrpt    5  37.419 ± 1.238  ops/ms
BKDCodecBenchmark.innerLoop:asm         24            false  thrpt          NaN             ---
BKDCodecBenchmark.legacy                16             true  thrpt    5  28.477 ± 3.747  ops/ms
BKDCodecBenchmark.legacy:asm            16             true  thrpt          NaN             ---
BKDCodecBenchmark.legacy                16            false  thrpt    5  29.838 ± 0.163  ops/ms
BKDCodecBenchmark.legacy:asm            16            false  thrpt          NaN             ---
BKDCodecBenchmark.legacy                24             true  thrpt    5  28.295 ± 1.224  ops/ms
BKDCodecBenchmark.legacy:asm            24             true  thrpt          NaN             ---
BKDCodecBenchmark.legacy                24            false  thrpt    5  27.915 ± 0.911  ops/ms
BKDCodecBenchmark.legacy:asm            24            false  thrpt          NaN             ---

jpountz · 2025-02-11T23:10:03Z

#current bpv=24 gets vectorized on the shift loop, but not for the remainder loop.

This is an interesting observation. I wonder if a small refactoring could help it get auto-vectorized? E.g. what if we applied the 0xFF mask to scratch in the shift loop rather than the remainder loop? Or if we split the remainder loop into 3 loops, one for each 8 bits that get contributed to the value?

Sorry for pushing, but if we could get auto-vectorization to do the right thing, then this would automatically benefit all users, not only those who enable the vector module.

gf2121 added 4 commits January 28, 2025 10:30

bpv24

97af1d2

only reduce virtual call

617cec7

iter

c72f9f4

jmh

4446855

github-actions bot added the module:core/other label Feb 6, 2025

gf2121 mentioned this pull request Feb 6, 2025

Reduce virtual calls when visiting bpv24-encoded doc ids in BKD leaves #14176

Merged

gf2121 added module:core/codecs and removed module:core/other labels Feb 6, 2025

iter

b031449

github-actions bot added module:core/other and removed module:core/codecs labels Feb 6, 2025

gf2121 added 2 commits February 6, 2025 18:39

stash

e7a3056

e2e benchmark

10511a2

gf2121 force-pushed the vector_bpv24 branch from b0a2983 to 10511a2 Compare February 7, 2025 08:54

bwc issue

bb1b923

github-actions bot added module:core/codecs and removed module:core/other labels Feb 8, 2025

gf2121 added 8 commits February 8, 2025 15:45

jmh fix

28cb597

fix

62c214f

iter

d977c02

license

c5653f6

iter

9bf0870

add license

3c91bf3

add java doc

a474043

private

5c4e1e9

gf2121 changed the title ~~[WIP] Introduce bpv24 vectorized decoding for DocIdsWriter~~ Introduce bpv24 vectorized decoding for DocIdsWriter Feb 8, 2025

gf2121 requested review from jpountz and iverase February 8, 2025 09:19

gf2121 added 2 commits February 8, 2025 17:26

simplify

6f8fc2b

add CHANGES

654f0a6

gf2121 changed the title ~~Introduce bpv24 vectorized decoding for DocIdsWriter~~ Use Vector API to decode BKD docIds Feb 8, 2025

gf2121 and others added 6 commits February 8, 2025 17:49

iter

6273aa9

Merge branch 'main' into vector_bpv24

285ae58

iter

5ee86c9

iter

decf31b

iter

114d677

iter

4b5e2f2

inner loop

ae65f61

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Vector API to decode BKD docIds #14203

Use Vector API to decode BKD docIds #14203

gf2121 commented Feb 6, 2025 •

edited

Loading

gf2121 commented Feb 7, 2025 •

edited

Loading

gf2121 commented Feb 8, 2025

jpountz commented Feb 10, 2025

gf2121 commented Feb 11, 2025

jpountz commented Feb 11, 2025

Use Vector API to decode BKD docIds #14203

Are you sure you want to change the base?

Use Vector API to decode BKD docIds #14203

Conversation

gf2121 commented Feb 6, 2025 • edited Loading

gf2121 commented Feb 7, 2025 • edited Loading

gf2121 commented Feb 8, 2025

jpountz commented Feb 10, 2025

gf2121 commented Feb 11, 2025

jpountz commented Feb 11, 2025

gf2121 commented Feb 6, 2025 •

edited

Loading

gf2121 commented Feb 7, 2025 •

edited

Loading