Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Vector API to decode BKD docIds #14203

Open
wants to merge 25 commits into
base: main
Choose a base branch
from
Open

Conversation

gf2121
Copy link
Contributor

@gf2121 gf2121 commented Feb 6, 2025

Context: #14176

I find that when running with constant block size (512), JIT can auto-vectorize the decoding loop. But it does not work when block size become variable, which can be true in real BKD leaves. This PR proposes to use vector API to decode DocIds in BKD.

MAC M2

Benchmark                        (bpv)  (countVariable)   Mode  Cnt    Score   Error   Units
BKDCodecBenchmark.current           16             true  thrpt    5   85.316 ± 2.181  ops/ms
BKDCodecBenchmark.current           16            false  thrpt    5  208.971 ± 2.734  ops/ms
BKDCodecBenchmark.current           24             true  thrpt    5   85.752 ± 2.129  ops/ms
BKDCodecBenchmark.current           24            false  thrpt    5  147.652 ± 1.786  ops/ms
BKDCodecBenchmark.currentVector     16             true  thrpt    5  186.534 ± 2.376  ops/ms
BKDCodecBenchmark.currentVector     16            false  thrpt    5  213.891 ± 4.671  ops/ms
BKDCodecBenchmark.currentVector     24             true  thrpt    5  140.298 ± 2.189  ops/ms
BKDCodecBenchmark.currentVector     24            false  thrpt    5  134.398 ± 1.640  ops/ms
BKDCodecBenchmark.legacy            16             true  thrpt    5   87.278 ± 1.432  ops/ms
BKDCodecBenchmark.legacy            16            false  thrpt    5  201.612 ± 3.277  ops/ms
BKDCodecBenchmark.legacy            24             true  thrpt    5   87.148 ± 1.704  ops/ms
BKDCodecBenchmark.legacy            24            false  thrpt    5   84.830 ± 8.852  ops/ms

Linux X86 (AVX512 supported)

Benchmark                        (bpv)  (countVariable)   Mode  Cnt    Score    Error   Units
BKDCodecBenchmark.current           16             true  thrpt    5   27.711 ?  2.777  ops/ms
BKDCodecBenchmark.current           16            false  thrpt    5  132.859 ? 16.914  ops/ms
BKDCodecBenchmark.current           24             true  thrpt    5   34.672 ?  5.730  ops/ms
BKDCodecBenchmark.current           24            false  thrpt    5   33.017 ?  5.080  ops/ms
BKDCodecBenchmark.currentVector     16             true  thrpt    5   99.538 ? 11.813  ops/ms
BKDCodecBenchmark.currentVector     16            false  thrpt    5  107.525 ? 11.693  ops/ms
BKDCodecBenchmark.currentVector     24             true  thrpt    5   69.268 ? 10.351  ops/ms
BKDCodecBenchmark.currentVector     24            false  thrpt    5   64.134 ?  7.790  ops/ms
BKDCodecBenchmark.legacy            16             true  thrpt    5   27.531 ?  3.810  ops/ms
BKDCodecBenchmark.legacy            16            false  thrpt    5  125.707 ?  9.652  ops/ms
BKDCodecBenchmark.legacy            24             true  thrpt    5   22.528 ?  4.724  ops/ms
BKDCodecBenchmark.legacy            24            false  thrpt    5   23.903 ?  3.505  ops/ms

@gf2121
Copy link
Contributor Author

gf2121 commented Feb 7, 2025

E2E result on Mac M2 is a bit disappointing:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          IntSet      834.36      (3.4%)      835.17      (3.8%)    0.1% (  -6% -    7%) 0.933
             CountFilteredIntNRQ       90.86      (2.2%)       92.56      (3.2%)    1.9% (  -3% -    7%) 0.033
                      TermDTSort      201.03      (6.7%)      206.65      (4.9%)    2.8% (  -8% -   15%) 0.131
                          IntNRQ      147.82      (2.0%)      151.98      (2.9%)    2.8% (  -2% -    7%) 0.000
                  FilteredIntNRQ      145.10      (2.9%)      150.02      (3.1%)    3.4% (  -2% -    9%) 0.000
               TermDayOfYearSort      200.93      (5.7%)      208.48      (4.0%)    3.8% (  -5% -   14%) 0.016

Profile suggests bottleneck is FixedBitset#set rather than decoding
image

@gf2121
Copy link
Contributor Author

gf2121 commented Feb 8, 2025

On a AVX-512 Linux X86 machine:

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                          IntSet      247.99      (4.1%)      244.08      (2.5%)   -1.6% (  -7% -    5%) 0.143
                      TermDTSort       82.84      (6.5%)       83.46      (8.8%)    0.8% ( -13% -   17%) 0.759
               TermDayOfYearSort       83.58      (4.6%)       85.12      (6.3%)    1.8% (  -8% -   13%) 0.290
             CountFilteredIntNRQ       38.61      (2.9%)       42.32      (2.6%)    9.6% (   4% -   15%) 0.000
                  FilteredIntNRQ       64.02      (3.6%)       75.48      (3.7%)   17.9% (  10% -   26%) 0.000
                          IntNRQ       66.28      (4.5%)       79.10      (2.9%)   19.3% (  11% -   28%) 0.000

@gf2121 gf2121 changed the title [WIP] Introduce bpv24 vectorized decoding for DocIdsWriter Introduce bpv24 vectorized decoding for DocIdsWriter Feb 8, 2025
@gf2121 gf2121 requested review from jpountz and iverase February 8, 2025 09:19
@gf2121 gf2121 changed the title Introduce bpv24 vectorized decoding for DocIdsWriter Use Vector API to decode BKD docIds Feb 8, 2025
@jpountz
Copy link
Contributor

jpountz commented Feb 10, 2025

Thanks for looking into it. Were you able to confirm that the difference with the variable count is indeed that auto-vectorization not getting enabled as opposed to something else such as different loop unrolling? I'm curious if you can compare the produced assembly and/or trick the JVM into generating more efficient code by writing the loop a bit differently, e.g. by having a fixed-size inner loop?

@gf2121
Copy link
Contributor Author

gf2121 commented Feb 11, 2025

Thanks for feedback! I implement the fixed-size inner loop and print out assembly for all. perf_asm.log

  • When profiling enabled, #current countVariable=true and #current countVariable=false has same (slow) speed. It seems like profiling prevented some optimization.

  • According to the assembly, #current bpv=16 does not get auto-vectorized. #current bpv=24 gets vectorized on the shift loop, but not for the remainder loop.

  • According to the assembly, the innerloop get auto-vectorized, but slower than vector API.

MAC M2

Benchmark                        (bpv)  (countVariable)   Mode  Cnt    Score    Error   Units
BKDCodecBenchmark.current           16             true  thrpt    5  103.490 ±  6.785  ops/ms
BKDCodecBenchmark.current           16            false  thrpt    5  212.488 ±  5.383  ops/ms
BKDCodecBenchmark.current           24             true  thrpt    5   91.203 ±  1.023  ops/ms
BKDCodecBenchmark.current           24            false  thrpt    5  149.742 ±  1.953  ops/ms
BKDCodecBenchmark.currentVector     16             true  thrpt    5  213.162 ±  1.598  ops/ms
BKDCodecBenchmark.currentVector     16            false  thrpt    5  216.529 ±  2.518  ops/ms
BKDCodecBenchmark.currentVector     24             true  thrpt    5  153.970 ±  1.101  ops/ms
BKDCodecBenchmark.currentVector     24            false  thrpt    5  140.103 ±  3.001  ops/ms
BKDCodecBenchmark.innerLoop         16             true  thrpt    5  129.281 ±  0.471  ops/ms
BKDCodecBenchmark.innerLoop         16            false  thrpt    5  131.083 ±  8.775  ops/ms
BKDCodecBenchmark.innerLoop         24             true  thrpt    5   99.597 ±  2.850  ops/ms
BKDCodecBenchmark.innerLoop         24            false  thrpt    5   96.235 ± 14.875  ops/ms
BKDCodecBenchmark.legacy            16             true  thrpt    5  104.314 ±  0.557  ops/ms
BKDCodecBenchmark.legacy            16            false  thrpt    5  202.175 ± 10.863  ops/ms
BKDCodecBenchmark.legacy            24             true  thrpt    5   86.016 ±  1.315  ops/ms
BKDCodecBenchmark.legacy            24            false  thrpt    5   85.609 ±  5.733  ops/ms

Linux X86 AVX512 profiling disabled

Benchmark                        (bpv)  (countVariable)   Mode  Cnt    Score    Error   Units
BKDCodecBenchmark.current           16             true  thrpt    5   41.138 ±  1.770  ops/ms
BKDCodecBenchmark.current           16            false  thrpt    5  142.277 ±  0.943  ops/ms
BKDCodecBenchmark.current           24             true  thrpt    5   43.104 ±  0.066  ops/ms
BKDCodecBenchmark.current           24            false  thrpt    5   42.760 ±  0.496  ops/ms
BKDCodecBenchmark.currentVector     16             true  thrpt    5   86.565 ±  0.904  ops/ms
BKDCodecBenchmark.currentVector     16            false  thrpt    5   86.624 ±  0.395  ops/ms
BKDCodecBenchmark.currentVector     24             true  thrpt    5   80.064 ±  2.604  ops/ms
BKDCodecBenchmark.currentVector     24            false  thrpt    5   76.638 ± 18.692  ops/ms
BKDCodecBenchmark.innerLoop         16             true  thrpt    5   43.810 ±  1.096  ops/ms
BKDCodecBenchmark.innerLoop         16            false  thrpt    5   42.485 ±  0.073  ops/ms
BKDCodecBenchmark.innerLoop         24             true  thrpt    5   37.255 ±  0.994  ops/ms
BKDCodecBenchmark.innerLoop         24            false  thrpt    5   37.243 ±  0.593  ops/ms
BKDCodecBenchmark.legacy            16             true  thrpt    5   41.415 ±  0.079  ops/ms
BKDCodecBenchmark.legacy            16            false  thrpt    5  145.713 ±  0.381  ops/ms
BKDCodecBenchmark.legacy            24             true  thrpt    5   27.758 ±  4.210  ops/ms
BKDCodecBenchmark.legacy            24            false  thrpt    5   28.519 ±  1.839  ops/ms

Linux X86 AVX512 profiling enabled

Benchmark                            (bpv)  (countVariable)   Mode  Cnt   Score   Error   Units
BKDCodecBenchmark.current               16             true  thrpt    5  29.878 ± 0.130  ops/ms
BKDCodecBenchmark.current:asm           16             true  thrpt          NaN             ---
BKDCodecBenchmark.current               16            false  thrpt    5  29.314 ± 0.229  ops/ms
BKDCodecBenchmark.current:asm           16            false  thrpt          NaN             ---
BKDCodecBenchmark.current               24             true  thrpt    5  34.874 ± 0.320  ops/ms
BKDCodecBenchmark.current:asm           24             true  thrpt          NaN             ---
BKDCodecBenchmark.current               24            false  thrpt    5  33.987 ± 0.055  ops/ms
BKDCodecBenchmark.current:asm           24            false  thrpt          NaN             ---
BKDCodecBenchmark.currentVector         16             true  thrpt    5  79.717 ± 5.983  ops/ms
BKDCodecBenchmark.currentVector:asm     16             true  thrpt          NaN             ---
BKDCodecBenchmark.currentVector         16            false  thrpt    5  81.924 ± 3.799  ops/ms
BKDCodecBenchmark.currentVector:asm     16            false  thrpt          NaN             ---
BKDCodecBenchmark.currentVector         24             true  thrpt    5  65.615 ± 8.901  ops/ms
BKDCodecBenchmark.currentVector:asm     24             true  thrpt          NaN             ---
BKDCodecBenchmark.currentVector         24            false  thrpt    5  74.759 ± 2.173  ops/ms
BKDCodecBenchmark.currentVector:asm     24            false  thrpt          NaN             ---
BKDCodecBenchmark.innerLoop             16             true  thrpt    5  40.869 ± 3.407  ops/ms
BKDCodecBenchmark.innerLoop:asm         16             true  thrpt          NaN             ---
BKDCodecBenchmark.innerLoop             16            false  thrpt    5  41.825 ± 1.644  ops/ms
BKDCodecBenchmark.innerLoop:asm         16            false  thrpt          NaN             ---
BKDCodecBenchmark.innerLoop             24             true  thrpt    5  37.251 ± 3.447  ops/ms
BKDCodecBenchmark.innerLoop:asm         24             true  thrpt          NaN             ---
BKDCodecBenchmark.innerLoop             24            false  thrpt    5  37.419 ± 1.238  ops/ms
BKDCodecBenchmark.innerLoop:asm         24            false  thrpt          NaN             ---
BKDCodecBenchmark.legacy                16             true  thrpt    5  28.477 ± 3.747  ops/ms
BKDCodecBenchmark.legacy:asm            16             true  thrpt          NaN             ---
BKDCodecBenchmark.legacy                16            false  thrpt    5  29.838 ± 0.163  ops/ms
BKDCodecBenchmark.legacy:asm            16            false  thrpt          NaN             ---
BKDCodecBenchmark.legacy                24             true  thrpt    5  28.295 ± 1.224  ops/ms
BKDCodecBenchmark.legacy:asm            24             true  thrpt          NaN             ---
BKDCodecBenchmark.legacy                24            false  thrpt    5  27.915 ± 0.911  ops/ms
BKDCodecBenchmark.legacy:asm            24            false  thrpt          NaN             ---

@jpountz
Copy link
Contributor

jpountz commented Feb 11, 2025

#current bpv=24 gets vectorized on the shift loop, but not for the remainder loop.

This is an interesting observation. I wonder if a small refactoring could help it get auto-vectorized? E.g. what if we applied the 0xFF mask to scratch in the shift loop rather than the remainder loop? Or if we split the remainder loop into 3 loops, one for each 8 bits that get contributed to the value?

Sorry for pushing, but if we could get auto-vectorization to do the right thing, then this would automatically benefit all users, not only those who enable the vector module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants