Skip to content

Simplify ForDeltaUtil's prefix sum. #14979

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 23, 2025
Merged

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Jul 21, 2025

I remember benchmarking prefix sums quite extensively, and unrolled loops performed significantly better than their rolled on counterpart, both on micro and macro benchmarks:

private static void prefixSum(int[] arr, int len) {
  for (int i = 1; i < len; ++i) {
    arr[i] += arr[i-1];
  }
}

However, I recently discovered that rewriting the loop this way performs much better, and almost on par with the unrolled variant:

private static void prefixSum(int[] arr, int len) {
  int sum = 0;
  for (int i = 0; i < len; ++i) {
    sum += arr[i];
    arr[i] = sum;
  }
}

I haven't checked the assembly yet, but both a JMH benchmark and luceneutil agree that it doesn't introduce a slowdown, so I cut over prefix sums to this approach.

I remember benchmarking prefix sums quite extensively, and unrolled loops
performed significantly better than their rolled on counterpart, both on micro
and macro benchmarks:

```java
private static void prefixSum(int[] arr, int len) {
  for (int i = 1; i < len; ++i) {
    arr[i] += arr[i-1];
  }
}
```

However, I recently discovered that rewriting the loop this way performs much
better, and almost on par with the unrolled variant:

```java
private static void prefixSum(int[] arr, int len) {
  int sum = 0;
  for (int i = 0; i < len; ++i) {
    sum += arr[i];
    arr[i] = sum;
  }
}
```

I haven't checked the assembly yet, but both a JMH benchmark and luceneutil
agree that it doesn't introduce a slowdown, so I cut over prefix sums to this
approach.
@jpountz jpountz added this to the 10.3.0 milestone Jul 21, 2025
@jpountz jpountz added the skip-changelog Apply to PRs that don't need a changelog entry, stopping the automated changelog check. label Jul 21, 2025
@jpountz
Copy link
Contributor Author

jpountz commented Jul 21, 2025

                            TaskQPS baseline      StdDevQPS my_modified_version      StdDev                Pct diff p-value
                      TermDTSort      395.64      (5.8%)      390.57      (4.0%)   -1.3% ( -10% -    9%) 0.482
                      OrHighRare      303.65      (7.1%)      300.02      (5.2%)   -1.2% ( -12% -   11%) 0.599
                       CountTerm     9330.02      (3.5%)     9238.82      (3.2%)   -1.0% (  -7% -    5%) 0.422
                  FilteredPhrase       32.40      (1.5%)       32.19      (1.4%)   -0.7% (  -3% -    2%) 0.215
                   TermTitleSort       84.19      (4.2%)       83.64      (4.4%)   -0.7% (  -8% -    8%) 0.676
               CombinedOrHighMed       88.57      (0.7%)       88.17      (2.0%)   -0.4% (  -3% -    2%) 0.419
              CombinedOrHighHigh       23.30      (0.8%)       23.21      (3.2%)   -0.4% (  -4% -    3%) 0.638
               FilteredOrHighMed      153.60      (1.0%)      153.10      (1.1%)   -0.3% (  -2% -    1%) 0.401
                     CountPhrase        4.24      (2.1%)        4.23      (3.5%)   -0.3% (  -5% -    5%) 0.772
                  CountOrHighMed      358.51      (1.0%)      357.43      (1.9%)   -0.3% (  -3% -    2%) 0.584
      FilteredOr2Terms2StopWords      147.65      (0.9%)      147.28      (1.2%)   -0.2% (  -2% -    1%) 0.538
                 FilteredPrefix3      151.73      (2.5%)      151.40      (1.7%)   -0.2% (  -4% -    4%) 0.776
              FilteredOrHighHigh       67.41      (1.9%)       67.29      (1.6%)   -0.2% (  -3% -    3%) 0.777
                FilteredOr3Terms      167.05      (0.8%)      166.74      (1.0%)   -0.2% (  -2% -    1%) 0.592
                          OrMany       23.50      (3.0%)       23.46      (2.6%)   -0.2% (  -5% -    5%) 0.862
             And2Terms2StopWords      206.60      (1.4%)      206.31      (1.3%)   -0.1% (  -2% -    2%) 0.770
               TermDayOfYearSort      282.79      (4.1%)      282.54      (3.7%)   -0.1% (  -7% -    8%) 0.950
             CountFilteredPhrase       25.43      (2.3%)       25.41      (2.1%)   -0.1% (  -4% -    4%) 0.922
             FilteredOrStopWords       45.74      (1.9%)       45.73      (1.9%)   -0.0% (  -3% -    3%) 0.990
                AndMedOrHighHigh       88.27      (1.9%)       88.28      (1.7%)    0.0% (  -3% -    3%) 0.980
                  FilteredIntNRQ      297.19      (0.7%)      297.43      (0.8%)    0.1% (  -1% -    1%) 0.783
                 CountOrHighHigh      340.83      (1.8%)      341.27      (2.9%)    0.1% (  -4% -    4%) 0.884
          CountFilteredOrHighMed      149.06      (0.6%)      149.26      (0.7%)    0.1% (  -1% -    1%) 0.559
                    CombinedTerm       39.45      (0.9%)       39.51      (0.5%)    0.1% (  -1% -    1%) 0.586
                  FilteredOrMany       16.55      (1.1%)       16.57      (1.2%)    0.2% (  -2% -    2%) 0.715
                     CountOrMany       29.11      (1.3%)       29.17      (1.6%)    0.2% (  -2% -    3%) 0.721
         CountFilteredOrHighHigh      136.99      (0.8%)      137.25      (1.0%)    0.2% (  -1% -    1%) 0.547
              CombinedAndHighMed       89.73      (0.8%)       89.93      (0.6%)    0.2% (  -1% -    1%) 0.382
                    AndStopWords       47.24      (2.7%)       47.35      (2.1%)    0.2% (  -4% -    5%) 0.789
             CountFilteredOrMany       27.25      (1.2%)       27.32      (1.5%)    0.2% (  -2% -    2%) 0.617
                      AndHighMed      202.48      (2.5%)      202.99      (1.9%)    0.3% (  -3% -    4%) 0.750
              Or2Terms2StopWords      206.67      (1.4%)      207.22      (1.9%)    0.3% (  -3% -    3%) 0.664
                    FilteredTerm      162.69      (2.2%)      163.18      (2.7%)    0.3% (  -4% -    5%) 0.744
                     AndHighHigh       69.16      (3.1%)       69.37      (2.4%)    0.3% (  -5% -    6%) 0.758
               FilteredAnd3Terms      189.84      (1.5%)      190.44      (1.0%)    0.3% (  -2% -    2%) 0.496
     FilteredAnd2Terms2StopWords      214.48      (2.4%)      215.19      (1.2%)    0.3% (  -3% -    4%) 0.631
                       And3Terms      240.86      (2.3%)      241.78      (1.5%)    0.4% (  -3% -    4%) 0.593
                 AndHighOrMedMed       51.39      (1.4%)       51.62      (1.2%)    0.4% (  -2% -    3%) 0.359
             CombinedAndHighHigh       23.50      (1.1%)       23.61      (0.7%)    0.5% (  -1% -    2%) 0.149
                CountAndHighHigh      357.29      (1.8%)      359.20      (2.5%)    0.5% (  -3% -    4%) 0.507
                     OrStopWords       48.86      (2.2%)       49.19      (2.2%)    0.7% (  -3% -    5%) 0.413
                       OrHighMed      258.66      (1.8%)      260.44      (1.6%)    0.7% (  -2% -    4%) 0.272
            FilteredAndStopWords       64.59      (4.0%)       65.06      (2.5%)    0.7% (  -5% -    7%) 0.555
                 CountAndHighMed      307.15      (0.7%)      309.50      (1.3%)    0.8% (  -1% -    2%) 0.044
                      OrHighHigh       78.09      (2.2%)       78.75      (2.1%)    0.8% (  -3% -    5%) 0.280
              FilteredAndHighMed      155.05      (2.9%)      156.38      (1.5%)    0.9% (  -3% -    5%) 0.307
             FilteredAndHighHigh       77.96      (4.5%)       78.64      (2.5%)    0.9% (  -5% -    8%) 0.506
                   TermMonthSort     3341.21      (1.3%)     3373.59      (2.0%)    1.0% (  -2% -    4%) 0.111
                        Or3Terms      230.62      (1.8%)      233.16      (1.7%)    1.1% (  -2% -    4%) 0.090
                            Term      666.11      (5.6%)      677.28      (3.6%)    1.7% (  -7% -   11%) 0.328

Copy link
Contributor

@gf2121 gf2121 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I played with your benchmark and can reproduce the speed up locally (prefixSumScalarNew).

Benchmark                                        (size)   Mode  Cnt   Score   Error   Units
PrefixSumBenchmark.prefixSumScalar                  128  thrpt    5  17.567 ± 0.117  ops/us
PrefixSumBenchmark.prefixSumScalarInlined           128  thrpt    5  26.228 ± 0.086  ops/us
PrefixSumBenchmark.prefixSumScalarNew               128  thrpt    5  25.864 ± 0.043  ops/us
PrefixSumBenchmark.prefixSumVector128               128  thrpt    5  20.668 ± 0.350  ops/us
PrefixSumBenchmark.prefixSumVector128_v2            128  thrpt    5  26.103 ± 0.176  ops/us
PrefixSumBenchmark.prefixSumVector256               128  thrpt    5  28.632 ± 0.956  ops/us
PrefixSumBenchmark.prefixSumVector256_v2            128  thrpt    5  44.185 ± 0.978  ops/us
PrefixSumBenchmark.prefixSumVector256_v2_inline     128  thrpt    5  43.949 ± 0.225  ops/us
PrefixSumBenchmark.prefixSumVector256_v3            128  thrpt    5  20.108 ± 1.157  ops/us
PrefixSumBenchmark.prefixSumVector512               128  thrpt    5  32.676 ± 0.266  ops/us
PrefixSumBenchmark.prefixSumVector512_v2            128  thrpt    5  57.176 ± 0.413  ops/us

I checked the assemble and the only difference i can see is that baseline uses a register in the unrolled(8x) loop body so it needs to read from array before each iteration, while this PR uses a register across iterations.

@jpountz
Copy link
Contributor Author

jpountz commented Jul 22, 2025

Thanks for checking! For reference here's what it gives on my machine (AMD Ryzen 9 3900X):

Benchmark                                          (size)   Mode  Cnt   Score    Error   Units
PrefixSumBenchmark.prefixSumScalar                    128  thrpt    5  19.081 ±  0.550  ops/us
PrefixSumBenchmark.prefixSumScalar                   1024  thrpt    5   2.180 ±  0.097  ops/us
PrefixSumBenchmark.prefixSumScalarUnrolled            128  thrpt    5  32.679 ±  1.819  ops/us
PrefixSumBenchmark.prefixSumScalarUnrolled           1024  thrpt    5  31.804 ±  0.067  ops/us
PrefixSumBenchmark.prefixSumScalar_v2                 128  thrpt    5  30.677 ±  0.308  ops/us
PrefixSumBenchmark.prefixSumScalar_v2                1024  thrpt    5   3.501 ±  0.035  ops/us
PrefixSumBenchmark.prefixSumVector128                 128  thrpt    5  16.519 ±  0.724  ops/us
PrefixSumBenchmark.prefixSumVector128                1024  thrpt    5   1.845 ±  0.003  ops/us
PrefixSumBenchmark.prefixSumVector128_v2              128  thrpt    5  19.237 ±  0.518  ops/us
PrefixSumBenchmark.prefixSumVector128_v2             1024  thrpt    5   1.883 ±  0.014  ops/us
PrefixSumBenchmark.prefixSumVector256                 128  thrpt    5  23.473 ±  0.164  ops/us
PrefixSumBenchmark.prefixSumVector256                1024  thrpt    5   3.029 ±  0.021  ops/us
PrefixSumBenchmark.prefixSumVector256_v2              128  thrpt    5  27.053 ±  0.129  ops/us
PrefixSumBenchmark.prefixSumVector256_v2             1024  thrpt    5   3.162 ±  0.093  ops/us
PrefixSumBenchmark.prefixSumVector256_v2_unrolled     128  thrpt    5  26.211 ±  0.156  ops/us
PrefixSumBenchmark.prefixSumVector256_v2_unrolled    1024  thrpt    5  25.478 ±  0.185  ops/us
PrefixSumBenchmark.prefixSumVector256_v3              128  thrpt    5  14.690 ±  0.037  ops/us
PrefixSumBenchmark.prefixSumVector256_v3             1024  thrpt    5   1.920 ±  0.057  ops/us
PrefixSumBenchmark.prefixSumVector512                 128  thrpt    5   0.052 ±  0.005  ops/us
PrefixSumBenchmark.prefixSumVector512                1024  thrpt    5   0.006 ±  0.001  ops/us
PrefixSumBenchmark.prefixSumVector512_v2              128  thrpt    5   0.082 ±  0.005  ops/us
PrefixSumBenchmark.prefixSumVector512_v2             1024  thrpt    5   0.010 ±  0.001  ops/us

@jpountz jpountz merged commit a2a9a3b into apache:main Jul 23, 2025
8 checks passed
@jpountz jpountz deleted the simplify_prefix_sum branch July 23, 2025 19:25
jpountz added a commit that referenced this pull request Jul 23, 2025
I remember benchmarking prefix sums quite extensively, and unrolled loops
performed significantly better than their rolled on counterpart, both on micro
and macro benchmarks:

```java
private static void prefixSum(int[] arr, int len) {
  for (int i = 1; i < len; ++i) {
    arr[i] += arr[i-1];
  }
}
```

However, I recently discovered that rewriting the loop this way performs much
better, and almost on par with the unrolled variant:

```java
private static void prefixSum(int[] arr, int len) {
  int sum = 0;
  for (int i = 0; i < len; ++i) {
    sum += arr[i];
    arr[i] = sum;
  }
}
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:core/codecs skip-changelog Apply to PRs that don't need a changelog entry, stopping the automated changelog check.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants