-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Simplify ForDeltaUtil's prefix sum. #14979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I remember benchmarking prefix sums quite extensively, and unrolled loops performed significantly better than their rolled on counterpart, both on micro and macro benchmarks: ```java private static void prefixSum(int[] arr, int len) { for (int i = 1; i < len; ++i) { arr[i] += arr[i-1]; } } ``` However, I recently discovered that rewriting the loop this way performs much better, and almost on par with the unrolled variant: ```java private static void prefixSum(int[] arr, int len) { int sum = 0; for (int i = 0; i < len; ++i) { sum += arr[i]; arr[i] = sum; } } ``` I haven't checked the assembly yet, but both a JMH benchmark and luceneutil agree that it doesn't introduce a slowdown, so I cut over prefix sums to this approach.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I played with your benchmark and can reproduce the speed up locally (prefixSumScalarNew).
Benchmark (size) Mode Cnt Score Error Units
PrefixSumBenchmark.prefixSumScalar 128 thrpt 5 17.567 ± 0.117 ops/us
PrefixSumBenchmark.prefixSumScalarInlined 128 thrpt 5 26.228 ± 0.086 ops/us
PrefixSumBenchmark.prefixSumScalarNew 128 thrpt 5 25.864 ± 0.043 ops/us
PrefixSumBenchmark.prefixSumVector128 128 thrpt 5 20.668 ± 0.350 ops/us
PrefixSumBenchmark.prefixSumVector128_v2 128 thrpt 5 26.103 ± 0.176 ops/us
PrefixSumBenchmark.prefixSumVector256 128 thrpt 5 28.632 ± 0.956 ops/us
PrefixSumBenchmark.prefixSumVector256_v2 128 thrpt 5 44.185 ± 0.978 ops/us
PrefixSumBenchmark.prefixSumVector256_v2_inline 128 thrpt 5 43.949 ± 0.225 ops/us
PrefixSumBenchmark.prefixSumVector256_v3 128 thrpt 5 20.108 ± 1.157 ops/us
PrefixSumBenchmark.prefixSumVector512 128 thrpt 5 32.676 ± 0.266 ops/us
PrefixSumBenchmark.prefixSumVector512_v2 128 thrpt 5 57.176 ± 0.413 ops/us
I checked the assemble and the only difference i can see is that baseline uses a register in the unrolled(8x) loop body so it needs to read from array before each iteration, while this PR uses a register across iterations.
Thanks for checking! For reference here's what it gives on my machine (AMD Ryzen 9 3900X):
|
I remember benchmarking prefix sums quite extensively, and unrolled loops performed significantly better than their rolled on counterpart, both on micro and macro benchmarks: ```java private static void prefixSum(int[] arr, int len) { for (int i = 1; i < len; ++i) { arr[i] += arr[i-1]; } } ``` However, I recently discovered that rewriting the loop this way performs much better, and almost on par with the unrolled variant: ```java private static void prefixSum(int[] arr, int len) { int sum = 0; for (int i = 0; i < len; ++i) { sum += arr[i]; arr[i] = sum; } } ```
I remember benchmarking prefix sums quite extensively, and unrolled loops performed significantly better than their rolled on counterpart, both on micro and macro benchmarks:
However, I recently discovered that rewriting the loop this way performs much better, and almost on par with the unrolled variant:
I haven't checked the assembly yet, but both a JMH benchmark and luceneutil agree that it doesn't introduce a slowdown, so I cut over prefix sums to this approach.