perf(cpu): adopt .NET 10 AVX-512 TernaryLogic and widen gather#141
Merged
mivertowski merged 1 commit intomainfrom Apr 22, 2026
Merged
perf(cpu): adopt .NET 10 AVX-512 TernaryLogic and widen gather#141mivertowski merged 1 commit intomainfrom
mivertowski merged 1 commit intomainfrom
Conversation
Five targeted adoption sites in the CPU SIMD reduction/branchless paths:
1. AdvancedSimdKernels.VectorGatherFloat32: replaces the scalar inner
loop (marked "Use scalar approach for gather" in a comment) with a
real Avx2.GatherVector256. On AVX-512 hosts, issues two back-to-back
256-bit gathers per iteration to cover 16 floats at a time — .NET 10
SDK 10.0.106 does not expose Avx512F.GatherVector512 directly, so
stitching two AVX2 gathers is the widest form available.
2. AdvancedSimdKernels.VectorScatterFloat32: removes dead AVX-512 loads
that preceded a scalar scatter loop. The SDK does not expose
Avx512F.Scatter, so the scatter remains scalar but is now unrolled
by 4 so the JIT can keep more stores in flight. Byte-identical result.
3. AdvancedSimdPatterns.ConditionalSelectAvx512: collapses the
CompareGreaterThan + BlendVariable sequence to a single vpternlogd
via Avx512F.TernaryLogic(trueVec, falseVec, mask, 0xCA). Truth table
0xCA = (C ? A : B). One instruction instead of one blend.
4. AdvancedSimdPatterns.ConditionalSelectSse: on AVX-512VL-capable hosts
the legacy Sse.And/AndNot/Or manual blend triple collapses to a
single Vector128 vpternlogd via Avx512F.VL.TernaryLogic. SSE4.1 and
pre-SSE4.1 fallbacks are preserved byte-for-byte.
5. AdvancedSimdPatterns.GatherFloat32: new GatherFloat32Avx512 branch
processes 16 elements per iteration on AVX-512 hosts by stitching two
Avx2.GatherVector256 calls, then delegating the remainder to the
existing AVX2 and scalar paths.
Each site keeps its non-AVX-512 fallback intact and byte-identical. Seven
new correctness tests in SimdOperationsTests cover the changed paths
against a scalar reference and pass on AVX2-only hardware.
Build: dotnet build DotCompute.sln --configuration Release -> 0 errors,
0 warnings. CPU unit tests: 139/140 pass (the one failure is a
pre-existing timing-based flake, unrelated).
Not adopted (SDK gap): Avx512F.GatherVector512 and Avx512F.Scatter are
not exposed in .NET 10 SDK 10.0.106. Revisit once the x86 intrinsics
surface catches up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Five targeted adoption sites for the .NET 10 AVX-512 / AVX2 intrinsics surface in the CPU backend's reduction and branchless paths. All behaviour is byte-for-byte identical on non-AVX-512 hardware; AVX-512 hosts get fewer instructions per element.
Adoption sites
AdvancedSimdKernels.VectorGatherFloat32Avx2.GatherVector256replacing a scalar inner loop that was documented as a workaroundvpgatherdpsvs 8 scalar loadsAdvancedSimdKernels.VectorScatterFloat32AdvancedSimdPatterns.ConditionalSelectAvx512Avx512F.TernaryLogic(..., 0xCA)instead ofCompareGreaterThan+BlendVariablevpternlogdvs compare + blendAdvancedSimdPatterns.ConditionalSelectSseAvx512F.VL.TernaryLogicreplacesSse.And/AndNot/Ortriple when AVX-512VL is presentvpternlogdvs three SSE opsAdvancedSimdPatterns.GatherFloat32Avx512Avx2.GatherVector256per iterationAvx512F.GatherVector512Seven new correctness tests in
SimdOperationsTestscover each changed path against a scalar reference.Intrinsics wanted but not available
Avx512F.GatherVector512andAvx512F.Scatterare not exposed in .NET 10 SDK 10.0.106 —System.Runtime.Intrinsics.X86.Avx512Fonly surfacesTernaryLogic,BlendVariable, arithmetic, and FMA. This is why Sites 1 and 5 stitch two AVX2 gathers instead of a single 512-bit gather, and Site 2 stays scalar. The code calls this out in XML remarks so the next revisit is obvious once the surface catches up.Build and test
dotnet build DotCompute.sln --configuration Release-> 0 errors, 0 warningsDotCompute.Backends.CPU.Tests-> 139/140 passing. The single failure (VectorOperations_Performance_SimdFasterThanScalar) is a pre-existing timing-based flake that passes in isolation and doesn't touch any modified code paths.KernelDebug*andRecovery*) are unrelated to this PR.Test plan
GatherFloat32once on AVX-512 hardware to measure the stitched-gather winScope discipline
src/Backends/DotCompute.Backends.CPU/Kernels/AdvancedSimdKernels.cssrc/Backends/DotCompute.Backends.CPU/Kernels/AdvancedSimdPatterns.cstests/Unit/DotCompute.Backends.CPU.Tests/SimdOperationsTests.cs🤖 Generated with Claude Code