While building qdk-chemistry (that depends on GauXC) with the clang-cl compiler on Windows (microsoft/qdk-chemistry#450), I've noticed several test failures when setting OMP_NUM_THREADS > 2.
I've tracked down the issue to GauXC's XC integrator module showing numerical instabilities when using "too many" threads.
I built GauXC with clang-cl (MSVC LLVM compiler), with vcpkg providing OpenBLAS:
cmake -S "$RepoRoot" -B "$BuildDir" `
-GNinja `
-DCMAKE_POLICY_VERSION_MINIMUM="3.5" `
-DFETCHCONTENT_QUIET=OFF `
-DCMAKE_BUILD_TYPE=Release `
-DCMAKE_CXX_STANDARD=20 `
-DCMAKE_CXX_STANDARD_REQUIRED=ON `
-DCMAKE_CXX_FLAGS="/DWIN32 /D_WINDOWS /GR /EHsc /D_USE_MATH_DEFINES" `
-DCMAKE_C_FLAGS="/DWIN32 /D_WINDOWS -Wno-implicit-function-declaration" `
-DCMAKE_C_COMPILER=clang-cl `
-DCMAKE_CXX_COMPILER=clang-cl `
-DCMAKE_TOOLCHAIN_FILE="$env:CMAKE_TOOLCHAIN_FILE" `
-DVCPKG_TARGET_TRIPLET=x64-windows `
-DVCPKG_INSTALLED_DIR="$VcpkgInstalledDir" `
-DGAUXC_ENABLE_OPENMP=ON `
-DGAUXC_ENABLE_MPI=OFF `
-DGAUXC_ENABLE_CUDA=OFF `
-DGAUXC_ENABLE_HDF5=ON `
-DGAUXC_ENABLE_MAGMA=OFF `
-DGAUXC_ENABLE_CUTLASS=ON `
-DEXCHCXX_ENABLE_LIBXC=OFF `
-DGAUXC_ENABLE_TESTS=ON
cmake --build "$BuildDir" --parallel 6
# run the tests
$env:OMP_NUM_THREADS = 8
Push-Location "$BuildDir"
ctest --output-on-failure --verbose -E "GAUXC_MPI_TEST"
The 2nd_derivative_test and xc_integrator tests fail with OMP_NUM_THREADS > 2. The higher the number of threads, the biggest the numerical instabilities that appear.
Here are the results of these tests for 2, 4, and 8 OpenMP threads, showing the tests passing with OMP_NUM_THREADS = 2, and failing with 4 and more dramatically with 8:
I've asked Claude to analyze the code, and it thinks the culprit to be GauXC's XC integrator, that uses element-by-element #pragma omp atomic accumulation on shared matrices (inc_by_submat_atomic in util.hpp). Not sure if it makes sense...
The pattern is:
#pragma omp parallel for
for (auto& task : tasks) {
// compute local XC contributions...
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
#pragma omp atomic
VXC(i, j) += local_contribution(i, j);
}
Instead of using thread-local buffers with a reduction, the code does element-by-element atomic accumulation on shared matrices (VXCs, VXCz, VXCy, VXCx). This is thousands of individual atomic operations per batch, on data that multiple threads are reading and writing simultaneously.
Why it fails with clang-cl/libomp but not GCC/libgomp
This isn't strictly a data race in the traditional sense — #pragma omp atomic does guarantee atomicity of each individual add. The problem is more subtle:
- libomp (LLVM/clang) uses more aggressive work-stealing and dynamic scheduling. Threads interleave more aggressively, so atomic contention is much higher. Under heavy contention, the accumulated floating-point errors from non-deterministic operation ordering can cause the SCF to diverge.
- libgomp (GCC) uses simpler scheduling that happens to serialize work more, reducing contention and keeping the accumulation order more consistent across runs.
- The XC matrix contributions are not associative in floating-point —
(a + b) + c ≠ a + (b + c) due to rounding. With 8 threads doing thousands of atomics in unpredictable order, the accumulated matrix can differ enough between runs to push SCF convergence off a cliff, especially for sensitive systems like benzene/PBE.
"XC Integrator" test — two racy paths
1. eval_exc_vxc (VXC matrix failures)
Test → reference_replicated_xc_host_integrator_exc_vxc.hpp → exc_vxc_local_work_()
Racy sites:
- Lines 500-503: scalar
EXC_WORK/NEL_WORK atomic accumulation
- Line 556:
lwd->inc_vxc() → util.hpp:154-158 — the core culprit: inc_by_submat_atomic() does element-by-element #pragma omp atomic into shared VXC matrix
- Lines 557-564: same for VXCz/VXCy/VXCx (UKS/GKS)
2. eval_exc_grad (gradient failures)
Test → reference_replicated_xc_host_integrator_exc_grad.hpp → exc_grad_local_work_()
Racy sites:
- Lines 576-589: 3-6
#pragma omp atomic per atom into shared EXC_GRAD[]
reference/weights.cxx:786-798 (Becke) and lines 948-978 (SSF): weight derivative atomics into exc_grad_w[]
"XC Integrator FXC" test — FXC contraction
Test → reference_replicated_xc_host_integrator_fxc_contraction.hpp → fxc_contraction_local_work_()
Racy sites:
- Lines 530-531: scalar
NEL_WORK atomic
- Line 569:
lwd->inc_vxc() → util.hpp:154-158 into shared FXC matrix
Common bottleneck
All paths funnel through src/xc_integrator/local_work_driver/host/util.hpp:154-158:
#pragma omp atomic
ABig_use[ ii + jj * LDAB ] += ASmall_use[ ii + jj * LDAS ];
This is the single function (inc_by_submat_atomic) that needs to be replaced with thread-local accumulation + reduction to fix all three test failures.
...not sure about the quality of its suggested implementation, but I hope this helps.
While building
qdk-chemistry(that depends on GauXC) with theclang-clcompiler on Windows (microsoft/qdk-chemistry#450), I've noticed several test failures when settingOMP_NUM_THREADS> 2.I've tracked down the issue to GauXC's XC integrator module showing numerical instabilities when using "too many" threads.
I built GauXC with
clang-cl(MSVC LLVM compiler), withvcpkgproviding OpenBLAS:The
2nd_derivative_testandxc_integratortests fail withOMP_NUM_THREADS> 2. The higher the number of threads, the biggest the numerical instabilities that appear.Here are the results of these tests for 2, 4, and 8 OpenMP threads, showing the tests passing with
OMP_NUM_THREADS= 2, and failing with 4 and more dramatically with 8:I've asked Claude to analyze the code, and it thinks the culprit to be GauXC's XC integrator, that uses element-by-element
#pragma omp atomicaccumulation on shared matrices (inc_by_submat_atomic in util.hpp). Not sure if it makes sense......not sure about the quality of its suggested implementation, but I hope this helps.