Skip to content

XC integrator OpenMP problems with LLVM libomp on Windows (clang-cl) #196

@lorisercole

Description

@lorisercole

While building qdk-chemistry (that depends on GauXC) with the clang-cl compiler on Windows (microsoft/qdk-chemistry#450), I've noticed several test failures when setting OMP_NUM_THREADS > 2.
I've tracked down the issue to GauXC's XC integrator module showing numerical instabilities when using "too many" threads.

I built GauXC with clang-cl (MSVC LLVM compiler), with vcpkg providing OpenBLAS:

cmake -S "$RepoRoot" -B "$BuildDir" `
    -GNinja `
    -DCMAKE_POLICY_VERSION_MINIMUM="3.5" `
    -DFETCHCONTENT_QUIET=OFF `
    -DCMAKE_BUILD_TYPE=Release `
    -DCMAKE_CXX_STANDARD=20 `
    -DCMAKE_CXX_STANDARD_REQUIRED=ON `
    -DCMAKE_CXX_FLAGS="/DWIN32 /D_WINDOWS /GR /EHsc /D_USE_MATH_DEFINES" `
    -DCMAKE_C_FLAGS="/DWIN32 /D_WINDOWS -Wno-implicit-function-declaration" `
    -DCMAKE_C_COMPILER=clang-cl `
    -DCMAKE_CXX_COMPILER=clang-cl `
    -DCMAKE_TOOLCHAIN_FILE="$env:CMAKE_TOOLCHAIN_FILE" `
    -DVCPKG_TARGET_TRIPLET=x64-windows `
    -DVCPKG_INSTALLED_DIR="$VcpkgInstalledDir" `
    -DGAUXC_ENABLE_OPENMP=ON `
    -DGAUXC_ENABLE_MPI=OFF `
    -DGAUXC_ENABLE_CUDA=OFF `
    -DGAUXC_ENABLE_HDF5=ON `
    -DGAUXC_ENABLE_MAGMA=OFF `
    -DGAUXC_ENABLE_CUTLASS=ON `
    -DEXCHCXX_ENABLE_LIBXC=OFF `
    -DGAUXC_ENABLE_TESTS=ON
cmake --build "$BuildDir" --parallel 6

# run the tests
$env:OMP_NUM_THREADS = 8
Push-Location "$BuildDir"
ctest --output-on-failure --verbose -E "GAUXC_MPI_TEST"

The 2nd_derivative_test and xc_integrator tests fail with OMP_NUM_THREADS > 2. The higher the number of threads, the biggest the numerical instabilities that appear.
Here are the results of these tests for 2, 4, and 8 OpenMP threads, showing the tests passing with OMP_NUM_THREADS = 2, and failing with 4 and more dramatically with 8:


I've asked Claude to analyze the code, and it thinks the culprit to be GauXC's XC integrator, that uses element-by-element #pragma omp atomic accumulation on shared matrices (inc_by_submat_atomic in util.hpp). Not sure if it makes sense...

The pattern is:

#pragma omp parallel for
for (auto& task : tasks) {
   // compute local XC contributions...
   for (int i = 0; i < n; i++)
       for (int j = 0; j < n; j++)
           #pragma omp atomic
           VXC(i, j) += local_contribution(i, j);
}

Instead of using thread-local buffers with a reduction, the code does element-by-element atomic accumulation on shared matrices (VXCs, VXCz, VXCy, VXCx). This is thousands of individual atomic operations per batch, on data that multiple threads are reading and writing simultaneously.

Why it fails with clang-cl/libomp but not GCC/libgomp

This isn't strictly a data race in the traditional sense — #pragma omp atomic does guarantee atomicity of each individual add. The problem is more subtle:

  1. libomp (LLVM/clang) uses more aggressive work-stealing and dynamic scheduling. Threads interleave more aggressively, so atomic contention is much higher. Under heavy contention, the accumulated floating-point errors from non-deterministic operation ordering can cause the SCF to diverge.
  2. libgomp (GCC) uses simpler scheduling that happens to serialize work more, reducing contention and keeping the accumulation order more consistent across runs.
  3. The XC matrix contributions are not associative in floating-point(a + b) + c ≠ a + (b + c) due to rounding. With 8 threads doing thousands of atomics in unpredictable order, the accumulated matrix can differ enough between runs to push SCF convergence off a cliff, especially for sensitive systems like benzene/PBE.

"XC Integrator" test — two racy paths

1. eval_exc_vxc (VXC matrix failures)

Test → reference_replicated_xc_host_integrator_exc_vxc.hppexc_vxc_local_work_()
Racy sites:

  • Lines 500-503: scalar EXC_WORK/NEL_WORK atomic accumulation
  • Line 556: lwd->inc_vxc()util.hpp:154-158 — the core culprit: inc_by_submat_atomic() does element-by-element #pragma omp atomic into shared VXC matrix
  • Lines 557-564: same for VXCz/VXCy/VXCx (UKS/GKS)

2. eval_exc_grad (gradient failures)

Test → reference_replicated_xc_host_integrator_exc_grad.hppexc_grad_local_work_()
Racy sites:

  • Lines 576-589: 3-6 #pragma omp atomic per atom into shared EXC_GRAD[]
  • reference/weights.cxx:786-798 (Becke) and lines 948-978 (SSF): weight derivative atomics into exc_grad_w[]

"XC Integrator FXC" test — FXC contraction

Test → reference_replicated_xc_host_integrator_fxc_contraction.hppfxc_contraction_local_work_()
Racy sites:

  • Lines 530-531: scalar NEL_WORK atomic
  • Line 569: lwd->inc_vxc()util.hpp:154-158 into shared FXC matrix

Common bottleneck

All paths funnel through src/xc_integrator/local_work_driver/host/util.hpp:154-158:

#pragma omp atomic
ABig_use[ ii + jj * LDAB ] += ASmall_use[ ii + jj * LDAS ];

This is the single function (inc_by_submat_atomic) that needs to be replaced with thread-local accumulation + reduction to fix all three test failures.

...not sure about the quality of its suggested implementation, but I hope this helps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions