XC integrator OpenMP problems with LLVM libomp on Windows (clang-cl)

While building `qdk-chemistry` (that depends on GauXC) with the `clang-cl` compiler on Windows (https://github.com/microsoft/qdk-chemistry/pull/450), I've noticed [several test failures](https://gist.github.com/lorisercole/c8081e5f4966c6bf9a2e64734c781fd3) when setting `OMP_NUM_THREADS` > 2.
I've tracked down the issue to GauXC's XC integrator module showing numerical instabilities when using "too many" threads.

I built GauXC with `clang-cl` (MSVC LLVM compiler), with `vcpkg` providing OpenBLAS:
```powershell
cmake -S "$RepoRoot" -B "$BuildDir" `
    -GNinja `
    -DCMAKE_POLICY_VERSION_MINIMUM="3.5" `
    -DFETCHCONTENT_QUIET=OFF `
    -DCMAKE_BUILD_TYPE=Release `
    -DCMAKE_CXX_STANDARD=20 `
    -DCMAKE_CXX_STANDARD_REQUIRED=ON `
    -DCMAKE_CXX_FLAGS="/DWIN32 /D_WINDOWS /GR /EHsc /D_USE_MATH_DEFINES" `
    -DCMAKE_C_FLAGS="/DWIN32 /D_WINDOWS -Wno-implicit-function-declaration" `
    -DCMAKE_C_COMPILER=clang-cl `
    -DCMAKE_CXX_COMPILER=clang-cl `
    -DCMAKE_TOOLCHAIN_FILE="$env:CMAKE_TOOLCHAIN_FILE" `
    -DVCPKG_TARGET_TRIPLET=x64-windows `
    -DVCPKG_INSTALLED_DIR="$VcpkgInstalledDir" `
    -DGAUXC_ENABLE_OPENMP=ON `
    -DGAUXC_ENABLE_MPI=OFF `
    -DGAUXC_ENABLE_CUDA=OFF `
    -DGAUXC_ENABLE_HDF5=ON `
    -DGAUXC_ENABLE_MAGMA=OFF `
    -DGAUXC_ENABLE_CUTLASS=ON `
    -DEXCHCXX_ENABLE_LIBXC=OFF `
    -DGAUXC_ENABLE_TESTS=ON
cmake --build "$BuildDir" --parallel 6

# run the tests
$env:OMP_NUM_THREADS = 8
Push-Location "$BuildDir"
ctest --output-on-failure --verbose -E "GAUXC_MPI_TEST"
```
**The `2nd_derivative_test` and `xc_integrator` tests fail with `OMP_NUM_THREADS` > 2**. The higher the number of threads, the biggest the numerical instabilities that appear.
Here are the results of these tests for 2, 4, and 8 OpenMP threads, showing the tests passing with `OMP_NUM_THREADS` = 2, and failing with 4 and more dramatically with 8:

- [gauxc_test_results_xc_integr_fxc_omp2.txt](https://github.com/user-attachments/files/26973684/gauxc_test_results_xc_integr_fxc_omp2.txt)
- [gauxc_test_results_xc_integr_fxc_omp4.txt](https://github.com/user-attachments/files/26973679/gauxc_test_results_xc_integr_fxc_omp4.txt)
- [gauxc_test_results_xc_integr_fxc_omp8.txt](https://github.com/user-attachments/files/26973682/gauxc_test_results_xc_integr_fxc_omp8.txt)
- [gauxc_test_results_xc_integr_omp2.txt](https://github.com/user-attachments/files/26973681/gauxc_test_results_xc_integr_omp2.txt)
- [gauxc_test_results_xc_integr_omp4.txt](https://github.com/user-attachments/files/26973683/gauxc_test_results_xc_integr_omp4.txt)
- [gauxc_test_results_xc_integr_omp8.txt](https://github.com/user-attachments/files/26973680/gauxc_test_results_xc_integr_omp8.txt)

---
I've asked Claude to analyze the code, and it thinks the culprit to be GauXC's XC integrator, that uses element-by-element `#pragma omp atomic` accumulation on shared matrices ([inc_by_submat_atomic in util.hpp](https://github.com/wavefunction91/GauXC/blob/2c4a2bd785eae44ae0049f06057c23ca0239ae33/src/xc_integrator/local_work_driver/host/util.hpp#L155-L158)). Not sure if it makes sense...

> The pattern is:
> ```C++
> #pragma omp parallel for
> for (auto& task : tasks) {
>    // compute local XC contributions...
>    for (int i = 0; i < n; i++)
>        for (int j = 0; j < n; j++)
>            #pragma omp atomic
>            VXC(i, j) += local_contribution(i, j);
> }
> ```
>  
> Instead of using **thread-local buffers with a reduction**, the code does **element-by-element atomic accumulation** on shared matrices (VXCs, VXCz, VXCy, VXCx). This is thousands of individual atomic operations per batch, on data that multiple threads are reading and writing simultaneously.
>
> #### Why it fails with clang-cl/libomp but not GCC/libgomp
> This isn't strictly a data race in the traditional sense — `#pragma omp atomic` does guarantee atomicity of each individual add. The problem is more subtle:
> 1. **libomp** (LLVM/clang) uses more aggressive work-stealing and dynamic scheduling. Threads interleave more aggressively, so atomic contention is much higher. Under heavy contention, the accumulated floating-point errors from non-deterministic operation ordering can cause the SCF to diverge.
> 2. **libgomp** (GCC) uses simpler scheduling that happens to serialize work more, reducing contention and keeping the accumulation order more consistent across runs.
> 3. The XC matrix contributions are **not associative in floating-point** — `(a + b) + c ≠ a + (b + c)` due to rounding. With 8 threads doing thousands of atomics in unpredictable order, the accumulated matrix can differ enough between runs to push SCF convergence off a cliff, especially for sensitive systems like benzene/PBE.
>
> ### "XC Integrator" test — two racy paths
> #### 1. `eval_exc_vxc` (**VXC matrix failures**)
> Test → `reference_replicated_xc_host_integrator_exc_vxc.hpp` → `exc_vxc_local_work_()`
> Racy sites:
> - Lines 500-503: scalar `EXC_WORK`/`NEL_WORK` atomic accumulation
> - Line 556: `lwd->inc_vxc()` → `util.hpp:154-158` — the core culprit: `inc_by_submat_atomic()` does element-by-element `#pragma omp atomic` into shared VXC matrix
> - Lines 557-564: same for VXCz/VXCy/VXCx (UKS/GKS)
> #### 2. `eval_exc_grad` (**gradient failures**)
> Test → `reference_replicated_xc_host_integrator_exc_grad.hpp` → `exc_grad_local_work_()`
> Racy sites:
>  - Lines 576-589: 3-6 `#pragma omp atomic` per atom into shared `EXC_GRAD[]`
>  - `reference/weights.cxx:786-798` (Becke) and lines 948-978 (SSF): weight derivative atomics into `exc_grad_w[]`
>
> ### "XC Integrator FXC" test — FXC contraction
> Test → `reference_replicated_xc_host_integrator_fxc_contraction.hpp` → `fxc_contraction_local_work_()`
> Racy sites:
> - Lines 530-531: scalar `NEL_WORK` atomic
> - Line 569: `lwd->inc_vxc()` → `util.hpp:154-158` into shared FXC matrix
>
> ### Common bottleneck
> All paths funnel through `src/xc_integrator/local_work_driver/host/util.hpp:154-158`:
> ```C++
> #pragma omp atomic
> ABig_use[ ii + jj * LDAB ] += ASmall_use[ ii + jj * LDAS ];
> ```
> This is the single function (`inc_by_submat_atomic`) that needs to be replaced with thread-local accumulation + reduction to fix all three test failures.

...not sure about the quality of its suggested implementation, but I hope this helps.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XC integrator OpenMP problems with LLVM libomp on Windows (clang-cl) #196

Why it fails with clang-cl/libomp but not GCC/libgomp

"XC Integrator" test — two racy paths

1. `eval_exc_vxc` (VXC matrix failures)

2. `eval_exc_grad` (gradient failures)

"XC Integrator FXC" test — FXC contraction

Common bottleneck

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

XC integrator OpenMP problems with LLVM libomp on Windows (clang-cl) #196

Description

Why it fails with clang-cl/libomp but not GCC/libgomp

"XC Integrator" test — two racy paths

1. eval_exc_vxc (VXC matrix failures)

2. eval_exc_grad (gradient failures)

"XC Integrator FXC" test — FXC contraction

Common bottleneck

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. `eval_exc_vxc` (VXC matrix failures)

2. `eval_exc_grad` (gradient failures)