Perf & correctness fixes: RMSD warp-shuffle, TFD precision, eigensolver persistence, similarity cache, BFGS sign fix, fused Butina sync reduction by mooreneural · Pull Request #177 · NVIDIA-Digital-Bio/nvMolKit

mooreneural · 2026-05-19T00:45:06Z

Scientific correctness and GPU performance improvements across six subsystems. No public API changes.

Correctness fixes

tfd_kernels.cu — Use double-precision sqrt for TFD pair-index decoding (float32 loses bits past ~1415 conformers); guard ring-torsion branch against numQ==0 division by zero.
bfgs_minimize.cu — Use |energy| in BFGS gradient-convergence denominator; raw negative energy was clamped to 1, silently tightening convergence.
triangle_smooth.cu — Replace cudaDeviceSynchronize() with cudaStreamSynchronize(data_.stream()) in copyToHost, scoping the sync to the correct stream.

Performance fixes

conformer_rmsd.cu — Replace 17 sequential cub::BlockReduce + __syncthreads() calls in computePairRmsd with warp-shuffle reductions. Sync count drops 17→3 (Kabsch) or 17→1 (prealigned); shared memory drops ~1120→400 bytes per block.
symmetric_eigensolver.cu — Move AsyncDeviceVector from a per-call heap allocation to a persistent states_ member of BatchedEigenSolver::Impl; reused across calls, grows only when batch size increases.
similarity_kernels.cu — Add per-device int8_t cache for tensor-op capability; cudaGetDeviceProperties called at most once per device per process instead of on every similarity launch.
clustering.py (fused_butina) — Reduce GPU↔CPU syncs from ~6 to 2 per iteration by batching neigh max+argmax into one tolist() call, batching cluster_count+is_free into one tolist() call, and maintaining a CPU mirror of the indices tensor.

Test plan

Conformer RMSD tests (prealigned and Kabsch paths)
TFD tests with >1415 conformers and ring torsions with 0 quartets
Fused Butina produces identical cluster assignments
BFGS converges correctly for negative-energy geometries
Similarity tests on sm_80+ and pre-Ampere hardware
Eigensolver convergence unchanged across multiple calls on same instance

…FGS, and fused Butina Scientific / correctness: - conformer_rmsd.cu: no change to algorithm, only implementation (see below) - tfd_kernels.cu: use double-precision sqrt for pair-index decoding (float32 loses bits past ~1415 conformers); guard ring-torsion path against numQ==0 division by zero - bfgs_minimize.cu: use |energy| in gradient convergence denominator so negative-energy geometries mid-minimisation don't tighten the threshold asymmetrically - triangle_smooth.cu: replace cudaDeviceSynchronize() with stream-scoped cudaStreamSynchronize in copyToHost, avoiding a device-wide stall Performance: - conformer_rmsd.cu: replace 17 sequential cub::BlockReduce + __syncthreads() calls in computePairRmsd with warp-shuffle reductions (warpSumDouble) and a single 400-byte shared buffer; reduces sync count to 3 for the alignment path and 1 for the prealigned path, and cuts shared memory from ~1120 to 400 bytes - symmetric_eigensolver.cu: move AsyncDeviceVector<curandState> from a stack-local allocation inside launchBatchEigensolverKernel (re-allocated on every call) to a persistent states_ member of BatchedEigenSolver::Impl, resizing only when the batch grows - similarity_kernels.cu: add a per-device int8_t cache (g_tensorOpsCache[16]) for isTensorOpsSupportedCached(); cudaGetDeviceProperties is now called at most once per device per process instead of on every similarity launch - clustering.py (fused_butina): reduce GPU↔CPU synchronisation from ~6 D2H syncs per clustering iteration to 2 by (a) batching neigh.max() + flip.argmax() into one torch.stack().tolist() call, (b) batching cluster_count + is_free into one torch.cat().tolist() call, and (c) maintaining a CPU mirror of the indices tensor to avoid a per-centroid sync

greptile-apps · 2026-05-19T00:50:43Z

Greptile Summary

Six targeted correctness and performance fixes across independent subsystems, with no public API changes. Each fix is well-scoped and the changes are individually verifiable.

Correctness: TFD pair-index decode promoted to double sqrt (float32 precision loss past ~1415 conformers), ring-torsion numQ==0 divide-by-zero guarded, BFGS gradient denominator uses fabs(energy) to avoid silent convergence tightening on negative-energy geometries, and triangle_smooth stream sync scoped to the correct stream with error checking.
Performance: RMSD warp-shuffle replaces 17 sequential cub::BlockReduce calls (sync count 17→3 Kabsch / 17→1 prealigned, shared memory ~1120→400 bytes), eigensolver curandState buffer persists across calls, similarity kernel device-capability query cached per device using std::atomic<int8_t>, and fused_butina GPU↔CPU syncs reduced from ~6 to 2 per iteration by batching reads and maintaining a CPU mirror of the indices tensor.

Confidence Score: 5/5

All changes are well-isolated, individually verifiable fixes; no shared state is introduced and no existing contracts are broken.

Each subsystem change is small and self-contained: the warp-shuffle barriers are correctly placed, the CPU indices mirror in fused_butina stays coherent because the underlying Triton kernel never writes back to the indices tensor, and all correctness fixes (fabs, double sqrt, numQ guard, stream sync) are straightforward one-to-few-line changes with clear before/after semantics.

No files require special attention; src/similarity_kernels.cu has outstanding CUDA error-return handling gaps in isTensorOpsSupportedCached that were noted in a prior review thread but are not introduced by this PR.

Important Files Changed

Filename	Overview
nvmolkit/clustering.py	Reduces GPU↔CPU syncs by batching max/argmax and cluster_count/is_free reads; CPU mirror of `indices` is kept in sync because `extract_cluster_and_singletons` only reads `indices`, never writes to it.
src/conformer_rmsd.cu	Replaces 17 sequential cub::BlockReduce + __syncthreads() calls with a two-phase warp-shuffle reduction; shared memory usage drops to 400 bytes. Synchronization barriers are correctly placed in all three phases.
src/minimizer/bfgs_minimize.cu	One-line correctness fix: `fabs(energies[sysIdx])` prevents negative energies from clamping the gradient convergence denominator to 1.
src/similarity_kernels.cu	Adds per-device `std::atomic<int8_t>` cache so `cudaGetDeviceProperties` is called at most once per device. The previous atomic-UB concern is resolved; CUDA error-return handling on `cudaGetDevice`/`cudaGetDeviceProperties` inside the new helper remains absent.
src/symmetric_eigensolver.cu	Moves `AsyncDeviceVector<curandState>` from per-call allocation to a persistent `states_` member; buffer grows lazily when batch size increases, matching previous allocation semantics on first use and new calls.
src/tfd/tfd_kernels.cu	Fixes pair-index decode to use `double` sqrt (avoiding float32 precision loss past ~1415 conformers) and guards the ring-torsion branch against `numQ == 0` division by zero.
src/triangle_smooth.cu	Replaces device-wide `cudaDeviceSynchronize()` with stream-scoped `cudaCheckError(cudaStreamSynchronize(data_.stream()))`, correctly limiting the sync scope and adding error propagation.

_{Reviews (3): Last reviewed commit: "Fix clang-format and ruff format violati..." | Re-trigger Greptile}

greptile-apps · 2026-05-19T00:50:47Z

+bool isTensorOpsSupportedCached() {
+  int device;
+  cudaGetDevice(&device);
+  if (device >= 0 && device < kMaxDevices && g_tensorOpsCache[device] != 0) {
+    return g_tensorOpsCache[device] == 2;
+  }
+  cudaDeviceProp props;
+  cudaGetDeviceProperties(&props, device);
+  const bool result = supportsTensorOps(props.major, props.minor);
+  if (device >= 0 && device < kMaxDevices) {
+    g_tensorOpsCache[device] = result ? 2 : 1;
+  }
+  return result;


Missing CUDA error checks in cached path

The original call sites used cudaCheckError(cudaGetDeviceProperties(&deviceProp, device)), but isTensorOpsSupportedCached() calls both cudaGetDevice and cudaGetDeviceProperties without checking their return codes. If cudaGetDevice fails, device is uninitialized; the guard device >= 0 && device < kMaxDevices might still pass with a garbage value, and the subsequent cudaGetDeviceProperties with that value would silently populate props with garbage — causing supportsTensorOps to select the wrong kernel path without any diagnostic.

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

similarity_kernels.cu: add blank line between constexpr int kMaxDevices and int8_t g_tensorOpsCache declarations to break AlignConsecutiveDeclarations grouping — mixed-type consecutive declarations produced formatting that differed from what clang-format expected. clustering.py: collapse torch.stack([...]) call to Black-style multiline form without magic trailing comma; previous multiline-with-trailing-comma form triggered Black's trailing-comma expansion rule, failing ruff format.

greptile-apps Bot reviewed May 19, 2026

View reviewed changes

mooreneural and others added 2 commits May 18, 2026 19:53

Update src/similarity_kernels.cu

97dc55b

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf & correctness fixes: RMSD warp-shuffle, TFD precision, eigensolver persistence, similarity cache, BFGS sign fix, fused Butina sync reduction#177

Perf & correctness fixes: RMSD warp-shuffle, TFD precision, eigensolver persistence, similarity cache, BFGS sign fix, fused Butina sync reduction#177
mooreneural wants to merge 3 commits into
NVIDIA-Digital-Bio:mainfrom
mooreneural:nvMolKit

mooreneural commented May 19, 2026

Uh oh!

greptile-apps Bot commented May 19, 2026 •

edited

Loading

Greptile Summary

Uh oh!

Uh oh!

greptile-apps Bot May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mooreneural commented May 19, 2026

Scientific correctness and GPU performance improvements across six subsystems. No public API changes.

Correctness fixes

Performance fixes

Test plan

Uh oh!

greptile-apps Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

greptile-apps Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented May 19, 2026 •

edited

Loading