Skip to content

Perf & correctness fixes: RMSD warp-shuffle, TFD precision, eigensolver persistence, similarity cache, BFGS sign fix, fused Butina sync reduction#177

Open
mooreneural wants to merge 3 commits into
NVIDIA-Digital-Bio:mainfrom
mooreneural:nvMolKit
Open

Perf & correctness fixes: RMSD warp-shuffle, TFD precision, eigensolver persistence, similarity cache, BFGS sign fix, fused Butina sync reduction#177
mooreneural wants to merge 3 commits into
NVIDIA-Digital-Bio:mainfrom
mooreneural:nvMolKit

Conversation

@mooreneural
Copy link
Copy Markdown

Scientific correctness and GPU performance improvements across six subsystems. No public API changes.

Correctness fixes

  • tfd_kernels.cu — Use double-precision sqrt for TFD pair-index decoding (float32 loses bits past ~1415 conformers); guard ring-torsion branch against numQ==0 division by zero.
  • bfgs_minimize.cu — Use |energy| in BFGS gradient-convergence denominator; raw negative energy was clamped to 1, silently tightening convergence.
  • triangle_smooth.cu — Replace cudaDeviceSynchronize() with cudaStreamSynchronize(data_.stream()) in copyToHost, scoping the sync to the correct stream.

Performance fixes

  • conformer_rmsd.cu — Replace 17 sequential cub::BlockReduce + __syncthreads() calls in computePairRmsd with warp-shuffle reductions. Sync count drops 17→3 (Kabsch) or 17→1 (prealigned); shared memory drops ~1120→400 bytes per block.
  • symmetric_eigensolver.cu — Move AsyncDeviceVector from a per-call heap allocation to a persistent states_ member of BatchedEigenSolver::Impl; reused across calls, grows only when batch size increases.
  • similarity_kernels.cu — Add per-device int8_t cache for tensor-op capability; cudaGetDeviceProperties called at most once per device per process instead of on every similarity launch.
  • clustering.py (fused_butina) — Reduce GPU↔CPU syncs from ~6 to 2 per iteration by batching neigh max+argmax into one tolist() call, batching cluster_count+is_free into one tolist() call, and maintaining a CPU mirror of the indices tensor.

Test plan

  • Conformer RMSD tests (prealigned and Kabsch paths)
  • TFD tests with >1415 conformers and ring torsions with 0 quartets
  • Fused Butina produces identical cluster assignments
  • BFGS converges correctly for negative-energy geometries
  • Similarity tests on sm_80+ and pre-Ampere hardware
  • Eigensolver convergence unchanged across multiple calls on same instance

…FGS, and fused Butina

Scientific / correctness:
- conformer_rmsd.cu: no change to algorithm, only implementation (see below)
- tfd_kernels.cu: use double-precision sqrt for pair-index decoding (float32
  loses bits past ~1415 conformers); guard ring-torsion path against numQ==0
  division by zero
- bfgs_minimize.cu: use |energy| in gradient convergence denominator so
  negative-energy geometries mid-minimisation don't tighten the threshold
  asymmetrically
- triangle_smooth.cu: replace cudaDeviceSynchronize() with stream-scoped
  cudaStreamSynchronize in copyToHost, avoiding a device-wide stall

Performance:
- conformer_rmsd.cu: replace 17 sequential cub::BlockReduce + __syncthreads()
  calls in computePairRmsd with warp-shuffle reductions (warpSumDouble) and a
  single 400-byte shared buffer; reduces sync count to 3 for the alignment path
  and 1 for the prealigned path, and cuts shared memory from ~1120 to 400 bytes
- symmetric_eigensolver.cu: move AsyncDeviceVector<curandState> from a
  stack-local allocation inside launchBatchEigensolverKernel (re-allocated on
  every call) to a persistent states_ member of BatchedEigenSolver::Impl,
  resizing only when the batch grows
- similarity_kernels.cu: add a per-device int8_t cache (g_tensorOpsCache[16])
  for isTensorOpsSupportedCached(); cudaGetDeviceProperties is now called at
  most once per device per process instead of on every similarity launch
- clustering.py (fused_butina): reduce GPU↔CPU synchronisation from ~6 D2H
  syncs per clustering iteration to 2 by (a) batching neigh.max() +
  flip.argmax() into one torch.stack().tolist() call, (b) batching
  cluster_count + is_free into one torch.cat().tolist() call, and (c)
  maintaining a CPU mirror of the indices tensor to avoid a per-centroid sync
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 19, 2026

Greptile Summary

Six targeted correctness and performance fixes across independent subsystems, with no public API changes. Each fix is well-scoped and the changes are individually verifiable.

  • Correctness: TFD pair-index decode promoted to double sqrt (float32 precision loss past ~1415 conformers), ring-torsion numQ==0 divide-by-zero guarded, BFGS gradient denominator uses fabs(energy) to avoid silent convergence tightening on negative-energy geometries, and triangle_smooth stream sync scoped to the correct stream with error checking.
  • Performance: RMSD warp-shuffle replaces 17 sequential cub::BlockReduce calls (sync count 17→3 Kabsch / 17→1 prealigned, shared memory ~1120→400 bytes), eigensolver curandState buffer persists across calls, similarity kernel device-capability query cached per device using std::atomic<int8_t>, and fused_butina GPU↔CPU syncs reduced from ~6 to 2 per iteration by batching reads and maintaining a CPU mirror of the indices tensor.

Confidence Score: 5/5

All changes are well-isolated, individually verifiable fixes; no shared state is introduced and no existing contracts are broken.

Each subsystem change is small and self-contained: the warp-shuffle barriers are correctly placed, the CPU indices mirror in fused_butina stays coherent because the underlying Triton kernel never writes back to the indices tensor, and all correctness fixes (fabs, double sqrt, numQ guard, stream sync) are straightforward one-to-few-line changes with clear before/after semantics.

No files require special attention; src/similarity_kernels.cu has outstanding CUDA error-return handling gaps in isTensorOpsSupportedCached that were noted in a prior review thread but are not introduced by this PR.

Important Files Changed

Filename Overview
nvmolkit/clustering.py Reduces GPU↔CPU syncs by batching max/argmax and cluster_count/is_free reads; CPU mirror of indices is kept in sync because extract_cluster_and_singletons only reads indices, never writes to it.
src/conformer_rmsd.cu Replaces 17 sequential cub::BlockReduce + __syncthreads() calls with a two-phase warp-shuffle reduction; shared memory usage drops to 400 bytes. Synchronization barriers are correctly placed in all three phases.
src/minimizer/bfgs_minimize.cu One-line correctness fix: fabs(energies[sysIdx]) prevents negative energies from clamping the gradient convergence denominator to 1.
src/similarity_kernels.cu Adds per-device std::atomic<int8_t> cache so cudaGetDeviceProperties is called at most once per device. The previous atomic-UB concern is resolved; CUDA error-return handling on cudaGetDevice/cudaGetDeviceProperties inside the new helper remains absent.
src/symmetric_eigensolver.cu Moves AsyncDeviceVector<curandState> from per-call allocation to a persistent states_ member; buffer grows lazily when batch size increases, matching previous allocation semantics on first use and new calls.
src/tfd/tfd_kernels.cu Fixes pair-index decode to use double sqrt (avoiding float32 precision loss past ~1415 conformers) and guards the ring-torsion branch against numQ == 0 division by zero.
src/triangle_smooth.cu Replaces device-wide cudaDeviceSynchronize() with stream-scoped cudaCheckError(cudaStreamSynchronize(data_.stream())), correctly limiting the sync scope and adding error propagation.

Reviews (3): Last reviewed commit: "Fix clang-format and ruff format violati..." | Re-trigger Greptile

Comment thread src/similarity_kernels.cu Outdated
Comment thread src/similarity_kernels.cu
Comment on lines +76 to +88
bool isTensorOpsSupportedCached() {
int device;
cudaGetDevice(&device);
if (device >= 0 && device < kMaxDevices && g_tensorOpsCache[device] != 0) {
return g_tensorOpsCache[device] == 2;
}
cudaDeviceProp props;
cudaGetDeviceProperties(&props, device);
const bool result = supportsTensorOps(props.major, props.minor);
if (device >= 0 && device < kMaxDevices) {
g_tensorOpsCache[device] = result ? 2 : 1;
}
return result;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Missing CUDA error checks in cached path

The original call sites used cudaCheckError(cudaGetDeviceProperties(&deviceProp, device)), but isTensorOpsSupportedCached() calls both cudaGetDevice and cudaGetDeviceProperties without checking their return codes. If cudaGetDevice fails, device is uninitialized; the guard device >= 0 && device < kMaxDevices might still pass with a garbage value, and the subsequent cudaGetDeviceProperties with that value would silently populate props with garbage — causing supportsTensorOps to select the wrong kernel path without any diagnostic.

mooreneural and others added 2 commits May 18, 2026 19:53
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
similarity_kernels.cu: add blank line between constexpr int kMaxDevices and
int8_t g_tensorOpsCache declarations to break AlignConsecutiveDeclarations
grouping — mixed-type consecutive declarations produced formatting that
differed from what clang-format expected.

clustering.py: collapse torch.stack([...]) call to Black-style multiline form
without magic trailing comma; previous multiline-with-trailing-comma form
triggered Black's trailing-comma expansion rule, failing ruff format.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant