Perf & correctness fixes: RMSD warp-shuffle, TFD precision, eigensolver persistence, similarity cache, BFGS sign fix, fused Butina sync reduction#177
Conversation
…FGS, and fused Butina Scientific / correctness: - conformer_rmsd.cu: no change to algorithm, only implementation (see below) - tfd_kernels.cu: use double-precision sqrt for pair-index decoding (float32 loses bits past ~1415 conformers); guard ring-torsion path against numQ==0 division by zero - bfgs_minimize.cu: use |energy| in gradient convergence denominator so negative-energy geometries mid-minimisation don't tighten the threshold asymmetrically - triangle_smooth.cu: replace cudaDeviceSynchronize() with stream-scoped cudaStreamSynchronize in copyToHost, avoiding a device-wide stall Performance: - conformer_rmsd.cu: replace 17 sequential cub::BlockReduce + __syncthreads() calls in computePairRmsd with warp-shuffle reductions (warpSumDouble) and a single 400-byte shared buffer; reduces sync count to 3 for the alignment path and 1 for the prealigned path, and cuts shared memory from ~1120 to 400 bytes - symmetric_eigensolver.cu: move AsyncDeviceVector<curandState> from a stack-local allocation inside launchBatchEigensolverKernel (re-allocated on every call) to a persistent states_ member of BatchedEigenSolver::Impl, resizing only when the batch grows - similarity_kernels.cu: add a per-device int8_t cache (g_tensorOpsCache[16]) for isTensorOpsSupportedCached(); cudaGetDeviceProperties is now called at most once per device per process instead of on every similarity launch - clustering.py (fused_butina): reduce GPU↔CPU synchronisation from ~6 D2H syncs per clustering iteration to 2 by (a) batching neigh.max() + flip.argmax() into one torch.stack().tolist() call, (b) batching cluster_count + is_free into one torch.cat().tolist() call, and (c) maintaining a CPU mirror of the indices tensor to avoid a per-centroid sync
|
| Filename | Overview |
|---|---|
| nvmolkit/clustering.py | Reduces GPU↔CPU syncs by batching max/argmax and cluster_count/is_free reads; CPU mirror of indices is kept in sync because extract_cluster_and_singletons only reads indices, never writes to it. |
| src/conformer_rmsd.cu | Replaces 17 sequential cub::BlockReduce + __syncthreads() calls with a two-phase warp-shuffle reduction; shared memory usage drops to 400 bytes. Synchronization barriers are correctly placed in all three phases. |
| src/minimizer/bfgs_minimize.cu | One-line correctness fix: fabs(energies[sysIdx]) prevents negative energies from clamping the gradient convergence denominator to 1. |
| src/similarity_kernels.cu | Adds per-device std::atomic<int8_t> cache so cudaGetDeviceProperties is called at most once per device. The previous atomic-UB concern is resolved; CUDA error-return handling on cudaGetDevice/cudaGetDeviceProperties inside the new helper remains absent. |
| src/symmetric_eigensolver.cu | Moves AsyncDeviceVector<curandState> from per-call allocation to a persistent states_ member; buffer grows lazily when batch size increases, matching previous allocation semantics on first use and new calls. |
| src/tfd/tfd_kernels.cu | Fixes pair-index decode to use double sqrt (avoiding float32 precision loss past ~1415 conformers) and guards the ring-torsion branch against numQ == 0 division by zero. |
| src/triangle_smooth.cu | Replaces device-wide cudaDeviceSynchronize() with stream-scoped cudaCheckError(cudaStreamSynchronize(data_.stream())), correctly limiting the sync scope and adding error propagation. |
Reviews (3): Last reviewed commit: "Fix clang-format and ruff format violati..." | Re-trigger Greptile
| bool isTensorOpsSupportedCached() { | ||
| int device; | ||
| cudaGetDevice(&device); | ||
| if (device >= 0 && device < kMaxDevices && g_tensorOpsCache[device] != 0) { | ||
| return g_tensorOpsCache[device] == 2; | ||
| } | ||
| cudaDeviceProp props; | ||
| cudaGetDeviceProperties(&props, device); | ||
| const bool result = supportsTensorOps(props.major, props.minor); | ||
| if (device >= 0 && device < kMaxDevices) { | ||
| g_tensorOpsCache[device] = result ? 2 : 1; | ||
| } | ||
| return result; |
There was a problem hiding this comment.
Missing CUDA error checks in cached path
The original call sites used cudaCheckError(cudaGetDeviceProperties(&deviceProp, device)), but isTensorOpsSupportedCached() calls both cudaGetDevice and cudaGetDeviceProperties without checking their return codes. If cudaGetDevice fails, device is uninitialized; the guard device >= 0 && device < kMaxDevices might still pass with a garbage value, and the subsequent cudaGetDeviceProperties with that value would silently populate props with garbage — causing supportsTensorOps to select the wrong kernel path without any diagnostic.
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
similarity_kernels.cu: add blank line between constexpr int kMaxDevices and int8_t g_tensorOpsCache declarations to break AlignConsecutiveDeclarations grouping — mixed-type consecutive declarations produced formatting that differed from what clang-format expected. clustering.py: collapse torch.stack([...]) call to Black-style multiline form without magic trailing comma; previous multiline-with-trailing-comma form triggered Black's trailing-comma expansion rule, failing ruff format.
Scientific correctness and GPU performance improvements across six subsystems. No public API changes.
Correctness fixes
Performance fixes
Test plan