Add downloadable agent skill and unit test for snippets#175
Conversation
…-Bio#140) Introduce the data structures needed to return on-device coordinates from ETKDG/MMFF/UFF without packing them back into RDKit conformers. C++: - New header src/device_coord_result.h defining `CoordinateOutput` enum and the flat CSR-style `DeviceCoordResult` struct (positions, atom_starts, mol/conf indices, optional energies/converged, gpu_id). - nvmolkit/array_helpers.h: add `makePyArrayBorrowed` and an `owned` flag on `PyArray` for non-owning views over persistent device buffers. Python: - nvmolkit/types.py: add `CoordinateOutput` enum and `DeviceCoordResult` class with `per_molecule()` (lazy nested torch views) and `dense()` (padded materialization). `AsyncGpuResult` accepts an optional `gpu_id` so buffers on a non-default device are wrapped onto the correct torch device. Signed-off-by: Kevin Boyd <kboyd@nvidia.com>
Introduces a small `coord_collect` library used by upcoming ETKDG/MMFF/UFF device-output paths to consolidate per-GPU result buffers onto a single target GPU. - `copyDeviceToDeviceAsync`: peer-to-peer (or system fallback) async copy with explicit cross-stream ordering via a CUDA event, plus a same-GPU fast path that degenerates to `cudaMemcpyAsync`. - `enablePeerAccess`: best-effort bidirectional peer access enable, used once during driver setup so subsequent `cudaMemcpyPeerAsync` calls take the fast path. - gtest coverage for same-GPU copy, zero-byte no-op, self peer enable, and a multi-GPU copy that skips on single-GPU hosts. Signed-off-by: Kevin Boyd <kboyd@nvidia.com>
nMols is the authoritative outer-list size for per-molecule views; it includes molecules that produced zero conformers, which the old max()+1 approach silently dropped. This fixes per_molecule() silently truncating the outer list when high-indexed molecules produce zero conformers (e.g. partial ETKDG success). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds CoordinateOutput::DEVICE to embedMolecules. In DEVICE mode, optimized ETKDG coordinates stay on the GPU and are returned as a flat-CSR DeviceCoordResult collected onto a single target GPU. RDKit conformer lists are not modified in this mode. Conformer pruning is rejected when DEVICE is selected (follow-up). Implementation: - New `etkdg_device_collect` library: thread-local DeviceCoordCollector, shared cap state for per-molecule conformer count limits, a 4D->3D packing kernel (`appendActive`), an ETKDGCollectDeviceCoordsStage, and `finalizeOnTarget` that uses copyDeviceToDeviceAsync to consolidate per-thread partials onto the target GPU. - `embedMolecules` now returns `std::optional<DeviceCoordResult>`, accepts `output` and `targetGpu` parameters, swaps in the collect stage in DEVICE mode, and skips the host conformer-pruning post-pass. - Added a sortedMols->original-input index map so user-visible molIndices reflect the caller's input order rather than the internal sorted order. Tests (test_etkdg_device_output): shape, multi-molecule indexing, multi- conformer indexing, and pruning-rejection. Bit-exact comparison with the host path is intentionally not asserted because ETKDG is non-deterministic across calls due to random initialization of multiple parallel attempts. Signed-off-by: Kevin Boyd <kboyd@nvidia.com>
Add nMols parameter to ETKDG finalizeOnTarget so the returned DeviceCoordResult.nMols reflects the full input batch size, including molecules that produced zero conformers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds CoordinateOutput::DEVICE to MMFFMinimizeMoleculesConfs and
UFFMinimizeMoleculesConfs. In DEVICE mode, optimized coordinates,
energies, and convergence flags stay on the GPU and are returned via the
new `device` field of {MMFF,UFF}MinimizeResult, collected onto a single
target GPU. RDKit conformer lists are not modified in this mode.
Implementation:
- New `ff_device_collect` library (src/minimizer/): per-thread
FFDeviceCoordCollector accumulating positions, energies, and converged
flags on the executing GPU; `appendBatch` issues async D2D copies into
the collector and downloads only the small statuses buffer to derive
int8 convergence flags; `finalizeOnTarget` consolidates per-thread
partials onto the target GPU using copyDeviceToDeviceAsync, builds CSR
atom_starts on the host, and uploads the index arrays.
- `bfgs_mmff.cpp` / `bfgs_uff.cpp`: in DEVICE mode, skip the host
position/energy copy + writeBackResults and instead append directly into
a thread-local FFDeviceCoordCollector; finalize after the OMP loop.
- `MMFFMinimizeResult` / `UFFMinimizeResult`: add an optional
`device` field to carry the on-device result.
Tests (test_ff_device_output): MMFF / UFF shape on ethanol, multi-mol
correct CSR indexing for MMFF, empty-input handling. All four pass on
single-GPU; existing MMFF/UFF tests remain untouched.
Signed-off-by: Kevin Boyd <kboyd@nvidia.com>
Adds an optional `deviceInput` parameter to MMFFMinimizeMoleculesConfs and UFFMinimizeMoleculesConfs. When supplied, starting coordinates are taken from the on-GPU DeviceCoordResult instead of from each conformer's host positions, broadcast to every executing GPU before BFGS runs. Each input mol still needs at least one RDKit conformer for force-field construction (consulted only for hybridization/VDW special-case checks; the actual positions used for minimization come from device_input). Implementation: - ff_device_collect: DeviceInputIndex builder validates the device-input conformer set matches the host-flattened (mol, conf) labels exactly and precomputes per-conformer source offsets. broadcastDeviceInputBatch then issues one cross-GPU async copy per batch slot using copyDeviceToDeviceAsync. - bfgs_mmff / bfgs_uff: build the index up front, share it with workers, precompute per-batch (srcIdx, atomCount) lists, and overwrite positions on each executing GPU after the host->device upload. Constraints + device_input is rejected with a clear error message (handling that combination is a follow-up). Tests (test_ff_device_output): MMFF round-trip with device_input matches no-input within BFGS tolerance, mismatched conformer-count and constraints combinations are rejected. Signed-off-by: Kevin Boyd <kboyd@nvidia.com>
Adds DEVICE-mode output to MMFFBatchedForcefield and UFFBatchedForcefield
so chained Python workflows can read on-GPU energies / gradients /
positions / minimize results without the existing device->host->device
triple-pump.
C++ (NativeMMFFBatchedForcefield / NativeUFFBatchedForcefield):
- Track the wrapper's home GPU at construction time.
- New `computeEnergyDevice` / `computeGradientsDevice` return borrowed
PyArrays over the wrapper's persistent energy/grad buffers.
- New `positionsDevice` returns a borrowed (n_atoms, 3) PyArray over the
wrapper's persistent positions buffer.
- New `indexBuffers` exposes (atom_starts, mol_indices, conf_indices) as
AsyncGpuResults.
- New `minimizeDevice` plumbs through to MMFF/UFF MinimizeMoleculesConfs
with output=DEVICE, refreshes the wrapper's positions buffer in-place
via a single D2D memcpy on the same stream (no host roundtrip), and
returns a Python DeviceCoordResult.
Python (`MMFFBatchedForcefield`, `UFFBatchedForcefield`):
- `compute_energy(output=DEVICE)` returns AsyncGpuResult of total energies.
- `compute_gradients(output=DEVICE)` returns flat AsyncGpuResult of
gradients in CSR layout.
- New `positions()` and `indices()` accessors for the wrapper's
persistent device state.
- `minimize(output=DEVICE, target_gpu=None)` returns DeviceCoordResult.
- Imports `_arrayHelpers` so PyArray returns are usable.
CMake:
- Add `${CMAKE_SOURCE_DIR}/src` to bfgs_mmff/bfgs_uff PUBLIC includes so
consumers can reach `device_coord_result.h`.
- Link `_batchedForcefield` against `coord_collect` and `device`.
Tests (test_batched_forcefield_device.py): compute_energy / gradients /
indices / positions accessor / minimize device-mode equivalence with
host mode for both MMFF and UFF. 10 new tests pass; 30 existing
batched-FF tests continue to pass.
Signed-off-by: Kevin Boyd <kboyd@nvidia.com>
Add nMols parameter to FF finalizeOnTarget so the returned DeviceCoordResult.nMols reflects the full input batch size. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Ship a SKILL.md file under agent-skills/nvmolkit-usage/ that downstream developers can copy into their own .cursor/skills/ to help an agent write correct nvMolKit-using code. The skill covers the public Python entry points, the AsyncGpuResult contract, runtime requirements, runnable recipes (Morgan + Tanimoto, ETKDG embed, MMFF minimize, single- GPU HardwareOptions tuning), and reference tables for HardwareOptions and SubstructSearchConfig. Advertise the skill from README.md and add a short page to the docs site (docs/agent_skill.rst, wired into the Guides toctree) so it's discoverable from the rendered docs.
…s toctree
The skill PR had been carrying device-output additions to
nvmolkit/batchedForcefield.{cpp,py} via a merge from the device-output
branch; those belong in their own PR and are reverted here to match main.
docs/agent_skill.rst was an orphan file - added to the Guides toctree
in docs/index.rst, dropped the stale per-feature enumeration, and
replaced the dangling "see the install section above" with a proper
cross-reference to a new :ref:`installation` label.
|
| Filename | Overview |
|---|---|
| nvmolkit/tests/test_skill.py | New test that extracts and runs every Python code block from SKILL.md via subprocess; parametrize call happens eagerly at collection time (see existing thread). |
| agent-skills/nvmolkit-usage/SKILL.md | New agent skill file with entry-point table, runtime requirements, result type docs, and six self-contained runnable recipes. |
| docs/agent_skill.rst | New RST page documenting how to install and use the Cursor agent skill; correctly cross-references the installation section. |
| docs/index.rst | Adds .. _installation: label and wires agent_skill into the toctree; straightforward docs wiring. |
| README.md | Adds a short section pointing users to the agent skill directory; no issues. |
Reviews (3): Last reviewed commit: "Merge branch 'main' into agent/nvmolkit-..." | Re-trigger Greptile
evasnow1992
left a comment
There was a problem hiding this comment.
Thank you for adding this skill. Changes look good to me!
No description provided.