Skip to content

Add downloadable agent skill and unit test for snippets#175

Merged
scal444 merged 20 commits into
NVIDIA-Digital-Bio:mainfrom
scal444:agent/nvmolkit-usage-skill
May 18, 2026
Merged

Add downloadable agent skill and unit test for snippets#175
scal444 merged 20 commits into
NVIDIA-Digital-Bio:mainfrom
scal444:agent/nvmolkit-usage-skill

Conversation

@scal444
Copy link
Copy Markdown
Collaborator

@scal444 scal444 commented May 18, 2026

No description provided.

scal444 and others added 18 commits April 30, 2026 12:07
…-Bio#140)

Introduce the data structures needed to return on-device coordinates from
ETKDG/MMFF/UFF without packing them back into RDKit conformers.

C++:
- New header src/device_coord_result.h defining `CoordinateOutput` enum and
  the flat CSR-style `DeviceCoordResult` struct (positions, atom_starts,
  mol/conf indices, optional energies/converged, gpu_id).
- nvmolkit/array_helpers.h: add `makePyArrayBorrowed` and an `owned` flag on
  `PyArray` for non-owning views over persistent device buffers.

Python:
- nvmolkit/types.py: add `CoordinateOutput` enum and `DeviceCoordResult`
  class with `per_molecule()` (lazy nested torch views) and `dense()`
  (padded materialization). `AsyncGpuResult` accepts an optional `gpu_id`
  so buffers on a non-default device are wrapped onto the correct torch
  device.

Signed-off-by: Kevin Boyd <kboyd@nvidia.com>
Introduces a small `coord_collect` library used by upcoming ETKDG/MMFF/UFF
device-output paths to consolidate per-GPU result buffers onto a single
target GPU.

- `copyDeviceToDeviceAsync`: peer-to-peer (or system fallback) async copy
  with explicit cross-stream ordering via a CUDA event, plus a same-GPU
  fast path that degenerates to `cudaMemcpyAsync`.
- `enablePeerAccess`: best-effort bidirectional peer access enable, used
  once during driver setup so subsequent `cudaMemcpyPeerAsync` calls take
  the fast path.
- gtest coverage for same-GPU copy, zero-byte no-op, self peer enable, and
  a multi-GPU copy that skips on single-GPU hosts.

Signed-off-by: Kevin Boyd <kboyd@nvidia.com>
nMols is the authoritative outer-list size for per-molecule views; it
includes molecules that produced zero conformers, which the old max()+1
approach silently dropped. This fixes per_molecule() silently truncating
the outer list when high-indexed molecules produce zero conformers (e.g.
partial ETKDG success).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds CoordinateOutput::DEVICE to embedMolecules. In DEVICE mode, optimized
ETKDG coordinates stay on the GPU and are returned as a flat-CSR
DeviceCoordResult collected onto a single target GPU. RDKit conformer lists
are not modified in this mode. Conformer pruning is rejected when DEVICE is
selected (follow-up).

Implementation:
- New `etkdg_device_collect` library: thread-local DeviceCoordCollector,
  shared cap state for per-molecule conformer count limits, a 4D->3D
  packing kernel (`appendActive`), an ETKDGCollectDeviceCoordsStage, and
  `finalizeOnTarget` that uses copyDeviceToDeviceAsync to consolidate
  per-thread partials onto the target GPU.
- `embedMolecules` now returns `std::optional<DeviceCoordResult>`, accepts
  `output` and `targetGpu` parameters, swaps in the collect stage in DEVICE
  mode, and skips the host conformer-pruning post-pass.
- Added a sortedMols->original-input index map so user-visible molIndices
  reflect the caller's input order rather than the internal sorted order.

Tests (test_etkdg_device_output): shape, multi-molecule indexing, multi-
conformer indexing, and pruning-rejection. Bit-exact comparison with the
host path is intentionally not asserted because ETKDG is non-deterministic
across calls due to random initialization of multiple parallel attempts.

Signed-off-by: Kevin Boyd <kboyd@nvidia.com>
Add nMols parameter to ETKDG finalizeOnTarget so the returned
DeviceCoordResult.nMols reflects the full input batch size, including
molecules that produced zero conformers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds CoordinateOutput::DEVICE to MMFFMinimizeMoleculesConfs and
UFFMinimizeMoleculesConfs. In DEVICE mode, optimized coordinates,
energies, and convergence flags stay on the GPU and are returned via the
new `device` field of {MMFF,UFF}MinimizeResult, collected onto a single
target GPU. RDKit conformer lists are not modified in this mode.

Implementation:
- New `ff_device_collect` library (src/minimizer/): per-thread
  FFDeviceCoordCollector accumulating positions, energies, and converged
  flags on the executing GPU; `appendBatch` issues async D2D copies into
  the collector and downloads only the small statuses buffer to derive
  int8 convergence flags; `finalizeOnTarget` consolidates per-thread
  partials onto the target GPU using copyDeviceToDeviceAsync, builds CSR
  atom_starts on the host, and uploads the index arrays.
- `bfgs_mmff.cpp` / `bfgs_uff.cpp`: in DEVICE mode, skip the host
  position/energy copy + writeBackResults and instead append directly into
  a thread-local FFDeviceCoordCollector; finalize after the OMP loop.
- `MMFFMinimizeResult` / `UFFMinimizeResult`: add an optional
  `device` field to carry the on-device result.

Tests (test_ff_device_output): MMFF / UFF shape on ethanol, multi-mol
correct CSR indexing for MMFF, empty-input handling. All four pass on
single-GPU; existing MMFF/UFF tests remain untouched.

Signed-off-by: Kevin Boyd <kboyd@nvidia.com>
Adds an optional `deviceInput` parameter to MMFFMinimizeMoleculesConfs and
UFFMinimizeMoleculesConfs. When supplied, starting coordinates are taken
from the on-GPU DeviceCoordResult instead of from each conformer's host
positions, broadcast to every executing GPU before BFGS runs. Each input
mol still needs at least one RDKit conformer for force-field construction
(consulted only for hybridization/VDW special-case checks; the actual
positions used for minimization come from device_input).

Implementation:
- ff_device_collect: DeviceInputIndex builder validates the device-input
  conformer set matches the host-flattened (mol, conf) labels exactly and
  precomputes per-conformer source offsets. broadcastDeviceInputBatch then
  issues one cross-GPU async copy per batch slot using copyDeviceToDeviceAsync.
- bfgs_mmff / bfgs_uff: build the index up front, share it with workers,
  precompute per-batch (srcIdx, atomCount) lists, and overwrite positions
  on each executing GPU after the host->device upload.

Constraints + device_input is rejected with a clear error message
(handling that combination is a follow-up).

Tests (test_ff_device_output): MMFF round-trip with device_input matches
no-input within BFGS tolerance, mismatched conformer-count and constraints
combinations are rejected.

Signed-off-by: Kevin Boyd <kboyd@nvidia.com>
Adds DEVICE-mode output to MMFFBatchedForcefield and UFFBatchedForcefield
so chained Python workflows can read on-GPU energies / gradients /
positions / minimize results without the existing device->host->device
triple-pump.

C++ (NativeMMFFBatchedForcefield / NativeUFFBatchedForcefield):
- Track the wrapper's home GPU at construction time.
- New `computeEnergyDevice` / `computeGradientsDevice` return borrowed
  PyArrays over the wrapper's persistent energy/grad buffers.
- New `positionsDevice` returns a borrowed (n_atoms, 3) PyArray over the
  wrapper's persistent positions buffer.
- New `indexBuffers` exposes (atom_starts, mol_indices, conf_indices) as
  AsyncGpuResults.
- New `minimizeDevice` plumbs through to MMFF/UFF MinimizeMoleculesConfs
  with output=DEVICE, refreshes the wrapper's positions buffer in-place
  via a single D2D memcpy on the same stream (no host roundtrip), and
  returns a Python DeviceCoordResult.

Python (`MMFFBatchedForcefield`, `UFFBatchedForcefield`):
- `compute_energy(output=DEVICE)` returns AsyncGpuResult of total energies.
- `compute_gradients(output=DEVICE)` returns flat AsyncGpuResult of
  gradients in CSR layout.
- New `positions()` and `indices()` accessors for the wrapper's
  persistent device state.
- `minimize(output=DEVICE, target_gpu=None)` returns DeviceCoordResult.
- Imports `_arrayHelpers` so PyArray returns are usable.

CMake:
- Add `${CMAKE_SOURCE_DIR}/src` to bfgs_mmff/bfgs_uff PUBLIC includes so
  consumers can reach `device_coord_result.h`.
- Link `_batchedForcefield` against `coord_collect` and `device`.

Tests (test_batched_forcefield_device.py): compute_energy / gradients /
indices / positions accessor / minimize device-mode equivalence with
host mode for both MMFF and UFF. 10 new tests pass; 30 existing
batched-FF tests continue to pass.

Signed-off-by: Kevin Boyd <kboyd@nvidia.com>
Add nMols parameter to FF finalizeOnTarget so the returned
DeviceCoordResult.nMols reflects the full input batch size.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Ship a SKILL.md file under agent-skills/nvmolkit-usage/ that downstream
developers can copy into their own .cursor/skills/ to help an agent
write correct nvMolKit-using code. The skill covers the public Python
entry points, the AsyncGpuResult contract, runtime requirements,
runnable recipes (Morgan + Tanimoto, ETKDG embed, MMFF minimize, single-
GPU HardwareOptions tuning), and reference tables for HardwareOptions
and SubstructSearchConfig.

Advertise the skill from README.md and add a short page to the docs
site (docs/agent_skill.rst, wired into the Guides toctree) so it's
discoverable from the rendered docs.
…s toctree

The skill PR had been carrying device-output additions to
nvmolkit/batchedForcefield.{cpp,py} via a merge from the device-output
branch; those belong in their own PR and are reverted here to match main.

docs/agent_skill.rst was an orphan file - added to the Guides toctree
in docs/index.rst, dropped the stale per-feature enumeration, and
replaced the dangling "see the install section above" with a proper
cross-reference to a new :ref:`installation` label.
@scal444 scal444 requested a review from evasnow1992 May 18, 2026 15:57
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 18, 2026

Greptile Summary

This PR ships a Cursor/agent skill (SKILL.md) that teaches AI coding agents the nvMolKit public API, and adds a test that extracts every Python code block from the skill file and runs each one as a subprocess against the installed package.

  • agent-skills/nvmolkit-usage/SKILL.md: New skill file with entry-point table, result-type documentation, and six self-contained runnable recipes covering fingerprints, similarity, ETKDG embedding, MMFF/UFF minimization, and BatchedForcefield constraints.
  • nvmolkit/tests/test_skill.py: New parametrized test that parses Python code blocks from SKILL.md with a regex and runs each via subprocess.run, asserting zero exit code; the parametrize call eagerly evaluates _extract_python_blocks at collection time (flagged in a prior review thread).
  • docs/: New agent_skill.rst page and minor index.rst wiring (.. _installation: label, toctree entry).

Confidence Score: 5/5

Safe to merge — all changes are purely additive with no modifications to existing production code paths.

All six SKILL.md recipes are self-contained and verified by the new test. Documentation and README changes are purely additive. The only open concern (collection-time failure if SKILL.md moves) was already raised in the existing review thread.

nvmolkit/tests/test_skill.py — the eager parametrize evaluation at collection time.

Important Files Changed

Filename Overview
nvmolkit/tests/test_skill.py New test that extracts and runs every Python code block from SKILL.md via subprocess; parametrize call happens eagerly at collection time (see existing thread).
agent-skills/nvmolkit-usage/SKILL.md New agent skill file with entry-point table, runtime requirements, result type docs, and six self-contained runnable recipes.
docs/agent_skill.rst New RST page documenting how to install and use the Cursor agent skill; correctly cross-references the installation section.
docs/index.rst Adds .. _installation: label and wires agent_skill into the toctree; straightforward docs wiring.
README.md Adds a short section pointing users to the agent skill directory; no issues.

Reviews (3): Last reviewed commit: "Merge branch 'main' into agent/nvmolkit-..." | Re-trigger Greptile

Comment thread nvmolkit/tests/test_skill.py
Copy link
Copy Markdown
Collaborator

@evasnow1992 evasnow1992 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding this skill. Changes look good to me!

@scal444 scal444 merged commit 434440f into NVIDIA-Digital-Bio:main May 18, 2026
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants