feat(ci): add GPU test workflow with pytest gpu marker#156
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds GPU test support: a new GitHub Actions workflow for GPU tests, Pixi tasks and a pytest "gpu" marker in pyproject.toml, and annotates many tests with Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Dev as "Developer / PR"
participant GH as "GitHub Actions"
participant Runner as "gpu-1 Runner"
participant Pixi as "setup-pixi / pixi"
participant Pytest as "pytest"
Dev->>GH: push / PR / workflow_dispatch
GH->>Runner: schedule gpu-tests job
Runner->>Runner: checkout repo
Runner->>Pixi: [email protected]
Runner->>Runner: build CUDA extensions (python -c "import ...")
Runner->>Pixi: pixi run -e <env> gpu-tests
Pixi->>Pytest: run tests ( -m gpu / -m 'not gpu' )
Pytest-->>Runner: test results
Runner-->>GH: upload status / logs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
.github/workflows/gpu-tests.yml (1)
23-40: Add a timeout/concurrency guard for the GPU runner.These jobs target scarce self-hosted GPU capacity. A stuck matrix entry or a burst of force-pushes can pin
gpu-1for a long time; addingtimeout-minutesand aconcurrencygroup would make the workflow much safer operationally.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/gpu-tests.yml around lines 23 - 40, Add a timeout and a concurrency guard to the gpu-tests job to protect scarce self-hosted GPU runner capacity: in the job block named "gpu-tests" (the job with strategy.matrix.environment and steps Checkout/Install pixi/Run tests) add a timeout-minutes value (e.g., 60) and a concurrency stanza with a stable group string such as "gpu-tests-${{ matrix.environment }}" and cancel-in-progress: true so only one matrix entry per environment runs at a time and stuck jobs are automatically timed out/cancelled.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pyproject.toml`:
- Around line 99-100: The all-tests task currently expands to a plain pytest run
and will execute tests marked with the "gpu" marker; update the Pixi task
configuration so the non-GPU default explicitly deselects that marker (e.g., add
pytest option -m "not gpu" to tool.pixi.tasks.all-tests) or alternatively split
the tasks into separate CPU and GPU tasks (create tool.pixi.tasks.cpu-tests with
-m "not gpu" and tool.pixi.tasks.gpu-tests with -m "gpu"); reference the "gpu"
marker and the task name tool.pixi.tasks.all-tests when applying the change.
---
Nitpick comments:
In @.github/workflows/gpu-tests.yml:
- Around line 23-40: Add a timeout and a concurrency guard to the gpu-tests job
to protect scarce self-hosted GPU runner capacity: in the job block named
"gpu-tests" (the job with strategy.matrix.environment and steps Checkout/Install
pixi/Run tests) add a timeout-minutes value (e.g., 60) and a concurrency stanza
with a stable group string such as "gpu-tests-${{ matrix.environment }}" and
cancel-in-progress: true so only one matrix entry per environment runs at a time
and stuck jobs are automatically timed out/cancelled.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bc589c61-ae7b-424c-bf56-04e67f0d4738
📒 Files selected for processing (10)
.github/workflows/gpu-tests.ymlpyproject.tomltests/integration/test_pipeline_integration.pytests/metrics/test_lddt_metrics.pytests/models/boltz/test_boltz_wrapper.pytests/models/test_model_wrapper_protocol.pytests/rewards/test_real_space_density_reward.pytests/utils/test_atom_reconciler.pytests/utils/test_density_utils.pytests/utils/test_frame_transforms.py
There was a problem hiding this comment.
🧹 Nitpick comments (3)
.github/workflows/gpu-tests.yml (3)
24-24: Self-hosted runner label is valid; consider adding an actionlint config.The
gpu-1label is correctly used for a self-hosted runner. The actionlint warning is a false positive. To suppress it in future runs, you can add anactionlint.yamlat the repo root:📝 Optional: Add actionlint.yaml to register custom labels
# actionlint.yaml self-hosted-runner: labels: - gpu-1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/gpu-tests.yml at line 24, The actionlint false-positive for the self-hosted runner label "gpu-1" can be silenced by adding an actionlint configuration that registers that custom label; create an actionlint.yaml at the repo root containing a self-hosted-runner.labels entry including "gpu-1" so actionlint recognizes the label and stops flagging runs-on: gpu-1 as invalid, and commit that file to the repository to apply the rule for future CI checks.
22-28: Add a timeout to prevent hung jobs on the self-hosted GPU runner.Self-hosted runners, especially GPU nodes, are typically limited resources. Without a
timeout-minutes, a hung test could block the runner indefinitely.⏱️ Proposed fix to add job timeout
jobs: gpu-tests: runs-on: gpu-1 + timeout-minutes: 60 strategy: fail-fast: false matrix: environment: [boltz-dev, protenix-dev, rf3-dev]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/gpu-tests.yml around lines 22 - 28, The gpu-tests job currently has no timeout and can hang indefinitely; add a timeout-minutes key to the gpu-tests job (same indentation level as runs-on and strategy) to cap job runtime (e.g., timeout-minutes: 60 or another agreed limit) so the self-hosted GPU runner is freed if tests hang; keep the job name gpu-tests and existing matrix/strategy intact.
3-20: Consider adding concurrency control to avoid queuing on limited GPU hardware.If multiple PRs or pushes trigger this workflow simultaneously, they'll queue on the single GPU runner. A concurrency group can cancel in-progress runs for superseded commits on the same branch/PR.
🔄 Optional: Add concurrency group
on: push: branches: [main] paths: - 'src/**' - 'tests/**' - 'pyproject.toml' - 'pixi.lock' - '.github/workflows/gpu-tests.yml' pull_request: branches: [main] paths: - 'src/**' - 'tests/**' - 'pyproject.toml' - 'pixi.lock' - '.github/workflows/gpu-tests.yml' workflow_dispatch: + +concurrency: + group: gpu-tests-${{ github.head_ref || github.ref }} + cancel-in-progress: true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/gpu-tests.yml around lines 3 - 20, Add a concurrency block to the workflow to prevent multiple runs from queuing on limited GPU hardware: under the top-level YAML (near the existing on: push/pull_request/workflow_dispatch) add a concurrency key with a group that uniquely identifies the branch or PR (e.g. using expressions like ${{ github.ref }} or ${{ github.event.pull_request.number || github.ref }}) and set cancel-in-progress: true so superseded commits cancel in-flight runs; this ensures the workflow triggered by push, pull_request or workflow_dispatch will not pile up on the single GPU runner.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In @.github/workflows/gpu-tests.yml:
- Line 24: The actionlint false-positive for the self-hosted runner label
"gpu-1" can be silenced by adding an actionlint configuration that registers
that custom label; create an actionlint.yaml at the repo root containing a
self-hosted-runner.labels entry including "gpu-1" so actionlint recognizes the
label and stops flagging runs-on: gpu-1 as invalid, and commit that file to the
repository to apply the rule for future CI checks.
- Around line 22-28: The gpu-tests job currently has no timeout and can hang
indefinitely; add a timeout-minutes key to the gpu-tests job (same indentation
level as runs-on and strategy) to cap job runtime (e.g., timeout-minutes: 60 or
another agreed limit) so the self-hosted GPU runner is freed if tests hang; keep
the job name gpu-tests and existing matrix/strategy intact.
- Around line 3-20: Add a concurrency block to the workflow to prevent multiple
runs from queuing on limited GPU hardware: under the top-level YAML (near the
existing on: push/pull_request/workflow_dispatch) add a concurrency key with a
group that uniquely identifies the branch or PR (e.g. using expressions like ${{
github.ref }} or ${{ github.event.pull_request.number || github.ref }}) and set
cancel-in-progress: true so superseded commits cancel in-flight runs; this
ensures the workflow triggered by push, pull_request or workflow_dispatch will
not pile up on the single GPU runner.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 3e3e4e8e-1de4-4026-9134-86244b3288fb
⛔ Files ignored due to path filters (1)
pixi.lockis excluded by!**/*.lock
📒 Files selected for processing (2)
.github/workflows/gpu-tests.ymlpyproject.toml
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
.github/workflows/gpu-tests.yml (1)
23-43: Add timeout and concurrency guards for the GPU queue.A hung test can monopolize the runner indefinitely, and repeated pushes will queue redundant matrix jobs. Adding
timeout-minutesandconcurrency.cancel-in-progresswill keep the GPU lane usable.Suggested hardening
jobs: gpu-tests: + timeout-minutes: 60 + concurrency: + group: gpu-tests-${{ github.ref }}-${{ matrix.environment }} + cancel-in-progress: true runs-on: gpu-1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/gpu-tests.yml around lines 23 - 43, Add job-level timeout and concurrency guards to the gpu-tests job to prevent hung runs and redundant queued matrix jobs: add a timeout-minutes key (e.g., timeout-minutes: 60) under the gpu-tests job and add a concurrency section with a unique group name (use the job name and matrix where appropriate, e.g., group: gpu-tests-${{ matrix.environment }}) and cancel-in-progress: true so in-progress GPU runs are cancelled when a new push starts; update the gpu-tests job definition (job id: gpu-tests and steps like "Build CUDA extensions" / "Run tests") to include these fields.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/gpu-tests.yml:
- Around line 22-37: The gpu-tests job currently runs on the self-hosted runner
"gpu-1" with default GITHUB_TOKEN permissions and mutable action tags; update
the "gpu-tests" job to declare minimal job-level permissions (e.g., permissions:
{ contents: read }) and replace the mutable action references
("actions/checkout@v4" and "prefix-dev/[email protected]") with their pinned
full commit SHAs so the workflow does not grant unnecessary write access and
uses immutable action versions.
- Around line 12-24: The gpu-tests job currently runs on the self-hosted runner
"runs-on: gpu-1" for pull_request events (see the pull_request trigger and the
gpu-tests job), which allows untrusted PR code on your infrastructure; fix this
by restricting self-hosted execution—add a job-level guard such as checking
github.event_name == 'push' on the gpu-tests job so it only uses gpu-1 for trunk
pushes, and route pull_request validation to GitHub-hosted runners (or require
approved contributors via branch/actor restrictions or an environment with
required reviewers) so external PRs never execute on the gpu-1 runner.
---
Nitpick comments:
In @.github/workflows/gpu-tests.yml:
- Around line 23-43: Add job-level timeout and concurrency guards to the
gpu-tests job to prevent hung runs and redundant queued matrix jobs: add a
timeout-minutes key (e.g., timeout-minutes: 60) under the gpu-tests job and add
a concurrency section with a unique group name (use the job name and matrix
where appropriate, e.g., group: gpu-tests-${{ matrix.environment }}) and
cancel-in-progress: true so in-progress GPU runs are cancelled when a new push
starts; update the gpu-tests job definition (job id: gpu-tests and steps like
"Build CUDA extensions" / "Run tests") to include these fields.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 525da684-d18b-420d-89f5-5ad889e8ffe3
📒 Files selected for processing (1)
.github/workflows/gpu-tests.yml
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
.github/workflows/gpu-tests.yml (1)
39-40: Add error handling for missingnvcc.If
nvccis not in PATH, thewhich nvcccommand will fail and produce an empty or invalidCUDA_HOME. Consider adding validation.🛡️ Proposed fix with validation
- name: Set CUDA_HOME - run: echo "CUDA_HOME=$(dirname $(dirname $(which nvcc)))" >> "$GITHUB_ENV" + run: | + NVCC_PATH=$(which nvcc) || { echo "::error::nvcc not found in PATH"; exit 1; } + echo "CUDA_HOME=$(dirname $(dirname "$NVCC_PATH"))" >> "$GITHUB_ENV"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/gpu-tests.yml around lines 39 - 40, The "Set CUDA_HOME" step uses `which nvcc` directly and can produce an empty/invalid CUDA_HOME if nvcc is missing; update that step to first check for nvcc (e.g., test "$(which nvcc)" or command -v nvcc) and fail the job with a clear error message or skip setting CUDA_HOME if not found, and only export CUDA_HOME=$(dirname $(dirname $(which nvcc))) when the check succeeds so the workflow doesn't continue with an empty variable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/gpu-tests.yml:
- Around line 42-43: The import wrapper in
src/sampleworks/core/forward_models/xray/real_space_density_deps/ops/csrc/__init__.py
currently swallows exceptions from torch.cpp_extension.load() and sets
CUDA_AVAILABLE = False, causing CI build steps (like importing
dilate_points_cuda) to silently succeed; change it to propagate compilation
failures by re-raising the caught exception (or logging the error and then
raise) and additionally perform an explicit runtime check after import: if not
CUDA_AVAILABLE: raise RuntimeError("CUDA extension failed to build or CUDA is
unavailable") so that imports of symbols like dilate_points_cuda fail fast and
the GitHub Actions step fails on bad compilation instead of deferring to later
tests.
---
Nitpick comments:
In @.github/workflows/gpu-tests.yml:
- Around line 39-40: The "Set CUDA_HOME" step uses `which nvcc` directly and can
produce an empty/invalid CUDA_HOME if nvcc is missing; update that step to first
check for nvcc (e.g., test "$(which nvcc)" or command -v nvcc) and fail the job
with a clear error message or skip setting CUDA_HOME if not found, and only
export CUDA_HOME=$(dirname $(dirname $(which nvcc))) when the check succeeds so
the workflow doesn't continue with an empty variable.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: efad377a-52ed-4157-9187-2b3f36dc1b27
📒 Files selected for processing (1)
.github/workflows/gpu-tests.yml
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
.github/workflows/gpu-tests.yml (1)
15-37: Consider adding a job timeout to prevent hung GPU jobs.GPU tests can hang due to CUDA driver issues, deadlocks, or resource contention. Without a timeout, a stuck job will consume the self-hosted runner indefinitely.
⏱️ Suggested change to add timeout
gpu-tests: runs-on: gpu-1 + timeout-minutes: 60 permissions: contents: read🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/gpu-tests.yml around lines 15 - 37, The gpu-tests job currently has no timeout which can leave self-hosted GPU runners hung indefinitely; add a timeout-minutes setting (e.g., timeout-minutes: 60) to the gpu-tests job definition so the workflow automatically cancels stuck runs, placing it directly under the gpu-tests job keys (near runs-on/strategy/name) to apply to that job; reference the job identifier "gpu-tests" and update the YAML accordingly to enforce a safe timeout value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/workflows/gpu-tests.yml:
- Around line 33-37: The CI activation environment used by the Build CUDA
extensions and Run tests steps doesn't export CUDA_HOME, causing
torch.utils.cpp_extension.load() (used from __init__.py while importing
dilate_points_cuda) to fail to find CUDA headers; update the pixi activation
environment in pyproject.toml under [tool.pixi.activation.env] to add
CUDA_HOME="$CONDA_PREFIX" so the CUDA toolkit from the pixi environment is
exposed during the pixi run -e ${{ matrix.environment }} python3 -c "from
sampleworks.core.forward_models.xray.real_space_density_deps.ops.csrc import
dilate_points_cuda" and subsequent JIT compilation.
---
Nitpick comments:
In @.github/workflows/gpu-tests.yml:
- Around line 15-37: The gpu-tests job currently has no timeout which can leave
self-hosted GPU runners hung indefinitely; add a timeout-minutes setting (e.g.,
timeout-minutes: 60) to the gpu-tests job definition so the workflow
automatically cancels stuck runs, placing it directly under the gpu-tests job
keys (near runs-on/strategy/name) to apply to that job; reference the job
identifier "gpu-tests" and update the YAML accordingly to enforce a safe timeout
value.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5eb62b47-2a36-4791-80cb-032e86d40f05
⛔ Files ignored due to path filters (1)
pixi.lockis excluded by!**/*.lock
📒 Files selected for processing (2)
.github/workflows/gpu-tests.ymlpyproject.toml
🚧 Files skipped from review as they are similar to previous changes (1)
- pyproject.toml
marcuscollins
left a comment
There was a problem hiding this comment.
There are a few places I think we could remove the @pytest.mark.gpu but it isn't strictly necessary. I'll approve and if you could make those changes before merging that would be helpful, but not required.
- Add .github/workflows/gpu-tests.yml running on gpu-1 self-hosted runner - Add @pytest.mark.gpu to tests requiring GPU hardware - Add cuda-toolkit pixi dependency and CUDA_HOME activation env - Add cpu-tests and gpu-tests pixi tasks - Gate with gpu-testing environment for security (required reviewers) - 30-minute timeout per job
d4f020f to
2a093db
Compare
Add GitHub Actions workflow for self-hosted GPU runners that runs tests marked with @pytest.mark.gpu across boltz-dev, protenix-dev, and rf3-dev environments.
Summary by CodeRabbit
New Features
Tests
Chores