Skip to content

Test/enumerate gpus coverage#223

Merged
nikopueringer merged 4 commits intonikopueringer:mainfrom
DaniSalam4eto:test/enumerate-gpus-coverage
Apr 17, 2026
Merged

Test/enumerate gpus coverage#223
nikopueringer merged 4 commits intonikopueringer:mainfrom
DaniSalam4eto:test/enumerate-gpus-coverage

Conversation

@DaniSalam4eto
Copy link
Copy Markdown
Contributor

@DaniSalam4eto DaniSalam4eto commented Apr 13, 2026

What does this change?

How was it tested?

Checklist

  • uv run pytest passes
  • uv run ruff check passes
  • uv run ruff format --check passes

@DaniSalam4eto
Copy link
Copy Markdown
Contributor Author

This PR adds GPU-free unit tests for enumerate_gpus() / _enumerate_nvidia() / _enumerate_amd() (follow-up to #211). All subprocess and CUDA access is mocked. Please see CI for ruff + pytest; I’ll address any failures promptly.

@apekshik
Copy link
Copy Markdown

One of the new tests, test_empty_nvidia_result_is_returned_as_is, appears to lock in what looks like a bug introduced in commit d836296.

enumerate_gpus currently does:

if gpus is not None:
    return gpus

That treats an empty list returned from _enumerate_nvidia as a terminal success, which means it skips the AMD / torch fallback path even though [] only means "the probe ran and found nothing."

That can happen in real environments where nvidia-smi is installed but no NVIDIA GPU is actually present, such as:

  • stale driver setups
  • CI runners
  • VMs

The dispatch should instead be:

if gpus:
    return gpus

That way, [] falls through to the next probe.

The test should also be inverted so it asserts that fallthrough happens, rather than enforcing the current early-return behavior.

@Raiden129
Copy link
Copy Markdown
Contributor

I opened up a pull request to fix the issue in device utils, so you just need to update the test reflect the new behavior

DaniSalam4eto added a commit to DaniSalam4eto/CorridorKey that referenced this pull request Apr 16, 2026
enumerate_gpus() now uses truthiness (if gpus) instead of is not None
so [] from nvidia-smi does not skip AMD/torch fallback.

Updates test_empty_nvidia_result to assert fallthrough to AMD (apekshik
review on PR nikopueringer#223).

Made-with: Cursor
@DaniSalam4eto
Copy link
Copy Markdown
Contributor Author

Thanks for the information: @apekshik and the fix from: @Raiden129.

Updated in this push:

device_utils.py: changed if gpus is not None: to if gpus: for both the NVIDIA and AMD dispatch checks in enumerate_gpus(). An empty list from _enumerate_nvidia() (stale driver, CI runner, VM) now correctly falls through to AMD / torch instead of being treated as terminal success.
test_device_utils.py: renamed test_empty_nvidia_result_is_returned_as_is to test_empty_nvidia_result_falls_through_to_amd and inverted the assertion so it enforces the fallthrough contract. NVIDIA returns [], AMD is consulted and its result is returned.
@Raiden129 I included the device_utils.py fix here as well so the test passes in isolation; happy to drop it if your PR lands first and I rebase on top.

@nikopueringer
Copy link
Copy Markdown
Owner

@DaniSalam4eto Hey there! Was reading through the comments here, and wanted to let you know that I merged @Raiden129 's fix. Since I merged that first, I saw your comment about rebasing, and wanted to let you know. Give me the green light when you're ready, and I will merge this PR. Thanks!

@DaniSalam4eto DaniSalam4eto force-pushed the test/enumerate-gpus-coverage branch from 2bdcb22 to 4146977 Compare April 17, 2026 15:56
PR nikopueringer#211 added enumerate_gpus() and its platform-specific helpers to
device_utils.py without unit tests. This commit is the first of three
that close that gap; it introduces the shared test scaffolding and
covers the NVIDIA path.

Scaffolding added at module level in tests/test_device_utils.py:

- _fake_completed(stdout, returncode): builds a MagicMock shaped like
  subprocess.CompletedProcess so monkeypatched subprocess.run() calls
  return something the helpers can read .stdout / .returncode off of.

- _SubprocessRouter: callable class keyed on cmd[0] so one monkeypatch
  can satisfy both the amd-smi and rocm-smi subprocess.run() calls
  inside _enumerate_amd() with different responses per binary. Missing
  entries raise FileNotFoundError to mimic the OS behavior when a CLI
  is absent, which is the exact signal the helpers use to fall through
  to the next backend.

New test class TestEnumerateNvidia covers every branch of
_enumerate_nvidia():

- test_parses_single_gpu — happy path, including the MiB -> GiB
  conversion (nvidia-smi reports MiB with --format=csv,nounits and the
  helper divides by 1024).
- test_parses_multi_gpu — exercises the CSV row loop with two GPUs.
- test_nonzero_returncode_returns_none — early return when nvidia-smi
  exits non-zero (e.g. driver not loaded).
- test_binary_missing_returns_none — FileNotFoundError path when
  nvidia-smi isn't installed at all.
- test_timeout_returns_none — subprocess.TimeoutExpired path when the
  5-second query wedges.
- test_malformed_rows_are_skipped — verifies the `len(parts) >= 4`
  guard so partial/garbage rows don't crash the parser.

Imports updated: add `subprocess` (for TimeoutExpired) and
`_enumerate_nvidia` from device_utils. Module docstring expanded to
mention enumerate_gpus() alongside the existing coverage areas.

All tests run GPU-free via monkeypatch and fit the existing
monkeypatch style of TestDetectBestDevice / TestResolveDevice in the
same file. No conftest or fixture changes required.
Second of three commits backfilling tests for PR nikopueringer#211. Covers every
branch of _enumerate_amd(), which walks a three-stage fallback:
amd-smi (ROCm 6.0+, JSON) -> rocm-smi (legacy, CSV) -> Windows
registry. All five tests run GPU-free on any platform.

New test class TestEnumerateAmd:

- test_amdsmi_success — happy path with a realistic amd-smi payload
  containing {"asic": {"market_name": ...}, "vram": {"size": {"value": MiB}}}.
  Verifies MiB -> GiB conversion and that vram_free_gb is populated from
  vram_total_gb (amd-smi's static subcommand reports no free-memory field).

- test_amdsmi_malformed_entry_skipped — two-entry payload where the
  second has a non-numeric VRAM size ("garbage"). Locks in the inner
  try/except that PR nikopueringer#211 added after review feedback: one bad GPU
  entry must not crash the whole enumeration. ValueError from
  float("garbage") is the specific path exercised.

- test_amdsmi_invalid_json_falls_through_to_rocmsmi — amd-smi returns
  "not-json". json.JSONDecodeError is caught by the outer handler and
  rocm-smi runs next. Mock returns 16 GiB total / 1 GiB used in bytes;
  assertions verify the bytes -> GiB conversion (divide by 1024**3) and
  that vram_free_gb = (total - used) / 1024**3 = 15.0.

- test_amdsmi_missing_falls_through_to_rocmsmi — router has no entry
  for amd-smi, so FileNotFoundError is raised and caught. Second
  subprocess.run call hits rocm-smi and succeeds. Verifies the
  amd-smi-not-installed path independently from the JSONDecodeError
  path above.

- test_no_tools_on_non_windows_returns_none — neither amd-smi nor
  rocm-smi present, sys.platform forced to "linux" so the Windows
  winreg branch is skipped. Verifies the function returns None
  (the "unavailable" sentinel) rather than an empty list.

Imports updated: add `json` (for json.dumps payload construction) and
`sys` (for monkeypatching sys.platform in the last test), plus
`_enumerate_amd` from device_utils. The _SubprocessRouter helper
introduced in the previous commit is reused here.
Third and final commit backfilling PR nikopueringer#211 tests. The previous two
commits covered the helpers (_enumerate_nvidia, _enumerate_amd) in
isolation; this one locks in the public entry point enumerate_gpus(),
which walks NVIDIA -> AMD -> torch.cuda -> [] and decides which source
to trust.

New module-level helper:

- _amd_must_not_run: raises AssertionError when called. Used as a
  monkeypatch target in the tests that must prove AMD enumeration is
  never invoked when NVIDIA already answered (first-match-wins).
  Module-scoped so ruff's blank-line rules around nested defs stay
  happy.

New test class TestEnumerateGpus:

- test_prefers_nvidia — NVIDIA helper returns a GPUInfo list; AMD
  helper is wired to _amd_must_not_run. Verifies dispatch stops at
  the first successful backend.

- test_falls_back_to_amd_when_nvidia_unavailable — NVIDIA returns
  None (unavailable sentinel) and AMD returns a populated list. The
  AMD result is returned unchanged.

- test_empty_nvidia_result_is_returned_as_is — the subtle one.
  nvidia-smi succeeding with zero GPUs returns []. That is NOT the
  same as None, and enumerate_gpus() must return [] without falling
  through to AMD. Written with _amd_must_not_run so this contract is
  enforced as an assertion, not an inference. Prevents a future
  refactor from accidentally collapsing `is not None` into truthiness
  and changing behavior.

- test_torch_fallback_when_smi_tools_absent — both smi helpers return
  None, torch.cuda.is_available() is True with 2 devices. Mocks
  get_device_properties() to return MagicMock objects with
  .total_memory and .name set so the helper's bytes -> GiB conversion
  (divide by 1024**3) can be verified for both GPUs. Also asserts
  vram_free_gb mirrors vram_total_gb (torch.cuda has no free-memory
  accessor in this code path).

- test_returns_empty_when_nothing_available — all three backends
  produce nothing (smi helpers return None, torch.cuda.is_available
  returns False). The function must return an empty list rather than
  raise, so CLI/UI code can simply check `if not gpus`.

Imports updated: add `GPUInfo` and `enumerate_gpus` from device_utils.

End state: 16 new tests across three commits, covering every branch
of the GPU enumeration code added in PR nikopueringer#211. All tests mock
subprocess and torch.cuda via monkeypatch, run GPU-free in CI, and
match the existing monkeypatch style used elsewhere in the file.
enumerate_gpus() now uses truthiness (if gpus) instead of is not None
so [] from nvidia-smi does not skip AMD/torch fallback.

Updates test_empty_nvidia_result to assert fallthrough to AMD (apekshik
review on PR nikopueringer#223).

Made-with: Cursor
@DaniSalam4eto
Copy link
Copy Markdown
Contributor Author

@nikopueringer Rebased on top of main now that #228 is in. Git dropped the duplicate device_utils.py change during the rebase, so this PR is now tests-only the renamed test_empty_nvidia_result_falls_through_to_amd locks in the fallthrough contract from @Raiden129's fix. Green light from my side, ready to merge.

@nikopueringer nikopueringer merged commit 207b879 into nikopueringer:main Apr 17, 2026
7 checks passed
@nikopueringer
Copy link
Copy Markdown
Owner

Merged! Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants