Test/enumerate gpus coverage#223
Conversation
|
This PR adds GPU-free unit tests for enumerate_gpus() / _enumerate_nvidia() / _enumerate_amd() (follow-up to #211). All subprocess and CUDA access is mocked. Please see CI for ruff + pytest; I’ll address any failures promptly. |
|
One of the new tests,
if gpus is not None:
return gpusThat treats an empty list returned from That can happen in real environments where
The dispatch should instead be: if gpus:
return gpusThat way, The test should also be inverted so it asserts that fallthrough happens, rather than enforcing the current early-return behavior. |
|
I opened up a pull request to fix the issue in device utils, so you just need to update the test reflect the new behavior |
enumerate_gpus() now uses truthiness (if gpus) instead of is not None so [] from nvidia-smi does not skip AMD/torch fallback. Updates test_empty_nvidia_result to assert fallthrough to AMD (apekshik review on PR nikopueringer#223). Made-with: Cursor
|
Thanks for the information: @apekshik and the fix from: @Raiden129. Updated in this push: device_utils.py: changed if gpus is not None: to if gpus: for both the NVIDIA and AMD dispatch checks in enumerate_gpus(). An empty list from _enumerate_nvidia() (stale driver, CI runner, VM) now correctly falls through to AMD / torch instead of being treated as terminal success. |
|
@DaniSalam4eto Hey there! Was reading through the comments here, and wanted to let you know that I merged @Raiden129 's fix. Since I merged that first, I saw your comment about rebasing, and wanted to let you know. Give me the green light when you're ready, and I will merge this PR. Thanks! |
2bdcb22 to
4146977
Compare
PR nikopueringer#211 added enumerate_gpus() and its platform-specific helpers to device_utils.py without unit tests. This commit is the first of three that close that gap; it introduces the shared test scaffolding and covers the NVIDIA path. Scaffolding added at module level in tests/test_device_utils.py: - _fake_completed(stdout, returncode): builds a MagicMock shaped like subprocess.CompletedProcess so monkeypatched subprocess.run() calls return something the helpers can read .stdout / .returncode off of. - _SubprocessRouter: callable class keyed on cmd[0] so one monkeypatch can satisfy both the amd-smi and rocm-smi subprocess.run() calls inside _enumerate_amd() with different responses per binary. Missing entries raise FileNotFoundError to mimic the OS behavior when a CLI is absent, which is the exact signal the helpers use to fall through to the next backend. New test class TestEnumerateNvidia covers every branch of _enumerate_nvidia(): - test_parses_single_gpu — happy path, including the MiB -> GiB conversion (nvidia-smi reports MiB with --format=csv,nounits and the helper divides by 1024). - test_parses_multi_gpu — exercises the CSV row loop with two GPUs. - test_nonzero_returncode_returns_none — early return when nvidia-smi exits non-zero (e.g. driver not loaded). - test_binary_missing_returns_none — FileNotFoundError path when nvidia-smi isn't installed at all. - test_timeout_returns_none — subprocess.TimeoutExpired path when the 5-second query wedges. - test_malformed_rows_are_skipped — verifies the `len(parts) >= 4` guard so partial/garbage rows don't crash the parser. Imports updated: add `subprocess` (for TimeoutExpired) and `_enumerate_nvidia` from device_utils. Module docstring expanded to mention enumerate_gpus() alongside the existing coverage areas. All tests run GPU-free via monkeypatch and fit the existing monkeypatch style of TestDetectBestDevice / TestResolveDevice in the same file. No conftest or fixture changes required.
Second of three commits backfilling tests for PR nikopueringer#211. Covers every branch of _enumerate_amd(), which walks a three-stage fallback: amd-smi (ROCm 6.0+, JSON) -> rocm-smi (legacy, CSV) -> Windows registry. All five tests run GPU-free on any platform. New test class TestEnumerateAmd: - test_amdsmi_success — happy path with a realistic amd-smi payload containing {"asic": {"market_name": ...}, "vram": {"size": {"value": MiB}}}. Verifies MiB -> GiB conversion and that vram_free_gb is populated from vram_total_gb (amd-smi's static subcommand reports no free-memory field). - test_amdsmi_malformed_entry_skipped — two-entry payload where the second has a non-numeric VRAM size ("garbage"). Locks in the inner try/except that PR nikopueringer#211 added after review feedback: one bad GPU entry must not crash the whole enumeration. ValueError from float("garbage") is the specific path exercised. - test_amdsmi_invalid_json_falls_through_to_rocmsmi — amd-smi returns "not-json". json.JSONDecodeError is caught by the outer handler and rocm-smi runs next. Mock returns 16 GiB total / 1 GiB used in bytes; assertions verify the bytes -> GiB conversion (divide by 1024**3) and that vram_free_gb = (total - used) / 1024**3 = 15.0. - test_amdsmi_missing_falls_through_to_rocmsmi — router has no entry for amd-smi, so FileNotFoundError is raised and caught. Second subprocess.run call hits rocm-smi and succeeds. Verifies the amd-smi-not-installed path independently from the JSONDecodeError path above. - test_no_tools_on_non_windows_returns_none — neither amd-smi nor rocm-smi present, sys.platform forced to "linux" so the Windows winreg branch is skipped. Verifies the function returns None (the "unavailable" sentinel) rather than an empty list. Imports updated: add `json` (for json.dumps payload construction) and `sys` (for monkeypatching sys.platform in the last test), plus `_enumerate_amd` from device_utils. The _SubprocessRouter helper introduced in the previous commit is reused here.
Third and final commit backfilling PR nikopueringer#211 tests. The previous two commits covered the helpers (_enumerate_nvidia, _enumerate_amd) in isolation; this one locks in the public entry point enumerate_gpus(), which walks NVIDIA -> AMD -> torch.cuda -> [] and decides which source to trust. New module-level helper: - _amd_must_not_run: raises AssertionError when called. Used as a monkeypatch target in the tests that must prove AMD enumeration is never invoked when NVIDIA already answered (first-match-wins). Module-scoped so ruff's blank-line rules around nested defs stay happy. New test class TestEnumerateGpus: - test_prefers_nvidia — NVIDIA helper returns a GPUInfo list; AMD helper is wired to _amd_must_not_run. Verifies dispatch stops at the first successful backend. - test_falls_back_to_amd_when_nvidia_unavailable — NVIDIA returns None (unavailable sentinel) and AMD returns a populated list. The AMD result is returned unchanged. - test_empty_nvidia_result_is_returned_as_is — the subtle one. nvidia-smi succeeding with zero GPUs returns []. That is NOT the same as None, and enumerate_gpus() must return [] without falling through to AMD. Written with _amd_must_not_run so this contract is enforced as an assertion, not an inference. Prevents a future refactor from accidentally collapsing `is not None` into truthiness and changing behavior. - test_torch_fallback_when_smi_tools_absent — both smi helpers return None, torch.cuda.is_available() is True with 2 devices. Mocks get_device_properties() to return MagicMock objects with .total_memory and .name set so the helper's bytes -> GiB conversion (divide by 1024**3) can be verified for both GPUs. Also asserts vram_free_gb mirrors vram_total_gb (torch.cuda has no free-memory accessor in this code path). - test_returns_empty_when_nothing_available — all three backends produce nothing (smi helpers return None, torch.cuda.is_available returns False). The function must return an empty list rather than raise, so CLI/UI code can simply check `if not gpus`. Imports updated: add `GPUInfo` and `enumerate_gpus` from device_utils. End state: 16 new tests across three commits, covering every branch of the GPU enumeration code added in PR nikopueringer#211. All tests mock subprocess and torch.cuda via monkeypatch, run GPU-free in CI, and match the existing monkeypatch style used elsewhere in the file.
enumerate_gpus() now uses truthiness (if gpus) instead of is not None so [] from nvidia-smi does not skip AMD/torch fallback. Updates test_empty_nvidia_result to assert fallthrough to AMD (apekshik review on PR nikopueringer#223). Made-with: Cursor
|
@nikopueringer Rebased on top of main now that #228 is in. Git dropped the duplicate device_utils.py change during the rebase, so this PR is now tests-only the renamed test_empty_nvidia_result_falls_through_to_amd locks in the fallthrough contract from @Raiden129's fix. Green light from my side, ready to merge. |
|
Merged! Thank you! |
What does this change?
How was it tested?
Checklist
uv run pytestpassesuv run ruff checkpassesuv run ruff format --checkpasses