-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Labels
Description
Context
In some cases, the inventory operator reports millions of available GPUs on a host that physically has only a few (e.g. 8).
Likely causes (to confirm in code):
- Integer underflow/overflow when computing available = total - used.
- Bad or transient values from the NVIDIA device plugin / DRA (uninitialized, negative, or huge) not being validated.
Proposed fix
- Add strict validation around GPU counts: enforce 0 <= used <= total and 0 <= available <= total; if out of range, clamp to safe values and log a warning.
- Use safe integer types for intermediate calculations to avoid underflow/overflow, then validate before storing.
- Add a sanity cap on per‑node GPU totals (e.g. ignore/report error if total > 128). Or use another type to hold bigger values.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Backlog (not prioritized)