Skip to content

Inventory operator reports absurd GPU counts #429

@vertex451

Description

@vertex451

Context
In some cases, the inventory operator reports millions of available GPUs on a host that physically has only a few (e.g. 8).
Likely causes (to confirm in code):

  • Integer underflow/overflow when computing available = total - used.
  • Bad or transient values from the NVIDIA device plugin / DRA (uninitialized, negative, or huge) not being validated.

Proposed fix

  • Add strict validation around GPU counts: enforce 0 <= used <= total and 0 <= available <= total; if out of range, clamp to safe values and log a warning.
  • Use safe integer types for intermediate calculations to avoid underflow/overflow, then validate before storing.
  • Add a sanity cap on per‑node GPU totals (e.g. ignore/report error if total > 128). Or use another type to hold bigger values.

Metadata

Metadata

Assignees

Labels

GPUrepo/providerAkash provider-services repo issues

Type

No type

Projects

Status

Backlog (not prioritized)

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions