Skip to content

Conversation

@github-actions
Copy link

@github-actions github-actions bot commented Nov 3, 2025

🤖 Automated backport of #1400 to release-1.18

✅ Cherry-pick completed successfully with no conflicts.

Original PR: #1400
Original Author: @cdesiniotis

Cherry-picked commits (1):

  • 61f9bde Redirect log message to stderr in nvidia runtime wrapper script

This backport was automatically created by the backport bot.

This change is required to make our nvidia runtime wrapper compliant with
the OCI runtime spec. All OCI-compliant runtimes must support the operations
documented at https://github.com/opencontainers/runtime-spec/blob/v1.2.1/runtime.md#operations.
Before this change, our nvidia runtime wrapper was not producing the expected
output when the query state operation (`state <container-id>`) was invoked
AND the nvidia kernel modules happened to not be loaded. In this case, we were
emitting an extra log message which caused the stdout of this command to not
adhere to the schema defined in the OCI runtime spec. Redirecting the log
message to stderr makes us compliant.

This issue was discovered when deploying GPU Operator 25.10.0 on nodes using cri-o.
GPU Operator 25.10.0 is the first release that installs nvidia runtime handlers
with cri-o by default, as opposed to installing an OCI hook file. When performing
a GPU driver upgrade, pods in the gpu-operator namespace would be in the
`Init:RunContainerError` state for several minutes until the new driver finished
installing -- note that no nvidia driver modules are loaded during this span of
several minutes. When inspecting the cri-o logs, we observed the following error
message:

```
level=warning msg="Error updating the container status \"16779f4cd2414a164aae56856b491f86fe0c6b803a3b4474ada2cc0864c8e028\": failed to decode container status for 16779f4cd2414a164aae56856b491f86fe0c6b803a3b4474ada2cc0864c8e028: skipThreeBytes: expect ull, error found in #2 byte of ...|nvidia drive|..., bigger context ...|nvidia driver modules are not yet loaded, invoking /|..." id=a4b48041-edc4-48c2-8d75-4ad03cb3d8e1 name=/runtime.v1.RuntimeService/CreateContainer
```

This error message indicates cri-o failed to get the status of the container because
it could not decode the JSON returned by the runtime handler.

Signed-off-by: Christopher Desiniotis <[email protected]>
(cherry picked from commit 61f9bde)
@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 3, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@elezar elezar merged commit 39ad9b2 into release-1.18 Nov 4, 2025
1 check passed
@elezar elezar deleted the backport-1400-to-release-1.18 branch November 4, 2025 09:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants