Redirect log message to stderr in nvidia runtime wrapper script #1400
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This change is required to make our nvidia runtime wrapper compliant with the OCI runtime spec. All OCI-compliant runtimes must support the operations documented at https://github.com/opencontainers/runtime-spec/blob/v1.2.1/runtime.md#operations. Before this change, our nvidia runtime wrapper was not producing the expected output when the query state operation (
state <container-id>) was invoked AND the nvidia kernel modules happened to not be loaded. In this case, we were emitting an extra log message which caused the stdout of this command to not adhere to the schema defined in the OCI runtime spec. Redirecting the log message to stderr makes us compliant.This issue was discovered when deploying GPU Operator 25.10.0 on nodes using cri-o. GPU Operator 25.10.0 is the first release that installs nvidia runtime handlers with cri-o by default, as opposed to installing an OCI hook file. When performing a GPU driver upgrade, pods in the gpu-operator namespace would be in the
Init:RunContainerErrorstate for several minutes until the new driver finished installing -- note that no nvidia driver modules are loaded during this span of several minutes. When inspecting the cri-o logs, we observed the following error message:This error message indicates cri-o failed to get the status of the container because it could not decode the JSON returned by the runtime handler.