Skip to content

Conversation

a-mccarthy
Copy link
Collaborator

@a-mccarthy a-mccarthy commented Mar 6, 2025

@a-mccarthy a-mccarthy marked this pull request as draft March 6, 2025 19:58
Copy link

github-actions bot commented Mar 6, 2025

Documentation preview

https://nvidia.github.io/cloud-native-docs/review/pr-162

@ArangoGutierrez ArangoGutierrez requested a review from Copilot March 7, 2025 15:29
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This pull request introduces documentation-related changes for the DRA driver while also providing configuration manifests for enabling dynamic resource allocation and related compute domain functionality.

  • Adds a kubeadm initialization configuration to enable the DynamicResourceAllocation feature.
  • Introduces a ComputeDomain resource and a pod manifest for resource channel injection.
  • Provides an additional ComputeDomain definition in a separate file.

Reviewed Changes

File Description
gpu-operator/manifests/input/kubeadm-init-config.yaml Defines kubeadm ClusterConfiguration and KubeletConfiguration with dynamic resource allocation settings.
gpu-operator/manifests/input/imex-channel-injection.yaml Contains ComputeDomain and Pod definitions for testing resource channel injection.
gpu-operator/manifests/input/dra-compute-domain-crd.yaml Provides an alternative ComputeDomain resource definition that duplicates functionality in imex-channel-injection.yaml.

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

gpu-operator/manifests/input/dra-compute-domain-crd.yaml:1

  • The ComputeDomain resource defined here duplicates the one in imex-channel-injection.yaml. Consider consolidating the definitions to avoid potential conflicts during deployment.
apiVersion: resource.nvidia.com/v1beta1

@a-mccarthy a-mccarthy requested a review from guptaNswati March 10, 2025 16:29
@elezar
Copy link
Member

elezar commented Mar 21, 2025

As a general comment, we need to call out what that driver enables -- which is to use Multi-Node NVLink -- and not how this is enabled. The UX for ComputeDomains has been designed to explicitly make minimal references to IMEX since this is an implementation detail. It is required to enable MNNVL, but should not be something that a user is concerned with.

I'll make another pass at reviewing this soon.

@ArangoGutierrez ArangoGutierrez requested a review from Copilot March 21, 2025 13:50
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces draft manifests for the DRA driver documentation and related resources, outlining initial configurations and resource definitions.

  • Added a kubeadm initialization configuration file with feature gate settings for dynamic resource allocation.
  • Introduced a ComputeDomain resource manifest.
  • Provided an injection manifest that combines a ComputeDomain definition with a Pod using the resource.

Reviewed Changes

Copilot reviewed 4 out of 7 changed files in this pull request and generated 1 comment.

File Description
gpu-operator/manifests/input/kubeadm-init-config.yaml New kubeadm init config with extraArgs for enabling dynamic resource allocation.
gpu-operator/manifests/input/dra-compute-domain-crd.yaml New ComputeDomain resource definition with basic specs.
gpu-operator/manifests/input/imex-channel-injection.yaml Injection manifest combining a ComputeDomain and a Pod definition.
Files not reviewed (3)
  • gpu-operator/index.rst: Language not supported
  • gpu-operator/manifests/output/compute-domain-channel-injection-crd.txt: Language not supported
  • gpu-operator/manifests/output/imex-logs.txt: Language not supported
Comments suppressed due to low confidence (1)

gpu-operator/manifests/input/imex-channel-injection.yaml:2

  • The ComputeDomain resource is defined in both this file and in dra-compute-domain-crd.yaml. Confirm if this duplication is intentional or consider consolidating these definitions to avoid potential deployment conflicts.
apiVersion: resource.nvidia.com/v1beta1

--namespace nvidia-dra-driver-gpu \
--set nvidiaDriverRoot=/run/nvidia/driver \
--set nvidiaCtkPath=/usr/local/nvidia/toolkit/nvidia-ctk \
--set resources.gpus.enabled=false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for following this ❤️

@a-mccarthy
Copy link
Collaborator Author

@klueska and @jgehrcke, I pushed some updates to this PR, please review if you have a chance. I still need to add in more content around the GPU resource allocation use case, but I'd like to get some feedback on the page structure and if the rest of the content for MMNL support makes sense.

Copy link
Contributor

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @a-mccarthy! I looked at the last two commits only and added just a small number of comments.

After more quick/diagonal reading, I noticed that more changes are required. Maybe it is useful when I add a commit to this branch -- WDYT?

Multi-Node NVLink support with the NVIDIA DRA Driver for GPUs
#############################################################
##################################################
NVIDIA Dynamic Resource Allocation Driver for GPUs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's still call this "NVIDIA DRA Driver for GPUs"

* - ``resources.gpus.enabled``
- Specifies whether to enable the NVIDIA DRA Driver for GPUs to manage GPU resource allocation.
This feature is not yet supported and you must set to ``false``.
This feature is in Technolody Preview and only recommended for testing, not production enviroments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technology :)
environments

- Specifies whether to enable the NVIDIA DRA Driver for GPUs to manage GPU resource allocation.
This feature is not yet supported and you must set to ``false``.
This feature is in Technolody Preview and only recommended for testing, not production enviroments.
To use with MNNVL, set to ``false``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this sentence.

(it's not correct: the driver has two components: the CD side of things, and the GPU side of things -- they can both be enabled independently).

For more information about Kubernetes Dynamic Resource Allocation (DRA), refer to the `Kubernetes DRA documentation <https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/>`_.
This page provides an overview of the DRA Driver for GPUs, its supported functionality, installing the component with the GPU Operator, and examples of using the DRA Driver for GPUs with supported use cases.

Before continuing, you should be familiar with the components of the `Kubernetes DRA feature <https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/>`_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be the first sentence in this section, and it should be the sentence that introduces the DRA abbreviation.

@a-mccarthy a-mccarthy requested a review from chenopis July 7, 2025 13:14
Copy link
Collaborator

@chenopis chenopis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small nits and questions. LGTM otherwise.

The `NVIDIA GPU Feature Discovery <https://github.com/NVIDIA/k8s-device-plugin/tree/main/docs/gpu-feature-discovery>`_ adds a Clique ID to each GPU node.
This is a unique identifier within an NVLink domain (physically connected GPUs over NVLink) that indicates which GPUs within that domain are physically capable of talking to each other.

The partitioning of GPUs into a set of cliques is done at the NVSwitch layer, not at the individual node layer. All GPUs on a given node are guaranteed to have the same <ClusterUUID.Clique ID> pair.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And why is it a pair? Oh, does the pair refer to the ClusterUUID and the Clique ID? Would "value" be better. I think of "pair" as something like (X,Y). Later on, "<cluster-uuid, clique-id>" is referenced, so this was confusing for me.

Within the bracket notation, would it be better if "Clique ID" were "Clique_ID" to make it clear that it is one value?

* - ``mpirun`` command argument ``-ppr:4:node``

-
* Number of GPUs per node as the process-per-resource number
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a bullet item since there is only one?

@a-mccarthy a-mccarthy requested review from chenopis and jgehrcke July 11, 2025 18:47
Copy link
Contributor

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First quick pass

Comment on lines 31 to 40
The two parts of this driver
============================

NVIDIA's DRA Driver for GPUs is comprised of two subsystems that are largely independent of each other: one manages GPUs, and the other one manages ComputeDomains.

The next documentation chapter contains instructions for how to install both parts or just one of them.
Additionally, we have prepared two separate documentation chapters, providing more in-depth information for each of the two subsystems:

- `Documentation for ComputeDomain support <foo2>`_
- `Documentation for GPU support <foo1>`_
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to just merge this into the intro? And then put the installation insturctions inline here (instead of in a separate file)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the installation instructions inline here (instead of in a separate file)?

Right now I think I still prefer the separation, but I want to think that through.

Currently, the left side bar shows:
image
We could instead show "Introduction & Installation" -- is that what you had in mind?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I'd be fine with that

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just merge this into the intro?

Did you mean the entire subsection with "this" or maybe just the documentation links?

Something about the tail end of this document isn't quite ideal yet -- this is probably what you mean. I agree.

Let's see... We right now have a section titled "Introduction" with three sub sections:

  1. tagline / short intro (w/o explicit subsection header)
  2. dra primer
  3. concept: two parts

That is, as it stands (maybe not clearly visible with typographical means), both "The two parts of this driver" and "A primer on DRA" are below Introduction in the hierarchy. In that sense, this is already "in the intro".

I think you mean to maybe pull this "up", and to maybe remove a sub section heading. I can see that we need to improve something here, but I am not yet sure. Some thoughts that I believe also matter:

  • The concept of "The two parts of this driver" (regardless of specific wording) is so important that I think it is valuable to have it reflected in a heading. That's important cognitive input to a reader. It makes it more likely that readers walk away knowing about that concept.
  • Having "A primer on DRA" as the only subsection is maybe a little funky.

While I am here, I thought about replacing "The two parts of this driver" with "The twofold nature of this driver" as something potentially stylistically nicer. The few percent of ambiguity that this might introduce are taken care of in the text body of that subsection.

Lots of text. Just thinking out loud.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could instead show "Introduction & Installation" -- is that what you had in mind?

done that

- Kubernetes v1.32 or newer.
- DRA and corresponding API groups must be enabled (`see Kubernetes docs <https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#enabling-dynamic-resource-allocation>`_).
- GPU Driver 565 or later.
- NVIDIA's GPU Operator v25.3.0 or later, installed with CDI enabled (use the ``--set cdi.enabled=true`` commandline argument during ``helm install``). For reference, please refer to the GPU Operator `installation documentation <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#common-chart-customization-options>`__.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't technically true. We always use it in conjunction with the operator, but it doesnt have to be.

Is there some way to storngly suggest it rather than list it as a requirement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prepend that last bullet with "While not strictly required, we recommend using .."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 83 to 99
#. Install the driver, providing install-time configuration parameters. Example:

.. code-block:: console
$ helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
--version="25.3.0-rc.4" \
--create-namespace \
--namespace nvidia-dra-driver-gpu \
--set nvidiaDriverRoot=/run/nvidia/driver \
--set resources.gpus.enabled=false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only usable for operator-managed drivers. With host-installed drivers, nvidiaDriverRoot=/.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit -- I see that you put that caveat in the Note below.
Most people copy and paste, so this may trip people up if there is not a full-alternative

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this may trip people up if there is not a full-alternative

Ack. I will add another block for host-provided drivers at slash

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

NVIDIA's `GB200 NVL72 <https://www.nvidia.com/en-us/data-center/gb200-nvl72/>`_ and comparable systems are designed specifically around Multi-Node NVLink (`MNNVL <https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/overview.html>`_) to turn a rack of GPU machines -- each with a small number of GPUs -- into a supercomputer with a large number of GPUs communicating at high bandwidth (1.8 TB/s chip-to-chip, and over `130 TB/s cumulative bandwidth <https://docs.nvidia.com/multi-node-nvlink-systems/multi-node-tuning-guide/overview.html#fifth-generation-nvlink>`_ on a GB200 NVL72).

NVIDIA's DRA Driver for GPUs enables MNNVL for Kubernetes workloads by introducing a new concept -- the **ComputeDomain**:
when workload requests a ComputeDomain, NVIDIA's DRA Driver for GPUs performs all the heavy lifting required for sharing GPU memory **securely** via NVLink among all pods that comprise the workload.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
when workload requests a ComputeDomain, NVIDIA's DRA Driver for GPUs performs all the heavy lifting required for sharing GPU memory **securely** via NVLink among all pods that comprise the workload.
when a workload requests a ComputeDomain, NVIDIA's DRA Driver for GPUs performs all the heavy lifting required for sharing GPU memory **securely** via NVLink among all pods that comprise the workload.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 26 to 27

A design goal of this DRA driver is to make IMEX, as much as possible, an implementation detail that workload authors and cluster operators do not need to be concerned with: the driver launches and/or reconfigures IMEX daemons and establishes and injects IMEX channels into containers as needed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a link to the imex user guide?

Copy link
Contributor

@jgehrcke jgehrcke Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point!

Rendered:

image

Personally, I don't find the IMEX user guide very good as it stands, and so I find that in the motivation, https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/overview.html#internode-memory-exchange-service is a better reference than https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html.

"IMEX" in line 2 of that block already links to https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/overview.html#internode-memory-exchange-service (a good, quick intro to IMEX).

I have now made "IMEX channels" (second-last line) link to https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/imexchannels.html. (a specific section in the IMEX user guide).

We also link to the IMEX user guide in "related resources".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done (linked "IMEX channels" to what I said above)

$ helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
--version="25.3.0-rc.4" \
--create-namespace --namespace nvidia-dra-driver-gpu \
Copy link
Contributor

@jgehrcke jgehrcke Jul 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not bother explicitly setting nvidiaDriverRoot=/ here. This is explained in the note below, and much better explained in the operator install docs which are linked to above.

Key words here are "Example for", which I hope will make responsible users think a tiny bit about the command they construct (based on ref docs, instead of based on this example alone).

I think this is good enough, especially in view of this section changing significantly soon (upon releasing operator 25.8.0).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the future we can probably do better and do like the operator does and autodetect this

Signed-off-by: Abigail McCarthy <[email protected]>
Co-authored-by: Tariq <[email protected]>
- Kubernetes v1.32 or newer.
- DRA and corresponding API groups must be enabled (`see Kubernetes docs <https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#enabling-dynamic-resource-allocation>`_).
- NVIDIA GPU Driver 565 or later.
- While not strictly required, we recommend using NVIDIA's GPU Operator v25.3.0 or later, installed with CDI enabled (use the ``--set cdi.enabled=true`` commandline argument during ``helm install``). For reference, please refer to the GPU Operator `installation documentation <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#common-chart-customization-options>`__.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It#s not strictly required to use the GPU Operator to enable CDI in the underlying runtime. But it is required to enable CDI in the underlying runtime in general.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re-thought this, pushed a commit

$ helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
--version="25.3.0-rc.4" \
--create-namespace --namespace nvidia-dra-driver-gpu \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
--create-namespace --namespace nvidia-dra-driver-gpu \
--create-namespace \
--namespace nvidia-dra-driver-gpu \

Copy link
Contributor

@jgehrcke jgehrcke Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, in a previous commit I just deliberately put these two related args on the same line -- to save a tiny bit of vertical space without losing clarity. (might even make it easier to draw attention to the more interesting part of the set of arguments).

But I can choose to not care :) Reverting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like having more than one flag on a single line (unless they are all one exactly one single line). It makes it hard to scan top to bottom and see all of the flags.

Copy link
Contributor

@jgehrcke jgehrcke Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel that most of the times, too. I sometimes make an exception for a logical group, as in the case --create-namespace --namespace nvidia-dra-driver-gpu. Anyway, changed!

$ helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
--version="25.3.0-rc.4" \
--create-namespace --namespace nvidia-dra-driver-gpu \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
--create-namespace --namespace nvidia-dra-driver-gpu \
--create-namespace \
--namespace nvidia-dra-driver-gpu \

Comment on lines 58 to 62
$ helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=${version} \
--set cdi.enabled=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
$ helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=${version} \
--set cdi.enabled=true
$ helm install nvidia/gpu-operator \
--wait \
--version=${version} \
--generate-name \
--create-namespace \
--namespace gpu-operator \
--set cdi.enabled=true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually rendered anywhere? Should we drop it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually rendered anywhere? Should we drop it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually rendered anywhere? Should we drop it?

Copy link
Contributor

@jgehrcke jgehrcke Jul 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not used, right -- dropping!

Copy link
Contributor

@klueska klueska left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still some rough edges in places, but I'm happy enough with it to merge it.

Towards clarity, minimalism, and low cognitive burden.

Added more thoughts about security – after all, IMEX and hence CDs
are primarily delivering a security boundary.

Sanity-checked by Kevin.

This is a milestoned, much (rather obviously) still needs to be
improved from here on.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
@jgehrcke
Copy link
Contributor

jgehrcke commented Jul 24, 2025

Thanks for review cycles and discussion. Squashed -- waiting for CI, then merging.

@jgehrcke jgehrcke merged commit d2766f5 into NVIDIA:main Jul 24, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants