adding dra docs #162

a-mccarthy · 2025-03-06T19:58:32Z

This PR updates the docs for the new DRA driver component.

links:

github-actions · 2025-03-06T20:58:49Z

Documentation preview

https://nvidia.github.io/cloud-native-docs/review/pr-162

gpu-operator/dra-driver.rst

Copilot

PR Overview

This pull request introduces documentation-related changes for the DRA driver while also providing configuration manifests for enabling dynamic resource allocation and related compute domain functionality.

Adds a kubeadm initialization configuration to enable the DynamicResourceAllocation feature.
Introduces a ComputeDomain resource and a pod manifest for resource channel injection.
Provides an additional ComputeDomain definition in a separate file.

Reviewed Changes

File	Description
gpu-operator/manifests/input/kubeadm-init-config.yaml	Defines kubeadm ClusterConfiguration and KubeletConfiguration with dynamic resource allocation settings.
gpu-operator/manifests/input/imex-channel-injection.yaml	Contains ComputeDomain and Pod definitions for testing resource channel injection.
gpu-operator/manifests/input/dra-compute-domain-crd.yaml	Provides an alternative ComputeDomain resource definition that duplicates functionality in imex-channel-injection.yaml.

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

gpu-operator/manifests/input/dra-compute-domain-crd.yaml:1

The ComputeDomain resource defined here duplicates the one in imex-channel-injection.yaml. Consider consolidating the definitions to avoid potential conflicts during deployment.

apiVersion: resource.nvidia.com/v1beta1

gpu-operator/dra-driver.rst

elezar · 2025-03-21T13:39:14Z

As a general comment, we need to call out what that driver enables -- which is to use Multi-Node NVLink -- and not how this is enabled. The UX for ComputeDomains has been designed to explicitly make minimal references to IMEX since this is an implementation detail. It is required to enable MNNVL, but should not be something that a user is concerned with.

I'll make another pass at reviewing this soon.

Copilot

Pull Request Overview

This PR introduces draft manifests for the DRA driver documentation and related resources, outlining initial configurations and resource definitions.

Added a kubeadm initialization configuration file with feature gate settings for dynamic resource allocation.
Introduced a ComputeDomain resource manifest.
Provided an injection manifest that combines a ComputeDomain definition with a Pod using the resource.

Reviewed Changes

Copilot reviewed 4 out of 7 changed files in this pull request and generated 1 comment.

File	Description
gpu-operator/manifests/input/kubeadm-init-config.yaml	New kubeadm init config with extraArgs for enabling dynamic resource allocation.
gpu-operator/manifests/input/dra-compute-domain-crd.yaml	New ComputeDomain resource definition with basic specs.
gpu-operator/manifests/input/imex-channel-injection.yaml	Injection manifest combining a ComputeDomain and a Pod definition.

Files not reviewed (3)

gpu-operator/index.rst: Language not supported
gpu-operator/manifests/output/compute-domain-channel-injection-crd.txt: Language not supported
gpu-operator/manifests/output/imex-logs.txt: Language not supported

Comments suppressed due to low confidence (1)

gpu-operator/manifests/input/imex-channel-injection.yaml:2

The ComputeDomain resource is defined in both this file and in dra-compute-domain-crd.yaml. Confirm if this duplication is intentional or consider consolidating these definitions to avoid potential deployment conflicts.

apiVersion: resource.nvidia.com/v1beta1

gpu-operator/manifests/input/kubeadm-init-config.yaml

gpu-operator/dra-driver.rst

jgehrcke · 2025-03-21T14:15:15Z

gpu-operator/dra-driver.rst

+            --namespace nvidia-dra-driver-gpu \
+            --set nvidiaDriverRoot=/run/nvidia/driver \
+            --set nvidiaCtkPath=/usr/local/nvidia/toolkit/nvidia-ctk \
+            --set resources.gpus.enabled=false


Thank you for following this ❤️

gpu-operator/dra-driver.rst

a-mccarthy · 2025-05-29T10:08:35Z

@klueska and @jgehrcke, I pushed some updates to this PR, please review if you have a chance. I still need to add in more content around the GPU resource allocation use case, but I'd like to get some feedback on the page structure and if the rest of the content for MMNL support makes sense.

jgehrcke

Thank you @a-mccarthy! I looked at the last two commits only and added just a small number of comments.

After more quick/diagonal reading, I noticed that more changes are required. Maybe it is useful when I add a commit to this branch -- WDYT?

jgehrcke · 2025-06-03T14:20:46Z

gpu-operator/dra-driver.rst

-Multi-Node NVLink support with the NVIDIA DRA Driver for GPUs 
-#############################################################
+##################################################
+NVIDIA Dynamic Resource Allocation Driver for GPUs 


Let's still call this "NVIDIA DRA Driver for GPUs"

jgehrcke · 2025-06-03T14:27:00Z

gpu-operator/dra-driver.rst

   * - ``resources.gpus.enabled``
     - Specifies whether to enable the NVIDIA DRA Driver for GPUs to manage GPU resource allocation.
-       This feature is not yet supported and you must set to ``false``.
+       This feature is in Technolody Preview and only recommended for testing, not production enviroments.


technology :)
environments

jgehrcke · 2025-06-03T14:28:55Z

gpu-operator/dra-driver.rst

     - Specifies whether to enable the NVIDIA DRA Driver for GPUs to manage GPU resource allocation.
-       This feature is not yet supported and you must set to ``false``.
+       This feature is in Technolody Preview and only recommended for testing, not production enviroments.
+       To use with MNNVL, set to ``false``.


Let's remove this sentence.

(it's not correct: the driver has two components: the CD side of things, and the GPU side of things -- they can both be enabled independently).

jgehrcke · 2025-06-03T14:32:51Z

gpu-operator/dra-driver.rst

-For more information about Kubernetes Dynamic Resource Allocation (DRA), refer to the `Kubernetes DRA documentation <https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/>`_.
+This page provides an overview of the DRA Driver for GPUs, its supported functionality, installing the component with the GPU Operator, and examples of using the DRA Driver for GPUs with supported use cases.

+Before continuing, you should be familiar with the components of the `Kubernetes DRA feature <https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/>`_.


I think this can be the first sentence in this section, and it should be the sentence that introduces the DRA abbreviation.

chenopis

A few small nits and questions. LGTM otherwise.

gpu-operator/dra-driver.rst

chenopis · 2025-07-10T16:34:26Z

gpu-operator/dra-driver.rst

+The `NVIDIA GPU Feature Discovery <https://github.com/NVIDIA/k8s-device-plugin/tree/main/docs/gpu-feature-discovery>`_ adds a Clique ID to each GPU node.
+This is a unique identifier within an NVLink domain (physically connected GPUs over NVLink) that indicates which GPUs within that domain are physically capable of talking to each other.
+
+The partitioning of GPUs into a set of cliques is done at the NVSwitch layer, not at the individual node layer. All GPUs on a given node are guaranteed to have the same <ClusterUUID.Clique ID> pair.


And why is it a pair? Oh, does the pair refer to the ClusterUUID and the Clique ID? Would "value" be better. I think of "pair" as something like (X,Y). Later on, "<cluster-uuid, clique-id>" is referenced, so this was confusing for me.

Within the bracket notation, would it be better if "Clique ID" were "Clique_ID" to make it clear that it is one value?

chenopis · 2025-07-10T16:51:47Z

gpu-operator/dra-driver.rst

+   * - ``mpirun`` command argument ``-ppr:4:node``
+
+     -
+       * Number of GPUs per node as the process-per-resource number


Does this need to be a bullet item since there is only one?

gpu-operator/dra-driver.rst

klueska

First quick pass

gpu-operator/dra-intro.rst

klueska · 2025-07-17T12:28:21Z

gpu-operator/dra-intro.rst

+The two parts of this driver
+============================
+
+NVIDIA's DRA Driver for GPUs is comprised of two subsystems that are largely independent of each other: one manages GPUs, and the other one manages ComputeDomains.
+
+The next documentation chapter contains instructions for how to install both parts or just one of them.
+Additionally, we have prepared two separate documentation chapters, providing more in-depth information for each of the two subsystems:
+
+- `Documentation for ComputeDomain support <foo2>`_
+- `Documentation for GPU support <foo1>`_


Does it make sense to just merge this into the intro? And then put the installation insturctions inline here (instead of in a separate file)?

put the installation instructions inline here (instead of in a separate file)?

Right now I think I still prefer the separation, but I want to think that through.

Currently, the left side bar shows:

We could instead show "Introduction & Installation" -- is that what you had in mind?

yeah I'd be fine with that

just merge this into the intro?

Did you mean the entire subsection with "this" or maybe just the documentation links?

Something about the tail end of this document isn't quite ideal yet -- this is probably what you mean. I agree.

Let's see... We right now have a section titled "Introduction" with three sub sections:

tagline / short intro (w/o explicit subsection header)

dra primer

concept: two parts

That is, as it stands (maybe not clearly visible with typographical means), both "The two parts of this driver" and "A primer on DRA" are below Introduction in the hierarchy. In that sense, this is already "in the intro".

I think you mean to maybe pull this "up", and to maybe remove a sub section heading. I can see that we need to improve something here, but I am not yet sure. Some thoughts that I believe also matter:

The concept of "The two parts of this driver" (regardless of specific wording) is so important that I think it is valuable to have it reflected in a heading. That's important cognitive input to a reader. It makes it more likely that readers walk away knowing about that concept.

Having "A primer on DRA" as the only subsection is maybe a little funky.

While I am here, I thought about replacing "The two parts of this driver" with "The twofold nature of this driver" as something potentially stylistically nicer. The few percent of ambiguity that this might introduce are taken care of in the text body of that subsection.

Lots of text. Just thinking out loud.

We could instead show "Introduction & Installation" -- is that what you had in mind?

done that

klueska · 2025-07-23T13:39:17Z

gpu-operator/dra-intro-install.rst

+- Kubernetes v1.32 or newer.
+- DRA and corresponding API groups must be enabled (`see Kubernetes docs <https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#enabling-dynamic-resource-allocation>`_).
+- GPU Driver 565 or later.
+- NVIDIA's GPU Operator v25.3.0 or later, installed with CDI enabled (use the ``--set cdi.enabled=true`` commandline argument during ``helm install``). For reference, please refer to the GPU Operator `installation documentation <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#common-chart-customization-options>`__.


This isn't technically true. We always use it in conjunction with the operator, but it doesnt have to be.

Is there some way to storngly suggest it rather than list it as a requirement?

I prepend that last bullet with "While not strictly required, we recommend using .."

klueska · 2025-07-23T13:40:44Z

gpu-operator/dra-intro-install.rst

+#. Install the driver, providing install-time configuration parameters. Example:
+
+   .. code-block:: console
+
+      $ helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
+          --version="25.3.0-rc.4" \
+          --create-namespace \
+          --namespace nvidia-dra-driver-gpu \
+          --set nvidiaDriverRoot=/run/nvidia/driver \
+          --set resources.gpus.enabled=false


This is only usable for operator-managed drivers. With host-installed drivers, nvidiaDriverRoot=/.

Edit -- I see that you put that caveat in the Note below.
Most people copy and paste, so this may trip people up if there is not a full-alternative

so this may trip people up if there is not a full-alternative

Ack. I will add another block for host-provided drivers at slash

klueska · 2025-07-23T13:43:53Z

gpu-operator/dra-cds.rst

+NVIDIA's `GB200 NVL72 <https://www.nvidia.com/en-us/data-center/gb200-nvl72/>`_ and comparable systems are designed specifically around Multi-Node NVLink (`MNNVL <https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/overview.html>`_) to turn a rack of GPU machines -- each with a small number of GPUs -- into a supercomputer with a large number of GPUs communicating at high bandwidth (1.8 TB/s chip-to-chip, and over `130 TB/s cumulative bandwidth <https://docs.nvidia.com/multi-node-nvlink-systems/multi-node-tuning-guide/overview.html#fifth-generation-nvlink>`_ on a GB200 NVL72).
+
+NVIDIA's DRA Driver for GPUs enables MNNVL for Kubernetes workloads by introducing a new concept -- the **ComputeDomain**:
+when workload requests a ComputeDomain, NVIDIA's DRA Driver for GPUs performs all the heavy lifting required for sharing GPU memory **securely** via NVLink among all pods that comprise the workload.


Suggested change

when workload requests a ComputeDomain, NVIDIA's DRA Driver for GPUs performs all the heavy lifting required for sharing GPU memory **securely** via NVLink among all pods that comprise the workload.

when a workload requests a ComputeDomain, NVIDIA's DRA Driver for GPUs performs all the heavy lifting required for sharing GPU memory **securely** via NVLink among all pods that comprise the workload.

klueska · 2025-07-23T13:44:51Z

gpu-operator/dra-cds.rst

+
+   A design goal of this DRA driver is to make IMEX, as much as possible, an implementation detail that workload authors and cluster operators do not need to be concerned with: the driver launches and/or reconfigures IMEX daemons and establishes and injects IMEX channels into containers as needed.


Add a link to the imex user guide?

Good point!

Rendered:

Personally, I don't find the IMEX user guide very good as it stands, and so I find that in the motivation, https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/overview.html#internode-memory-exchange-service is a better reference than https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/overview.html.

"IMEX" in line 2 of that block already links to https://docs.nvidia.com/multi-node-nvlink-systems/mnnvl-user-guide/overview.html#internode-memory-exchange-service (a good, quick intro to IMEX).

I have now made "IMEX channels" (second-last line) link to https://docs.nvidia.com/multi-node-nvlink-systems/imex-guide/imexchannels.html. (a specific section in the IMEX user guide).

We also link to the IMEX user guide in "related resources".

done (linked "IMEX channels" to what I said above)

jgehrcke · 2025-07-23T15:24:20Z

gpu-operator/dra-intro-install.rst

+
+      $ helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
+          --version="25.3.0-rc.4" \
+          --create-namespace --namespace nvidia-dra-driver-gpu \


I do not bother explicitly setting nvidiaDriverRoot=/ here. This is explained in the note below, and much better explained in the operator install docs which are linked to above.

Key words here are "Example for", which I hope will make responsible users think a tiny bit about the command they construct (based on ref docs, instead of based on this example alone).

I think this is good enough, especially in view of this section changing significantly soon (upon releasing operator 25.8.0).

In the future we can probably do better and do like the operator does and autodetect this

Signed-off-by: Abigail McCarthy <[email protected]> Co-authored-by: Tariq <[email protected]>

klueska · 2025-07-24T08:51:32Z

gpu-operator/dra-intro-install.rst

+- Kubernetes v1.32 or newer.
+- DRA and corresponding API groups must be enabled (`see Kubernetes docs <https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/#enabling-dynamic-resource-allocation>`_).
+- NVIDIA GPU Driver 565 or later.
+- While not strictly required, we recommend using NVIDIA's GPU Operator v25.3.0 or later, installed with CDI enabled (use the ``--set cdi.enabled=true`` commandline argument during ``helm install``). For reference, please refer to the GPU Operator `installation documentation <https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#common-chart-customization-options>`__.


It#s not strictly required to use the GPU Operator to enable CDI in the underlying runtime. But it is required to enable CDI in the underlying runtime in general.

re-thought this, pushed a commit

klueska · 2025-07-24T08:53:26Z

gpu-operator/dra-intro-install.rst

+
+      $ helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
+          --version="25.3.0-rc.4" \
+          --create-namespace --namespace nvidia-dra-driver-gpu \


Suggested change

--create-namespace --namespace nvidia-dra-driver-gpu \

--create-namespace \

--namespace nvidia-dra-driver-gpu \

Well, in a previous commit I just deliberately put these two related args on the same line -- to save a tiny bit of vertical space without losing clarity. (might even make it easier to draw attention to the more interesting part of the set of arguments).

But I can choose to not care :) Reverting.

I don't like having more than one flag on a single line (unless they are all one exactly one single line). It makes it hard to scan top to bottom and see all of the flags.

I feel that most of the times, too. I sometimes make an exception for a logical group, as in the case --create-namespace --namespace nvidia-dra-driver-gpu. Anyway, changed!

klueska · 2025-07-24T08:53:51Z

gpu-operator/dra-intro-install.rst

+
+      $ helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \
+          --version="25.3.0-rc.4" \
+          --create-namespace --namespace nvidia-dra-driver-gpu \


Suggested change

--create-namespace --namespace nvidia-dra-driver-gpu \

--create-namespace \

--namespace nvidia-dra-driver-gpu \

gpu-operator/dra-intro-install.rst

klueska · 2025-07-24T09:04:05Z

gpu-operator/dra-intro-install.rst

+      $ helm install --wait --generate-name \
+          -n gpu-operator --create-namespace \
+          nvidia/gpu-operator \
+          --version=${version} \
+          --set cdi.enabled=true


Suggested change

$ helm install --wait --generate-name \

-n gpu-operator --create-namespace \

nvidia/gpu-operator \

--version=${version} \

--set cdi.enabled=true

$ helm install nvidia/gpu-operator \

--wait \

--version=${version} \

--generate-name \

--create-namespace \

--namespace gpu-operator \

--set cdi.enabled=true

klueska · 2025-07-24T09:09:37Z

gpu-operator/manifests/input/kubeadm-init-config.yaml

Is this actually rendered anywhere? Should we drop it?

klueska · 2025-07-24T09:09:55Z

gpu-operator/manifests/input/imex-channel-injection.yaml

Is this actually rendered anywhere? Should we drop it?

klueska · 2025-07-24T09:10:00Z

gpu-operator/manifests/input/dra-compute-domain-crd.yaml

Is this actually rendered anywhere? Should we drop it?

not used, right -- dropping!

klueska

Still some rough edges in places, but I'm happy enough with it to merge it.

Towards clarity, minimalism, and low cognitive burden. Added more thoughts about security – after all, IMEX and hence CDs are primarily delivering a security boundary. Sanity-checked by Kevin. This is a milestoned, much (rather obviously) still needs to be improved from here on. Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>

jgehrcke · 2025-07-24T12:14:30Z

Thanks for review cycles and discussion. Squashed -- waiting for CI, then merging.

a-mccarthy marked this pull request as draft March 6, 2025 19:58

a-mccarthy requested review from cdesiniotis, guptaNswati, klueska and tariq1890 March 6, 2025 19:59

guptaNswati reviewed Mar 6, 2025

View reviewed changes