Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding dra docs #162

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from
Draft

adding dra docs #162

wants to merge 4 commits into from

Conversation

a-mccarthy
Copy link
Collaborator

This is a draft version of the dra driver documentation.

To be decided:

  • What workflows to highlight to users
  • Where is the best place for these docs to live

Signed-off-by: Abigail McCarthy <[email protected]>
@a-mccarthy a-mccarthy marked this pull request as draft March 6, 2025 19:58
Signed-off-by: Abigail McCarthy <[email protected]>
Copy link

github-actions bot commented Mar 6, 2025

Documentation preview

https://nvidia.github.io/cloud-native-docs/review/pr-162

About the NVIDIA GPU DRA Driver
*******************************

The NVIDIA GPU DRA Driver leverages the Kubernetes Dynamic Resource Allocation (DRA) API to support NVIDIA IMEX channels available in GH200 and GB200 GPUs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be we should add a bit about imex channels. E.g.
.... to support NVIDIA IMEX (Internode Memory Exchange/Management Service) channels available in GH200 and G200 systems that allows the GPUs to directly read / write each other’s memory over a high-bandwidth NVLink.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

defined in line 13 and 14

The NVIDIA GPU DRA Driver creates and manages IMEX channels through the creation of a ComputeDomain custom resource. Use this custom resource to define your resource templates, and then reference the templates within your workload specs.

An IMEX channel is a construct that allows a set of GPUs to directly read and write each other's memory over a high-bandwidth NVLink.
The NVLink connection may either be directly between GPUs on the same node or between GPUs on separate nodes connected by an NVSwitch.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh you defined it here.

The NVLink connection may either be directly between GPUs on the same node or between GPUs on separate nodes connected by an NVSwitch.
Once an IMEX channel has been established for a set of GPUs, they are free to read and write each other's memory via extensions to the CUDA memory call APIs.

The ability to support IMEX channels on GH200 and GB200 systems is essential, as they have been designed specifically to exploit the use of IMEX channels to turn a rack of GPU machines (each with a small number of GPUs) into a giant supercomputer with up to 72 GPUs communicating at full NVLink bandwidth.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is imp to first summarize what is to follow. That imex channels are meant for multi-node communication

IMEX channel (by definition) is a resource that spans multiple nodes, hence the ability to support IMEX channels on GH200 and GB200 systems is essential. These GH200 and GB200 systems are designed specifically to leverage the use of IMEX channels to turn a rack of GPU machines (each with a small number of GPUs) into a giant supercomputer with up to 72 GPUs communicating at full NVLink bandwidth.


Kubernetes Dynamic Resource Allocation (DRA), available as beta in Kubernetes v1.32, is an API for requesting and sharing resources between pods and containers inside a pod.
This feature treats specialized hardware as a definable and reusable object and provides the necessary primitives to support cross-node resources such as IMEX channels.
Along with the NVIDIA GPU DRA Driver, you are able to use DRA to define IMEX channel resources that can be managed by Kubernetes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NVIDIA GPU DRA Driver uses DRA features to define IMEX channel resources that can be managed by Kubernetes.

Drivers must be pre-installed on all GPU nodes before installing the NVIDIA GPU Operator as operator managed drivers are not supported at this time.

- IMEX packages installed on GPU nodes with systemd service disabled.
The IMEX package verisons must match the installed driver version.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to the release notes for the exact version of the driver and IMEX to install.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guptaNswati can you share the a link to release notes you are referring to? thanks!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the official way to install NVIDIA drivers https://www.nvidia.com/en-us/drivers/


$ curl -fsSL -o /tmp/nvidia-imex-${IMEX_DRIVER_VERSION}_${DRIVER_VERSION}-1_${TARGETARCH}.deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/sbsa/nvidia-imex-${IMEX_DRIVER_VERSION}_${DRIVER_VERSION}-1_${TARGETARCH}.deb && dpkg -i /tmp/nvidia-imex-${IMEX_DRIVER_VERSION}_${DRIVER_VERSION}-1_${TARGETARCH}.deb && \
nvidia-imex --version && \
ls /etc/nvidia-imex && \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont need this ls

The following example shows how to install IMEX drivers.

.. code-block:: console

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are assuming, you already installed NVIDIA drivers

DRIVER_VERSION=$(nvidia-smi -i 0 --query-gpu=driver_version --format=csv,noheader,nounits)
IMEX_DRIVER_VERSION="${DRIVER_VERSION%%.*}"
TARGETARCH=$(uname -m | sed -E 's/^x86_64$/amd64/; s/^(aarch64|arm64)$/arm64/')

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, line 32 defines the second prerequisite for installing drivers

Once all daemons have been fully started, the DRA driver unblocks each worker, injects its IMEX channel into the worker and allows it to start running.


View the Compute Domain resources on your cluster

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should also add the MPI nvbandwidth example.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its also described here NVIDIA/k8s-dra-driver-gpu#249

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will add that!

Node and Pod Affitnity Strategies
=========================================

The ComputeDomain object is not tied directly to the notion of an IMEX channel.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rephrase it:

A ComputeDomain isn’t strictly about IMEX channels—it’s about running workloads across a group of compute nodes. This means even if some nodes are not IMEX capable, they can still be part of the same ComputeDomain. To control where your workloads run, you should use NodeAffinity and PodAffinity rules.

@ArangoGutierrez ArangoGutierrez requested a review from Copilot March 7, 2025 15:29

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This pull request introduces documentation-related changes for the DRA driver while also providing configuration manifests for enabling dynamic resource allocation and related compute domain functionality.

  • Adds a kubeadm initialization configuration to enable the DynamicResourceAllocation feature.
  • Introduces a ComputeDomain resource and a pod manifest for resource channel injection.
  • Provides an additional ComputeDomain definition in a separate file.

Reviewed Changes

File Description
gpu-operator/manifests/input/kubeadm-init-config.yaml Defines kubeadm ClusterConfiguration and KubeletConfiguration with dynamic resource allocation settings.
gpu-operator/manifests/input/imex-channel-injection.yaml Contains ComputeDomain and Pod definitions for testing resource channel injection.
gpu-operator/manifests/input/dra-compute-domain-crd.yaml Provides an alternative ComputeDomain resource definition that duplicates functionality in imex-channel-injection.yaml.

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

gpu-operator/manifests/input/dra-compute-domain-crd.yaml:1

  • The ComputeDomain resource defined here duplicates the one in imex-channel-injection.yaml. Consider consolidating the definitions to avoid potential conflicts during deployment.
apiVersion: resource.nvidia.com/v1beta1
@a-mccarthy a-mccarthy requested a review from guptaNswati March 10, 2025 16:29


******************************************************************
Run a Multi-node nvbandwidth Test Requiring IMEX Channels with MPI
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i need to do some more testing to clean up this example and add some explanations around whats happening.

Signed-off-by: Abigail McCarthy <[email protected]>
SPDX-License-Identifier: Apache-2.0

############################################################
Install the NVIDIA GPU DRA Driver and Configure IMEX Support
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we're supporting from a customer perspective is "Multi-Node NVLink" and how we're supporting it is through IMEX channels. The IMEX channels are an implementation detail from the user perspective.

I also believe that the DRA driver is no longer called the IMEX driver, but the Compute Domain DRA Driver.

Copy link
Contributor

@jgehrcke jgehrcke Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick feedback on naming.

I am OK with "NVIDIA GPU DRA Driver".

We need consistency. Naming is hard.

Here are my preferences, also after talking to Kevin:

NVIDIA DRA Driver for GPUs

Kubernetes DRA driver for NVIDIA GPUs

The latter is most precise and intuitive I think. But maybe too long, and too k8sy.

I trust that a certain decision has already been made, and we are late in the game with naming discussion. Yet, it's an important discussion.

For the Helm chart (public) listing I picked "NVIDIA DRA Driver for GPUs", see

https://catalog.ngc.nvidia.com/orgs/nvidia/helm-charts/nvidia-dra-driver-gpu

Here is how we should think about it:

The DRA driver for NVIDIA GPUs enables multi-node GPU workloads in Kubernetes environments

That's the very high-level.

I agree with Evan that anything IMEX is an (important) implementation detail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also believe that the DRA driver is no longer called the IMEX driver

And yes yes yes to this. We should think that it never was named IMEX driver. This was an organizational brainfart. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After Slack discussion: for now, we decided to use "NVIDIA DRA Driver for GPUs", like we did for nSpect and also the Helm chart registry. Kevin also prefers this:

I usually prefer NVIDIA DRA Driver for GPUs so there’s not 3 all caps acronyms right next to each other

This page details more information about the GPU DRA Driver, including how to install it and examples of deploying workloads using IMEX channels.

*******************************
About the NVIDIA GPU DRA Driver
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the "GPU DRA Driver" in this context? Is it the driver that is managing the lifecycle of compute domains or access to GPUs themselves?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the "GPU DRA Driver" in this context?

The "NVIDIA GPU DRA Driver" in this document always refers to the component developed in https://github.com/NVIDIA/k8s-dra-driver-gpu

Is it the driver that is managing the lifecycle of compute domains

in that sense: yes

@elezar
Copy link
Member

elezar commented Mar 21, 2025

As a general comment, we need to call out what that driver enables -- which is to use Multi-Node NVLink -- and not how this is enabled. The UX for ComputeDomains has been designed to explicitly make minimal references to IMEX since this is an implementation detail. It is required to enable MNNVL, but should not be something that a user is concerned with.

I'll make another pass at reviewing this soon.

@ArangoGutierrez ArangoGutierrez requested a review from Copilot March 21, 2025 13:50
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces draft manifests for the DRA driver documentation and related resources, outlining initial configurations and resource definitions.

  • Added a kubeadm initialization configuration file with feature gate settings for dynamic resource allocation.
  • Introduced a ComputeDomain resource manifest.
  • Provided an injection manifest that combines a ComputeDomain definition with a Pod using the resource.

Reviewed Changes

Copilot reviewed 4 out of 7 changed files in this pull request and generated 1 comment.

File Description
gpu-operator/manifests/input/kubeadm-init-config.yaml New kubeadm init config with extraArgs for enabling dynamic resource allocation.
gpu-operator/manifests/input/dra-compute-domain-crd.yaml New ComputeDomain resource definition with basic specs.
gpu-operator/manifests/input/imex-channel-injection.yaml Injection manifest combining a ComputeDomain and a Pod definition.
Files not reviewed (3)
  • gpu-operator/index.rst: Language not supported
  • gpu-operator/manifests/output/compute-domain-channel-injection-crd.txt: Language not supported
  • gpu-operator/manifests/output/imex-logs.txt: Language not supported
Comments suppressed due to low confidence (1)

gpu-operator/manifests/input/imex-channel-injection.yaml:2

  • The ComputeDomain resource is defined in both this file and in dra-compute-domain-crd.yaml. Confirm if this duplication is intentional or consider consolidating these definitions to avoid potential deployment conflicts.
apiVersion: resource.nvidia.com/v1beta1

Comment on lines +5 to +16
- name: "feature-gates"
value: "DynamicResourceAllocation=true"
- name: "runtime-config"
value: "resource.k8s.io/v1alpha3=true"
controllerManager:
extraArgs:
- name: "feature-gates"
value: "DynamicResourceAllocation=true"
scheduler:
extraArgs:
- name: "feature-gates"
value: "DynamicResourceAllocation=true"
Copy link
Preview

Copilot AI Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation for 'value' under the extraArgs list item appears inconsistent with standard YAML formatting. Consider aligning it to 2 additional spaces relative to the '-' line (e.g., 4 spaces instead of 5) to ensure proper parsing.

Suggested change
- name: "feature-gates"
value: "DynamicResourceAllocation=true"
- name: "runtime-config"
value: "resource.k8s.io/v1alpha3=true"
controllerManager:
extraArgs:
- name: "feature-gates"
value: "DynamicResourceAllocation=true"
scheduler:
extraArgs:
- name: "feature-gates"
value: "DynamicResourceAllocation=true"
- name: "feature-gates"
value: "DynamicResourceAllocation=true"
- name: "runtime-config"
value: "resource.k8s.io/v1alpha3=true"
controllerManager:
extraArgs:
- name: "feature-gates"
value: "DynamicResourceAllocation=true"
scheduler:
extraArgs:
- name: "feature-gates"
value: "DynamicResourceAllocation=true"

Copilot is powered by AI, so mistakes are possible. Review output carefully before use.

Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options

An IMEX channel is a construct that allows a set of GPUs to directly read and write each other's memory over a high-bandwidth NVLink.
The NVLink connection may either be directly between GPUs on the same node or between GPUs on separate nodes connected by an NVSwitch.
Once an IMEX channel has been established for a set of GPUs, they are free to read and write each other's memory via extensions to the CUDA memory call APIs.
Copy link
Contributor

@jgehrcke jgehrcke Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My hunch is that we should not go into this detail here but just link to reference documentation. It's very easy to say something wrong/misleading here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgehrcke do you happen to know where that documentation lives? I have included a lot of this info b/c i was unable to find a place where we can link out to :)

Copy link
Contributor

@jgehrcke jgehrcke Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference docs are here, for example: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MALLOC__ASYNC.html#group__CUDA__MALLOC__ASYNC_1g8aa4c143dbc20293659cd883232b95f2

There is the beautiful statement of

When exporter and importer CUDA processes have been granted access to the same IMEX channel, they can securely share memory.

And that's already a better description because of the emphasis on securely.

Maybe we can borrow that statement, and otherwise just link to refdocs.

--namespace nvidia-dra-driver-gpu \
--set nvidiaDriverRoot=/run/nvidia/driver \
--set nvidiaCtkPath=/usr/local/nvidia/toolkit/nvidia-ctk \
--set resources.gpus.enabled=false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for following this ❤️

* - ``nvidiaDriverRoot``
- Specifies the driver root on the host.
For Operator managed drivers, use ``/run/nvidia/driver``.
For pre-installed drivers, use ``/``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we provide a recommended mode of operation?

Something like "Use Operator-managed drivers if in doubt".


* - ``nvidiaCtkPath``
- Specifies the path of The NVIDIA Container Tool Kit binary (nvidia-ctk) on the host, as it should appear in the the generated CDI specification.
The exact path depends on the system that runs on the node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe here we should also say sth like "use the Operator-managed nvidia-ctk if in doubt"

CC @elezar

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can do something like we do above for the nvidiaDriverRoot and call out the two options with their default values.

  • /usr/bin/nvidia-ctk for a pre-installed NVIDIA Container Toolkit
  • /usr/local/nvidia/toolkit/nvidia-ctk for an GPU Operator-installed NVIDIA Container Toolkit

Prerequisites
=============

- GH200 and GB200 GPUs with Mulit-Node NVLink connections between GPUs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multi

and maybe we can also introduce the abbreviation "(MNNVL)" here and re-use it below if we can. It's seemingly (as I am learning about this, too) a very common abbreviation in this growing ecosystem.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the Compute Domain driver does not require a GH200 or GB200 system. The driver will function on systems that do not have MNNVL.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elezar what do you mean by function? like it will deploy and not throw errors, or will it actually create resources on the cluster, even though there is no MNNVL

This means even if some nodes are not IMEX capable, they can still be part of the same ComputeDomain.
You must apply NodeAffinity and PodAffinity rules to make sure your workloads run on IMEX capable nodes.

For example you could set PodAffinity with a preferred topologyKey set to ``nvidia.com/gpu.clique`` for workloads to span multiple NVLink domains but want them packed as tightly as possible. Or use a required topologyKey set to ``nvidia.con/gpu.clique`` when you require all workloads deployed into the same NVLink domain, but don't care which one.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to add my understanding here and then we can think about potentially rewording/reordering, simplifying.

Technically, when requiring the value of nvidia.con/gpu.clique to be the same among all jobs then this is the same as saying: all jobs must land in the same domain&clique. And that means: all jobs are mutually reachable (all-to-all communication is possible). For a MNNVL setup this is what's typically wanted, and maybe we should lead with that.

So, I consider this use case to be more advanced or exotic:

for workloads to span multiple NVLink domains but want them packed as tightly as possible.

But maybe I don't know enough.

apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
name: imex-channel-injection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klueska do you have ideas about another name for this?

Install the NVIDIA GPU DRA Driver and Configure IMEX Support
############################################################

THe NVIDIA GPU DRA Driver is an additional component you can install after the GPU Operator that enables you to use the Kubernetes DRA feature to define IMEX channel resources that are managed by Kubernetes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about:

Suggested change
THe NVIDIA GPU DRA Driver is an additional component you can install after the GPU Operator that enables you to use the Kubernetes DRA feature to define IMEX channel resources that are managed by Kubernetes.
The NVIDIA GPU DRA Driver is an additional component you can install alongside the GPU Operator that enables you to use the Kubernetes DRA feature to support Multi-Node NVLink in distributed applications.

We may also want to link to the Kubernetes DRA documentation? https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/

About the NVIDIA GPU DRA Driver
*******************************

The NVIDIA GPU DRA Driver leverages the Kubernetes Dynamic Resource Allocation (DRA) API to support NVIDIA IMEX channels available in GH200 and GB200 GPUs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does something like the following better capture the intent of the COMPUTE DOMAIN DRIVER:

The NVIDIA GPU DRA Driver provides a Compute Domain abstraction (as a Kubernetes CRD) that allows distributed applications to make use of technologies such as Mulit-node NVLink if available. The underlying "connectivity" (not sure about this word) is managed by the NVIDIA GPU DRA Driver to ensure portability workloads.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this does not mention IMEX channels.

Install the NVIDIA GPU DRA Driver
*********************************

The GPU DRA Driver is an addiitonal component that can be installed after you've installed the GPU Operator on your clsuter.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's strictly speaking required that the GPU Operator be installed first, but may make things simpler.

Also a typo:

Suggested change
The GPU DRA Driver is an addiitonal component that can be installed after you've installed the GPU Operator on your clsuter.
The GPU DRA Driver is an addiitonal component that can be installed after you've installed the GPU Operator on your k8s cluster.


- GH200 and GB200 GPUs with Mulit-Node NVLink connections between GPUs.

- Kubernetes v1.32 multi-node cluster with the DynamitcResourceAllocation feature gate enabled.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Kubernetes v1.32 multi-node cluster with the DynamitcResourceAllocation feature gate enabled.
- A Kubernetes v1.32 cluster with the `DynamitcResourceAllocation` feature gate enabled and the `resource.k8s.io` API group enabled.


- Kubernetes v1.32 multi-node cluster with the DynamitcResourceAllocation feature gate enabled.

The following is a sample for enabling DRA feature gates.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The following is a sample for enabling DRA feature gates.
The following is a sample for enabling the required feature gates and API groups.

:language: yaml
:caption: Sample Kubeadm Init Config with DRA Feature Gates Enabled

- The NVIDIA GPU Operator v25.3.0 or later installed with CDI enabled on all nodes and NVIDIA GPU Driver 565 or later.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The v565 driver is only for GH200 / GB200 systems where MNNVL is supported. For other systems older driver versions should be sufficient.

--version=${version} \
--set cdi.enabled=true

Note if you want to install the GPU DRA Driver using pre-installed drivers, you must install NVIDIA GPU Driver 565 or later, the corresponding IMEX packages on GPU nodes, and disable the IMEX systemd service before installing the GPU Operator.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note if you want to install the GPU DRA Driver using pre-installed drivers, you must install NVIDIA GPU Driver 565 or later, the corresponding IMEX packages on GPU nodes, and disable the IMEX systemd service before installing the GPU Operator.
Note if you want to install the NVIDIA GPU DRA Driver using pre-installed drivers, you must install NVIDIA GPU Driver 565 or later, the corresponding IMEX packages on GPU nodes, and disable the IMEX systemd service before installing the GPU Operator.


* - ``nvidiaDriverRoot``
- Specifies the driver root on the host.
For Operator managed drivers, use ``/run/nvidia/driver``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For Operator managed drivers, use ``/run/nvidia/driver``.
For Operator-managed drivers, use ``/run/nvidia/driver``.

- ``/``

* - ``nvidiaCtkPath``
- Specifies the path of The NVIDIA Container Tool Kit binary (nvidia-ctk) on the host, as it should appear in the the generated CDI specification.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Specifies the path of The NVIDIA Container Tool Kit binary (nvidia-ctk) on the host, as it should appear in the the generated CDI specification.
- Specifies the path of The NVIDIA Container Toolkit CLI binary (nvidia-ctk) on the host.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "generated CDI specification" is probably an implementation detail that should not be relevant in this case.

node3 nvidia.com/gpu.clique 1fbed3a8-bd74-4c83-afcb-cfb75ebc9304.1
node4 nvidia.com/gpu.clique 1fbed3a8-bd74-4c83-afcb-cfb75ebc9304.1

The GPU DRA Driver adds a Clique ID to each GPU node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The GPU DRA Driver does not add this label. This label is added by GPU Feature Discovery.

The NVIDIA GPU DRA Driver introduces a new custom resource called ComputeDomain, which creates a DRA ResourceClaimTemplate that you can reference in workloads.
The ComputeDomain resources also creates a unique ResourceClaim for each worker that links it back to the ComputeDomain where the ResourceClaimTemplate is defined.

If a subset of the nodes associated with a ComputeDomain are capable of communicating over IMEX, the NVIDIA Kubernetes DRA will set up a one-off IMEX domain to allow GPUs to communicate over their multi-node NVLink connections. Multiple IMEX domains will be created as necessary depending on the number (and availability) of nodes allocated to the ComputeDomain.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: IMEX -> Mulit-Node NVLink (MNNVL)

- None

* - ``numNodes`` (required)
- Specifies the number of nodes in the IMEX domain.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Specifies the number of nodes in the IMEX domain.
- Specifies the number of nodes in the Compute Domain.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants