Skip to content

adding dra docs #162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

adding dra docs #162

wants to merge 6 commits into from

Conversation

a-mccarthy
Copy link
Collaborator

@a-mccarthy a-mccarthy commented Mar 6, 2025

@a-mccarthy a-mccarthy marked this pull request as draft March 6, 2025 19:58
Copy link

github-actions bot commented Mar 6, 2025

Documentation preview

https://nvidia.github.io/cloud-native-docs/review/pr-162

@ArangoGutierrez ArangoGutierrez requested a review from Copilot March 7, 2025 15:29
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This pull request introduces documentation-related changes for the DRA driver while also providing configuration manifests for enabling dynamic resource allocation and related compute domain functionality.

  • Adds a kubeadm initialization configuration to enable the DynamicResourceAllocation feature.
  • Introduces a ComputeDomain resource and a pod manifest for resource channel injection.
  • Provides an additional ComputeDomain definition in a separate file.

Reviewed Changes

File Description
gpu-operator/manifests/input/kubeadm-init-config.yaml Defines kubeadm ClusterConfiguration and KubeletConfiguration with dynamic resource allocation settings.
gpu-operator/manifests/input/imex-channel-injection.yaml Contains ComputeDomain and Pod definitions for testing resource channel injection.
gpu-operator/manifests/input/dra-compute-domain-crd.yaml Provides an alternative ComputeDomain resource definition that duplicates functionality in imex-channel-injection.yaml.

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (1)

gpu-operator/manifests/input/dra-compute-domain-crd.yaml:1

  • The ComputeDomain resource defined here duplicates the one in imex-channel-injection.yaml. Consider consolidating the definitions to avoid potential conflicts during deployment.
apiVersion: resource.nvidia.com/v1beta1

@a-mccarthy a-mccarthy requested a review from guptaNswati March 10, 2025 16:29
This page details more information about the GPU DRA Driver, including how to install it and examples of deploying workloads using IMEX channels.

*******************************
About the NVIDIA GPU DRA Driver
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the "GPU DRA Driver" in this context? Is it the driver that is managing the lifecycle of compute domains or access to GPUs themselves?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the "GPU DRA Driver" in this context?

The "NVIDIA GPU DRA Driver" in this document always refers to the component developed in https://github.com/NVIDIA/k8s-dra-driver-gpu

Is it the driver that is managing the lifecycle of compute domains

in that sense: yes

@elezar
Copy link
Member

elezar commented Mar 21, 2025

As a general comment, we need to call out what that driver enables -- which is to use Multi-Node NVLink -- and not how this is enabled. The UX for ComputeDomains has been designed to explicitly make minimal references to IMEX since this is an implementation detail. It is required to enable MNNVL, but should not be something that a user is concerned with.

I'll make another pass at reviewing this soon.

@ArangoGutierrez ArangoGutierrez requested a review from Copilot March 21, 2025 13:50
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces draft manifests for the DRA driver documentation and related resources, outlining initial configurations and resource definitions.

  • Added a kubeadm initialization configuration file with feature gate settings for dynamic resource allocation.
  • Introduced a ComputeDomain resource manifest.
  • Provided an injection manifest that combines a ComputeDomain definition with a Pod using the resource.

Reviewed Changes

Copilot reviewed 4 out of 7 changed files in this pull request and generated 1 comment.

File Description
gpu-operator/manifests/input/kubeadm-init-config.yaml New kubeadm init config with extraArgs for enabling dynamic resource allocation.
gpu-operator/manifests/input/dra-compute-domain-crd.yaml New ComputeDomain resource definition with basic specs.
gpu-operator/manifests/input/imex-channel-injection.yaml Injection manifest combining a ComputeDomain and a Pod definition.
Files not reviewed (3)
  • gpu-operator/index.rst: Language not supported
  • gpu-operator/manifests/output/compute-domain-channel-injection-crd.txt: Language not supported
  • gpu-operator/manifests/output/imex-logs.txt: Language not supported
Comments suppressed due to low confidence (1)

gpu-operator/manifests/input/imex-channel-injection.yaml:2

  • The ComputeDomain resource is defined in both this file and in dra-compute-domain-crd.yaml. Confirm if this duplication is intentional or consider consolidating these definitions to avoid potential deployment conflicts.
apiVersion: resource.nvidia.com/v1beta1


An IMEX channel is a construct that allows a set of GPUs to directly read and write each other's memory over a high-bandwidth NVLink.
The NVLink connection may either be directly between GPUs on the same node or between GPUs on separate nodes connected by an NVSwitch.
Once an IMEX channel has been established for a set of GPUs, they are free to read and write each other's memory via extensions to the CUDA memory call APIs.
Copy link
Contributor

@jgehrcke jgehrcke Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My hunch is that we should not go into this detail here but just link to reference documentation. It's very easy to say something wrong/misleading here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jgehrcke do you happen to know where that documentation lives? I have included a lot of this info b/c i was unable to find a place where we can link out to :)

Copy link
Contributor

@jgehrcke jgehrcke Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference docs are here, for example: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MALLOC__ASYNC.html#group__CUDA__MALLOC__ASYNC_1g8aa4c143dbc20293659cd883232b95f2

There is the beautiful statement of

When exporter and importer CUDA processes have been granted access to the same IMEX channel, they can securely share memory.

And that's already a better description because of the emphasis on securely.

Maybe we can borrow that statement, and otherwise just link to refdocs.

--namespace nvidia-dra-driver-gpu \
--set nvidiaDriverRoot=/run/nvidia/driver \
--set nvidiaCtkPath=/usr/local/nvidia/toolkit/nvidia-ctk \
--set resources.gpus.enabled=false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for following this ❤️

@a-mccarthy a-mccarthy marked this pull request as ready for review March 24, 2025 16:22
The NVIDIA DRA Driver for GPUs is an additional component you can install alongside the GPU Operator that enables you to use the Kubernetes Dynamic Resource Allocation (DRA) feature to support Multi-Node NVLink in NVIDIA HGX GB200 NVL GPUs.
This page details more information about installing the DRA Driver for GPUs and examples of deploying workloads utilizing Multi-Node NVLink with NVIDIA HGX GB200 NVL systems.

NVIDIA HGX GB200 NVL systems are designed specifically to leverage the use of IMEX channels to turn a rack of GPU machines, each with a small number of GPUs, into a giant supercomputer with up to 72 GPUs communicating at full NVLink bandwidth.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NVIDIA HGX GB200 NVL systems are designed specifically to leverage the use of IMEX channels to turn a rack of GPU machines, each with a small number of GPUs, into a giant supercomputer with up to 72 GPUs communicating at full NVLink bandwidth.
NVIDIA HGX GB200 NVL systems are designed specifically to use Multi-Node NVLinks to turn a rack of GPU machines, each with a small number of GPUs, into a giant supercomputer with up to 72 GPUs communicating at full NVLink bandwidth.

IMEX channels are a low-level implementation detail.

About the NVIDIA DRA Driver for GPUs
************************************

The NVIDIA DRA Driver for GPUs leverages the Kubernetes Dynamic Resource Allocation (DRA) API to support NVIDIA Multi-Node NVLink available in NVIDIA HGX GB200 NVL GPUs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to repeat what was said in the previou section. Was that the intent?

************************************

The NVIDIA DRA Driver for GPUs leverages the Kubernetes Dynamic Resource Allocation (DRA) API to support NVIDIA Multi-Node NVLink available in NVIDIA HGX GB200 NVL GPUs.
The NVIDIA DRA Driver for GPUs introduces a Kubernetes custom resource named ComputeDomain where you can define your resource templates, and then reference the templates within your workload definitions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we "highlight" ComputeDomain?

************************************

The NVIDIA DRA Driver for GPUs leverages the Kubernetes Dynamic Resource Allocation (DRA) API to support NVIDIA Multi-Node NVLink available in NVIDIA HGX GB200 NVL GPUs.
The NVIDIA DRA Driver for GPUs introduces a Kubernetes custom resource named ComputeDomain where you can define your resource templates, and then reference the templates within your workload definitions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NVIDIA DRA Driver for GPUs introduced a Kubernetes custom resource named ComputeDomain which can be referenced in jobs that are expected to span multiple nodes. In the case of nodes conntected using MNNVL, the required resources are automatically provisioned to allow a set of GPUs to directly read and write each other's memory over a high-bandwidth NVLink.

The NVIDIA DRA Driver for GPUs leverages the Kubernetes Dynamic Resource Allocation (DRA) API to support NVIDIA Multi-Node NVLink available in NVIDIA HGX GB200 NVL GPUs.
The NVIDIA DRA Driver for GPUs introduces a Kubernetes custom resource named ComputeDomain where you can define your resource templates, and then reference the templates within your workload definitions.

A ComputeDomain creates and manages an IMEX channel, a construct that allows a set of GPUs to directly read and write each other's memory over a high-bandwidth NVLink.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we ever reach consensus on whether we want to actually mention "IMEX channels" here. (A compute domain can also be used on nodes that do not support MNNVL)

The NVIDIA DRA Driver for GPUs is an additional component you can install alongside the GPU Operator that enables you to use the Kubernetes Dynamic Resource Allocation (DRA) feature to support Multi-Node NVLink in NVIDIA HGX GB200 NVL GPUs.
This page details more information about installing the DRA Driver for GPUs and examples of deploying workloads utilizing Multi-Node NVLink with NVIDIA HGX GB200 NVL systems.

NVIDIA HGX GB200 NVL systems are designed specifically to leverage the use of IMEX channels to turn a rack of GPU machines, each with a small number of GPUs, into a giant supercomputer with up to 72 GPUs communicating at full NVLink bandwidth.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we expand IMEX once before we use this acronym everywhere else?

I believe IMEX stands for Internode Memory Exchange

Signed-off-by: Abigail McCarthy <[email protected]>

Co-authored-by: Tariq <[email protected]>
Signed-off-by: Abigail McCarthy <[email protected]>
Signed-off-by: Abigail McCarthy <[email protected]>
Signed-off-by: Abigail McCarthy <[email protected]>
@a-mccarthy a-mccarthy requested a review from jgehrcke May 29, 2025 10:04
@a-mccarthy
Copy link
Collaborator Author

@klueska and @jgehrcke, I pushed some updates to this PR, please review if you have a chance. I still need to add in more content around the GPU resource allocation use case, but I'd like to get some feedback on the page structure and if the rest of the content for MMNL support makes sense.

Copy link
Contributor

@jgehrcke jgehrcke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @a-mccarthy! I looked at the last two commits only and added just a small number of comments.

After more quick/diagonal reading, I noticed that more changes are required. Maybe it is useful when I add a commit to this branch -- WDYT?

Multi-Node NVLink support with the NVIDIA DRA Driver for GPUs
#############################################################
##################################################
NVIDIA Dynamic Resource Allocation Driver for GPUs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's still call this "NVIDIA DRA Driver for GPUs"

@@ -123,7 +141,8 @@ To view all the options, run ``helm show values nvidia/nvidia-dra-driver-gpu``.

* - ``resources.gpus.enabled``
- Specifies whether to enable the NVIDIA DRA Driver for GPUs to manage GPU resource allocation.
This feature is not yet supported and you must set to ``false``.
This feature is in Technolody Preview and only recommended for testing, not production enviroments.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technology :)
environments

@@ -123,7 +141,8 @@ To view all the options, run ``helm show values nvidia/nvidia-dra-driver-gpu``.

* - ``resources.gpus.enabled``
- Specifies whether to enable the NVIDIA DRA Driver for GPUs to manage GPU resource allocation.
This feature is not yet supported and you must set to ``false``.
This feature is in Technolody Preview and only recommended for testing, not production enviroments.
To use with MNNVL, set to ``false``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this sentence.

(it's not correct: the driver has two components: the CD side of things, and the GPU side of things -- they can both be enabled independently).


Before continuing, you should be familiar with the components of the `Kubernetes DRA feature <https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/>`_.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can be the first sentence in this section, and it should be the sentence that introduces the DRA abbreviation.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants