-
Notifications
You must be signed in to change notification settings - Fork 28
adding dra docs #162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
adding dra docs #162
Conversation
Documentation preview |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This pull request introduces documentation-related changes for the DRA driver while also providing configuration manifests for enabling dynamic resource allocation and related compute domain functionality.
- Adds a kubeadm initialization configuration to enable the DynamicResourceAllocation feature.
- Introduces a ComputeDomain resource and a pod manifest for resource channel injection.
- Provides an additional ComputeDomain definition in a separate file.
Reviewed Changes
File | Description |
---|---|
gpu-operator/manifests/input/kubeadm-init-config.yaml | Defines kubeadm ClusterConfiguration and KubeletConfiguration with dynamic resource allocation settings. |
gpu-operator/manifests/input/imex-channel-injection.yaml | Contains ComputeDomain and Pod definitions for testing resource channel injection. |
gpu-operator/manifests/input/dra-compute-domain-crd.yaml | Provides an alternative ComputeDomain resource definition that duplicates functionality in imex-channel-injection.yaml. |
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Comments suppressed due to low confidence (1)
gpu-operator/manifests/input/dra-compute-domain-crd.yaml:1
- The ComputeDomain resource defined here duplicates the one in imex-channel-injection.yaml. Consider consolidating the definitions to avoid potential conflicts during deployment.
apiVersion: resource.nvidia.com/v1beta1
gpu-operator/dra-driver.rst
Outdated
This page details more information about the GPU DRA Driver, including how to install it and examples of deploying workloads using IMEX channels. | ||
|
||
******************************* | ||
About the NVIDIA GPU DRA Driver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the "GPU DRA Driver" in this context? Is it the driver that is managing the lifecycle of compute domains or access to GPUs themselves?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the "GPU DRA Driver" in this context?
The "NVIDIA GPU DRA Driver" in this document always refers to the component developed in https://github.com/NVIDIA/k8s-dra-driver-gpu
Is it the driver that is managing the lifecycle of compute domains
in that sense: yes
As a general comment, we need to call out what that driver enables -- which is to use Multi-Node NVLink -- and not how this is enabled. The UX for ComputeDomains has been designed to explicitly make minimal references to IMEX since this is an implementation detail. It is required to enable MNNVL, but should not be something that a user is concerned with. I'll make another pass at reviewing this soon. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces draft manifests for the DRA driver documentation and related resources, outlining initial configurations and resource definitions.
- Added a kubeadm initialization configuration file with feature gate settings for dynamic resource allocation.
- Introduced a ComputeDomain resource manifest.
- Provided an injection manifest that combines a ComputeDomain definition with a Pod using the resource.
Reviewed Changes
Copilot reviewed 4 out of 7 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
gpu-operator/manifests/input/kubeadm-init-config.yaml | New kubeadm init config with extraArgs for enabling dynamic resource allocation. |
gpu-operator/manifests/input/dra-compute-domain-crd.yaml | New ComputeDomain resource definition with basic specs. |
gpu-operator/manifests/input/imex-channel-injection.yaml | Injection manifest combining a ComputeDomain and a Pod definition. |
Files not reviewed (3)
- gpu-operator/index.rst: Language not supported
- gpu-operator/manifests/output/compute-domain-channel-injection-crd.txt: Language not supported
- gpu-operator/manifests/output/imex-logs.txt: Language not supported
Comments suppressed due to low confidence (1)
gpu-operator/manifests/input/imex-channel-injection.yaml:2
- The ComputeDomain resource is defined in both this file and in dra-compute-domain-crd.yaml. Confirm if this duplication is intentional or consider consolidating these definitions to avoid potential deployment conflicts.
apiVersion: resource.nvidia.com/v1beta1
gpu-operator/dra-driver.rst
Outdated
|
||
An IMEX channel is a construct that allows a set of GPUs to directly read and write each other's memory over a high-bandwidth NVLink. | ||
The NVLink connection may either be directly between GPUs on the same node or between GPUs on separate nodes connected by an NVSwitch. | ||
Once an IMEX channel has been established for a set of GPUs, they are free to read and write each other's memory via extensions to the CUDA memory call APIs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My hunch is that we should not go into this detail here but just link to reference documentation. It's very easy to say something wrong/misleading here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jgehrcke do you happen to know where that documentation lives? I have included a lot of this info b/c i was unable to find a place where we can link out to :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reference docs are here, for example: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__MALLOC__ASYNC.html#group__CUDA__MALLOC__ASYNC_1g8aa4c143dbc20293659cd883232b95f2
There is the beautiful statement of
When exporter and importer CUDA processes have been granted access to the same IMEX channel, they can securely share memory.
And that's already a better description because of the emphasis on securely.
Maybe we can borrow that statement, and otherwise just link to refdocs.
--namespace nvidia-dra-driver-gpu \ | ||
--set nvidiaDriverRoot=/run/nvidia/driver \ | ||
--set nvidiaCtkPath=/usr/local/nvidia/toolkit/nvidia-ctk \ | ||
--set resources.gpus.enabled=false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for following this ❤️
gpu-operator/dra-driver.rst
Outdated
The NVIDIA DRA Driver for GPUs is an additional component you can install alongside the GPU Operator that enables you to use the Kubernetes Dynamic Resource Allocation (DRA) feature to support Multi-Node NVLink in NVIDIA HGX GB200 NVL GPUs. | ||
This page details more information about installing the DRA Driver for GPUs and examples of deploying workloads utilizing Multi-Node NVLink with NVIDIA HGX GB200 NVL systems. | ||
|
||
NVIDIA HGX GB200 NVL systems are designed specifically to leverage the use of IMEX channels to turn a rack of GPU machines, each with a small number of GPUs, into a giant supercomputer with up to 72 GPUs communicating at full NVLink bandwidth. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NVIDIA HGX GB200 NVL systems are designed specifically to leverage the use of IMEX channels to turn a rack of GPU machines, each with a small number of GPUs, into a giant supercomputer with up to 72 GPUs communicating at full NVLink bandwidth. | |
NVIDIA HGX GB200 NVL systems are designed specifically to use Multi-Node NVLinks to turn a rack of GPU machines, each with a small number of GPUs, into a giant supercomputer with up to 72 GPUs communicating at full NVLink bandwidth. | |
IMEX channels are a low-level implementation detail.
gpu-operator/dra-driver.rst
Outdated
About the NVIDIA DRA Driver for GPUs | ||
************************************ | ||
|
||
The NVIDIA DRA Driver for GPUs leverages the Kubernetes Dynamic Resource Allocation (DRA) API to support NVIDIA Multi-Node NVLink available in NVIDIA HGX GB200 NVL GPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to repeat what was said in the previou section. Was that the intent?
gpu-operator/dra-driver.rst
Outdated
************************************ | ||
|
||
The NVIDIA DRA Driver for GPUs leverages the Kubernetes Dynamic Resource Allocation (DRA) API to support NVIDIA Multi-Node NVLink available in NVIDIA HGX GB200 NVL GPUs. | ||
The NVIDIA DRA Driver for GPUs introduces a Kubernetes custom resource named ComputeDomain where you can define your resource templates, and then reference the templates within your workload definitions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we "highlight" ComputeDomain
?
gpu-operator/dra-driver.rst
Outdated
************************************ | ||
|
||
The NVIDIA DRA Driver for GPUs leverages the Kubernetes Dynamic Resource Allocation (DRA) API to support NVIDIA Multi-Node NVLink available in NVIDIA HGX GB200 NVL GPUs. | ||
The NVIDIA DRA Driver for GPUs introduces a Kubernetes custom resource named ComputeDomain where you can define your resource templates, and then reference the templates within your workload definitions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The NVIDIA DRA Driver for GPUs introduced a Kubernetes custom resource named ComputeDomain
which can be referenced in jobs that are expected to span multiple nodes. In the case of nodes conntected using MNNVL, the required resources are automatically provisioned to allow a set of GPUs to directly read and write each other's memory over a high-bandwidth NVLink.
gpu-operator/dra-driver.rst
Outdated
The NVIDIA DRA Driver for GPUs leverages the Kubernetes Dynamic Resource Allocation (DRA) API to support NVIDIA Multi-Node NVLink available in NVIDIA HGX GB200 NVL GPUs. | ||
The NVIDIA DRA Driver for GPUs introduces a Kubernetes custom resource named ComputeDomain where you can define your resource templates, and then reference the templates within your workload definitions. | ||
|
||
A ComputeDomain creates and manages an IMEX channel, a construct that allows a set of GPUs to directly read and write each other's memory over a high-bandwidth NVLink. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we ever reach consensus on whether we want to actually mention "IMEX channels" here. (A compute domain can also be used on nodes that do not support MNNVL)
gpu-operator/dra-driver.rst
Outdated
The NVIDIA DRA Driver for GPUs is an additional component you can install alongside the GPU Operator that enables you to use the Kubernetes Dynamic Resource Allocation (DRA) feature to support Multi-Node NVLink in NVIDIA HGX GB200 NVL GPUs. | ||
This page details more information about installing the DRA Driver for GPUs and examples of deploying workloads utilizing Multi-Node NVLink with NVIDIA HGX GB200 NVL systems. | ||
|
||
NVIDIA HGX GB200 NVL systems are designed specifically to leverage the use of IMEX channels to turn a rack of GPU machines, each with a small number of GPUs, into a giant supercomputer with up to 72 GPUs communicating at full NVLink bandwidth. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we expand IMEX once before we use this acronym everywhere else?
I believe IMEX stands for Internode Memory Exchange
Signed-off-by: Abigail McCarthy <[email protected]> Co-authored-by: Tariq <[email protected]>
gpu-operator/manifests/output/compute-domain-channel-injection-crd.txt
Outdated
Show resolved
Hide resolved
Signed-off-by: Abigail McCarthy <[email protected]>
Signed-off-by: Abigail McCarthy <[email protected]>
Signed-off-by: Abigail McCarthy <[email protected]>
Signed-off-by: Abigail McCarthy <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @a-mccarthy! I looked at the last two commits only and added just a small number of comments.
After more quick/diagonal reading, I noticed that more changes are required. Maybe it is useful when I add a commit to this branch -- WDYT?
gpu-operator/dra-driver.rst
Outdated
Multi-Node NVLink support with the NVIDIA DRA Driver for GPUs | ||
############################################################# | ||
################################################## | ||
NVIDIA Dynamic Resource Allocation Driver for GPUs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's still call this "NVIDIA DRA Driver for GPUs"
@@ -123,7 +141,8 @@ To view all the options, run ``helm show values nvidia/nvidia-dra-driver-gpu``. | |||
|
|||
* - ``resources.gpus.enabled`` | |||
- Specifies whether to enable the NVIDIA DRA Driver for GPUs to manage GPU resource allocation. | |||
This feature is not yet supported and you must set to ``false``. | |||
This feature is in Technolody Preview and only recommended for testing, not production enviroments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technology :)
environments
@@ -123,7 +141,8 @@ To view all the options, run ``helm show values nvidia/nvidia-dra-driver-gpu``. | |||
|
|||
* - ``resources.gpus.enabled`` | |||
- Specifies whether to enable the NVIDIA DRA Driver for GPUs to manage GPU resource allocation. | |||
This feature is not yet supported and you must set to ``false``. | |||
This feature is in Technolody Preview and only recommended for testing, not production enviroments. | |||
To use with MNNVL, set to ``false``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove this sentence.
(it's not correct: the driver has two components: the CD side of things, and the GPU side of things -- they can both be enabled independently).
gpu-operator/dra-driver.rst
Outdated
|
||
Before continuing, you should be familiar with the components of the `Kubernetes DRA feature <https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be the first sentence in this section, and it should be the sentence that introduces the DRA abbreviation.
Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
This PR updates the docs for the new DRA driver component.
links: