diff --git a/gpu-operator/dra-cds.rst b/gpu-operator/dra-cds.rst new file mode 100644 index 000000000..0738426aa --- /dev/null +++ b/gpu-operator/dra-cds.rst @@ -0,0 +1,232 @@ +.. license-header + SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + SPDX-License-Identifier: Apache-2.0 + +########################## +NVIDIA DRA Driver for GPUs +########################## + +.. _dra_docs_compute_domains: + +******************************************** +ComputeDomains: Multi-Node NVLink simplified +******************************************** + +Motivation +========== + +NVIDIA's `GB200 NVL72 `_ and comparable systems are designed specifically around Multi-Node NVLink (`MNNVL `_) to turn a rack of GPU machines -- each with a small number of GPUs -- into a supercomputer with a large number of GPUs communicating at high bandwidth (1.8 TB/s chip-to-chip, and over `130 TB/s cumulative bandwidth `_ on a GB200 NVL72). + +NVIDIA's DRA Driver for GPUs enables MNNVL for Kubernetes workloads by introducing a new concept -- the **ComputeDomain**: +when workload requests a ComputeDomain, NVIDIA's DRA Driver for GPUs performs all the heavy lifting required for sharing GPU memory securely via NVLink among all pods that comprise the workload. + +.. note:: + + Users may appreciate to know that -- under the hood -- NVIDIA Internode Memory Exchange (`IMEX `_) primitives need to be orchestrated for mapping GPU memory over NVLink. + + A design goal of this DRA driver is to make IMEX, as much as possible, an implementation detail that workload authors and cluster operators do not need to be concerned with: the driver launches and/or reconfigures IMEX daemons and establishes and injects IMEX channels into containers as needed. + + +.. _dra-docs-cd-guarantees: + +Guarantees +========== + +By design, an individual ComputeDomain guarantees + +#. **MNNVL-reachability** between pods that are in the domain. +#. **secure isolation** from other pods that are not in the domain and in a different Kubernetes namespace. + +In terms of lifetime, a ComputeDomain is ephemeral: its lifetime is bound to the lifetime of the consuming workload. +In terms of placement, our design choice is that a ComputeDomain follows the workload. + +That means: once workload pods get scheduled onto specific nodes, if they request a ComputeDomain, that domain automatically forms around them. +Upon workload completion, all ComputeDomain-associated resources get torn down automatically. + +For more detail on the security properties of a ComputeDomain, see `Security `__. + + +A deeper dive: related resources +================================ + +For more background on how ComputeDomains facilitate orchestrating MNNVL workloads on Kubernetes, see `this doc `_ and `this slide deck `_. +For an outlook on planned improvements on the ComputeDomain concept, please refer to `this document `_. + +Details about IMEX and its relationship to NVLink may be found in `NVIDIA's IMEX guide `_, and in `NVIDIA's NVLink guide `_. +CUDA API documentation for `cuMemCreate `_ provides a starting point to learn about how to share GPU memory via IMEX/NVLink. +If you are looking for a higher-level GPU communication library, `NVIDIA's NCCL `_ newer than version 2.25 supports MNNVL. + + +Usage example: a multi-node nvbandwidth test +============================================ + +This example demonstrates how to run an MNNVL workload across multiple nodes using a ComputeDomain (CD). +As example CUDA workload that performs MNNVL communication, we have picked `nvbandwidth `_. +Since nvbandwidth requires MPI, below we also install the `Kubeflow MPI Operator `_. + +**Steps:** + +#. Install the MPI Operator. + + .. code-block:: console + + $ kubectl create -f https://github.com/kubeflow/mpi-operator/releases/download/v0.6.0/mpi-operator.yaml + +#. Create a test job file called ``nvbandwidth-test-job.yaml``. + To do that, follow `this part of the CD validation instructions `_. + This example is configured to run across two nodes, using four GPUs per node. + If you want to use different numbers, please adjust the parameters in the spec according to the table below: + + .. list-table:: + :header-rows: 1 + + * - Parameter + - Value (in example) + + * - ``ComputeDomain.spec.numNodes`` + - Total number of nodes to use in the test (2). + + * - ``MPIJob.spec.slotsPerWorker`` + - Number of GPUs per node to use -- this must match the ``ppr`` number below (4). + + * - ``MPIJob.spec.mpiReplicaSpecs.Worker.replicas`` + - Also set this to the number of nodes (2). + + * - ``mpirun`` command argument ``-ppr:4:node`` + - Set this to the number of GPUs to use per node (4) + + * - ``mpirun`` command argument ``-np`` value + - Set this to the total number of GPUs in the test (8). + +#. Apply the manifest. + + .. code-block:: console + + $ kubectl apply -f nvbandwidth-test-job.yaml + + *Example Output* + + .. code-block:: output + + computedomain.resource.nvidia.com/nvbandwidth-test-compute-domain configured + mpijob.kubeflow.org/nvbandwidth-test configured + +#. Verify that the nvbandwidth pods were created. + + .. code-block:: console + + $ kubectl get pods + + *Example Output* + + .. code-block:: output + + NAME READY STATUS RESTARTS AGE + nvbandwidth-test-launcher-lzv84 1/1 Running 0 8s + nvbandwidth-test-worker-0 1/1 Running 0 15s + nvbandwidth-test-worker-1 1/1 Running 0 15s + + +#. Verify that the ComputeDomain pods were created for each node. + + .. code-block:: console + + $ kubectl get pods -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain + + *Example Output* + + .. code-block:: output + + NAME READY STATUS RESTARTS AGE + nvbandwidth-test-compute-domain-ht24d-9jhmj 1/1 Running 0 20s + nvbandwidth-test-compute-domain-ht24d-rcn2c 1/1 Running 0 20s + +#. Verify the nvbandwidth test output. + + .. code-block:: console + + $ kubectl logs --tail=-1 -l job-name=nvbandwidth-test-launcher + + *Example Output* + + .. code-block:: output + + Warning: Permanently added '[nvbandwidth-test-worker-0.nvbandwidth-test.default.svc]:2222' (ECDSA) to the list of known hosts. + Warning: Permanently added '[nvbandwidth-test-worker-1.nvbandwidth-test.default.svc]:2222' (ECDSA) to the list of known hosts. + [nvbandwidth-test-worker-0:00025] MCW rank 0 bound to socket 0[core 0[hwt 0]]: + + [...] + + [nvbandwidth-test-worker-1:00025] MCW rank 7 bound to socket 0[core 3[hwt 0]]: [./././B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.] + nvbandwidth Version: v0.7 + Built from Git version: v0.7 + + MPI version: Open MPI v4.1.4, package: Debian OpenMPI, ident: 4.1.4, repo rev: v4.1.4, May 26, 2022 + CUDA Runtime Version: 12080 + CUDA Driver Version: 12080 + Driver Version: 570.124.06 + + Process 0 (nvbandwidth-test-worker-0): device 0: HGX GB200 (00000008:01:00) + Process 1 (nvbandwidth-test-worker-0): device 1: HGX GB200 (00000009:01:00) + Process 2 (nvbandwidth-test-worker-0): device 2: HGX GB200 (00000018:01:00) + Process 3 (nvbandwidth-test-worker-0): device 3: HGX GB200 (00000019:01:00) + Process 4 (nvbandwidth-test-worker-1): device 0: HGX GB200 (00000008:01:00) + Process 5 (nvbandwidth-test-worker-1): device 1: HGX GB200 (00000009:01:00) + Process 6 (nvbandwidth-test-worker-1): device 2: HGX GB200 (00000018:01:00) + Process 7 (nvbandwidth-test-worker-1): device 3: HGX GB200 (00000019:01:00) + + Running multinode_device_to_device_memcpy_read_ce. + memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s) + 0 1 2 3 4 5 6 7 + 0 N/A 798.02 798.25 798.02 798.02 797.88 797.73 797.95 + 1 798.10 N/A 797.80 798.02 798.02 798.25 797.88 798.02 + 2 797.95 797.95 N/A 797.73 797.80 797.95 797.95 797.65 + 3 798.10 798.02 797.95 N/A 798.02 798.10 797.88 797.73 + 4 797.80 798.02 798.02 798.02 N/A 797.95 797.80 798.02 + 5 797.80 797.95 798.10 798.10 797.95 N/A 797.95 797.88 + 6 797.73 797.95 798.10 798.02 797.95 797.88 N/A 797.80 + 7 797.88 798.02 797.95 798.02 797.88 797.95 798.02 N/A + + SUM multinode_device_to_device_memcpy_read_ce 44685.29 + + NOTE: The reported results may not reflect the full capabilities of the platform. + +#. Clean up. + + .. code-block:: console + + $ kubectl delete -f nvbandwidth-test-job.yaml + +.. _dra-docs-cd-security: + +Security +======== + +As indicated in `Guarantees `__, the ComputeDomain primitive provides a *security boundary.* That deserves clarifying remarks. + +NVLink enables mapping remote GPU memory so that it can be read from / written to with regular CUDA API calls (as if it were normal, local GPU memory). +From a security point of view, that begs the question: can any other GPU in the same NVLink parition freely read and mutate other GPU's memory -- or is there an authorization layer inbetween? +The answer is "yes": +IMEX has been introduced specifically as a means for providing secure isolation between GPUs that are in the same NVLink partition. +With IMEX, every individual GPU memory export/import operation can be subject to fine-grained access control. + +With the following two additional constraints, we can now better understand the security guarantee provided by ComputeDomains: + +- The ComputeDomain security boundary is implemented with IMEX. +- A job submitted to Kubernetes namespace `A` cannot be part of a ComputeDomain created for namespace `B`. + + +That is, ComputeDomains (only) promise robust IMEX-based isolation between jobs that are **not** part of the same Kubernetes namespace. +If a bad actor has access to a Kubernetes namespace, they may be able to mutate ComputeDomains (and, as such, IMEX primitives) in that Kubernetes namespace. +That, in turn, may allow for disabling or trivially working around IMEX access control. + + +With ComputeDomains, the overall ambition is that the security isolation between jobs in different Kubernetes namespaces is strong enough to responsibly allow for multi-tenant environments where compute jobs that conceptually cannot trust each other are "only" separated by the Kubernetes namespace boundary. + + +Additional remarks +================== + +We are planning to extend the documentation for ComputeDomains, with a focus on API reference documentation and known limitations as well as best practices and security. + +As we iterate on design and implementation, we are particularly interested and open to receiving your feedback -- please reach out via the issue tracker or discussion forum in the `GitHub repository `_. diff --git a/gpu-operator/dra-gpus.rst b/gpu-operator/dra-gpus.rst new file mode 100644 index 000000000..e44178216 --- /dev/null +++ b/gpu-operator/dra-gpus.rst @@ -0,0 +1,33 @@ +.. license-header + SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + SPDX-License-Identifier: Apache-2.0 + +########################## +NVIDIA DRA Driver for GPUs +########################## + +.. _dra_docs_gpus: + +************** +GPU allocation +************** + +Compared to `traditional GPU allocation `_ using coarse-grained count-based requests, the GPU allocation side of this driver enables fine-grained control and powerful features long desired by the community, such as: + +#. Controlled sharing of individual GPUs between multiple pods and/or containers. +#. GPU selection via complex constraints expressed via `CEL `_. +#. Dynamic partitioning. + +To learn more about this part of the driver and about what we are planning to build in the future, have a look at `these release notes `_. + +While the GPU allocation features of this driver can be tried out, they are not yet officially supported. +Hence, the GPU kubelet plugin is currently disabled by default in the Helm chart installation. + +For documentation on how to use and test the current set of GPU allocation features, please head over to the `demo section `_ of the driver's README and to its `quickstart directory `_. + +.. note:: + This part of the NVIDIA DRA Driver for GPUs is in **Technology Preview**. + They are not yet supported in production environments and are not functionally complete. + Technology Preview features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. + These releases may not have full documentation, and testing is limited. + diff --git a/gpu-operator/dra-intro-install.rst b/gpu-operator/dra-intro-install.rst new file mode 100644 index 000000000..805d82cf2 --- /dev/null +++ b/gpu-operator/dra-intro-install.rst @@ -0,0 +1,110 @@ +.. license-header + SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. + SPDX-License-Identifier: Apache-2.0 + +########################## +NVIDIA DRA Driver for GPUs +########################## + +************ +Introduction +************ + +With NVIDIA's DRA Driver for GPUs, your Kubernetes workload can allocate and consume the following two types of resources: + +* **GPUs**: for controlled sharing and dynamic reconfiguration of GPUs. A modern replacement for the traditional GPU allocation method (using `NVIDIA's device plugin `_). We are excited about this part of the driver; it is however not yet fully supported (Technology Preview). +* **ComputeDomains**: for robust and secure Multi-Node NVLink (MNNVL) for NVIDIA GB200 and similar systems. Fully supported. + +A primer on DRA +=============== + +Dynamic Resource Allocation (DRA) is a novel concept in Kubernetes for flexibly requesting, configuring, and sharing specialized devices like GPUs. +DRA puts device configuration and scheduling into the hands of device vendors via drivers like this one. +For NVIDIA devices, there are two particularly benefical characteristics provided by DRA: + +#. A clean way to allocate **cross-node resources** in Kubernetes (leveraged here for providing NVLink connectivity across pods running on multiple nodes). +#. Mechanisms to explicitly **share, partition, and reconfigure** devices **on-the-fly** based on user requests (leveraged here for advanced GPU allocation). + +To understand and make best use of NVIDIA's DRA Driver for GPUs, we recommend becoming familiar with DRA by working through the `official documentation `_. + + +The twofold nature of this driver +================================= + +NVIDIA's DRA Driver for GPUs is comprised of two subsystems that are largely independent of each other: one manages GPUs, and the other one manages ComputeDomains. + +Below, you can find instructions for how to install both parts or just one of them. +Additionally, we have prepared two separate documentation chapters, providing more in-depth information for each of the two subsystems: + +- :ref:`Documentation for ComputeDomain (MNNVL) support ` +- :ref:`Documentation for GPU support ` + + +************ +Installation +************ + +Prerequisites +============= + +- Kubernetes v1.32 or newer. +- DRA and corresponding API groups must be enabled (`see Kubernetes docs `_). +- GPU Driver 565 or later. +- NVIDIA's GPU Operator v25.3.0 or later, installed with CDI enabled (use the ``--set cdi.enabled=true`` commandline argument during ``helm install``). For reference, please refer to the GPU Operator `installation documentation `__. + +.. + For convenience, the following example shows how to enable CDI upon GPU Operator installation: + .. code-block:: console + $ helm install --wait --generate-name \ + -n gpu-operator --create-namespace \ + nvidia/gpu-operator \ + --version=${version} \ + --set cdi.enabled=true + +.. note:: + + If you want to use ComputeDomains and a pre-installed NVIDIA GPU Driver: + + - Make sure to have the corresponding ``nvidia-imex-*`` packages installed. + - Disable the IMEX systemd service before installing the GPU Operator. + - Refer to the `docs on installing the GPU Operator with a pre-installed GPU driver `__. + + +Configure and Helm-install the driver +===================================== + +#. Add the NVIDIA Helm repository: + + .. code-block:: console + + $ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \ + && helm repo update + +#. Install the driver, providing install-time configuration parameters. Example: + + .. code-block:: console + + $ helm install nvidia-dra-driver-gpu nvidia/nvidia-dra-driver-gpu \ + --version="25.3.0" \ + --create-namespace \ + --namespace nvidia-dra-driver-gpu \ + --set nvidiaDriverRoot=/run/nvidia/driver \ + --set resources.gpus.enabled=false + +All install-time configuration parameters can be listed by running ``helm show values nvidia/nvidia-dra-driver-gpu``. + +.. note:: + + - A common mode of operation for now is to enable only the ComputeDomain subsystem (to have GPUs allocated using the traditional device plugin). The example above achieves that by setting ``resources.gpus.enabled=false``. + - Setting ``nvidiaDriverRoot=/run/nvidia/driver`` above expects a GPU Operator-provided GPU driver. That configuration parameter must be changed in case the GPU driver is installed straight on the host (typically at ``/``, which is the default value for ``nvidiaDriverRoot``). + - In a future release, NVIDIA's DRA Driver for GPUs will be bundled with the NVIDIA GPU Operator (and does not need to be installed as a separate Helm chart anymore). + + +Validate installation +===================== + +We recommend to perform validation steps to confirm that your setup works as expected. +To that end, we have prepared separate documentation: + +- `Testing ComputeDomain allocation `_ +- [TODO] Testing GPU allocation diff --git a/gpu-operator/index.rst b/gpu-operator/index.rst index 678d1b7f9..4455331c9 100644 --- a/gpu-operator/index.rst +++ b/gpu-operator/index.rst @@ -75,4 +75,13 @@ Azure AKS Google GKE +.. toctree:: + :caption: NVIDIA DRA Driver for GPUs + :titlesonly: + :hidden: + + Introduction & Installation + GPUs + ComputeDomains + .. include:: overview.rst diff --git a/gpu-operator/manifests/input/dra-compute-domain-crd.yaml b/gpu-operator/manifests/input/dra-compute-domain-crd.yaml new file mode 100644 index 000000000..924563d74 --- /dev/null +++ b/gpu-operator/manifests/input/dra-compute-domain-crd.yaml @@ -0,0 +1,9 @@ +apiVersion: resource.nvidia.com/v1beta1 +kind: ComputeDomain +metadata: + name: imex-channel-injection +spec: + numNodes: 1 + channel: + resourceClaimTemplate: + name: imex-channel-0 diff --git a/gpu-operator/manifests/input/imex-channel-injection.yaml b/gpu-operator/manifests/input/imex-channel-injection.yaml new file mode 100644 index 000000000..f812dd47d --- /dev/null +++ b/gpu-operator/manifests/input/imex-channel-injection.yaml @@ -0,0 +1,28 @@ +--- +apiVersion: resource.nvidia.com/v1beta1 +kind: ComputeDomain +metadata: + name: imex-channel-injection +spec: + numNodes: 1 + channel: + resourceClaimTemplate: + name: imex-channel-0 +--- +apiVersion: v1 +kind: Pod +metadata: + name: imex-channel-injection +spec: + containers: + - name: ctr + image: ubuntu:22.04 + command: ["bash", "-c"] + args: ["ls -la /dev/nvidia-caps-imex-channels; trap 'exit 0' TERM; sleep 9999 & wait"] + resources: + claims: + - name: imex-channel-0 + resourceClaims: + - name: imex-channel-0 + resourceClaimTemplateName: imex-channel-0 + diff --git a/gpu-operator/manifests/input/kubeadm-init-config.yaml b/gpu-operator/manifests/input/kubeadm-init-config.yaml new file mode 100644 index 000000000..e913e464d --- /dev/null +++ b/gpu-operator/manifests/input/kubeadm-init-config.yaml @@ -0,0 +1,21 @@ +apiVersion: kubeadm.k8s.io/v1beta4 +kind: ClusterConfiguration +apiServer: + extraArgs: + - name: "feature-gates" + value: "DynamicResourceAllocation=true" + - name: "runtime-config" + value: "resource.k8s.io/v1beta1=true" +controllerManager: + extraArgs: + - name: "feature-gates" + value: "DynamicResourceAllocation=true" +scheduler: + extraArgs: + - name: "feature-gates" + value: "DynamicResourceAllocation=true" +--- +apiVersion: kubelet.config.k8s.io/v1beta1 +kind: KubeletConfiguration +featureGates: + DynamicResourceAllocation: true diff --git a/repo.toml b/repo.toml index 8e47ee289..15df2ddf1 100644 --- a/repo.toml +++ b/repo.toml @@ -265,4 +265,4 @@ copyright_start = 2024 [repo_docs.projects.secure-services-istio-keycloak.builds.linkcheck] build_by_default = false -output_format = "linkcheck" \ No newline at end of file +output_format = "linkcheck"