|
| 1 | +<!-- |
| 2 | +SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
| 3 | +SPDX-License-Identifier: Apache-2.0 |
| 4 | +
|
| 5 | +Licensed under the Apache License, Version 2.0 (the "License"); |
| 6 | +you may not use this file except in compliance with the License. |
| 7 | +You may obtain a copy of the License at |
| 8 | +
|
| 9 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 10 | +
|
| 11 | +Unless required by applicable law or agreed to in writing, software |
| 12 | +distributed under the License is distributed on an "AS IS" BASIS, |
| 13 | +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 14 | +See the License for the specific language governing permissions and |
| 15 | +limitations under the License. |
| 16 | +--> |
| 17 | + |
| 18 | +# README – End‑to‑End (Ginkgo/Gomega) Test Suite for the NVIDIA K8s Device Plugin |
| 19 | + |
| 20 | +--- |
| 21 | + |
| 22 | +## 1 Purpose |
| 23 | +This repository contains a self‑contained Ginkgo v2 / Gomega end‑to‑end (E2E) test suite that |
| 24 | + |
| 25 | +1. Creates an **isolated namespace** per run. |
| 26 | +2. Deploys the **NVIDIA k8s‑device‑plugin Helm chart** under a random release name. |
| 27 | +3. Executes a **CUDA “*n‑body*” benchmark job** to validate GPU scheduling. |
| 28 | + |
| 29 | +On test failure the suite gathers logs and **ensures full cleanup** (namespace deletion, finalizer removal). |
| 30 | +The suite targets CI pipelines and developers validating chart or driver changes before promotion. |
| 31 | + |
| 32 | +--- |
| 33 | + |
| 34 | +## 2 Prerequisites |
| 35 | + |
| 36 | +| Requirement | Notes | |
| 37 | +|----------------------|-------------------------------------------------------------------------------| |
| 38 | +| **Go ≥ 1.22** | Needed for building helper binaries. | |
| 39 | +| **Kubernetes cluster** | Must be reachable via `kubectl`; worker nodes require NVIDIA GPUs. | |
| 40 | +| **Helm v3 CLI** | Only required for manual debugging; the suite uses a programmatic client. | |
| 41 | +| **Linux/macOS host** | The Makefile assumes a POSIX‑compatible shell. | |
| 42 | + |
| 43 | +--- |
| 44 | + |
| 45 | +## 3 Environment variables |
| 46 | + |
| 47 | +| Variable | Required | Default | Description | |
| 48 | +|----------|----------|---------|-------------| |
| 49 | +| `KUBECONFIG` | ✔ | — | Path to the target‑cluster kubeconfig. | |
| 50 | +| `HELM_CHART` | ✔ | — | Helm chart reference (e.g. `oci://ghcr.io/nvidia/k8s-device-plugin`). | |
| 51 | +| `E2E_IMAGE_REPO` | ✔ | — | Repository hosting the image under test. | |
| 52 | +| `E2E_IMAGE_TAG` | ✔ | — | Image tag to test. | |
| 53 | +| `E2E_IMAGE_PULL_POLICY` | ✔ | — | Image pull policy (`Always`, `IfNotPresent`, …). | |
| 54 | +| `E2E_TIMEOUT_SECONDS` | ✖ | `1800` | Global timeout (s). | |
| 55 | +| `LOG_ARTIFACTS_DIR` | ✖ | `./artifacts` | Directory for Helm & test logs. | |
| 56 | +| `COLLECT_LOGS_FROM` | ✖ | (unset) | Comma‑separated node list or `all` for log collection. | |
| 57 | +| `NVIDIA_DRIVER_ENABLED` | ✖ | `false` | Skip GPU job when driver is unavailable. | |
| 58 | + |
| 59 | +> *Unset variables fall back to defaults via `getIntEnvVar` / `getBoolEnvVar`.* |
| 60 | +
|
| 61 | +--- |
| 62 | + |
| 63 | +## 4 Build helper binaries |
| 64 | + |
| 65 | +```bash |
| 66 | +make ginkgo |
| 67 | +# → ./bin/ginkgo (latest v2 CLI) |
| 68 | +``` |
| 69 | + |
| 70 | +--- |
| 71 | + |
| 72 | +## 5 Run the suite |
| 73 | + |
| 74 | +### 5.1 Default invocation |
| 75 | +```bash |
| 76 | +make test-e2e |
| 77 | +``` |
| 78 | +Generates the CLI (if missing), executes all specs under `./tests/e2e`, and writes a JSON report to `ginkgo.json`. |
| 79 | + |
| 80 | +### 5.2 Focused run / extra flags |
| 81 | +```bash |
| 82 | +GINKGO_ARGS='--focus="[GPU Job]" --keep-going' make test-e2e |
| 83 | +``` |
| 84 | +Any flag accepted by `ginkgo run` can be forwarded through `GINKGO_ARGS`. |
| 85 | + |
| 86 | +--- |
| 87 | + |
| 88 | +## 6 Execution flow |
| 89 | + |
| 90 | +| Phase | Key functions / objects | Description | |
| 91 | +|-------|-------------------------|-------------| |
| 92 | +| **Init** | `TestMain`, `getTestEnv` | Validates env vars, sets global timeout. | |
| 93 | +| **Client setup** | `getK8sClients`, `getHelmClient` | Creates REST clients (core, CRD, NFD) and a Helm client that shares the same `rest.Config`. | |
| 94 | +| **Namespace** | `CreateTestingNS` | Generates a unique namespace labelled `e2e-run=<uid>`. | |
| 95 | +| **Chart deploy** | `helmClient.InstallRelease` | Installs the chart in the test namespace with a random release name. | |
| 96 | +| **Workload** | `newGPUJob` | Launches `nvcr.io/nvidia/k8s/cuda-sample:nbody` requesting `nvidia.com/gpu=1`. | |
| 97 | +| **Assertions** | Gomega matchers | Waits for `JobSucceeded == 1` and validates pod logs. | |
| 98 | +| **Cleanup** | `cleanupNamespaceResources`, `AfterSuite` | Removes finalizers, deletes namespace, closes Helm log file. | |
| 99 | + |
| 100 | +--- |
| 101 | + |
| 102 | +## 7 Artifacts & logs |
| 103 | + |
| 104 | +``` |
| 105 | +${LOG_ARTIFACTS_DIR}/ |
| 106 | +└── helm/ |
| 107 | + ├── helm_logs # Release operations, one per test namespace |
| 108 | + └── ... |
| 109 | +
|
| 110 | +ginkgo.json # Structured test outcome for CI parsing |
| 111 | +``` |
| 112 | +If `COLLECT_LOGS_FROM` is set, additional node‑level or container logs are archived in the same directory. |
| 113 | + |
| 114 | +--- |
| 115 | + |
| 116 | +## 8 Extending the suite |
| 117 | + |
| 118 | +### 8.1 Creating additional spec files |
| 119 | + |
| 120 | +1. Add a new `_test.go` file under `tests/e2e`. |
| 121 | +2. Import the Ginkgo/Gomega DSL: |
| 122 | + ```go |
| 123 | + import ( |
| 124 | + . "github.com/onsi/ginkgo/v2" |
| 125 | + . "github.com/onsi/gomega" |
| 126 | + ) |
| 127 | + ``` |
| 128 | +3. Wrap your tests with `Describe`, `Context`, `When`, `It`, etc. |
| 129 | +4. Scope all resources to `testNamespace` and always guard API calls with `Expect(err).NotTo(HaveOccurred())`. |
| 130 | +5. Use helpers such as `wait.PollUntilContextTimeout` for custom waits and back‑off loops. |
| 131 | + |
| 132 | +### 8.2 Adding additional *When* blocks to `device-plugin_test.go` |
| 133 | +The suite already contains a high‑level file, `tests/e2e/device-plugin_test.go`, which drives most GPU‑focused checks. To extend it: |
| 134 | + |
| 135 | +1. **Open** `tests/e2e/device-plugin_test.go`. |
| 136 | +2. **Locate** the outer `Describe("GPU Device Plugin", Ordered, func() { … })` wrapper. |
| 137 | +3. **Add a sibling `When` container** under this `Describe` for each new behaviour you want to validate: |
| 138 | + ```go |
| 139 | + When("....", func() { |
| 140 | + It("should ......", func(ctx context.Context) { |
| 141 | + // |
| 142 | + // |
| 143 | + // ... |
| 144 | + }) |
| 145 | + }) |
| 146 | + ``` |
| 147 | +4. **Use `Ordered`** on the `When` block *only* if its order relative to other tests is significant (e.g. upgrade/downgrade flows). Otherwise omit it for independent execution. |
| 148 | +5. **Share helpers**: you can reference `helmClient`, `clientSet`, `randomSuffix()`, `eventuallyNonControlPlaneNodes`, etc., directly because they are package‑level variables/functions exposed by `e2e`. |
| 149 | +6. **Diagnostics on failure** are automatic – `AfterEach` will collect logs whenever `CurrentSpecReport().Failed()` is `true`. |
| 150 | + |
| 151 | +> Keep each `When` block focused on one behaviour. If it spawns multiple `It` tests, make sure they are idempotent and leave no residual resources so that later blocks start from a clean state. |
| 152 | +
|
| 153 | +--- |
| 154 | + |
| 155 | +## 9 Troubleshooting Troubleshooting |
| 156 | + |
| 157 | +| Symptom | Possible fix | |
| 158 | +|---------|--------------| |
| 159 | +| **`ErrImagePull` for CUDA job** | Validate `E2E_IMAGE_REPO` / `E2E_IMAGE_TAG` and registry access. | |
| 160 | +| Job stuck in **`Pending`** | Ensure nodes advertise `nvidia.com/gpu` and tolerations match taints. | |
| 161 | +| Helm install failure | Render manifests locally via `helm template $HELM_CHART` to inspect errors. | |
| 162 | + |
| 163 | +--- |
| 164 | + |
| 165 | +## 10 License |
| 166 | +This test code is released under the same license as the NVIDIA k8s‑device‑plugin project (Apache‑2.0). |
| 167 | + |
| 168 | +--- |
| 169 | + |
| 170 | +## 11 References |
| 171 | +* [Ginkgo v2](https://github.com/onsi/ginkgo) |
| 172 | +* [mittwald/go‑helm‑client](https://github.com/mittwald/go-helm-client) |
| 173 | +* [Kubernetes‑sigs/Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery) |
| 174 | +* [Kubernetes blog – *End‑to‑End Testing for Everyone*](https://kubernetes.io/blog/2020/07/27/kubernetes-e2e-testing-for-everyone/) |
0 commit comments