Skip to content

Commit 4ea15c9

Browse files
[no-relnote] E2E update
Signed-off-by: Carlos Eduardo Arango Gutierrez <[email protected]>
1 parent 3d3bee1 commit 4ea15c9

File tree

405 files changed

+13210
-6407
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

405 files changed

+13210
-6407
lines changed

tests/e2e/Makefile

Lines changed: 9 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
1-
# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
1+
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
23
#
34
# Licensed under the Apache License, Version 2.0 (the "License");
45
# you may not use this file except in compliance with the License.
@@ -28,21 +29,10 @@ E2E_IMAGE_PULL_POLICY ?= IfNotPresent
2829
HELM_CHART ?= $(CURDIR)/deployments/helm/nvidia-device-plugin
2930
LOG_ARTIFACTS ?= $(CURDIR)/e2e_logs
3031

31-
.PHONY: test
32-
test:
33-
@if [ -z ${KUBECONFIG} ]; then \
34-
echo "[ERR] KUBECONFIG missing, must be defined"; \
35-
exit 1; \
36-
fi
37-
cd $(CURDIR)/tests/e2e && $(GO_CMD) test -timeout $(GO_TEST_TIMEOUT) -v . -args \
38-
-kubeconfig=$(KUBECONFIG) \
39-
-driver-enabled=$(DRIVER_ENABLED) \
40-
-image.repo=$(E2E_IMAGE_REPO) \
41-
-image.tag=$(E2E_IMAGE_TAG) \
42-
-image.pull-policy=$(E2E_IMAGE_PULL_POLICY) \
43-
-log-artifacts=$(LOG_ARTIFACTS) \
44-
-helm-chart=$(HELM_CHART) \
45-
-helm-log-file=$(LOG_ARTIFACTS)/helm.log \
46-
-ginkgo.focus="\[nvidia\]" \
47-
-test.timeout=1h \
48-
-ginkgo.v
32+
.PHONY: ginkgo e2e-test
33+
ginkgo:
34+
mkdir -p $(CURDIR)/bin
35+
GOBIN=$(CURDIR)/bin go install github.com/onsi/ginkgo/v2/ginkgo@latest
36+
37+
test-e2e: ginkgo
38+
$(CURDIR)/bin/ginkgo $(GINKGO_ARGS) -v --json-report ginkgo.json ./tests/e2e/...

tests/e2e/README.md

Lines changed: 174 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,174 @@
1+
<!--
2+
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
3+
SPDX-License-Identifier: Apache-2.0
4+
5+
Licensed under the Apache License, Version 2.0 (the "License");
6+
you may not use this file except in compliance with the License.
7+
You may obtain a copy of the License at
8+
9+
http://www.apache.org/licenses/LICENSE-2.0
10+
11+
Unless required by applicable law or agreed to in writing, software
12+
distributed under the License is distributed on an "AS IS" BASIS,
13+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
See the License for the specific language governing permissions and
15+
limitations under the License.
16+
-->
17+
18+
# README – End‑to‑End (Ginkgo/Gomega) Test Suite for the NVIDIA K8s Device Plugin
19+
20+
---
21+
22+
## 1  Purpose
23+
This repository contains a self‑contained Ginkgo v2 / Gomega end‑to‑end (E2E) test suite that
24+
25+
1. Creates an **isolated namespace** per run.
26+
2. Deploys the **NVIDIA k8s‑device‑plugin Helm chart** under a random release name.
27+
3. Executes a **CUDA “*n‑body*” benchmark job** to validate GPU scheduling.
28+
29+
On test failure the suite gathers logs and **ensures full cleanup** (namespace deletion, finalizer removal).
30+
The suite targets CI pipelines and developers validating chart or driver changes before promotion.
31+
32+
---
33+
34+
## 2  Prerequisites
35+
36+
| Requirement | Notes |
37+
|----------------------|-------------------------------------------------------------------------------|
38+
| **Go ≥ 1.22** | Needed for building helper binaries. |
39+
| **Kubernetes cluster** | Must be reachable via `kubectl`; worker nodes require NVIDIA GPUs. |
40+
| **Helm v3 CLI** | Only required for manual debugging; the suite uses a programmatic client. |
41+
| **Linux/macOS host** | The Makefile assumes a POSIX‑compatible shell. |
42+
43+
---
44+
45+
## 3  Environment variables
46+
47+
| Variable | Required | Default | Description |
48+
|----------|----------|---------|-------------|
49+
| `KUBECONFIG` ||| Path to the target‑cluster kubeconfig. |
50+
| `HELM_CHART` ||| Helm chart reference (e.g. `oci://ghcr.io/nvidia/k8s-device-plugin`). |
51+
| `E2E_IMAGE_REPO` ||| Repository hosting the image under test. |
52+
| `E2E_IMAGE_TAG` ||| Image tag to test. |
53+
| `E2E_IMAGE_PULL_POLICY` ||| Image pull policy (`Always`, `IfNotPresent`, …). |
54+
| `E2E_TIMEOUT_SECONDS` || `1800` | Global timeout (s). |
55+
| `LOG_ARTIFACTS_DIR` || `./artifacts` | Directory for Helm & test logs. |
56+
| `COLLECT_LOGS_FROM` || (unset) | Comma‑separated node list or `all` for log collection. |
57+
| `NVIDIA_DRIVER_ENABLED` || `false` | Skip GPU job when driver is unavailable. |
58+
59+
> *Unset variables fall back to defaults via `getIntEnvVar` / `getBoolEnvVar`.*
60+
61+
---
62+
63+
## 4  Build helper binaries
64+
65+
```bash
66+
make ginkgo
67+
# → ./bin/ginkgo (latest v2 CLI)
68+
```
69+
70+
---
71+
72+
## 5  Run the suite
73+
74+
### 5.1  Default invocation
75+
```bash
76+
make test-e2e
77+
```
78+
Generates the CLI (if missing), executes all specs under `./tests/e2e`, and writes a JSON report to `ginkgo.json`.
79+
80+
### 5.2  Focused run / extra flags
81+
```bash
82+
GINKGO_ARGS='--focus="[GPU Job]" --keep-going' make test-e2e
83+
```
84+
Any flag accepted by `ginkgo run` can be forwarded through `GINKGO_ARGS`.
85+
86+
---
87+
88+
## 6  Execution flow
89+
90+
| Phase | Key functions / objects | Description |
91+
|-------|-------------------------|-------------|
92+
| **Init** | `TestMain`, `getTestEnv` | Validates env vars, sets global timeout. |
93+
| **Client setup** | `getK8sClients`, `getHelmClient` | Creates REST clients (core, CRD, NFD) and a Helm client that shares the same `rest.Config`. |
94+
| **Namespace** | `CreateTestingNS` | Generates a unique namespace labelled `e2e-run=<uid>`. |
95+
| **Chart deploy** | `helmClient.InstallRelease` | Installs the chart in the test namespace with a random release name. |
96+
| **Workload** | `newGPUJob` | Launches `nvcr.io/nvidia/k8s/cuda-sample:nbody` requesting `nvidia.com/gpu=1`. |
97+
| **Assertions** | Gomega matchers | Waits for `JobSucceeded == 1` and validates pod logs. |
98+
| **Cleanup** | `cleanupNamespaceResources`, `AfterSuite` | Removes finalizers, deletes namespace, closes Helm log file. |
99+
100+
---
101+
102+
## 7  Artifacts & logs
103+
104+
```
105+
${LOG_ARTIFACTS_DIR}/
106+
└── helm/
107+
├── helm_logs # Release operations, one per test namespace
108+
└── ...
109+
110+
ginkgo.json # Structured test outcome for CI parsing
111+
```
112+
If `COLLECT_LOGS_FROM` is set, additional node‑level or container logs are archived in the same directory.
113+
114+
---
115+
116+
## 8 Extending the suite
117+
118+
### 8.1 Creating additional spec files
119+
120+
1. Add a new `_test.go` file under `tests/e2e`.
121+
2. Import the Ginkgo/Gomega DSL:
122+
```go
123+
import (
124+
. "github.com/onsi/ginkgo/v2"
125+
. "github.com/onsi/gomega"
126+
)
127+
```
128+
3. Wrap your tests with `Describe`, `Context`, `When`, `It`, etc.
129+
4. Scope all resources to `testNamespace` and always guard API calls with `Expect(err).NotTo(HaveOccurred())`.
130+
5. Use helpers such as `wait.PollUntilContextTimeout` for custom waits and back‑off loops.
131+
132+
### 8.2 Adding additional *When* blocks to `device-plugin_test.go`
133+
The suite already contains a high‑level file, `tests/e2e/device-plugin_test.go`, which drives most GPU‑focused checks. To extend it:
134+
135+
1. **Open** `tests/e2e/device-plugin_test.go`.
136+
2. **Locate** the outer `Describe("GPU Device Plugin", Ordered, func() { … })` wrapper.
137+
3. **Add a sibling `When` container** under this `Describe` for each new behaviour you want to validate:
138+
```go
139+
When("....", func() {
140+
It("should ......", func(ctx context.Context) {
141+
//
142+
//
143+
// ...
144+
})
145+
})
146+
```
147+
4. **Use `Ordered`** on the `When` block *only* if its order relative to other tests is significant (e.g. upgrade/downgrade flows). Otherwise omit it for independent execution.
148+
5. **Share helpers**: you can reference `helmClient`, `clientSet`, `randomSuffix()`, `eventuallyNonControlPlaneNodes`, etc., directly because they are package‑level variables/functions exposed by `e2e`.
149+
6. **Diagnostics on failure** are automatic – `AfterEach` will collect logs whenever `CurrentSpecReport().Failed()` is `true`.
150+
151+
> Keep each `When` block focused on one behaviour. If it spawns multiple `It` tests, make sure they are idempotent and leave no residual resources so that later blocks start from a clean state.
152+
153+
---
154+
155+
## 9 Troubleshooting  Troubleshooting
156+
157+
| Symptom | Possible fix |
158+
|---------|--------------|
159+
| **`ErrImagePull` for CUDA job** | Validate `E2E_IMAGE_REPO` / `E2E_IMAGE_TAG` and registry access. |
160+
| Job stuck in **`Pending`** | Ensure nodes advertise `nvidia.com/gpu` and tolerations match taints. |
161+
| Helm install failure | Render manifests locally via `helm template $HELM_CHART` to inspect errors. |
162+
163+
---
164+
165+
## 10  License
166+
This test code is released under the same license as the NVIDIA k8s‑device‑plugin project (Apache‑2.0).
167+
168+
---
169+
170+
## 11  References
171+
* [Ginkgo v2](https://github.com/onsi/ginkgo)
172+
* [mittwald/go‑helm‑client](https://github.com/mittwald/go-helm-client)
173+
* [Kubernetes‑sigs/Node Feature Discovery](https://github.com/kubernetes-sigs/node-feature-discovery)
174+
* [Kubernetes blog – *End‑to‑End Testing for Everyone*](https://kubernetes.io/blog/2020/07/27/kubernetes-e2e-testing-for-everyone/)

0 commit comments

Comments
 (0)