chore: add a tutorial on how to write a new health monitor#1428
chore: add a tutorial on how to write a new health monitor#1428nitz2407 wants to merge 3 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a README link to tutorials and a new end-to-end guide for writing, validating, packaging, and registering an NVSentinel health monitor, including protocol details, an example Go implementation, remediation rules, and CI/ko integration. ChangesHealth Monitor Tutorial
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
9329434 to
5aed4a5
Compare
|
🌿 Fern Docs Preview: https://nvidia-preview-pull-request-1428.docs.buildwithfern.com/nvsentinel |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/tutorials/writing-a-health-monitor.md`:
- Around line 227-249: Validate the parsed POLL_INTERVAL_SECONDS value in the
health monitor startup flow before creating the ticker in the polling loop. In
the section that currently parses pollSeconds and later calls time.NewTicker,
add a guard to reject zero or negative values and return a clear error instead
of proceeding. Keep the fix near the existing POLL_INTERVAL_SECONDS handling so
the failure is caught before the ticker is constructed.
- Around line 520-530: The setup steps in the tutorial need to be reordered
because `go get github.com/nvidia/nvsentinel/commons@v0.0.0` will fail in a
fresh checkout until the local replacement is already in place. Update the
instructions in the `writing-a-health-monitor` section so the `go.mod` replace
for `github.com/nvidia/nvsentinel/commons` is added first (or via `go mod edit
-replace`) before running `go get`, keeping the dependency fetch step after the
replace is configured.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: fd89c363-9c5b-4a74-9877-39fa927e7df7
📒 Files selected for processing (2)
docs/README.mddocs/tutorials/writing-a-health-monitor.md
Merging this branch will decrease overall coverage
Coverage by fileChanged files (no unit tests)
Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code. Changed unit test files
|
89c9f74 to
ec0590a
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/tutorials/writing-a-health-monitor.md`:
- Around line 15-19: The blockquotes in the tutorial markdown currently contain
blank lines that trigger markdownlint MD028, so tighten the copy in the affected
section and the repeated block around Appendix C by removing the internal blank
lines or splitting each quoted paragraph into separate blockquote blocks. Update
the relevant markdown text in the tutorial content so the quoted lines remain
readable but no longer have blank lines inside the same blockquote.
- Around line 263-265: The trigger-file check is treating all stat failures as
“healthy,” which can hide permission or I/O problems and clear an active fault;
update the fileExists helper and the demo check logic so only os.IsNotExist is
interpreted as the file being absent. Return the error from fileExists instead
of collapsing every os.Stat failure to false, and in the demo check path that
logs with slog.Info, propagate non-not-exist errors as unhealthy rather than
deriving healthy solely from fileExists. Apply the same fix to both occurrences
referenced by the demo snippets so the health monitor behavior is consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 62ef8d12-b0b4-4b72-9c3c-2e94b6c95a67
📒 Files selected for processing (2)
docs/README.mddocs/tutorials/writing-a-health-monitor.md
✅ Files skipped from review due to trivial changes (1)
- docs/README.md
| - A fault-quarantine rule that acts on its events. | ||
| - Unit tests and the production-grade publishing path. | ||
|
|
||
| > **Who is this for?** Developers new to NVSentinel. You need to know basic Go and |
There was a problem hiding this comment.
Can we anchor this more for developers who are extending NVSentinel? Similar to how anyone can write a controller for Kubernetes, some may be accepted upstream, but some may not need to be upstream?
There was a problem hiding this comment.
It means the user is already aware about NVSentinel, and we no need to put information about it inside the doc?
There was a problem hiding this comment.
meaning user is already using it and wants to extend. So just some basic information as like a refresher
|
|
||
| --- | ||
|
|
||
| ## 1. What a health monitor is (the 2-minute version) |
There was a problem hiding this comment.
nit:
| ## 1. What a health monitor is (the 2-minute version) | |
| ## 1. What a health monitor is |
| | `checkName` | string | yes | The specific check, e.g. `"DemoHealthCheck"`. | | ||
| | `isHealthy` | bool | yes | `true` = recovery/clear; `false` = a fault. | | ||
| | `isFatal` | bool | yes | `true` means "serious enough to act on". Quarantine rules typically require `isFatal == true`. | | ||
| | `message` | string | recommended | Human-readable description. | |
There was a problem hiding this comment.
this is the error message as is from the hardware not a human readable description
| | `entitiesImpacted` | repeated Entity | optional | Sub-components affected, e.g. `{entityType:"GPU", entityValue:"3"}`. | | ||
| | `errorCode` | repeated string | optional | Vendor/error codes. | | ||
| | `metadata` | map<string,string> | optional | Free-form key/values. | | ||
| | `quarantineOverrides` / `drainOverrides` | BehaviourOverrides | optional | `{force, skip}` to override default cordon/drain behaviour. | |
There was a problem hiding this comment.
let's not document this, let's drop this for now
| | `errorCode` | repeated string | optional | Vendor/error codes. | | ||
| | `metadata` | map<string,string> | optional | Free-form key/values. | | ||
| | `quarantineOverrides` / `drainOverrides` | BehaviourOverrides | optional | `{force, skip}` to override default cordon/drain behaviour. | | ||
| | `id` | string | optional | Leave empty; platform-connector assigns one. | |
There was a problem hiding this comment.
we don't have to mention this
|
|
||
| --- | ||
|
|
||
| ## 9. Variant: cluster watcher (Deployment) |
|
|
||
| --- | ||
|
|
||
| ## 10. Package as a Helm subchart |
| > `helm template demo-health-monitor distros/kubernetes/nvsentinel/charts/demo-health-monitor` | ||
| > and validate with `helm lint distros/kubernetes/nvsentinel/charts/demo-health-monitor`. | ||
|
|
||
| ### Make remediation act on your events |
There was a problem hiding this comment.
Let's not mention this here, we can do this as a part of the fault quarantine tutorial
|
|
||
| --- | ||
|
|
||
| ## 11. Register in the build |
|
|
||
| ## Appendix C: One-shot AI prompt | ||
|
|
||
| Paste this to an AI coding agent working in the NVSentinel repo. Replace the bracketed |
There was a problem hiding this comment.
This needs to work on an empty repo
64487fd to
02e66f5
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
docs/tutorials/writing-a-health-monitor.md (1)
40-40: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick winAlign the socket path with the mount.
The diagram says the monitor talks to
unix:///var/run/nvsentinel.sock, but the deploy step mountshostPath/var/run/nvsentinelinto/var/run. That only works if the socket actually lives under a nested directory, so readers will end up wiring the wrong path. Please make the mount and URI agree.Also applies to: 368-369
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/tutorials/writing-a-health-monitor.md` at line 40, The diagram and deploy instructions use mismatched socket paths, so update the health monitor docs to make the mount target and the gRPC URI refer to the same location. Adjust the wording and diagram in the writing-a-health-monitor tutorial around the monitor-to-Platform Connector connection so the socket path shown for the monitor matches the hostPath mount used in the deployment step.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@docs/tutorials/writing-a-health-monitor.md`:
- Line 40: The diagram and deploy instructions use mismatched socket paths, so
update the health monitor docs to make the mount target and the gRPC URI refer
to the same location. Adjust the wording and diagram in the
writing-a-health-monitor tutorial around the monitor-to-Platform Connector
connection so the socket path shown for the monitor matches the hostPath mount
used in the deployment step.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: f8a3d558-7b55-443b-920b-297b665bae97
📒 Files selected for processing (1)
docs/tutorials/writing-a-health-monitor.md
02e66f5 to
1105a3a
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
docs/tutorials/writing-a-health-monitor.md (1)
152-173: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winClarify the
go.modscaffold wording.This says "two local replace directives", but the snippet only adds
data-modelshere;commonsis introduced later in section 8. As written, it reads like a missing step.🛠 Suggested wording
-Create `go.mod` with the two local replace directives every monitor uses (the module -path is resolved locally, not from a registry): +Create `go.mod` with the local replace directive needed for the demo module (the +module path is resolved locally, not from a registry):🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/tutorials/writing-a-health-monitor.md` around lines 152 - 173, The `go.mod` scaffold wording is inconsistent with the snippet because it claims “two local replace directives” even though this section only adds the `data-models` replace and introduces `commons` later in section 8. Update the tutorial text around the `go.mod` example to say this scaffold includes the local replace for `data-models` only, and mention that the `commons` replace is added later when the production publisher is introduced, so readers do not think a step is missing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/tutorials/writing-a-health-monitor.md`:
- Around line 626-651: The publish example currently introduces fakePC but does
not use it, and the guidance about healthpub.New is incomplete. Update the
tutorial section around fakePC, TestBuildEvent_Unhealthy, and the Publish
example so the fake client is actually used when demonstrating delivery through
healthpub, or remove the fakePC mention entirely if only buildEvent is being
tested. Make the example consistent with the non-unix target behavior of
healthpub.New so the fakePC path is exercised instead of left unused.
---
Nitpick comments:
In `@docs/tutorials/writing-a-health-monitor.md`:
- Around line 152-173: The `go.mod` scaffold wording is inconsistent with the
snippet because it claims “two local replace directives” even though this
section only adds the `data-models` replace and introduces `commons` later in
section 8. Update the tutorial text around the `go.mod` example to say this
scaffold includes the local replace for `data-models` only, and mention that the
`commons` replace is added later when the production publisher is introduced, so
readers do not think a step is missing.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 15e923c0-04d0-4d5e-9098-d2553ea37bdd
📒 Files selected for processing (2)
docs/README.mddocs/tutorials/writing-a-health-monitor.md
✅ Files skipped from review due to trivial changes (1)
- docs/README.md
1105a3a to
c7c3147
Compare
Signed-off-by: Nitin Jain (SW-CLOUD) <nitijain@nvidia.com>
Signed-off-by: Nitin Jain (SW-CLOUD) <nitijain@nvidia.com>
c7c3147 to
fb27671
Compare
|
|
||
| - A working Go health monitor that detects a fault and reports it to NVSentinel. | ||
| - A container image for it, deployed to a live cluster as a DaemonSet. | ||
| - A fault-quarantine rule that acts on its events. |
There was a problem hiding this comment.
I thought we didn't want to handle this here?
There was a problem hiding this comment.
I think we should put it here as without it flow is incomplete.
|
|
||
| ```bash | ||
| cd <repo-root> # the NVSentinel repository root | ||
| docker build -t demo-health-monitor:dev \ |
There was a problem hiding this comment.
Can we move the build and publish steps to the previous section?
| **2. Load it into the KIND cluster** so the node can use it without pushing to a registry: | ||
|
|
||
| ```bash | ||
| kind load docker-image demo-health-monitor:dev --name kind-nvsentinel |
There was a problem hiding this comment.
let's have the users push to dockerhub and pull from there
|
|
||
| > Tear down the monitor with `kubectl delete ds/demo-health-monitor -n nvsentinel`. | ||
|
|
||
| ### Trigger and verify |
There was a problem hiding this comment.
let's just validate node conditions
|
|
||
| --- | ||
|
|
||
| ## 8. Level up to production-grade |
fb27671 to
294532a
Compare
|
|
||
| ## 7. Run and verify | ||
|
|
||
| Deploy the image (built in [section 6](#6-containerize)) as a DaemonSet, then trigger a fault |
There was a problem hiding this comment.
nit: we don't need the callback
| Deploy the image (built in [section 6](#6-containerize)) as a DaemonSet, then trigger a fault | |
| Deploy the image as a DaemonSet, then trigger a fault |
| docker push docker.io/<dockerhub-user>/demo-health-monitor:dev | ||
| ``` | ||
|
|
||
| > Use a **public** Docker Hub repository so the cluster can pull without credentials. For a |
There was a problem hiding this comment.
We can rephrase to say something like if you are using a private repo then make sure you create the required image pull credentials
| Deploy the image (built in [section 6](#6-containerize)) as a DaemonSet, then trigger a fault | ||
| and watch it flow through the pipeline. | ||
|
|
||
| > **Prerequisites (tools):** `docker`, `kubectl`. |
There was a problem hiding this comment.
Let's put all the prereuisites at the top rather than per section
|
|
||
| > **Prerequisites (tools):** `docker`, `kubectl`. | ||
|
|
||
| > **Prerequisite — a running NVSentinel cluster.** Verification needs NVSentinel already |
There was a problem hiding this comment.
Let's keep it simple, a cluster with NVSentinel running
| containers: | ||
| - name: demo-health-monitor | ||
| image: docker.io/<dockerhub-user>/demo-health-monitor:dev | ||
| imagePullPolicy: Always |
There was a problem hiding this comment.
| imagePullPolicy: Always | |
| imagePullPolicy: IfNotPresent |
|
|
||
| ### Trigger and verify | ||
|
|
||
| > Assumes the [prerequisite](#7-run-and-verify) is met: the cluster is up and |
There was a problem hiding this comment.
this is second time we have mentioned this, let's drop duplicate information. Let's keep all the prerequisites at the time and not keep repreating outselves
| scheduled there (replace the IP with whatever you set `CHECK_TARGET` to): | ||
|
|
||
| ```bash | ||
| docker exec kind-nvsentinel-worker ip route add blackhole 192.168.1.1/32 |
There was a problem hiding this comment.
this will not work for a real cluster
There was a problem hiding this comment.
Mentioned specifically that the example uses the KIND cluster for verification.
| back (`Status=False`, `Reason=NetworkReachabilityIsHealthy`, `Message="No Health Failures"`): | ||
|
|
||
| ```bash | ||
| docker exec kind-nvsentinel-worker ip route del blackhole 192.168.1.1/32 |
There was a problem hiding this comment.
will not work on a real cluster
There was a problem hiding this comment.
Mentioned specifically that the example uses the KIND cluster for verification.
| -o jsonpath='{range .status.conditions[?(@.type=="NetworkReachability")]}{.type}{" "}{.status}{" "}{.reason}{" "}{.message}{"\n"}{end}' | ||
| ``` | ||
|
|
||
| > **Want automatic remediation (cordon/drain)?** Recording the condition proves your event |
There was a problem hiding this comment.
I think we talked about this ealier. This tutorial should stop at the node conditions
|
|
||
| --- | ||
|
|
||
| ## Appendix A: Key file locations |
Signed-off-by: Nitin Jain (SW-CLOUD) <nitijain@nvidia.com>
294532a to
ae4c4a2
Compare
Summary
Adds a complete, developer-facing guide for writing NVSentinel health monitors, and its one-shot AI prompt.
Testing
Validated one-shot AI prompt end-to-end on a local KIND cluster via Tilt.
Type of Change
Component(s) Affected
Testing
Checklist
Summary by CodeRabbit