Skip to content

chore: add a tutorial on how to write a new health monitor#1428

Open
nitz2407 wants to merge 3 commits into
NVIDIA:mainfrom
nitz2407:nitijain/NKX-12166
Open

chore: add a tutorial on how to write a new health monitor#1428
nitz2407 wants to merge 3 commits into
NVIDIA:mainfrom
nitz2407:nitijain/NKX-12166

Conversation

@nitz2407

@nitz2407 nitz2407 commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a complete, developer-facing guide for writing NVSentinel health monitors, and its one-shot AI prompt.

Testing

Validated one-shot AI prompt end-to-end on a local KIND cluster via Tilt.

Type of Change

  • 🐛 Bug fix
  • ✨ New feature
  • 💥 Breaking change
  • 📚 Documentation
  • 🔧 Refactoring
  • 🔨 Build/CI

Component(s) Affected

  • Core Services
  • Documentation/CI
  • Fault Management
  • Health Monitors
  • Janitor
  • [] Other: ____________

Testing

  • Tests pass locally
  • Manual testing completed
  • No breaking changes (or documented)

Checklist

  • Self-review completed
  • Documentation updated (if needed)
  • Ready for review

Summary by CodeRabbit

  • Documentation
    • Added a new “Writing a New Health Monitor” tutorial with end-to-end guidance to build, test, containerize, and deploy a health monitor (no GPU) that reports faults via a Unix-domain gRPC socket.
    • Documents the required platform connector health event contract, required vs. recommended fields, and key semantics for healthy/fatal behavior and recommended actions.
    • Includes local verification steps using Tilt/cluster tooling and required ruleset matching for emitted monitor events.
    • Expanded the documentation with a new tutorials section and a direct link to the guide.

@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a README link to tutorials and a new end-to-end guide for writing, validating, packaging, and registering an NVSentinel health monitor, including protocol details, an example Go implementation, remediation rules, and CI/ko integration.

Changes

Health Monitor Tutorial

Layer / File(s) Summary
Tutorial entry and scope
docs/README.md, docs/tutorials/writing-a-health-monitor.md
Adds the tutorials section in the README and introduces the tutorial purpose and monitor responsibilities.
Contract and rules
docs/tutorials/writing-a-health-monitor.md
Documents the required gRPC API, HealthEvent field requirements, ProcessingStrategy semantics, and the identity, transition, clearing, and isFatal rules.
Demo monitor implementation
docs/tutorials/writing-a-health-monitor.md
Shows the Go demo monitor with env-based configuration, Unix-socket gRPC dialing, trigger-file polling, edge-triggered emission, and event-building helpers.
Validation and image build
docs/tutorials/writing-a-health-monitor.md
Describes the Tilt/KIND local workflow, the fault-quarantine ruleset, and the multi-stage Dockerfile for building the demo monitor image.
Production publisher and reference
docs/tutorials/writing-a-health-monitor.md
Switches publishing to commons/pkg/healthpub, including socket gating, retries, a buildEvent test, and the appendix material for actions, paths, and prompt text.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐇 I hopped through docs and found a tune,
A health monitor guide beneath the moon.
From socket to chart, the path is bright,
With trigger-file steps and guidance just right.
Thump thump—this burrow feels complete!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: a new tutorial for writing a health monitor.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@github-actions

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/tutorials/writing-a-health-monitor.md`:
- Around line 227-249: Validate the parsed POLL_INTERVAL_SECONDS value in the
health monitor startup flow before creating the ticker in the polling loop. In
the section that currently parses pollSeconds and later calls time.NewTicker,
add a guard to reject zero or negative values and return a clear error instead
of proceeding. Keep the fix near the existing POLL_INTERVAL_SECONDS handling so
the failure is caught before the ticker is constructed.
- Around line 520-530: The setup steps in the tutorial need to be reordered
because `go get github.com/nvidia/nvsentinel/commons@v0.0.0` will fail in a
fresh checkout until the local replacement is already in place. Update the
instructions in the `writing-a-health-monitor` section so the `go.mod` replace
for `github.com/nvidia/nvsentinel/commons` is added first (or via `go mod edit
-replace`) before running `go get`, keeping the dependency fetch step after the
replace is configured.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fd89c363-9c5b-4a74-9877-39fa927e7df7

📥 Commits

Reviewing files that changed from the base of the PR and between 873f3ef and 5aed4a5.

📒 Files selected for processing (2)
  • docs/README.md
  • docs/tutorials/writing-a-health-monitor.md

Comment thread docs/tutorials/writing-a-health-monitor.md
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
@github-actions

Copy link
Copy Markdown
Contributor

Merging this branch will decrease overall coverage

Impacted Packages Coverage Δ 🤖
github.com/nvidia/nvsentinel/janitor-provider/pkg/csp/generic 0.00% (ø)
github.com/nvidia/nvsentinel/labeler 0.00% (ø)
github.com/nvidia/nvsentinel/labeler/pkg/devicecounts 58.85% (ø)
github.com/nvidia/nvsentinel/labeler/pkg/initializer 0.00% (ø)
github.com/nvidia/nvsentinel/labeler/pkg/labeler 52.69% (-1.15%) 👎
github.com/nvidia/nvsentinel/labeler/pkg/metrics 0.00% (ø)
github.com/nvidia/nvsentinel/node-drainer/pkg/evaluator 54.73% (ø)
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler 34.59% (ø)

Coverage by file

Changed files (no unit tests)

Changed File Coverage Δ Total Covered Missed 🤖
github.com/nvidia/nvsentinel/janitor-provider/pkg/csp/generic/generic.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/labeler/main.go 0.00% (ø) 152 0 152
github.com/nvidia/nvsentinel/labeler/pkg/devicecounts/device_counts.go 56.65% (ø) 406 230 176
github.com/nvidia/nvsentinel/labeler/pkg/devicecounts/resource_slices.go 89.66% (ø) 29 26 3
github.com/nvidia/nvsentinel/labeler/pkg/initializer/init.go 0.00% (ø) 52 0 52
github.com/nvidia/nvsentinel/labeler/pkg/labeler/labeler.go 58.74% (-1.29%) 698 410 (-9) 288 (+9) 👎
github.com/nvidia/nvsentinel/labeler/pkg/labeler/resource_slice_events.go 1.22% (ø) 82 1 81
github.com/nvidia/nvsentinel/labeler/pkg/metrics/metrics.go 0.00% (ø) 0 0 0
github.com/nvidia/nvsentinel/node-drainer/pkg/evaluator/evaluator.go 55.03% (ø) 636 350 286
github.com/nvidia/nvsentinel/node-drainer/pkg/evaluator/types.go 33.33% (ø) 9 3 6
github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler/reconciler.go 34.59% (ø) 1590 550 1040

Please note that the "Total", "Covered", and "Missed" counts above refer to code statements instead of lines of code. The value in brackets refers to the test coverage of that file in the old version of the code.

Changed unit test files

  • github.com/nvidia/nvsentinel/janitor-provider/pkg/csp/generic/generic_test.go
  • github.com/nvidia/nvsentinel/labeler/pkg/devicecounts/device_counts_test.go
  • github.com/nvidia/nvsentinel/labeler/pkg/devicecounts/resource_slices_test.go
  • github.com/nvidia/nvsentinel/labeler/pkg/labeler/labeler_test.go
  • github.com/nvidia/nvsentinel/node-drainer/pkg/evaluator/evaluator_integration_test.go
  • github.com/nvidia/nvsentinel/node-drainer/pkg/reconciler/reconciler_integration_test.go

@nitz2407 nitz2407 force-pushed the nitijain/NKX-12166 branch 2 times, most recently from 89c9f74 to ec0590a Compare June 26, 2026 12:45

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/tutorials/writing-a-health-monitor.md`:
- Around line 15-19: The blockquotes in the tutorial markdown currently contain
blank lines that trigger markdownlint MD028, so tighten the copy in the affected
section and the repeated block around Appendix C by removing the internal blank
lines or splitting each quoted paragraph into separate blockquote blocks. Update
the relevant markdown text in the tutorial content so the quoted lines remain
readable but no longer have blank lines inside the same blockquote.
- Around line 263-265: The trigger-file check is treating all stat failures as
“healthy,” which can hide permission or I/O problems and clear an active fault;
update the fileExists helper and the demo check logic so only os.IsNotExist is
interpreted as the file being absent. Return the error from fileExists instead
of collapsing every os.Stat failure to false, and in the demo check path that
logs with slog.Info, propagate non-not-exist errors as unhealthy rather than
deriving healthy solely from fileExists. Apply the same fix to both occurrences
referenced by the demo snippets so the health monitor behavior is consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 62ef8d12-b0b4-4b72-9c3c-2e94b6c95a67

📥 Commits

Reviewing files that changed from the base of the PR and between 5aed4a5 and ec0590a.

📒 Files selected for processing (2)
  • docs/README.md
  • docs/tutorials/writing-a-health-monitor.md
✅ Files skipped from review due to trivial changes (1)
  • docs/README.md

Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
- A fault-quarantine rule that acts on its events.
- Unit tests and the production-grade publishing path.

> **Who is this for?** Developers new to NVSentinel. You need to know basic Go and

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we anchor this more for developers who are extending NVSentinel? Similar to how anyone can write a controller for Kubernetes, some may be accepted upstream, but some may not need to be upstream?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It means the user is already aware about NVSentinel, and we no need to put information about it inside the doc?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meaning user is already using it and wants to extend. So just some basic information as like a refresher

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased


---

## 1. What a health monitor is (the 2-minute version)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
## 1. What a health monitor is (the 2-minute version)
## 1. What a health monitor is

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

| `checkName` | string | yes | The specific check, e.g. `"DemoHealthCheck"`. |
| `isHealthy` | bool | yes | `true` = recovery/clear; `false` = a fault. |
| `isFatal` | bool | yes | `true` means "serious enough to act on". Quarantine rules typically require `isFatal == true`. |
| `message` | string | recommended | Human-readable description. |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the error message as is from the hardware not a human readable description

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

| `entitiesImpacted` | repeated Entity | optional | Sub-components affected, e.g. `{entityType:"GPU", entityValue:"3"}`. |
| `errorCode` | repeated string | optional | Vendor/error codes. |
| `metadata` | map<string,string> | optional | Free-form key/values. |
| `quarantineOverrides` / `drainOverrides` | BehaviourOverrides | optional | `{force, skip}` to override default cordon/drain behaviour. |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not document this, let's drop this for now

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

| `errorCode` | repeated string | optional | Vendor/error codes. |
| `metadata` | map<string,string> | optional | Free-form key/values. |
| `quarantineOverrides` / `drainOverrides` | BehaviourOverrides | optional | `{force, skip}` to override default cordon/drain behaviour. |
| `id` | string | optional | Leave empty; platform-connector assigns one. |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have to mention this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


---

## 9. Variant: cluster watcher (Deployment)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we skipped

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed


---

## 10. Package as a Helm subchart

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be skipped

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

> `helm template demo-health-monitor distros/kubernetes/nvsentinel/charts/demo-health-monitor`
> and validate with `helm lint distros/kubernetes/nvsentinel/charts/demo-health-monitor`.

### Make remediation act on your events

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not mention this here, we can do this as a part of the fault quarantine tutorial

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed


---

## 11. Register in the build

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be skipped

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed


## Appendix C: One-shot AI prompt

Paste this to an AI coding agent working in the NVSentinel repo. Replace the bracketed

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to work on an empty repo

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@nitz2407 nitz2407 force-pushed the nitijain/NKX-12166 branch from 64487fd to 02e66f5 Compare June 26, 2026 16:01

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
docs/tutorials/writing-a-health-monitor.md (1)

40-40: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Align the socket path with the mount.

The diagram says the monitor talks to unix:///var/run/nvsentinel.sock, but the deploy step mounts hostPath /var/run/nvsentinel into /var/run. That only works if the socket actually lives under a nested directory, so readers will end up wiring the wrong path. Please make the mount and URI agree.

Also applies to: 368-369

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/tutorials/writing-a-health-monitor.md` at line 40, The diagram and
deploy instructions use mismatched socket paths, so update the health monitor
docs to make the mount target and the gRPC URI refer to the same location.
Adjust the wording and diagram in the writing-a-health-monitor tutorial around
the monitor-to-Platform Connector connection so the socket path shown for the
monitor matches the hostPath mount used in the deployment step.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@docs/tutorials/writing-a-health-monitor.md`:
- Line 40: The diagram and deploy instructions use mismatched socket paths, so
update the health monitor docs to make the mount target and the gRPC URI refer
to the same location. Adjust the wording and diagram in the
writing-a-health-monitor tutorial around the monitor-to-Platform Connector
connection so the socket path shown for the monitor matches the hostPath mount
used in the deployment step.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f8a3d558-7b55-443b-920b-297b665bae97

📥 Commits

Reviewing files that changed from the base of the PR and between ec0590a and 64487fd.

📒 Files selected for processing (1)
  • docs/tutorials/writing-a-health-monitor.md

@nitz2407 nitz2407 force-pushed the nitijain/NKX-12166 branch from 02e66f5 to 1105a3a Compare June 26, 2026 16:06

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
docs/tutorials/writing-a-health-monitor.md (1)

152-173: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Clarify the go.mod scaffold wording.

This says "two local replace directives", but the snippet only adds data-models here; commons is introduced later in section 8. As written, it reads like a missing step.

🛠 Suggested wording
-Create `go.mod` with the two local replace directives every monitor uses (the module
-path is resolved locally, not from a registry):
+Create `go.mod` with the local replace directive needed for the demo module (the
+module path is resolved locally, not from a registry):
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/tutorials/writing-a-health-monitor.md` around lines 152 - 173, The
`go.mod` scaffold wording is inconsistent with the snippet because it claims
“two local replace directives” even though this section only adds the
`data-models` replace and introduces `commons` later in section 8. Update the
tutorial text around the `go.mod` example to say this scaffold includes the
local replace for `data-models` only, and mention that the `commons` replace is
added later when the production publisher is introduced, so readers do not think
a step is missing.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/tutorials/writing-a-health-monitor.md`:
- Around line 626-651: The publish example currently introduces fakePC but does
not use it, and the guidance about healthpub.New is incomplete. Update the
tutorial section around fakePC, TestBuildEvent_Unhealthy, and the Publish
example so the fake client is actually used when demonstrating delivery through
healthpub, or remove the fakePC mention entirely if only buildEvent is being
tested. Make the example consistent with the non-unix target behavior of
healthpub.New so the fakePC path is exercised instead of left unused.

---

Nitpick comments:
In `@docs/tutorials/writing-a-health-monitor.md`:
- Around line 152-173: The `go.mod` scaffold wording is inconsistent with the
snippet because it claims “two local replace directives” even though this
section only adds the `data-models` replace and introduces `commons` later in
section 8. Update the tutorial text around the `go.mod` example to say this
scaffold includes the local replace for `data-models` only, and mention that the
`commons` replace is added later when the production publisher is introduced, so
readers do not think a step is missing.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 15e923c0-04d0-4d5e-9098-d2553ea37bdd

📥 Commits

Reviewing files that changed from the base of the PR and between 64487fd and 02e66f5.

📒 Files selected for processing (2)
  • docs/README.md
  • docs/tutorials/writing-a-health-monitor.md
✅ Files skipped from review due to trivial changes (1)
  • docs/README.md

Comment thread docs/tutorials/writing-a-health-monitor.md Outdated
@nitz2407 nitz2407 force-pushed the nitijain/NKX-12166 branch from 1105a3a to c7c3147 Compare June 26, 2026 17:26
nitz2407 added 2 commits June 26, 2026 23:04
Signed-off-by: Nitin Jain (SW-CLOUD) <nitijain@nvidia.com>
Signed-off-by: Nitin Jain (SW-CLOUD) <nitijain@nvidia.com>
@nitz2407 nitz2407 force-pushed the nitijain/NKX-12166 branch from c7c3147 to fb27671 Compare June 26, 2026 17:34

- A working Go health monitor that detects a fault and reports it to NVSentinel.
- A container image for it, deployed to a live cluster as a DaemonSet.
- A fault-quarantine rule that acts on its events.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we didn't want to handle this here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should put it here as without it flow is incomplete.


```bash
cd <repo-root> # the NVSentinel repository root
docker build -t demo-health-monitor:dev \

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the build and publish steps to the previous section?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

**2. Load it into the KIND cluster** so the node can use it without pushing to a registry:

```bash
kind load docker-image demo-health-monitor:dev --name kind-nvsentinel

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's have the users push to dockerhub and pull from there

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


> Tear down the monitor with `kubectl delete ds/demo-health-monitor -n nvsentinel`.

### Trigger and verify

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's just validate node conditions

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


---

## 8. Level up to production-grade

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's drop this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@nitz2407 nitz2407 force-pushed the nitijain/NKX-12166 branch from fb27671 to 294532a Compare June 27, 2026 07:18

## 7. Run and verify

Deploy the image (built in [section 6](#6-containerize)) as a DaemonSet, then trigger a fault

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we don't need the callback

Suggested change
Deploy the image (built in [section 6](#6-containerize)) as a DaemonSet, then trigger a fault
Deploy the image as a DaemonSet, then trigger a fault

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

docker push docker.io/<dockerhub-user>/demo-health-monitor:dev
```

> Use a **public** Docker Hub repository so the cluster can pull without credentials. For a

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can rephrase to say something like if you are using a private repo then make sure you create the required image pull credentials

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Deploy the image (built in [section 6](#6-containerize)) as a DaemonSet, then trigger a fault
and watch it flow through the pipeline.

> **Prerequisites (tools):** `docker`, `kubectl`.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put all the prereuisites at the top rather than per section

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


> **Prerequisites (tools):** `docker`, `kubectl`.

> **Prerequisite — a running NVSentinel cluster.** Verification needs NVSentinel already

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep it simple, a cluster with NVSentinel running

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

containers:
- name: demo-health-monitor
image: docker.io/<dockerhub-user>/demo-health-monitor:dev
imagePullPolicy: Always

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
imagePullPolicy: Always
imagePullPolicy: IfNotPresent

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


### Trigger and verify

> Assumes the [prerequisite](#7-run-and-verify) is met: the cluster is up and

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is second time we have mentioned this, let's drop duplicate information. Let's keep all the prerequisites at the time and not keep repreating outselves

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

scheduled there (replace the IP with whatever you set `CHECK_TARGET` to):

```bash
docker exec kind-nvsentinel-worker ip route add blackhole 192.168.1.1/32

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will not work for a real cluster

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned specifically that the example uses the KIND cluster for verification.

back (`Status=False`, `Reason=NetworkReachabilityIsHealthy`, `Message="No Health Failures"`):

```bash
docker exec kind-nvsentinel-worker ip route del blackhole 192.168.1.1/32

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will not work on a real cluster

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mentioned specifically that the example uses the KIND cluster for verification.

-o jsonpath='{range .status.conditions[?(@.type=="NetworkReachability")]}{.type}{" "}{.status}{" "}{.reason}{" "}{.message}{"\n"}{end}'
```

> **Want automatic remediation (cordon/drain)?** Recording the condition proves your event

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we talked about this ealier. This tutorial should stop at the node conditions

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


---

## Appendix A: Key file locations

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's drop this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Signed-off-by: Nitin Jain (SW-CLOUD) <nitijain@nvidia.com>
@nitz2407 nitz2407 force-pushed the nitijain/NKX-12166 branch from 294532a to ae4c4a2 Compare June 29, 2026 10:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants