Skip to content

Releases: NVIDIA/NVSentinel

Release v1.12.0

Choose a tag to compare

@github-actions github-actions released this 29 Jun 12:46
v1.12.0
4ac11c6

Release v1.12.0

This release prevents the DCGM connectivity error that fired on every new node during GPU Operator bootstrapping, adds device-count labels from the labeler so downstream consumers can detect nodes reporting fewer GPUs or NICs than expected, adds an opt-in Magic SysRq reboot path for the generic bare-metal provider, and includes reliability fixes for the labeler, node-drainer, and the PostgreSQL store.

Major New Features

Prevent DCGM Connectivity Errors on Node Bootstrapping (#1425, #1423)

On a freshly launched node, the gpu-health-monitor pod could become ready before the GPU Operator's nvidia-dcgm pod finished its init-container startup sequence, producing a GpuDcgmConnectivityFailure unhealthy condition (DCGM_CONNECTIVITY_ERROR, CONTACT_SUPPORT) that only cleared minutes later once DCGM came up. The gpu-health-monitor is no longer scheduled until the nvidia-dcgm pod on the node is ready, so a normal node bootstrap no longer emits a false connectivity error. Node-deletion teardown behavior is unchanged.

Expected Device-Count Labels from Labeler (#1395)

The labeler can now write normalized current and expected device-count labels (e.g. nvsentinel.dgxc.nvidia.com/gpu.count.current / .expected) onto nodes, giving downstream modules a signal for detecting nodes that advertise fewer devices than their peers. Current count is derived from a configurable CEL expression supporting both device-plugin/GFD-style node labels and DRA ResourceSlice advertisements; expected count is either learned from peers in the same grouping-label partition or pinned via per-class overrides. Configured per device class (GPU, NIC) in a TOML ConfigMap and disabled by default. See ADR-043 for the design.

Opt-In SysRq Reboot for the Generic Bare-Metal Provider (#1418)

The generic bare-metal janitor provider now supports an opt-in Linux Magic SysRq reboot mode (janitor-provider.csp.generic.useSysrqReboot=true), which reboots a node by writing b to /proc/sysrq-trigger from a privileged Job rather than using the default chroot-based reboot. This is useful on hosts where the chroot path is unreliable. The existing chroot-based reboot remains the default, so existing deployments are unaffected.

Bug Fixes & Reliability

  • Lazily initialize ResourceSlice informers in the labeler (#1422): The labeler eagerly started a DRA ResourceSlice informer even when device-count detection used only the device-plugin method, spamming failed to list *v1.ResourceSlice: the server could not find the requested resource errors on clusters without the resource.k8s.io API. The informer is now initialized lazily only when a class actually requires ResourceSlice data, and string digits are normalized to numbers during count evaluation. Follow-up to the device-count feature (#1395).
  • Node-drainer ignores stale AlreadyQuarantined events (#1419, #1415): A stale AlreadyQuarantined event re-enqueued via a later change-stream update — after the node had already been unquarantined and its quarantineHealthEvent annotation removed — was treated as "not already drained" and fell through to normal drain evaluation. That marked the stale event Succeeded and mutated the node-state label (triggering an invalid none -> draining transition) despite there being no active quarantine context. The already-drained check now handles a missing annotation on a stale AlreadyQuarantined event correctly instead of proceeding to drain.
  • Fixed PostgreSQL UpdateDocument placeholder collision (#1391): In the direct PostgreSQL store, UpdateDocument did not bind SET parameters before WHERE parameters, so combined update+filter statements could apply parameters in the wrong order. WHERE placeholders are now shifted after the update args (regex-based, so multi-digit placeholders such as $10 are not rewritten incorrectly) and executed with update args followed by filter args.

Acknowledgments

This release includes contributions from:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.12.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v1.11.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.12.0 \
  --namespace nvsentinel \
  --reuse-values

Release v1.11.0

Choose a tag to compare

@github-actions github-actions released this 22 Jun 12:58
v1.11.0
7cf58f4

Release v1.11.0

This release improves the accuracy of Mean Time To Repair (MTTR) reporting by letting dashboards distinguish automated remediations from events that require manual intervention, and fixes a checkpoint-advancement bug in the fault-remediation reconciler that could silently drop live health events on cold start.

Major New Features

Recommended-action label on MTTR metrics (#1406)

The fault_quarantine_node_remediation_duration_excluding_drain_seconds MTTR histogram now carries a recommended_action label. Previously, nodes that required manual handling (e.g. a CONTACT_SUPPORT recommended action) could sit cordoned for hours before an operator acted, and that long idle time was bucketed alongside genuine automated remediations, inflating MTTR on Grafana dashboards. With the new label, dashboards can filter out CONTACT_SUPPORT and other manual events so MTTR reflects only automated remediations.

Bug Fixes & Reliability

  • Fixed cold-start checkpoint advancement on document ID errors (#1411): Cold-start events are enqueued without resume tokens, but the document-ID error path in the fault-remediation reconciler called the watcher directly, where an empty token could resolve to the current MongoDB or PostgreSQL stream position and advance the checkpoint past events that had not yet been handled. Document-ID extraction failures are now routed through safeMarkProcessed, so cold-start events are no longer incorrectly marked processed and remediation events are no longer silently lost.

Acknowledgments

This release includes contributions from:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback! Special thanks to first-time contributor @fallintoplace.

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.11.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v1.10.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.11.0 \
  --namespace nvsentinel \
  --reuse-values

Release v1.10.0

Choose a tag to compare

@github-actions github-actions released this 15 Jun 12:56
v1.10.0
6c78e01

Release v1.10.0

NVSentinel v1.10.0 expands GPU health coverage with a new GPU thermal-margin watch, reduces memory footprint through optional per-policy namespace scoping in the Kubernetes Object Monitor, and lays the API foundation for external breakfix coordination via the ExternalRemediationRequest CRD. This release also adds finer-grained scheduling control for platform connectors, fixes fault-remediation handling of deleted nodes, and corrects Helm rendering for proxy-terminated PostgreSQL TLS configurations.

Major New Features

GPU Thermal Margin Watch (#1371, #1388)

Adds a new GpuThermalMarginWatch health check to the GPU health monitor that detects when a GPU crosses its hardware thermal-slowdown boundary. The monitor samples DCGM field 153 (DCGM_FI_DEV_GPU_TEMP_LIMIT) as a signed thermal-margin signal and compares it against a per-GPU hardware slowdown threshold; because that offset varies by SKU and is not exposed by DCGM, the metadata-collector reads it once per GPU via NVML field 194 (NVML_FI_DEV_TEMPERATURE_SLOWDOWN_TLIMIT) and publishes it in gpu_metadata.json. When a GPU's live margin falls below its slowdown threshold, NVSentinel raises a fatal GpuThermalMarginWatch event (error code GPU_TEMP_HW_SLOWDOWN_VIOLATION, recommended action CONTACT_SUPPORT) and clears it automatically once the margin recovers. Unlike the existing GpuThermalWatch, which only signals that throttling increased, this gives operators a quantifiable measure of how far past the hardware slowdown line a GPU has gone. The feature is opt-in via enable/store-only toggles. A companion operator runbook walks responders through confirming the alert with live telemetry and nvidia-smi, checking per-GPU threshold metadata, applying remediation, and reproducing the condition under load.

Optional Namespace Scoping in KOM Policies (#1394)

The Kubernetes Object Monitor (KOM) now supports optional per-policy namespace scoping. Setting resource.namespace on a namespaced resource in a KOM policy instructs controller-runtime to build an informer cache scoped to a single namespace rather than watching every object of that GVK cluster-wide, dramatically reducing memory usage for high-cardinality resources such as Pods. In testing, a Pod-watching policy scoped to one namespace held steady at ~18Mi even with 2000 pods scheduled in an unmonitored namespace, versus ~153Mi (roughly 9x) when watching cluster-wide. The field is rejected for cluster-scoped resources; leave it unset when cluster-wide monitoring is genuinely required.

ExternalRemediationRequest CRD Foundation (#1376)

Introduces the foundation for the ExternalRemediationRequest (ERR) CRD, a new coordination surface in the nvsentinel.dgxc.nvidia.com API group that lets NVSentinel hand off node ownership to an external breakfix system. This first PR ships the API shape only: the apiserver now accepts ERR objects via a new proto-generated CRD packaged in the janitor Helm chart, with scheme registration and RBAC granting janitor access to externalremediationrequests plus its status and finalizers subresources. It also adds custom protojson marshaling so proto well-known types (such as the Timestamp on Condition.lastTransitionTime) serialize as RFC3339, and centralizes the nvsentinel.dgxc.nvidia.com/managed node label and ERR identity constants in a new commons/pkg/managed package. This is foundational/preview only: no component observes ERR objects yet, and the reconciler, fault-remediation producer, and node-labeler gating land in follow-up PRs.

Affinity Support for Platform Connector (#1375)

The NVSentinel Helm chart now supports a platformConnector.affinity value, letting operators control how the platform connector DaemonSet pods are scheduled onto nodes. When set, the affinity block (for example, nodeAffinity rules matching custom node labels) is rendered into the DaemonSet's pod spec; when left empty (the default is {}) it renders nothing, so existing deployments are unaffected. This is useful for pinning connectors to specific node pools or hardware. The change includes a new scheduling configuration reference in the platform-connectors docs.

Bug Fixes & Reliability

  • Ignore deleted nodes in fault remediation (#1396, #1387): Fixed fault-remediation retrying health events forever when the target node had been deleted from the cluster. Previously, GetRemediationState/checkExistingCRStatus failed with a Kubernetes "Node not found" error before the event could be marked terminal, so controller-runtime kept retrying and cold-start re-enqueued the stale event on every restart. The reconciler now detects the not-found error via apierrors.IsNotFound, marks remediation events for deleted nodes terminal with faultRemediated=false (cancellation events terminal with faultRemediated=true), and advances the change-stream resume token so the event is recorded as processed and never retried again.
  • Render valid platform-connectors DaemonSet without a client cert (#1397, #1241): Fixed the platform-connectors DaemonSet in the umbrella Helm chart so it renders a valid manifest when PostgreSQL is the datastore but no client certificate is mounted (platformConnector.postgresqlStore.clientCertMountPath set to ""), a common configuration when TLS is terminated by a cloud-sql-proxy sidecar. Previously the template emitted a volumeMount referencing a non-existent volume, causing ArgoCD and Kubernetes to reject the DaemonSet as invalid. The cert volume, volumeMount, and fix-cert-permissions init container are now only rendered when a mount path is actually configured, bringing the DaemonSet in line with the subchart Deployments that already handled this case.
  • Fixed preflight E2E test flakiness (#1374): Hardened the preflight E2E test helper to comprehensively wait for and validate all preflight-related init containers before assertions run, eliminating race conditions where assertions could execute against pods whose init containers had not yet finished. Test/CI reliability only; no runtime behavior change.
  • Fixed preflight test flakiness on webhook restart (#1372): Fixed intermittent preflight test failures that occurred when the admission webhook restarted mid-test by wrapping test GPU pod creation in an automatic retry and correcting pod object construction in the test utilities. Test/CI reliability only; no runtime behavior change.

Documentation

  • Supported GPU architectures (#1389): Adds an official "GPU Support" section to the README and OVERVIEW docs listing the validated NVIDIA GPU architectures: Volta (V100), Ampere (A100), Hopper (H100), Ada Lovelace (L4/L40/L40S), and Blackwell (B200, GB200, GB300, RTX Pro 6000). It clarifies that NVSentinel works with any GPU supported by the NVIDIA GPU Operator, and documents that while most components run across all architectures, the optional nccl-loopback preflight check compiles GPU kernels targeting only Ampere, Ada Lovelace, Hopper, and Blackwell, so it must be disabled or skipped on Volta/V100 nodes.

Acknowledgments

This release includes contributions from:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.10.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v1.9.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.10.0 \
  --namespace nvsentinel \
  --reuse-values

Release v1.9.0

Choose a tag to compare

@github-actions github-actions released this 08 Jun 15:58
v1.9.0
30ee12d

Release v1.9.0

This release tightens preflight semantics so DCGM execution errors and non-actionable diagnostic failures no longer block workloads, adds per-init-container controls for inheriting workload env/volume mounts (so NCCL loopback can run in a clean environment while allreduce can pick up workload fabric config), introduces an optional image-cache DaemonSet for preflight images, adds an out-of-cluster deployment mode for platform-connector, and fixes a startup-race bug that left syslog-health-monitor using an empty GPU driver version for the lifetime of the pod.

Major New Features

Per-Init-Container Env & Volume Inheritance Flags (#1370)

Preflight init containers previously inherited workload environment variables matching ncclEnvPatterns and volume mounts matching volumeMountPatterns uniformly across every check. That was too broad — workload-specific NCCL/fabric configuration could poison checks meant to run with a curated environment (e.g. NCCL loopback inheriting workload settings that alter local GPU P2P/NVLink/NVSwitch behavior). Each preflight init container can now opt in or out of inheritance independently:

- name: preflight-nccl-loopback
  inheritUserEnv: false
  inheritUserVolumeMounts: false
- name: preflight-nccl-allreduce
  inheritUserEnv: true            # workload fabric config still flows through
  inheritUserVolumeMounts: true

Built-in checks default to curated environments; deployments can opt in selectively where inheritance is actually required.

Image-Cache DaemonSet for Preflight Images (#1365)

New optional DaemonSet preflight-image-cache pre-pulls all preflight check images on every node, eliminating cold-start image-pull latency from the critical path of a workload's first preflight run. Each container in the DaemonSet idles after pulling its image. Gated behind imageCache.enabled (default false), with configurable resources, pod annotations, and scheduling overrides. A pod-template config checksum annotation forces a rollout when the config content changes.

Out-of-Cluster Platform-Connector Deployment (#1359)

platform-connectors accepts an optional --kubeconfig flag for explicit out-of-cluster Kubernetes authentication. The kubeconfig path is threaded through startup, connector initialization, and pipeline transformer creation so both the Kubernetes connector and MetadataAugmentor use the same client config when platform-connectors runs outside the cluster (e.g., under systemd). When --kubeconfig is unset, existing in-cluster auth behavior is unchanged.

Synced DCGM Error Mappings (#1369)

Updated dcgmerrorsmapping.csv to match the latest upstream DCGM dcgm_errors.h enum. New mappings:

  • DCGM_FR_SRAM_THRESHOLD, DCGM_FR_NVLINK_EFFECTIVE_BER_THRESHOLD, DCGM_FR_NVLINK_SYMBOL_BER_THRESHOLD, DCGM_FR_IMEX_UNHEALTHY, DCGM_FR_FABRIC_PROBE_STATE, DCGM_FR_BINARY_PERMISSIONS, DCGM_FR_GPU_RECOVERY_DRAIN_P2PCONTACT_SUPPORT
  • DCGM_FR_FALLEN_OFF_BUS, DCGM_FR_GPU_RECOVERY_REBOOTRESTART_BM
  • DCGM_FR_GPU_RECOVERY_RESET, DCGM_FR_GPU_RECOVERY_DRAIN_RESET, DCGM_FR_NCCL_ERRORCOMPONENT_RESET

Bug Fixes & Reliability

  • DCGM_ST_* Should Not Fail Preflight (#1364, #1363): DCGM_ST_* codes (e.g. DCGM_ST_IN_USE, DCGM_ST_DIAG_ALREADY_RUNNING) are diagnostic execution failures — the framework could not complete the run — not confirmed hardware faults. Previously these surfaced as fatal health events that cordoned the node. preflight-dcgm-diag now retries on DCGM_ST_* for a configurable number of attempts (DCGM_DIAG_STATUS_RETRY_MAX_ATTEMPTS, DCGM_DIAG_STATUS_RETRY_INTERVAL_SECONDS); if the status persists it emits a non-fatal unhealthy HealthEvent with RecommendedAction=NONE (carrying the DCGM_ST_* status name in the errorCode) and exits successfully so the workload is not blocked. Also adds clean shutdown — dcgmStopDiagnostic is called on termination signals.
  • Preflight-DCGM-Diag Non-Actionable Failures Are Non-Fatal (#1358): DCGM diag failures whose recommended action resolves to NONE (e.g., XID detected during the run with no actionable remediation) are now emitted as non-fatal — the init container exits 0 and the workload's next preflight container runs. Previously these triggered Init:Error and blocked the workload. Bumped DCGM to 4.5.2 to match gpu-health-monitor.
  • Syslog-HM Driver Version Startup Race (#1362): Fixed a long-standing startup race where syslog-health-monitor cached DriverVersion = "" if the monitor started before metadata-collector populated /var/lib/nvsentinel/gpu_metadata.json. The stale empty value was then used for the lifetime of the pod, breaking driver-version-dependent XID 144–150 decoding (the analyzer fell back to WORKFLOW_NVLINK5_ERRCONTACT_SUPPORT instead of returning RESET_GPUCOMPONENT_RESET). Subtle because #1302 had already fixed metadata recovery for PCI → GPU UUID lookups, masking this code path. GetDriverVersion() now reloads metadata at request time when the cached value is empty, so the monitor recovers once metadata-collector writes the file. A new Prometheus metric tracks XID decode requests that ran without a driver version.

Acknowledgments

This release includes contributions from:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback! Special thanks to first-time contributor @sulixu.

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.9.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v1.8.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.9.0 \
  --namespace nvsentinel \
  --reuse-values

Release v1.8.0

Choose a tag to compare

@github-actions github-actions released this 01 Jun 13:20
v1.8.0
fc43d07

Release v1.8.0

This release replaces node-drainer's FIFO worker queue with a two-lane priority queue so a single noisy node can no longer starve drains on other nodes, adds a drainGPUPods flag to scope eviction to GPU-requesting workloads, makes drain and quarantine overrides configurable from the kubernetes-object-monitor, fills in missing recommended actions for newer XIDs, and remediates several CVEs across container images and the Go toolchain.

Major New Features

Priority Queue for Node-Drainer (#1341)

Replaced node-drainer's ready-FIFO ordering with a two-lane priority queue layered under the existing Kubernetes rate-limiting workqueue. Events for nodes that have not yet reached draining get one high-priority representative; additional queued work for the same node stays low-priority to prevent grouped floods from blocking later nodes. Queue priority state is in-memory and follows successful node label transitions — setting draining marks the node as draining, while unquarantine or terminal drain labels clear it. Retry, drain action evaluation, and health-event lifecycle semantics are unchanged. A new Prometheus counter node_drainer_queue_items_assigned_total{priority, reason} tracks assignment decisions.

drainGPUPods Filter (#1310, #1264)

New Helm flag node-drainer.drainGPUPods (default false) restricts pod eviction during fault remediation to workloads that request GPU resources (nvidia.com/gpu or nvidia.com/pgpu). When enabled, CPU-only pods (logging agents, monitoring sidecars, infrastructure DaemonSets) stay running on the node, while GPU workloads — the ones actually blocked by the GPU fault — are evicted. The filter inspects both regular containers and init containers. Default behavior is unchanged so existing deployments are unaffected.

Drain & Quarantine Overrides from Kubernetes Object Monitor (#1342)

drainOverrides and quarantineOverrides are now configurable on health events emitted by kubernetes-object-monitor policies, matching the support that already existed in other monitors. Cluster operators can declare per-policy overrides directly in the TOML/YAML config:

healthEvent:
  componentClass: Node
  isFatal: true
  message: "Node is not ready"
  recommendedAction: CONTACT_SUPPORT
  errorCode:
    - NODE_NOT_READY
  quarantineOverrides:
    force: true                # or skip: true; do not set both
  drainOverrides:
    skip: true                 # or force: true; do not set both

force and skip are mutually exclusive per override block; the chart validates this at template time. This unlocks scenarios like "cordon the node but do not evict pods" (the example tested in the PR) without requiring a separate health monitor.

Bug Fixes & Reliability

  • Missing XID Recommended Actions (#1343): Filled in recommended actions for XIDs that were missing from the gpu-health-monitor mapping but listed in the XID analyzer catalog — adds an additional GPU recovery scenario that now triggers COMPONENT_RESET and fabric-related failures that now trigger RESTART_VM. Bringing the mapping in line with the catalog prevents these XIDs from being silently classified as NONE/CONTACT_SUPPORT.
  • Preflight Build Platform Arg + FQ CEL for Preflight (#1352): Fixed a missing --platform argument in the preflight-checks Docker build/publish targets that caused multi-platform image operations to silently produce single-platform artifacts. Also added a new fault-quarantine CEL policy so nodes are cordoned when preflight agents emit fatal health events (respecting existing node-exclusion settings) — preflight failures now flow through the same cordon path as other monitors.

Security & Infrastructure

Acknowledgments

This release includes contributions from:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback! Special thanks to first-time contributor @coderuhaan2004.

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.8.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v1.7.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.8.0 \
  --namespace nvsentinel \
  --reuse-values

Release v1.7.0

Choose a tag to compare

@github-actions github-actions released this 26 May 18:46
v1.7.0
c28bf0c

Release v1.7.0

This release fixes a fault-remediation bug where every historical cancellation replayed on every restart (eventually causing OOM kills), adds a Helm gate to disable the external-MongoDB setup job for tenants who provision the database themselves, brings the docs site onto NVIDIA's shared Fern global theme, and ships a large set of GitHub repository automation workflows.

Major New Features

External MongoDB Setup Job Gate (#1311)

The post-install/post-upgrade hook job that provisions collections, indexes, and x509 users on external MongoDB can now be disabled independently of the external-MongoDB configuration. Set global.datastore.setupJob.enabled: false to opt out — useful for deployments where the datastore is provisioned out-of-band and the setup job's auth requirements don't match the tenant identity. Defaults to true, so existing deployments are unaffected.

Repository Automation Workflows (#1306)

Adds a suite of GitHub Actions workflows and issue templates for repository hygiene:

  • Merge conflict check — runs on PR creation and main push; adds a needs-rebase label when a PR diverges from main.
  • Dependabot auto-merge — auto-merges Dependabot PRs that contain only semver-patch updates.
  • Issue triage — applies needs-triage and area/* labels to new issues.
  • Labeler — applies area/* labels to PRs based on the paths touched.
  • Welcome — posts a templated message on first-time contributors' issues and PRs.
  • Inactive PR reminder — comments on PRs that have been inactive for 14–30 days.
  • Issue SLAs — labels and comments on issues that have breached priority-tiered SLAs.
  • Lock threads — locks closed issues and PRs after 90 days.

New issue templates for documentation requests and updates are added; the Question template is removed in favor of Discussions; the Bug/Feature templates now require a contributor agreement checkbox and add a component selector.

Bug Fixes & Reliability

  • Fault-Remediation Cancellation Completion Marker (#1335): Fixed a bug where handleCancellationEvent cleared Kubernetes annotations and advanced the change-stream resume token but never wrote faultRemediated back to MongoDB, while the cold-start cancellation query had no faultremediated == nil filter. Together this meant every historical cancellation replayed on every fault-remediation restart, growing monotonically and eventually causing OOM kills. The fix:

    • handleCancellationEvent now calls updateNodeRemediatedStatus(true) after clearing annotations, writing the same completion marker the remediation path already writes.
    • The cold-start cancellation query leg now requires faultremediated == nil, so already-processed cancellations are excluded.
    • The call returns an error (rather than just logging) if the marker write fails, preventing the resume token from advancing without a durable terminal state.
  • Slinky Drainer Annotation Prefix (#1318): Corrected the node annotation prefix used by the Slinky Drainer plugin from [J] [NVSentinel] to [T] [NVSentinel] so automated breakfix is detected with the expected T prefix. Demo documentation updated to match.

Docs Site

  • NVIDIA Global Theme (#1320, #1321): Migrated the Fern docs site from per-repo theme assets to the shared global-theme: nvidia, deleting ~1,126 lines of custom theme code (footer/badge components, NVIDIA SVGs, main.css, and the footer/layout/colors/theme/logo/favicon/js/css blocks in docs.yml). Added multi-source: true to the Fern instance config so the global theme's JS bundle (OneTrust cookie consent SDK) loads alongside the CSS portion. Fern CLI was bumped to 5.30.2 (required for global-theme support).

  • Frozen-Only Versioning (#1319, #1315): All versions in the docs dropdown now serve frozen content from their git tag — the "live docs" entry served from main has been removed. The newest version is stamped "Latest · vX.Y.Z" transiently at publish time. Eliminates duplicate dropdown entries, off-by-one pruning, and the dependency on the GitHub releases API for stamping. Version entries are sorted by semver descending (sort -rV) after insertion, so backport patches like v1.5.1 don't end up above newer releases; registration is now skipped when the publishing tag equals the latest release (the "Latest" stamp already covers it).

  • CI Runner Migration (#1324): Standardized CI runners onto a dedicated linux-amd64-cpu4 flavor to unblock Dependabot PR merging.

Acknowledgments

This release includes contributions from:

Thanks also to @rohansav for diagnosing and authoring the cancellation completion marker fix that was cherry-picked into #1335.

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.7.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v1.6.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.7.0 \
  --namespace nvsentinel \
  --reuse-values

Release v1.6.0

Choose a tag to compare

@github-actions github-actions released this 19 May 06:37
v1.6.0
f9e2daa

Release v1.6.0

This release adds configurable XID cancellation in the syslog health monitor, repeated-NIC analyzer rules for non-fatal degradation signals, a producer-side gate that prevents stale health events from skewing fault-quarantine metrics during platform-connector outages, identity-aware node-condition compaction that fixes stuck conditions on long entity names, and the v1.5.1 fault-remediation cold-start fix for users upgrading directly from v1.5.0.

Major New Features

XID Cancellation in Syslog Health Monitor (#1270)

The syslog-health-monitor can now be configured with cancellation rules that suppress related XID events when a source XID is observed. Rules are declared in a TOML ConfigMap by source/target error code:

cancellations:
  - name: SysLogsXIDError
    enabled: true
    rules:
      - onErrorCode: "162"
        cancelErrorCodes: ["163"]

When a source XID fires, the monitor emits a synthetic healthy event that clears matching target XIDs from the node condition. The platform-connector and fault-quarantine resolve health events by errorCode when present (falling back to entities-impacted otherwise), so the existing resolution semantics for non-XID checks are unaffected. A new Prometheus metric counts emitted cancellations by check / source / target error code.

Repeated NIC Analyzer Rules (#1272)

Two new Health Events Analyzer rules escalate repeated non-fatal NIC signals:

  • RepeatedNICDriverError: escalates selected non-fatal SysLogsNICDriverError patterns when the same pattern repeats 3 times on a node within 1 hour. Noisy diagnostic-only signals like access_reg_failed are excluded from escalation.
  • RepeatedNICDegradation: escalates non-fatal NIC degradation events when the same NIC + NICPort sees 3 degradation events within 1 hour.

Both rules escalate to CONTACT_SUPPORT rather than REPLACE_VM — deterministic NIC failures still use first-event REPLACE_VM, while repeated diagnostic/degradation signals are surfaced for human triage. Aggregation is scoped to the same NIC + NICPort so events on different ports do not aggregate incorrectly.

Bug Fixes & Reliability

  • Platform-Connector Outage Gating (#1259): When platform-connector restarted (graceful redeploy, OOM, helm upgrade), every health monitor on the node held in-flight events in its retry loop with the original GeneratedTimestamp. When platform-connector returned, those stale events landed at fault-quarantine and were misattributed as multi-minute fault_quarantine_node_quarantine_duration_seconds histogram entries, even when fault-quarantine actually cordoned in ≤100 ms. Each monitor now stat-checks the platform-connector Unix socket before every gRPC send; if the socket is missing the send is skipped (no buffering, no cache mutation) and the next polling cycle re-emits the event with a fresh timestamp. Recovery is bounded by the polling cadence regardless of how long the outage lasts. A shared publisher in commons/pkg/healthpub consolidates the gate, retry policy, and Prometheus counters across all Go monitors; the Python gpu-health-monitor gets the same gate inline. Also fixes a related bug where syslog-health-monitor.handleBootIDChange persisted the new BootID before delivering post-reboot healthy events — any send failure left those events permanently lost. BootID is now persisted only after every healthy event has been delivered, and a pendingPostRebootBootIDClear is retried at the top of every poll cycle.

  • Node Condition Cleanup for Truncated Entity Messages (#1304): Fixed an issue where entities with long values (e.g., v1/Pod:prod/61f345d08c9a432a-134a464884734f90) would be byte-truncated mid-token by the platform-connector's per-message compaction, leaving subsequent healthy events unable to clear the condition (the exact-substring cleanup lookup never matched the truncated form). compactMessageField is rewritten to parse the structured identity prefix (ErrorCode + entity tokens) and only truncate the trailing diagnostic free-text — identity tokens are never byte-truncated. A backward-compatible entityMatchesMessage helper falls back to prefix matching when there is evidence of truncation (token ends in ... or is the last token with no Recommended Action=), so nodes already carrying truncated conditions from older releases can also be cleared.

  • Fault-Quarantine Empty-Annotation Handling (#1309): Fixed a bug where fault-quarantine treated quarantineHealthEvent: "[]" as an active quarantine. When fault-quarantine processed a healthy event that cleared the last entity from a quarantined node, it wrote the annotation as an empty JSON array before performUncordon() removed the key entirely. If fault-quarantine restarted or hit a conflict before the key was removed, the next fatal event for that node followed the handleAlreadyQuarantinedNode path — appending the event without cordoning. Adds a shared annotation.IsEmptyValue() helper that treats "", whitespace, and "[]" as absent, used by hasExistingQuarantine() and the related test helpers. The same PR also hardens NIC E2E teardown to restart the NIC monitor before deleting the fake sysfs tree, eliminating a burst of false "device disappeared" fatal events that contaminated downstream tests.

  • NIC Fatal Events Cordon Nodes (#1288): Updated fault-quarantine rules so fatal syslog-health-monitor events for the NIC component class now cordon nodes (previously only GPU did), and added a new ruleset for fatal nic-health-monitor events. E2E coverage was extended to assert that fatal NIC events cordon and that recovery uncordons. Also prevents node-drainer from marking drain status terminal when a node-state label update fails, so the event can be retried instead of leaving DB and node state inconsistent.

  • Syslog HM Metadata Cache Retry (#1302, #1287): Fixed an issue where the syslog-health-monitor cached the metadata-collector output regardless of parse success — a failed parse poisoned the cache and prevented later retries from picking up valid metadata. Metadata is now cached only after a successful parse; failed parses are retried on the next lookup. Fixes a class of PCI-to-GPU-UUID resolution failures that persisted even after the metadata file appeared on disk.

  • Fault-Remediation Cold-Start Replay (#1281): Brings the v1.5.1 hotfix forward for users upgrading directly from v1.5.0. Cold-start replay for intentionally-skipped events without a terminal remediation status could create duplicate RebootNode CRs after fault-quarantine had uncordoned the node, and unsupported actions like CONTACT_SUPPORT could re-apply the remediation-failed label on every fault-remediation restart. The fix uses the existing node remediation annotation to close stale events for the covered equivalence groups before clearing the annotation (faultremediated=true on UnQuarantined, faultremediated=false on Cancelled and unsupported actions), and shares the cold-start eligibility query between cold-start and cleanup paths. See v1.5.1 release notes for the full description.

  • Fern Docs Preview Build (#1285, #1284): Frozen version content (v1.2.0, v1.3.0, v1.4.0) is now included in PR preview builds via the same git archive loop used by the publish workflow, so previews match the production docs site. Preview-comment generation now streams fern generate output via tee instead of capturing it into a variable that swallowed errors.

CI / Docs Publishing

  • Auto-Registered Versioned Docs (#1290, #1292, #1293, #1291): Fern docs publish is now triggered on release tag push instead of docs/v* tags. On each release tag the workflow auto-registers the new version in the Fern dropdown via yq and prunes the list to Latest + the 3 most recent releases. Pre-release tags (containing -) still trigger publish but skip registration and pruning. Registry changes are persisted via peter-evans/create-pull-request after publish succeeds, with a pre-stamp backup so transient Latest · vX.Y.Z entries are never committed. Frozen-version checkout is hardened with glob guards, MDX brace/angle-bracket escaping, and consistent git show-ref tag verification across both publish and preview-build workflows.

Acknowledgments

This release includes contributions from:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.6.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v1.5.0 or v1.5.1:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.6.0 \
  --namespace nvsentinel \
  --reuse-values

Release v1.5.1

Choose a tag to compare

@github-actions github-actions released this 14 May 10:42
v1.5.1
f963044

Release v1.5.1

This is a hotfix release on top of v1.5.0 containing a single critical fix to the fault-remediation cold-start path.

Bug Fixes

  • Fault-Remediation Cold-Start Replay (#1281): Fixed cold-start replay for health events that were intentionally skipped but left without a terminal remediation status. Previously, some skip paths only advanced the change-stream token while leaving healtheventstatus.faultremediated == nil on the event document, so a fault-remediation restart could cold-start the stale event and process it a second time. This produced two related replay bugs:

    • A skipped event behind an equivalent in-progress remediation CR could create a duplicate RebootNode after fault-remediation restart — allowing workloads to be scheduled back onto an uncordoned node before an unwanted reboot was triggered.
    • Unsupported recommended actions (e.g., CONTACT_SUPPORT) could replay on every fault-remediation restart and re-apply the dgxc.nvidia.com/nvsentinel-state=remediation-failed label to a node that had already recovered.

    The fix uses the existing node remediation annotation to close stale events for the covered equivalence groups before clearing the annotation:

    • On UnQuarantined, covered stale events are marked faultremediated=true.
    • On Cancelled, covered stale events are marked faultremediated=false (manual/external cancellation does not prove the fault was remediated).
    • Unsupported recommended actions are now made terminal with faultremediated=false instead of remaining cold-start eligible.
    • The cold-start "unresolved remediation-ready event" query was extracted into a shared helper so cold-start and cleanup paths use the same criteria.
    • A succeeded existing CR only covers events created before the remediation annotation for that equivalence group was created — later events are treated as a new remediation session and may create a new CR.
    • Matching remediation annotation groups are evaluated deterministically by newest CreatedAt first.

    Cleanup is scoped to the equivalence group, not the node, so unrelated remediation actions on the same node are not incorrectly closed.

Acknowledgments

Thanks to @XRFXLP, @KaivalyaMDabhadkar for diagnosing and fixing this issue.

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Upgrade from v1.5.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.5.1 \
  --namespace nvsentinel \
  --reuse-values

Fresh install:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.5.1 \
  --namespace nvsentinel \
  --create-namespace

Release v1.3.1

Choose a tag to compare

@github-actions github-actions released this 14 May 10:15
v1.3.1
705fddd

Release v1.3.1

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel --version v1.3.1

Release v1.5.0

Choose a tag to compare

@github-actions github-actions released this 11 May 12:30
v1.5.0
489b522

Release v1.5.0

This release significantly expands the NIC Health Monitor (alpha) with NIC driver syslog detection and counter-based degradation checks for InfiniBand and Ethernet/RoCE, fixes the GPU reset workflow to inject devices via HostPath volumes for more reliable resets, adds a versioned docs dropdown with content pinning, and ships a critical fix for faultRemediated decoding on wrapped BoolValue records.

Major New Features

NIC Health Monitor Expansion (Alpha)

Status: Alpha. APIs, Helm values, and event schemas may change in future releases. Not recommended for production use yet — feedback and bug reports welcome.

Building on the link-state detection introduced in v1.4.0, the NIC Health Monitor now adds NIC driver syslog detection, counter-based degradation checks, and end-to-end test coverage.

  • NIC Driver Syslog Detection (#1257): New SysLogsNICDriverError check in the syslog-health-monitor detects mlx5_core NIC driver/firmware errors from kernel logs. Ships 8 default patterns (3 Fatal: cmd_exec_timeout, health_poll_failed, unrecoverable_err; 5 Non-Fatal: netdev_watchdog, pci_power_insufficient, port_module_high_temp, access_reg_failed, module_unplugged) verified against upstream Linux kernel source. Fatal patterns publish a node condition with a REPLACE_VM recommended action; non-fatal patterns emit Kubernetes events with no remediation. Operator-configurable via a TOML ConfigMap with BDF extraction for resolving the affected NIC entity. Includes Prometheus metrics for driver-error events.
  • Counter-Based Degradation Detection (#1248): Adds counter polling for InfiniBand and Ethernet/RoCE NICs on a dedicated 1s loop, with latching, reset/recovery, and both delta and velocity (per-second / per-minute) evaluation. New counterDetection configuration block supports per-counter thresholds, velocity units, fatality flags, and recommended actions. Fatal counter breaches (e.g., link_downed, excessive_buffer_overrun_errors) publish node conditions; non-fatal counters (e.g., symbol_error, roce_slow_restart, carrier_changes) emit Kubernetes events. Latched breaches persist until the underlying counter resets.
  • Configuration Hardening (#1271): NIC counter configuration is now restricted to a hardcoded allowlist of supported counter name/path pairs (thresholds and severity remain operator-tunable). Startup validation now rejects unsupported or mismatched counter selections.
  • Tilt End-to-End Tests (#1249, #1260): Comprehensive Tilt e2e coverage for both counter detection (TestNICCounterIBDegradation, TestNICCounterEthernetDegradation, TestNICCounterBelowThreshold, TestNICCounterBootIDClearsBreachState) and syslog NIC driver detection (TestSyslogHealthMonitorNICDriverDetection) — exercising fatal/non-fatal paths, multi-device faults, threshold validation, latch/clear behavior, and boot-ID recovery. AWS/GCP integration and e2e workflow timeouts were bumped from 60 → 75 minutes to accommodate the new coverage.

GPU Reset HostPath Architecture (#1243)

Fixed the GPU reset workflow to inject GPU devices via HostPath volumes instead of NVIDIA_VISIBLE_DEVICES, eliminating a class of failures where the privileged reset pod could not have GPUs injected on nodes with reset-pending XIDs (nvidia-container-cli.real: detection error: nvml error: gpu requires reset).

  • The reset job now mounts /run/nvidia/driver and /sys via HostPath and invokes nvidia-smi through chroot /run/nvidia/driver. NVIDIA_VISIBLE_DEVICES=void disables nvidia-container-toolkit injection for the reset container.
  • Manual persistence-mode toggling is removed — nvidia-smi --gpu-reset now handles persistence mode automatically because /run/nvidia-persistenced is available through the driver mount.
  • The gpu-feature-discovery pod is now evicted in addition to nvidia-device-plugin, nvidia-dcgm, and nvidia-dcgm-exporter, since GFD opens NVML handles in a loop that can block resets.
  • HostNetwork=true is no longer required for the reset workflow.

Versioned Docs Dropdown (#1263, #1262)

The Fern docs site now ships a version dropdown with true content pinning for frozen releases.

  • Replaces the single dev version with Latest plus frozen entries (v1.2.0, v1.3.0, v1.4.0).
  • Frozen versions serve docs extracted from their git tags via git archive at publish time — users no longer see drift between docs and the release they're running.
  • The Latest display-name is stamped with the current GitHub release tag during CI.
  • Publish workflow gains a concurrency group, set -o pipefail, and step-summary URLs.

Bug Fixes & Reliability

  • faultRemediated Decoding for Wrapped BoolValue Records (#1255): Fixed health event status decoding when healtheventstatus.faultremediated is stored as a protobuf BoolValue document ({"value": false}) while the datastore model exposes it as *bool. Adds JSON/BSON compatibility for both wrapped and plain boolean shapes so legacy datastore records and proto/change-stream payloads both decode safely; writers continue emitting the wrapped shape expected by proto/change-stream consumers. Normalizes legacy plain-boolean values before proto unmarshalling in the shared event parser — fixes decoding on the node-drainer query path against MongoDB and PostgreSQL.
  • Fern Docs CI Hardening (#1251, #1250): Closed a regex gap in the MDX safety check that let bare <img> tags slip through, replaced the hardcoded fern-api@4.42.1 pin with a dynamic lookup against fern/fern.config.json, and added the production custom domain (docs.nvidia.com/nvsentinel) and canonical-host metadata.
  • macOS Dev Environment Setup (#1126, #1125): make dev-env-setup now works end-to-end on macOS (Apple Silicon). Replaces wget with curl (not installed by default on macOS), installs Go via Homebrew when the manual tar extraction would fail, installs missing tools (addlicense, protoc-gen-go, golangci-lint, gotestsum, gocover-cobertura) at the pinned versions from .versions.yaml, adds GOPATH/bin to PATH, and uses pipx for Poetry to avoid PEP 668 externally-managed-environment errors.

Acknowledgments

This release includes contributions from:

Thank you to everyone who contributed code, testing, documentation, design reviews, and feedback!

Container Images

See versions.txt for the full list of container images and versions.

Helm Chart

Install with:

helm install nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.5.0 \
  --namespace nvsentinel \
  --create-namespace

To upgrade from v1.4.0:

helm upgrade nvsentinel oci://ghcr.io/nvidia/nvsentinel \
  --version v1.5.0 \
  --namespace nvsentinel \
  --reuse-values