Skip to content

feat: nvswitch telemetry gaps#2945

Merged
mkoci merged 27 commits into
NVIDIA:mainfrom
mkoci:nvswitch_telemetry_gaps
Jun 30, 2026
Merged

feat: nvswitch telemetry gaps#2945
mkoci merged 27 commits into
NVIDIA:mainfrom
mkoci:nvswitch_telemetry_gaps

Conversation

@mkoci

@mkoci mkoci commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Description

This PR covers the NV Switch port from #2283. It closes the GB200 NVSwitch telemetry gaps for NVUE gNMI streaming, NVUE REST, and NMX-T.

Type of Change

  • Add - New feature or capability

Related Issues

#2283

Testing

  • Unit tests added/updated
  • Manual testing performed

Additional Notes

Gated previously insecure TLS verification behind dangerously_skip_tls_verification in config surface

  • Source mappings were validated against live GB200 NVSwitch endpoints (gNMI / NVUE-REST / NMX-T).
  • Deployment note: gNMI TLS verification is now strict by default. Lab or self-signed
    NVOS gNMI endpoints must set dangerously_skip_tls_verification = true to connect.

@copy-pr-bot

copy-pr-bot Bot commented Jun 27, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 955de19a-3a92-4e02-a2d9-613b495758c4

📥 Commits

Reviewing files that changed from the base of the PR and between 4e7af93 and 886fd6d.

📒 Files selected for processing (4)
  • crates/health/example/config.example.toml
  • crates/health/src/collectors/nvue/gnmi/sample_processor.rs
  • crates/health/src/collectors/nvue/rest/collector.rs
  • crates/health/src/config.rs
🚧 Files skipped from review as they are similar to previous changes (3)
  • crates/health/example/config.example.toml
  • crates/health/src/config.rs
  • crates/health/src/collectors/nvue/rest/collector.rs

Summary by CodeRabbit

  • New Features
    • Expanded NVUE health monitoring with platform environment (fan, temperature, status) and platform-general metrics toggles.
    • Added sensor range metrics for min/max thresholds.
    • Improved OTLP metric naming to consistently include a configurable prefix.
  • Bug Fixes
    • Made TLS behavior explicit: strict verification is the default for NMX-T and NVUE gNMI (with opt-in skipping via configuration).
    • Updated status-style outputs to use StateSet-style state series instead of numeric codes.
    • Refined metric identity handling so switch serial uses the correct OTLP resource key.

Walkthrough

This PR updates health collector configuration, NVUE gNMI and REST metric collection, NMX-T scraping, OTLP export naming, and sensor threshold handling.

Changes

Configuration, TLS Abstraction, and OTLP Naming

Layer / File(s) Summary
New config fields and example config
crates/health/src/config.rs, crates/health/example/config.example.toml
Adds TLS-skip flags, NVUE platform-general enablement, NVUE REST platform-environment enablement, and updates the bundled example config and parsing tests.
TLS verifier helper
crates/health/src/collectors/nvue/tls.rs
Removes self_signed_tls_config() and adds accept_any_cert_verifier() with updated rustls imports.
OTLP metric prefix and serial key
crates/health/src/otlp/convert.rs, crates/health/src/otlp/metrics_drain.rs, crates/health/src/sink/otlp.rs
Extends OTLP export naming with a metric prefix, changes datapoint attribute handling, renames the serial resource key, and threads the prefix through the drain task and sink wiring.

NMX-T Collector Allowlist Rewrite

Layer / File(s) Summary
Allowlist types and scrape parsing
crates/health/src/collectors/nmxt.rs
Adds metric and label allowlists, parsed scrape sample types, temperature and state helpers, port validation, and Prometheus line parsing for labeled and unlabeled samples.
Collector wiring and scrape iteration
crates/health/src/collectors/nmxt.rs
Removes stored switch identity, builds the HTTP client with conditional TLS skipping, and rewrites scrape iteration to emit allowlisted metrics, temperature conversion, and down-blame state output.
NMX-T emission tests
crates/health/src/collectors/nmxt.rs
Covers allowlist exactness, canonical label re-export, module temperature conversion, down-blame fan-out, port validation, and once-per-port emission behavior.

NVUE gNMI: TLS Skip, Platform-General Paths, and StateSet Processing

Layer / File(s) Summary
Client TLS skip and subscribe paths
crates/health/src/collectors/nvue/gnmi/client.rs, crates/health/src/collectors/nvue/gnmi/subscriber.rs
Adds the TLS-skip flag to GnmiClient, introduces TLS endpoint configuration, extends subscription paths for platform-general, and wires the flag from configuration through the subscriber.
Sample processor StateSet and singleton emission
crates/health/src/collectors/nvue/gnmi/sample_processor.rs
Adds the NVUE sample stream constant, classifies platform-general as a singleton entity, introduces StateSet and switch-level emit helpers, and rewrites interface and component leaf handling.
Numeric leaf dispatcher and state helpers
crates/health/src/collectors/nvue/gnmi/sample_processor.rs
Introduces the table-driven numeric interface dispatcher, replaces enum-style numeric helpers with string-to-state helpers, and adds link width and speed parsing.
Sample processor test coverage
crates/health/src/collectors/nvue/gnmi/sample_processor.rs
Expands tests for mapping functions, StateSet fan-out, platform-general metrics, numeric leaf dispatch, and unmapped-leaf suppression.

NVUE REST: Platform Environment Endpoints and StateSet Emission

Layer / File(s) Summary
REST client platform environment support
crates/health/src/collectors/nvue/rest/client.rs
Adds platform-environment endpoint constants, gated fetchers, response types, and parsing tests for fan, temperature, and aggregate environment payloads.
StateSet emission and platform-environment collection
crates/health/src/collectors/nvue/rest/collector.rs
Replaces numeric status helpers with state mappings, introduces StateSet emission, and extends iteration to collect fan, temperature, and FAN_STATUS environment metrics.
REST collector tests
crates/health/src/collectors/nvue/rest/collector.rs
Adds StateSet assertions, helper mapping tests, and emission coverage for fan speed, temperature, and LED status handling.

Sensor Range Metric Emission

Layer / File(s) Summary
Sensor range metric emission
crates/health/src/collectors/sensors.rs
Introduces SensorRangeKind, derives range values in sensor updates, reuses sample fields for threshold emission, and adds tests for suffix and label formatting.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested labels

rack health

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and clearly refers to the NVSwitch telemetry feature work in this changeset.
Description check ✅ Passed The description is directly related to the NVSwitch telemetry, TLS, and testing changes in the pull request.
Docstring Coverage ✅ Passed Docstring coverage is 93.30% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@mkoci mkoci force-pushed the nvswitch_telemetry_gaps branch from b4b9e2a to 961fd8c Compare June 28, 2026 02:28
@copy-pr-bot

copy-pr-bot Bot commented Jun 28, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@mkoci mkoci marked this pull request as ready for review June 29, 2026 12:45
@mkoci mkoci requested a review from a team as a code owner June 29, 2026 12:45
@mkoci mkoci requested review from jayzhudev and yoks June 29, 2026 12:49

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (5)
crates/health/src/collectors/sensors.rs (1)

404-422: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add a behavioral test for range metric emission.

These tests only pin the enum strings. The new contract is in Lines 347-396: include_sensor_thresholds gating, None suppression, and the sensor_range label on emitted samples. Please cover that path with a table-driven test around update_sensor or emit_sensor_range_metric; otherwise the exported series shape can drift without a failing test. As per coding guidelines, "Use table-driven test style when writing tests in Rust" and "Prefer carbide-test-support scenarios (scenarios! / value_scenarios!) or explicit cases (check_cases / check_values) for test coverage in Rust." As per path instructions, prefer findings about behavior and missing tests over style-only comments.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/collectors/sensors.rs` around lines 404 - 422, The current
tests only verify SensorRangeKind string helpers, but they do not cover the
actual range-metric emission behavior in update_sensor and
emit_sensor_range_metric. Add a table-driven behavioral test that exercises
include_sensor_thresholds gating, confirms None suppresses emission, and asserts
the emitted sample carries the sensor_range label with the expected value. Use
the existing symbols SensorRangeKind, update_sensor, and
emit_sensor_range_metric so the test stays tied to the contract and not just the
enum strings.

Sources: Coding guidelines, Path instructions

crates/health/src/otlp/convert.rs (1)

254-258: 🗄️ Data Integrity & Integration | 🔵 Trivial | ⚡ Quick win

Enforce the resource-only identity invariant at the converter boundary.

Line 254 copies every sample.labels entry to the datapoint, while the new test only proves identity is absent when the sample has no labels. If a collector later emits switch_id, switch_serial, or switch_serial_number, this converter will still duplicate resource identity onto datapoints.

Suggested hardening
+const OTLP_RESOURCE_ONLY_DATAPOINT_LABELS: &[&str] =
+    &["switch_id", "switch_serial", "switch_serial_number", "switch_ip"];
+
         let attributes: Vec<KeyValue> = sample
             .labels
             .iter()
+            .filter(|(k, _)| !OTLP_RESOURCE_ONLY_DATAPOINT_LABELS.contains(&k.as_ref()))
             .map(|(k, v)| kv(k, v.clone()))
             .collect();

As per coding guidelines, implementation claims should be backed by code or tests.

Also applies to: 799-803

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/otlp/convert.rs` around lines 254 - 258, The datapoint
conversion currently copies every entry from `sample.labels` into `attributes`,
which can leak resource identity onto datapoints. Update the converter boundary
so the label-to-attribute mapping in this conversion path explicitly filters out
identity keys such as `switch_id`, `switch_serial`, and `switch_serial_number`
before collecting `attributes`. Add or extend tests around the same conversion
logic to prove these labels are not propagated when present, not just when
labels are empty.

Source: Coding guidelines

crates/health/src/collectors/nmxt.rs (1)

973-1005: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Exercise the production emission path in tests.

These tests copy the same emission logic instead of calling the collector’s production emission path, so they can pass while scrape_iteration regresses. Consider extracting the post-scrape loop into a helper that accepts parsed NmxtMetricSamples and invoking that helper from both scrape_iteration and these tests. As per coding guidelines, verification should exercise the behavior that changed.

Also applies to: 1082-1114

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/collectors/nmxt.rs` around lines 973 - 1005, The test logic
around down_blame emission is duplicating the collector’s production path
instead of exercising it, so it can miss regressions in scrape_iteration.
Extract the post-scrape emission loop from scrape_iteration into a shared helper
that takes parsed NmxtMetricSample values and performs the down_blame event
generation, then call that helper from both scrape_iteration and the affected
tests (including the similar block in the other test section) so verification
covers the real behavior.

Source: Coding guidelines

crates/health/src/config.rs (1)

1889-1942: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Make the new config parse test table-driven.

Line 1889 adds two duplicated parse/assert flows for the same operation. Please collapse them into cases so adding the next config variant stays cheap and consistent with the Rust test style.

♻️ Proposed refactor
-        let config: Config = Figment::new()
-            .merge(Serialized::defaults(Config::default()))
-            .merge(Toml::string(omitted))
-            .extract()
-            .expect("failed to parse omitted tls flag");
-
-        let Configurable::Enabled(nvue) = config.collectors.nvue else {
-            panic!("nvue config should be enabled");
-        };
-        let Configurable::Enabled(gnmi) = nvue.gnmi else {
-            panic!("gnmi config should be enabled");
-        };
-        assert!(!gnmi.dangerously_skip_tls_verification);
-
-        let enabled = r#"
+        let enabled = r#"
 [endpoint_sources.carbide_api]
 enabled = false
 
 [sinks.health_report]
 enabled = false
@@
 dangerously_skip_tls_verification = true
 "#;
 
-        let config: Config = Figment::new()
-            .merge(Serialized::defaults(Config::default()))
-            .merge(Toml::string(enabled))
-            .extract()
-            .expect("failed to parse enabled tls flag");
-
-        let Configurable::Enabled(nvue) = config.collectors.nvue else {
-            panic!("nvue config should be enabled");
-        };
-        let Configurable::Enabled(gnmi) = nvue.gnmi else {
-            panic!("gnmi config should be enabled");
-        };
-        assert!(gnmi.dangerously_skip_tls_verification);
+        for (name, toml, expected) in [
+            ("omitted tls flag", omitted, false),
+            ("enabled tls flag", enabled, true),
+        ] {
+            let config: Config = Figment::new()
+                .merge(Serialized::defaults(Config::default()))
+                .merge(Toml::string(toml))
+                .extract()
+                .unwrap_or_else(|_| panic!("failed to parse {name}"));
+
+            let Configurable::Enabled(nvue) = config.collectors.nvue else {
+                panic!("nvue config should be enabled for {name}");
+            };
+            let Configurable::Enabled(gnmi) = nvue.gnmi else {
+                panic!("gnmi config should be enabled for {name}");
+            };
+            assert_eq!(gnmi.dangerously_skip_tls_verification, expected, "{name}");
+        }

As per coding guidelines, "Use table-driven test style when writing tests in Rust".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/config.rs` around lines 1889 - 1942, Refactor
test_nvue_gnmi_dangerous_tls_skip_defaults_false_and_parses_true in
crates::health::config into a table-driven test instead of two duplicated
Figment parse/assert flows. Keep the same setup for Config::default,
Figment::new, and the Configurable::Enabled unwraps for config.collectors.nvue
and nvue.gnmi, but iterate over cases covering omitted and explicitly true
dangerously_skip_tls_verification and assert the expected boolean per case. This
will make adding future GNMI config variants consistent and cheap.

Source: Coding guidelines

crates/health/src/collectors/nvue/gnmi/sample_processor.rs (1)

785-788: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add the missing table case for interface_link_down_events.

The mapping exists, but the new numeric table test does not assert its emitted metric type/unit; the older no-sink test only checks entity count. Add it to keep the mapping table fully covered. As per coding guidelines, verification should exercise the behavior that changed.

Proposed test addition
             (
                 &["phy-diag", "state", "plr-bw-loss-percent"],
                 "interface_plr_bw_loss_percent",
                 "percent",
             ),
+            (
+                &["phy-diag", "state", "unintentional-link-down-events"],
+                "interface_link_down_events",
+                "count",
+            ),
         ];

Also applies to: 1758-2067

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/collectors/nvue/gnmi/sample_processor.rs` around lines 785
- 788, The mapping for interface_link_down_events is present in
sample_processor.rs, but the numeric table test still does not verify its
emitted metric type and unit. Update the table-driven coverage in the sample
processor tests to include the interface_link_down_events case and assert the
expected metric name, type, and unit so the new mapping is exercised by the
changed-behavior test rather than only the older no-sink entity-count check.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/health/example/config.example.toml`:
- Around line 225-226: The example config is missing the new fan
platform-environment toggle, so update the configuration example around the
platform_environment_* entries to include platform_environment_fan_enabled
alongside platform_environment_temperature_enabled and
platform_environment_status_enabled. Use the existing NvueRestPaths-related
platform environment section as the reference point so the new option is
documented consistently with the other enabled flags.

In `@crates/health/src/collectors/nmxt.rs`:
- Around line 249-250: Remove the free-text Status_Message mapping from
NMXT_LABEL_MAP or otherwise filter it out before MetricSample construction in
nmxt.rs, so sink-agnostic metric events never carry this high-cardinality label.
Update the mapping/collection path around NMXT_LABEL_MAP and the MetricSample
emission logic to keep only stable, low-cardinality labels, and apply the same
fix anywhere else the allowlist is built or reused.
- Around line 479-485: The NMX-T HTTP client currently bypasses TLS verification
unconditionally in the `NmxtCollector` setup. Add a
`dangerously_skip_tls_verification` field to `NmxtCollectorConfig`, thread that
config into the `reqwest::Client::builder()` path, and only call
`danger_accept_invalid_certs(true)` when that option is enabled; otherwise build
the client with normal certificate validation.

In `@crates/health/src/collectors/nvue/gnmi/sample_processor.rs`:
- Around line 771-777: The metric name for the retry-events leaf is incorrect in
the nvue GNMI sample mapping and is currently conflated with the retry-codes
metric. Update the relevant entry in sample_processor.rs for the
plr-xmit-retry-events-within-t-sec-max leaf so the exported name uses an
events-based identifier (not codes), and make the same rename in the matching
duplicate mapping referenced by the reviewer. Keep the surrounding mapping
structure and unit unchanged, and use the existing sample definitions to ensure
the new name stays aligned with the leaf semantics.
- Around line 892-923: Both link parsing helpers currently accept any
successfully parsed f64, so invalid values can flow into MetricSample and
downstream exporters. Update link_width_to_f64 and link_speed_to_gbps to reject
NaN, infinities, and negative numbers after parsing, returning None for those
cases. Keep the validation close to the parsing logic in these two functions,
and add table-driven tests covering finite positives plus NaN, infinity, and
negative edge cases.

In `@crates/health/src/collectors/nvue/rest/collector.rs`:
- Around line 79-86: Update fan_max_speed_to_f64 in the NVUE REST collector to
reject invalid fan RPM values before they are emitted. After parsing the string
into f64, filter out non-finite results using is_finite() and return None for
any value less than 0.0; keep temp_to_f64 unchanged. Add tests covering NaN,
inf, and a negative RPM case to verify fan_max_speed_to_f64 drops them.

---

Nitpick comments:
In `@crates/health/src/collectors/nmxt.rs`:
- Around line 973-1005: The test logic around down_blame emission is duplicating
the collector’s production path instead of exercising it, so it can miss
regressions in scrape_iteration. Extract the post-scrape emission loop from
scrape_iteration into a shared helper that takes parsed NmxtMetricSample values
and performs the down_blame event generation, then call that helper from both
scrape_iteration and the affected tests (including the similar block in the
other test section) so verification covers the real behavior.

In `@crates/health/src/collectors/nvue/gnmi/sample_processor.rs`:
- Around line 785-788: The mapping for interface_link_down_events is present in
sample_processor.rs, but the numeric table test still does not verify its
emitted metric type and unit. Update the table-driven coverage in the sample
processor tests to include the interface_link_down_events case and assert the
expected metric name, type, and unit so the new mapping is exercised by the
changed-behavior test rather than only the older no-sink entity-count check.

In `@crates/health/src/collectors/sensors.rs`:
- Around line 404-422: The current tests only verify SensorRangeKind string
helpers, but they do not cover the actual range-metric emission behavior in
update_sensor and emit_sensor_range_metric. Add a table-driven behavioral test
that exercises include_sensor_thresholds gating, confirms None suppresses
emission, and asserts the emitted sample carries the sensor_range label with the
expected value. Use the existing symbols SensorRangeKind, update_sensor, and
emit_sensor_range_metric so the test stays tied to the contract and not just the
enum strings.

In `@crates/health/src/config.rs`:
- Around line 1889-1942: Refactor
test_nvue_gnmi_dangerous_tls_skip_defaults_false_and_parses_true in
crates::health::config into a table-driven test instead of two duplicated
Figment parse/assert flows. Keep the same setup for Config::default,
Figment::new, and the Configurable::Enabled unwraps for config.collectors.nvue
and nvue.gnmi, but iterate over cases covering omitted and explicitly true
dangerously_skip_tls_verification and assert the expected boolean per case. This
will make adding future GNMI config variants consistent and cheap.

In `@crates/health/src/otlp/convert.rs`:
- Around line 254-258: The datapoint conversion currently copies every entry
from `sample.labels` into `attributes`, which can leak resource identity onto
datapoints. Update the converter boundary so the label-to-attribute mapping in
this conversion path explicitly filters out identity keys such as `switch_id`,
`switch_serial`, and `switch_serial_number` before collecting `attributes`. Add
or extend tests around the same conversion logic to prove these labels are not
propagated when present, not just when labels are empty.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4ec16842-9c93-4c47-a2e2-621bceb0b447

📥 Commits

Reviewing files that changed from the base of the PR and between 87a5337 and 961fd8c.

📒 Files selected for processing (14)
  • crates/health/example/config.example.toml
  • crates/health/src/collectors/nmxt.rs
  • crates/health/src/collectors/nvue/gnmi/client.rs
  • crates/health/src/collectors/nvue/gnmi/sample_processor.rs
  • crates/health/src/collectors/nvue/gnmi/subscriber.rs
  • crates/health/src/collectors/nvue/rest/client.rs
  • crates/health/src/collectors/nvue/rest/collector.rs
  • crates/health/src/collectors/nvue/tls.rs
  • crates/health/src/collectors/sensors.rs
  • crates/health/src/config.rs
  • crates/health/src/otlp/convert.rs
  • crates/health/src/otlp/metrics_drain.rs
  • crates/health/src/sink/otlp.rs
  • crates/health/src/sink/prometheus.rs

Comment thread crates/health/example/config.example.toml
Comment thread crates/health/src/collectors/nmxt.rs Outdated
Comment thread crates/health/src/collectors/nmxt.rs Outdated
Comment thread crates/health/src/collectors/nvue/gnmi/sample_processor.rs
Comment thread crates/health/src/collectors/nvue/gnmi/sample_processor.rs Outdated
Comment thread crates/health/src/collectors/nvue/rest/collector.rs
@github-actions

Copy link
Copy Markdown

🔍 Container Scan Summary

Service Total Critical High Medium Low Other
boot-artifacts-aarch64 3 0 0 3 0 0
boot-artifacts-x86_64 3 0 0 3 0 0
forge-admin-cli-x86_64 288 6 26 105 7 144
machine-validation-runner 751 30 190 274 36 221
machine_validation 751 30 190 274 36 221
machine_validation-aarch64 751 30 190 274 36 221
nvmetal-carbide 751 30 190 274 36 221
TOTAL 3298 126 786 1207 151 1028

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

@yoks yoks changed the title Nvswitch telemetry gaps feat: nvswitch telemetry gaps Jun 29, 2026
mkoci added 18 commits June 29, 2026 14:00
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…ted mappings

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…VUE REST

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
… sources

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…etheus sink

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…el cardinality fixes

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…ation changes

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…h_serial label

Compose OTLP metric name as {prefix}_{name}_{metric_type}_{unit} to match the
Prometheus sink, and promote switch_serial/switch_id onto datapoint attributes so
Grafana switch dashboards resolve identically across export paths.

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
The NMX-T collector built its reqwest client without danger_accept_invalid_certs,
unlike the sibling NVUE REST collector. On minimal runtime images this fails at
client build time (native-root-CA load) and the switch serves a self-signed cert
anyway, so NMX-T never collected. Match the NVUE REST self-signed handling.

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
tonic 0.14 auto-injects a strict system-root TLS verifier for https:// URIs
(Endpoint::from) and layers its own TlsConnector over any custom connector
(channel/service/connector.rs). That silently negated the hand-rolled
hyper-rustls skip-verify connector, so tonic strictly verified and rejected
NVOS's self-signed gNMI cert -- the channel died right after the server
Certificate message (opaque 'transport error', no HTTP/2 frames).

Use Endpoint::tls_config_with_verifier(ClientTlsConfig::new(), <verifier>) so
the AcceptAnyCertVerifier is applied in tonic's own TLS layer; drop the
hand-rolled connector. tls.rs now exposes accept_any_cert_verifier() instead
of self_signed_tls_config().

Validated on gb-nvl-124-switch06: gNMI SAMPLE+ON_CHANGE streams connect and
86 carbide_hardware_health_nvue_gnmi_* metric families flow via the OtlpSink
into VictoriaMetrics.

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…nfig

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
… config for dev

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
mkoci added 7 commits June 29, 2026 14:01
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
The generated matrix/validation docs were already dropped in
3b0a075 (chore(health): remove temp docs from repo), but the
one-shot generator script was missed. It has no callers, its
required inputs are not in the repo, and its outputs are no longer
tracked, so it cannot run from a clean checkout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
@mkoci mkoci force-pushed the nvswitch_telemetry_gaps branch from 961fd8c to 5eb393d Compare June 29, 2026 18:07
…ssage from labels due to high cardinality

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
crates/health/src/config.rs (1)

1058-1066: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Assert the new path toggles in the example-config test.

The new REST platform-environment flags and gNMI platform_general_enabled default/parse surface are not asserted here, so the example can drift without this test catching it.

Suggested test coverage
             if let Configurable::Enabled(ref rest) = nvue.rest {
                 assert_eq!(rest.poll_interval, Duration::from_secs(60));
                 assert_eq!(rest.request_timeout, Duration::from_secs(30));
+                assert!(rest.paths.platform_environment_fan_enabled);
+                assert!(rest.paths.platform_environment_temperature_enabled);
+                assert!(rest.paths.platform_environment_status_enabled);
             } else {
                 panic!("nvue rest config should be enabled in example config");
             }
             if let Configurable::Enabled(ref gnmi) = nvue.gnmi {
                 assert_eq!(gnmi.gnmi_port, 9339);
                 assert_eq!(gnmi.sample_interval, Duration::from_secs(300));
                 assert_eq!(gnmi.request_timeout, Duration::from_secs(30));
                 assert!(!gnmi.dangerously_skip_tls_verification);
+                assert!(gnmi.paths.platform_general_enabled);
                 assert!(gnmi.system_events_enabled);
             } else {

As per coding guidelines, verification should exercise the behavior that changed.

Also applies to: 1373-1384

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/config.rs` around lines 1058 - 1066, The example-config
test for Config/default parsing is missing assertions for the newly added path
toggles, so update that test to explicitly verify the REST platform-environment
flags and the gNMI platform_general_enabled default/parse behavior. Use the
Config default() surface and the example-config test setup to assert these new
fields alongside the existing enabled flags so future drift is caught by the
test.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/health/src/config.rs`:
- Around line 1895-1918: The NMX-T TLS config test only covers the default and
explicit true cases, but it misses the enabled [collectors.nmxt] deserialization
path when dangerously_skip_tls_verification is omitted. Update
test_nmxt_dangerous_tls_skip_defaults_false_and_parses_true in
NmxtCollectorConfig/Config to use a table-driven style that covers both omitted
and true inputs, and assert the parsed Configurable::Enabled(nmxt) value matches
the expected default false/explicit true behavior.

---

Nitpick comments:
In `@crates/health/src/config.rs`:
- Around line 1058-1066: The example-config test for Config/default parsing is
missing assertions for the newly added path toggles, so update that test to
explicitly verify the REST platform-environment flags and the gNMI
platform_general_enabled default/parse behavior. Use the Config default()
surface and the example-config test setup to assert these new fields alongside
the existing enabled flags so future drift is caught by the test.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4c29cb3e-6beb-4e6d-a4ed-feb3061ec11a

📥 Commits

Reviewing files that changed from the base of the PR and between 5eb393d and 4e7af93.

📒 Files selected for processing (3)
  • crates/health/example/config.example.toml
  • crates/health/src/collectors/nmxt.rs
  • crates/health/src/config.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • crates/health/src/collectors/nmxt.rs

Comment thread crates/health/src/config.rs Outdated
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
@mkoci mkoci merged commit d63a8ad into NVIDIA:main Jun 30, 2026
59 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants