feat: nvswitch telemetry gaps by mkoci · Pull Request #2945 · NVIDIA/infra-controller

mkoci · 2026-06-27T19:27:28Z

Description

This PR covers the NV Switch port from #2283. It closes the GB200 NVSwitch telemetry gaps for NVUE gNMI streaming, NVUE REST, and NMX-T.

Type of Change

Add - New feature or capability

Related Issues

#2283

Testing

Unit tests added/updated
Manual testing performed

Additional Notes

Gated previously insecure TLS verification behind dangerously_skip_tls_verification in config surface

Source mappings were validated against live GB200 NVSwitch endpoints (gNMI / NVUE-REST / NMX-T).
Deployment note: gNMI TLS verification is now strict by default. Lab or self-signed
NVOS gNMI endpoints must set dangerously_skip_tls_verification = true to connect.

copy-pr-bot · 2026-06-27T19:27:32Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-27T19:27:37Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 955de19a-3a92-4e02-a2d9-613b495758c4

📥 Commits

Reviewing files that changed from the base of the PR and between 4e7af93 and 886fd6d.

📒 Files selected for processing (4)

crates/health/example/config.example.toml
crates/health/src/collectors/nvue/gnmi/sample_processor.rs
crates/health/src/collectors/nvue/rest/collector.rs
crates/health/src/config.rs

🚧 Files skipped from review as they are similar to previous changes (3)

crates/health/example/config.example.toml
crates/health/src/config.rs
crates/health/src/collectors/nvue/rest/collector.rs

Summary by CodeRabbit

New Features
- Expanded NVUE health monitoring with platform environment (fan, temperature, status) and platform-general metrics toggles.
- Added sensor range metrics for min/max thresholds.
- Improved OTLP metric naming to consistently include a configurable prefix.
Bug Fixes
- Made TLS behavior explicit: strict verification is the default for NMX-T and NVUE gNMI (with opt-in skipping via configuration).
- Updated status-style outputs to use StateSet-style state series instead of numeric codes.
- Refined metric identity handling so switch serial uses the correct OTLP resource key.

Walkthrough

This PR updates health collector configuration, NVUE gNMI and REST metric collection, NMX-T scraping, OTLP export naming, and sensor threshold handling.

Changes

Configuration, TLS Abstraction, and OTLP Naming

Layer / File(s)	Summary
New config fields and example config `crates/health/src/config.rs`, `crates/health/example/config.example.toml`	Adds TLS-skip flags, NVUE platform-general enablement, NVUE REST platform-environment enablement, and updates the bundled example config and parsing tests.
TLS verifier helper `crates/health/src/collectors/nvue/tls.rs`	Removes `self_signed_tls_config()` and adds `accept_any_cert_verifier()` with updated `rustls` imports.
OTLP metric prefix and serial key `crates/health/src/otlp/convert.rs`, `crates/health/src/otlp/metrics_drain.rs`, `crates/health/src/sink/otlp.rs`	Extends OTLP export naming with a metric prefix, changes datapoint attribute handling, renames the serial resource key, and threads the prefix through the drain task and sink wiring.

NMX-T Collector Allowlist Rewrite

Layer / File(s)	Summary
Allowlist types and scrape parsing `crates/health/src/collectors/nmxt.rs`	Adds metric and label allowlists, parsed scrape sample types, temperature and state helpers, port validation, and Prometheus line parsing for labeled and unlabeled samples.
Collector wiring and scrape iteration `crates/health/src/collectors/nmxt.rs`	Removes stored switch identity, builds the HTTP client with conditional TLS skipping, and rewrites scrape iteration to emit allowlisted metrics, temperature conversion, and down-blame state output.
NMX-T emission tests `crates/health/src/collectors/nmxt.rs`	Covers allowlist exactness, canonical label re-export, module temperature conversion, down-blame fan-out, port validation, and once-per-port emission behavior.

NVUE gNMI: TLS Skip, Platform-General Paths, and StateSet Processing

Layer / File(s)	Summary
Client TLS skip and subscribe paths `crates/health/src/collectors/nvue/gnmi/client.rs`, `crates/health/src/collectors/nvue/gnmi/subscriber.rs`	Adds the TLS-skip flag to `GnmiClient`, introduces TLS endpoint configuration, extends subscription paths for platform-general, and wires the flag from configuration through the subscriber.
Sample processor StateSet and singleton emission `crates/health/src/collectors/nvue/gnmi/sample_processor.rs`	Adds the NVUE sample stream constant, classifies platform-general as a singleton entity, introduces StateSet and switch-level emit helpers, and rewrites interface and component leaf handling.
Numeric leaf dispatcher and state helpers `crates/health/src/collectors/nvue/gnmi/sample_processor.rs`	Introduces the table-driven numeric interface dispatcher, replaces enum-style numeric helpers with string-to-state helpers, and adds link width and speed parsing.
Sample processor test coverage `crates/health/src/collectors/nvue/gnmi/sample_processor.rs`	Expands tests for mapping functions, StateSet fan-out, platform-general metrics, numeric leaf dispatch, and unmapped-leaf suppression.

NVUE REST: Platform Environment Endpoints and StateSet Emission

Layer / File(s)	Summary
REST client platform environment support `crates/health/src/collectors/nvue/rest/client.rs`	Adds platform-environment endpoint constants, gated fetchers, response types, and parsing tests for fan, temperature, and aggregate environment payloads.
StateSet emission and platform-environment collection `crates/health/src/collectors/nvue/rest/collector.rs`	Replaces numeric status helpers with state mappings, introduces StateSet emission, and extends iteration to collect fan, temperature, and FAN_STATUS environment metrics.
REST collector tests `crates/health/src/collectors/nvue/rest/collector.rs`	Adds StateSet assertions, helper mapping tests, and emission coverage for fan speed, temperature, and LED status handling.

Sensor Range Metric Emission

Layer / File(s)	Summary
Sensor range metric emission `crates/health/src/collectors/sensors.rs`	Introduces `SensorRangeKind`, derives range values in sensor updates, reuses sample fields for threshold emission, and adds tests for suffix and label formatting.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Suggested labels

rack health

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title is concise and clearly refers to the NVSwitch telemetry feature work in this changeset.
Description check	✅ Passed	The description is directly related to the NVSwitch telemetry, TLS, and testing changes in the pull request.
Docstring Coverage	✅ Passed	Docstring coverage is 93.30% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

copy-pr-bot · 2026-06-28T02:28:49Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (5)

crates/health/src/collectors/sensors.rs (1)

404-422: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add a behavioral test for range metric emission.

These tests only pin the enum strings. The new contract is in Lines 347-396: include_sensor_thresholds gating, None suppression, and the sensor_range label on emitted samples. Please cover that path with a table-driven test around update_sensor or emit_sensor_range_metric; otherwise the exported series shape can drift without a failing test. As per coding guidelines, "Use table-driven test style when writing tests in Rust" and "Prefer carbide-test-support scenarios (scenarios! / value_scenarios!) or explicit cases (check_cases / check_values) for test coverage in Rust." As per path instructions, prefer findings about behavior and missing tests over style-only comments.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/collectors/sensors.rs` around lines 404 - 422, The current
tests only verify SensorRangeKind string helpers, but they do not cover the
actual range-metric emission behavior in update_sensor and
emit_sensor_range_metric. Add a table-driven behavioral test that exercises
include_sensor_thresholds gating, confirms None suppresses emission, and asserts
the emitted sample carries the sensor_range label with the expected value. Use
the existing symbols SensorRangeKind, update_sensor, and
emit_sensor_range_metric so the test stays tied to the contract and not just the
enum strings.
Sources: Coding guidelines, Path instructions

crates/health/src/otlp/convert.rs (1)

254-258: 🗄️ Data Integrity & Integration | 🔵 Trivial | ⚡ Quick win

Enforce the resource-only identity invariant at the converter boundary.

Line 254 copies every sample.labels entry to the datapoint, while the new test only proves identity is absent when the sample has no labels. If a collector later emits switch_id, switch_serial, or switch_serial_number, this converter will still duplicate resource identity onto datapoints.
Suggested hardening
+const OTLP_RESOURCE_ONLY_DATAPOINT_LABELS: &[&str] =
+    &["switch_id", "switch_serial", "switch_serial_number", "switch_ip"];
+
         let attributes: Vec<KeyValue> = sample
             .labels
             .iter()
+            .filter(|(k, _)| !OTLP_RESOURCE_ONLY_DATAPOINT_LABELS.contains(&k.as_ref()))
             .map(|(k, v)| kv(k, v.clone()))
             .collect();
As per coding guidelines, implementation claims should be backed by code or tests.

Also applies to: 799-803
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/otlp/convert.rs` around lines 254 - 258, The datapoint
conversion currently copies every entry from `sample.labels` into `attributes`,
which can leak resource identity onto datapoints. Update the converter boundary
so the label-to-attribute mapping in this conversion path explicitly filters out
identity keys such as `switch_id`, `switch_serial`, and `switch_serial_number`
before collecting `attributes`. Add or extend tests around the same conversion
logic to prove these labels are not propagated when present, not just when
labels are empty.
Source: Coding guidelines

crates/health/src/collectors/nmxt.rs (1)

973-1005: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Exercise the production emission path in tests.

These tests copy the same emission logic instead of calling the collector’s production emission path, so they can pass while scrape_iteration regresses. Consider extracting the post-scrape loop into a helper that accepts parsed NmxtMetricSamples and invoking that helper from both scrape_iteration and these tests. As per coding guidelines, verification should exercise the behavior that changed.

Also applies to: 1082-1114
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/collectors/nmxt.rs` around lines 973 - 1005, The test logic
around down_blame emission is duplicating the collector’s production path
instead of exercising it, so it can miss regressions in scrape_iteration.
Extract the post-scrape emission loop from scrape_iteration into a shared helper
that takes parsed NmxtMetricSample values and performs the down_blame event
generation, then call that helper from both scrape_iteration and the affected
tests (including the similar block in the other test section) so verification
covers the real behavior.
Source: Coding guidelines

crates/health/src/config.rs (1)

1889-1942: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Make the new config parse test table-driven.

Line 1889 adds two duplicated parse/assert flows for the same operation. Please collapse them into cases so adding the next config variant stays cheap and consistent with the Rust test style.

♻️ Proposed refactor

-        let config: Config = Figment::new()
-            .merge(Serialized::defaults(Config::default()))
-            .merge(Toml::string(omitted))
-            .extract()
-            .expect("failed to parse omitted tls flag");
-
-        let Configurable::Enabled(nvue) = config.collectors.nvue else {
-            panic!("nvue config should be enabled");
-        };
-        let Configurable::Enabled(gnmi) = nvue.gnmi else {
-            panic!("gnmi config should be enabled");
-        };
-        assert!(!gnmi.dangerously_skip_tls_verification);
-
-        let enabled = r#"
+        let enabled = r#"
 [endpoint_sources.carbide_api]
 enabled = false
 
 [sinks.health_report]
 enabled = false
@@
 dangerously_skip_tls_verification = true
 "#;
 
-        let config: Config = Figment::new()
-            .merge(Serialized::defaults(Config::default()))
-            .merge(Toml::string(enabled))
-            .extract()
-            .expect("failed to parse enabled tls flag");
-
-        let Configurable::Enabled(nvue) = config.collectors.nvue else {
-            panic!("nvue config should be enabled");
-        };
-        let Configurable::Enabled(gnmi) = nvue.gnmi else {
-            panic!("gnmi config should be enabled");
-        };
-        assert!(gnmi.dangerously_skip_tls_verification);
+        for (name, toml, expected) in [
+            ("omitted tls flag", omitted, false),
+            ("enabled tls flag", enabled, true),
+        ] {
+            let config: Config = Figment::new()
+                .merge(Serialized::defaults(Config::default()))
+                .merge(Toml::string(toml))
+                .extract()
+                .unwrap_or_else(|_| panic!("failed to parse {name}"));
+
+            let Configurable::Enabled(nvue) = config.collectors.nvue else {
+                panic!("nvue config should be enabled for {name}");
+            };
+            let Configurable::Enabled(gnmi) = nvue.gnmi else {
+                panic!("gnmi config should be enabled for {name}");
+            };
+            assert_eq!(gnmi.dangerously_skip_tls_verification, expected, "{name}");
+        }

As per coding guidelines, "Use table-driven test style when writing tests in Rust".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/config.rs` around lines 1889 - 1942, Refactor
test_nvue_gnmi_dangerous_tls_skip_defaults_false_and_parses_true in
crates::health::config into a table-driven test instead of two duplicated
Figment parse/assert flows. Keep the same setup for Config::default,
Figment::new, and the Configurable::Enabled unwraps for config.collectors.nvue
and nvue.gnmi, but iterate over cases covering omitted and explicitly true
dangerously_skip_tls_verification and assert the expected boolean per case. This
will make adding future GNMI config variants consistent and cheap.

Source: Coding guidelines

crates/health/src/collectors/nvue/gnmi/sample_processor.rs (1)

785-788: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Add the missing table case for interface_link_down_events.

The mapping exists, but the new numeric table test does not assert its emitted metric type/unit; the older no-sink test only checks entity count. Add it to keep the mapping table fully covered. As per coding guidelines, verification should exercise the behavior that changed.

Proposed test addition

             (
                 &["phy-diag", "state", "plr-bw-loss-percent"],
                 "interface_plr_bw_loss_percent",
                 "percent",
             ),
+            (
+                &["phy-diag", "state", "unintentional-link-down-events"],
+                "interface_link_down_events",
+                "count",
+            ),
         ];

Also applies to: 1758-2067

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/collectors/nvue/gnmi/sample_processor.rs` around lines 785
- 788, The mapping for interface_link_down_events is present in
sample_processor.rs, but the numeric table test still does not verify its
emitted metric type and unit. Update the table-driven coverage in the sample
processor tests to include the interface_link_down_events case and assert the
expected metric name, type, and unit so the new mapping is exercised by the
changed-behavior test rather than only the older no-sink entity-count check.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/health/example/config.example.toml`:
- Around line 225-226: The example config is missing the new fan
platform-environment toggle, so update the configuration example around the
platform_environment_* entries to include platform_environment_fan_enabled
alongside platform_environment_temperature_enabled and
platform_environment_status_enabled. Use the existing NvueRestPaths-related
platform environment section as the reference point so the new option is
documented consistently with the other enabled flags.

In `@crates/health/src/collectors/nmxt.rs`:
- Around line 249-250: Remove the free-text Status_Message mapping from
NMXT_LABEL_MAP or otherwise filter it out before MetricSample construction in
nmxt.rs, so sink-agnostic metric events never carry this high-cardinality label.
Update the mapping/collection path around NMXT_LABEL_MAP and the MetricSample
emission logic to keep only stable, low-cardinality labels, and apply the same
fix anywhere else the allowlist is built or reused.
- Around line 479-485: The NMX-T HTTP client currently bypasses TLS verification
unconditionally in the `NmxtCollector` setup. Add a
`dangerously_skip_tls_verification` field to `NmxtCollectorConfig`, thread that
config into the `reqwest::Client::builder()` path, and only call
`danger_accept_invalid_certs(true)` when that option is enabled; otherwise build
the client with normal certificate validation.

In `@crates/health/src/collectors/nvue/gnmi/sample_processor.rs`:
- Around line 771-777: The metric name for the retry-events leaf is incorrect in
the nvue GNMI sample mapping and is currently conflated with the retry-codes
metric. Update the relevant entry in sample_processor.rs for the
plr-xmit-retry-events-within-t-sec-max leaf so the exported name uses an
events-based identifier (not codes), and make the same rename in the matching
duplicate mapping referenced by the reviewer. Keep the surrounding mapping
structure and unit unchanged, and use the existing sample definitions to ensure
the new name stays aligned with the leaf semantics.
- Around line 892-923: Both link parsing helpers currently accept any
successfully parsed f64, so invalid values can flow into MetricSample and
downstream exporters. Update link_width_to_f64 and link_speed_to_gbps to reject
NaN, infinities, and negative numbers after parsing, returning None for those
cases. Keep the validation close to the parsing logic in these two functions,
and add table-driven tests covering finite positives plus NaN, infinity, and
negative edge cases.

In `@crates/health/src/collectors/nvue/rest/collector.rs`:
- Around line 79-86: Update fan_max_speed_to_f64 in the NVUE REST collector to
reject invalid fan RPM values before they are emitted. After parsing the string
into f64, filter out non-finite results using is_finite() and return None for
any value less than 0.0; keep temp_to_f64 unchanged. Add tests covering NaN,
inf, and a negative RPM case to verify fan_max_speed_to_f64 drops them.

---

Nitpick comments:
In `@crates/health/src/collectors/nmxt.rs`:
- Around line 973-1005: The test logic around down_blame emission is duplicating
the collector’s production path instead of exercising it, so it can miss
regressions in scrape_iteration. Extract the post-scrape emission loop from
scrape_iteration into a shared helper that takes parsed NmxtMetricSample values
and performs the down_blame event generation, then call that helper from both
scrape_iteration and the affected tests (including the similar block in the
other test section) so verification covers the real behavior.

In `@crates/health/src/collectors/nvue/gnmi/sample_processor.rs`:
- Around line 785-788: The mapping for interface_link_down_events is present in
sample_processor.rs, but the numeric table test still does not verify its
emitted metric type and unit. Update the table-driven coverage in the sample
processor tests to include the interface_link_down_events case and assert the
expected metric name, type, and unit so the new mapping is exercised by the
changed-behavior test rather than only the older no-sink entity-count check.

In `@crates/health/src/collectors/sensors.rs`:
- Around line 404-422: The current tests only verify SensorRangeKind string
helpers, but they do not cover the actual range-metric emission behavior in
update_sensor and emit_sensor_range_metric. Add a table-driven behavioral test
that exercises include_sensor_thresholds gating, confirms None suppresses
emission, and asserts the emitted sample carries the sensor_range label with the
expected value. Use the existing symbols SensorRangeKind, update_sensor, and
emit_sensor_range_metric so the test stays tied to the contract and not just the
enum strings.

In `@crates/health/src/config.rs`:
- Around line 1889-1942: Refactor
test_nvue_gnmi_dangerous_tls_skip_defaults_false_and_parses_true in
crates::health::config into a table-driven test instead of two duplicated
Figment parse/assert flows. Keep the same setup for Config::default,
Figment::new, and the Configurable::Enabled unwraps for config.collectors.nvue
and nvue.gnmi, but iterate over cases covering omitted and explicitly true
dangerously_skip_tls_verification and assert the expected boolean per case. This
will make adding future GNMI config variants consistent and cheap.

In `@crates/health/src/otlp/convert.rs`:
- Around line 254-258: The datapoint conversion currently copies every entry
from `sample.labels` into `attributes`, which can leak resource identity onto
datapoints. Update the converter boundary so the label-to-attribute mapping in
this conversion path explicitly filters out identity keys such as `switch_id`,
`switch_serial`, and `switch_serial_number` before collecting `attributes`. Add
or extend tests around the same conversion logic to prove these labels are not
propagated when present, not just when labels are empty.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4ec16842-9c93-4c47-a2e2-621bceb0b447

📥 Commits

Reviewing files that changed from the base of the PR and between 87a5337 and 961fd8c.

📒 Files selected for processing (14)

crates/health/example/config.example.toml
crates/health/src/collectors/nmxt.rs
crates/health/src/collectors/nvue/gnmi/client.rs
crates/health/src/collectors/nvue/gnmi/sample_processor.rs
crates/health/src/collectors/nvue/gnmi/subscriber.rs
crates/health/src/collectors/nvue/rest/client.rs
crates/health/src/collectors/nvue/rest/collector.rs
crates/health/src/collectors/nvue/tls.rs
crates/health/src/collectors/sensors.rs
crates/health/src/config.rs
crates/health/src/otlp/convert.rs
crates/health/src/otlp/metrics_drain.rs
crates/health/src/sink/otlp.rs
crates/health/src/sink/prometheus.rs

github-actions · 2026-06-29T14:14:00Z

🔍 Container Scan Summary

Service	Total	Critical	High	Medium	Low	Other
boot-artifacts-aarch64	3	0	0	3	0	0
boot-artifacts-x86_64	3	0	0	3	0	0
forge-admin-cli-x86_64	288	6	26	105	7	144
machine-validation-runner	751	30	190	274	36	221
machine_validation	751	30	190	274	36	221
machine_validation-aarch64	751	30	190	274	36	221
nvmetal-carbide	751	30	190	274	36	221
TOTAL	3298	126	786	1207	151	1028

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

…ted mappings Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

…VUE REST Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

… sources Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

…etheus sink Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

…el cardinality fixes Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

…ation changes Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

…h_serial label Compose OTLP metric name as {prefix}_{name}_{metric_type}_{unit} to match the Prometheus sink, and promote switch_serial/switch_id onto datapoint attributes so Grafana switch dashboards resolve identically across export paths. Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

The NMX-T collector built its reqwest client without danger_accept_invalid_certs, unlike the sibling NVUE REST collector. On minimal runtime images this fails at client build time (native-root-CA load) and the switch serves a self-signed cert anyway, so NMX-T never collected. Match the NVUE REST self-signed handling. Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

tonic 0.14 auto-injects a strict system-root TLS verifier for https:// URIs (Endpoint::from) and layers its own TlsConnector over any custom connector (channel/service/connector.rs). That silently negated the hand-rolled hyper-rustls skip-verify connector, so tonic strictly verified and rejected NVOS's self-signed gNMI cert -- the channel died right after the server Certificate message (opaque 'transport error', no HTTP/2 frames). Use Endpoint::tls_config_with_verifier(ClientTlsConfig::new(), <verifier>) so the AcceptAnyCertVerifier is applied in tonic's own TLS layer; drop the hand-rolled connector. tls.rs now exposes accept_any_cert_verifier() instead of self_signed_tls_config(). Validated on gb-nvl-124-switch06: gNMI SAMPLE+ON_CHANGE streams connect and 86 carbide_hardware_health_nvue_gnmi_* metric families flow via the OtlpSink into VictoriaMetrics. Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

…nfig Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

… config for dev Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

The generated matrix/validation docs were already dropped in 3b0a075 (chore(health): remove temp docs from repo), but the one-shot generator script was missed. It has no callers, its required inputs are not in the repo, and its outputs are no longer tracked, so it cannot run from a clean checkout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

…ssage from labels due to high cardinality Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

crates/health/src/config.rs (1)

1058-1066: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Assert the new path toggles in the example-config test.

The new REST platform-environment flags and gNMI platform_general_enabled default/parse surface are not asserted here, so the example can drift without this test catching it.

Suggested test coverage

             if let Configurable::Enabled(ref rest) = nvue.rest {
                 assert_eq!(rest.poll_interval, Duration::from_secs(60));
                 assert_eq!(rest.request_timeout, Duration::from_secs(30));
+                assert!(rest.paths.platform_environment_fan_enabled);
+                assert!(rest.paths.platform_environment_temperature_enabled);
+                assert!(rest.paths.platform_environment_status_enabled);
             } else {
                 panic!("nvue rest config should be enabled in example config");
             }
             if let Configurable::Enabled(ref gnmi) = nvue.gnmi {
                 assert_eq!(gnmi.gnmi_port, 9339);
                 assert_eq!(gnmi.sample_interval, Duration::from_secs(300));
                 assert_eq!(gnmi.request_timeout, Duration::from_secs(30));
                 assert!(!gnmi.dangerously_skip_tls_verification);
+                assert!(gnmi.paths.platform_general_enabled);
                 assert!(gnmi.system_events_enabled);
             } else {

As per coding guidelines, verification should exercise the behavior that changed.

Also applies to: 1373-1384

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/health/src/config.rs` around lines 1058 - 1066, The example-config
test for Config/default parsing is missing assertions for the newly added path
toggles, so update that test to explicitly verify the REST platform-environment
flags and the gNMI platform_general_enabled default/parse behavior. Use the
Config default() surface and the example-config test setup to assert these new
fields alongside the existing enabled flags so future drift is caught by the
test.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/health/src/config.rs`:
- Around line 1895-1918: The NMX-T TLS config test only covers the default and
explicit true cases, but it misses the enabled [collectors.nmxt] deserialization
path when dangerously_skip_tls_verification is omitted. Update
test_nmxt_dangerous_tls_skip_defaults_false_and_parses_true in
NmxtCollectorConfig/Config to use a table-driven style that covers both omitted
and true inputs, and assert the parsed Configurable::Enabled(nmxt) value matches
the expected default false/explicit true behavior.

---

Nitpick comments:
In `@crates/health/src/config.rs`:
- Around line 1058-1066: The example-config test for Config/default parsing is
missing assertions for the newly added path toggles, so update that test to
explicitly verify the REST platform-environment flags and the gNMI
platform_general_enabled default/parse behavior. Use the Config default()
surface and the example-config test setup to assert these new fields alongside
the existing enabled flags so future drift is caught by the test.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4c29cb3e-6beb-4e6d-a4ed-feb3061ec11a

📥 Commits

Reviewing files that changed from the base of the PR and between 5eb393d and 4e7af93.

📒 Files selected for processing (3)

crates/health/example/config.example.toml
crates/health/src/collectors/nmxt.rs
crates/health/src/config.rs

🚧 Files skipped from review as they are similar to previous changes (1)

crates/health/src/collectors/nmxt.rs

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

mkoci force-pushed the nvswitch_telemetry_gaps branch from b4b9e2a to 961fd8c Compare June 28, 2026 02:28

mkoci marked this pull request as ready for review June 29, 2026 12:45

mkoci requested a review from a team as a code owner June 29, 2026 12:45

mkoci requested review from jayzhudev and yoks June 29, 2026 12:49

coderabbitai Bot reviewed Jun 29, 2026

View reviewed changes

yoks mentioned this pull request Jun 29, 2026

change: hw-health universal stage support #2938

Open

10 tasks

yoks changed the title ~~Nvswitch telemetry gaps~~ feat: nvswitch telemetry gaps Jun 29, 2026

mkoci added 18 commits June 29, 2026 14:00

docs(health): add GB200 NVSWITCH telemetry matrix

2018d03

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

docs(health): record nv-redfish dependency path

340d046

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

docs(health): clarify nv-redfish local patch strategy

f9105b5

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

feat(health): collect GB200 NVSwitch telemetry gaps

4fa9b6f

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

feat(health): rework GB200 NVSwitch telemetry to explicit live-valida…

3db8cb2

…ted mappings Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

feat(health): reclaim 4 NVSwitch cable fault rows via NMX-T

acd2a38

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

feat(health): implement 6 string-valued NVSwitch catalog rows

eea4bcd

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

feat(health): implement 21 temp-threshold + 8 temp-current rows via N…

2bb4bbe

…VUE REST Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

feat(health): reclaim 5 NVSwitch catalog rows via live gNMI/NVUE-REST…

f0b4fa2

… sources Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

feat(health): exclude high-cardinality free-text labels from the Prom…

589ab5a

…etheus sink Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

refactor(health): struct allowlists, StateSet enum metrics, NMX-T lab…

dedab8e

…el cardinality fixes Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

docs(health): reconcile GB200 matrix + runbook for StateSet/represent…

f73d4d8

…ation changes Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

chore(health): remove temp docs from repo

db79da0

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

fix(health): prevent empty labels from propagating. Update example co…

b12337d

…nfig Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

fix(health): default to strict TLS verification. add optional flag in…

51e080f

… config for dev Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

mkoci added 7 commits June 29, 2026 14:01

lint(health): fix

f758afe

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

chore(health): fix comment copy

2863717

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

fix(health): nmxt cleanup. Fix wasteful label rebuilds

ae8df21

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

chore(health): comment cleanup. fixing labels

a5723d3

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

fix(health): added back allowlist guard

4a8bcfe

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

fix(health): remove label dupes

5eb393d

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

mkoci force-pushed the nvswitch_telemetry_gaps branch from 961fd8c to 5eb393d Compare June 29, 2026 18:07

fix(health): add dangerous TLS gate to nmxt as well. Remove status_me…

4e7af93

…ssage from labels due to high cardinality Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

coderabbitai Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread crates/health/src/config.rs Outdated

yoks approved these changes Jun 29, 2026

View reviewed changes

fix(health): address @CodeRabbit's random nits :/

886fd6d

Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>

yoks approved these changes Jun 29, 2026

View reviewed changes

mkoci merged commit d63a8ad into NVIDIA:main Jun 30, 2026
59 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: nvswitch telemetry gaps#2945

feat: nvswitch telemetry gaps#2945
mkoci merged 27 commits into
NVIDIA:mainfrom
mkoci:nvswitch_telemetry_gaps

mkoci commented Jun 27, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 27, 2026

Uh oh!

coderabbitai Bot commented Jun 27, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 28, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mkoci commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Related Issues

Testing

Additional Notes

Uh oh!

copy-pr-bot Bot commented Jun 27, 2026

Uh oh!

coderabbitai Bot commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

Suggested labels

Uh oh!

copy-pr-bot Bot commented Jun 28, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 29, 2026

🔍 Container Scan Summary

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mkoci commented Jun 27, 2026 •

edited

Loading

coderabbitai Bot commented Jun 27, 2026 •

edited

Loading