feat: nvswitch telemetry gaps#2945
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (4)
🚧 Files skipped from review as they are similar to previous changes (3)
Summary by CodeRabbit
WalkthroughThis PR updates health collector configuration, NVUE gNMI and REST metric collection, NMX-T scraping, OTLP export naming, and sensor threshold handling. ChangesConfiguration, TLS Abstraction, and OTLP Naming
NMX-T Collector Allowlist Rewrite
NVUE gNMI: TLS Skip, Platform-General Paths, and StateSet Processing
NVUE REST: Platform Environment Endpoints and StateSet Emission
Sensor Range Metric Emission
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Suggested labels
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
b4b9e2a to
961fd8c
Compare
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (5)
crates/health/src/collectors/sensors.rs (1)
404-422: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd a behavioral test for range metric emission.
These tests only pin the enum strings. The new contract is in Lines 347-396:
include_sensor_thresholdsgating,Nonesuppression, and thesensor_rangelabel on emitted samples. Please cover that path with a table-driven test aroundupdate_sensororemit_sensor_range_metric; otherwise the exported series shape can drift without a failing test. As per coding guidelines, "Use table-driven test style when writing tests in Rust" and "Prefer carbide-test-support scenarios (scenarios!/value_scenarios!) or explicit cases (check_cases/check_values) for test coverage in Rust." As per path instructions, prefer findings about behavior and missing tests over style-only comments.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/collectors/sensors.rs` around lines 404 - 422, The current tests only verify SensorRangeKind string helpers, but they do not cover the actual range-metric emission behavior in update_sensor and emit_sensor_range_metric. Add a table-driven behavioral test that exercises include_sensor_thresholds gating, confirms None suppresses emission, and asserts the emitted sample carries the sensor_range label with the expected value. Use the existing symbols SensorRangeKind, update_sensor, and emit_sensor_range_metric so the test stays tied to the contract and not just the enum strings.Sources: Coding guidelines, Path instructions
crates/health/src/otlp/convert.rs (1)
254-258: 🗄️ Data Integrity & Integration | 🔵 Trivial | ⚡ Quick winEnforce the resource-only identity invariant at the converter boundary.
Line 254 copies every
sample.labelsentry to the datapoint, while the new test only proves identity is absent when the sample has no labels. If a collector later emitsswitch_id,switch_serial, orswitch_serial_number, this converter will still duplicate resource identity onto datapoints.Suggested hardening
+const OTLP_RESOURCE_ONLY_DATAPOINT_LABELS: &[&str] = + &["switch_id", "switch_serial", "switch_serial_number", "switch_ip"]; + let attributes: Vec<KeyValue> = sample .labels .iter() + .filter(|(k, _)| !OTLP_RESOURCE_ONLY_DATAPOINT_LABELS.contains(&k.as_ref())) .map(|(k, v)| kv(k, v.clone())) .collect();As per coding guidelines, implementation claims should be backed by code or tests.
Also applies to: 799-803
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/otlp/convert.rs` around lines 254 - 258, The datapoint conversion currently copies every entry from `sample.labels` into `attributes`, which can leak resource identity onto datapoints. Update the converter boundary so the label-to-attribute mapping in this conversion path explicitly filters out identity keys such as `switch_id`, `switch_serial`, and `switch_serial_number` before collecting `attributes`. Add or extend tests around the same conversion logic to prove these labels are not propagated when present, not just when labels are empty.Source: Coding guidelines
crates/health/src/collectors/nmxt.rs (1)
973-1005: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winExercise the production emission path in tests.
These tests copy the same emission logic instead of calling the collector’s production emission path, so they can pass while
scrape_iterationregresses. Consider extracting the post-scrape loop into a helper that accepts parsedNmxtMetricSamples and invoking that helper from bothscrape_iterationand these tests. As per coding guidelines, verification should exercise the behavior that changed.Also applies to: 1082-1114
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/collectors/nmxt.rs` around lines 973 - 1005, The test logic around down_blame emission is duplicating the collector’s production path instead of exercising it, so it can miss regressions in scrape_iteration. Extract the post-scrape emission loop from scrape_iteration into a shared helper that takes parsed NmxtMetricSample values and performs the down_blame event generation, then call that helper from both scrape_iteration and the affected tests (including the similar block in the other test section) so verification covers the real behavior.Source: Coding guidelines
crates/health/src/config.rs (1)
1889-1942: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winMake the new config parse test table-driven.
Line 1889 adds two duplicated parse/assert flows for the same operation. Please collapse them into cases so adding the next config variant stays cheap and consistent with the Rust test style.
♻️ Proposed refactor
- let config: Config = Figment::new() - .merge(Serialized::defaults(Config::default())) - .merge(Toml::string(omitted)) - .extract() - .expect("failed to parse omitted tls flag"); - - let Configurable::Enabled(nvue) = config.collectors.nvue else { - panic!("nvue config should be enabled"); - }; - let Configurable::Enabled(gnmi) = nvue.gnmi else { - panic!("gnmi config should be enabled"); - }; - assert!(!gnmi.dangerously_skip_tls_verification); - - let enabled = r#" + let enabled = r#" [endpoint_sources.carbide_api] enabled = false [sinks.health_report] enabled = false @@ dangerously_skip_tls_verification = true "#; - let config: Config = Figment::new() - .merge(Serialized::defaults(Config::default())) - .merge(Toml::string(enabled)) - .extract() - .expect("failed to parse enabled tls flag"); - - let Configurable::Enabled(nvue) = config.collectors.nvue else { - panic!("nvue config should be enabled"); - }; - let Configurable::Enabled(gnmi) = nvue.gnmi else { - panic!("gnmi config should be enabled"); - }; - assert!(gnmi.dangerously_skip_tls_verification); + for (name, toml, expected) in [ + ("omitted tls flag", omitted, false), + ("enabled tls flag", enabled, true), + ] { + let config: Config = Figment::new() + .merge(Serialized::defaults(Config::default())) + .merge(Toml::string(toml)) + .extract() + .unwrap_or_else(|_| panic!("failed to parse {name}")); + + let Configurable::Enabled(nvue) = config.collectors.nvue else { + panic!("nvue config should be enabled for {name}"); + }; + let Configurable::Enabled(gnmi) = nvue.gnmi else { + panic!("gnmi config should be enabled for {name}"); + }; + assert_eq!(gnmi.dangerously_skip_tls_verification, expected, "{name}"); + }As per coding guidelines, "Use table-driven test style when writing tests in Rust".
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/config.rs` around lines 1889 - 1942, Refactor test_nvue_gnmi_dangerous_tls_skip_defaults_false_and_parses_true in crates::health::config into a table-driven test instead of two duplicated Figment parse/assert flows. Keep the same setup for Config::default, Figment::new, and the Configurable::Enabled unwraps for config.collectors.nvue and nvue.gnmi, but iterate over cases covering omitted and explicitly true dangerously_skip_tls_verification and assert the expected boolean per case. This will make adding future GNMI config variants consistent and cheap.Source: Coding guidelines
crates/health/src/collectors/nvue/gnmi/sample_processor.rs (1)
785-788: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd the missing table case for
interface_link_down_events.The mapping exists, but the new numeric table test does not assert its emitted metric type/unit; the older no-sink test only checks entity count. Add it to keep the mapping table fully covered. As per coding guidelines, verification should exercise the behavior that changed.
Proposed test addition
( &["phy-diag", "state", "plr-bw-loss-percent"], "interface_plr_bw_loss_percent", "percent", ), + ( + &["phy-diag", "state", "unintentional-link-down-events"], + "interface_link_down_events", + "count", + ), ];Also applies to: 1758-2067
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/collectors/nvue/gnmi/sample_processor.rs` around lines 785 - 788, The mapping for interface_link_down_events is present in sample_processor.rs, but the numeric table test still does not verify its emitted metric type and unit. Update the table-driven coverage in the sample processor tests to include the interface_link_down_events case and assert the expected metric name, type, and unit so the new mapping is exercised by the changed-behavior test rather than only the older no-sink entity-count check.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/health/example/config.example.toml`:
- Around line 225-226: The example config is missing the new fan
platform-environment toggle, so update the configuration example around the
platform_environment_* entries to include platform_environment_fan_enabled
alongside platform_environment_temperature_enabled and
platform_environment_status_enabled. Use the existing NvueRestPaths-related
platform environment section as the reference point so the new option is
documented consistently with the other enabled flags.
In `@crates/health/src/collectors/nmxt.rs`:
- Around line 249-250: Remove the free-text Status_Message mapping from
NMXT_LABEL_MAP or otherwise filter it out before MetricSample construction in
nmxt.rs, so sink-agnostic metric events never carry this high-cardinality label.
Update the mapping/collection path around NMXT_LABEL_MAP and the MetricSample
emission logic to keep only stable, low-cardinality labels, and apply the same
fix anywhere else the allowlist is built or reused.
- Around line 479-485: The NMX-T HTTP client currently bypasses TLS verification
unconditionally in the `NmxtCollector` setup. Add a
`dangerously_skip_tls_verification` field to `NmxtCollectorConfig`, thread that
config into the `reqwest::Client::builder()` path, and only call
`danger_accept_invalid_certs(true)` when that option is enabled; otherwise build
the client with normal certificate validation.
In `@crates/health/src/collectors/nvue/gnmi/sample_processor.rs`:
- Around line 771-777: The metric name for the retry-events leaf is incorrect in
the nvue GNMI sample mapping and is currently conflated with the retry-codes
metric. Update the relevant entry in sample_processor.rs for the
plr-xmit-retry-events-within-t-sec-max leaf so the exported name uses an
events-based identifier (not codes), and make the same rename in the matching
duplicate mapping referenced by the reviewer. Keep the surrounding mapping
structure and unit unchanged, and use the existing sample definitions to ensure
the new name stays aligned with the leaf semantics.
- Around line 892-923: Both link parsing helpers currently accept any
successfully parsed f64, so invalid values can flow into MetricSample and
downstream exporters. Update link_width_to_f64 and link_speed_to_gbps to reject
NaN, infinities, and negative numbers after parsing, returning None for those
cases. Keep the validation close to the parsing logic in these two functions,
and add table-driven tests covering finite positives plus NaN, infinity, and
negative edge cases.
In `@crates/health/src/collectors/nvue/rest/collector.rs`:
- Around line 79-86: Update fan_max_speed_to_f64 in the NVUE REST collector to
reject invalid fan RPM values before they are emitted. After parsing the string
into f64, filter out non-finite results using is_finite() and return None for
any value less than 0.0; keep temp_to_f64 unchanged. Add tests covering NaN,
inf, and a negative RPM case to verify fan_max_speed_to_f64 drops them.
---
Nitpick comments:
In `@crates/health/src/collectors/nmxt.rs`:
- Around line 973-1005: The test logic around down_blame emission is duplicating
the collector’s production path instead of exercising it, so it can miss
regressions in scrape_iteration. Extract the post-scrape emission loop from
scrape_iteration into a shared helper that takes parsed NmxtMetricSample values
and performs the down_blame event generation, then call that helper from both
scrape_iteration and the affected tests (including the similar block in the
other test section) so verification covers the real behavior.
In `@crates/health/src/collectors/nvue/gnmi/sample_processor.rs`:
- Around line 785-788: The mapping for interface_link_down_events is present in
sample_processor.rs, but the numeric table test still does not verify its
emitted metric type and unit. Update the table-driven coverage in the sample
processor tests to include the interface_link_down_events case and assert the
expected metric name, type, and unit so the new mapping is exercised by the
changed-behavior test rather than only the older no-sink entity-count check.
In `@crates/health/src/collectors/sensors.rs`:
- Around line 404-422: The current tests only verify SensorRangeKind string
helpers, but they do not cover the actual range-metric emission behavior in
update_sensor and emit_sensor_range_metric. Add a table-driven behavioral test
that exercises include_sensor_thresholds gating, confirms None suppresses
emission, and asserts the emitted sample carries the sensor_range label with the
expected value. Use the existing symbols SensorRangeKind, update_sensor, and
emit_sensor_range_metric so the test stays tied to the contract and not just the
enum strings.
In `@crates/health/src/config.rs`:
- Around line 1889-1942: Refactor
test_nvue_gnmi_dangerous_tls_skip_defaults_false_and_parses_true in
crates::health::config into a table-driven test instead of two duplicated
Figment parse/assert flows. Keep the same setup for Config::default,
Figment::new, and the Configurable::Enabled unwraps for config.collectors.nvue
and nvue.gnmi, but iterate over cases covering omitted and explicitly true
dangerously_skip_tls_verification and assert the expected boolean per case. This
will make adding future GNMI config variants consistent and cheap.
In `@crates/health/src/otlp/convert.rs`:
- Around line 254-258: The datapoint conversion currently copies every entry
from `sample.labels` into `attributes`, which can leak resource identity onto
datapoints. Update the converter boundary so the label-to-attribute mapping in
this conversion path explicitly filters out identity keys such as `switch_id`,
`switch_serial`, and `switch_serial_number` before collecting `attributes`. Add
or extend tests around the same conversion logic to prove these labels are not
propagated when present, not just when labels are empty.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4ec16842-9c93-4c47-a2e2-621bceb0b447
📒 Files selected for processing (14)
crates/health/example/config.example.tomlcrates/health/src/collectors/nmxt.rscrates/health/src/collectors/nvue/gnmi/client.rscrates/health/src/collectors/nvue/gnmi/sample_processor.rscrates/health/src/collectors/nvue/gnmi/subscriber.rscrates/health/src/collectors/nvue/rest/client.rscrates/health/src/collectors/nvue/rest/collector.rscrates/health/src/collectors/nvue/tls.rscrates/health/src/collectors/sensors.rscrates/health/src/config.rscrates/health/src/otlp/convert.rscrates/health/src/otlp/metrics_drain.rscrates/health/src/sink/otlp.rscrates/health/src/sink/prometheus.rs
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…ted mappings Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…VUE REST Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
… sources Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…etheus sink Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…el cardinality fixes Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…ation changes Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…h_serial label
Compose OTLP metric name as {prefix}_{name}_{metric_type}_{unit} to match the
Prometheus sink, and promote switch_serial/switch_id onto datapoint attributes so
Grafana switch dashboards resolve identically across export paths.
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
The NMX-T collector built its reqwest client without danger_accept_invalid_certs, unlike the sibling NVUE REST collector. On minimal runtime images this fails at client build time (native-root-CA load) and the switch serves a self-signed cert anyway, so NMX-T never collected. Match the NVUE REST self-signed handling. Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
tonic 0.14 auto-injects a strict system-root TLS verifier for https:// URIs (Endpoint::from) and layers its own TlsConnector over any custom connector (channel/service/connector.rs). That silently negated the hand-rolled hyper-rustls skip-verify connector, so tonic strictly verified and rejected NVOS's self-signed gNMI cert -- the channel died right after the server Certificate message (opaque 'transport error', no HTTP/2 frames). Use Endpoint::tls_config_with_verifier(ClientTlsConfig::new(), <verifier>) so the AcceptAnyCertVerifier is applied in tonic's own TLS layer; drop the hand-rolled connector. tls.rs now exposes accept_any_cert_verifier() instead of self_signed_tls_config(). Validated on gb-nvl-124-switch06: gNMI SAMPLE+ON_CHANGE streams connect and 86 carbide_hardware_health_nvue_gnmi_* metric families flow via the OtlpSink into VictoriaMetrics. Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
…nfig Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
… config for dev Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
The generated matrix/validation docs were already dropped in 3b0a075 (chore(health): remove temp docs from repo), but the one-shot generator script was missed. It has no callers, its required inputs are not in the repo, and its outputs are no longer tracked, so it cannot run from a clean checkout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
961fd8c to
5eb393d
Compare
…ssage from labels due to high cardinality Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
crates/health/src/config.rs (1)
1058-1066: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAssert the new path toggles in the example-config test.
The new REST platform-environment flags and gNMI
platform_general_enableddefault/parse surface are not asserted here, so the example can drift without this test catching it.Suggested test coverage
if let Configurable::Enabled(ref rest) = nvue.rest { assert_eq!(rest.poll_interval, Duration::from_secs(60)); assert_eq!(rest.request_timeout, Duration::from_secs(30)); + assert!(rest.paths.platform_environment_fan_enabled); + assert!(rest.paths.platform_environment_temperature_enabled); + assert!(rest.paths.platform_environment_status_enabled); } else { panic!("nvue rest config should be enabled in example config"); } if let Configurable::Enabled(ref gnmi) = nvue.gnmi { assert_eq!(gnmi.gnmi_port, 9339); assert_eq!(gnmi.sample_interval, Duration::from_secs(300)); assert_eq!(gnmi.request_timeout, Duration::from_secs(30)); assert!(!gnmi.dangerously_skip_tls_verification); + assert!(gnmi.paths.platform_general_enabled); assert!(gnmi.system_events_enabled); } else {As per coding guidelines, verification should exercise the behavior that changed.
Also applies to: 1373-1384
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/health/src/config.rs` around lines 1058 - 1066, The example-config test for Config/default parsing is missing assertions for the newly added path toggles, so update that test to explicitly verify the REST platform-environment flags and the gNMI platform_general_enabled default/parse behavior. Use the Config default() surface and the example-config test setup to assert these new fields alongside the existing enabled flags so future drift is caught by the test.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/health/src/config.rs`:
- Around line 1895-1918: The NMX-T TLS config test only covers the default and
explicit true cases, but it misses the enabled [collectors.nmxt] deserialization
path when dangerously_skip_tls_verification is omitted. Update
test_nmxt_dangerous_tls_skip_defaults_false_and_parses_true in
NmxtCollectorConfig/Config to use a table-driven style that covers both omitted
and true inputs, and assert the parsed Configurable::Enabled(nmxt) value matches
the expected default false/explicit true behavior.
---
Nitpick comments:
In `@crates/health/src/config.rs`:
- Around line 1058-1066: The example-config test for Config/default parsing is
missing assertions for the newly added path toggles, so update that test to
explicitly verify the REST platform-environment flags and the gNMI
platform_general_enabled default/parse behavior. Use the Config default()
surface and the example-config test setup to assert these new fields alongside
the existing enabled flags so future drift is caught by the test.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4c29cb3e-6beb-4e6d-a4ed-feb3061ec11a
📒 Files selected for processing (3)
crates/health/example/config.example.tomlcrates/health/src/collectors/nmxt.rscrates/health/src/config.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- crates/health/src/collectors/nmxt.rs
Signed-off-by: mkoci <26286151+mkoci@users.noreply.github.com>
Description
This PR covers the NV Switch port from #2283. It closes the GB200 NVSwitch telemetry gaps for NVUE gNMI streaming, NVUE REST, and NMX-T.
Type of Change
Related Issues
#2283
Testing
Additional Notes
Gated previously insecure TLS verification behind
dangerously_skip_tls_verificationin config surfaceNVOS gNMI endpoints must set
dangerously_skip_tls_verification = trueto connect.