feat(observability): propagate W3C trace context across nico-api's network boundaries#2700
Conversation
…twork boundaries nico-api now accepts and produces W3C Trace Context headers (traceparent, tracestate) at its network boundaries, so a request already traced upstream stays one trace as it flows through nico-api and on to the services it calls. - Install the standard W3C TraceContextPropagator at startup (the OTel default is no-op). - Ingress (REST + gRPC): one shared per-request layer extracts the inbound context and parents the request span to it; with no valid inbound context the span is a fresh root. - Egress: gRPC clients are wrapped in a tower TraceInjectService; reqwest clients are built through reqwest-tracing's middleware; the two OAuth2 token providers inject manually, since the oauth2 crate builds their requests. - The local tracing-enabled flag stays the master switch: an inbound sampled flag never forces recording here. Shared extract/inject glue and the tower middleware live in a new trace-propagation crate, with unit tests and an ingress->egress round-trip. Per-client details and maintainer guidance are in docs/observability/tracing.md. Closes NVIDIA#2438. Signed-off-by: Adnan Dosani <258082943+adnandnv@users.noreply.github.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: ⛔ Files ignored due to path filters (1)
📒 Files selected for processing (3)
💤 Files with no reviewable changes (1)
🚧 Files skipped from review as they are similar to previous changes (2)
Summary by CodeRabbit
WalkthroughAdds a new ChangesW3C Trace Context Propagation
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
🌿 Preview your docs: https://nvidia-preview-pull-request-2700.docs.buildwithfern.com/infra-controller |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/nras/src/client.rs`:
- Around line 52-57: The HTTP client created in the new_with_config method lacks
timeout configuration, which can cause indefinite blocking during network issues
or NRAS service outages. Add a timeout field to the Config struct to make it
configurable, then apply this timeout to the reqwest client builder by calling
the .timeout() method on the ClientBuilder chain before calling .build(). Ensure
the timeout value is propagated from Config and applied either at the client
level in new_with_config or on individual requests made through methods like
attest_gpu to enforce bounded wait times for external service calls.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 931e99d9-7c3a-48f6-9587-76664d251c90
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (32)
Cargo.tomlcrates/api-core/Cargo.tomlcrates/api-core/src/handlers/redfish.rscrates/api-core/src/logging/api_logs.rscrates/api-core/src/logging/setup.rscrates/api-core/src/machine_identity/token_exchange.rscrates/api-web/Cargo.tomlcrates/api-web/src/auth.rscrates/component-manager/Cargo.tomlcrates/component-manager/src/nsm.rscrates/component-manager/src/psm.rscrates/component-manager/src/tls.rscrates/firmware/Cargo.tomlcrates/firmware/src/downloader.rscrates/libmlx/Cargo.tomlcrates/libmlx/src/firmware/source.rscrates/libnmxc/Cargo.tomlcrates/libnmxc/src/lib.rscrates/libnmxc/src/nmxc_api.rscrates/libnmxm/Cargo.tomlcrates/libnmxm/src/lib.rscrates/mqttea/Cargo.tomlcrates/mqttea/src/auth/oauth2_provider.rscrates/nras/Cargo.tomlcrates/nras/src/client.rscrates/nras/src/lib.rscrates/rpc/Cargo.tomlcrates/rpc/src/forge_tls_client.rscrates/trace-propagation/Cargo.tomlcrates/trace-propagation/src/lib.rscrates/trace-propagation/tests/no_global_propagator.rsdocs/observability/tracing.md
| pub fn new_with_config(config: &crate::Config) -> NrasVerifierClient { | ||
| NrasVerifierClient { | ||
| config: config.clone(), | ||
| http_client: reqwest::Client::new(), | ||
| http_client: reqwest_middleware::ClientBuilder::new(reqwest::Client::new()) | ||
| .with(reqwest_tracing::TracingMiddleware::default()) | ||
| .build(), |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify timeout configuration on the NRAS egress path.
rg -n "NrasVerifierClient|Client::new\\(|Client::builder\\(|connect_timeout\\(|\\.timeout\\(" crates/nras/src/client.rs -C3Repository: NVIDIA/infra-controller
Length of output: 1072
🏁 Script executed:
#!/bin/bash
# Search for Config definition and any timeout fields
rg -n "struct Config|timeout|Timeout" crates/nras/src/ -A2 -B2 | head -100Repository: NVIDIA/infra-controller
Length of output: 329
🏁 Script executed:
#!/bin/bash
# Search for attest_gpu implementation and how the http_client is used
rg -n "async fn attest_gpu" crates/nras/src/client.rs -A 30Repository: NVIDIA/infra-controller
Length of output: 2719
🏁 Script executed:
#!/bin/bash
# Check if reqwest::Client has any builder methods or default timeout behavior
fd -e rs -e toml . crates/nras/ | xargs rg -l "reqwest" | head -5Repository: NVIDIA/infra-controller
Length of output: 167
🏁 Script executed:
#!/bin/bash
# Get the full Config struct definition
rg -n "pub struct Config" crates/nras/src/lib.rs -A 20Repository: NVIDIA/infra-controller
Length of output: 699
🏁 Script executed:
#!/bin/bash
# Search across all crates for any timeout configuration patterns for reqwest clients
rg -n "\.timeout\(|connect_timeout\(" --type rs crates/ | grep -i "nras\|http_client" | head -20Repository: NVIDIA/infra-controller
Length of output: 94
🏁 Script executed:
#!/bin/bash
# Check Cargo.toml for reqwest version and any timeout-related dependencies
cat crates/nras/Cargo.tomlRepository: NVIDIA/infra-controller
Length of output: 1427
🏁 Script executed:
#!/bin/bash
# Verify reqwest default timeout behavior and check if any global timeout patterns exist elsewhere
rg -n "timeout\(" crates/nras/ -iRepository: NVIDIA/infra-controller
Length of output: 49
Configure explicit timeout bounds on the NRAS HTTP client.
The reqwest::Client::new() at line 55 lacks any timeout configuration, permitting indefinite blocking during NRAS outages or network degradation. The Config struct contains no timeout fields, and individual requests in attest_gpu (line 81) are not wrapped with timeout constraints. Propagate a configurable timeout through Config and apply it via .timeout() on the client or per-request to ensure bounded wait times during external service calls.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/nras/src/client.rs` around lines 52 - 57, The HTTP client created in
the new_with_config method lacks timeout configuration, which can cause
indefinite blocking during network issues or NRAS service outages. Add a timeout
field to the Config struct to make it configurable, then apply this timeout to
the reqwest client builder by calling the .timeout() method on the ClientBuilder
chain before calling .build(). Ensure the timeout value is propagated from
Config and applied either at the client level in new_with_config or on
individual requests made through methods like attest_gpu to enforce bounded wait
times for external service calls.
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
|
@akorobkov-nvda do you want to take a look at this? |
polarweasel
left a comment
There was a problem hiding this comment.
Sent you a replacement for tracing.md in Slack, @adnandnv. Drop it in and this should be good to go, docs-wise.
…e-propagation Signed-off-by: Adnan Dosani <258082943+adnandnv@users.noreply.github.com>
Surround every fenced code block with a blank line before and after so Fern/CommonMark no longer mis-parses an un-spaced fence (notably the TOML/bash blocks indented inside the TL;DR list) and bleeds a large span of the page into a single code block. Also repoint the core-metrics catalogue link to the in-repo core_metrics.md and apply minor prose copy-edits. Signed-off-by: Adnan Dosani <258082943+adnandnv@users.noreply.github.com>
…e-propagation Signed-off-by: Adnan Dosani <258082943+adnandnv@users.noreply.github.com> # Conflicts: # crates/firmware/src/downloader.rs
Thanks @polarweasel! Done. |
Implements W3C Trace Context propagation for nico-api so a request traced upstream keeps one trace as
it passes through nico-api and into the services it calls.
TraceContextPropagatorat startup (the OTel default is no-op).traceparent/tracestateand parents the request span to it; no valid inbound context → fresh root.
TraceInjectService; reqwest clients go throughreqwest-tracing's middleware; the two OAuth2 token providers inject manually (theoauth2cratebuilds their requests).
tracing-enabledswitch stays the master control — an inboundsampledflag can't force local recording.Per-client details and how to add a new client are in
docs/observability/tracing.md.Related issues
#2438
Type of Change
Breaking Changes
Testing
Additional Notes
bmc-proxyandhardware-healthare intentionally not instrumented:they build no span exporter, so injecting there would be dead code.
docs/observability/tracing.md§1.7documents how to add propagation to a new client if that changes.
trace-propagationholds the shared glue + tower middleware.tracing.mdmarker references are a doc-staleness fix, not part of this feature. The doc stilldescribed the sampler's old
code.namespacemarker; fix: Fixes for OTLP tracing #2268 replaced it withcarbide.trace_rootwithout updating the doc. The
code.namespace→carbide.trace_rootdoc edits just correct that.(The doc's
ParentBasedmention is also updated, since this PR unwraps the sampler.)opentelemetry-http0.31,reqwest-middleware0.5,reqwest-tracing0.7.