fix(site-explorer): widen duration histogram buckets#2704
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR adds explicit millisecond histogram buckets for site-explorer latency metrics, exposes a public helper for building the view, and registers that view in production and test metric provider setup. ChangesSite Explorer Histogram Bucket Fix
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Pull request overview
This PR improves observability for carbide-site-explorer by configuring explicit OpenTelemetry histogram buckets (in milliseconds) that extend up to 1 hour, so long-running site explorer operations no longer concentrate in the +Inf bucket and become unusable for monitoring.
Changes:
- Add a reusable
site_explorer_latency_histogram_view()helper that builds an explicit-bucket histogram view for site explorer duration metrics. - Export the helper from
carbide-site-explorerand register the views incarbide-api-core’s metrics setup. - Add dependencies (
carbide-metrics-utils,opentelemetry_sdk) and a unit test to ensure the view builder succeeds.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| crates/site-explorer/src/metrics.rs | Adds explicit histogram bucket boundaries + a helper for constructing OTel views; adds a basic build test. |
| crates/site-explorer/src/lib.rs | Re-exports the view helper for use by other crates. |
| crates/site-explorer/Cargo.toml | Adds dependencies needed to build OTel metric views. |
| crates/api-core/src/logging/setup.rs | Registers the new site-explorer histogram views during meter provider construction (and mirrors it in tests). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
crates/site-explorer/src/metrics.rs (1)
769-777: ⚡ Quick winPrefer table-driven cases for the view-construction test.
This test calls the same fallible operation with multiple input variants; converting it to a scenario table will make extension and failure labeling cleaner.
As per coding guidelines, “Reach for a table whenever two or more tests call the same operation with different inputs.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/site-explorer/src/metrics.rs` around lines 769 - 777, The test function site_explorer_latency_histogram_views_build calls the same operation site_explorer_latency_histogram_view multiple times with different inputs. Refactor this into a table-driven test by creating a collection of test case tuples containing the input string and a descriptive label for each variant, then iterate over this collection to call the function once per case. This will make the test more maintainable and easier to extend with additional cases in the future.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@crates/site-explorer/src/metrics.rs`:
- Around line 769-777: The test function
site_explorer_latency_histogram_views_build calls the same operation
site_explorer_latency_histogram_view multiple times with different inputs.
Refactor this into a table-driven test by creating a collection of test case
tuples containing the input string and a descriptive label for each variant,
then iterate over this collection to call the function once per case. This will
make the test more maintainable and easier to extend with additional cases in
the future.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: c35b0c86-848f-422b-a183-f8d94cf1ebec
📒 Files selected for processing (4)
crates/api-core/src/logging/setup.rscrates/site-explorer/Cargo.tomlcrates/site-explorer/src/lib.rscrates/site-explorer/src/metrics.rs
|
@coderabbitai full review |
✅ Action performedFull review finished. |
|
/ok to test b6ee2bc |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
crates/site-explorer/src/metrics.rs (1)
766-793: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winThe test asserts construction, not the behavior the PR delivers.
site_explorer_latency_histogram_views_buildonly proves the view builds; it never demonstrates that observations above 10 s land in the extended buckets rather than+Inf. Since that redistribution is the contract under repair, consider a Prometheus round-trip assertion mirroringtest_gauge_aggregationinsetup.rs: record a multi-second value through aMeterProviderwired with this view and assert the encoded output exposes a bucket boundary beyond10000.Separately, the case loop is a hand-rolled table; the repository convention is to express these via
carbide-test-support(scenarios!/check_cases) so failures are labelled by scenario automatically.As per coding guidelines: "Prefer table-driven tests using the
carbide-test-supportcrate withscenarios!... for fallible operations". As per path instructions: "Prefer findings about behavior ... and missing tests over style-only comments."Would you like me to draft a
scenarios!-based replacement plus a bucket-redistribution assertion?🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/site-explorer/src/metrics.rs` around lines 766 - 793, The current test in site_explorer_latency_histogram_views_build only verifies that site_explorer_latency_histogram_view constructs successfully, not that the latency bucket redistribution actually works. Update the tests around site_explorer_latency_histogram_view to perform a Prometheus round-trip similar to test_gauge_aggregation in setup.rs: record a multi-second observation through a MeterProvider configured with this view and assert the exported histogram includes a bucket boundary beyond 10000 instead of collapsing into +Inf. Also replace the hand-rolled case loop with carbide-test-support table-driven coverage using scenarios! or check_cases so each label is reported automatically on failure.Sources: Coding guidelines, Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@crates/site-explorer/src/metrics.rs`:
- Around line 766-793: The current test in
site_explorer_latency_histogram_views_build only verifies that
site_explorer_latency_histogram_view constructs successfully, not that the
latency bucket redistribution actually works. Update the tests around
site_explorer_latency_histogram_view to perform a Prometheus round-trip similar
to test_gauge_aggregation in setup.rs: record a multi-second observation through
a MeterProvider configured with this view and assert the exported histogram
includes a bucket boundary beyond 10000 instead of collapsing into +Inf. Also
replace the hand-rolled case loop with carbide-test-support table-driven
coverage using scenarios! or check_cases so each label is reported automatically
on failure.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 47dd7e1e-31db-41aa-a436-cda0b44f9ae3
📒 Files selected for processing (4)
crates/api-core/src/logging/setup.rscrates/site-explorer/Cargo.tomlcrates/site-explorer/src/lib.rscrates/site-explorer/src/metrics.rs
🔍 Container Scan Summary
Per-CVE detail lives in the per-service |
7b20b00 to
1f81242
Compare
|
@Susanpdl there problems with some commits missing sign-off signatures, cryptographc signatures, or both. Please review https://github.com/NVIDIA/infra-controller/blob/main/CONTRIBUTING.md and update the PR with properly signed (essentially git commit -s -S ...) commits. |
a706047 to
0db1dc5
Compare
|
Thanks for the review feedback, @mxh-0xbb. I've updated the PR branch with properly signed commits:
Please let me know if anything else is needed. |
|
/ok to test 0db1dc5 |
ajf
left a comment
There was a problem hiding this comment.
lgtm
@kensimon or @akorobkov-nvda if you want to take a look.
|
🌿 Preview your docs: https://nvidia-preview-pull-request-2704.docs.buildwithfern.com/infra-controller |
|
Pushed a fix for the failing
Commit: 8b6ea1b ( Please re-run CI when convenient — happy to address anything else that comes up. |
|
@Susanpdl can you please resolve current merge conflict? Thanks |
Default OpenTelemetry histogram buckets top out at 10 seconds, so site explorer iteration latency observations on production sites land in +Inf. Register explicit millisecond buckets up to one hour for site explorer duration histograms. Fixes NVIDIA#2352 Signed-off-by: Susan Poudel <susanpdl77@gmail.com>
Keep the default millisecond buckets through 10 seconds so sub-second endpoint timings stay useful, then extend the upper range to one hour. Narrow the view filter to carbide_site_explorer_*_latency so histograms that declare seconds units are not given millisecond bucket boundaries. Signed-off-by: Susan Poudel <susanpdl77@gmail.com>
Replace the hand-rolled view-build loop with carbide-test-support scenarios and add a Prometheus round-trip test that records a 30s observation and verifies it lands in the extended bucket range. Signed-off-by: Susan Poudel <susanpdl77@gmail.com>
Satisfy clippy::items_after_test_module by placing mod tests at the end of metrics.rs. Signed-off-by: Susan Poudel <susanpdl77@gmail.com>
8b6ea1b to
d56cb6c
Compare
|
/ok to test d56cb6c |
Satisfy check-format-nightly: merge the metrics re-exports into one use statement, expand the histogram boundary array one element per line, and inline the test meter constructor. Signed-off-by: Susan Poudel <susanpdl77@gmail.com>
Head branch was pushed to by a user without write access
|
@yoks Resolved the merge conflict with I also fixed a Latest commit: c3b983c. Could you kick off another |
|
/ok to test c3b983c |
Summary
Site explorer iteration latency uses the default OpenTelemetry histogram buckets, which top out at 10 seconds. On production sites most observations land in the
+Infbucket, so the metric is not useful for monitoring.This change registers explicit millisecond buckets up to one hour for site explorer duration histograms, including iteration latency and per endpoint exploration duration.
Fixes #2352
Test plan
cargo test -p carbide-site-explorercarbide_site_explorer_iteration_latency_milliseconds_bucketobservations spread across buckets above 10s instead of concentrating in+Inf