fix(site-explorer): widen duration histogram buckets by Susanpdl · Pull Request #2704 · NVIDIA/infra-controller

Susanpdl · 2026-06-19T15:34:34Z

Summary

Site explorer iteration latency uses the default OpenTelemetry histogram buckets, which top out at 10 seconds. On production sites most observations land in the +Inf bucket, so the metric is not useful for monitoring.

This change registers explicit millisecond buckets up to one hour for site explorer duration histograms, including iteration latency and per endpoint exploration duration.

Fixes #2352

Test plan

Added unit test that histogram views build successfully
CI cargo test -p carbide-site-explorer
After deploy, confirm carbide_site_explorer_iteration_latency_milliseconds_bucket observations spread across buckets above 10s instead of concentrating in +Inf

copy-pr-bot · 2026-06-19T15:34:38Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-19T15:34:55Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

This PR adds explicit millisecond histogram buckets for site-explorer latency metrics, exposes a public helper for building the view, and registers that view in production and test metric provider setup.

Changes

Site Explorer Histogram Bucket Fix

Layer / File(s)	Summary
Histogram view helper and tests `crates/site-explorer/Cargo.toml`, `crates/site-explorer/src/metrics.rs`	Adds telemetry dependencies, updates metric imports, defines explicit millisecond boundaries, implements `site_explorer_latency_histogram_view`, and adds unit coverage for name-filter handling and Prometheus bucket placement.
Public re-export and meter wiring `crates/site-explorer/src/lib.rs`, `crates/api-core/src/logging/setup.rs`	Re-exports `site_explorer_latency_histogram_view` from the crate root and adds the site-explorer latency histogram views to both `create_metrics` and `test_gauge_aggregation`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title is concise and accurately reflects the main change to site-explorer histogram buckets.
Description check	✅ Passed	The description is clearly related to the histogram bucket changes and test coverage.
Linked Issues check	✅ Passed	The changes replace default site-explorer histogram buckets with explicit millisecond buckets, matching `#2352`.
Out of Scope Changes check	✅ Passed	The added tests, exports, and dependencies support the bucket change and do not appear unrelated.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

Copilot

Pull request overview

This PR improves observability for carbide-site-explorer by configuring explicit OpenTelemetry histogram buckets (in milliseconds) that extend up to 1 hour, so long-running site explorer operations no longer concentrate in the +Inf bucket and become unusable for monitoring.

Changes:

Add a reusable site_explorer_latency_histogram_view() helper that builds an explicit-bucket histogram view for site explorer duration metrics.
Export the helper from carbide-site-explorer and register the views in carbide-api-core’s metrics setup.
Add dependencies (carbide-metrics-utils, opentelemetry_sdk) and a unit test to ensure the view builder succeeds.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
crates/site-explorer/src/metrics.rs	Adds explicit histogram bucket boundaries + a helper for constructing OTel views; adds a basic build test.
crates/site-explorer/src/lib.rs	Re-exports the view helper for use by other crates.
crates/site-explorer/Cargo.toml	Adds dependencies needed to build OTel metric views.
crates/api-core/src/logging/setup.rs	Registers the new site-explorer histogram views during meter provider construction (and mirrors it in tests).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

coderabbitai

🧹 Nitpick comments (1)

crates/site-explorer/src/metrics.rs (1)
769-777: ⚡ Quick win

Prefer table-driven cases for the view-construction test.

This test calls the same fallible operation with multiple input variants; converting it to a scenario table will make extension and failure labeling cleaner.

As per coding guidelines, “Reach for a table whenever two or more tests call the same operation with different inputs.”
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/site-explorer/src/metrics.rs` around lines 769 - 777, The test
function site_explorer_latency_histogram_views_build calls the same operation
site_explorer_latency_histogram_view multiple times with different inputs.
Refactor this into a table-driven test by creating a collection of test case
tuples containing the input string and a descriptive label for each variant,
then iterate over this collection to call the function once per case. This will
make the test more maintainable and easier to extend with additional cases in
the future.
Source: Coding guidelines

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/site-explorer/src/metrics.rs`:
- Around line 769-777: The test function
site_explorer_latency_histogram_views_build calls the same operation
site_explorer_latency_histogram_view multiple times with different inputs.
Refactor this into a table-driven test by creating a collection of test case
tuples containing the input string and a descriptive label for each variant,
then iterate over this collection to call the function once per case. This will
make the test more maintainable and easier to extend with additional cases in
the future.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c35b0c86-848f-422b-a183-f8d94cf1ebec

📥 Commits

Reviewing files that changed from the base of the PR and between d5e5b61 and c623bed.

📒 Files selected for processing (4)

crates/api-core/src/logging/setup.rs
crates/site-explorer/Cargo.toml
crates/site-explorer/src/lib.rs
crates/site-explorer/src/metrics.rs

ianderson-nvidia · 2026-06-24T16:17:22Z

@coderabbitai full review

coderabbitai · 2026-06-24T16:17:31Z

✅ Action performed

Full review finished.

ianderson-nvidia · 2026-06-24T16:17:51Z

/ok to test b6ee2bc

coderabbitai

🧹 Nitpick comments (1)

crates/site-explorer/src/metrics.rs (1)
766-793: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

The test asserts construction, not the behavior the PR delivers.

site_explorer_latency_histogram_views_build only proves the view builds; it never demonstrates that observations above 10 s land in the extended buckets rather than +Inf. Since that redistribution is the contract under repair, consider a Prometheus round-trip assertion mirroring test_gauge_aggregation in setup.rs: record a multi-second value through a MeterProvider wired with this view and assert the encoded output exposes a bucket boundary beyond 10000.

Separately, the case loop is a hand-rolled table; the repository convention is to express these via carbide-test-support (scenarios! / check_cases) so failures are labelled by scenario automatically.

As per coding guidelines: "Prefer table-driven tests using the carbide-test-support crate with scenarios! ... for fallible operations". As per path instructions: "Prefer findings about behavior ... and missing tests over style-only comments."

Would you like me to draft a scenarios!-based replacement plus a bucket-redistribution assertion?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/site-explorer/src/metrics.rs` around lines 766 - 793, The current test
in site_explorer_latency_histogram_views_build only verifies that
site_explorer_latency_histogram_view constructs successfully, not that the
latency bucket redistribution actually works. Update the tests around
site_explorer_latency_histogram_view to perform a Prometheus round-trip similar
to test_gauge_aggregation in setup.rs: record a multi-second observation through
a MeterProvider configured with this view and assert the exported histogram
includes a bucket boundary beyond 10000 instead of collapsing into +Inf. Also
replace the hand-rolled case loop with carbide-test-support table-driven
coverage using scenarios! or check_cases so each label is reported automatically
on failure.
Sources: Coding guidelines, Path instructions

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@crates/site-explorer/src/metrics.rs`:
- Around line 766-793: The current test in
site_explorer_latency_histogram_views_build only verifies that
site_explorer_latency_histogram_view constructs successfully, not that the
latency bucket redistribution actually works. Update the tests around
site_explorer_latency_histogram_view to perform a Prometheus round-trip similar
to test_gauge_aggregation in setup.rs: record a multi-second observation through
a MeterProvider configured with this view and assert the exported histogram
includes a bucket boundary beyond 10000 instead of collapsing into +Inf. Also
replace the hand-rolled case loop with carbide-test-support table-driven
coverage using scenarios! or check_cases so each label is reported automatically
on failure.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 47dd7e1e-31db-41aa-a436-cda0b44f9ae3

📥 Commits

Reviewing files that changed from the base of the PR and between d5e5b61 and b6ee2bc.

📒 Files selected for processing (4)

crates/api-core/src/logging/setup.rs
crates/site-explorer/Cargo.toml
crates/site-explorer/src/lib.rs
crates/site-explorer/src/metrics.rs

github-actions · 2026-06-24T17:44:27Z

🔍 Container Scan Summary

Service	Total	Critical	High	Medium	Low	Other
boot-artifacts-aarch64	3	0	0	3	0	0
boot-artifacts-x86_64	3	0	0	3	0	0
forge-admin-cli-x86_64	288	6	26	105	7	144
machine-validation-runner	751	30	190	274	36	221
machine_validation	751	30	190	274	36	221
machine_validation-aarch64	751	30	190	274	36	221
nvmetal-carbide	751	30	190	274	36	221
TOTAL	3298	126	786	1207	151	1028

Per-CVE detail lives in the per-service grype-* artifacts (JSON + SARIF). Severity counts only — no CVE IDs published here.

mxh-0xbb · 2026-06-25T07:27:35Z

@Susanpdl there problems with some commits missing sign-off signatures, cryptographc signatures, or both. Please review https://github.com/NVIDIA/infra-controller/blob/main/CONTRIBUTING.md and update the PR with properly signed (essentially git commit -s -S ...) commits.

Susanpdl · 2026-06-25T12:49:13Z

Thanks for the review feedback, @mxh-0xbb.

I've updated the PR branch with properly signed commits:

All commits now include DCO Signed-off-by trailers
All commits are cryptographically signed with my SSH signing key and show as Verified on GitHub

Please let me know if anything else is needed.

ajf · 2026-06-25T17:35:02Z

/ok to test 0db1dc5

ajf

lgtm

@kensimon or @akorobkov-nvda if you want to take a look.

github-actions · 2026-06-25T17:50:12Z

🌿 Preview your docs: https://nvidia-preview-pull-request-2704.docs.buildwithfern.com/infra-controller

Susanpdl · 2026-06-25T18:33:24Z

Pushed a fix for the failing lint-police check.

clippy::items_after_test_module was triggered because mod tests in crates/site-explorer/src/metrics.rs sat above MetricHolder. I moved the test module to the end of the file.

Commit: 8b6ea1b (fix(site-explorer): move tests after MetricHolder for clippy)

Please re-run CI when convenient — happy to address anything else that comes up.

yoks · 2026-06-29T16:56:32Z

@Susanpdl can you please resolve current merge conflict? Thanks

Default OpenTelemetry histogram buckets top out at 10 seconds, so site explorer iteration latency observations on production sites land in +Inf. Register explicit millisecond buckets up to one hour for site explorer duration histograms. Fixes NVIDIA#2352 Signed-off-by: Susan Poudel <susanpdl77@gmail.com>

Keep the default millisecond buckets through 10 seconds so sub-second endpoint timings stay useful, then extend the upper range to one hour. Narrow the view filter to carbide_site_explorer_*_latency so histograms that declare seconds units are not given millisecond bucket boundaries. Signed-off-by: Susan Poudel <susanpdl77@gmail.com>

Replace the hand-rolled view-build loop with carbide-test-support scenarios and add a Prometheus round-trip test that records a 30s observation and verifies it lands in the extended bucket range. Signed-off-by: Susan Poudel <susanpdl77@gmail.com>

Satisfy clippy::items_after_test_module by placing mod tests at the end of metrics.rs. Signed-off-by: Susan Poudel <susanpdl77@gmail.com>

yoks · 2026-06-29T18:48:51Z

/ok to test d56cb6c

Satisfy check-format-nightly: merge the metrics re-exports into one use statement, expand the histogram boundary array one element per line, and inline the test meter constructor. Signed-off-by: Susan Poudel <susanpdl77@gmail.com>

Susanpdl · 2026-06-29T19:13:43Z

@yoks Resolved the merge conflict with main (rebased onto latest, kept both bmc-mock and the new opentelemetry-prometheus/prometheus dev-deps, and preserved endpoint_exploration_step_latency handling in MetricHolder).

I also fixed a check-format-nightly failure from the last run — applied nightly rustfmt to crates/site-explorer (merged the metrics re-exports, expanded the histogram boundary array, inlined the test meter constructor).

Latest commit: c3b983c. Could you kick off another /ok to test when you get a chance? Thanks!

yoks · 2026-06-29T19:37:16Z

/ok to test c3b983c

Susanpdl requested a review from a team as a code owner June 19, 2026 15:34

Copilot AI review requested due to automatic review settings June 19, 2026 15:34

Copilot started reviewing on behalf of Susanpdl June 19, 2026 15:35 View session

Copilot AI reviewed Jun 19, 2026

View reviewed changes

Comment thread crates/site-explorer/src/metrics.rs

Comment thread crates/api-core/src/logging/setup.rs

Comment thread crates/api-core/src/logging/setup.rs

coderabbitai Bot reviewed Jun 19, 2026

View reviewed changes

coderabbitai Bot reviewed Jun 24, 2026

View reviewed changes

Susanpdl force-pushed the fix/site-explorer-histogram-buckets branch from 7b20b00 to 1f81242 Compare June 25, 2026 01:26

Susanpdl force-pushed the fix/site-explorer-histogram-buckets branch 2 times, most recently from a706047 to 0db1dc5 Compare June 25, 2026 12:39

ajf approved these changes Jun 25, 2026

View reviewed changes

Susanpdl added 4 commits June 29, 2026 14:47

fix(site-explorer): move tests after MetricHolder for clippy

d56cb6c

Satisfy clippy::items_after_test_module by placing mod tests at the end of metrics.rs. Signed-off-by: Susan Poudel <susanpdl77@gmail.com>

Susanpdl force-pushed the fix/site-explorer-histogram-buckets branch from 8b6ea1b to d56cb6c Compare June 29, 2026 18:47

yoks approved these changes Jun 29, 2026

View reviewed changes

yoks enabled auto-merge (squash) June 29, 2026 18:54

style(site-explorer): apply nightly rustfmt formatting

c3b983c

Satisfy check-format-nightly: merge the metrics re-exports into one use statement, expand the histogram boundary array one element per line, and inline the test meter constructor. Signed-off-by: Susan Poudel <susanpdl77@gmail.com>

auto-merge was automatically disabled June 29, 2026 19:11
Head branch was pushed to by a user without write access

yoks approved these changes Jun 29, 2026

View reviewed changes

yoks enabled auto-merge (squash) June 29, 2026 19:37

yoks merged commit 68fc56e into NVIDIA:main Jun 29, 2026
56 checks passed

Uh oh!

Conversation

Susanpdl commented Jun 19, 2026

Summary

Test plan

Uh oh!

copy-pr-bot Bot commented Jun 19, 2026

Uh oh!

coderabbitai Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ianderson-nvidia commented Jun 24, 2026

Uh oh!

coderabbitai Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ianderson-nvidia commented Jun 24, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Container Scan Summary

Uh oh!

mxh-0xbb commented Jun 25, 2026

Uh oh!

Susanpdl commented Jun 25, 2026

Uh oh!

ajf commented Jun 25, 2026

Uh oh!

ajf left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 25, 2026

Uh oh!

Susanpdl commented Jun 25, 2026

Uh oh!

yoks commented Jun 29, 2026

Uh oh!

yoks commented Jun 29, 2026

Uh oh!

Susanpdl commented Jun 29, 2026

Uh oh!

yoks commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

coderabbitai Bot commented Jun 19, 2026 •

edited

Loading

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading

github-actions Bot commented Jun 24, 2026 •

edited

Loading