Skip to content

Conversation

@SkArchon
Copy link
Contributor

@SkArchon SkArchon commented Sep 10, 2025

This is a PR to debug operation planning time, not for merge

Summary by CodeRabbit

  • New Features

    • Added time-based warnings in logs when operation planning exceeds defined thresholds (e.g., high and upper-limit alerts).
    • Continued collection of planning-time metrics for visibility into performance.
  • Chores

    • Standardized planning-time thresholds across metric backends for consistent observability.
    • No changes to public APIs or configuration.

Checklist

  • I have discussed my proposed changes in an issue and have received approval to proceed.
  • I have followed the coding standards of the project.
  • Tests or benchmarks have been added or updated.
  • Documentation has been updated on https://github.com/wundergraph/cosmo-docs.
  • I have read the Contributors Guide.

@coderabbitai
Copy link

coderabbitai bot commented Sep 10, 2025

Walkthrough

Introduces time-based logging thresholds for operation planning time in both OTLP and Prometheus metric stores. Logs “high” and “upper limit” messages based on 5s and 10s thresholds before recording to existing histograms. No public APIs or signatures changed.

Changes

Cohort / File(s) Summary of modifications
Operation planning time logging & thresholds
router/pkg/metric/otlp_metric_store.go, router/pkg/metric/prom_metric_store.go
Added pre-recording logging based on two time thresholds (5s, 10s). If >= upper limit, log “upper limit”; else if >= limit, log “high”. Preserved existing histogram recording. In Prom store, introduced constants and time import.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The title uses a “wip:” prefix and frames the change as temporary debugging rather than clearly summarizing the main feature of adding threshold-based logging for operation planning time, making it both noisy and vague. Please remove the “wip:” prefix and rephrase to succinctly describe the core change, for example “Add threshold-based logging for operation planning time.”
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

github-actions bot commented Sep 10, 2025

Router-nonroot image scan passed

✅ No security vulnerabilities found in image:

ghcr.io/wundergraph/cosmo/router:sha-6cac283235f94ac76f23d9bf020087a1b70ba75e-nonroot

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
router/pkg/metric/prom_metric_store.go (3)

10-10: Import is fine; consider keeping thresholds decoupled from this file.

"time" is only needed to compute thresholds added below. If you later hoist thresholds into a shared file (see next comment), you can drop this import here.


18-21: Name/units clarity for thresholds; consider centralizing.

  • The constants are milliseconds (ms) expressed as float64. Consider either:
    • Renaming to include units (e.g., operationPlanningWarnMs/UpperMs), or
    • Defining as time.Duration in a common file (e.g., thresholds.go) and converting at call sites.
  • Place these in a shared package file (same package) so OTLP/Prom stores don’t depend on one file defining them.

No code change required for this PR since it’s debug-only, but centralizing avoids accidental removal/rename drift.


148-153: Use clearer messages, include thresholds, and bump severity for upper limit.

Current logs are a bit awkward and lack threshold context. Suggest WARN on upper limit and INFO on warn threshold, include units and threshold values, and add a “store” field for filtering.

-	if planningTime >= upperTimeLimit {
-		h.logger.Info("prometheus: the operation planning time upper limit", zap.Float64("planningTime", planningTime))
-	} else if planningTime >= timeLimit {
-		h.logger.Info("prometheus: the operation planning time high", zap.Float64("planningTime", planningTime))
-	}
+	if planningTime >= upperTimeLimit {
+		h.logger.Warn("operation planning time exceeded upper threshold",
+			zap.Float64("planningTimeMs", planningTime),
+			zap.Float64("thresholdMs", upperTimeLimit),
+			zap.String("store", "prometheus"),
+		)
+	} else if planningTime >= timeLimit {
+		h.logger.Info("operation planning time above warn threshold",
+			zap.Float64("planningTimeMs", planningTime),
+			zap.Float64("thresholdMs", timeLimit),
+			zap.String("store", "prometheus"),
+		)
+	}
router/pkg/metric/otlp_metric_store.go (2)

137-141: Mirror the Prom store logging improvements and include thresholds.

Align message wording, severity, and fields for easier cross-store analysis.

-	if planningTime >= upperTimeLimit {
-		h.logger.Info("otlp: the operation planning time upper limit", zap.Float64("planningTime", planningTime))
-	} else if planningTime >= timeLimit {
-		h.logger.Info("otlp: the operation planning time high", zap.Float64("planningTime", planningTime))
-	}
+	if planningTime >= upperTimeLimit {
+		h.logger.Warn("operation planning time exceeded upper threshold",
+			zap.Float64("planningTimeMs", planningTime),
+			zap.Float64("thresholdMs", upperTimeLimit),
+			zap.String("store", "otlp"),
+		)
+	} else if planningTime >= timeLimit {
+		h.logger.Info("operation planning time above warn threshold",
+			zap.Float64("planningTimeMs", planningTime),
+			zap.Float64("thresholdMs", timeLimit),
+			zap.String("store", "otlp"),
+		)
+	}

137-141: Avoid cross-file constant coupling.

upperTimeLimit/timeLimit are defined in prom_metric_store.go. They’re package-level so this compiles, but it’s fragile. Move them into a small shared file in this package (e.g., router/pkg/metric/thresholds.go) to prevent accidental breakage if the Prom file changes.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 61d53ce and 4f1679e.

📒 Files selected for processing (2)
  • router/pkg/metric/otlp_metric_store.go (1 hunks)
  • router/pkg/metric/prom_metric_store.go (2 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-14T17:22:41.662Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.

Applied to files:

  • router/pkg/metric/otlp_metric_store.go
📚 Learning: 2025-08-14T17:45:57.509Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.

Applied to files:

  • router/pkg/metric/prom_metric_store.go
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.

Applied to files:

  • router/pkg/metric/prom_metric_store.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: build-router
  • GitHub Check: build_push_image
  • GitHub Check: build_test
  • GitHub Check: image_scan (nonroot)
  • GitHub Check: integration_test (./telemetry)
  • GitHub Check: integration_test (./events)
  • GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
  • GitHub Check: image_scan
  • GitHub Check: build_push_image (nonroot)
  • GitHub Check: Analyze (go)
  • GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (1)
router/pkg/metric/prom_metric_store.go (1)

148-153: Unit consistency is correct – no changes needed
MeasureOperationPlanningTime accepts a time.Duration, converts it to milliseconds (float64(duration)/float64(time.Millisecond)), and then compares against the timeLimit/upperTimeLimit (also in ms), so thresholds align.

@SkArchon SkArchon changed the title chore: log messages for debugging operation planning wip: log messages for debugging operation planning Sep 10, 2025
Copy link
Member

@endigma endigma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no-op to take the PR out of my review queue

@github-actions
Copy link

github-actions bot commented Oct 9, 2025

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added Stale and removed Stale labels Oct 9, 2025
@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants