-
Notifications
You must be signed in to change notification settings - Fork 187
wip: log messages for debugging operation planning #2204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughIntroduces time-based logging thresholds for operation planning time in both OTLP and Prometheus metric stores. Logs “high” and “upper limit” messages based on 5s and 10s thresholds before recording to existing histograms. No public APIs or signatures changed. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
Comment |
Router-nonroot image scan passed✅ No security vulnerabilities found in image: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
router/pkg/metric/prom_metric_store.go (3)
10-10: Import is fine; consider keeping thresholds decoupled from this file."time" is only needed to compute thresholds added below. If you later hoist thresholds into a shared file (see next comment), you can drop this import here.
18-21: Name/units clarity for thresholds; consider centralizing.
- The constants are milliseconds (ms) expressed as float64. Consider either:
- Renaming to include units (e.g., operationPlanningWarnMs/UpperMs), or
- Defining as time.Duration in a common file (e.g., thresholds.go) and converting at call sites.
- Place these in a shared package file (same package) so OTLP/Prom stores don’t depend on one file defining them.
No code change required for this PR since it’s debug-only, but centralizing avoids accidental removal/rename drift.
148-153: Use clearer messages, include thresholds, and bump severity for upper limit.Current logs are a bit awkward and lack threshold context. Suggest WARN on upper limit and INFO on warn threshold, include units and threshold values, and add a “store” field for filtering.
- if planningTime >= upperTimeLimit { - h.logger.Info("prometheus: the operation planning time upper limit", zap.Float64("planningTime", planningTime)) - } else if planningTime >= timeLimit { - h.logger.Info("prometheus: the operation planning time high", zap.Float64("planningTime", planningTime)) - } + if planningTime >= upperTimeLimit { + h.logger.Warn("operation planning time exceeded upper threshold", + zap.Float64("planningTimeMs", planningTime), + zap.Float64("thresholdMs", upperTimeLimit), + zap.String("store", "prometheus"), + ) + } else if planningTime >= timeLimit { + h.logger.Info("operation planning time above warn threshold", + zap.Float64("planningTimeMs", planningTime), + zap.Float64("thresholdMs", timeLimit), + zap.String("store", "prometheus"), + ) + }router/pkg/metric/otlp_metric_store.go (2)
137-141: Mirror the Prom store logging improvements and include thresholds.Align message wording, severity, and fields for easier cross-store analysis.
- if planningTime >= upperTimeLimit { - h.logger.Info("otlp: the operation planning time upper limit", zap.Float64("planningTime", planningTime)) - } else if planningTime >= timeLimit { - h.logger.Info("otlp: the operation planning time high", zap.Float64("planningTime", planningTime)) - } + if planningTime >= upperTimeLimit { + h.logger.Warn("operation planning time exceeded upper threshold", + zap.Float64("planningTimeMs", planningTime), + zap.Float64("thresholdMs", upperTimeLimit), + zap.String("store", "otlp"), + ) + } else if planningTime >= timeLimit { + h.logger.Info("operation planning time above warn threshold", + zap.Float64("planningTimeMs", planningTime), + zap.Float64("thresholdMs", timeLimit), + zap.String("store", "otlp"), + ) + }
137-141: Avoid cross-file constant coupling.upperTimeLimit/timeLimit are defined in prom_metric_store.go. They’re package-level so this compiles, but it’s fragile. Move them into a small shared file in this package (e.g., router/pkg/metric/thresholds.go) to prevent accidental breakage if the Prom file changes.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
router/pkg/metric/otlp_metric_store.go(1 hunks)router/pkg/metric/prom_metric_store.go(2 hunks)
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-08-14T17:22:41.662Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/oltp_connection_metric_store.go:46-49
Timestamp: 2025-08-14T17:22:41.662Z
Learning: In the OTLP connection metric store (router/pkg/metric/oltp_connection_metric_store.go), errors from startInitMetrics are intentionally logged but not propagated - this is existing behavior to ensure metrics failures don't block core system functionality.
Applied to files:
router/pkg/metric/otlp_metric_store.go
📚 Learning: 2025-08-14T17:45:57.509Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/metric/prom_event_metric_store.go:64-65
Timestamp: 2025-08-14T17:45:57.509Z
Learning: In the Cosmo router codebase, meter providers (both OTLP and Prometheus) are shut down at the router level rather than by individual metric stores. This is why metric store implementations like promEventMetrics use no-op Shutdown() methods - the router orchestrates the shutdown process for all meter providers.
Applied to files:
router/pkg/metric/prom_metric_store.go
📚 Learning: 2025-08-14T16:47:03.744Z
Learnt from: SkArchon
PR: wundergraph/cosmo#2137
File: router/pkg/pubsub/redis/adapter.go:114-116
Timestamp: 2025-08-14T16:47:03.744Z
Learning: In the Cosmo router codebase, EventMetricStore fields are never nil - the system uses the null object pattern with NewNoopEventMetricStore() when metrics are disabled, eliminating the need for nil checks before calling EventMetricStore methods.
Applied to files:
router/pkg/metric/prom_metric_store.go
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: build-router
- GitHub Check: build_push_image
- GitHub Check: build_test
- GitHub Check: image_scan (nonroot)
- GitHub Check: integration_test (./telemetry)
- GitHub Check: integration_test (./events)
- GitHub Check: integration_test (./. ./fuzzquery ./lifecycle ./modules)
- GitHub Check: image_scan
- GitHub Check: build_push_image (nonroot)
- GitHub Check: Analyze (go)
- GitHub Check: Analyze (javascript-typescript)
🔇 Additional comments (1)
router/pkg/metric/prom_metric_store.go (1)
148-153: Unit consistency is correct – no changes needed
MeasureOperationPlanningTimeaccepts atime.Duration, converts it to milliseconds (float64(duration)/float64(time.Millisecond)), and then compares against thetimeLimit/upperTimeLimit(also in ms), so thresholds align.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no-op to take the PR out of my review queue
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
|
This PR was marked stale due to lack of activity. It will be closed in 14 days. |
This is a PR to debug operation planning time, not for merge
Summary by CodeRabbit
New Features
Chores
Checklist