metrics: split DB time from drain time, widen buckets to 30s#22
Conversation
Two related fixes for the Database Performance dashboard, which becomes uninformative whenever queries take longer than 10s. 1. zooid_query_duration_seconds used prometheus.DefBuckets, whose last finite bucket is 10s. With dbOpTimeout = 30s, anything between 10s and 30s lands in +Inf and histogram_quantile reports the bucket boundary — so p99 (and often p95/p50) pins at exactly 10s on the dashboard. New buckets cover the full 30s budget. 2. The single duration metric conflated DB time with consumer drain time: the timer in queryEventsWith covered the full iter.Seq lifetime, including time blocked inside yield(evt) waiting for the WebSocket peer. A back-pressured client looked identical to slow Postgres in the metric. Two new sibling histograms split the breakdown: * zooid_query_db_seconds = total - drain (Postgres + scan + parse) * zooid_query_drain_seconds = time blocked yielding to consumer zooid_query_duration_seconds keeps its original "total wall time" semantics so existing dashboards/alerts don't break. The new metrics are additive. Both new histograms share the same widened buckets so they're directly comparable. Test added that saves N events, queries with a sleeping consumer, and verifies drain accounting matches the induced sleep within scheduler-jitter slack while DB time stays well below it. Fixes #21.
There was a problem hiding this comment.
Code Review
This pull request enhances query performance monitoring by splitting the total query duration into DB-side time and consumer-drain time. It introduces two new Prometheus histograms, QueryDBDuration and QueryDrainDuration, and updates the existing QueryDuration with custom buckets. A helper function observeQueryTimings is added to manage these metrics, and comprehensive tests are included to verify the accounting logic. A performance optimization was suggested to use WithLabelValues instead of With(prometheus.Labels{...}) to avoid map allocations during high-frequency query operations.
There was a problem hiding this comment.
Pull request overview
This PR improves Zooid query latency observability by widening histogram buckets up to the typical 30s DB timeout and separating “DB time” from “consumer drain/back-pressure time” during event iteration, addressing Grafana percentiles clipping at 10s and conflating slow peers with slow Postgres.
Changes:
- Replace
prometheus.DefBucketswith explicit buckets spanning 5ms → 30s for query-duration histograms. - Split query timing instrumentation into three comparable histograms: total, DB (total minus drain), and drain.
- Add a metrics test that validates drain accounting under an intentionally slow consumer.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
zooid/metrics.go |
Introduces widened bucket boundaries and adds two new histograms (*_db_seconds, *_drain_seconds) alongside the existing total-time metric. |
zooid/events.go |
Updates query instrumentation to track drain time per yielded event and emits total/DB/drain histograms via a shared helper. |
zooid/metrics_test.go |
Extends existing histogram test and adds a new test verifying drain-vs-DB accounting under a slow consumer. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- gemini: switch to HistogramVec.WithLabelValues to skip the per-query
map allocation that With(Labels{...}) performs on the hot path.
- Copilot: reword the observeQueryTimings comment — non-negativity of
(wall - drainTotal) is from disjoint-sub-interval containment, not
from monotonic time. Monotonic only protects against wall-clock
adjustments, which is unrelated.
Fixes #21.
Why
The Database Performance dashboard for the groupchats relay was showing p99 (and frequently p95/p50) pinned at exactly 10s for long stretches. Two problems compounded:
zooid_query_duration_secondsusedprometheus.DefBuckets, whose last finite bucket is 10s. WithdbOpTimeout = 30s, anything between 10s and 30s landed in+Infandhistogram_quantilereported the bucket boundary.queryEventsWithcovered the entireiter.Seqlifetime — including time blocked insideyield(evt)waiting for a slow WebSocket peer. A back-pressured client looked identical to slow Postgres.What
Wider buckets (5ms → 30s, 14 boundaries) on all query-duration histograms so the slow tail is actually visible.
Three histograms instead of one:
zooid_query_duration_seconds— total wall time. Kept with original semantics so existing dashboards/alerts don't break.zooid_query_db_seconds— total minus drain. Approximates Postgres + driver + scan + parse.zooid_query_drain_seconds— time blocked yielding to the consumer.All three share the same buckets, directly comparable in dashboards.
observeQueryTimingshelper takes the three observers + start/drain as args — no closure, no captures, no heap allocation on the hot path.Verification
TestMetrics_QueryDrainAccounting: saves 5 events, queries with a consumer that sleeps 50ms per row, verifies:dbDelta < drainDelta— DB time stays well below draindb + drain ≈ total— invariant holds within 1ms float tolerancetotal ≤ wall + 50ms— observation doesn't exceed measured wall timego test -timeout 300s ./...passes;gofmt -l -s .clean.Tradeoffs / notes for reviewer
zooid_query_duration_secondskeeps total-time semantics. New metrics are additive. Dashboard JSON updates for groupchats relay live in the consumer repo (sphere-infra) and add two new panels next to the existing one.dbOpTimeout. InternalreplaceEventOncereads can exceed 30s with their 60s budget — those land in+Inf, accepted as a small minority on the read path. Comment inmetrics.goexplains.histogram_quantileover the existing metric will be noisy until enough new samples accumulate under the new boundaries. Worth it; the panel was meaningless above 10s.