metrics: split DB time from drain time, widen buckets to 30s by MastaP · Pull Request #22 · unicitynetwork/unicity-relay

MastaP · 2026-05-09T08:56:33Z

Fixes #21.

Why

The Database Performance dashboard for the groupchats relay was showing p99 (and frequently p95/p50) pinned at exactly 10s for long stretches. Two problems compounded:

zooid_query_duration_seconds used prometheus.DefBuckets, whose last finite bucket is 10s. With dbOpTimeout = 30s, anything between 10s and 30s landed in +Inf and histogram_quantile reported the bucket boundary.
The timer in queryEventsWith covered the entire iter.Seq lifetime — including time blocked inside yield(evt) waiting for a slow WebSocket peer. A back-pressured client looked identical to slow Postgres.

What

Wider buckets (5ms → 30s, 14 boundaries) on all query-duration histograms so the slow tail is actually visible.
Three histograms instead of one:
- zooid_query_duration_seconds — total wall time. Kept with original semantics so existing dashboards/alerts don't break.
- zooid_query_db_seconds — total minus drain. Approximates Postgres + driver + scan + parse.
- zooid_query_drain_seconds — time blocked yielding to the consumer.
All three share the same buckets, directly comparable in dashboards.
observeQueryTimings helper takes the three observers + start/drain as args — no closure, no captures, no heap allocation on the hot path.

Verification

TestMetrics_QueryDrainAccounting: saves 5 events, queries with a consumer that sleeps 50ms per row, verifies:

Each histogram increments by exactly 1 sample
Drain delta ≥ 0.9 × (n × sleep) — covers cumulative induced sleep within scheduler-jitter slack
dbDelta < drainDelta — DB time stays well below drain
db + drain ≈ total — invariant holds within 1ms float tolerance
total ≤ wall + 50ms — observation doesn't exceed measured wall time

go test -timeout 300s ./... passes; gofmt -l -s . clean.

Tradeoffs / notes for reviewer

No backward-compat break: existing zooid_query_duration_seconds keeps total-time semantics. New metrics are additive. Dashboard JSON updates for groupchats relay live in the consumer repo (sphere-infra) and add two new panels next to the existing one.
Bucket ceiling at 30s: covers the typical dbOpTimeout. Internal replaceEventOnce reads can exceed 30s with their 60s budget — those land in +Inf, accepted as a small minority on the read path. Comment in metrics.go explains.
Histogram boundary change invalidates incremental aggregation for the existing metric: for ~24h after deploy, histogram_quantile over the existing metric will be noisy until enough new samples accumulate under the new boundaries. Worth it; the panel was meaningless above 10s.

Two related fixes for the Database Performance dashboard, which becomes uninformative whenever queries take longer than 10s. 1. zooid_query_duration_seconds used prometheus.DefBuckets, whose last finite bucket is 10s. With dbOpTimeout = 30s, anything between 10s and 30s lands in +Inf and histogram_quantile reports the bucket boundary — so p99 (and often p95/p50) pins at exactly 10s on the dashboard. New buckets cover the full 30s budget. 2. The single duration metric conflated DB time with consumer drain time: the timer in queryEventsWith covered the full iter.Seq lifetime, including time blocked inside yield(evt) waiting for the WebSocket peer. A back-pressured client looked identical to slow Postgres in the metric. Two new sibling histograms split the breakdown: * zooid_query_db_seconds = total - drain (Postgres + scan + parse) * zooid_query_drain_seconds = time blocked yielding to consumer zooid_query_duration_seconds keeps its original "total wall time" semantics so existing dashboards/alerts don't break. The new metrics are additive. Both new histograms share the same widened buckets so they're directly comparable. Test added that saves N events, queries with a sleeping consumer, and verifies drain accounting matches the induced sleep within scheduler-jitter slack while DB time stays well below it. Fixes #21.

gemini-code-assist

Code Review

This pull request enhances query performance monitoring by splitting the total query duration into DB-side time and consumer-drain time. It introduces two new Prometheus histograms, QueryDBDuration and QueryDrainDuration, and updates the existing QueryDuration with custom buckets. A helper function observeQueryTimings is added to manage these metrics, and comprehensive tests are included to verify the accounting logic. A performance optimization was suggested to use WithLabelValues instead of With(prometheus.Labels{...}) to avoid map allocations during high-frequency query operations.

Copilot

Pull request overview

This PR improves Zooid query latency observability by widening histogram buckets up to the typical 30s DB timeout and separating “DB time” from “consumer drain/back-pressure time” during event iteration, addressing Grafana percentiles clipping at 10s and conflating slow peers with slow Postgres.

Changes:

Replace prometheus.DefBuckets with explicit buckets spanning 5ms → 30s for query-duration histograms.
Split query timing instrumentation into three comparable histograms: total, DB (total minus drain), and drain.
Add a metrics test that validates drain accounting under an intentionally slow consumer.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
`zooid/metrics.go`	Introduces widened bucket boundaries and adds two new histograms (`_db_seconds`, `_drain_seconds`) alongside the existing total-time metric.
`zooid/events.go`	Updates query instrumentation to track drain time per yielded event and emits total/DB/drain histograms via a shared helper.
`zooid/metrics_test.go`	Extends existing histogram test and adds a new test verifying drain-vs-DB accounting under a slow consumer.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- gemini: switch to HistogramVec.WithLabelValues to skip the per-query map allocation that With(Labels{...}) performs on the hot path. - Copilot: reword the observeQueryTimings comment — non-negativity of (wall - drainTotal) is from disjoint-sub-interval containment, not from monotonic time. Monotonic only protects against wall-clock adjustments, which is unrelated.

MastaP requested a review from Copilot May 9, 2026 08:57

Copilot started reviewing on behalf of MastaP May 9, 2026 08:57 View session

gemini-code-assist Bot reviewed May 9, 2026

View reviewed changes

Comment thread zooid/events.go Outdated

Copilot AI reviewed May 9, 2026

View reviewed changes

Comment thread zooid/events.go Outdated

MastaP merged commit bf541a9 into master May 9, 2026
1 check passed

MastaP deleted the fix/query-duration-buckets-and-drain branch May 9, 2026 09:07

MastaP mentioned this pull request May 9, 2026

Denormalize kind onto event_tags so tag-filter CTE can pre-filter by kind #23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metrics: split DB time from drain time, widen buckets to 30s#22

metrics: split DB time from drain time, widen buckets to 30s#22
MastaP merged 2 commits into
masterfrom
fix/query-duration-buckets-and-drain

MastaP commented May 9, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

MastaP commented May 9, 2026

Why

What

Verification

Tradeoffs / notes for reviewer

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants