Skip to content

metrics: split DB time from drain time, widen buckets to 30s#22

Merged
MastaP merged 2 commits into
masterfrom
fix/query-duration-buckets-and-drain
May 9, 2026
Merged

metrics: split DB time from drain time, widen buckets to 30s#22
MastaP merged 2 commits into
masterfrom
fix/query-duration-buckets-and-drain

Conversation

@MastaP

@MastaP MastaP commented May 9, 2026

Copy link
Copy Markdown
Member

Fixes #21.

Why

The Database Performance dashboard for the groupchats relay was showing p99 (and frequently p95/p50) pinned at exactly 10s for long stretches. Two problems compounded:

  1. zooid_query_duration_seconds used prometheus.DefBuckets, whose last finite bucket is 10s. With dbOpTimeout = 30s, anything between 10s and 30s landed in +Inf and histogram_quantile reported the bucket boundary.
  2. The timer in queryEventsWith covered the entire iter.Seq lifetime — including time blocked inside yield(evt) waiting for a slow WebSocket peer. A back-pressured client looked identical to slow Postgres.

What

  • Wider buckets (5ms → 30s, 14 boundaries) on all query-duration histograms so the slow tail is actually visible.

  • Three histograms instead of one:

    • zooid_query_duration_seconds — total wall time. Kept with original semantics so existing dashboards/alerts don't break.
    • zooid_query_db_seconds — total minus drain. Approximates Postgres + driver + scan + parse.
    • zooid_query_drain_seconds — time blocked yielding to the consumer.

    All three share the same buckets, directly comparable in dashboards.

  • observeQueryTimings helper takes the three observers + start/drain as args — no closure, no captures, no heap allocation on the hot path.

Verification

TestMetrics_QueryDrainAccounting: saves 5 events, queries with a consumer that sleeps 50ms per row, verifies:

  • Each histogram increments by exactly 1 sample
  • Drain delta ≥ 0.9 × (n × sleep) — covers cumulative induced sleep within scheduler-jitter slack
  • dbDelta < drainDelta — DB time stays well below drain
  • db + drain ≈ total — invariant holds within 1ms float tolerance
  • total ≤ wall + 50ms — observation doesn't exceed measured wall time

go test -timeout 300s ./... passes; gofmt -l -s . clean.

Tradeoffs / notes for reviewer

  • No backward-compat break: existing zooid_query_duration_seconds keeps total-time semantics. New metrics are additive. Dashboard JSON updates for groupchats relay live in the consumer repo (sphere-infra) and add two new panels next to the existing one.
  • Bucket ceiling at 30s: covers the typical dbOpTimeout. Internal replaceEventOnce reads can exceed 30s with their 60s budget — those land in +Inf, accepted as a small minority on the read path. Comment in metrics.go explains.
  • Histogram boundary change invalidates incremental aggregation for the existing metric: for ~24h after deploy, histogram_quantile over the existing metric will be noisy until enough new samples accumulate under the new boundaries. Worth it; the panel was meaningless above 10s.

Two related fixes for the Database Performance dashboard, which becomes
uninformative whenever queries take longer than 10s.

1. zooid_query_duration_seconds used prometheus.DefBuckets, whose last
   finite bucket is 10s. With dbOpTimeout = 30s, anything between 10s
   and 30s lands in +Inf and histogram_quantile reports the bucket
   boundary — so p99 (and often p95/p50) pins at exactly 10s on the
   dashboard. New buckets cover the full 30s budget.

2. The single duration metric conflated DB time with consumer drain
   time: the timer in queryEventsWith covered the full iter.Seq
   lifetime, including time blocked inside yield(evt) waiting for the
   WebSocket peer. A back-pressured client looked identical to slow
   Postgres in the metric.

   Two new sibling histograms split the breakdown:

   * zooid_query_db_seconds   = total - drain  (Postgres + scan + parse)
   * zooid_query_drain_seconds = time blocked yielding to consumer

   zooid_query_duration_seconds keeps its original "total wall time"
   semantics so existing dashboards/alerts don't break. The new metrics
   are additive.

Both new histograms share the same widened buckets so they're directly
comparable. Test added that saves N events, queries with a sleeping
consumer, and verifies drain accounting matches the induced sleep
within scheduler-jitter slack while DB time stays well below it.

Fixes #21.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances query performance monitoring by splitting the total query duration into DB-side time and consumer-drain time. It introduces two new Prometheus histograms, QueryDBDuration and QueryDrainDuration, and updates the existing QueryDuration with custom buckets. A helper function observeQueryTimings is added to manage these metrics, and comprehensive tests are included to verify the accounting logic. A performance optimization was suggested to use WithLabelValues instead of With(prometheus.Labels{...}) to avoid map allocations during high-frequency query operations.

Comment thread zooid/events.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Zooid query latency observability by widening histogram buckets up to the typical 30s DB timeout and separating “DB time” from “consumer drain/back-pressure time” during event iteration, addressing Grafana percentiles clipping at 10s and conflating slow peers with slow Postgres.

Changes:

  • Replace prometheus.DefBuckets with explicit buckets spanning 5ms → 30s for query-duration histograms.
  • Split query timing instrumentation into three comparable histograms: total, DB (total minus drain), and drain.
  • Add a metrics test that validates drain accounting under an intentionally slow consumer.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
zooid/metrics.go Introduces widened bucket boundaries and adds two new histograms (*_db_seconds, *_drain_seconds) alongside the existing total-time metric.
zooid/events.go Updates query instrumentation to track drain time per yielded event and emits total/DB/drain histograms via a shared helper.
zooid/metrics_test.go Extends existing histogram test and adds a new test verifying drain-vs-DB accounting under a slow consumer.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread zooid/events.go Outdated
- gemini: switch to HistogramVec.WithLabelValues to skip the per-query
  map allocation that With(Labels{...}) performs on the hot path.
- Copilot: reword the observeQueryTimings comment — non-negativity of
  (wall - drainTotal) is from disjoint-sub-interval containment, not
  from monotonic time. Monotonic only protects against wall-clock
  adjustments, which is unrelated.
@MastaP MastaP merged commit bf541a9 into master May 9, 2026
1 check passed
@MastaP MastaP deleted the fix/query-duration-buckets-and-drain branch May 9, 2026 09:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Query duration histogram clips at 10s and conflates DB time with consumer back-pressure

2 participants