Skip to content

feat(obix): surface silent failures in cache loop, pg-listener, and listeners#48

Merged
sandipndev merged 3 commits into
mainfrom
task/obix-stall-observability
Apr 15, 2026
Merged

feat(obix): surface silent failures in cache loop, pg-listener, and listeners#48
sandipndev merged 3 commits into
mainfrom
task/obix-stall-observability

Conversation

@sandipndev
Copy link
Copy Markdown
Member

@sandipndev sandipndev commented Apr 14, 2026

Summary

Observability-only change to make the stall described in GaloyMoney/lana-bank#5035 self-diagnosing. No behavior is modified — every previously-silent drop or task exit in the outbox path now emits a short-lived OTEL span that is queryable in Honeycomb.

Context: the same stall signature reproduced on staging today (0.53.0-rc.5, job 0.6.18, obix 0.2.21) even though PR #87 in job (lost-handler self-steal fix) is live. 25 outbox handlers frozen at the exact same sequence; 7 events sitting unprocessed in persistent_outbox_events; zero log output from the server pod for 30+ minutes. The job poller is demonstrably alive (the two task.* crons keep dispatching every ~15s in Honeycomb) — so the stall is inside the obix subscription layer, and today there is no way to tell which silent-drop site is firing.

What this PR changes

Every previously-silent failure site now emits a trace-visible signal:

  • persistent/cache.rscache_fill_receiver Lagged / Closed, notification / backfill channel closure, broadcast-loop halt on sequence gap, broadcast-loop halt on persistent_event_sender having no active receivers, and load_next_page failure in handle_backfill_request.
  • persistent/listener.rsBroadcastStreamRecvError::Lagged drops (previously ()-swallowed).
  • ephemeral/cache.rs + ephemeral/listener.rs — symmetric signals for the ephemeral path.
  • out/pg_notify.rsPgListener.recv() errors and forward-send failures.
  • New handle::spawn_supervised wraps the three long-lived background tasks (persistent_cache_loop, ephemeral_cache_loop, pg_listener) in AssertUnwindSafe::catch_unwind and records both normal-exit and panic termination. A silent cache-loop death is no longer silent.

Why spans instead of tracing::error!

All these failure sites live inside tokio::spawn'd long-lived select! loops with no surrounding span. lana-bank's tracing-opentelemetry layer only exports spans via OTLP — bare tracing::error! events without a parent span land in stdout but never reach Honeycomb. To make these stall signals queryable, each one has to open (and close) its own short-lived span.

The record_* helper pattern

Each failure site calls a small #[tracing::instrument]-decorated no-op helper function at the bottom of the module:

#[tracing::instrument(
    name = "obix.persistent_cache.cache_fill_lagged",
    level = "error",
    fields(otel.status_code = "ERROR"),
)]
fn record_cache_fill_lagged(dropped: u64, last_broadcast_sequence: u64, highest_known: u64) {}

The empty body is intentional. #[instrument] creates a span on entry, records the function's arguments as span fields, and closes the span on return — so an empty function produces a single short-lived trace-visible signal with structured attributes, and the whole exporter pipeline (otel layer → OTLP → Honeycomb) fires as usual.

This pattern is not used elsewhere in lana-bank, es-entity, job, or obix main — it was introduced in this PR. The alternatives we considered:

  1. Refactor each select! arm body into its own #[instrument]ed async method — matches lana-bank's usual convention, but requires threading persistent_cache / last_broadcast_sequence / cache refs through method signatures for what is essentially an error log. Invasive for an observability-only PR.
  2. Bare tracing::error! without a span — aesthetically matches lana-bank, but the signal is invisible in Honeycomb. Defeats the purpose of the PR.
  3. Inline error_span!(...).in_scope(|| ()) at each site — same mechanics as the helper pattern, but 14× the ceremony.

The empty-body #[instrument] helper keeps every call site a plain, type-checked function call and declares the span name / level / otel.status_code once per signal. Localised to this subsystem; not being introduced as a project-wide convention.

Signal names follow the shape obix.<subsystem>.<condition>, e.g. obix.persistent_cache.sequence_gap, obix.supervisor.task_panicked — stable, queryable, dot-hierarchical so Honeycomb can filter by prefix.

Deliberately out of scope

Actual behavioral fixes to the stall (e.g., the listener re-reading latest_known_persisted() so latest_known self-corrects after a dropped broadcast event; periodic gap-fill sweeper; the c9868b2 gap-fix branch decision). Those should be driven by what the traces in this PR surface on the next reproduction, not by guessing.

Test plan

  • cargo build (SQLX_OFFLINE)
  • cargo clippy --all-targets -- -D warnings — clean
  • cargo fmt --check — clean
  • cargo nextest run — all 17 tests pass
  • Deploy to staging, run simulation, observe next stall, query name = "obix.*" in Honeycomb.

🤖 Generated with Claude Code

…isteners

The outbox delivery path has several silent-drop sites that, when they
fire, leave all handlers stalled with no log output and no process
restart — exactly the failure mode tracked in GaloyMoney/lana-bank#5035.

This is an observability-only change: no behavior is modified. Every
previously-silent failure now emits a structured `tracing::error!` or
`tracing::warn!` so the next stall is self-diagnosing.

Covered:
- `persistent_cache` loop: log Lagged / Closed on cache_fill_receiver,
  log notification / backfill channel closures, log when the broadcast
  stalls because the next sequence is missing (gap) or because
  `persistent_event_sender` has no active receivers.
- `persistent_listener`: log `BroadcastStreamRecvError::Lagged` drops
  (previously `()`-swallowed).
- `ephemeral_cache` + `ephemeral_listener`: symmetric logging.
- `pg_listener` forwarder: log PgListener.recv() errors and forward-
  send failures when the cache loop receiver is gone.
- New `handle::spawn_supervised` wrapper around the two long-lived
  background tasks (`persistent_cache_loop`, `ephemeral_cache_loop`,
  `pg_listener`): catches panics via `AssertUnwindSafe::catch_unwind`
  and logs both normal-exit and panic termination. A silent cache-loop
  death is now loud.

Out of scope (follow-up PR): behavioral fixes to the stall itself
(e.g., re-reading `latest_known_persisted()` in the listener, or
periodic gap-fill) — those decisions should be driven by the logs this
PR produces on the next stall.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sandipndev sandipndev marked this pull request as ready for review April 15, 2026 04:48
Bare tracing::error!/warn! events with no parent span never reach the
OTLP exporter in lana-bank's tracing-opentelemetry setup — they only
land in stdout. Replace each failure site with a short-lived
error_span!/warn_span! that closes via .in_scope(|| ()), making the
signal a queryable Honeycomb row. Error-severity spans carry
otel.status_code = "ERROR" so the span is marked errored.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the inline error_span!/warn_span! + .in_scope(|| ()) pattern
with small #[tracing::instrument]ed no-op helper functions grouped at
the bottom of each module. Call sites become plain function calls with
type-checked arguments, matching the #[instrument] convention used
throughout lana-bank. Span name, level, and otel.status_code are
declared once on the helper instead of repeated at every site.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@HonestMajority HonestMajority left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about producing errors from in this library. I think generally it should be decided higher up in the stack what is an error or not. But we can ship and change later if we want

@sandipndev
Copy link
Copy Markdown
Member Author

Yeah this is just for debugging purposes, we can revert this later

@sandipndev sandipndev merged commit 6c95972 into main Apr 15, 2026
4 checks passed
@sandipndev sandipndev deleted the task/obix-stall-observability branch April 15, 2026 15:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants