feat(obix): surface silent failures in cache loop, pg-listener, and listeners#48
Merged
Conversation
…isteners The outbox delivery path has several silent-drop sites that, when they fire, leave all handlers stalled with no log output and no process restart — exactly the failure mode tracked in GaloyMoney/lana-bank#5035. This is an observability-only change: no behavior is modified. Every previously-silent failure now emits a structured `tracing::error!` or `tracing::warn!` so the next stall is self-diagnosing. Covered: - `persistent_cache` loop: log Lagged / Closed on cache_fill_receiver, log notification / backfill channel closures, log when the broadcast stalls because the next sequence is missing (gap) or because `persistent_event_sender` has no active receivers. - `persistent_listener`: log `BroadcastStreamRecvError::Lagged` drops (previously `()`-swallowed). - `ephemeral_cache` + `ephemeral_listener`: symmetric logging. - `pg_listener` forwarder: log PgListener.recv() errors and forward- send failures when the cache loop receiver is gone. - New `handle::spawn_supervised` wrapper around the two long-lived background tasks (`persistent_cache_loop`, `ephemeral_cache_loop`, `pg_listener`): catches panics via `AssertUnwindSafe::catch_unwind` and logs both normal-exit and panic termination. A silent cache-loop death is now loud. Out of scope (follow-up PR): behavioral fixes to the stall itself (e.g., re-reading `latest_known_persisted()` in the listener, or periodic gap-fill) — those decisions should be driven by the logs this PR produces on the next stall. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
siddhart1o1
approved these changes
Apr 15, 2026
Bare tracing::error!/warn! events with no parent span never reach the OTLP exporter in lana-bank's tracing-opentelemetry setup — they only land in stdout. Replace each failure site with a short-lived error_span!/warn_span! that closes via .in_scope(|| ()), making the signal a queryable Honeycomb row. Error-severity spans carry otel.status_code = "ERROR" so the span is marked errored. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the inline error_span!/warn_span! + .in_scope(|| ()) pattern with small #[tracing::instrument]ed no-op helper functions grouped at the bottom of each module. Call sites become plain function calls with type-checked arguments, matching the #[instrument] convention used throughout lana-bank. Span name, level, and otel.status_code are declared once on the helper instead of repeated at every site. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
HonestMajority
approved these changes
Apr 15, 2026
Contributor
HonestMajority
left a comment
There was a problem hiding this comment.
Not sure about producing errors from in this library. I think generally it should be decided higher up in the stack what is an error or not. But we can ship and change later if we want
Member
Author
|
Yeah this is just for debugging purposes, we can revert this later |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Observability-only change to make the stall described in GaloyMoney/lana-bank#5035 self-diagnosing. No behavior is modified — every previously-silent drop or task exit in the outbox path now emits a short-lived OTEL span that is queryable in Honeycomb.
Context: the same stall signature reproduced on staging today (
0.53.0-rc.5, job 0.6.18, obix 0.2.21) even though PR #87 injob(lost-handler self-steal fix) is live. 25 outbox handlers frozen at the exact same sequence; 7 events sitting unprocessed inpersistent_outbox_events; zero log output from the server pod for 30+ minutes. The job poller is demonstrably alive (the twotask.*crons keep dispatching every ~15s in Honeycomb) — so the stall is inside the obix subscription layer, and today there is no way to tell which silent-drop site is firing.What this PR changes
Every previously-silent failure site now emits a trace-visible signal:
persistent/cache.rs—cache_fill_receiverLagged/Closed, notification / backfill channel closure, broadcast-loop halt on sequence gap, broadcast-loop halt onpersistent_event_senderhaving no active receivers, andload_next_pagefailure inhandle_backfill_request.persistent/listener.rs—BroadcastStreamRecvError::Laggeddrops (previously()-swallowed).ephemeral/cache.rs+ephemeral/listener.rs— symmetric signals for the ephemeral path.out/pg_notify.rs—PgListener.recv()errors and forward-send failures.handle::spawn_supervisedwraps the three long-lived background tasks (persistent_cache_loop,ephemeral_cache_loop,pg_listener) inAssertUnwindSafe::catch_unwindand records both normal-exit and panic termination. A silent cache-loop death is no longer silent.Why spans instead of
tracing::error!All these failure sites live inside
tokio::spawn'd long-livedselect!loops with no surrounding span.lana-bank'stracing-opentelemetrylayer only exports spans via OTLP — baretracing::error!events without a parent span land in stdout but never reach Honeycomb. To make these stall signals queryable, each one has to open (and close) its own short-lived span.The
record_*helper patternEach failure site calls a small
#[tracing::instrument]-decorated no-op helper function at the bottom of the module:The empty body is intentional.
#[instrument]creates a span on entry, records the function's arguments as span fields, and closes the span on return — so an empty function produces a single short-lived trace-visible signal with structured attributes, and the whole exporter pipeline (otel layer → OTLP → Honeycomb) fires as usual.This pattern is not used elsewhere in
lana-bank,es-entity,job, or obixmain— it was introduced in this PR. The alternatives we considered:select!arm body into its own#[instrument]ed async method — matches lana-bank's usual convention, but requires threadingpersistent_cache/last_broadcast_sequence/ cache refs through method signatures for what is essentially an error log. Invasive for an observability-only PR.tracing::error!without a span — aesthetically matches lana-bank, but the signal is invisible in Honeycomb. Defeats the purpose of the PR.error_span!(...).in_scope(|| ())at each site — same mechanics as the helper pattern, but 14× the ceremony.The empty-body
#[instrument]helper keeps every call site a plain, type-checked function call and declares the span name / level /otel.status_codeonce per signal. Localised to this subsystem; not being introduced as a project-wide convention.Signal names follow the shape
obix.<subsystem>.<condition>, e.g.obix.persistent_cache.sequence_gap,obix.supervisor.task_panicked— stable, queryable, dot-hierarchical so Honeycomb can filter by prefix.Deliberately out of scope
Actual behavioral fixes to the stall (e.g., the listener re-reading
latest_known_persisted()solatest_knownself-corrects after a dropped broadcast event; periodic gap-fill sweeper; thec9868b2gap-fix branch decision). Those should be driven by what the traces in this PR surface on the next reproduction, not by guessing.Test plan
cargo build(SQLX_OFFLINE)cargo clippy --all-targets -- -D warnings— cleancargo fmt --check— cleancargo nextest run— all 17 tests passname = "obix.*"in Honeycomb.🤖 Generated with Claude Code