Summary
During a post-crash Hindsight recovery incident, /health continued to report healthy while async worker lanes were degraded by deterministic poison operations.
The immediate data-shape bug is tracked separately in #1669 and a narrow validation fix is open in #1670. This issue is the operational follow-up: deterministic async failures should be quarantined or surfaced as degraded health/status instead of requiring SQL + journal inspection.
Observed failure classes from the incident:
asyncpg.exceptions.DataError: different vector dimensions 384 and 0
index "idx_memory_links_to_type_weight" contains unexpected zero page at block 33008
HINT: Please REINDEX it.
What happened
After an unclean VM shutdown and WebUI session recovery, Hindsight had derived async work stuck/failing in the background queue. The REST API stayed healthy:
{"status":"healthy","database":"connected"}
But worker logs and async_operations showed repeated deterministic failures and pending backlog. In the local cleanup pass, we had to manually:
- cancel active poison operations with
different vector dimensions 384 and 0;
- reindex a crash-damaged PostgreSQL index;
- cancel
batch_retain parent operations with task_payload is null that remained as pending queue work;
- restart the worker and watch
WORKER_STATS until global: pending=0.
Final local state after manual cleanup:
pending/processing async_operations: 0
active vector/zero-page errors: 0
WORKER_STATS ... global: pending=0 ... my_active: none
Impact
/health can report healthy while retain/consolidation work is functionally degraded.
- Deterministic failures can retry until max retries or keep reappearing as queue work.
- Operators cannot distinguish normal backlog from poison backlog without SQL and journal access.
- Null-payload parent operations can make the queue look pending even when there is no executable payload.
- The remediation path is operationally risky because it requires direct DB inspection/mutation.
Suggested behavior
Add an explicit async-operation quarantine/degraded-status path:
- Classify deterministic/non-transient failures after
N retries or by known error classes:
- invalid embedding/vector dimension;
- schema/data-shape errors;
- invalid/null payload operations;
- corruption/index errors that require operator intervention;
- oversized single-item payloads when splitting cannot reduce them.
- Move those operations out of the active retry lane as
quarantined or failed_permanent.
- Store safe metadata, not retained content:
- operation id;
- operation type;
- error class/signature;
- retry count;
- payload byte size / payload-null flag;
- created/updated timestamps.
- Expose queue degradation via
/health or a dedicated status endpoint:
- active poison/quarantined operation count;
- oldest retrying operation age;
- recent deterministic failure classes;
- pending operations by type/status/payload-null;
- whether reserved worker lanes are blocked.
- Keep privacy intact: status output should never expose retained text or payload contents.
Regression test ideas
- A retain/consolidation operation that repeatedly raises invalid vector dimension is moved out of active retry flow after the configured threshold.
- A null-payload parent
batch_retain cannot remain indefinitely as normal pending executable work.
/health or /status reports degraded queue state without leaking payload contents.
- Transient LLM/network failures still use ordinary retry/backoff.
- Quarantined operations do not block unrelated later retain operations.
Relationship to existing work
The key point: validation fixes individual poison sources; quarantine/status prevents the next deterministic poison source from silently degrading the worker lane.
Summary
During a post-crash Hindsight recovery incident,
/healthcontinued to report healthy while async worker lanes were degraded by deterministic poison operations.The immediate data-shape bug is tracked separately in #1669 and a narrow validation fix is open in #1670. This issue is the operational follow-up: deterministic async failures should be quarantined or surfaced as degraded health/status instead of requiring SQL + journal inspection.
Observed failure classes from the incident:
What happened
After an unclean VM shutdown and WebUI session recovery, Hindsight had derived async work stuck/failing in the background queue. The REST API stayed healthy:
{"status":"healthy","database":"connected"}But worker logs and
async_operationsshowed repeated deterministic failures and pending backlog. In the local cleanup pass, we had to manually:different vector dimensions 384 and 0;batch_retainparent operations withtask_payload is nullthat remained as pending queue work;WORKER_STATSuntilglobal: pending=0.Final local state after manual cleanup:
Impact
/healthcan report healthy while retain/consolidation work is functionally degraded.Suggested behavior
Add an explicit async-operation quarantine/degraded-status path:
Nretries or by known error classes:quarantinedorfailed_permanent./healthor a dedicated status endpoint:Regression test ideas
batch_retaincannot remain indefinitely as normal pending executable work./healthor/statusreports degraded queue state without leaking payload contents.Relationship to existing work
The key point: validation fixes individual poison sources; quarantine/status prevents the next deterministic poison source from silently degrading the worker lane.