Skip to content

Async operations need poison quarantine / degraded queue health #1671

@ai-ag2026

Description

@ai-ag2026

Summary

During a post-crash Hindsight recovery incident, /health continued to report healthy while async worker lanes were degraded by deterministic poison operations.

The immediate data-shape bug is tracked separately in #1669 and a narrow validation fix is open in #1670. This issue is the operational follow-up: deterministic async failures should be quarantined or surfaced as degraded health/status instead of requiring SQL + journal inspection.

Observed failure classes from the incident:

asyncpg.exceptions.DataError: different vector dimensions 384 and 0
index "idx_memory_links_to_type_weight" contains unexpected zero page at block 33008
HINT:  Please REINDEX it.

What happened

After an unclean VM shutdown and WebUI session recovery, Hindsight had derived async work stuck/failing in the background queue. The REST API stayed healthy:

{"status":"healthy","database":"connected"}

But worker logs and async_operations showed repeated deterministic failures and pending backlog. In the local cleanup pass, we had to manually:

  • cancel active poison operations with different vector dimensions 384 and 0;
  • reindex a crash-damaged PostgreSQL index;
  • cancel batch_retain parent operations with task_payload is null that remained as pending queue work;
  • restart the worker and watch WORKER_STATS until global: pending=0.

Final local state after manual cleanup:

pending/processing async_operations: 0
active vector/zero-page errors: 0
WORKER_STATS ... global: pending=0 ... my_active: none

Impact

  • /health can report healthy while retain/consolidation work is functionally degraded.
  • Deterministic failures can retry until max retries or keep reappearing as queue work.
  • Operators cannot distinguish normal backlog from poison backlog without SQL and journal access.
  • Null-payload parent operations can make the queue look pending even when there is no executable payload.
  • The remediation path is operationally risky because it requires direct DB inspection/mutation.

Suggested behavior

Add an explicit async-operation quarantine/degraded-status path:

  1. Classify deterministic/non-transient failures after N retries or by known error classes:
    • invalid embedding/vector dimension;
    • schema/data-shape errors;
    • invalid/null payload operations;
    • corruption/index errors that require operator intervention;
    • oversized single-item payloads when splitting cannot reduce them.
  2. Move those operations out of the active retry lane as quarantined or failed_permanent.
  3. Store safe metadata, not retained content:
    • operation id;
    • operation type;
    • error class/signature;
    • retry count;
    • payload byte size / payload-null flag;
    • created/updated timestamps.
  4. Expose queue degradation via /health or a dedicated status endpoint:
    • active poison/quarantined operation count;
    • oldest retrying operation age;
    • recent deterministic failure classes;
    • pending operations by type/status/payload-null;
    • whether reserved worker lanes are blocked.
  5. Keep privacy intact: status output should never expose retained text or payload contents.

Regression test ideas

  • A retain/consolidation operation that repeatedly raises invalid vector dimension is moved out of active retry flow after the configured threshold.
  • A null-payload parent batch_retain cannot remain indefinitely as normal pending executable work.
  • /health or /status reports degraded queue state without leaking payload contents.
  • Transient LLM/network failures still use ordinary retry/backoff.
  • Quarantined operations do not block unrelated later retain operations.

Relationship to existing work

The key point: validation fixes individual poison sources; quarantine/status prevents the next deterministic poison source from silently degrading the worker lane.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions