Async operations need poison quarantine / degraded queue health

## Summary

During a post-crash Hindsight recovery incident, `/health` continued to report healthy while async worker lanes were degraded by deterministic poison operations.

The immediate data-shape bug is tracked separately in #1669 and a narrow validation fix is open in #1670. This issue is the operational follow-up: deterministic async failures should be quarantined or surfaced as degraded health/status instead of requiring SQL + journal inspection.

Observed failure classes from the incident:

```text
asyncpg.exceptions.DataError: different vector dimensions 384 and 0
index "idx_memory_links_to_type_weight" contains unexpected zero page at block 33008
HINT:  Please REINDEX it.
```

## What happened

After an unclean VM shutdown and WebUI session recovery, Hindsight had derived async work stuck/failing in the background queue. The REST API stayed healthy:

```json
{"status":"healthy","database":"connected"}
```

But worker logs and `async_operations` showed repeated deterministic failures and pending backlog. In the local cleanup pass, we had to manually:

- cancel active poison operations with `different vector dimensions 384 and 0`;
- reindex a crash-damaged PostgreSQL index;
- cancel `batch_retain` parent operations with `task_payload is null` that remained as pending queue work;
- restart the worker and watch `WORKER_STATS` until `global: pending=0`.

Final local state after manual cleanup:

```text
pending/processing async_operations: 0
active vector/zero-page errors: 0
WORKER_STATS ... global: pending=0 ... my_active: none
```

## Impact

- `/health` can report healthy while retain/consolidation work is functionally degraded.
- Deterministic failures can retry until max retries or keep reappearing as queue work.
- Operators cannot distinguish normal backlog from poison backlog without SQL and journal access.
- Null-payload parent operations can make the queue look pending even when there is no executable payload.
- The remediation path is operationally risky because it requires direct DB inspection/mutation.

## Suggested behavior

Add an explicit async-operation quarantine/degraded-status path:

1. Classify deterministic/non-transient failures after `N` retries or by known error classes:
   - invalid embedding/vector dimension;
   - schema/data-shape errors;
   - invalid/null payload operations;
   - corruption/index errors that require operator intervention;
   - oversized single-item payloads when splitting cannot reduce them.
2. Move those operations out of the active retry lane as `quarantined` or `failed_permanent`.
3. Store safe metadata, not retained content:
   - operation id;
   - operation type;
   - error class/signature;
   - retry count;
   - payload byte size / payload-null flag;
   - created/updated timestamps.
4. Expose queue degradation via `/health` or a dedicated status endpoint:
   - active poison/quarantined operation count;
   - oldest retrying operation age;
   - recent deterministic failure classes;
   - pending operations by type/status/payload-null;
   - whether reserved worker lanes are blocked.
5. Keep privacy intact: status output should never expose retained text or payload contents.

## Regression test ideas

- A retain/consolidation operation that repeatedly raises invalid vector dimension is moved out of active retry flow after the configured threshold.
- A null-payload parent `batch_retain` cannot remain indefinitely as normal pending executable work.
- `/health` or `/status` reports degraded queue state without leaking payload contents.
- Transient LLM/network failures still use ordinary retry/backoff.
- Quarantined operations do not block unrelated later retain operations.

## Relationship to existing work

- Complements #1669: zero-length embeddings reaching pgvector.
- Complements #1670: validates retain embedding dimensions before pgvector writes.
- Related to #1571: oversized single retain items can also become worker poison.

The key point: validation fixes individual poison sources; quarantine/status prevents the next deterministic poison source from silently degrading the worker lane.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async operations need poison quarantine / degraded queue health #1671

Summary

What happened

Impact

Suggested behavior

Regression test ideas

Relationship to existing work

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Async operations need poison quarantine / degraded queue health #1671

Description

Summary

What happened

Impact

Suggested behavior

Regression test ideas

Relationship to existing work

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions