Goal
Improve reliability of async processing.
Scope
- Define retry strategy per job class
- Add dead-letter queue or failure sink
- Add observability around retry/failure counts
Acceptance Criteria
- Retry policy is explicit and configurable
- Poison messages are isolated for investigation
- Alerts/metrics exist for recurring failures