fix: downgrade shutdown reschedule log severity#112
Conversation
Why this is not a job-system failureThe Honeycomb error trace is reporting shutdown cleanup, not a failed job-system operation. During shutdown, the job poller looks for executions that are still marked In the observed trace, So the trace is not saying:
It is saying:
That is useful operational signal, but it is expected recovery behavior during shutdown. The current problem is that the recovery log is emitted at ERROR level once per job, which makes Honeycomb treat successful shutdown cleanup as a LANA error and page ZenDuty. Real failures in this path are still protected: the cleanup query, |
Summary
Downgrade the shutdown cleanup reschedule log from
error!towarn!.Investigation
ZenDuty incidents fired in both staging and volcano-qa from Honeycomb LANA error alerts. Both alerts came from the same shutdown cleanup message:
Job still running after shutdown timeout, forcing rescheduleThe alert fired because this message was emitted once per rescheduled job at ERROR level. In the observed incidents, each environment emitted 37 of these events.
We inspected the Honeycomb trace for staging, including trace
0aff61c2ffbb9b45f56515a684886a7b.At first glance, the trace made
job_repo.find_alllook suspicious, but that call succeeded. It returned the 37 job IDs withstatus_code = 0. The downstreamjob_repo.updatespans also succeeded. The ERRORs were the per-jobtracing::error!span events emitted afterfind_all, while the job crate was intentionally rescheduling running jobs during shutdown cleanup.We also checked frequency over the queried 30 day window:
So this appears to be low-frequency shutdown/restart cleanup behavior, but noisy enough to page because each rescheduled job currently emits an ERROR event.
Rationale
kill_remaining_jobsis an intentional recovery path during shutdown. It handles jobs still markedrunningfor the shutting-down poller instance by:pendingpoller_instance_idExecutionAborted { reason: "killed job" }ExecutionScheduled { attempt, scheduled_at }This preserves the work and makes the job eligible to be picked up again by a poller. That is useful operational signal, but it is not itself a job-system failure.
Real failures in this path are still preserved: DB/repo operations continue to use
?, so failures from the cleanup update,find_all, repo updates, or commit still propagate as actual errors.Via ZenDuty Alerts