Skip to content

fix: downgrade shutdown reschedule log severity#112

Closed
Nsandomeno wants to merge 1 commit into
mainfrom
chore--downgrade-kill-job-exception-level
Closed

fix: downgrade shutdown reschedule log severity#112
Nsandomeno wants to merge 1 commit into
mainfrom
chore--downgrade-kill-job-exception-level

Conversation

@Nsandomeno
Copy link
Copy Markdown

Summary

Downgrade the shutdown cleanup reschedule log from error! to warn!.

Investigation

ZenDuty incidents fired in both staging and volcano-qa from Honeycomb LANA error alerts. Both alerts came from the same shutdown cleanup message:

Job still running after shutdown timeout, forcing reschedule

The alert fired because this message was emitted once per rescheduled job at ERROR level. In the observed incidents, each environment emitted 37 of these events.

We inspected the Honeycomb trace for staging, including trace 0aff61c2ffbb9b45f56515a684886a7b.

At first glance, the trace made job_repo.find_all look suspicious, but that call succeeded. It returned the 37 job IDs with status_code = 0. The downstream job_repo.update spans also succeeded. The ERRORs were the per-job tracing::error! span events emitted after find_all, while the job crate was intentionally rescheduling running jobs during shutdown cleanup.

We also checked frequency over the queried 30 day window:

  • staging: 1 cleanup occurrence, 37 jobs rescheduled
  • volcano-qa: 1 cleanup occurrence, 37 jobs rescheduled

So this appears to be low-frequency shutdown/restart cleanup behavior, but noisy enough to page because each rescheduled job currently emits an ERROR event.

Rationale

kill_remaining_jobs is an intentional recovery path during shutdown. It handles jobs still marked running for the shutting-down poller instance by:

  • resetting the execution to pending
  • clearing poller_instance_id
  • recording ExecutionAborted { reason: "killed job" }
  • recording ExecutionScheduled { attempt, scheduled_at }

This preserves the work and makes the job eligible to be picked up again by a poller. That is useful operational signal, but it is not itself a job-system failure.

Real failures in this path are still preserved: DB/repo operations continue to use ?, so failures from the cleanup update, find_all, repo updates, or commit still propagate as actual errors.

Via ZenDuty Alerts

@Nsandomeno
Copy link
Copy Markdown
Author

Why this is not a job-system failure

The Honeycomb error trace is reporting shutdown cleanup, not a failed job-system operation.

During shutdown, the job poller looks for executions that are still marked running for the poller instance that is going away. For those executions, kill_remaining_jobs intentionally moves them back to pending, clears poller_instance_id, and records abort/reschedule events.

In the observed trace, n_killed = 37, which means 37 running executions matched the cleanup query and were reset for retry. The suspicious-looking job_repo.find_all span succeeded and returned those 37 job IDs. The subsequent job_repo.update spans also succeeded, which indicates the abort/reschedule events were persisted.

So the trace is not saying:

  • the DB failed
  • find_all failed
  • cleanup failed
  • jobs were lost

It is saying:

  • shutdown happened
  • 37 in-flight jobs did not complete before cleanup
  • the job crate recovered by making those jobs eligible to run again

That is useful operational signal, but it is expected recovery behavior during shutdown. The current problem is that the recovery log is emitted at ERROR level once per job, which makes Honeycomb treat successful shutdown cleanup as a LANA error and page ZenDuty.

Real failures in this path are still protected: the cleanup query, find_all, per-job repo updates, and transaction commit all use ?, so actual DB/repo failures would still propagate as real errors. Downgrading the per-job recovery log to warn! does not hide those failures.

@Nsandomeno Nsandomeno reopened this Jun 2, 2026
@Nsandomeno Nsandomeno closed this Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant