fix: downgrade shutdown reschedule log severity by Nsandomeno · Pull Request #112 · GaloyMoney/job

Nsandomeno · 2026-06-01T21:19:41Z

Summary

Downgrade the shutdown cleanup reschedule log from error! to warn!.

Investigation

ZenDuty incidents fired in both staging and volcano-qa from Honeycomb LANA error alerts. Both alerts came from the same shutdown cleanup message:

Job still running after shutdown timeout, forcing reschedule

The alert fired because this message was emitted once per rescheduled job at ERROR level. In the observed incidents, each environment emitted 37 of these events.

We inspected the Honeycomb trace for staging, including trace 0aff61c2ffbb9b45f56515a684886a7b.

At first glance, the trace made job_repo.find_all look suspicious, but that call succeeded. It returned the 37 job IDs with status_code = 0. The downstream job_repo.update spans also succeeded. The ERRORs were the per-job tracing::error! span events emitted after find_all, while the job crate was intentionally rescheduling running jobs during shutdown cleanup.

We also checked frequency over the queried 30 day window:

staging: 1 cleanup occurrence, 37 jobs rescheduled
volcano-qa: 1 cleanup occurrence, 37 jobs rescheduled

So this appears to be low-frequency shutdown/restart cleanup behavior, but noisy enough to page because each rescheduled job currently emits an ERROR event.

Rationale

kill_remaining_jobs is an intentional recovery path during shutdown. It handles jobs still marked running for the shutting-down poller instance by:

resetting the execution to pending
clearing poller_instance_id
recording ExecutionAborted { reason: "killed job" }
recording ExecutionScheduled { attempt, scheduled_at }

This preserves the work and makes the job eligible to be picked up again by a poller. That is useful operational signal, but it is not itself a job-system failure.

Real failures in this path are still preserved: DB/repo operations continue to use ?, so failures from the cleanup update, find_all, repo updates, or commit still propagate as actual errors.

Via ZenDuty Alerts

Staging
QA

Nsandomeno · 2026-06-02T14:36:16Z

Why this is not a job-system failure

The Honeycomb error trace is reporting shutdown cleanup, not a failed job-system operation.

During shutdown, the job poller looks for executions that are still marked running for the poller instance that is going away. For those executions, kill_remaining_jobs intentionally moves them back to pending, clears poller_instance_id, and records abort/reschedule events.

In the observed trace, n_killed = 37, which means 37 running executions matched the cleanup query and were reset for retry. The suspicious-looking job_repo.find_all span succeeded and returned those 37 job IDs. The subsequent job_repo.update spans also succeeded, which indicates the abort/reschedule events were persisted.

So the trace is not saying:

the DB failed
find_all failed
cleanup failed
jobs were lost

It is saying:

shutdown happened
37 in-flight jobs did not complete before cleanup
the job crate recovered by making those jobs eligible to run again

That is useful operational signal, but it is expected recovery behavior during shutdown. The current problem is that the recovery log is emitted at ERROR level once per job, which makes Honeycomb treat successful shutdown cleanup as a LANA error and page ZenDuty.

Real failures in this path are still protected: the cleanup query, find_all, per-job repo updates, and transaction commit all use ?, so actual DB/repo failures would still propagate as real errors. Downgrading the per-job recovery log to warn! does not hide those failures.

fix: downgrade shutdown reschedule log severity

95d337c

Nsandomeno requested review from HonestMajority and sandipndev June 1, 2026 21:19

Nsandomeno closed this Jun 2, 2026

Nsandomeno reopened this Jun 2, 2026

Nsandomeno closed this Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: downgrade shutdown reschedule log severity#112

fix: downgrade shutdown reschedule log severity#112
Nsandomeno wants to merge 1 commit into
mainfrom
chore--downgrade-kill-job-exception-level

Nsandomeno commented Jun 1, 2026

Uh oh!

Nsandomeno commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Nsandomeno commented Jun 1, 2026

Summary

Investigation

Rationale

Via ZenDuty Alerts

Uh oh!

Nsandomeno commented Jun 2, 2026

Why this is not a job-system failure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant