"My queue looks wedged — what do I run?" The commands below are in the order you probably want them. Shipped with v0.19.1 after a production incident where the queue held for 90+ minutes before the operator noticed.
gbrain doctor --json | jq '.checks[] | select(.name == "queue_health")'queue_health flags two patterns:
- stalled-forever: active job whose
started_atis older than 1h. - waiting-depth: any per-name queue deeper than 10 (override via
GBRAIN_QUEUE_WAITING_THRESHOLD). Signals a missingmaxWaiting.
The nastiest stall: the worker process is running (passes ps / kill -0 /
container health), but its DB connection died (common behind a transaction
pooler) and never came back, so it claims no jobs and finishes nothing. Jobs
pile up with 0 active. Liveness checks all pass; nothing crashes.
As of v0.42.22.0 this self-heals — you usually won't have to do anything:
- The worker exits on its own dead pool. Under a supervisor, the worker's
DB-liveness probe runs and self-exits (
db_dead) after ~3 minutes; the supervisor respawns it with a fresh pool. - The supervisor restarts a worker that stops making progress. If a queue
has claimable work, 0 live-lock active jobs, and no completions for 15
minutes while the child is alive, the supervisor restarts it (covers stuck
handlers too, not just dead pools). Tune with
--wedge-restart-minutes/--wedge-restart-checksongbrain jobs supervisor(0 disables).
The signal is loud now — check either:
gbrain jobs stats --queue default # prints a WEDGED QUEUE line
gbrain doctor --json | jq '.checks[] | select(.name == "wedged_queue")'wedged_queue is a per-queue health error (0 active_healthy + waiting > 0 +
stale completions). Manual fix if you ever need it:
gbrain jobs supervisor stop && gbrain jobs supervisor start # fresh pool
gbrain jobs retry <id> # dead-lettered jobs# Who's active right now?
gbrain jobs list --status active
# Who's waiting, biggest pile first?
gbrain jobs list --status waiting --limit 50
# What's wrong with a specific job?
gbrain jobs get <id># Force-kill a single stuck job:
gbrain jobs cancel <id>
# Clear a specific job entirely (last resort):
gbrain jobs delete <id>
# Health smoke on the mechanism itself:
gbrain jobs smoke --wedge-rescue- stalled-forever — A worker claimed a job, started executing, and has
held the row for over an hour. The wall-clock sweep evicts jobs past
2×
timeout_ms; if one's still active, either notimeout_mswas set or the sweep is newly deployed and this job predates it. Cancel it. - waiting-depth — Submitters are piling up jobs faster than workers
drain them. Set
--max-waiting Non the submission or on the programmaticqueue.add()call. If you want a taller pile, raise the threshold viaGBRAIN_QUEUE_WAITING_THRESHOLD=50 gbrain doctor.
# If you're running autopilot with --no-worker, check that your external
# worker (systemd / Docker / OpenClaw service-manager) is alive:
gbrain jobs list --status active | head -5If the list is empty AND your submissions keep piling up, no worker is claiming. Start one:
GBRAIN_ALLOW_SHELL_JOBS=1 gbrain jobs work --concurrency 4- B7 —
minion_workersheartbeat table for ground-truth liveness (the--no-workerprobe and the droppedqueue_healthworker-heartbeat subcheck both need this). - B3 —
gbrain doctor --fixlearns to rescue queue wedges.