Skip to content

fix(webhook): bump default per-attempt timeout from 15s to 60s#74

Merged
viktor-shcherb merged 2 commits into
mainfrom
fix/webhook-timeout-bump
Apr 30, 2026
Merged

fix(webhook): bump default per-attempt timeout from 15s to 60s#74
viktor-shcherb merged 2 commits into
mainfrom
fix/webhook-timeout-bump

Conversation

@viktor-shcherb
Copy link
Copy Markdown
Member

Summary

The jobseek-murmur-shim's accept handler runs a defense-in-depth rerunProbes step that spawns one Python subprocess per board (Playwright + httpx + the entire crawler tree on a cold venv). For a 1-board run this empirically takes ~20s, which busts the 15s DEFAULT_WEBHOOK_REQUEST_TIMEOUT_MS.

Result: every webhook delivery times out with "This operation was aborted". Both retry attempts hit the same wall, webhook_status flips to failed, and the demo run never visibly publishes — even though the receiver is happily writing the catalog row server-side.

Live evidence (prod)

Run r_b33a598674192a29cf6d0eb7 (Discord):

  • Manual POST to https://jobseek.colophon-group.org/api/murmur/accept with the run's composed final_output: {ok:true, applied:true, company_id:..., board_count:1} in 22.6s
  • Murmur's deliver attempts: both logged webhook.attempt_transport_error error:"This operation was aborted" after the 15s race timer expired

Fix

Bump default to 60s — ~3x headroom over the observed worst case, still well under the agent claim TTL so a runaway receiver can't lock anything for too long.

The constant is already env-overridable via the route-level seam, so callers (and tests) that want a tighter bound keep the option.

Test plan

  • src/webhook.test.ts — all 13 tests pass
  • After deploy + a fresh demo run: webhook fires, webhook_status flips to delivered, the company row appears in jobseek's catalog table on local Hetzner Postgres

🤖 Generated with Claude Code

viktor-shcherb and others added 2 commits April 30, 2026 12:24
The jobseek-murmur-shim accept handler does a defense-in-depth
`rerunProbes` step that spawns one Python subprocess per board
(Playwright + httpx + the entire crawler tree on a cold venv).
Empirically that takes ~20s for a 1-board run (Discord), which
busts the original 15s budget — every webhook delivery times out
with `This operation was aborted`, both attempts hit the timeout,
and webhook_status flips to `failed` even though the receiver is
still happily writing the catalog row.

Live evidence on prod (run r_b33a598674192a29cf6d0eb7):
  - manual POST to https://jobseek.colophon-group.org/api/murmur/accept
    completed in 22.6s with `{ok:true, applied:true}`
  - murmur's deliver attempts both logged
    `webhook.attempt_transport_error error:"This operation was aborted"`
    after the 15s race timer expired

60s gives ~3x headroom over the observed worst case while still
capping a runaway receiver well below the agent claim TTL.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@viktor-shcherb viktor-shcherb merged commit cd4c55c into main Apr 30, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant