Skip to content

ci: tighten job timeouts + step-level limits for timed_out visibility#249

Merged
pasrom merged 1 commit into
mainfrom
ci/tighten-job-timeouts
May 12, 2026
Merged

ci: tighten job timeouts + step-level limits for timed_out visibility#249
pasrom merged 1 commit into
mainfrom
ci/tighten-job-timeouts

Conversation

@pasrom
Copy link
Copy Markdown
Owner

@pasrom pasrom commented May 12, 2026

Summary

Three workflows had job-timeouts massively over their actual p50, and none paired the long-running steps with step-level timeouts. Result today: a hung test consumes the full budget AND surfaces as opaque `cancelled` (identical to user-cancel or concurrency-cancel) — needs API forensics to disambiguate.

This PR tightens job-timeouts to honest worst-case + 3× margin and adds step-level limits so hangs surface as `timed_out` on the actual hung step.

Changes

Workflow Job Before After job New step-timeout Step covered
`ci.yml` `test` 30 min 25 min 18 min Swift tests
`e2e.yml` `e2e` 60 min 30 min 25 min Run E2E tests
`e2e-app.yml` `e2e-app` 15 min 20 min 10 min × 3 each live-recording lane

Why

  • `ci.yml/test` observed p50 5–7 min, max ~15 min cold-cache. 18 min step-timeout = 3× margin. Old 30 min was 6× the actual budget — a hang ate 25 min of CI time before surfacing.
  • `e2e.yml` observed p50 ~5 min. Old 60 min was defensive for 1.75 GB model downloads on cold cache, but the Mini's disk is persistent so cold cache is rare. 30 min covers worst-case fresh-download. 25 min step-timeout fails fast on hang.
  • `e2e-app.yml` — 15 min was set when only one lane existed (~6 min wall). After test(e2e-app): cover record-only mode with sidecar+WAV assertions #247 + feat(rpc): /action/enqueueFile + chained reimport E2E #248 there are three lanes totaling ~14 min observed. Bumped to 20 min for honest margin; 10 min per lane lets us tell which lane hung.

Background: `docs/plans/.local/open/2026-05-11-job-timeout-visibility.md` documents the cancelled-vs-timed_out distinction from PR #231's incident.

Test plan

  • All three YAMLs syntactically valid (`python3 -c 'yaml.safe_load(open(…))'`)
  • PR-CI green on the new `test` budgets (this PR itself exercises them)
  • After merge: future hangs in `Swift tests` / `Run E2E tests` / `e2e-app` lanes surface as `timed_out` on the step, not `cancelled` on the job

View in Codesmith
Need help on this PR? Tag @codesmith with what you need.

  • Let Codesmith autofix CI failures and bot reviews

…lity

Cuts massively-over-budget job-timeouts down to honest worst-case +
margin, and pairs each long-running step with its own step-timeout so a
hang surfaces as `timed_out` on that step instead of as an opaque
job-level `cancelled` after the full budget elapses.

- ci.yml `test`: 30 → 25 min job; new 18 min step-timeout on Swift tests
  (p50 5–7 min, ~3× margin)
- e2e.yml `e2e`: 60 → 30 min job; new 25 min step-timeout on Run E2E
  tests (cold cache covers multi-GB model downloads)
- e2e-app.yml `e2e-app`: 15 → 20 min job (the three live-recording
  lanes now total ~14 min observed); new 10 min step-timeout on each
  lane so we can tell which one hung

Background: docs/plans/.local/open/2026-05-11-job-timeout-visibility.md
documents the cancelled-vs-timed_out distinction. Job-level timeouts map
to `cancelled` (same as user-cancel and concurrency-cancel) — only step-
level timeouts produce `timed_out`. Without step-timeouts, a hung test
takes ~30 min to fail AND looks identical to a manual cancel in the UI.
@github-actions github-actions Bot added the chore Maintenance or non-functional changes label May 12, 2026
@pasrom pasrom merged commit 5ee13e4 into main May 12, 2026
11 checks passed
@pasrom pasrom deleted the ci/tighten-job-timeouts branch May 12, 2026 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chore Maintenance or non-functional changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant