ci: tighten job timeouts + step-level limits for timed_out visibility#249
Merged
Conversation
…lity Cuts massively-over-budget job-timeouts down to honest worst-case + margin, and pairs each long-running step with its own step-timeout so a hang surfaces as `timed_out` on that step instead of as an opaque job-level `cancelled` after the full budget elapses. - ci.yml `test`: 30 → 25 min job; new 18 min step-timeout on Swift tests (p50 5–7 min, ~3× margin) - e2e.yml `e2e`: 60 → 30 min job; new 25 min step-timeout on Run E2E tests (cold cache covers multi-GB model downloads) - e2e-app.yml `e2e-app`: 15 → 20 min job (the three live-recording lanes now total ~14 min observed); new 10 min step-timeout on each lane so we can tell which one hung Background: docs/plans/.local/open/2026-05-11-job-timeout-visibility.md documents the cancelled-vs-timed_out distinction. Job-level timeouts map to `cancelled` (same as user-cancel and concurrency-cancel) — only step- level timeouts produce `timed_out`. Without step-timeouts, a hung test takes ~30 min to fail AND looks identical to a manual cancel in the UI.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three workflows had job-timeouts massively over their actual p50, and none paired the long-running steps with step-level timeouts. Result today: a hung test consumes the full budget AND surfaces as opaque `cancelled` (identical to user-cancel or concurrency-cancel) — needs API forensics to disambiguate.
This PR tightens job-timeouts to honest worst-case + 3× margin and adds step-level limits so hangs surface as `timed_out` on the actual hung step.
Changes
Why
Background: `docs/plans/.local/open/2026-05-11-job-timeout-visibility.md` documents the cancelled-vs-timed_out distinction from PR #231's incident.
Test plan
Need help on this PR? Tag
@codesmithwith what you need.