fix(test): de-flake self-update banner flow under CPU starvation#135
Merged
Conversation
CI (windows) flaked on self-update-flow scenario 7 ("first frame never contains
the banner — banner is async"): the test waited 15s for the banner, which never
rendered. Root cause: checkSelfUpdate's fetch has a 3s abort timeout; under
heavy parallel CPU load the event loop starves past 3s before the (fast, local)
stub response is processed, so the fetch is aborted → null → no banner → the
banner-wait times out (~18s total). Not scenario 7's logic — any banner-wait
test (4/5/7) can hit it.
Fix: make the fetch timeout env-overridable via
SHARDMIND_SELF_UPDATE_FETCH_TIMEOUT_MS (call-time read, mirrors getRegistryUrl /
getSelfUpdateCacheDir; production keeps the 3s default; non-numeric/non-positive
falls back). The flow harness sets it to 45s so the local-stub fetch can't be
spuriously aborted under load, and bumps scenarios 4/7 banner waits to 30s
(matching 5) — ordered test-timeout 60s > fetch 45s > wait 30s, no races.
Tests: getConfiguredFetchTimeoutMs unit matrix (default / override / bad input);
self-update-flow 9/9 + self-update-check 33/33 green in isolation.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR addresses a CI flake in the Layer-1 self-update banner flow tests on Windows under CPU starvation by making the self-update fetch timeout configurable via an environment variable, while preserving the production default behavior.
Changes:
- Add
getConfiguredFetchTimeoutMs()to readSHARDMIND_SELF_UPDATE_FETCH_TIMEOUT_MSat call time and use it as the default forcheckSelfUpdate’s fetch timeout. - Extend unit test coverage with a matrix covering default/valid override/invalid override cases for the new timeout configuration.
- Harden Layer-1 self-update flow tests by setting a larger fetch timeout and aligning banner
waitForbudgets (scenarios 4 and 7) to 30s.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
source/core/self-update-check.ts |
Adds env-overridable fetch timeout resolution and uses it as the default fetch timeout for checkSelfUpdate. |
tests/unit/self-update-check.test.ts |
Adds unit tests validating the new env-based timeout configuration behavior. |
tests/component/flows/self-update-flow.test.tsx |
Sets the new env var in the harness and increases banner wait budgets to reduce flakiness under CPU starvation. |
Comment on lines
+145
to
+150
| export function getConfiguredFetchTimeoutMs(): number { | ||
| const raw = process.env['SHARDMIND_SELF_UPDATE_FETCH_TIMEOUT_MS']; | ||
| if (raw === undefined) return FETCH_TIMEOUT_MS; | ||
| const n = Number(raw.trim()); | ||
| return Number.isFinite(n) && n > 0 ? n : FETCH_TIMEOUT_MS; | ||
| } |
Owner
Author
There was a problem hiding this comment.
Good catch — fixed in 3dbc420. getConfiguredFetchTimeoutMs now Math.truncs and clamps to 2_147_483_647 (2^31−1), so a value above the setTimeout ceiling no longer overflows to ~1ms (which would have made a large value abort sooner, not later). Added a unit case covering the fractional-truncate and clamp paths.
Owner
Author
Address Copilot review on #135: getConfiguredFetchTimeoutMs returned any finite positive number, but Node's setTimeout delay is a 32-bit signed int — a value above 2^31-1 overflows to ~1ms, which would make a fat-fingered large SHARDMIND_SELF_UPDATE_FETCH_TIMEOUT_MS abort the fetch almost immediately (the opposite of intent). Truncate fractions and clamp to 2,147,483,647. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
v0.1.3release run'sci.ymlWindows cell flaked onself-update-flowscenario 7 ("first rendered frame never contains the banner — banner is async"): it waited 15s for the banner, which never rendered (~18s total). Not a regression — the only change vs the green #105 PR was the version bump, and the test passes in isolation.Root cause
checkSelfUpdate's fetch has a 3s abort timeout (FETCH_TIMEOUT_MS). Under heavy parallel CPU load the event loop starves past 3s before the (fast, local) stub response is processed, so the fetch is aborted →null→ the banner never renders → the 15swaitFortimes out. This hits any banner-wait scenario (4/5/7), not just 7 — it just landed on 7 this run.Fix
Make the fetch timeout env-overridable via
SHARDMIND_SELF_UPDATE_FETCH_TIMEOUT_MS— a call-time read that mirrors the file's existinggetRegistryUrl/getSelfUpdateCacheDirtest-injection pattern. Production keeps the 3s default; a non-numeric/non-positive value falls back. The flow harness sets it to 45s so the local-stub fetch can't be spuriously aborted under load, and bumps scenarios 4/7 banner waits to 30s (matching 5). Timeout ordering is race-free: test 60s > fetch 45s > wait 30s.This is a root-cause fix (prevents the spurious abort), not a blind
waitForbump.Tests
getConfiguredFetchTimeoutMsunit matrix (default / trimmed override / non-numeric / non-positive / empty → default).self-update-flow9/9 andself-update-check33/33 green in isolation; typecheck clean.Note (out of scope)
Running the full suite in one parallel batch on a saturated machine surfaces the same class of CPU-starvation flake in other Layer-1 flow tests (e.g.
install-flow#12) — different tests fail each run, all pass in isolation. That's a broader Layer-1-under-load concurrency-tuning question (vitest pool sizing / per-file isolation), separate from this targeted fix. Happy to file a tracking issue if you want it pursued.