Skip to content

fix(test): de-flake self-update banner flow under CPU starvation#135

Merged
breferrari merged 2 commits into
mainfrom
fix/flaky-self-update-flow-banner
Jun 2, 2026
Merged

fix(test): de-flake self-update banner flow under CPU starvation#135
breferrari merged 2 commits into
mainfrom
fix/flaky-self-update-flow-banner

Conversation

@breferrari

Copy link
Copy Markdown
Owner

Problem

The v0.1.3 release run's ci.yml Windows cell flaked on self-update-flow scenario 7 ("first rendered frame never contains the banner — banner is async"): it waited 15s for the banner, which never rendered (~18s total). Not a regression — the only change vs the green #105 PR was the version bump, and the test passes in isolation.

Root cause

checkSelfUpdate's fetch has a 3s abort timeout (FETCH_TIMEOUT_MS). Under heavy parallel CPU load the event loop starves past 3s before the (fast, local) stub response is processed, so the fetch is aborted → null → the banner never renders → the 15s waitFor times out. This hits any banner-wait scenario (4/5/7), not just 7 — it just landed on 7 this run.

Fix

Make the fetch timeout env-overridable via SHARDMIND_SELF_UPDATE_FETCH_TIMEOUT_MS — a call-time read that mirrors the file's existing getRegistryUrl / getSelfUpdateCacheDir test-injection pattern. Production keeps the 3s default; a non-numeric/non-positive value falls back. The flow harness sets it to 45s so the local-stub fetch can't be spuriously aborted under load, and bumps scenarios 4/7 banner waits to 30s (matching 5). Timeout ordering is race-free: test 60s > fetch 45s > wait 30s.

This is a root-cause fix (prevents the spurious abort), not a blind waitFor bump.

Tests

  • New getConfiguredFetchTimeoutMs unit matrix (default / trimmed override / non-numeric / non-positive / empty → default).
  • self-update-flow 9/9 and self-update-check 33/33 green in isolation; typecheck clean.

Note (out of scope)

Running the full suite in one parallel batch on a saturated machine surfaces the same class of CPU-starvation flake in other Layer-1 flow tests (e.g. install-flow #12) — different tests fail each run, all pass in isolation. That's a broader Layer-1-under-load concurrency-tuning question (vitest pool sizing / per-file isolation), separate from this targeted fix. Happy to file a tracking issue if you want it pursued.

CI (windows) flaked on self-update-flow scenario 7 ("first frame never contains
the banner — banner is async"): the test waited 15s for the banner, which never
rendered. Root cause: checkSelfUpdate's fetch has a 3s abort timeout; under
heavy parallel CPU load the event loop starves past 3s before the (fast, local)
stub response is processed, so the fetch is aborted → null → no banner → the
banner-wait times out (~18s total). Not scenario 7's logic — any banner-wait
test (4/5/7) can hit it.

Fix: make the fetch timeout env-overridable via
SHARDMIND_SELF_UPDATE_FETCH_TIMEOUT_MS (call-time read, mirrors getRegistryUrl /
getSelfUpdateCacheDir; production keeps the 3s default; non-numeric/non-positive
falls back). The flow harness sets it to 45s so the local-stub fetch can't be
spuriously aborted under load, and bumps scenarios 4/7 banner waits to 30s
(matching 5) — ordered test-timeout 60s > fetch 45s > wait 30s, no races.

Tests: getConfiguredFetchTimeoutMs unit matrix (default / override / bad input);
self-update-flow 9/9 + self-update-check 33/33 green in isolation.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 2, 2026 00:28

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a CI flake in the Layer-1 self-update banner flow tests on Windows under CPU starvation by making the self-update fetch timeout configurable via an environment variable, while preserving the production default behavior.

Changes:

  • Add getConfiguredFetchTimeoutMs() to read SHARDMIND_SELF_UPDATE_FETCH_TIMEOUT_MS at call time and use it as the default for checkSelfUpdate’s fetch timeout.
  • Extend unit test coverage with a matrix covering default/valid override/invalid override cases for the new timeout configuration.
  • Harden Layer-1 self-update flow tests by setting a larger fetch timeout and aligning banner waitFor budgets (scenarios 4 and 7) to 30s.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
source/core/self-update-check.ts Adds env-overridable fetch timeout resolution and uses it as the default fetch timeout for checkSelfUpdate.
tests/unit/self-update-check.test.ts Adds unit tests validating the new env-based timeout configuration behavior.
tests/component/flows/self-update-flow.test.tsx Sets the new env var in the harness and increases banner wait budgets to reduce flakiness under CPU starvation.

Comment on lines +145 to +150
export function getConfiguredFetchTimeoutMs(): number {
const raw = process.env['SHARDMIND_SELF_UPDATE_FETCH_TIMEOUT_MS'];
if (raw === undefined) return FETCH_TIMEOUT_MS;
const n = Number(raw.trim());
return Number.isFinite(n) && n > 0 ? n : FETCH_TIMEOUT_MS;
}

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 3dbc420. getConfiguredFetchTimeoutMs now Math.truncs and clamps to 2_147_483_647 (2^31−1), so a value above the setTimeout ceiling no longer overflows to ~1ms (which would have made a large value abort sooner, not later). Added a unit case covering the fractional-truncate and clamp paths.

@breferrari

Copy link
Copy Markdown
Owner Author

Broader Layer-1-under-load flakiness now tracked in #136 (this PR fixes the self-update instance; #136 covers the class).

Address Copilot review on #135: getConfiguredFetchTimeoutMs returned any finite
positive number, but Node's setTimeout delay is a 32-bit signed int — a value
above 2^31-1 overflows to ~1ms, which would make a fat-fingered large
SHARDMIND_SELF_UPDATE_FETCH_TIMEOUT_MS abort the fetch almost immediately (the
opposite of intent). Truncate fractions and clamp to 2,147,483,647.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@breferrari breferrari merged commit a73c7ce into main Jun 2, 2026
6 checks passed
@breferrari breferrari deleted the fix/flaky-self-update-flow-banner branch June 2, 2026 00:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants