docs: update runner doctor — A13 fixed in v0.27.15, B7 rootless repair incomplete#5762
Conversation
There was a problem hiding this comment.
Pull request overview
This PR syncs the self-hosted runner “runner doctor” knowledge base across its three mirrored copies, incorporating the latest scan window learnings for ARC/DinD (A13) and rootless cleanup (B7) so triage guidance matches current AWF behavior.
Changes:
- Updates failure mode A13 to “Fixed in AWF v0.27.15” via
runner.topology: "arc-dind"and thesysroot-stageinit container approach. - Updates failure mode B7 to reflect the remaining rootless Docker edge case and the post-v0.27.15
chmod -R a+rwXmitigation. - Removes A13 from the “known unresolved items / open gaps” lists and updates the corresponding prose.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/shared/self-hosted-failure-modes.md |
Updates A13/B7 table rows and removes A13 from known-unresolved list. |
.github/workflows/self-hosted-runner-doctor.md |
Updates the “open gaps” guidance to reflect A13 being fixed in v0.27.15. |
.github/agents/self-hosted-runner-doctor.md |
Mirrors the same runner-doctor guidance and A13/B7 updates. |
Review details
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 3/3 changed files
- Comments generated: 4
- Review effort level: Low
| | B5 | `getaddrinfo EAI_AGAIN <awmg-cli-proxy>` → `awf-cli-proxy could not connect to the external DIFC proxy` → `The agent was never invoked` in `--network-isolation` + `--topology-attach` runs | Startup ordering deadlock: `connectTopologyContainers()` runs only after `startContainers()` succeeds, but `startContainers()` blocks on the cli-proxy health gate that requires the topology peer to be reachable on `awf-net` (which `internal: true`). The peer is never attached → EAI_AGAIN → fail-fast → deadlock. Deterministic, not flaky. | Resolved in AWF: attach topology peers to `awf-net` before the health-gated bring-up (Fix A: split `up -d`, network first → attach → remaining); also harden cli-proxy to treat `EAI_AGAIN`/`ENOTFOUND` as not-yet-ready (Fix B) | Confirm `topologyAttach` is non-empty; check the cli-proxy logs for `EAI_AGAIN`; verify AWF version includes the ordering fix | #5543, #5542 | | ||
| | B6 | `EACCES` in `upload-artifact` step after a `sudo: false` (`--network-isolation`) AWF run; firewall log/audit dirs present but unreadable | Sidecars write files as non-runner UIDs (squid → uid 13, cli-proxy → `cliproxy`, agent/iptables-init → root). AWF's `chmod -R a+rX` repair runs as the unprivileged runner and silently fails at `debug` level on files it doesn't own | Resolved in AWF: (a) run Node sidecars as runner UID via compose `user:`; (b) root perm-fixer container at cleanup (daemon-run, mounts log dir, chowns to runner UID, skipped when `--keep-containers`); (c) promote swallowed-`chmod` failure from `debug` to `warn` | `ls -la <firewall-logs-dir>` after run — look for root or uid-13 owned files; check AWF logs for the swallowed `chmod` warning | #5545, #5542 | | ||
| | B7 | AWF exits with unhandled `EACCES` error during cleanup; stack trace shows `unlink ... /tmp/awf-<ts>-chroot-home/<path>` (e.g. `.aws/config`, cloud credentials) | In rootless Docker mode the agent container runs with UID namespace remapping. Files created by the agent inside the `chroot-home` temp directory are owned by remapped UIDs. AWF's `removeWorkDirectories()` runs as the unprivileged host runner and `fs.rmSync` fails on these files. | **Fixed in AWF v0.27.13**: upgrade AWF; `removeWorkDirectories` now catches `EACCES` and retries after spinning up a short-lived repair container (`CHOWN`/`DAC_OVERRIDE`/`FOWNER` capabilities) that chowns the files back to the host user | `ls -la /tmp/awf-*-chroot-home/` after a rootless run — files owned by non-runner UIDs confirm the mode; upgrade to AWF ≥ v0.27.13 | #5653 | | ||
| | B7 | AWF exits with unhandled `EACCES` error during cleanup; stack trace shows `unlink ... /tmp/awf-<ts>-chroot-home/<path>` (e.g. `.aws/config`, cloud credentials); or `[WARN] Failed to remove chroot home directory after permission repair` after a seemingly successful repair container run | In rootless Docker mode the agent container runs with UID namespace remapping. Files created by the agent inside the `chroot-home` temp directory are owned by remapped UIDs. AWF's `removeWorkDirectories()` runs as the unprivileged host runner and `fs.rmSync` fails on these files. | **Partially fixed in AWF v0.27.13** (repair container with CHOWN/DAC_OVERRIDE/FOWNER capabilities); **further fix merged post-v0.27.15** (#5717): in rootless Docker the repair container's `chown` operates within the user namespace and may not change host-level ownership. The post-v0.27.15 fix adds `chmod -R a+rwX` so the host can delete the directory regardless of ownership. Non-fatal if unfixed — leaves an orphan `/tmp/awf-*-chroot-home` dir. | `ls -la /tmp/awf-*-chroot-home/` after a rootless run — files owned by non-runner UIDs confirm the mode; upgrade to AWF ≥ v0.27.13; check AWF logs for `[WARN] Failed to remove chroot home directory after permission repair` | #5653, github/gh-aw-firewall#5708, github/gh-aw-firewall#5717 | |
| | B5 | `getaddrinfo EAI_AGAIN <awmg-cli-proxy>` → `awf-cli-proxy could not connect to the external DIFC proxy` → `The agent was never invoked` in `--network-isolation` + `--topology-attach` runs | Startup ordering deadlock: `connectTopologyContainers()` runs only after `startContainers()` succeeds, but `startContainers()` blocks on the cli-proxy health gate that requires the topology peer to be reachable on `awf-net` (which `internal: true`). The peer is never attached → EAI_AGAIN → fail-fast → deadlock. Deterministic, not flaky. | Resolved in AWF: attach topology peers to `awf-net` before the health-gated bring-up (Fix A: split `up -d`, network first → attach → remaining); also harden cli-proxy to treat `EAI_AGAIN`/`ENOTFOUND` as not-yet-ready (Fix B) | Confirm `topologyAttach` is non-empty; check the cli-proxy logs for `EAI_AGAIN`; verify AWF version includes the ordering fix | #5543, #5542 | | ||
| | B6 | `EACCES` in `upload-artifact` step after a `sudo: false` (`--network-isolation`) AWF run; firewall log/audit dirs present but unreadable | Sidecars write files as non-runner UIDs (squid → uid 13, cli-proxy → `cliproxy`, agent/iptables-init → root). AWF's `chmod -R a+rX` repair runs as the unprivileged runner and silently fails at `debug` level on files it doesn't own | Resolved in AWF: (a) run Node sidecars as runner UID via compose `user:`; (b) root perm-fixer container at cleanup (daemon-run, mounts log dir, chowns to runner UID, skipped when `--keep-containers`); (c) promote swallowed-`chmod` failure from `debug` to `warn` | `ls -la <firewall-logs-dir>` after run — look for root or uid-13 owned files; check AWF logs for the swallowed `chmod` warning | #5545, #5542 | | ||
| | B7 | AWF exits with unhandled `EACCES` error during cleanup; stack trace shows `unlink ... /tmp/awf-<ts>-chroot-home/<path>` (e.g. `.aws/config`, cloud credentials) | In rootless Docker mode the agent container runs with UID namespace remapping. Files created by the agent inside the `chroot-home` temp directory are owned by remapped UIDs. AWF's `removeWorkDirectories()` runs as the unprivileged host runner and `fs.rmSync` fails on these files. | **Fixed in AWF v0.27.13**: upgrade AWF; `removeWorkDirectories` now catches `EACCES` and retries after spinning up a short-lived repair container (`CHOWN`/`DAC_OVERRIDE`/`FOWNER` capabilities) that chowns the files back to the host user | `ls -la /tmp/awf-*-chroot-home/` after a rootless run — files owned by non-runner UIDs confirm the mode; upgrade to AWF ≥ v0.27.13 | #5653 | | ||
| | B7 | AWF exits with unhandled `EACCES` error during cleanup; stack trace shows `unlink ... /tmp/awf-<ts>-chroot-home/<path>` (e.g. `.aws/config`, cloud credentials); or `[WARN] Failed to remove chroot home directory after permission repair` after a seemingly successful repair container run | In rootless Docker mode the agent container runs with UID namespace remapping. Files created by the agent inside the `chroot-home` temp directory are owned by remapped UIDs. AWF's `removeWorkDirectories()` runs as the unprivileged host runner and `fs.rmSync` fails on these files. | **Partially fixed in AWF v0.27.13** (repair container with CHOWN/DAC_OVERRIDE/FOWNER capabilities); **further fix merged post-v0.27.15** (#5717): in rootless Docker the repair container's `chown` operates within the user namespace and may not change host-level ownership. The post-v0.27.15 fix adds `chmod -R a+rwX` so the host can delete the directory regardless of ownership. Non-fatal if unfixed — leaves an orphan `/tmp/awf-*-chroot-home` dir. | `ls -la /tmp/awf-*-chroot-home/` after a rootless run — files owned by non-runner UIDs confirm the mode; upgrade to AWF ≥ v0.27.13; check AWF logs for `[WARN] Failed to remove chroot home directory after permission repair` | #5653, github/gh-aw-firewall#5708, github/gh-aw-firewall#5717 | |
| @@ -109,9 +109,9 @@ Prefer the narrowest match. Examples: | |||
|
|
|||
| ### 4. Check for known unresolved problems | |||
| @@ -81,9 +81,9 @@ Prefer the narrowest match. Examples: | |||
|
|
|||
| ### 4. Check for known unresolved problems | |||
|
@copilot address review feedback |
Done in 3707555..latest:
|
|
✅ Copilot review passed with no inline comments. @copilot Add the |
|
📡 Smoke OTel Tracing completed. All tracing scenarios validated. ✅ |
|
✅ Smoke Gemini completed. All facets verified. 💎 |
|
✅ Build Test Suite completed successfully! |
|
✅ Smoke Copilot BYOK AOAI (api-key) completed. Copilot AOAI BYOK (api-key) mode operational. 🔓 |
|
✅ Contribution Check completed successfully! Contribution guidelines review complete for PR #5762: all applicable CONTRIBUTING.md requirements are satisfied; no comment needed. |
|
✨ The prophecy is fulfilled... Smoke Codex has completed its mystical journey. The stars align. 🌟 |
|
✅ Smoke Copilot BYOK AOAI (Entra) completed. Copilot AOAI BYOK (Entra) mode operational. 🔓 |
|
✅ Smoke Claude passed |
|
🔌 Smoke Services — All services reachable! ✅ |
|
📰 VERDICT: Smoke Copilot has concluded. All systems operational. This is a developing story. 🎤 |
|
✅ Smoke Copilot BYOK completed. Copilot BYOK mode operational. 🔓 |
|
🚀 Security Guard has started processing this pull request |
|
🔑 Smoke Copilot PAT PAT auth validated. All systems operational. ✅ |
|
Chroot tests passed! Smoke Chroot - All security and functionality tests succeeded. |
🤖 Smoke Test Results — PASS
PR: docs: update runner doctor — A13 fixed in v0.27.15, B7 rootless repair incomplete Overall: ✅ PASS
|
|
Smoke Test: Copilot PAT Auth — PR #5762 (
Overall: FAIL — pre-computed step outputs ( Auth mode: PAT (COPILOT_GITHUB_TOKEN)
|
Smoke Test: Claude Engine Validation
Overall result: PASS ✅
|
Smoke Test: Copilot BYOK (Direct) Mode ✅All tests passed.
Status: PASS Running in direct BYOK mode via
|
|
Direct BYOK (github-oidc+Azure AD) via api-proxy → Azure OpenAI (Foundry, o4-mini-aw) Overall: PASS
|
|
docs: update runner doctor — A13 fixed in v0.27.15, B7 rootless repair incomplete
Running in direct BYOK mode (COPILOT_PROVIDER_API_KEY + COPILOT_PROVIDER_BASE_URL) via api-proxy → Azure OpenAI (Foundry, o4-mini-aw) Overall: PASS
|
🔭 Smoke Test: API Proxy OpenTelemetry Tracing
All scenarios pass. OTEL tracing integration is correctly implemented.
|
|
✅ fix: ensure chmod runs even when chown fails in rootless permission repair Warning Firewall blocked 1 domainThe following domain was blocked by the firewall during workflow execution:
network:
allowed:
- defaults
- "registry.npmjs.org"See Network Configuration for more information.
|
🔍 Chroot Version Comparison Results
Overall: ❌ Not all tests passed — Python and Node.js versions differ between host and chroot.
|
Smoke Test: GitHub Actions Services Connectivity
Overall: FAIL
|
Gemini Engine Smoke Test Results
Overall status: FAIL Warning Firewall blocked 1 domainThe following domain was blocked by the firewall during workflow execution:
network:
allowed:
- defaults
- "localhost"See Network Configuration for more information.
|
🏗️ Build Test Suite Results
Overall: 8/8 ecosystems passed — ✅ PASS
|
Knowledge-base sync for scan window 2026-06-29→30: two lessons from #5693/#5696 (A13 shipped) and #5708/#5717 (B7 rootless edge case).
A13 — Fixed in v0.27.15 (arc-dind topology)
runner.topology: "arc-dind"now triggers asysroot-stageinit container that copies the signedbuild-toolsimage filesystem into asysrootvolume at/host:robefore the agent starts. The old "Unresolved / bake binaries into daemon image" workaround is gone.awf --version ≥ v0.27.15andsysroot-stagein compose output#5693,#5696stageBaseSystem()/dind.preStageDirsframingB7 — Partially fixed; rootless
chowninsufficientIn rootless Docker the repair container's
chownruns inside the user namespace and may not affect host-level ownership, leaving an orphan/tmp/awf-*-chroot-homedir. Post-v0.27.15 (#5717) addschmod -R a+rwXas a fallback.[WARN] Failed to remove chroot home directory after permission repairtriggerchmodfix#5708,#5717Files changed
All three mirrored files updated identically per the updater workflow's sync contract:
.github/workflows/shared/self-hosted-failure-modes.md.github/workflows/self-hosted-runner-doctor.md.github/agents/self-hosted-runner-doctor.md