Summary
Proposed knowledge-base changes
File: .github/workflows/shared/self-hosted-failure-modes.md
1 β Add A13 (new row in Category A β ARC / DinD)
Insert after the A12 row:
| A13 | `chroot: failed to run command '/bin/sh': No such file or directory` or `[entrypoint][ERROR] capsh not found on host system` on a **glibc/Debian daemon** (not musl/Alpine) | ARC/DinD split-fs: system-mount source dirs (`/tmp/gh-aw/{usr,bin,lib,...}`) are empty because nothing populates them; `stageBaseSystem()` does not yet exist. The entrypoint "musl/Alpine" warning is **misleading** β it fires because no dynamic loader is found, not because the daemon is musl. `dind.preStageDirs` only mkdir's empty work dirs; it does not stage a base userland. | **Unresolved** β `stageBaseSystem()` capability not yet implemented; base userland must originate from the AWF-signed image to preserve security invariants (never from runner/daemon-writable paths for pre-`capsh` execution). Workaround: bake required binaries (`/bin/sh`, `bash`, `capsh`, loader, coreutils) directly into the DinD daemon image | Confirm daemon libc: `ldd --version` inside the DinD container. Then check whether the staging dir is populated: `ls /tmp/gh-aw/usr/bin/sh` (or `/tmp/gh-aw/bin/sh`) on the runner side. If the daemon is glibc but the file is missing, this is A13, not A4. | #5541 |
2 β Add B5 (new row in Category B β Self-hosted runners)
Append after the B4 row:
| B5 | `getaddrinfo EAI_AGAIN <awmg-cli-proxy>` β `awf-cli-proxy could not connect to the external DIFC proxy` β `The agent was never invoked` in `--network-isolation` + `--topology-attach` runs | Startup ordering deadlock: `connectTopologyContainers()` runs only after `startContainers()` succeeds, but `startContainers()` blocks on the cli-proxy health gate that requires the topology peer to be reachable on `awf-net` (which `internal: true`). The peer is never attached β EAI_AGAIN β fail-fast β deadlock. Deterministic, not flaky. | Resolved in AWF: attach topology peers to `awf-net` before the health-gated bring-up (Fix A: split `up -d`, network first β attach β remaining); also harden cli-proxy to treat `EAI_AGAIN`/`ENOTFOUND` as not-yet-ready (Fix B) | Confirm `topologyAttach` is non-empty; check the cli-proxy logs for `EAI_AGAIN`; verify AWF version includes the ordering fix | #5543, #5542 |
3 β Add B6 (new row in Category B β Self-hosted runners)
Append after the B5 row:
| B6 | `EACCES` in `upload-artifact` step after a `sudo: false` (`--network-isolation`) AWF run; firewall log/audit dirs present but unreadable | Sidecars write files as non-runner UIDs (squid β uid 13, cli-proxy β `cliproxy`, agent/iptables-init β root). AWF's `chmod -R a+rX` repair runs as the unprivileged runner and silently fails at `debug` level on files it doesn't own | Resolved in AWF: (a) run Node sidecars as runner UID via compose `user:`; (b) root perm-fixer container at cleanup (daemon-run, mounts log dir, chowns to runner UID, skipped when `--keep-containers`); (c) promote swallowed-`chmod` failure from `debug` to `warn` | `ls -la <firewall-logs-dir>` after run β look for root or uid-13 owned files; check AWF logs for the swallowed `chmod` warning | #5545, #5542 |
4 β Add error-string quick-lookup entries
Add the following rows to the error-string table (after the existing mkdirat row):
| `getaddrinfo EAI_AGAIN <topology-peer>` with `awf-cli-proxy could not connect to the external DIFC proxy` | B5 |
| `EACCES` in `upload-artifact` after `sudo: false` (`--network-isolation`) AWF run | B6 |
| `chroot: failed to run command '/bin/sh'` on glibc daemon (not musl β confirmed by `ldd --version`) | A13 |
5 β Add A13 to "Known unresolved items"
- A13 / #5541 β base-userland staging for ARC/DinD split-fs (`stageBaseSystem()` not yet implemented; security-preserving fix requires sourcing the base userland from the AWF-signed image)
Proposed doctor changes
File: .github/workflows/self-hosted-runner-doctor.md
This file imports shared/self-hosted-failure-modes.md and contains no duplicated catalog rows; it only needs playbook-level additions.
In Β§3 "Match symptom β failure mode", add to the hint list:
- `chroot: failed to run command '/bin/sh'` on a glibc daemon β A13 (empty staging, not A4 musl)
- `EAI_AGAIN <awmg-cli-proxy>` in network-isolation + topology-attach β B5
- `EACCES` in upload-artifact after sudo:false β B6
In Β§4 "Check for known unresolved problems", add:
A13 / #5541 β ARC/DinD split-fs base-userland staging is not yet implemented; AWF cannot currently run end-to-end on a split-fs runner with an empty /host.
Proposed portable agent changes
File: .github/agents/self-hosted-runner-doctor.md
This file embeds the full catalog and must be updated in parallel with the shared file. It is currently also missing A12 (the mkdirat / read-only filesystem fix from #5481/#5482 that was added to the shared file but never synced here).
1 β Add missing A12 (sync gap)
After the A11 row, insert:
| A12 | `mkdirat ... : read-only file system` during agent chroot startup on ARC/DinD | `chroot.binariesSourcePath` set to the same root as `--docker-host-path-prefix` (e.g. both `/tmp/gh-aw`); Docker mounts `/tmp/gh-aw/usr:/host/usr:ro` first, then the attempt to mkdir `/host/usr/local/bin` as a nested overlay mount point fails because the parent is read-only | **Fixed in firewall v0.27.10**: upgrade AWF; the overlay is now mounted at `/host/tmp/awf-runner-bin:ro` (writable `/host/tmp` parent) instead of `/host/usr/local/bin:ro` | Check `awf --version`; inspect agent container logs for `mkdirat`; verify `chroot.binariesSourcePath` equals `docker-host-path-prefix` root | #5481, #5482 |
2 β Add A13 (same text as shared file change above)
Insert after A12.
3 β Add B5 and B6 (same text as shared file changes above)
Append after B4.
4 β Add error-string quick-lookup entries
Add the mkdirat row that is in the shared file but missing from the portable agent, plus the three new rows:
| `mkdirat ... : read-only file system` during chroot agent startup | A12 |
| `getaddrinfo EAI_AGAIN <topology-peer>` with `awf-cli-proxy could not connect to the external DIFC proxy` | B5 |
| `EACCES` in `upload-artifact` after `sudo: false` (`--network-isolation`) AWF run | B6 |
| `chroot: failed to run command '/bin/sh'` on glibc daemon (not musl β confirmed by `ldd --version`) | A13 |
5 β Add A13 to "Known unresolved items"
- A13 / #5541 β base-userland staging for ARC/DinD split-fs (`stageBaseSystem()` not yet implemented; security-preserving fix requires sourcing the base userland from the AWF-signed image)
6 β Update Β§3 playbook hint (portable agent has same text as the workflow doctor)
Add the same three hint lines as the workflow doctor (A13, B5, B6).
7 β Update Β§4 "Check for known unresolved problems"
Add the A13 entry as in the workflow doctor.
Source issues and PRs
| Citation |
Title |
State |
Lesson |
| #5541 |
[ARC-DinD] Chroot /host base userland not staged on split-fs runners |
Open |
A13 β empty /host staging, unresolved |
| #5543 |
network-isolation: topology-attach ordering deadlock starves the cli-proxy health gate |
Closed (completed) |
B5 β EAI_AGAIN deadlock |
| #5545 |
network-isolation: rootless firewall logs are unreadable β EACCES on artifact upload |
Closed (completed) |
B6 β rootless EACCES |
| #5542 |
network-isolation (sudo:false) rollout regressions (parent tracking issue) |
Closed (completed) |
Covers B5 + B6 |
Generated by Runner Doctor Updater Β· 55.5 AIC Β· β 9.7K Β· β·
Summary
Proposed knowledge-base changes
File:
.github/workflows/shared/self-hosted-failure-modes.md1 β Add A13 (new row in Category A β ARC / DinD)
Insert after the A12 row:
2 β Add B5 (new row in Category B β Self-hosted runners)
Append after the B4 row:
3 β Add B6 (new row in Category B β Self-hosted runners)
Append after the B5 row:
4 β Add error-string quick-lookup entries
Add the following rows to the error-string table (after the existing
mkdiratrow):5 β Add A13 to "Known unresolved items"
Proposed doctor changes
File:
.github/workflows/self-hosted-runner-doctor.mdThis file imports
shared/self-hosted-failure-modes.mdand contains no duplicated catalog rows; it only needs playbook-level additions.In Β§3 "Match symptom β failure mode", add to the hint list:
In Β§4 "Check for known unresolved problems", add:
Proposed portable agent changes
File:
.github/agents/self-hosted-runner-doctor.mdThis file embeds the full catalog and must be updated in parallel with the shared file. It is currently also missing A12 (the
mkdirat/ read-only filesystem fix from #5481/#5482 that was added to the shared file but never synced here).1 β Add missing A12 (sync gap)
After the A11 row, insert:
2 β Add A13 (same text as shared file change above)
Insert after A12.
3 β Add B5 and B6 (same text as shared file changes above)
Append after B4.
4 β Add error-string quick-lookup entries
Add the
mkdiratrow that is in the shared file but missing from the portable agent, plus the three new rows:5 β Add A13 to "Known unresolved items"
6 β Update Β§3 playbook hint (portable agent has same text as the workflow doctor)
Add the same three hint lines as the workflow doctor (A13, B5, B6).
7 β Update Β§4 "Check for known unresolved problems"
Add the A13 entry as in the workflow doctor.
Source issues and PRs
/hoststaging, unresolvedEAI_AGAINdeadlock