Skip to content

🩺 Runner Doctor UpdateRunner Doctor Update: A13 (empty /host staging), B5 (topology-attach deadlock), B6 (rootless artifact EACCES) + portable-agent A [Content truncated due to length] #5582

Description

@github-actions

Summary


Proposed knowledge-base changes

File: .github/workflows/shared/self-hosted-failure-modes.md

1 β€” Add A13 (new row in Category A β€” ARC / DinD)

Insert after the A12 row:

| A13 | `chroot: failed to run command '/bin/sh': No such file or directory` or `[entrypoint][ERROR] capsh not found on host system` on a **glibc/Debian daemon** (not musl/Alpine) | ARC/DinD split-fs: system-mount source dirs (`/tmp/gh-aw/{usr,bin,lib,...}`) are empty because nothing populates them; `stageBaseSystem()` does not yet exist. The entrypoint "musl/Alpine" warning is **misleading** β€” it fires because no dynamic loader is found, not because the daemon is musl. `dind.preStageDirs` only mkdir's empty work dirs; it does not stage a base userland. | **Unresolved** β€” `stageBaseSystem()` capability not yet implemented; base userland must originate from the AWF-signed image to preserve security invariants (never from runner/daemon-writable paths for pre-`capsh` execution). Workaround: bake required binaries (`/bin/sh`, `bash`, `capsh`, loader, coreutils) directly into the DinD daemon image | Confirm daemon libc: `ldd --version` inside the DinD container. Then check whether the staging dir is populated: `ls /tmp/gh-aw/usr/bin/sh` (or `/tmp/gh-aw/bin/sh`) on the runner side. If the daemon is glibc but the file is missing, this is A13, not A4. | #5541 |

2 β€” Add B5 (new row in Category B β€” Self-hosted runners)

Append after the B4 row:

| B5 | `getaddrinfo EAI_AGAIN <awmg-cli-proxy>` β†’ `awf-cli-proxy could not connect to the external DIFC proxy` β†’ `The agent was never invoked` in `--network-isolation` + `--topology-attach` runs | Startup ordering deadlock: `connectTopologyContainers()` runs only after `startContainers()` succeeds, but `startContainers()` blocks on the cli-proxy health gate that requires the topology peer to be reachable on `awf-net` (which `internal: true`). The peer is never attached β†’ EAI_AGAIN β†’ fail-fast β†’ deadlock. Deterministic, not flaky. | Resolved in AWF: attach topology peers to `awf-net` before the health-gated bring-up (Fix A: split `up -d`, network first β†’ attach β†’ remaining); also harden cli-proxy to treat `EAI_AGAIN`/`ENOTFOUND` as not-yet-ready (Fix B) | Confirm `topologyAttach` is non-empty; check the cli-proxy logs for `EAI_AGAIN`; verify AWF version includes the ordering fix | #5543, #5542 |

3 β€” Add B6 (new row in Category B β€” Self-hosted runners)

Append after the B5 row:

| B6 | `EACCES` in `upload-artifact` step after a `sudo: false` (`--network-isolation`) AWF run; firewall log/audit dirs present but unreadable | Sidecars write files as non-runner UIDs (squid β†’ uid 13, cli-proxy β†’ `cliproxy`, agent/iptables-init β†’ root). AWF's `chmod -R a+rX` repair runs as the unprivileged runner and silently fails at `debug` level on files it doesn't own | Resolved in AWF: (a) run Node sidecars as runner UID via compose `user:`; (b) root perm-fixer container at cleanup (daemon-run, mounts log dir, chowns to runner UID, skipped when `--keep-containers`); (c) promote swallowed-`chmod` failure from `debug` to `warn` | `ls -la <firewall-logs-dir>` after run β€” look for root or uid-13 owned files; check AWF logs for the swallowed `chmod` warning | #5545, #5542 |

4 β€” Add error-string quick-lookup entries

Add the following rows to the error-string table (after the existing mkdirat row):

| `getaddrinfo EAI_AGAIN <topology-peer>` with `awf-cli-proxy could not connect to the external DIFC proxy` | B5 |
| `EACCES` in `upload-artifact` after `sudo: false` (`--network-isolation`) AWF run | B6 |
| `chroot: failed to run command '/bin/sh'` on glibc daemon (not musl β€” confirmed by `ldd --version`) | A13 |

5 β€” Add A13 to "Known unresolved items"

- A13 / #5541 β€” base-userland staging for ARC/DinD split-fs (`stageBaseSystem()` not yet implemented; security-preserving fix requires sourcing the base userland from the AWF-signed image)

Proposed doctor changes

File: .github/workflows/self-hosted-runner-doctor.md

This file imports shared/self-hosted-failure-modes.md and contains no duplicated catalog rows; it only needs playbook-level additions.

In Β§3 "Match symptom β†’ failure mode", add to the hint list:

- `chroot: failed to run command '/bin/sh'` on a glibc daemon β†’ A13 (empty staging, not A4 musl)
- `EAI_AGAIN <awmg-cli-proxy>` in network-isolation + topology-attach β†’ B5
- `EACCES` in upload-artifact after sudo:false β†’ B6

In Β§4 "Check for known unresolved problems", add:

A13 / #5541 β€” ARC/DinD split-fs base-userland staging is not yet implemented; AWF cannot currently run end-to-end on a split-fs runner with an empty /host.

Proposed portable agent changes

File: .github/agents/self-hosted-runner-doctor.md

This file embeds the full catalog and must be updated in parallel with the shared file. It is currently also missing A12 (the mkdirat / read-only filesystem fix from #5481/#5482 that was added to the shared file but never synced here).

1 β€” Add missing A12 (sync gap)

After the A11 row, insert:

| A12 | `mkdirat ... : read-only file system` during agent chroot startup on ARC/DinD | `chroot.binariesSourcePath` set to the same root as `--docker-host-path-prefix` (e.g. both `/tmp/gh-aw`); Docker mounts `/tmp/gh-aw/usr:/host/usr:ro` first, then the attempt to mkdir `/host/usr/local/bin` as a nested overlay mount point fails because the parent is read-only | **Fixed in firewall v0.27.10**: upgrade AWF; the overlay is now mounted at `/host/tmp/awf-runner-bin:ro` (writable `/host/tmp` parent) instead of `/host/usr/local/bin:ro` | Check `awf --version`; inspect agent container logs for `mkdirat`; verify `chroot.binariesSourcePath` equals `docker-host-path-prefix` root | #5481, #5482 |

2 β€” Add A13 (same text as shared file change above)

Insert after A12.

3 β€” Add B5 and B6 (same text as shared file changes above)

Append after B4.

4 β€” Add error-string quick-lookup entries

Add the mkdirat row that is in the shared file but missing from the portable agent, plus the three new rows:

| `mkdirat ... : read-only file system` during chroot agent startup | A12 |
| `getaddrinfo EAI_AGAIN <topology-peer>` with `awf-cli-proxy could not connect to the external DIFC proxy` | B5 |
| `EACCES` in `upload-artifact` after `sudo: false` (`--network-isolation`) AWF run | B6 |
| `chroot: failed to run command '/bin/sh'` on glibc daemon (not musl β€” confirmed by `ldd --version`) | A13 |

5 β€” Add A13 to "Known unresolved items"

- A13 / #5541 β€” base-userland staging for ARC/DinD split-fs (`stageBaseSystem()` not yet implemented; security-preserving fix requires sourcing the base userland from the AWF-signed image)

6 β€” Update Β§3 playbook hint (portable agent has same text as the workflow doctor)

Add the same three hint lines as the workflow doctor (A13, B5, B6).

7 β€” Update Β§4 "Check for known unresolved problems"

Add the A13 entry as in the workflow doctor.


Source issues and PRs

Citation Title State Lesson
#5541 [ARC-DinD] Chroot /host base userland not staged on split-fs runners Open A13 β€” empty /host staging, unresolved
#5543 network-isolation: topology-attach ordering deadlock starves the cli-proxy health gate Closed (completed) B5 β€” EAI_AGAIN deadlock
#5545 network-isolation: rootless firewall logs are unreadable β†’ EACCES on artifact upload Closed (completed) B6 β€” rootless EACCES
#5542 network-isolation (sudo:false) rollout regressions (parent tracking issue) Closed (completed) Covers B5 + B6

Generated by Runner Doctor Updater Β· 55.5 AIC Β· ⊞ 9.7K Β· β—·

  • expires on Jul 26, 2026, 5:08 PM UTC

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions