Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 5 additions & 6 deletions .github/agents/self-hosted-runner-doctor.md
Original file line number Diff line number Diff line change
Expand Up @@ -79,11 +79,11 @@ Prefer the narrowest match. Examples:
- `400 bad request: Authorization header is badly formatted` → C3
- `diagnosis=unknown` (proxy reachable, no connection error) or `reachable-but-api-error` from DIFC probe with `GITHUB_SERVER_URL=*.ghe.com` → C7 (DIFC proxy not enterprise-host-aware)

### 4. Check for known unresolved problems
### 4. Check for known gaps and notable fixes

If the best match is one of the known open gaps (gVisor/Kata runtime support, `--enable-dind` cleanup, enterprise header-injection extension points, the remaining `GH_HOST` leak to user steps, or ARC/DinD base-userland staging), say so explicitly instead of implying there is a shipped fix.
If the best match is one of the known open gaps (gVisor/Kata runtime support, `--enable-dind` cleanup, enterprise header-injection extension points, or the remaining `GH_HOST` leak to user steps), say so explicitly instead of implying there is a shipped fix.

A13 / #5541 — ARC/DinD split-fs base-userland staging is not yet implemented; AWF cannot currently run end-to-end on a split-fs runner with an empty /host.
A13 / github/gh-aw-firewall#5693, github/gh-aw-firewall#5696 — ARC/DinD split-fs base-userland staging is **fixed in AWF v0.27.15**: set `runner.topology: "arc-dind"` in the AWF config JSON. The `sysroot-stage` init container copies the signed `build-tools` image filesystem into a `sysroot` volume mounted at `/host:ro` before the agent starts.

C7 / #5615 — DIFC proxy enterprise-host awareness for `*.ghe.com` data-residency is not yet implemented in the companion projects; AWF ≥ v0.27.12 provides improved diagnostics (HTTP status + targeted hint) but the underlying cause remains unresolved.

Expand Down Expand Up @@ -144,7 +144,7 @@ Establish these facts before matching a failure mode:
| A10 | `Docker socket not found` plus `Invalid container ID format: arc-...` | MCP gateway assumed `/var/run/docker.sock`, group 0, and Docker-style container IDs | Propagate `DOCKER_HOST`, detect socket GID, relax pod-name handling | `stat -c '%g' ${DOCKER_HOST#unix://}`, `cat /proc/self/cgroup` | #2267, #2292, #2664, #2706, #2808 |
| A11 | Threat detection passes even though the engine binary is missing | `GH_AW_DETECTION_CONTINUE_ON_ERROR` suppressed a real setup failure | Reconsider default or log the skipped check explicitly | `printenv GH_AW_DETECTION_CONTINUE_ON_ERROR`; inspect agent logs for `ENOENT` | #4787 |
| A12 | `mkdirat ... : read-only file system` during agent chroot startup on ARC/DinD | `chroot.binariesSourcePath` set to the same root as `--docker-host-path-prefix` (e.g. both `/tmp/gh-aw`); Docker mounts `/tmp/gh-aw/usr:/host/usr:ro` first, then the attempt to mkdir `/host/usr/local/bin` as a nested overlay mount point fails because the parent is read-only | **Fixed in firewall v0.27.10**: upgrade AWF; the overlay is now mounted at `/host/tmp/awf-runner-bin:ro` (writable `/host/tmp` parent) instead of `/host/usr/local/bin:ro` | Check `awf --version`; inspect agent container logs for `mkdirat`; verify `chroot.binariesSourcePath` equals `docker-host-path-prefix` root | #5481, #5482 |
| A13 | `chroot: failed to run command '/bin/sh': No such file or directory` or `[entrypoint][ERROR] capsh not found on host system` on a **glibc/Debian daemon** (not musl/Alpine) | ARC/DinD split-fs: system-mount source dirs (`/tmp/gh-aw/{usr,bin,lib,...}`) are empty because nothing populates them; `stageBaseSystem()` does not yet exist. The entrypoint "musl/Alpine" warning is **misleading** — it fires because no dynamic loader is found, not because the daemon is musl. `dind.preStageDirs` only mkdirs empty work dirs; it does not stage a base userland. | **Unresolved** — `stageBaseSystem()` capability not yet implemented; base userland must originate from the AWF-signed image to preserve security invariants (never from runner/daemon-writable paths for pre-`capsh` execution). Workaround: bake required binaries (`/bin/sh`, `bash`, `capsh`, loader, coreutils) directly into the DinD daemon image | Confirm daemon libc: `ldd --version` inside the DinD container. Then check whether the staging dir is populated: `ls /tmp/gh-aw/usr/bin/sh` (or `/tmp/gh-aw/bin/sh`) on the runner side. If the daemon is glibc but the file is missing, this is A13, not A4. | #5541 |
| A13 | `chroot: failed to run command '/bin/sh': No such file or directory` or `[entrypoint][ERROR] capsh not found on host system` on a **glibc/Debian daemon** (not musl/Alpine) | ARC/DinD split-fs: system-mount source dirs (`/tmp/gh-aw/{usr,bin,lib,...}`) are empty because nothing populates them. The entrypoint "musl/Alpine" warning is **misleading** — it fires because no dynamic loader is found, not because the daemon is musl. | **Fixed in AWF v0.27.15**: set `runner.topology: "arc-dind"` in the AWF config JSON. AWF emits a `sysroot-stage` init container that copies the signed `build-tools` image filesystem (`bash`, `capsh`, `gcc`, dev libs, coreutils) into a named `sysroot` volume mounted at `/host:ro` before the agent starts. Use `runner.sysrootImage` to pin a specific image. | Check `awf --version` ≥ v0.27.15; verify `runner.topology: "arc-dind"` is set; inspect compose output for `sysroot-stage` service and `sysroot` volume | #5541, github/gh-aw-firewall#5693, github/gh-aw-firewall#5696 |

## Category B — Self-hosted runners

Expand All @@ -156,7 +156,7 @@ Establish these facts before matching a failure mode:
| B4 | `node: command not found` after `actions/setup-node` on self-hosted | Node was installed in `$HOME/work/_tool` and that toolcache is not visible | Mount / expose the runner toolcache; use `AWF_EXTRA_TOOLCACHE_DIRS` if needed | `which node`; inspect `$HOME/work/_tool/node` | #3544, #3545 |
| B5 | `getaddrinfo EAI_AGAIN <awmg-cli-proxy>` → `awf-cli-proxy could not connect to the external DIFC proxy` → `The agent was never invoked` in `--network-isolation` + `--topology-attach` runs | Startup ordering deadlock: `connectTopologyContainers()` runs only after `startContainers()` succeeds, but `startContainers()` blocks on the cli-proxy health gate that requires the topology peer to be reachable on `awf-net` (which `internal: true`). The peer is never attached → EAI_AGAIN → fail-fast → deadlock. Deterministic, not flaky. | Resolved in AWF: attach topology peers to `awf-net` before the health-gated bring-up (Fix A: split `up -d`, network first → attach → remaining); also harden cli-proxy to treat `EAI_AGAIN`/`ENOTFOUND` as not-yet-ready (Fix B) | Confirm `topologyAttach` is non-empty; check the cli-proxy logs for `EAI_AGAIN`; verify AWF version includes the ordering fix | #5543, #5542 |
| B6 | `EACCES` in `upload-artifact` step after a `sudo: false` (`--network-isolation`) AWF run; firewall log/audit dirs present but unreadable | Sidecars write files as non-runner UIDs (squid → uid 13, cli-proxy → `cliproxy`, agent/iptables-init → root). AWF's `chmod -R a+rX` repair runs as the unprivileged runner and silently fails at `debug` level on files it doesn't own | Resolved in AWF: (a) run Node sidecars as runner UID via compose `user:`; (b) root perm-fixer container at cleanup (daemon-run, mounts log dir, chowns to runner UID, skipped when `--keep-containers`); (c) promote swallowed-`chmod` failure from `debug` to `warn` | `ls -la <firewall-logs-dir>` after run — look for root or uid-13 owned files; check AWF logs for the swallowed `chmod` warning | #5545, #5542 |
| B7 | AWF exits with unhandled `EACCES` error during cleanup; stack trace shows `unlink ... /tmp/awf-<ts>-chroot-home/<path>` (e.g. `.aws/config`, cloud credentials) | In rootless Docker mode the agent container runs with UID namespace remapping. Files created by the agent inside the `chroot-home` temp directory are owned by remapped UIDs. AWF's `removeWorkDirectories()` runs as the unprivileged host runner and `fs.rmSync` fails on these files. | **Fixed in AWF v0.27.13**: upgrade AWF; `removeWorkDirectories` now catches `EACCES` and retries after spinning up a short-lived repair container (`CHOWN`/`DAC_OVERRIDE`/`FOWNER` capabilities) that chowns the files back to the host user | `ls -la /tmp/awf-*-chroot-home/` after a rootless run — files owned by non-runner UIDs confirm the mode; upgrade to AWF ≥ v0.27.13 | #5653 |
| B7 | AWF < v0.27.13: unhandled `EACCES` stack trace shows `unlink ... /tmp/awf-<ts>-chroot-home/<path>` (e.g. `.aws/config`, cloud credentials). AWF ≥ v0.27.13: `removeWorkDirectories()` catches the error and emits `[WARN] Failed to remove chroot home directory after permission repair` instead of crashing | In rootless Docker mode the agent container runs with UID namespace remapping. Files created by the agent inside the `chroot-home` temp directory are owned by remapped UIDs. AWF's `removeWorkDirectories()` runs as the unprivileged host runner and `fs.rmSync` fails on these files. | **Partially fixed in AWF v0.27.13** (repair container with CHOWN/DAC_OVERRIDE/FOWNER capabilities); **further fix merged post-v0.27.15** (#5717): in rootless Docker the repair container's `chown` operates within the user namespace and may not change host-level ownership. The post-v0.27.15 fix adds `chmod -R a+rwX` so the host can delete the directory regardless of ownership. Non-fatal if unfixed — leaves an orphan `/tmp/awf-*-chroot-home` dir. | `ls -la /tmp/awf-*-chroot-home/` after a rootless run — files owned by non-runner UIDs confirm the mode; upgrade to AWF ≥ v0.27.13; check AWF logs for `[WARN] Failed to remove chroot home directory after permission repair` | #5653, github/gh-aw-firewall#5708, github/gh-aw-firewall#5717 |

## Category C — GHES / GHEC / `ghe.com`

Expand Down Expand Up @@ -210,5 +210,4 @@ Flag these explicitly instead of implying there is a complete fix:
- D3 / #1727 — lingering `--enable-dind` cleanup
- D4 / #4849 — enterprise header injection extension point
- C5 / #3937 — full `GH_HOST` leak fix still requires gh-aw changes
- A13 / #5541 — base-userland staging for ARC/DinD split-fs (`stageBaseSystem()` not yet implemented; security-preserving fix requires sourcing the base userland from the AWF-signed image)
- C7 / #5615 — DIFC proxy enterprise-host awareness for `*.ghe.com` data-residency (root cause unresolved; tracked in github/gh-aw-mcpg#8202 and github/gh-aw#41911)
6 changes: 3 additions & 3 deletions .github/workflows/self-hosted-runner-doctor.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,11 +107,11 @@ Prefer the narrowest match. Examples:
- `400 bad request: Authorization header is badly formatted` → C3
- `diagnosis=unknown` (proxy reachable, no connection error) or `reachable-but-api-error` from DIFC probe with `GITHUB_SERVER_URL=*.ghe.com` → C7 (DIFC proxy not enterprise-host-aware)

### 4. Check for known unresolved problems
### 4. Check for known gaps and notable fixes

If the best match is one of the known open gaps (gVisor/Kata runtime support, `--enable-dind` cleanup, enterprise header-injection extension points, the remaining `GH_HOST` leak to user steps, or ARC/DinD base-userland staging), say so explicitly instead of implying there is a shipped fix.
If the best match is one of the known open gaps (gVisor/Kata runtime support, `--enable-dind` cleanup, enterprise header-injection extension points, or the remaining `GH_HOST` leak to user steps), say so explicitly instead of implying there is a shipped fix.

A13 / #5541 — ARC/DinD split-fs base-userland staging is not yet implemented; AWF cannot currently run end-to-end on a split-fs runner with an empty /host.
A13 / github/gh-aw-firewall#5693, github/gh-aw-firewall#5696 — ARC/DinD split-fs base-userland staging is **fixed in AWF v0.27.15**: set `runner.topology: "arc-dind"` in the AWF config JSON. The `sysroot-stage` init container copies the signed `build-tools` image filesystem into a `sysroot` volume mounted at `/host:ro` before the agent starts.

C7 / #5615 — DIFC proxy enterprise-host awareness for `*.ghe.com` data-residency is not yet implemented in the companion projects; AWF ≥ v0.27.12 provides improved diagnostics (HTTP status + targeted hint) but the underlying cause remains unresolved.

Expand Down
Loading
Loading