From b8a8eeb1f3dbdbce0368ad04217be0744639a10c Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Wed, 1 Jul 2026 18:32:56 +0000 Subject: [PATCH 1/3] Initial plan From 3707555a5d58ca8a64ee6716fe78411a468702db Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Wed, 1 Jul 2026 18:41:21 +0000 Subject: [PATCH 2/3] docs: update runner doctor A13 fixed v0.27.15, B7 partial fix note --- .github/agents/self-hosted-runner-doctor.md | 9 ++++----- .github/workflows/self-hosted-runner-doctor.md | 4 ++-- .github/workflows/shared/self-hosted-failure-modes.md | 5 ++--- 3 files changed, 8 insertions(+), 10 deletions(-) diff --git a/.github/agents/self-hosted-runner-doctor.md b/.github/agents/self-hosted-runner-doctor.md index 56f054fcd..9a59017b3 100644 --- a/.github/agents/self-hosted-runner-doctor.md +++ b/.github/agents/self-hosted-runner-doctor.md @@ -81,9 +81,9 @@ Prefer the narrowest match. Examples: ### 4. Check for known unresolved problems -If the best match is one of the known open gaps (gVisor/Kata runtime support, `--enable-dind` cleanup, enterprise header-injection extension points, the remaining `GH_HOST` leak to user steps, or ARC/DinD base-userland staging), say so explicitly instead of implying there is a shipped fix. +If the best match is one of the known open gaps (gVisor/Kata runtime support, `--enable-dind` cleanup, enterprise header-injection extension points, or the remaining `GH_HOST` leak to user steps), say so explicitly instead of implying there is a shipped fix. -A13 / #5541 — ARC/DinD split-fs base-userland staging is not yet implemented; AWF cannot currently run end-to-end on a split-fs runner with an empty /host. +A13 / github/gh-aw-firewall#5693, github/gh-aw-firewall#5696 — ARC/DinD split-fs base-userland staging is **fixed in AWF v0.27.15**: set `runner.topology: "arc-dind"` in the AWF config JSON. The `sysroot-stage` init container copies the signed `build-tools` image filesystem into a `sysroot` volume mounted at `/host:ro` before the agent starts. C7 / #5615 — DIFC proxy enterprise-host awareness for `*.ghe.com` data-residency is not yet implemented in the companion projects; AWF ≥ v0.27.12 provides improved diagnostics (HTTP status + targeted hint) but the underlying cause remains unresolved. @@ -144,7 +144,7 @@ Establish these facts before matching a failure mode: | A10 | `Docker socket not found` plus `Invalid container ID format: arc-...` | MCP gateway assumed `/var/run/docker.sock`, group 0, and Docker-style container IDs | Propagate `DOCKER_HOST`, detect socket GID, relax pod-name handling | `stat -c '%g' ${DOCKER_HOST#unix://}`, `cat /proc/self/cgroup` | #2267, #2292, #2664, #2706, #2808 | | A11 | Threat detection passes even though the engine binary is missing | `GH_AW_DETECTION_CONTINUE_ON_ERROR` suppressed a real setup failure | Reconsider default or log the skipped check explicitly | `printenv GH_AW_DETECTION_CONTINUE_ON_ERROR`; inspect agent logs for `ENOENT` | #4787 | | A12 | `mkdirat ... : read-only file system` during agent chroot startup on ARC/DinD | `chroot.binariesSourcePath` set to the same root as `--docker-host-path-prefix` (e.g. both `/tmp/gh-aw`); Docker mounts `/tmp/gh-aw/usr:/host/usr:ro` first, then the attempt to mkdir `/host/usr/local/bin` as a nested overlay mount point fails because the parent is read-only | **Fixed in firewall v0.27.10**: upgrade AWF; the overlay is now mounted at `/host/tmp/awf-runner-bin:ro` (writable `/host/tmp` parent) instead of `/host/usr/local/bin:ro` | Check `awf --version`; inspect agent container logs for `mkdirat`; verify `chroot.binariesSourcePath` equals `docker-host-path-prefix` root | #5481, #5482 | -| A13 | `chroot: failed to run command '/bin/sh': No such file or directory` or `[entrypoint][ERROR] capsh not found on host system` on a **glibc/Debian daemon** (not musl/Alpine) | ARC/DinD split-fs: system-mount source dirs (`/tmp/gh-aw/{usr,bin,lib,...}`) are empty because nothing populates them; `stageBaseSystem()` does not yet exist. The entrypoint "musl/Alpine" warning is **misleading** — it fires because no dynamic loader is found, not because the daemon is musl. `dind.preStageDirs` only mkdirs empty work dirs; it does not stage a base userland. | **Unresolved** — `stageBaseSystem()` capability not yet implemented; base userland must originate from the AWF-signed image to preserve security invariants (never from runner/daemon-writable paths for pre-`capsh` execution). Workaround: bake required binaries (`/bin/sh`, `bash`, `capsh`, loader, coreutils) directly into the DinD daemon image | Confirm daemon libc: `ldd --version` inside the DinD container. Then check whether the staging dir is populated: `ls /tmp/gh-aw/usr/bin/sh` (or `/tmp/gh-aw/bin/sh`) on the runner side. If the daemon is glibc but the file is missing, this is A13, not A4. | #5541 | +| A13 | `chroot: failed to run command '/bin/sh': No such file or directory` or `[entrypoint][ERROR] capsh not found on host system` on a **glibc/Debian daemon** (not musl/Alpine) | ARC/DinD split-fs: system-mount source dirs (`/tmp/gh-aw/{usr,bin,lib,...}`) are empty because nothing populates them. The entrypoint "musl/Alpine" warning is **misleading** — it fires because no dynamic loader is found, not because the daemon is musl. | **Fixed in AWF v0.27.15**: set `runner.topology: "arc-dind"` in the AWF config JSON. AWF emits a `sysroot-stage` init container that copies the signed `build-tools` image filesystem (`bash`, `capsh`, `gcc`, dev libs, coreutils) into a named `sysroot` volume mounted at `/host:ro` before the agent starts. Use `runner.sysrootImage` to pin a specific image. | Check `awf --version` ≥ v0.27.15; verify `runner.topology: "arc-dind"` is set; inspect compose output for `sysroot-stage` service and `sysroot` volume | #5541, github/gh-aw-firewall#5693, github/gh-aw-firewall#5696 | ## Category B — Self-hosted runners @@ -156,7 +156,7 @@ Establish these facts before matching a failure mode: | B4 | `node: command not found` after `actions/setup-node` on self-hosted | Node was installed in `$HOME/work/_tool` and that toolcache is not visible | Mount / expose the runner toolcache; use `AWF_EXTRA_TOOLCACHE_DIRS` if needed | `which node`; inspect `$HOME/work/_tool/node` | #3544, #3545 | | B5 | `getaddrinfo EAI_AGAIN ` → `awf-cli-proxy could not connect to the external DIFC proxy` → `The agent was never invoked` in `--network-isolation` + `--topology-attach` runs | Startup ordering deadlock: `connectTopologyContainers()` runs only after `startContainers()` succeeds, but `startContainers()` blocks on the cli-proxy health gate that requires the topology peer to be reachable on `awf-net` (which `internal: true`). The peer is never attached → EAI_AGAIN → fail-fast → deadlock. Deterministic, not flaky. | Resolved in AWF: attach topology peers to `awf-net` before the health-gated bring-up (Fix A: split `up -d`, network first → attach → remaining); also harden cli-proxy to treat `EAI_AGAIN`/`ENOTFOUND` as not-yet-ready (Fix B) | Confirm `topologyAttach` is non-empty; check the cli-proxy logs for `EAI_AGAIN`; verify AWF version includes the ordering fix | #5543, #5542 | | B6 | `EACCES` in `upload-artifact` step after a `sudo: false` (`--network-isolation`) AWF run; firewall log/audit dirs present but unreadable | Sidecars write files as non-runner UIDs (squid → uid 13, cli-proxy → `cliproxy`, agent/iptables-init → root). AWF's `chmod -R a+rX` repair runs as the unprivileged runner and silently fails at `debug` level on files it doesn't own | Resolved in AWF: (a) run Node sidecars as runner UID via compose `user:`; (b) root perm-fixer container at cleanup (daemon-run, mounts log dir, chowns to runner UID, skipped when `--keep-containers`); (c) promote swallowed-`chmod` failure from `debug` to `warn` | `ls -la ` after run — look for root or uid-13 owned files; check AWF logs for the swallowed `chmod` warning | #5545, #5542 | -| B7 | AWF exits with unhandled `EACCES` error during cleanup; stack trace shows `unlink ... /tmp/awf--chroot-home/` (e.g. `.aws/config`, cloud credentials) | In rootless Docker mode the agent container runs with UID namespace remapping. Files created by the agent inside the `chroot-home` temp directory are owned by remapped UIDs. AWF's `removeWorkDirectories()` runs as the unprivileged host runner and `fs.rmSync` fails on these files. | **Fixed in AWF v0.27.13**: upgrade AWF; `removeWorkDirectories` now catches `EACCES` and retries after spinning up a short-lived repair container (`CHOWN`/`DAC_OVERRIDE`/`FOWNER` capabilities) that chowns the files back to the host user | `ls -la /tmp/awf-*-chroot-home/` after a rootless run — files owned by non-runner UIDs confirm the mode; upgrade to AWF ≥ v0.27.13 | #5653 | +| B7 | AWF exits with unhandled `EACCES` error during cleanup; stack trace shows `unlink ... /tmp/awf--chroot-home/` (e.g. `.aws/config`, cloud credentials); or `[WARN] Failed to remove chroot home directory after permission repair` after a seemingly successful repair container run | In rootless Docker mode the agent container runs with UID namespace remapping. Files created by the agent inside the `chroot-home` temp directory are owned by remapped UIDs. AWF's `removeWorkDirectories()` runs as the unprivileged host runner and `fs.rmSync` fails on these files. | **Partially fixed in AWF v0.27.13** (repair container with CHOWN/DAC_OVERRIDE/FOWNER capabilities); **further fix merged post-v0.27.15** (#5717): in rootless Docker the repair container's `chown` operates within the user namespace and may not change host-level ownership. The post-v0.27.15 fix adds `chmod -R a+rwX` so the host can delete the directory regardless of ownership. Non-fatal if unfixed — leaves an orphan `/tmp/awf-*-chroot-home` dir. | `ls -la /tmp/awf-*-chroot-home/` after a rootless run — files owned by non-runner UIDs confirm the mode; upgrade to AWF ≥ v0.27.13; check AWF logs for `[WARN] Failed to remove chroot home directory after permission repair` | #5653, github/gh-aw-firewall#5708, github/gh-aw-firewall#5717 | ## Category C — GHES / GHEC / `ghe.com` @@ -210,5 +210,4 @@ Flag these explicitly instead of implying there is a complete fix: - D3 / #1727 — lingering `--enable-dind` cleanup - D4 / #4849 — enterprise header injection extension point - C5 / #3937 — full `GH_HOST` leak fix still requires gh-aw changes -- A13 / #5541 — base-userland staging for ARC/DinD split-fs (`stageBaseSystem()` not yet implemented; security-preserving fix requires sourcing the base userland from the AWF-signed image) - C7 / #5615 — DIFC proxy enterprise-host awareness for `*.ghe.com` data-residency (root cause unresolved; tracked in github/gh-aw-mcpg#8202 and github/gh-aw#41911) diff --git a/.github/workflows/self-hosted-runner-doctor.md b/.github/workflows/self-hosted-runner-doctor.md index ccb133edb..72baa6c56 100644 --- a/.github/workflows/self-hosted-runner-doctor.md +++ b/.github/workflows/self-hosted-runner-doctor.md @@ -109,9 +109,9 @@ Prefer the narrowest match. Examples: ### 4. Check for known unresolved problems -If the best match is one of the known open gaps (gVisor/Kata runtime support, `--enable-dind` cleanup, enterprise header-injection extension points, the remaining `GH_HOST` leak to user steps, or ARC/DinD base-userland staging), say so explicitly instead of implying there is a shipped fix. +If the best match is one of the known open gaps (gVisor/Kata runtime support, `--enable-dind` cleanup, enterprise header-injection extension points, or the remaining `GH_HOST` leak to user steps), say so explicitly instead of implying there is a shipped fix. -A13 / #5541 — ARC/DinD split-fs base-userland staging is not yet implemented; AWF cannot currently run end-to-end on a split-fs runner with an empty /host. +A13 / github/gh-aw-firewall#5693, github/gh-aw-firewall#5696 — ARC/DinD split-fs base-userland staging is **fixed in AWF v0.27.15**: set `runner.topology: "arc-dind"` in the AWF config JSON. The `sysroot-stage` init container copies the signed `build-tools` image filesystem into a `sysroot` volume mounted at `/host:ro` before the agent starts. C7 / #5615 — DIFC proxy enterprise-host awareness for `*.ghe.com` data-residency is not yet implemented in the companion projects; AWF ≥ v0.27.12 provides improved diagnostics (HTTP status + targeted hint) but the underlying cause remains unresolved. diff --git a/.github/workflows/shared/self-hosted-failure-modes.md b/.github/workflows/shared/self-hosted-failure-modes.md index 320aea8aa..a659dde70 100644 --- a/.github/workflows/shared/self-hosted-failure-modes.md +++ b/.github/workflows/shared/self-hosted-failure-modes.md @@ -29,7 +29,7 @@ Establish these facts before matching a failure mode: | A10 | `Docker socket not found` plus `Invalid container ID format: arc-...` | MCP gateway assumed `/var/run/docker.sock`, group 0, and Docker-style container IDs | Propagate `DOCKER_HOST`, detect socket GID, relax pod-name handling | `stat -c '%g' ${DOCKER_HOST#unix://}`, `cat /proc/self/cgroup` | #2267, #2292, #2664, #2706, #2808 | | A11 | Threat detection passes even though the engine binary is missing | `GH_AW_DETECTION_CONTINUE_ON_ERROR` suppressed a real setup failure | Reconsider default or log the skipped check explicitly | `printenv GH_AW_DETECTION_CONTINUE_ON_ERROR`; inspect agent logs for `ENOENT` | #4787 | | A12 | `mkdirat ... : read-only file system` during agent chroot startup on ARC/DinD | `chroot.binariesSourcePath` set to the same root as `--docker-host-path-prefix` (e.g. both `/tmp/gh-aw`); Docker mounts `/tmp/gh-aw/usr:/host/usr:ro` first, then the attempt to mkdir `/host/usr/local/bin` as a nested overlay mount point fails because the parent is read-only | **Fixed in firewall v0.27.10**: upgrade AWF; the overlay is now mounted at `/host/tmp/awf-runner-bin:ro` (writable `/host/tmp` parent) instead of `/host/usr/local/bin:ro` | Check `awf --version`; inspect agent container logs for `mkdirat`; verify `chroot.binariesSourcePath` equals `docker-host-path-prefix` root | #5481, #5482 | -| A13 | `chroot: failed to run command '/bin/sh': No such file or directory` or `[entrypoint][ERROR] capsh not found on host system` on a **glibc/Debian daemon** (not musl/Alpine) | ARC/DinD split-fs: system-mount source dirs (`/tmp/gh-aw/{usr,bin,lib,...}`) are empty because nothing populates them; `stageBaseSystem()` does not yet exist. The entrypoint "musl/Alpine" warning is **misleading** — it fires because no dynamic loader is found, not because the daemon is musl. `dind.preStageDirs` only mkdirs empty work dirs; it does not stage a base userland. | **Unresolved** — `stageBaseSystem()` capability not yet implemented; base userland must originate from the AWF-signed image to preserve security invariants (never from runner/daemon-writable paths for pre-`capsh` execution). Workaround: bake required binaries (`/bin/sh`, `bash`, `capsh`, loader, coreutils) directly into the DinD daemon image | Confirm daemon libc: `ldd --version` inside the DinD container. Then check whether the staging dir is populated: `ls /tmp/gh-aw/usr/bin/sh` (or `/tmp/gh-aw/bin/sh`) on the runner side. If the daemon is glibc but the file is missing, this is A13, not A4. | #5541 | +| A13 | `chroot: failed to run command '/bin/sh': No such file or directory` or `[entrypoint][ERROR] capsh not found on host system` on a **glibc/Debian daemon** (not musl/Alpine) | ARC/DinD split-fs: system-mount source dirs (`/tmp/gh-aw/{usr,bin,lib,...}`) are empty because nothing populates them. The entrypoint "musl/Alpine" warning is **misleading** — it fires because no dynamic loader is found, not because the daemon is musl. | **Fixed in AWF v0.27.15**: set `runner.topology: "arc-dind"` in the AWF config JSON. AWF emits a `sysroot-stage` init container that copies the signed `build-tools` image filesystem (`bash`, `capsh`, `gcc`, dev libs, coreutils) into a named `sysroot` volume mounted at `/host:ro` before the agent starts. Use `runner.sysrootImage` to pin a specific image. | Check `awf --version` ≥ v0.27.15; verify `runner.topology: "arc-dind"` is set; inspect compose output for `sysroot-stage` service and `sysroot` volume | #5541, github/gh-aw-firewall#5693, github/gh-aw-firewall#5696 | ## Category B — Self-hosted runners @@ -41,7 +41,7 @@ Establish these facts before matching a failure mode: | B4 | `node: command not found` after `actions/setup-node` on self-hosted | Node was installed in `$HOME/work/_tool` and that toolcache is not visible | Mount / expose the runner toolcache; use `AWF_EXTRA_TOOLCACHE_DIRS` if needed | `which node`; inspect `$HOME/work/_tool/node` | #3544, #3545 | | B5 | `getaddrinfo EAI_AGAIN ` → `awf-cli-proxy could not connect to the external DIFC proxy` → `The agent was never invoked` in `--network-isolation` + `--topology-attach` runs | Startup ordering deadlock: `connectTopologyContainers()` runs only after `startContainers()` succeeds, but `startContainers()` blocks on the cli-proxy health gate that requires the topology peer to be reachable on `awf-net` (which `internal: true`). The peer is never attached → EAI_AGAIN → fail-fast → deadlock. Deterministic, not flaky. | Resolved in AWF: attach topology peers to `awf-net` before the health-gated bring-up (Fix A: split `up -d`, network first → attach → remaining); also harden cli-proxy to treat `EAI_AGAIN`/`ENOTFOUND` as not-yet-ready (Fix B) | Confirm `topologyAttach` is non-empty; check the cli-proxy logs for `EAI_AGAIN`; verify AWF version includes the ordering fix | #5543, #5542 | | B6 | `EACCES` in `upload-artifact` step after a `sudo: false` (`--network-isolation`) AWF run; firewall log/audit dirs present but unreadable | Sidecars write files as non-runner UIDs (squid → uid 13, cli-proxy → `cliproxy`, agent/iptables-init → root). AWF's `chmod -R a+rX` repair runs as the unprivileged runner and silently fails at `debug` level on files it doesn't own | Resolved in AWF: (a) run Node sidecars as runner UID via compose `user:`; (b) root perm-fixer container at cleanup (daemon-run, mounts log dir, chowns to runner UID, skipped when `--keep-containers`); (c) promote swallowed-`chmod` failure from `debug` to `warn` | `ls -la ` after run — look for root or uid-13 owned files; check AWF logs for the swallowed `chmod` warning | #5545, #5542 | -| B7 | AWF exits with unhandled `EACCES` error during cleanup; stack trace shows `unlink ... /tmp/awf--chroot-home/` (e.g. `.aws/config`, cloud credentials) | In rootless Docker mode the agent container runs with UID namespace remapping. Files created by the agent inside the `chroot-home` temp directory are owned by remapped UIDs. AWF's `removeWorkDirectories()` runs as the unprivileged host runner and `fs.rmSync` fails on these files. | **Fixed in AWF v0.27.13**: upgrade AWF; `removeWorkDirectories` now catches `EACCES` and retries after spinning up a short-lived repair container (`CHOWN`/`DAC_OVERRIDE`/`FOWNER` capabilities) that chowns the files back to the host user | `ls -la /tmp/awf-*-chroot-home/` after a rootless run — files owned by non-runner UIDs confirm the mode; upgrade to AWF ≥ v0.27.13 | #5653 | +| B7 | AWF exits with unhandled `EACCES` error during cleanup; stack trace shows `unlink ... /tmp/awf--chroot-home/` (e.g. `.aws/config`, cloud credentials); or `[WARN] Failed to remove chroot home directory after permission repair` after a seemingly successful repair container run | In rootless Docker mode the agent container runs with UID namespace remapping. Files created by the agent inside the `chroot-home` temp directory are owned by remapped UIDs. AWF's `removeWorkDirectories()` runs as the unprivileged host runner and `fs.rmSync` fails on these files. | **Partially fixed in AWF v0.27.13** (repair container with CHOWN/DAC_OVERRIDE/FOWNER capabilities); **further fix merged post-v0.27.15** (#5717): in rootless Docker the repair container's `chown` operates within the user namespace and may not change host-level ownership. The post-v0.27.15 fix adds `chmod -R a+rwX` so the host can delete the directory regardless of ownership. Non-fatal if unfixed — leaves an orphan `/tmp/awf-*-chroot-home` dir. | `ls -la /tmp/awf-*-chroot-home/` after a rootless run — files owned by non-runner UIDs confirm the mode; upgrade to AWF ≥ v0.27.13; check AWF logs for `[WARN] Failed to remove chroot home directory after permission repair` | #5653, github/gh-aw-firewall#5708, github/gh-aw-firewall#5717 | ## Category C — GHES / GHEC / `ghe.com` @@ -95,5 +95,4 @@ Flag these explicitly instead of implying there is a complete fix: - D3 / #1727 — lingering `--enable-dind` cleanup - D4 / #4849 — enterprise header injection extension point - C5 / #3937 — full `GH_HOST` leak fix still requires gh-aw changes -- A13 / #5541 — base-userland staging for ARC/DinD split-fs (`stageBaseSystem()` not yet implemented; security-preserving fix requires sourcing the base userland from the AWF-signed image) - C7 / #5615 — DIFC proxy enterprise-host awareness for `*.ghe.com` data-residency (root cause unresolved; tracked in github/gh-aw-mcpg#8202 and github/gh-aw#41911) From 655842091449db87da15d8125a20e37458d9539b Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Wed, 1 Jul 2026 19:24:21 +0000 Subject: [PATCH 3/3] docs: fix B7 signal wording and rename known-unresolved heading --- .github/agents/self-hosted-runner-doctor.md | 4 ++-- .github/workflows/self-hosted-runner-doctor.md | 2 +- .github/workflows/shared/self-hosted-failure-modes.md | 2 +- 3 files changed, 4 insertions(+), 4 deletions(-) diff --git a/.github/agents/self-hosted-runner-doctor.md b/.github/agents/self-hosted-runner-doctor.md index 9a59017b3..a13402006 100644 --- a/.github/agents/self-hosted-runner-doctor.md +++ b/.github/agents/self-hosted-runner-doctor.md @@ -79,7 +79,7 @@ Prefer the narrowest match. Examples: - `400 bad request: Authorization header is badly formatted` → C3 - `diagnosis=unknown` (proxy reachable, no connection error) or `reachable-but-api-error` from DIFC probe with `GITHUB_SERVER_URL=*.ghe.com` → C7 (DIFC proxy not enterprise-host-aware) -### 4. Check for known unresolved problems +### 4. Check for known gaps and notable fixes If the best match is one of the known open gaps (gVisor/Kata runtime support, `--enable-dind` cleanup, enterprise header-injection extension points, or the remaining `GH_HOST` leak to user steps), say so explicitly instead of implying there is a shipped fix. @@ -156,7 +156,7 @@ Establish these facts before matching a failure mode: | B4 | `node: command not found` after `actions/setup-node` on self-hosted | Node was installed in `$HOME/work/_tool` and that toolcache is not visible | Mount / expose the runner toolcache; use `AWF_EXTRA_TOOLCACHE_DIRS` if needed | `which node`; inspect `$HOME/work/_tool/node` | #3544, #3545 | | B5 | `getaddrinfo EAI_AGAIN ` → `awf-cli-proxy could not connect to the external DIFC proxy` → `The agent was never invoked` in `--network-isolation` + `--topology-attach` runs | Startup ordering deadlock: `connectTopologyContainers()` runs only after `startContainers()` succeeds, but `startContainers()` blocks on the cli-proxy health gate that requires the topology peer to be reachable on `awf-net` (which `internal: true`). The peer is never attached → EAI_AGAIN → fail-fast → deadlock. Deterministic, not flaky. | Resolved in AWF: attach topology peers to `awf-net` before the health-gated bring-up (Fix A: split `up -d`, network first → attach → remaining); also harden cli-proxy to treat `EAI_AGAIN`/`ENOTFOUND` as not-yet-ready (Fix B) | Confirm `topologyAttach` is non-empty; check the cli-proxy logs for `EAI_AGAIN`; verify AWF version includes the ordering fix | #5543, #5542 | | B6 | `EACCES` in `upload-artifact` step after a `sudo: false` (`--network-isolation`) AWF run; firewall log/audit dirs present but unreadable | Sidecars write files as non-runner UIDs (squid → uid 13, cli-proxy → `cliproxy`, agent/iptables-init → root). AWF's `chmod -R a+rX` repair runs as the unprivileged runner and silently fails at `debug` level on files it doesn't own | Resolved in AWF: (a) run Node sidecars as runner UID via compose `user:`; (b) root perm-fixer container at cleanup (daemon-run, mounts log dir, chowns to runner UID, skipped when `--keep-containers`); (c) promote swallowed-`chmod` failure from `debug` to `warn` | `ls -la ` after run — look for root or uid-13 owned files; check AWF logs for the swallowed `chmod` warning | #5545, #5542 | -| B7 | AWF exits with unhandled `EACCES` error during cleanup; stack trace shows `unlink ... /tmp/awf--chroot-home/` (e.g. `.aws/config`, cloud credentials); or `[WARN] Failed to remove chroot home directory after permission repair` after a seemingly successful repair container run | In rootless Docker mode the agent container runs with UID namespace remapping. Files created by the agent inside the `chroot-home` temp directory are owned by remapped UIDs. AWF's `removeWorkDirectories()` runs as the unprivileged host runner and `fs.rmSync` fails on these files. | **Partially fixed in AWF v0.27.13** (repair container with CHOWN/DAC_OVERRIDE/FOWNER capabilities); **further fix merged post-v0.27.15** (#5717): in rootless Docker the repair container's `chown` operates within the user namespace and may not change host-level ownership. The post-v0.27.15 fix adds `chmod -R a+rwX` so the host can delete the directory regardless of ownership. Non-fatal if unfixed — leaves an orphan `/tmp/awf-*-chroot-home` dir. | `ls -la /tmp/awf-*-chroot-home/` after a rootless run — files owned by non-runner UIDs confirm the mode; upgrade to AWF ≥ v0.27.13; check AWF logs for `[WARN] Failed to remove chroot home directory after permission repair` | #5653, github/gh-aw-firewall#5708, github/gh-aw-firewall#5717 | +| B7 | AWF < v0.27.13: unhandled `EACCES` stack trace shows `unlink ... /tmp/awf--chroot-home/` (e.g. `.aws/config`, cloud credentials). AWF ≥ v0.27.13: `removeWorkDirectories()` catches the error and emits `[WARN] Failed to remove chroot home directory after permission repair` instead of crashing | In rootless Docker mode the agent container runs with UID namespace remapping. Files created by the agent inside the `chroot-home` temp directory are owned by remapped UIDs. AWF's `removeWorkDirectories()` runs as the unprivileged host runner and `fs.rmSync` fails on these files. | **Partially fixed in AWF v0.27.13** (repair container with CHOWN/DAC_OVERRIDE/FOWNER capabilities); **further fix merged post-v0.27.15** (#5717): in rootless Docker the repair container's `chown` operates within the user namespace and may not change host-level ownership. The post-v0.27.15 fix adds `chmod -R a+rwX` so the host can delete the directory regardless of ownership. Non-fatal if unfixed — leaves an orphan `/tmp/awf-*-chroot-home` dir. | `ls -la /tmp/awf-*-chroot-home/` after a rootless run — files owned by non-runner UIDs confirm the mode; upgrade to AWF ≥ v0.27.13; check AWF logs for `[WARN] Failed to remove chroot home directory after permission repair` | #5653, github/gh-aw-firewall#5708, github/gh-aw-firewall#5717 | ## Category C — GHES / GHEC / `ghe.com` diff --git a/.github/workflows/self-hosted-runner-doctor.md b/.github/workflows/self-hosted-runner-doctor.md index 72baa6c56..13d7b8e37 100644 --- a/.github/workflows/self-hosted-runner-doctor.md +++ b/.github/workflows/self-hosted-runner-doctor.md @@ -107,7 +107,7 @@ Prefer the narrowest match. Examples: - `400 bad request: Authorization header is badly formatted` → C3 - `diagnosis=unknown` (proxy reachable, no connection error) or `reachable-but-api-error` from DIFC probe with `GITHUB_SERVER_URL=*.ghe.com` → C7 (DIFC proxy not enterprise-host-aware) -### 4. Check for known unresolved problems +### 4. Check for known gaps and notable fixes If the best match is one of the known open gaps (gVisor/Kata runtime support, `--enable-dind` cleanup, enterprise header-injection extension points, or the remaining `GH_HOST` leak to user steps), say so explicitly instead of implying there is a shipped fix. diff --git a/.github/workflows/shared/self-hosted-failure-modes.md b/.github/workflows/shared/self-hosted-failure-modes.md index a659dde70..abaa895bb 100644 --- a/.github/workflows/shared/self-hosted-failure-modes.md +++ b/.github/workflows/shared/self-hosted-failure-modes.md @@ -41,7 +41,7 @@ Establish these facts before matching a failure mode: | B4 | `node: command not found` after `actions/setup-node` on self-hosted | Node was installed in `$HOME/work/_tool` and that toolcache is not visible | Mount / expose the runner toolcache; use `AWF_EXTRA_TOOLCACHE_DIRS` if needed | `which node`; inspect `$HOME/work/_tool/node` | #3544, #3545 | | B5 | `getaddrinfo EAI_AGAIN ` → `awf-cli-proxy could not connect to the external DIFC proxy` → `The agent was never invoked` in `--network-isolation` + `--topology-attach` runs | Startup ordering deadlock: `connectTopologyContainers()` runs only after `startContainers()` succeeds, but `startContainers()` blocks on the cli-proxy health gate that requires the topology peer to be reachable on `awf-net` (which `internal: true`). The peer is never attached → EAI_AGAIN → fail-fast → deadlock. Deterministic, not flaky. | Resolved in AWF: attach topology peers to `awf-net` before the health-gated bring-up (Fix A: split `up -d`, network first → attach → remaining); also harden cli-proxy to treat `EAI_AGAIN`/`ENOTFOUND` as not-yet-ready (Fix B) | Confirm `topologyAttach` is non-empty; check the cli-proxy logs for `EAI_AGAIN`; verify AWF version includes the ordering fix | #5543, #5542 | | B6 | `EACCES` in `upload-artifact` step after a `sudo: false` (`--network-isolation`) AWF run; firewall log/audit dirs present but unreadable | Sidecars write files as non-runner UIDs (squid → uid 13, cli-proxy → `cliproxy`, agent/iptables-init → root). AWF's `chmod -R a+rX` repair runs as the unprivileged runner and silently fails at `debug` level on files it doesn't own | Resolved in AWF: (a) run Node sidecars as runner UID via compose `user:`; (b) root perm-fixer container at cleanup (daemon-run, mounts log dir, chowns to runner UID, skipped when `--keep-containers`); (c) promote swallowed-`chmod` failure from `debug` to `warn` | `ls -la ` after run — look for root or uid-13 owned files; check AWF logs for the swallowed `chmod` warning | #5545, #5542 | -| B7 | AWF exits with unhandled `EACCES` error during cleanup; stack trace shows `unlink ... /tmp/awf--chroot-home/` (e.g. `.aws/config`, cloud credentials); or `[WARN] Failed to remove chroot home directory after permission repair` after a seemingly successful repair container run | In rootless Docker mode the agent container runs with UID namespace remapping. Files created by the agent inside the `chroot-home` temp directory are owned by remapped UIDs. AWF's `removeWorkDirectories()` runs as the unprivileged host runner and `fs.rmSync` fails on these files. | **Partially fixed in AWF v0.27.13** (repair container with CHOWN/DAC_OVERRIDE/FOWNER capabilities); **further fix merged post-v0.27.15** (#5717): in rootless Docker the repair container's `chown` operates within the user namespace and may not change host-level ownership. The post-v0.27.15 fix adds `chmod -R a+rwX` so the host can delete the directory regardless of ownership. Non-fatal if unfixed — leaves an orphan `/tmp/awf-*-chroot-home` dir. | `ls -la /tmp/awf-*-chroot-home/` after a rootless run — files owned by non-runner UIDs confirm the mode; upgrade to AWF ≥ v0.27.13; check AWF logs for `[WARN] Failed to remove chroot home directory after permission repair` | #5653, github/gh-aw-firewall#5708, github/gh-aw-firewall#5717 | +| B7 | AWF < v0.27.13: unhandled `EACCES` stack trace shows `unlink ... /tmp/awf--chroot-home/` (e.g. `.aws/config`, cloud credentials). AWF ≥ v0.27.13: `removeWorkDirectories()` catches the error and emits `[WARN] Failed to remove chroot home directory after permission repair` instead of crashing | In rootless Docker mode the agent container runs with UID namespace remapping. Files created by the agent inside the `chroot-home` temp directory are owned by remapped UIDs. AWF's `removeWorkDirectories()` runs as the unprivileged host runner and `fs.rmSync` fails on these files. | **Partially fixed in AWF v0.27.13** (repair container with CHOWN/DAC_OVERRIDE/FOWNER capabilities); **further fix merged post-v0.27.15** (#5717): in rootless Docker the repair container's `chown` operates within the user namespace and may not change host-level ownership. The post-v0.27.15 fix adds `chmod -R a+rwX` so the host can delete the directory regardless of ownership. Non-fatal if unfixed — leaves an orphan `/tmp/awf-*-chroot-home` dir. | `ls -la /tmp/awf-*-chroot-home/` after a rootless run — files owned by non-runner UIDs confirm the mode; upgrade to AWF ≥ v0.27.13; check AWF logs for `[WARN] Failed to remove chroot home directory after permission repair` | #5653, github/gh-aw-firewall#5708, github/gh-aw-firewall#5717 | ## Category C — GHES / GHEC / `ghe.com`