network-isolation: rootless firewall logs are unreadable by the runner → EACCES on artifact upload (fixable in AWF)

## Summary

In `--network-isolation` (`sandbox.agent.sudo: false`) mode, the firewall's log and audit artifacts are written into host-mounted directories by container processes whose UIDs the **unprivileged runner cannot read**. AWF's existing permission-repair step silently no-ops when AWF is rootless, so the runner hits `EACCES` when `upload-artifact` zips the firewall logs. The agent run itself may succeed, but the artifact upload fails.

This is **fully addressable within gh-aw-firewall** and the fix works identically on standard runners and on ARC/DinD. It is the rootless-permission sibling of the topology-attach deadlock (#5543); the broader sudo:false rollout regression set is #5542, and the ARC/chroot track is #5541.

### Evidence

gh-aw [PR github/gh-aw#41426](https://github.com/github/gh-aw/pull/41426) reverted `glossary-maintainer` from `sudo: false → true`. Part of the motivation: under `sudo: false` the previous `sudo chmod -R a+rX /tmp/gh-aw/sandbox/firewall` workaround step is gone, but the firewall containers still write files the runner can't read, so `upload-artifact` fails with `EACCES`.

---

## Root cause — the rootless permission asymmetry

AWF already attempts a repair: `preserveCleanupArtifacts()` runs `chmod -R a+rX` on every log/audit dir at cleanup (`src/artifact-preservation.ts:59,74,160,172,189,201`). But that `chmod` is **host-side**, executed by the AWF process:

- When AWF runs under `sudo` (`sudo: true`), the process is root → the `chmod` succeeds (and gh-aw additionally ran its own `sudo chmod`).
- When AWF runs **rootless** (`sudo: false`), the process is the **runner UID** and gets `EPERM` on files it doesn't own. The failure is caught and logged at **`debug` only** (`src/artifact-preservation.ts:61-63`), so it is silent.

The files were written by container processes with UIDs that don't match the runner:

- `cli-proxy` → `USER cliproxy` (non-root, **no in-container root at all** — `containers/cli-proxy/Dockerfile:44`)
- `api-proxy` → `USER apiproxy` (non-root — `containers/api-proxy/Dockerfile:43`)
- `squid` → `USER proxy` / uid 13 (`containers/squid/Dockerfile:79`)
- agent / iptables-init → start as **root inside the container**; on a stock daemon (no userns-remap) those files land **root-owned** on the host

The log dirs are pre-created `0777` so files *can* be created (`src/workdir-setup.ts:151-198`), but the runner can neither `chmod` nor reliably read what another UID wrote. Result: `EACCES` at artifact-upload time.

**Invariant for any fix:** the artifacts must become runner-readable **without host root** (there is no host `sudo` in rootless mode, and none at all in an ARC pod) — either created readable, or relaxed by a privileged actor that *is* available, namely a container under the Docker daemon.

---

## Proposed solution (four parts)

A layered fix: **1a** removes the mismatch at the source for the common case, **2b** is the universal backstop, **1b** is cheap hardening, and the **swallowed-chmod** change restores observability.

### 1a — Run the Node sidecars as the runner's UID:GID (primary)

Add compose `user: "${uid}:${gid}"` to the `cli-proxy` and `api-proxy` services, using the host uid/gid AWF already resolves for the agent (`getSafeHostUid`/`getSafeHostGid`). Every file these sidecars write is then **runner-owned → trivially readable, no chmod needed**.

- Low risk: both are simple Node servers writing to a `0777` bind mount; neither needs in-container root.
- **Do not** blanket-apply this to `squid` — it expects uid 13 to own `/var/spool/squid`; running it as an arbitrary uid can break its cache/db. squid is covered by 2b instead.
- Files: `src/services/cli-proxy-service.ts`, `src/services/api-proxy-service-config.ts` (add `user:`), reuse the existing host-uid resolution.

### 2b — Root "perm-fixer" container at cleanup (universal backstop)

When AWF is rootless (`process.getuid() !== 0`), run a short-lived root container under the daemon that chowns/chmods AWF's log+audit dirs to the runner. This covers **everything 1a doesn't**: squid uid-13 files and any root-owned agent/iptables-init output.

Why it works where the host-side `chmod` fails: on a stock (non-userns-remapped) daemon, **container uid 0 == host uid 0** over the bind-mounted files, so the container holds `CHOWN`/`DAC_OVERRIDE`/`FOWNER` regardless of which service UID wrote each file.

Placement: in the cleanup sequence (`src/commands/main-action.ts:154-165`), step 3 `cleanup() → preserveCleanupArtifacts()`. Containers are already stopped by `stopContainers()` (step 2), so there are **no write races**.

Sketch:

```ts
// src/services/perm-fixer.ts — called from preserveCleanupArtifacts when getuid() !== 0
async function fixArtifactOwnership(dirs: string[], uid: number, gid: number,
                                    dockerHostPathPrefix?: string) {
  for (const dir of dirs) {                              // proxyLogsDir, auditDir, sessionStateDir
    const [translated] = applyHostPathPrefixToVolumes(   // ARC translation; no-op when no prefix
      [`${dir}:/fix`], dockerHostPathPrefix);            // src/services/host-path-prefix.ts:75
    await execa('docker', [
      'run', '--rm', '--network', 'none',
      '--cap-drop', 'ALL', '--cap-add', 'CHOWN', '--cap-add', 'DAC_OVERRIDE', '--cap-add', 'FOWNER',
      '-e', `TUID=${uid}`, '-e', `TGID=${gid}`,
      '-v', translated,
      AWF_AGENT_IMAGE,                                   // reuse an already-pulled image — NO network pull
      'sh', '-c', 'chown -R "$TUID:$TGID" /fix && chmod -R a+rX /fix',
    ], { env: getLocalDockerEnv(), reject: false });
  }
}
```

- **Minimum guarantee:** `chmod -R a+rX` → world-readable + dir-traversable (all `upload-artifact` needs to read). **Stronger:** `chown` to the runner so it can also delete/move the dir afterward.
- **ARC + non-ARC, same code:** the `-v` source is routed through `applyHostPathPrefixToVolumes`, which is a **no-op when there's no prefix** (`host-path-prefix.ts:76`) and applies the daemon-side prefix in ARC. The chown targets the runner's **numeric** uid/gid, which is identical on both sides of the shared volume.
- **Image:** reuse the agent/squid image (already pulled) — never `busybox`/`alpine` that would need a network pull through the firewall.
- **Skip when `--keep-containers`** (debugging; leave perms as-is).

#### Why AWF can still launch the perm-fixer (the chroot is the agent container's, not AWF's)

A natural objection: "doesn't AWF chroot/jail itself, so it can't launch more containers at cleanup?" No — these are two separate processes in separate namespaces:

- **The AWF orchestrator** (`src/cli.ts` → `src/commands/main-action.ts`) is an ordinary host/runner process. It **never `chroot`s itself**; every `chroot` reference in `src/` is just configuration it *passes into* the agent container (paths/identity/caps — `src/awf-config-schema.json:613`, `src/types/runtime-options.ts:132`). It talks to Docker the whole time via `execa('docker', …, { env: getLocalDockerEnv() })`.
- **The agent container** is what chroots: `containers/agent/entrypoint.sh` (PID 1 *inside that one container*) does `chroot /host` and drops `CAP_SYS_CHROOT`/`CAP_SYS_ADMIN` (`entrypoint.sh:399-402`). That jail lives entirely in the agent container's mount namespace and has **zero effect** on the AWF process or any other container.

By cleanup time the agent container has already **exited** (`runAgentCommand` returned), so its chroot is gone regardless. Launching the perm-fixer is just one more `docker run` from the same un-jailed orchestrator that already launched squid/agent/api-proxy/cli-proxy — architecturally identical.

The only real precondition is the one that is already true whenever AWF runs at all: **the AWF process can reach the Docker daemon** (socket access via the `docker` group — *not* host root). In rootless mode the orchestrator is the unprivileged runner user, which is exactly why it can't `chmod` other-UID files directly — but it can still `docker run`, and the perm-fixer *container* runs as root in its own namespace (container-uid-0 == host-uid-0 over the bind mount on a stock daemon) to do the chown the host process couldn't. If AWF couldn't talk to the daemon, it could never have started the firewall in the first place.

### 1b — Permissive file modes at the source (hardening)

Ensure the sidecar log writers create files world-readable (e.g. `fs.createWriteStream(LOG_FILE, { flags: 'a', mode: 0o644 })` in `containers/cli-proxy/server.js:59` and the api-proxy equivalent, and/or `umask 0` in entrypoints). Cheap, FS/daemon-agnostic, and reduces reliance on 2b for the node sidecars. Does not fix ownership of root-/uid-13-owned trees, so it complements rather than replaces 2b.

### Restore observability for the swallowed `chmod`

Today the rootless `chmod -R a+rX` failure is logged at `debug` and lost (`src/artifact-preservation.ts:61-63`). Promote it to a **`warn`** (e.g. "could not relax artifact permissions as a non-root user; rootless perm-fixer will repair ownership"), so this class of failure can't silently regress and is diagnosable from default logs.

---

## ARC/DinD compatibility

The fix is ARC-correct by construction because ARC forces exactly two rules, both already satisfied above:

1. **Never assume host root** — the runner pod is unprivileged and there is no host `sudo`; only the DinD daemon can run a root container. → 2b runs under the daemon.
2. **Every helper bind mount must use the existing path translation** — runner and daemon have separate filesystems bridged by a shared volume + `--docker-host-path-prefix`. → 2b routes its `-v` through `applyHostPathPrefixToVolumes`, and chowns to the runner's numeric uid (identical on both sides).

Per-solution: **1a** ✅ (numeric runner uid on the shared volume), **1b** ✅ (mode-based, FS-agnostic), **2b** ✅ (daemon-run root + translated path), restore-observability ✅. Known limitation: a **userns-remapped daemon** maps container root to a subordinate uid, so 2b's `chown` may fail there; `chmod a+rX` still applies as the floor.

---

## Boundary / coordination note

AWF can only guarantee the directories it is told about (`--proxy-logs-dir`, the audit dir, the session-state dir). The failing path `/tmp/gh-aw/sandbox/firewall/{logs,audit}` may also contain output from **gh-aw-managed** containers (`awmg-mcpg`, `awmg-cli-proxy`) written *outside* AWF's known dirs. A 2b perm-fixer bind-mounting the passed-in tree covers what AWF owns; anything gh-aw writes elsewhere remains gh-aw's responsibility. The issue should make this split explicit so neither side assumes the other handles it.

---

## Acceptance criteria

- Under `--network-isolation` on a standard hosted runner, firewall `logs/`/`audit/` artifacts are readable by the unprivileged runner and `upload-artifact` succeeds with **no** external `sudo`/`chmod` step.
- The same holds on an ARC/DinD runner (shared-volume artifacts owned by / readable to the runner uid).
- `sudo: true` (rootful) runs are unchanged (2b is gated on `getuid() !== 0`).
- The rootless host-side `chmod` failure is visible at `warn` level.
- Regression coverage: a CI job exercising `sudo: false` end-to-end (sidecar logs + audit + artifact upload) so the EACCES regression can't silently return.

## References

- gh-aw PR: github/gh-aw#41426 (revert `sudo: false → true` for glossary-maintainer)
- Sibling issues: #5543 (topology-attach deadlock), #5542 (rollout regression set), #5541 (ARC/chroot)
- AWF code: `src/artifact-preservation.ts:52-208`, `src/workdir-setup.ts:117-200`, `src/commands/main-action.ts:148-174`, `src/services/host-path-prefix.ts:75-80`, `src/services/cli-proxy-service.ts`, `src/services/api-proxy-service-config.ts`, `containers/cli-proxy/Dockerfile:44`, `containers/api-proxy/Dockerfile:43`, `containers/squid/Dockerfile:79`, `containers/cli-proxy/server.js:45-59`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

network-isolation: rootless firewall logs are unreadable by the runner → EACCES on artifact upload (fixable in AWF) #5545

Summary

Evidence

Root cause — the rootless permission asymmetry

Proposed solution (four parts)

1a — Run the Node sidecars as the runner's UID:GID (primary)

2b — Root "perm-fixer" container at cleanup (universal backstop)

Why AWF can still launch the perm-fixer (the chroot is the agent container's, not AWF's)

1b — Permissive file modes at the source (hardening)

Restore observability for the swallowed `chmod`

ARC/DinD compatibility

Boundary / coordination note

Acceptance criteria

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

network-isolation: rootless firewall logs are unreadable by the runner → EACCES on artifact upload (fixable in AWF) #5545

Description

Summary

Evidence

Root cause — the rootless permission asymmetry

Proposed solution (four parts)

1a — Run the Node sidecars as the runner's UID:GID (primary)

2b — Root "perm-fixer" container at cleanup (universal backstop)

Why AWF can still launch the perm-fixer (the chroot is the agent container's, not AWF's)

1b — Permissive file modes at the source (hardening)

Restore observability for the swallowed chmod

ARC/DinD compatibility

Boundary / coordination note

Acceptance criteria

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Restore observability for the swallowed `chmod`