You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Two distinct regressions block the sandbox.agent.sudo: false (network-isolation) rollout on standard GitHub-hosted runners (this is not ARC — see #5541 for the ARC/chroot track). Both were surfaced by gh-aw PR github/gh-aw#41426, which had to revert sudo: false → true for the glossary-maintainer workflow, and by the failing Typist run github/gh-aw run 28168827390 / job 83427672448.
Topology-attach ordering deadlock — in --network-isolation mode AWF's own awf-cli-proxy sidecar can never become healthy because the external DIFC proxy / MCP-gateway containers it must reach are only joined to awf-netafter container startup has already gated on that sidecar's health. Result: getaddrinfo EAI_AGAIN awmg-cli-proxy → awf-cli-proxy could not connect to the external DIFC proxy → "The agent was never invoked" → exit 1.
Rootless permission regression — firewall containers write files into the host-mounted logs//audit/ dirs with ownership the unprivileged runner can't read; without sudo the previous sudo chmod -R a+rX /tmp/gh-aw/sandbox/firewall repair step is gone, so upload-artifact fails with EACCES.
These make sudo: false non-viable for any workflow that uses the CLI proxy + artifact upload until fixed.
Problem 1 — Topology-attach ordering deadlock (firewall never starts)
Symptom (Typist run 28168827390)
[tcp-tunnel] Upstream error (...): getaddrinfo EAI_AGAIN awmg-cli-proxy
[cli-proxy] ERROR: DIFC proxy liveness probe failed for localhost:18443 ...
[cli-proxy] Failing fast to avoid repeated in-agent retries
AWF firewall failed to start: awf-cli-proxy could not connect to the external
DIFC proxy (or exited before establishing a connection). ... The agent was never invoked.
The config used topologyAttach: ["awmg-mcpg","awmg-cli-proxy"] and --difc-proxy-host awmg-cli-proxy:18443. The hostname awmg-cli-proxy never resolves on awf-net.
Root cause — a startup ordering deadlock
The sequence in cli-workflow.ts is:
Step 2 startContainers() (src/container-lifecycle.ts:44-66) runs docker compose up -d. Because the agent service depends_on the cli-proxy with condition: service_healthy (src/services/cli-proxy-service.ts:82-93, src/compose-generator.ts), compose blocks on the cli-proxy becoming healthy.
The cli-proxy entrypoint (containers/cli-proxy/entrypoint.sh:59-97) probes the external DIFC proxy via a TCP tunnel to AWF_DIFC_PROXY_HOST (awmg-cli-proxy). It retries 10 times then exit 1.
But awmg-cli-proxy is only attached to awf-net in Step 2.5connectTopologyContainers() (src/cli-workflow.ts:122-133, src/topology.ts:91-120) — which runs only after startContainers() returns successfully.
So:
startContainers() waits for awf-cli-proxy healthy
↓ requires
awf-cli-proxy probe reaches awmg-cli-proxy
↓ requires
docker network connect awf-net awmg-cli-proxy ← Step 2.5
↓ runs only after
startContainers() returns ← never reached
awf-net is internal: true (src/compose-generator.ts:234-240); the external awmg-cli-proxy is launched by gh-aw on the default bridge network, so until AWF explicitly docker network connects it to awf-net, the name is unresolvable from inside awf-net (EAI_AGAIN). The attach never happens → deadlock → fail-fast → agent never invoked.
(The gh-aw-launched awmg-mcpg gateway showing early ECONNRESET health failures is a secondary symptom of the same "peers not yet joined to awf-net" condition.)
Why it isn't just flakiness
It is deterministic: the attach step is structurally sequenced after the health gate it depends on. Any workflow combining --network-isolation + --topology-attach + a CLI proxy that probes a topology peer at startup will hit it 100% of the time.
Potential solutions
Reorder: attach topology peers before the health gate. Join --topology-attach containers to awf-netbefore (or concurrently with) bringing up the dependent sidecars. Options: bring up squid/network first, run connectTopologyContainers(), thenup -d the cli-proxy/agent; or split compose so the attach happens between network creation and the health-gated services.
Make the cli-proxy probe tolerant of a not-yet-attached peer. Treat EAI_AGAIN/ENOTFOUND as "not-yet-ready" (retry, not fail-fast) rather than terminal, with a longer DNS-aware backoff, so the sidecar survives until the attach lands.
Pre-create and attach via compose, not post-hoc. Declare the external peers as external_links/extra networks, or have AWF create awf-net and attach peers, then generate/launch compose with the peers already reachable.
Better diagnostics. Distinguish "peer not attached to awf-net" from "peer attached but DIFC proxy not listening" so this failure is self-explaining.
Problem 2 — Rootless permission regression (EACCES on artifact upload)
Evidence
gh-aw PR github/gh-aw#41426 reverting glossary-maintainer to sudo: true removed, among other things, the sudo chmod -R a+rX /tmp/gh-aw/sandbox/firewall step. Under sudo: false that step is gone, but the firewall containers still write files into the host-mounted logs/ and audit/ dirs under /tmp/gh-aw/sandbox/firewall. The unprivileged runner then hits EACCES when upload-artifact zips them.
Root cause
AWF pre-creates the log dirs 0777 (src/workdir-setup.ts:155-198), but the files written into them come from container processes (squid as proxy/uid 13, cli-proxy as cliproxy, api-proxy, agent) whose UIDs don't match the runner user. AWF's own chmod -R a+rX repair (src/artifact-preservation.ts:52-74,160-201) can only relax permissions on files AWF can chmod — it can't fix files owned by other UIDs without privileges. In sudo mode gh-aw papered over this with the external sudo chmod; rootless mode has no such escape hatch.
Potential solutions
Write firewall logs/audit as the invoking user. Run the log-writing sidecars (or their bind-mounted output dirs) with the runner's UID/GID — e.g. --user "$(id -u):$(id -g)", the same UID-mapping AWF already does for the agent (containers/agent/entrypoint.sh) — so output is runner-readable by construction.
Repair ownership/permissions on shutdown without sudo. Have the sidecars (which are root inside their container, even when AWF runs rootless on the host) chown/chmod -R a+rX their own output dirs to the host caller's UID on exit, so no host-side privilege is needed.
Use a UID-mapped / userns-remapped volume for the firewall log/audit dirs so host-visible ownership matches the runner.
Document that under --network-isolation the external sudo chmod step is unnecessary because AWF guarantees runner-readable artifacts.
Acceptance criteria
A --network-isolation + --topology-attach run with a CLI-proxy DIFC peer starts the firewall and invokes the agent (no EAI_AGAIN deadlock) on a standard hosted runner.
Firewall logs//audit/ artifacts written under sudo: false are readable by the unprivileged runner; upload-artifact succeeds with no external chmod/sudo step.
Regression coverage: a CI job exercising the sudo: false topology end-to-end (cli-proxy + topology-attach + artifact upload), so neither regression can silently return.
Summary
Two distinct regressions block the
sandbox.agent.sudo: false(network-isolation) rollout on standard GitHub-hosted runners (this is not ARC — see #5541 for the ARC/chroot track). Both were surfaced by gh-aw PR github/gh-aw#41426, which had to revertsudo: false → truefor theglossary-maintainerworkflow, and by the failing Typist run github/gh-aw run 28168827390 / job 83427672448.--network-isolationmode AWF's ownawf-cli-proxysidecar can never become healthy because the external DIFC proxy / MCP-gateway containers it must reach are only joined toawf-netafter container startup has already gated on that sidecar's health. Result:getaddrinfo EAI_AGAIN awmg-cli-proxy→awf-cli-proxy could not connect to the external DIFC proxy→ "The agent was never invoked" → exit 1.logs//audit/dirs with ownership the unprivileged runner can't read; withoutsudothe previoussudo chmod -R a+rX /tmp/gh-aw/sandbox/firewallrepair step is gone, soupload-artifactfails withEACCES.These make
sudo: falsenon-viable for any workflow that uses the CLI proxy + artifact upload until fixed.Problem 1 — Topology-attach ordering deadlock (firewall never starts)
Symptom (Typist run 28168827390)
The config used
topologyAttach: ["awmg-mcpg","awmg-cli-proxy"]and--difc-proxy-host awmg-cli-proxy:18443. The hostnameawmg-cli-proxynever resolves onawf-net.Root cause — a startup ordering deadlock
The sequence in
cli-workflow.tsis:startContainers()(src/container-lifecycle.ts:44-66) runsdocker compose up -d. Because the agent servicedepends_onthe cli-proxy withcondition: service_healthy(src/services/cli-proxy-service.ts:82-93,src/compose-generator.ts), compose blocks on the cli-proxy becoming healthy.containers/cli-proxy/entrypoint.sh:59-97) probes the external DIFC proxy via a TCP tunnel toAWF_DIFC_PROXY_HOST(awmg-cli-proxy). It retries 10 times thenexit 1.awmg-cli-proxyis only attached toawf-netin Step 2.5connectTopologyContainers()(src/cli-workflow.ts:122-133,src/topology.ts:91-120) — which runs only afterstartContainers()returns successfully.So:
awf-netisinternal: true(src/compose-generator.ts:234-240); the externalawmg-cli-proxyis launched by gh-aw on the defaultbridgenetwork, so until AWF explicitlydocker network connects it toawf-net, the name is unresolvable from insideawf-net(EAI_AGAIN). The attach never happens → deadlock → fail-fast → agent never invoked.(The gh-aw-launched
awmg-mcpggateway showing earlyECONNRESEThealth failures is a secondary symptom of the same "peers not yet joined toawf-net" condition.)Why it isn't just flakiness
It is deterministic: the attach step is structurally sequenced after the health gate it depends on. Any workflow combining
--network-isolation+--topology-attach+ a CLI proxy that probes a topology peer at startup will hit it 100% of the time.Potential solutions
--topology-attachcontainers toawf-netbefore (or concurrently with) bringing up the dependent sidecars. Options: bring upsquid/network first, runconnectTopologyContainers(), thenup -dthe cli-proxy/agent; or split compose so the attach happens between network creation and the health-gated services.EAI_AGAIN/ENOTFOUNDas "not-yet-ready" (retry, not fail-fast) rather than terminal, with a longer DNS-aware backoff, so the sidecar survives until the attach lands.external_links/extra networks, or have AWF createawf-netand attach peers, then generate/launch compose with the peers already reachable.Problem 2 — Rootless permission regression (EACCES on artifact upload)
Evidence
gh-aw PR github/gh-aw#41426 reverting
glossary-maintainertosudo: trueremoved, among other things, thesudo chmod -R a+rX /tmp/gh-aw/sandbox/firewallstep. Undersudo: falsethat step is gone, but the firewall containers still write files into the host-mountedlogs/andaudit/dirs under/tmp/gh-aw/sandbox/firewall. The unprivileged runner then hitsEACCESwhenupload-artifactzips them.Root cause
AWF pre-creates the log dirs
0777(src/workdir-setup.ts:155-198), but the files written into them come from container processes (squid asproxy/uid 13, cli-proxy ascliproxy, api-proxy, agent) whose UIDs don't match the runner user. AWF's ownchmod -R a+rXrepair (src/artifact-preservation.ts:52-74,160-201) can only relax permissions on files AWF canchmod— it can't fix files owned by other UIDs without privileges. Insudomode gh-aw papered over this with the externalsudo chmod; rootless mode has no such escape hatch.Potential solutions
--user "$(id -u):$(id -g)", the same UID-mapping AWF already does for the agent (containers/agent/entrypoint.sh) — so output is runner-readable by construction.chown/chmod -R a+rXtheir own output dirs to the host caller's UID on exit, so no host-side privilege is needed.--network-isolationthe externalsudo chmodstep is unnecessary because AWF guarantees runner-readable artifacts.Acceptance criteria
--network-isolation+--topology-attachrun with a CLI-proxy DIFC peer starts the firewall and invokes the agent (noEAI_AGAINdeadlock) on a standard hosted runner.logs//audit/artifacts written undersudo: falseare readable by the unprivileged runner;upload-artifactsucceeds with no externalchmod/sudostep.sudo: falsetopology end-to-end (cli-proxy + topology-attach + artifact upload), so neither regression can silently return.References
sudo: false → truefor glossary-maintainer)28168827390, job83427672448src/cli-workflow.ts:95-133,src/topology.ts:91-120,src/container-lifecycle.ts:44-145,src/services/cli-proxy-service.ts:82-93,containers/cli-proxy/entrypoint.sh:59-97,src/compose-generator.ts:220-256,src/workdir-setup.ts:155-198,src/artifact-preservation.ts:52-201