Skip to content

network-isolation (sudo:false) rollout regressions on standard runners: topology-attach ordering deadlock + rootless log EACCES #5542

Description

@lpcox

Summary

Two distinct regressions block the sandbox.agent.sudo: false (network-isolation) rollout on standard GitHub-hosted runners (this is not ARC — see #5541 for the ARC/chroot track). Both were surfaced by gh-aw PR github/gh-aw#41426, which had to revert sudo: false → true for the glossary-maintainer workflow, and by the failing Typist run github/gh-aw run 28168827390 / job 83427672448.

  1. Topology-attach ordering deadlock — in --network-isolation mode AWF's own awf-cli-proxy sidecar can never become healthy because the external DIFC proxy / MCP-gateway containers it must reach are only joined to awf-net after container startup has already gated on that sidecar's health. Result: getaddrinfo EAI_AGAIN awmg-cli-proxyawf-cli-proxy could not connect to the external DIFC proxy"The agent was never invoked" → exit 1.
  2. Rootless permission regression — firewall containers write files into the host-mounted logs//audit/ dirs with ownership the unprivileged runner can't read; without sudo the previous sudo chmod -R a+rX /tmp/gh-aw/sandbox/firewall repair step is gone, so upload-artifact fails with EACCES.

These make sudo: false non-viable for any workflow that uses the CLI proxy + artifact upload until fixed.


Problem 1 — Topology-attach ordering deadlock (firewall never starts)

Symptom (Typist run 28168827390)

[tcp-tunnel] Upstream error (...): getaddrinfo EAI_AGAIN awmg-cli-proxy
[cli-proxy] ERROR: DIFC proxy liveness probe failed for localhost:18443 ...
[cli-proxy] Failing fast to avoid repeated in-agent retries
AWF firewall failed to start: awf-cli-proxy could not connect to the external
DIFC proxy (or exited before establishing a connection). ... The agent was never invoked.

The config used topologyAttach: ["awmg-mcpg","awmg-cli-proxy"] and --difc-proxy-host awmg-cli-proxy:18443. The hostname awmg-cli-proxy never resolves on awf-net.

Root cause — a startup ordering deadlock

The sequence in cli-workflow.ts is:

  1. Step 2 startContainers() (src/container-lifecycle.ts:44-66) runs docker compose up -d. Because the agent service depends_on the cli-proxy with condition: service_healthy (src/services/cli-proxy-service.ts:82-93, src/compose-generator.ts), compose blocks on the cli-proxy becoming healthy.
  2. The cli-proxy entrypoint (containers/cli-proxy/entrypoint.sh:59-97) probes the external DIFC proxy via a TCP tunnel to AWF_DIFC_PROXY_HOST (awmg-cli-proxy). It retries 10 times then exit 1.
  3. But awmg-cli-proxy is only attached to awf-net in Step 2.5 connectTopologyContainers() (src/cli-workflow.ts:122-133, src/topology.ts:91-120) — which runs only after startContainers() returns successfully.

So:

startContainers() waits for awf-cli-proxy healthy
        ↓ requires
awf-cli-proxy probe reaches awmg-cli-proxy
        ↓ requires
docker network connect awf-net awmg-cli-proxy   ← Step 2.5
        ↓ runs only after
startContainers() returns                        ← never reached

awf-net is internal: true (src/compose-generator.ts:234-240); the external awmg-cli-proxy is launched by gh-aw on the default bridge network, so until AWF explicitly docker network connects it to awf-net, the name is unresolvable from inside awf-net (EAI_AGAIN). The attach never happens → deadlock → fail-fast → agent never invoked.

(The gh-aw-launched awmg-mcpg gateway showing early ECONNRESET health failures is a secondary symptom of the same "peers not yet joined to awf-net" condition.)

Why it isn't just flakiness

It is deterministic: the attach step is structurally sequenced after the health gate it depends on. Any workflow combining --network-isolation + --topology-attach + a CLI proxy that probes a topology peer at startup will hit it 100% of the time.

Potential solutions

  • Reorder: attach topology peers before the health gate. Join --topology-attach containers to awf-net before (or concurrently with) bringing up the dependent sidecars. Options: bring up squid/network first, run connectTopologyContainers(), then up -d the cli-proxy/agent; or split compose so the attach happens between network creation and the health-gated services.
  • Make the cli-proxy probe tolerant of a not-yet-attached peer. Treat EAI_AGAIN/ENOTFOUND as "not-yet-ready" (retry, not fail-fast) rather than terminal, with a longer DNS-aware backoff, so the sidecar survives until the attach lands.
  • Pre-create and attach via compose, not post-hoc. Declare the external peers as external_links/extra networks, or have AWF create awf-net and attach peers, then generate/launch compose with the peers already reachable.
  • Better diagnostics. Distinguish "peer not attached to awf-net" from "peer attached but DIFC proxy not listening" so this failure is self-explaining.

Problem 2 — Rootless permission regression (EACCES on artifact upload)

Evidence

gh-aw PR github/gh-aw#41426 reverting glossary-maintainer to sudo: true removed, among other things, the sudo chmod -R a+rX /tmp/gh-aw/sandbox/firewall step. Under sudo: false that step is gone, but the firewall containers still write files into the host-mounted logs/ and audit/ dirs under /tmp/gh-aw/sandbox/firewall. The unprivileged runner then hits EACCES when upload-artifact zips them.

Root cause

AWF pre-creates the log dirs 0777 (src/workdir-setup.ts:155-198), but the files written into them come from container processes (squid as proxy/uid 13, cli-proxy as cliproxy, api-proxy, agent) whose UIDs don't match the runner user. AWF's own chmod -R a+rX repair (src/artifact-preservation.ts:52-74,160-201) can only relax permissions on files AWF can chmod — it can't fix files owned by other UIDs without privileges. In sudo mode gh-aw papered over this with the external sudo chmod; rootless mode has no such escape hatch.

Potential solutions

  • Write firewall logs/audit as the invoking user. Run the log-writing sidecars (or their bind-mounted output dirs) with the runner's UID/GID — e.g. --user "$(id -u):$(id -g)", the same UID-mapping AWF already does for the agent (containers/agent/entrypoint.sh) — so output is runner-readable by construction.
  • Repair ownership/permissions on shutdown without sudo. Have the sidecars (which are root inside their container, even when AWF runs rootless on the host) chown/chmod -R a+rX their own output dirs to the host caller's UID on exit, so no host-side privilege is needed.
  • Use a UID-mapped / userns-remapped volume for the firewall log/audit dirs so host-visible ownership matches the runner.
  • Document that under --network-isolation the external sudo chmod step is unnecessary because AWF guarantees runner-readable artifacts.

Acceptance criteria

  • A --network-isolation + --topology-attach run with a CLI-proxy DIFC peer starts the firewall and invokes the agent (no EAI_AGAIN deadlock) on a standard hosted runner.
  • Firewall logs//audit/ artifacts written under sudo: false are readable by the unprivileged runner; upload-artifact succeeds with no external chmod/sudo step.
  • Regression coverage: a CI job exercising the sudo: false topology end-to-end (cli-proxy + topology-attach + artifact upload), so neither regression can silently return.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingenhancementNew feature or requesttesting

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions