Context
The macOS mount-failure mitigations (#248, #276, #303, #376, #397) all work around a single upstream root cause: Apple Virtualization.framework's virtio-fs server has a cache-coherency/concurrency bug (Apple Feedback FB16008360) where metadata ops (mkdirat, getxattr, overlay metacopy) intermittently fail under load and across sleep/wake. As of the research done for #397, this is unfixed upstream (containers/podman#23061 closed as "Apple's bug"; #24725 open).
Idea
The libkrun / krunkit Podman-machine provider ships its own virtio-fs implementation (not AVF's), with stricter semantics that reportedly avoid AVF's laxness. Switching the default macOS provider could address the root cause directly and let us retire the accumulated workarounds (vfkit watchdog, sleep prevention, overlay-volume placement, status named-volume mirroring) over time.
What to evaluate
- Reliability: does krunkit's virtio-fs actually eliminate the
exit_code=4 / error_type=mount_failure class under sustained + concurrent agent I/O and across sleep/wake?
- Operational cost: provider install/bootstrap,
podman machine init --provider, migration for existing users.
- Trade-offs: performance, GPU passthrough, maturity vs. AVF, CI implications.
- Which existing mitigations become removable vs. should stay as defense-in-depth.
Pointers
Tracked from the #397 overlay research. https://claude.ai/code/session_01GyUwb9P9abPA2WnUAqnCUU
Context
The macOS mount-failure mitigations (#248, #276, #303, #376, #397) all work around a single upstream root cause: Apple Virtualization.framework's virtio-fs server has a cache-coherency/concurrency bug (Apple Feedback FB16008360) where metadata ops (
mkdirat,getxattr, overlaymetacopy) intermittently fail under load and across sleep/wake. As of the research done for #397, this is unfixed upstream (containers/podman#23061 closed as "Apple's bug"; #24725 open).Idea
The libkrun / krunkit Podman-machine provider ships its own virtio-fs implementation (not AVF's), with stricter semantics that reportedly avoid AVF's laxness. Switching the default macOS provider could address the root cause directly and let us retire the accumulated workarounds (vfkit watchdog, sleep prevention, overlay-volume placement, status named-volume mirroring) over time.
What to evaluate
exit_code=4 / error_type=mount_failureclass under sustained + concurrent agent I/O and across sleep/wake?podman machine init --provider, migration for existing users.Pointers
scripts/lib/podman-health.sh,scripts/lib/vfkit-watchdog.sh,scripts/lib/status-sync.sh, overlay-volume work inscripts/lib/overlay-sandbox.sh.Tracked from the #397 overlay research. https://claude.ai/code/session_01GyUwb9P9abPA2WnUAqnCUU