Skip to content

Evaluate libkrun/krunkit Podman-machine provider to address AVF virtio-fs root cause (#376/#397 follow-up) #409

Description

@aviadshiber

Context

The macOS mount-failure mitigations (#248, #276, #303, #376, #397) all work around a single upstream root cause: Apple Virtualization.framework's virtio-fs server has a cache-coherency/concurrency bug (Apple Feedback FB16008360) where metadata ops (mkdirat, getxattr, overlay metacopy) intermittently fail under load and across sleep/wake. As of the research done for #397, this is unfixed upstream (containers/podman#23061 closed as "Apple's bug"; #24725 open).

Idea

The libkrun / krunkit Podman-machine provider ships its own virtio-fs implementation (not AVF's), with stricter semantics that reportedly avoid AVF's laxness. Switching the default macOS provider could address the root cause directly and let us retire the accumulated workarounds (vfkit watchdog, sleep prevention, overlay-volume placement, status named-volume mirroring) over time.

What to evaluate

  • Reliability: does krunkit's virtio-fs actually eliminate the exit_code=4 / error_type=mount_failure class under sustained + concurrent agent I/O and across sleep/wake?
  • Operational cost: provider install/bootstrap, podman machine init --provider, migration for existing users.
  • Trade-offs: performance, GPU passthrough, maturity vs. AVF, CI implications.
  • Which existing mitigations become removable vs. should stay as defense-in-depth.

Pointers

Tracked from the #397 overlay research. https://claude.ai/code/session_01GyUwb9P9abPA2WnUAqnCUU

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions