Background
The package had never passed a workflow. Two root-cause bugs were found and already fixed + pushed to master:
aca8046 — startos/versions/current.ts had version: '0.21.1-rc0:0', which is invalid ExVer (pre-release alpha/numeric identifiers must be dot-delimited). It crashed start-cli s9pk pack at manifest evaluation for all variants, so every prior run died identically at ~1m49s. Fixed to 0.21.1-rc.0:0.
4c1d1de — scripts/patch-dockerfiles.sh injected the dist-named SETUPTOOLS_SCM_PRETEND_VERSION_FOR_VLLM. vLLM's setup.py calls setuptools_scm.get_version() with no dist_name, and the transitively-bumped setuptools-scm v10 (now backed by vcs_versioning) no longer infers the project name from pyproject.toml in that call, so the named override never matched → setuptools-scm was unable to detect version. Switched to the generic SETUPTOOLS_SCM_PRETEND_VERSION (safe — the Dockerfile builds one project). UPDATING.md updated to match.
| variant |
result |
notes |
| rocm |
✅ built & packed |
amd64-only, as designed |
| cpu / x86_64 |
✅ built & packed |
source build works end-to-end |
| nvidia |
❌ out of disk |
packs two CUDA images (amd64 8.6 GB + arm64 9.9 GB compressed; ~50 GB+ uncompressed in the docker store during pack) on one runner. Disk so exhausted the runner couldn't write its own logs (job log truncated to 2 lines). |
| cpu / aarch64 |
❌ after ~2.5 h |
shared BuildKit cache clobber — see below |
So the fundamental packaging is fixed (rocm + cpu/x86 are green). Both remaining failures are on the aarch64 paths of the two heavy variants.
cpu/aarch64 root cause
Upstream Dockerfile.cpu mounts --mount=type=cache,target=/vllm-workspace/.deps with no platform scoping. The x86 build populates .deps/onednn-src with a shallow oneDNN clone at the x86 default ref; the arm build reuses that same cache, needs the arm-pinned oneDNN commit 9c5be1cc…, tries to update the existing clone, and hits fatal: reference is not a tree (the commit isn't in the shallow x86 history). Both arches run sequentially on one runner under QEMU.
Both prebuilt nvidia arches are confirmed published on Docker Hub at the pinned nightly; rocm is amd64-only.
Decisions needed
1. Strategic — stay on vLLM master/nightly, or pin a stable release?
The repo deliberately tracks vLLM master (0b9f2c8; submodule sits 132 commits past v0.21.1rc0) for master-only features per the release notes: Gemma 4 presets, MTP speculative decoding, DGX Spark / GB10 hardware detection.
- Nightly is the sole reason cpu is built from source — vLLM ships no prebuilt cpu image at nightly commits. That source build is what produces all the QEMU / oneDNN / setuptools-scm complexity.
- If a stable release were pinned and vLLM publishes a stable cpu image, cpu becomes a prebuilt-image pull like nvidia/rocm and the entire source-build path disappears. Needs: (a) confirm those master-only features aren't required; (b) verify a stable cpu image exists.
2. aarch64 scope for the heavy variants — keep or drop?
- Drop nvidia/aarch64 → disk issue becomes trivial (one CUDA image).
- Drop cpu/aarch64 → removes the 2.5 h emulated build and the oneDNN cache bug entirely.
- Keeping either is doable but costs real CI time / runner disk every release.
3. If keeping arches, the in-place fixes
- nvidia disk: enable
FREE_DISK_SPACE: true (currently commented out in all three workflow files) and prune the first arch's image before packing the second (FREE_DISK_SPACE alone may not fit two CUDA images).
- cpu/aarch64 cache: have
patch-dockerfiles.sh make the .deps cache mount per-arch (e.g. id=deps-${TARGETARCH}) or drop the .deps cache; then accept ~2.5 h emulated builds per release.
cc @dr-bonez — the version + setuptools-scm bugs are fixed and on master; what's left is the arch/scope strategy and the nightly-vs-stable call, both product decisions rather than bugs.
Background
The package had never passed a workflow. Two root-cause bugs were found and already fixed + pushed to
master:aca8046—startos/versions/current.tshadversion: '0.21.1-rc0:0', which is invalid ExVer (pre-release alpha/numeric identifiers must be dot-delimited). It crashedstart-cli s9pk packat manifest evaluation for all variants, so every prior run died identically at ~1m49s. Fixed to0.21.1-rc.0:0.4c1d1de—scripts/patch-dockerfiles.shinjected the dist-namedSETUPTOOLS_SCM_PRETEND_VERSION_FOR_VLLM. vLLM'ssetup.pycallssetuptools_scm.get_version()with nodist_name, and the transitively-bumped setuptools-scm v10 (now backed byvcs_versioning) no longer infers the project name frompyproject.tomlin that call, so the named override never matched →setuptools-scm was unable to detect version. Switched to the genericSETUPTOOLS_SCM_PRETEND_VERSION(safe — the Dockerfile builds one project).UPDATING.mdupdated to match.Current state (run
26610370261)pack) on one runner. Disk so exhausted the runner couldn't write its own logs (job log truncated to 2 lines).So the fundamental packaging is fixed (rocm + cpu/x86 are green). Both remaining failures are on the aarch64 paths of the two heavy variants.
cpu/aarch64 root cause
Upstream
Dockerfile.cpumounts--mount=type=cache,target=/vllm-workspace/.depswith no platform scoping. The x86 build populates.deps/onednn-srcwith a shallow oneDNN clone at the x86 default ref; the arm build reuses that same cache, needs the arm-pinned oneDNN commit9c5be1cc…, tries to update the existing clone, and hitsfatal: reference is not a tree(the commit isn't in the shallow x86 history). Both arches run sequentially on one runner under QEMU.Both prebuilt nvidia arches are confirmed published on Docker Hub at the pinned nightly; rocm is amd64-only.
Decisions needed
1. Strategic — stay on vLLM master/nightly, or pin a stable release?
The repo deliberately tracks vLLM master (
0b9f2c8; submodule sits 132 commits pastv0.21.1rc0) for master-only features per the release notes: Gemma 4 presets, MTP speculative decoding, DGX Spark / GB10 hardware detection.2. aarch64 scope for the heavy variants — keep or drop?
3. If keeping arches, the in-place fixes
FREE_DISK_SPACE: true(currently commented out in all three workflow files) and prune the first arch's image before packing the second (FREE_DISK_SPACE alone may not fit two CUDA images).patch-dockerfiles.shmake the.depscache mount per-arch (e.g.id=deps-${TARGETARCH}) or drop the.depscache; then accept ~2.5 h emulated builds per release.cc @dr-bonez — the version + setuptools-scm bugs are fixed and on
master; what's left is the arch/scope strategy and the nightly-vs-stable call, both product decisions rather than bugs.