Skip to content

CI: nvidia (disk) and cpu/aarch64 (oneDNN cache) failures — arch scope & nightly-vs-stable decisions needed #8

Description

@MattDHill

Background

The package had never passed a workflow. Two root-cause bugs were found and already fixed + pushed to master:

  • aca8046startos/versions/current.ts had version: '0.21.1-rc0:0', which is invalid ExVer (pre-release alpha/numeric identifiers must be dot-delimited). It crashed start-cli s9pk pack at manifest evaluation for all variants, so every prior run died identically at ~1m49s. Fixed to 0.21.1-rc.0:0.
  • 4c1d1descripts/patch-dockerfiles.sh injected the dist-named SETUPTOOLS_SCM_PRETEND_VERSION_FOR_VLLM. vLLM's setup.py calls setuptools_scm.get_version() with no dist_name, and the transitively-bumped setuptools-scm v10 (now backed by vcs_versioning) no longer infers the project name from pyproject.toml in that call, so the named override never matched → setuptools-scm was unable to detect version. Switched to the generic SETUPTOOLS_SCM_PRETEND_VERSION (safe — the Dockerfile builds one project). UPDATING.md updated to match.

Current state (run 26610370261)

variant result notes
rocm ✅ built & packed amd64-only, as designed
cpu / x86_64 ✅ built & packed source build works end-to-end
nvidia out of disk packs two CUDA images (amd64 8.6 GB + arm64 9.9 GB compressed; ~50 GB+ uncompressed in the docker store during pack) on one runner. Disk so exhausted the runner couldn't write its own logs (job log truncated to 2 lines).
cpu / aarch64 ❌ after ~2.5 h shared BuildKit cache clobber — see below

So the fundamental packaging is fixed (rocm + cpu/x86 are green). Both remaining failures are on the aarch64 paths of the two heavy variants.

cpu/aarch64 root cause

Upstream Dockerfile.cpu mounts --mount=type=cache,target=/vllm-workspace/.deps with no platform scoping. The x86 build populates .deps/onednn-src with a shallow oneDNN clone at the x86 default ref; the arm build reuses that same cache, needs the arm-pinned oneDNN commit 9c5be1cc…, tries to update the existing clone, and hits fatal: reference is not a tree (the commit isn't in the shallow x86 history). Both arches run sequentially on one runner under QEMU.

Both prebuilt nvidia arches are confirmed published on Docker Hub at the pinned nightly; rocm is amd64-only.

Decisions needed

1. Strategic — stay on vLLM master/nightly, or pin a stable release?

The repo deliberately tracks vLLM master (0b9f2c8; submodule sits 132 commits past v0.21.1rc0) for master-only features per the release notes: Gemma 4 presets, MTP speculative decoding, DGX Spark / GB10 hardware detection.

  • Nightly is the sole reason cpu is built from source — vLLM ships no prebuilt cpu image at nightly commits. That source build is what produces all the QEMU / oneDNN / setuptools-scm complexity.
  • If a stable release were pinned and vLLM publishes a stable cpu image, cpu becomes a prebuilt-image pull like nvidia/rocm and the entire source-build path disappears. Needs: (a) confirm those master-only features aren't required; (b) verify a stable cpu image exists.

2. aarch64 scope for the heavy variants — keep or drop?

  • Drop nvidia/aarch64 → disk issue becomes trivial (one CUDA image).
  • Drop cpu/aarch64 → removes the 2.5 h emulated build and the oneDNN cache bug entirely.
  • Keeping either is doable but costs real CI time / runner disk every release.

3. If keeping arches, the in-place fixes

  • nvidia disk: enable FREE_DISK_SPACE: true (currently commented out in all three workflow files) and prune the first arch's image before packing the second (FREE_DISK_SPACE alone may not fit two CUDA images).
  • cpu/aarch64 cache: have patch-dockerfiles.sh make the .deps cache mount per-arch (e.g. id=deps-${TARGETARCH}) or drop the .deps cache; then accept ~2.5 h emulated builds per release.

cc @dr-bonez — the version + setuptools-scm bugs are fixed and on master; what's left is the arch/scope strategy and the nightly-vs-stable call, both product decisions rather than bugs.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions