CI: nvidia (disk) and cpu/aarch64 (oneDNN cache) failures — arch scope & nightly-vs-stable decisions needed

## Background

The package had never passed a workflow. Two root-cause bugs were found and **already fixed + pushed to `master`**:

- **`aca8046`** — `startos/versions/current.ts` had `version: '0.21.1-rc0:0'`, which is invalid ExVer (pre-release alpha/numeric identifiers must be dot-delimited). It crashed `start-cli s9pk pack` at *manifest evaluation* for **all** variants, so every prior run died identically at ~1m49s. Fixed to `0.21.1-rc.0:0`.
- **`4c1d1de`** — `scripts/patch-dockerfiles.sh` injected the dist-named `SETUPTOOLS_SCM_PRETEND_VERSION_FOR_VLLM`. vLLM's `setup.py` calls `setuptools_scm.get_version()` with no `dist_name`, and the transitively-bumped **setuptools-scm v10** (now backed by `vcs_versioning`) no longer infers the project name from `pyproject.toml` in that call, so the named override never matched → `setuptools-scm was unable to detect version`. Switched to the **generic** `SETUPTOOLS_SCM_PRETEND_VERSION` (safe — the Dockerfile builds one project). `UPDATING.md` updated to match.

## Current state (run [`26610370261`](https://github.com/Start9Labs/vllm-startos/actions/runs/26610370261))

| variant | result | notes |
|---|---|---|
| **rocm** | ✅ built & packed | amd64-only, as designed |
| **cpu / x86_64** | ✅ built & packed | source build works end-to-end |
| **nvidia** | ❌ **out of disk** | packs two CUDA images (amd64 8.6 GB + arm64 9.9 GB compressed; ~50 GB+ uncompressed in the docker store during `pack`) on one runner. Disk so exhausted the runner couldn't write its own logs (job log truncated to 2 lines). |
| **cpu / aarch64** | ❌ after ~2.5 h | shared BuildKit cache clobber — see below |

So the fundamental packaging is fixed (rocm + cpu/x86 are green). Both remaining failures are on the **aarch64 paths of the two heavy variants**.

### cpu/aarch64 root cause
Upstream `Dockerfile.cpu` mounts `--mount=type=cache,target=/vllm-workspace/.deps` with **no platform scoping**. The x86 build populates `.deps/onednn-src` with a shallow oneDNN clone at the x86 default ref; the arm build reuses that same cache, needs the arm-pinned oneDNN commit `9c5be1cc…`, tries to *update* the existing clone, and hits `fatal: reference is not a tree` (the commit isn't in the shallow x86 history). Both arches run sequentially on one runner under QEMU.

Both prebuilt nvidia arches are confirmed published on Docker Hub at the pinned nightly; rocm is amd64-only.

## Decisions needed

### 1. Strategic — stay on vLLM master/nightly, or pin a stable release?
The repo deliberately tracks vLLM **master** (`0b9f2c8`; submodule sits 132 commits past `v0.21.1rc0`) for master-only features per the release notes: **Gemma 4 presets, MTP speculative decoding, DGX Spark / GB10 hardware detection.**

- Nightly is the *sole reason* cpu is built from source — vLLM ships no prebuilt cpu image at nightly commits. That source build is what produces all the QEMU / oneDNN / setuptools-scm complexity.
- If a **stable** release were pinned *and* vLLM publishes a stable cpu image, cpu becomes a prebuilt-image pull like nvidia/rocm and the entire source-build path disappears. Needs: (a) confirm those master-only features aren't required; (b) verify a stable cpu image exists.

### 2. aarch64 scope for the heavy variants — keep or drop?
- Drop **nvidia/aarch64** → disk issue becomes trivial (one CUDA image).
- Drop **cpu/aarch64** → removes the 2.5 h emulated build and the oneDNN cache bug entirely.
- Keeping either is doable but costs real CI time / runner disk every release.

### 3. If keeping arches, the in-place fixes
- **nvidia disk:** enable `FREE_DISK_SPACE: true` (currently commented out in all three workflow files) **and** prune the first arch's image before packing the second (FREE_DISK_SPACE alone may not fit two CUDA images).
- **cpu/aarch64 cache:** have `patch-dockerfiles.sh` make the `.deps` cache mount per-arch (e.g. `id=deps-${TARGETARCH}`) or drop the `.deps` cache; then accept ~2.5 h emulated builds per release.

---
cc @dr-bonez — the version + setuptools-scm bugs are fixed and on `master`; what's left is the arch/scope strategy and the nightly-vs-stable call, both product decisions rather than bugs.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CI: nvidia (disk) and cpu/aarch64 (oneDNN cache) failures — arch scope & nightly-vs-stable decisions needed #8

Background

Current state (run `26610370261`)

cpu/aarch64 root cause

Decisions needed

1. Strategic — stay on vLLM master/nightly, or pin a stable release?

2. aarch64 scope for the heavy variants — keep or drop?

3. If keeping arches, the in-place fixes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

variant	result	notes
rocm	✅ built & packed	amd64-only, as designed
cpu / x86_64	✅ built & packed	source build works end-to-end
nvidia	❌ out of disk	packs two CUDA images (amd64 8.6 GB + arm64 9.9 GB compressed; ~50 GB+ uncompressed in the docker store during `pack`) on one runner. Disk so exhausted the runner couldn't write its own logs (job log truncated to 2 lines).
cpu / aarch64	❌ after ~2.5 h	shared BuildKit cache clobber — see below

Uh oh!

CI: nvidia (disk) and cpu/aarch64 (oneDNN cache) failures — arch scope & nightly-vs-stable decisions needed #8

Description

Background

Current state (run 26610370261)

cpu/aarch64 root cause

Decisions needed

1. Strategic — stay on vLLM master/nightly, or pin a stable release?

2. aarch64 scope for the heavy variants — keep or drop?

3. If keeping arches, the in-place fixes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Current state (run `26610370261`)