Skip to content

[KV Offload]: Tiered offloading + MoonCakeStore#2689

Merged
samsja merged 7 commits into
mainfrom
mooncake-baby
Jun 2, 2026
Merged

[KV Offload]: Tiered offloading + MoonCakeStore#2689
samsja merged 7 commits into
mainfrom
mooncake-baby

Conversation

@S1ro1
Copy link
Copy Markdown
Collaborator

@S1ro1 S1ro1 commented Jun 2, 2026

Replace the flat inference.kv_cache_offload (single cpu_bytes) with a typed, backend-swappable, tiered config discriminated on type:

  • native: vLLM OffloadingConnector (cpu) / TieringOffloadingSpec (cpu+disk, fs_python secondary). Fully self-contained.
  • mooncake: per-node standalone-store (mooncake_master + mooncake_client, RDMA), cpu and/or disk tiers.

Composable cpu/disk tiers (cpu required, disk-only rejected); disk.num_bytes is mooncake-only (native fs tier is filesystem-bounded; warns if set on native).

Centralize the vLLM kv_transfer_config build in
InferenceConfig.build_kv_transfer_config (NIXL transfer + offload composed via MultiConnector); both SLURM templates stop hand-rolling the JSON and launch the node-local Mooncake store. New src/prime_rl/inference/mooncake.py launches the master+client and writes MOONCAKE_CONFIG_PATH; server.py brings it up for local runs.

Verified:

  • Single-node (Qwen3-0.6B, H200): all four configs boot + serve; native-cpu and mooncake-rdma both round-trip KV (store ~1.1GB / reload ~205MB).
  • Multi-node 3-node SLURM disaggregated (Qwen3-30B-A3B): NIXL P/D transfer serves coherent output; with mooncake offload, MultiConnector[Nixl, MooncakeStore] + per-node RDMA store serves coherent output.

Note

High Risk
Breaking inference config and changes to the KV transfer/offload path for multi-node and P/D SLURM jobs, plus new distributed Mooncake/RDMA dependencies.

Overview
Replaces the flat inference.kv_cache_offload.cpu_bytes setting with a type-discriminated offload config: required cpu.num_bytes, optional disk.path, and backends native (vLLM OffloadingConnector / TieringOffloadingSpec) or mooncake (shared RDMA store on SLURM). Old cpu_bytes is rejected — configs must migrate per CHANGELOG.

InferenceConfig.build_kv_transfer_config now assembles vLLM kv_transfer_config (NIXL for disaggregated P/D via auto-set use_pd_kv_transfer, plus offload; MultiConnector when both apply). to_vllm() uses that instead of inline offload JSON. SLURM inference / multi_node_rl templates drop hand-built kv_transfer_config in VLLM_EXTRA and, for Mooncake, include _mooncake_store.sh.j2 (master/client, MOONCAKE_CONFIG_PATH) with cleanup for mooncake_* processes. RL/inference entrypoints pass Mooncake template vars; validate_mooncake_offload_requires_slurm blocks Mooncake without SLURM. Adds mooncake-transfer-engine and docs/changelog for the new shape.

Reviewed by Cursor Bugbot for commit 44f951d. Bugbot is set up for automated code reviews on this repo. Configure here.

S1ro1 and others added 3 commits June 2, 2026 06:25
Replace the flat `inference.kv_cache_offload` (single `cpu_bytes`) with a typed,
backend-swappable, tiered config discriminated on `type`:

- native: vLLM `OffloadingConnector` (cpu) / `TieringOffloadingSpec` (cpu+disk,
  fs_python secondary). Fully self-contained.
- mooncake: per-node standalone-store (`mooncake_master` + `mooncake_client`, RDMA),
  cpu and/or disk tiers.

Composable `cpu`/`disk` tiers (cpu required, disk-only rejected); `disk.num_bytes`
is mooncake-only (native fs tier is filesystem-bounded; warns if set on native).

Centralize the vLLM `kv_transfer_config` build in
`InferenceConfig.build_kv_transfer_config` (NIXL transfer + offload composed via
`MultiConnector`); both SLURM templates stop hand-rolling the JSON and launch the
node-local Mooncake store. New `src/prime_rl/inference/mooncake.py` launches the
master+client and writes `MOONCAKE_CONFIG_PATH`; `server.py` brings it up for local
runs.

Verified:
- Single-node (Qwen3-0.6B, H200): all four configs boot + serve; native-cpu and
  mooncake-rdma both round-trip KV (store ~1.1GB / reload ~205MB).
- Multi-node 3-node SLURM disaggregated (Qwen3-30B-A3B): NIXL P/D transfer serves
  coherent output; with mooncake offload, MultiConnector[Nixl, MooncakeStore] +
  per-node RDMA store serves coherent output.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rver

- DiskOffloadTier.num_bytes: removed. Neither backend enforces a disk byte quota
  (capacity is bounded by the filesystem at disk.path), so it was a no-op knob.
- MooncakeKVCacheOffloadConfig.metadata_server: removed. The launcher always hosts
  the HTTP metadata server on the node-local master and auto-derives the URL from the
  master host.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/prime_rl/inference/mooncake.py Outdated
Comment thread src/prime_rl/inference/mooncake.py Outdated
Comment thread src/prime_rl/inference/mooncake.py Outdated
Comment thread packages/prime-rl-configs/src/prime_rl/configs/inference.py
Comment thread src/prime_rl/inference/mooncake.py Outdated
Address PR review: replace the Python launcher CLI (+ env.sh round-trip) with an
explicit bash launch in the SLURM templates, and document the breaking config change.

- New `templates/_mooncake_store.sh.j2` partial (shared via `{% include %}` by
  `inference.sbatch.j2` and `multi_node_rl.sbatch.j2`): launches the node-local
  `mooncake_master` + `mooncake_client`, writes the config JSON, and exports
  `PYTHONHASHSEED` / `MOONCAKE_CONFIG_PATH` inline — visible in the rendered script.
  Removes the hacky orphan-process CLI, the `print`, and the unquoted-`env.sh` round-trip.
- Delete `src/prime_rl/inference/mooncake.py` and the `server.py` store-launch hook.
- Mooncake offload is now SLURM-only: `inference_local` errors (when no store env is
  present) and an `RLConfig` validator rejects local RL + mooncake. Native offload is
  unchanged and still works locally. Entrypoints pass `kv_offload_{cpu_bytes,disk_path,
  device_name}` template vars.
- CHANGELOG.md: document the `inference.kv_cache_offload.cpu_bytes` -> discriminated
  `type` breaking change (per .cursor/BUGBOT.md).

Verified single-node SLURM (Qwen3-0.6B): store launches via the template bash,
MooncakeStoreConnector (standalone-store) loads, and the server serves coherent output.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit cb7b4d4. Configure here.

Comment thread pyproject.toml
S1ro1 and others added 3 commits June 2, 2026 07:36
Switch the Mooncake offload backend from per-node isolated stores to a single shared
distributed pool, so all inference nodes' CPU RAM (and disk) form one KV cache and a
prefix cached on any node is reusable by every other node/replica.

- `_mooncake_store.sh.j2`: the head inference node (first in the SLURM nodelist) runs the
  one master + HTTP metadata server; every node runs a client pointing at the head and
  contributes its segment. Store keys are model + parallel-rank + content-hash (no instance
  id), so the pool is reused across nodes and replicas.
- Drop the now-unused `MooncakeKVCacheOffloadConfig.master_server_address` (auto head).
- docs: describe the shared pool (total pool ≈ num_bytes × #inference-nodes).

Verified on a 2-node SLURM job (Qwen3-0.6B): one master on the head, both clients join
(2 segments), and a prefix cached on node A is served from node B over RDMA
(load_get ~160 MB cross-node hit).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The RLConfig `validate_mooncake_offload_requires_slurm` config validator already guards
the misuse case; the per-request runtime check in `inference_local` (and its `import os`)
was redundant. Remove it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@samsja samsja merged commit c13409e into main Jun 2, 2026
18 checks passed
snimu added a commit that referenced this pull request Jun 2, 2026
…air_rl cadence align)

Picks up 4e7aebf in research-configs: migrates [inference.kv_cache_offload]
from flat cpu_bytes to the new tiered/discriminated shape (PR #2689) on
the three forth-lang cells in active use (cmb_code_a0p005, rl_muon, rl),
and aligns glm45air_rl ckpt+eval cadence with the muon cell.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants