[KV Offload]: Tiered offloading + MoonCakeStore#2689
Merged
Conversation
Replace the flat `inference.kv_cache_offload` (single `cpu_bytes`) with a typed, backend-swappable, tiered config discriminated on `type`: - native: vLLM `OffloadingConnector` (cpu) / `TieringOffloadingSpec` (cpu+disk, fs_python secondary). Fully self-contained. - mooncake: per-node standalone-store (`mooncake_master` + `mooncake_client`, RDMA), cpu and/or disk tiers. Composable `cpu`/`disk` tiers (cpu required, disk-only rejected); `disk.num_bytes` is mooncake-only (native fs tier is filesystem-bounded; warns if set on native). Centralize the vLLM `kv_transfer_config` build in `InferenceConfig.build_kv_transfer_config` (NIXL transfer + offload composed via `MultiConnector`); both SLURM templates stop hand-rolling the JSON and launch the node-local Mooncake store. New `src/prime_rl/inference/mooncake.py` launches the master+client and writes `MOONCAKE_CONFIG_PATH`; `server.py` brings it up for local runs. Verified: - Single-node (Qwen3-0.6B, H200): all four configs boot + serve; native-cpu and mooncake-rdma both round-trip KV (store ~1.1GB / reload ~205MB). - Multi-node 3-node SLURM disaggregated (Qwen3-30B-A3B): NIXL P/D transfer serves coherent output; with mooncake offload, MultiConnector[Nixl, MooncakeStore] + per-node RDMA store serves coherent output. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rver - DiskOffloadTier.num_bytes: removed. Neither backend enforces a disk byte quota (capacity is bounded by the filesystem at disk.path), so it was a no-op knob. - MooncakeKVCacheOffloadConfig.metadata_server: removed. The launcher always hosts the HTTP metadata server on the node-local master and auto-derives the URL from the master host. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
samsja
reviewed
Jun 2, 2026
samsja
reviewed
Jun 2, 2026
samsja
reviewed
Jun 2, 2026
Address PR review: replace the Python launcher CLI (+ env.sh round-trip) with an
explicit bash launch in the SLURM templates, and document the breaking config change.
- New `templates/_mooncake_store.sh.j2` partial (shared via `{% include %}` by
`inference.sbatch.j2` and `multi_node_rl.sbatch.j2`): launches the node-local
`mooncake_master` + `mooncake_client`, writes the config JSON, and exports
`PYTHONHASHSEED` / `MOONCAKE_CONFIG_PATH` inline — visible in the rendered script.
Removes the hacky orphan-process CLI, the `print`, and the unquoted-`env.sh` round-trip.
- Delete `src/prime_rl/inference/mooncake.py` and the `server.py` store-launch hook.
- Mooncake offload is now SLURM-only: `inference_local` errors (when no store env is
present) and an `RLConfig` validator rejects local RL + mooncake. Native offload is
unchanged and still works locally. Entrypoints pass `kv_offload_{cpu_bytes,disk_path,
device_name}` template vars.
- CHANGELOG.md: document the `inference.kv_cache_offload.cpu_bytes` -> discriminated
`type` breaking change (per .cursor/BUGBOT.md).
Verified single-node SLURM (Qwen3-0.6B): store launches via the template bash,
MooncakeStoreConnector (standalone-store) loads, and the server serves coherent output.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit cb7b4d4. Configure here.
Switch the Mooncake offload backend from per-node isolated stores to a single shared distributed pool, so all inference nodes' CPU RAM (and disk) form one KV cache and a prefix cached on any node is reusable by every other node/replica. - `_mooncake_store.sh.j2`: the head inference node (first in the SLURM nodelist) runs the one master + HTTP metadata server; every node runs a client pointing at the head and contributes its segment. Store keys are model + parallel-rank + content-hash (no instance id), so the pool is reused across nodes and replicas. - Drop the now-unused `MooncakeKVCacheOffloadConfig.master_server_address` (auto head). - docs: describe the shared pool (total pool ≈ num_bytes × #inference-nodes). Verified on a 2-node SLURM job (Qwen3-0.6B): one master on the head, both clients join (2 segments), and a prefix cached on node A is served from node B over RDMA (load_get ~160 MB cross-node hit). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The RLConfig `validate_mooncake_offload_requires_slurm` config validator already guards the misuse case; the per-request runtime check in `inference_local` (and its `import os`) was redundant. Remove it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
samsja
approved these changes
Jun 2, 2026
snimu
added a commit
that referenced
this pull request
Jun 2, 2026
…air_rl cadence align) Picks up 4e7aebf in research-configs: migrates [inference.kv_cache_offload] from flat cpu_bytes to the new tiered/discriminated shape (PR #2689) on the three forth-lang cells in active use (cmb_code_a0p005, rl_muon, rl), and aligns glm45air_rl ckpt+eval cadence with the muon cell. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Replace the flat
inference.kv_cache_offload(singlecpu_bytes) with a typed, backend-swappable, tiered config discriminated ontype:OffloadingConnector(cpu) /TieringOffloadingSpec(cpu+disk, fs_python secondary). Fully self-contained.mooncake_master+mooncake_client, RDMA), cpu and/or disk tiers.Composable
cpu/disktiers (cpu required, disk-only rejected);disk.num_bytesis mooncake-only (native fs tier is filesystem-bounded; warns if set on native).Centralize the vLLM
kv_transfer_configbuild inInferenceConfig.build_kv_transfer_config(NIXL transfer + offload composed viaMultiConnector); both SLURM templates stop hand-rolling the JSON and launch the node-local Mooncake store. Newsrc/prime_rl/inference/mooncake.pylaunches the master+client and writesMOONCAKE_CONFIG_PATH;server.pybrings it up for local runs.Verified:
Note
High Risk
Breaking inference config and changes to the KV transfer/offload path for multi-node and P/D SLURM jobs, plus new distributed Mooncake/RDMA dependencies.
Overview
Replaces the flat
inference.kv_cache_offload.cpu_bytessetting with atype-discriminated offload config: requiredcpu.num_bytes, optionaldisk.path, and backendsnative(vLLMOffloadingConnector/TieringOffloadingSpec) ormooncake(shared RDMA store on SLURM). Oldcpu_bytesis rejected — configs must migrate per CHANGELOG.InferenceConfig.build_kv_transfer_confignow assembles vLLMkv_transfer_config(NIXL for disaggregated P/D via auto-setuse_pd_kv_transfer, plus offload;MultiConnectorwhen both apply).to_vllm()uses that instead of inline offload JSON. SLURMinference/multi_node_rltemplates drop hand-builtkv_transfer_configinVLLM_EXTRAand, for Mooncake, include_mooncake_store.sh.j2(master/client,MOONCAKE_CONFIG_PATH) with cleanup formooncake_*processes. RL/inference entrypoints pass Mooncake template vars;validate_mooncake_offload_requires_slurmblocks Mooncake without SLURM. Addsmooncake-transfer-engineand docs/changelog for the new shape.Reviewed by Cursor Bugbot for commit 44f951d. Bugbot is set up for automated code reviews on this repo. Configure here.