[KV Offload]: Tiered offloading + MoonCakeStore by S1ro1 · Pull Request #2689 · PrimeIntellect-ai/prime-rl

S1ro1 · 2026-06-02T00:57:14Z

Replace the flat inference.kv_cache_offload (single cpu_bytes) with a typed, backend-swappable, tiered config discriminated on type:

native: vLLM OffloadingConnector (cpu) / TieringOffloadingSpec (cpu+disk, fs_python secondary). Fully self-contained.
mooncake: per-node standalone-store (mooncake_master + mooncake_client, RDMA), cpu and/or disk tiers.

Composable cpu/disk tiers (cpu required, disk-only rejected); disk.num_bytes is mooncake-only (native fs tier is filesystem-bounded; warns if set on native).

Centralize the vLLM kv_transfer_config build in
InferenceConfig.build_kv_transfer_config (NIXL transfer + offload composed via MultiConnector); both SLURM templates stop hand-rolling the JSON and launch the node-local Mooncake store. New src/prime_rl/inference/mooncake.py launches the master+client and writes MOONCAKE_CONFIG_PATH; server.py brings it up for local runs.

Verified:

Single-node (Qwen3-0.6B, H200): all four configs boot + serve; native-cpu and mooncake-rdma both round-trip KV (store ~1.1GB / reload ~205MB).
Multi-node 3-node SLURM disaggregated (Qwen3-30B-A3B): NIXL P/D transfer serves coherent output; with mooncake offload, MultiConnector[Nixl, MooncakeStore] + per-node RDMA store serves coherent output.

Note

High Risk
Breaking inference config and changes to the KV transfer/offload path for multi-node and P/D SLURM jobs, plus new distributed Mooncake/RDMA dependencies.

Overview
Replaces the flat inference.kv_cache_offload.cpu_bytes setting with a type-discriminated offload config: required cpu.num_bytes, optional disk.path, and backends native (vLLM OffloadingConnector / TieringOffloadingSpec) or mooncake (shared RDMA store on SLURM). Old cpu_bytes is rejected — configs must migrate per CHANGELOG.

InferenceConfig.build_kv_transfer_config now assembles vLLM kv_transfer_config (NIXL for disaggregated P/D via auto-set use_pd_kv_transfer, plus offload; MultiConnector when both apply). to_vllm() uses that instead of inline offload JSON. SLURM inference / multi_node_rl templates drop hand-built kv_transfer_config in VLLM_EXTRA and, for Mooncake, include _mooncake_store.sh.j2 (master/client, MOONCAKE_CONFIG_PATH) with cleanup for mooncake_* processes. RL/inference entrypoints pass Mooncake template vars; validate_mooncake_offload_requires_slurm blocks Mooncake without SLURM. Adds mooncake-transfer-engine and docs/changelog for the new shape.

^{Reviewed by Cursor Bugbot for commit 44f951d. Bugbot is set up for automated code reviews on this repo. Configure here.}

Replace the flat `inference.kv_cache_offload` (single `cpu_bytes`) with a typed, backend-swappable, tiered config discriminated on `type`: - native: vLLM `OffloadingConnector` (cpu) / `TieringOffloadingSpec` (cpu+disk, fs_python secondary). Fully self-contained. - mooncake: per-node standalone-store (`mooncake_master` + `mooncake_client`, RDMA), cpu and/or disk tiers. Composable `cpu`/`disk` tiers (cpu required, disk-only rejected); `disk.num_bytes` is mooncake-only (native fs tier is filesystem-bounded; warns if set on native). Centralize the vLLM `kv_transfer_config` build in `InferenceConfig.build_kv_transfer_config` (NIXL transfer + offload composed via `MultiConnector`); both SLURM templates stop hand-rolling the JSON and launch the node-local Mooncake store. New `src/prime_rl/inference/mooncake.py` launches the master+client and writes `MOONCAKE_CONFIG_PATH`; `server.py` brings it up for local runs. Verified: - Single-node (Qwen3-0.6B, H200): all four configs boot + serve; native-cpu and mooncake-rdma both round-trip KV (store ~1.1GB / reload ~205MB). - Multi-node 3-node SLURM disaggregated (Qwen3-30B-A3B): NIXL P/D transfer serves coherent output; with mooncake offload, MultiConnector[Nixl, MooncakeStore] + per-node RDMA store serves coherent output. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rver - DiskOffloadTier.num_bytes: removed. Neither backend enforces a disk byte quota (capacity is bounded by the filesystem at disk.path), so it was a no-op knob. - MooncakeKVCacheOffloadConfig.metadata_server: removed. The launcher always hosts the HTTP metadata server on the node-local master and auto-derives the URL from the master host. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Address PR review: replace the Python launcher CLI (+ env.sh round-trip) with an explicit bash launch in the SLURM templates, and document the breaking config change. - New `templates/_mooncake_store.sh.j2` partial (shared via `{% include %}` by `inference.sbatch.j2` and `multi_node_rl.sbatch.j2`): launches the node-local `mooncake_master` + `mooncake_client`, writes the config JSON, and exports `PYTHONHASHSEED` / `MOONCAKE_CONFIG_PATH` inline — visible in the rendered script. Removes the hacky orphan-process CLI, the `print`, and the unquoted-`env.sh` round-trip. - Delete `src/prime_rl/inference/mooncake.py` and the `server.py` store-launch hook. - Mooncake offload is now SLURM-only: `inference_local` errors (when no store env is present) and an `RLConfig` validator rejects local RL + mooncake. Native offload is unchanged and still works locally. Entrypoints pass `kv_offload_{cpu_bytes,disk_path, device_name}` template vars. - CHANGELOG.md: document the `inference.kv_cache_offload.cpu_bytes` -> discriminated `type` breaking change (per .cursor/BUGBOT.md). Verified single-node SLURM (Qwen3-0.6B): store launches via the template bash, MooncakeStoreConnector (standalone-store) loads, and the server serves coherent output. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit cb7b4d4. Configure here.}

Switch the Mooncake offload backend from per-node isolated stores to a single shared distributed pool, so all inference nodes' CPU RAM (and disk) form one KV cache and a prefix cached on any node is reusable by every other node/replica. - `_mooncake_store.sh.j2`: the head inference node (first in the SLURM nodelist) runs the one master + HTTP metadata server; every node runs a client pointing at the head and contributes its segment. Store keys are model + parallel-rank + content-hash (no instance id), so the pool is reused across nodes and replicas. - Drop the now-unused `MooncakeKVCacheOffloadConfig.master_server_address` (auto head). - docs: describe the shared pool (total pool ≈ num_bytes × #inference-nodes). Verified on a 2-node SLURM job (Qwen3-0.6B): one master on the head, both clients join (2 segments), and a prefix cached on node A is served from node B over RDMA (load_get ~160 MB cross-node hit). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The RLConfig `validate_mooncake_offload_requires_slurm` config validator already guards the misuse case; the per-request runtime check in `inference_local` (and its `import os`) was redundant. Remove it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…air_rl cadence align) Picks up 4e7aebf in research-configs: migrates [inference.kv_cache_offload] from flat cpu_bytes to the new tiered/discriminated shape (PR #2689) on the three forth-lang cells in active use (cmb_code_a0p005, rl_muon, rl), and aligns glm45air_rl ckpt+eval cadence with the muon cell. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

S1ro1 and others added 3 commits June 2, 2026 06:25

Merge branch 'main' into mooncake-baby

f37b0be

samsja reviewed Jun 2, 2026

View reviewed changes

Comment thread src/prime_rl/inference/mooncake.py Outdated

samsja reviewed Jun 2, 2026

View reviewed changes

Comment thread src/prime_rl/inference/mooncake.py Outdated

samsja reviewed Jun 2, 2026

View reviewed changes

Comment thread src/prime_rl/inference/mooncake.py Outdated

cursor Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread packages/prime-rl-configs/src/prime_rl/configs/inference.py

Comment thread src/prime_rl/inference/mooncake.py Outdated

cursor Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread pyproject.toml

S1ro1 and others added 3 commits June 2, 2026 07:36

Merge branch 'main' into mooncake-baby

b523c28

samsja approved these changes Jun 2, 2026

View reviewed changes

samsja merged commit c13409e into main Jun 2, 2026
18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KV Offload]: Tiered offloading + MoonCakeStore#2689

[KV Offload]: Tiered offloading + MoonCakeStore#2689
samsja merged 7 commits into
mainfrom
mooncake-baby

S1ro1 commented Jun 2, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

S1ro1 commented Jun 2, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

S1ro1 commented Jun 2, 2026 •

edited by cursor Bot

Loading