Skip to content

feat: wire r3 v3 routed experts replay#2487

Open
S1ro1 wants to merge 14 commits into
mainfrom
feat/r3-v3-routed-experts
Open

feat: wire r3 v3 routed experts replay#2487
S1ro1 wants to merge 14 commits into
mainfrom
feat/r3-v3-routed-experts

Conversation

@S1ro1
Copy link
Copy Markdown
Collaborator

@S1ro1 S1ro1 commented May 13, 2026

Summary

  • expose choices[i].routed_experts as compact base64 NumPy payloads from the prime-rl vLLM token/chat wrappers
  • keep routed experts as a first-class TrainingSample.routed_experts field using a RoutedExperts transport struct; tolist() is too expensive, so the struct carries raw bytes plus shape/dtype
  • stitch multi-turn routed experts by mutating the existing sample only, then load the packed bytes in the trainer with torch.frombuffer
  • reject trainer.enable_router_replay with inference.kv_cache_offload; CPU KV offload/router-cache recovery is intentionally not supported in this version
  • pin the upstream vLLM nightly wheel 0.20.2rc1.dev354+g24337fb86.cu129 mirrored to the prime-rl v0.5.0 release, and keep the prime-rl vLLM plugin patches for upstream compatibility
  • patch vLLM config validation in prime-rl to allow routed-experts capture with the NIXL connector; P/D routed experts are stitched by the router, while CPU KV offload remains rejected by prime-rl validation
  • pin verifiers to upstream main 7fdf522
  • pin vllm-router to release 0.1.24, which includes P/D routed-experts stitching

Related PRs

Verification

  • uv lock --check
  • uv run ruff check --config=pyproject.toml
  • uv run ruff format --check --config=pyproject.toml
  • uv run ruff check src/prime_rl/transport/types.py src/prime_rl/orchestrator/trajectories.py src/prime_rl/trainer/batch.py src/prime_rl/trainer/rl/data.py src/prime_rl/inference/vllm/routed_experts.py src/prime_rl/inference/vllm/serving_tokens.py src/prime_rl/inference/vllm/serving_chat_with_tokens.py src/prime_rl/inference/patches.py tests/unit/inference/test_serving_tokens.py tests/unit/orchestrator/test_batch.py tests/unit/orchestrator/test_trajectories.py
  • git diff --check
  • uv run python - <<'PY' ... transformers_v5_compat() ... PY to verify the vLLM plugin patches DPEngineCoreProc on the nightly wheel
  • uv run pytest tests/unit/inference/test_serving_tokens.py tests/unit/orchestrator/test_batch.py tests/unit/orchestrator/test_trajectories.py (59 passed)

Note

Medium Risk
Changes the inference→orchestrator→trainer data contract for routed_experts (new packed-bytes struct and base64 NumPy payloads) and updates batch assembly/tensorization logic, which could break router-replay or training if shape/dtype handling is off. Also pins to a custom vLLM wheel and adjusts vLLM monkey patches, increasing integration risk across upstream versions.

Overview
Enables router replay to consume compact routed-expert decisions end-to-end by exporting choices[i].routed_experts as a base64-encoded NumPy payload (new serialize_routed_experts/RoutedExpertsCapture) and updating both the chat and tokens vLLM serving wrappers to attach this field.

Refactors the training data path to avoid expensive tolist() conversions by introducing a RoutedExperts transport struct (raw bytes + shape + dtype) and updating trajectory stitching, batch packing/padding, and trainer tensorization (torch.frombuffer) to operate on the packed representation.

Adds a config validation that forbids trainer.enable_router_replay with inference.kv_cache_offload, tweaks vLLM DP pause/resume monkey patches to bypass upstream two-phase pause behavior, and updates dependencies/pins (custom vllm wheel, verifiers rev, uv lock updates including tokenspeed-mla).

Reviewed by Cursor Bugbot for commit 9438623. Bugbot is set up for automated code reviews on this repo. Configure here.

@S1ro1 S1ro1 force-pushed the feat/r3-v3-routed-experts branch from bf79561 to 721a874 Compare May 13, 2026 12:13
@S1ro1 S1ro1 force-pushed the feat/r3-v3-routed-experts branch from bc91c30 to e55328f Compare May 14, 2026 14:09
@S1ro1 S1ro1 force-pushed the feat/r3-v3-routed-experts branch from e55328f to 1fea38e Compare May 14, 2026 14:13
@S1ro1 S1ro1 marked this pull request as ready for review May 14, 2026 15:52
sample.routed_experts.extend(step_routed[prefix_len:])
if prefix_len > 0 and prefix_len <= step_routed.shape[0]:
sample_routed_experts[prefix_len - 1] = step_routed[prefix_len - 1]
sample_routed_experts = np.concatenate((sample_routed_experts, step_routed[prefix_len:]), axis=0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixed compact dtypes cause silent truncation during stitching

Medium Severity

When stitching multi-turn routed experts, _decode_routed_experts preserves each step's independently-chosen compact dtype. If step 1 serializes as uint8 (all expert IDs ≤ 255) and step 2 as int16 (some IDs > 255), the boundary replacement sample_routed_experts[prefix_len - 1] = step_routed[prefix_len - 1] writes int16 values into a uint8 array, silently truncating expert IDs via numpy overflow. The subsequent np.concatenate upcasts correctly, but the corrupted boundary value persists. This affects models with more than 255 experts where per-step value ranges happen to differ.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 9438623. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit d6d06b4. Configure here.

.reshape(packed_routed_experts.shape)
.to(torch.int32)
.unsqueeze(0)
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read-only tensor from torch.frombuffer on immutable bytes

Medium Severity

torch.frombuffer is called on packed_routed_experts.data which is bytes (immutable). When the compact dtype is already int32, .to(torch.int32) is a no-op returning self, so the final tensor remains read-only and backed by the immutable buffer. For uint8/int16 sources, .to(torch.int32) creates a writable copy, making the behavior dtype-dependent. The analogous pixel_values conversion at line 228 correctly wraps in bytearray(...) to ensure mutability. Passing bytearray(packed_routed_experts.data) here would make the behavior consistent and safe regardless of source dtype.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit d6d06b4. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants