feat: wire r3 v3 routed experts replay#2487
Conversation
bf79561 to
721a874
Compare
bc91c30 to
e55328f
Compare
e55328f to
1fea38e
Compare
| sample.routed_experts.extend(step_routed[prefix_len:]) | ||
| if prefix_len > 0 and prefix_len <= step_routed.shape[0]: | ||
| sample_routed_experts[prefix_len - 1] = step_routed[prefix_len - 1] | ||
| sample_routed_experts = np.concatenate((sample_routed_experts, step_routed[prefix_len:]), axis=0) |
There was a problem hiding this comment.
Mixed compact dtypes cause silent truncation during stitching
Medium Severity
When stitching multi-turn routed experts, _decode_routed_experts preserves each step's independently-chosen compact dtype. If step 1 serializes as uint8 (all expert IDs ≤ 255) and step 2 as int16 (some IDs > 255), the boundary replacement sample_routed_experts[prefix_len - 1] = step_routed[prefix_len - 1] writes int16 values into a uint8 array, silently truncating expert IDs via numpy overflow. The subsequent np.concatenate upcasts correctly, but the corrupted boundary value persists. This affects models with more than 255 experts where per-step value ranges happen to differ.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 9438623. Configure here.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit d6d06b4. Configure here.
| .reshape(packed_routed_experts.shape) | ||
| .to(torch.int32) | ||
| .unsqueeze(0) | ||
| ) |
There was a problem hiding this comment.
Read-only tensor from torch.frombuffer on immutable bytes
Medium Severity
torch.frombuffer is called on packed_routed_experts.data which is bytes (immutable). When the compact dtype is already int32, .to(torch.int32) is a no-op returning self, so the final tensor remains read-only and backed by the immutable buffer. For uint8/int16 sources, .to(torch.int32) creates a writable copy, making the behavior dtype-dependent. The analogous pixel_values conversion at line 228 correctly wraps in bytearray(...) to ensure mutability. Passing bytearray(packed_routed_experts.data) here would make the behavior consistent and safe regardless of source dtype.
Reviewed by Cursor Bugbot for commit d6d06b4. Configure here.


Summary
choices[i].routed_expertsas compact base64 NumPy payloads from the prime-rl vLLM token/chat wrappersTrainingSample.routed_expertsfield using aRoutedExpertstransport struct;tolist()is too expensive, so the struct carries raw bytes plus shape/dtypetorch.frombuffertrainer.enable_router_replaywithinference.kv_cache_offload; CPU KV offload/router-cache recovery is intentionally not supported in this version0.20.2rc1.dev354+g24337fb86.cu129mirrored to the prime-rlv0.5.0release, and keep the prime-rl vLLM plugin patches for upstream compatibility7fdf522vllm-routerto release0.1.24, which includes P/D routed-experts stitchingRelated PRs
Verification
uv lock --checkuv run ruff check --config=pyproject.tomluv run ruff format --check --config=pyproject.tomluv run ruff check src/prime_rl/transport/types.py src/prime_rl/orchestrator/trajectories.py src/prime_rl/trainer/batch.py src/prime_rl/trainer/rl/data.py src/prime_rl/inference/vllm/routed_experts.py src/prime_rl/inference/vllm/serving_tokens.py src/prime_rl/inference/vllm/serving_chat_with_tokens.py src/prime_rl/inference/patches.py tests/unit/inference/test_serving_tokens.py tests/unit/orchestrator/test_batch.py tests/unit/orchestrator/test_trajectories.pygit diff --checkuv run python - <<'PY' ... transformers_v5_compat() ... PYto verify the vLLM plugin patchesDPEngineCoreProcon the nightly wheeluv run pytest tests/unit/inference/test_serving_tokens.py tests/unit/orchestrator/test_batch.py tests/unit/orchestrator/test_trajectories.py(59 passed)Note
Medium Risk
Changes the inference→orchestrator→trainer data contract for
routed_experts(new packed-bytes struct and base64 NumPy payloads) and updates batch assembly/tensorization logic, which could break router-replay or training if shape/dtype handling is off. Also pins to a custom vLLM wheel and adjusts vLLM monkey patches, increasing integration risk across upstream versions.Overview
Enables router replay to consume compact routed-expert decisions end-to-end by exporting
choices[i].routed_expertsas a base64-encoded NumPy payload (newserialize_routed_experts/RoutedExpertsCapture) and updating both the chat and tokens vLLM serving wrappers to attach this field.Refactors the training data path to avoid expensive
tolist()conversions by introducing aRoutedExpertstransport struct (rawbytes+shape+dtype) and updating trajectory stitching, batch packing/padding, and trainer tensorization (torch.frombuffer) to operate on the packed representation.Adds a config validation that forbids
trainer.enable_router_replaywithinference.kv_cache_offload, tweaks vLLM DP pause/resume monkey patches to bypass upstream two-phase pause behavior, and updates dependencies/pins (customvllmwheel,verifiersrev, uv lock updates includingtokenspeed-mla).Reviewed by Cursor Bugbot for commit 9438623. Bugbot is set up for automated code reviews on this repo. Configure here.