Fix: add layerwise as default by S1ro1 · Pull Request #2467 · PrimeIntellect-ai/prime-rl

S1ro1 · 2026-05-10T13:17:00Z

Fixes broken nccl weight transfer on vllm 0.20

Note

Medium Risk
Changes the hot weight-reload path for both filesystem and NCCL updates to use vLLM's layerwise reload APIs, which can affect runtime model correctness and stability during in-place updates. Risk is mitigated by keeping the kernel-format path intact and adding explicit MLA absorbed-weight recomputation for that path.

Overview
Switches checkpoint-format weight updates (filesystem and NCCL) to a new load_weights_checkpoint_layerwise flow that wraps model.load_weights(...) with vLLM initialize_layerwise_reload/finalize_layerwise_reload and set_current_vllm_config, replacing the prior process_weights_after_loading postprocessing.

For NCCL kernel-format transfers, removes the old generic postprocess step and instead explicitly calls update_mla_absorbed_weights after load_weights_kernel, then returns early.

^{Reviewed by Cursor Bugbot for commit 25f17df. Bugbot is set up for automated code reviews on this repo. Configure here.}

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 718bbb8. Configure here.}

Mirror the same fix as the NCCL path: replace `model.load_weights + process_weights_after_loading` with `load_weights_checkpoint_layerwise`. Without it, MoE + filesystem transport hits the same `Current vLLM config is not set` AssertionError at `worker/filesystem.py` on the first weight update. Verified end-to-end on Qwen3-30B-A3B-Thinking-2507 + reverse-text: filesystem broadcast returns 200 OK, orchestrator step 2 consumed new weights cleanly (off-policy level 1, reward 0.3583), trainer steady-state at ~29K tok/s, MFU 21%, no NaN, no errors.

Fix: add layerwise as default

718bbb8

cursor Bot reviewed May 10, 2026

View reviewed changes

Comment thread src/prime_rl/inference/vllm/worker/weight_transfer.py

S1ro1 and others added 2 commits May 10, 2026 18:57

Fix; cleanup

76f49ce

hallerite approved these changes May 10, 2026

View reviewed changes

S1ro1 merged commit 319af10 into main May 10, 2026
8 of 10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: add layerwise as default#2467

Fix: add layerwise as default#2467
S1ro1 merged 3 commits into
mainfrom
fix/nccl-layerwise-default

S1ro1 commented May 10, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

S1ro1 commented May 10, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

S1ro1 commented May 10, 2026 •

edited by cursor Bot

Loading