Skip to content

Fix: add layerwise as default#2467

Merged
S1ro1 merged 3 commits into
mainfrom
fix/nccl-layerwise-default
May 10, 2026
Merged

Fix: add layerwise as default#2467
S1ro1 merged 3 commits into
mainfrom
fix/nccl-layerwise-default

Conversation

@S1ro1
Copy link
Copy Markdown
Collaborator

@S1ro1 S1ro1 commented May 10, 2026

Fixes broken nccl weight transfer on vllm 0.20


Note

Medium Risk
Changes the hot weight-reload path for both filesystem and NCCL updates to use vLLM's layerwise reload APIs, which can affect runtime model correctness and stability during in-place updates. Risk is mitigated by keeping the kernel-format path intact and adding explicit MLA absorbed-weight recomputation for that path.

Overview
Switches checkpoint-format weight updates (filesystem and NCCL) to a new load_weights_checkpoint_layerwise flow that wraps model.load_weights(...) with vLLM initialize_layerwise_reload/finalize_layerwise_reload and set_current_vllm_config, replacing the prior process_weights_after_loading postprocessing.

For NCCL kernel-format transfers, removes the old generic postprocess step and instead explicitly calls update_mla_absorbed_weights after load_weights_kernel, then returns early.

Reviewed by Cursor Bugbot for commit 25f17df. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 718bbb8. Configure here.

Comment thread src/prime_rl/inference/vllm/worker/weight_transfer.py
S1ro1 and others added 2 commits May 10, 2026 18:57
Mirror the same fix as the NCCL path: replace
`model.load_weights + process_weights_after_loading` with
`load_weights_checkpoint_layerwise`. Without it, MoE
+ filesystem transport hits the same
`Current vLLM config is not set` AssertionError at
`worker/filesystem.py` on the first weight update.

Verified end-to-end on Qwen3-30B-A3B-Thinking-2507 + reverse-text:
filesystem broadcast returns 200 OK, orchestrator step 2 consumed
new weights cleanly (off-policy level 1, reward 0.3583), trainer
steady-state at ~29K tok/s, MFU 21%, no NaN, no errors.
@S1ro1 S1ro1 merged commit 319af10 into main May 10, 2026
8 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants