Skip to content

[Weight Transfer]: Support layerwise reloading for online quantization#2464

Open
S1ro1 wants to merge 5 commits into
mainfrom
online-quant
Open

[Weight Transfer]: Support layerwise reloading for online quantization#2464
S1ro1 wants to merge 5 commits into
mainfrom
online-quant

Conversation

@S1ro1
Copy link
Copy Markdown
Collaborator

@S1ro1 S1ro1 commented May 9, 2026

Note

Medium Risk
Touches inference weight hot-reload and NCCL/filesystem broadcast flows; incorrect flag propagation or vLLM layerwise reload behavior could break online weight updates or degrade inference stability, especially under FP8 quantization.

Overview
Adds support for vLLM layerwise weight reloading during online inference weight updates, primarily to enable online FP8 quantization (inference.quantization=fp8_per_block).

Configuration is extended with quantization and a weight_broadcast.layerwise flag (propagated/validated across trainer/orchestrator/inference), auto-enabling layerwise reload when quantization is set and forbidding incompatible combinations (e.g., layerwise vs quantize_in_weight_transfer).

Inference weight update RPCs now carry the layerwise flag end-to-end (server → worker, and orchestrator → /init_broadcaster), and weight loading is refactored to a shared load_weights_checkpoint_or_layerwise helper that uses vLLM’s initialize_layerwise_reload/finalize_layerwise_reload plus a workaround to run FP8 online conversion without torch.compile when needed.

Reviewed by Cursor Bugbot for commit 3974fd9. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 0683bba. Configure here.

Comment thread src/prime_rl/inference/vllm/worker/nccl.py Outdated
Copy link
Copy Markdown
Member

@samsja samsja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM sir

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants