chore: relaunch vllm 0.20.1 bump by samsja · Pull Request #2448 · PrimeIntellect-ai/prime-rl

samsja · 2026-05-08T16:36:03Z

Re-applies #2427 (vllm 0.20.1 bump) after revert in #2437.

Why

The original bump (bump vllm version #2427) was reverted in Revert "bump vllm version (#2427)" #2437 without a recorded blocker. The intervening fp32 lm_head feature (feat(inference): fp32 lm_head via native bf16xbf16 -> fp32 mm (alt to #2438) #2441) actually relies on torch.mm(..., out_dtype=torch.float32), which needs torch >= 2.10 — comfortably available with the new flash-attn wheel pinned to torch 2.11 here.
Verified the vLLM LogitsProcessor API is byte-identical between 0.19.0 and 0.20.1 (__init__, _get_logits, _gather_logits signatures unchanged), so monkey_patch_fp32_lm_head from feat(inference): fp32 lm_head via native bf16xbf16 -> fp32 mm (alt to #2438) #2441 keeps working without modification.

What

Reverts commit bb522cc ("Revert 'bump vllm version (bump vllm version #2427)'"), which restores:
- vllm>=0.20.1
- flash-attn wheel pinned to +cu128torch2.11
- torchvision / torchaudio deps via pytorch-cu128
- removal of monkey_patch_fused_moe_lora_dp (fixed in vLLM 0.20 via #40338)
- removal of monkey_patch_offloading_connector_cpu_block_count (fixed in vLLM 0.20 via #39617)
monkey_patch_fp32_lm_head (feat(inference): fp32 lm_head via native bf16xbf16 -> fp32 mm (alt to #2438) #2441) is preserved as-is — auto-merge handled the overlap with the patches removed by the bump.

🤖 Generated with Claude Code

Note

Medium Risk
Upgrades core inference dependencies (vLLM/PyTorch/Flash-Attn), which can change runtime behavior and performance across serving and distributed execution despite minimal application-code changes.

Overview
Updates the inference dependency stack to vLLM >=0.20.1 and aligns CUDA 12.8 wheels by adding torchvision/torchaudio from the pytorch-cu128 index and updating the pinned flash-attn wheel to the +cu128torch2.11 build.

Simplifies local vLLM patching by removing monkey patches that are now fixed upstream (the LoRA+MoE+DP corruption workaround and the offloading connector CPU block count fix), and adjusts comments/config (exclude-newer-package now unblocks vllm).

^{Reviewed by Cursor Bugbot for commit 464bea4. Bugbot is set up for automated code reviews on this repo. Configure here.}

This reverts commit bb522cc.

Revert "Revert "bump vllm version (#2427)" (#2437)"

464bea4

This reverts commit bb522cc.

samsja marked this pull request as ready for review May 8, 2026 16:58

samsja merged commit 09e403b into main May 8, 2026
13 of 14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: relaunch vllm 0.20.1 bump#2448

chore: relaunch vllm 0.20.1 bump#2448
samsja merged 1 commit into
mainfrom
chore/relaunch-vllm-0.20

samsja commented May 8, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samsja commented May 8, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

samsja commented May 8, 2026 •

edited by cursor Bot

Loading