Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
507 commits
Select commit Hold shift + click to select a range
81b608d
[Bugfix] reasoning_parser parameter handling in run_batch.py (#26225)
inc-jeong Oct 16, 2025
7de0029
[ROCm][FEAT] Fuse DeepSeek shared experts into AITER fused_moe ops (#…
kliuae Oct 16, 2025
b1ecb43
[CI] Enable Blackwell Llama4 MoE tests (#26731)
mgoin Oct 16, 2025
445ee12
[BUG] Allow runai_streamer_sharded in config check (#26958)
ahao-anyscale Oct 16, 2025
668d940
[bugfix] Fix SP + PP without specifying compile size (#26955)
angelayi Oct 16, 2025
29ddae4
[BugFix] Work around graph partition x torch.compile cache issue (#26…
zou3519 Oct 16, 2025
c3e8091
[DOC][XPU]update feature parity with Intel GPU (#26954)
xuechendi Oct 16, 2025
e76ac40
[Chore] Rename `utils` submodules (#26920)
DarkLight1337 Oct 16, 2025
5cec582
[PERF] Qwen3-next MTP speedup (change bool mask indexing to index_sel…
vadiklyutiy Oct 16, 2025
2f7b895
Deepseek-v3 Batch Invariant on 8xH100 (#26609)
bwasti Oct 16, 2025
c70ac7e
[CI/Build] Update expected beam search output for Phi3V (#26978)
DarkLight1337 Oct 16, 2025
aa4ddbe
[Hardware][CPU][PowerPC]Disable torch.compile() in toptopk sampling (…
Akashcodes732 Oct 16, 2025
4bc9280
[CI/Build] Fix AMD import failures in CI (#26841)
zhewenl Oct 16, 2025
b020e9c
[Benchmark] Use truncation by default for pooling benchmarks (#26992)
DarkLight1337 Oct 16, 2025
9adb917
[Chore] Separate out `vllm.utils.collections` (#26990)
DarkLight1337 Oct 16, 2025
1b0643c
[Model][Bugfix] fix ernie45 vl run failed from shared experts optimiz…
CSWYF3634076 Oct 16, 2025
4b9033e
Cleanup code after Python 3.10 upgrade (#26520)
lgeiger Oct 16, 2025
8752f7a
[MISC] fix import violations for re and triton modules (#26654)
llsj14 Oct 16, 2025
71db0c1
[Bugfix] Correct LayerNorm epsilon parameter in modernbert.py (#27008)
bogdanminko Oct 16, 2025
9e123d4
[Benchmark] Show E2EL by default for pooling models (#27014)
DarkLight1337 Oct 16, 2025
5390be1
[Attention] Tune CUTLASS MLA num_splits (#26846)
MatthewBonanni Oct 16, 2025
3a821b5
[NIXL] Improve request_finished() debug logs (#25665)
markmc Oct 16, 2025
dc5de46
[docs] standardize Hugging Face env var to `HF_TOKEN` (deprecates `HU…
yankay Oct 16, 2025
40b9d7b
[CI] Replace large models with tiny alternatives in tests (#24057)
tahsintunan Oct 16, 2025
88914c2
[Feature] Add process_weights_after_loading to AttentionImpl (#26870)
lengrongfu Oct 16, 2025
62339c0
[Model] Fix Qwen3VL mm mapping (#27027)
jeejeelee Oct 16, 2025
4db9fe7
Fix Qwen2.5 VL image grid docstring (#27033)
skyloevil Oct 16, 2025
ecea4be
Support `set` in the CLI generation (#27031)
hmellor Oct 16, 2025
3f8d633
[gpt-oss][1/N] EZ: refactor serving_responses for modularity (#26948)
qandrew Oct 16, 2025
9d9c085
Support block size of 256 used by Intel HPU (#26883)
mandy-li Oct 16, 2025
176d46e
[Compressed Tensors] Always clone output for compile robustness (#26849)
kylesayrs Oct 16, 2025
9216eec
Adding Warmup to Benchmark Serving (#26943)
kimbochen Oct 16, 2025
65688b9
[Bug] Fix batch invariant test `has` to `is` (#27032)
yewentao256 Oct 16, 2025
5dcfe89
[GPTOSS][DP/EP][Marlin] Enable GPTOSS Batched DP/EP using Marlin kern…
varun-sundar-rabindranath Oct 16, 2025
226908b
[Feature] Migrate DeepGEMM API from `get_m_alignment_for_contiguous_l…
yewentao256 Oct 16, 2025
368330d
[CI] Prune Quantization Tests and skip compilation (#27038)
mgoin Oct 16, 2025
3ea7010
[Bug] Add Assertion for `random-input-len` / `random-output-len` (#26…
yewentao256 Oct 16, 2025
7da8634
[small][batch invariance] Rename the env and internal flags to simpli…
bwasti Oct 16, 2025
e426ef9
Refactor Transformers backend to use mixins (#26906)
hmellor Oct 16, 2025
110524e
[NVIDIA] [Perf] Update to leverage flashinfer trtllm FP4 MOE throughp…
jiahanc Oct 16, 2025
e5b31da
[torch.compile] Passing only necessary compilation config to inductor…
luccafong Oct 17, 2025
fb84a87
[Chore] Separate out `vllm.utils.importlib` (#27022)
DarkLight1337 Oct 17, 2025
9d9d49c
[torch.compile] fix simple inductor graph partition test (#27050)
BoyuanFeng Oct 17, 2025
61786b1
Remove unused imports (#26972)
lgeiger Oct 17, 2025
9fb0b5a
vllm bench serve shows num of failed requests (#26478)
tomasruizt Oct 17, 2025
ef59e41
[Docs] Reduce custom syntax used in docs (#27009)
hmellor Oct 17, 2025
19214a5
[Perf] Exploit out-of-band buffers in shm_broadcast (#26961)
njhill Oct 17, 2025
c212bf1
disable graph partition in custom op (#26952)
BoyuanFeng Oct 17, 2025
c682f60
[Bugfix][Qwen] fixes the weights dtype in qwen3_next: it is actually …
sighingnow Oct 17, 2025
04f1cf4
[Core] Change `execute_model_with_error_logging()` to be a ctx manage…
njhill Oct 17, 2025
6e92c84
[Bugfix] Fix ReplicatedLinearWithLoRA (#27065)
jeejeelee Oct 17, 2025
aefff95
[Kernel] Lazy import FlashInfer (#26977)
jeejeelee Oct 17, 2025
e61d110
[CI/Build] Update Llama4 eval yaml (#27070)
zhewenl Oct 17, 2025
cda86b7
[Model] Always use Transformers backend for PaliGemma and Gemma3-MM (…
DarkLight1337 Oct 17, 2025
bd85240
[Model] Add support for LightOnOCR (#26916)
staghado Oct 17, 2025
e7ac46e
[CI/Build] Update compressed tensor test path to fix CPU CI (#27068)
bigPYJ1151 Oct 17, 2025
50e9d0b
[Kernel][Performance] Fuse float cast and renormalize to topk softmax…
izhuhaoran Oct 17, 2025
ec0bec4
[CI] fix docs build failed (#27082)
chaunceyjiang Oct 17, 2025
93bff5b
Update troubleshooting.md and remind VLLM_TRACE_FUNCTION usage (#27069)
Prowindy Oct 17, 2025
9d9ae2d
[VLM][Refactor] Remove useless func `get_input_positions` in `MRotary…
MengqingCao Oct 17, 2025
860453b
[Docs] Replace all explicit anchors with real links (#27087)
hmellor Oct 17, 2025
1e9d934
[Docs] Replace `rst` style double-backtick with `md` single-backtick …
hmellor Oct 17, 2025
684e2c4
[Model]Improve Qwen3VLMoeForConditionalGeneration packed_modules_mapp…
jeejeelee Oct 17, 2025
5edb6d5
[Harware][AMD][Model] Triton MoE tuning configs for GLM-4.5 for MI350…
rkarhila-amd Oct 17, 2025
bec5337
Fix incorrect docstring for stop_profile() method (#27101)
hyongtao-code Oct 17, 2025
9d6cea0
[torch.compile] Enable attention and allreduce fusion without custom …
ProExpertProg Oct 17, 2025
b075c73
[CI] Nixl integration tests (#27010)
NickLucche Oct 17, 2025
717695a
[Data-parallel] Allow DP>1 for world_size > num_gpus on node (8) (#26…
patrickvonplaten Oct 17, 2025
da3b5e3
[bugfix] Qwen3-VL fix video incorrect timestamp calculations while do…
wulipc Oct 17, 2025
3f232e7
[CI] Remove forbidden slash (#27112)
NickLucche Oct 17, 2025
c97724d
[ROCM] MoE fp4 CK kernel (#26545)
maleksan85 Oct 17, 2025
b2a1519
[ROCm][Bugfix][Model] Fix illegal memory access when running qwen3_mo…
rasmith Oct 17, 2025
21dd40c
[Bugfix] [AITER] [ROCm] Fix Quark MoE Quant Config and AITER Fused Mo…
vllmellm Oct 17, 2025
7eed6ea
[Chore] Remove unused `PolyNorm` layer (#27110)
Isotr0py Oct 17, 2025
c8f71f5
[Bugfix] Use PIECEWISE cudagraphs on Blackwell if max_model_len > 131…
mgoin Oct 17, 2025
54f088f
[Minor] Remove unnecessary error message (#27115)
zhuohan123 Oct 17, 2025
5ec81ee
[V1][Spec Decode] Fix greedy temperature detection after sampler refa…
Pradyun92 Oct 17, 2025
70f882e
[Test] Make `test_failure` more stable for batch invariance (#27054)
yewentao256 Oct 17, 2025
5b42c77
[BugFix][Core] Fix error when enable async-scheduling in multi-node e…
lhtin Oct 17, 2025
af46ced
[Perf] Add H100 fused MoE config (#25398)
skyloevil Oct 18, 2025
8111e83
[CI/Build] tests(v1): feed Triton attention the (num_blocks, 2, …) KV…
hl475 Oct 18, 2025
bda05c4
[GPT-OSS] Structure_Tag support for gpt-oss tool-call in cot (#25515)
Hanchenli Oct 18, 2025
78d3bba
[Misc] Rev DeepEP (#27122)
varun-sundar-rabindranath Oct 18, 2025
427b8eb
[DOC][FEATURES][CPU]update cpu feature for v1 (#27135)
xuechendi Oct 18, 2025
c4322fd
[Test] Add test for /health endpoint on engine failure (#26074)
dongbo910220 Oct 18, 2025
4fe6fac
[Chore] Separate out `vllm.utils.mem_utils` (#27143)
iAmir97 Oct 18, 2025
272514e
[Feature] Batch Invariant: Support DeepGEMM and Blackwell (#27127)
yewentao256 Oct 18, 2025
b4aa02d
[fix][cpu] fix prefill attention in CPU attention backend (#27035)
fadara01 Oct 18, 2025
2cd6601
[Misc] Refactor `get_kv_cache_spec` into `AttentionLayerBase` (#26587)
NickLucche Oct 18, 2025
bdac0c7
[Models][QwenVL] Remove unnecessary `.contiguous()` calls (#27106)
lgeiger Oct 18, 2025
a6e7382
[Chore] Clean up pytorch helper functions in `vllm.utils` (#26908)
Isotr0py Oct 18, 2025
b8a28c7
Fix incorrect string formatting in barrier timeout exceptions (#27149)
hyongtao-code Oct 18, 2025
fb3d109
[Minor] Add some clarifying comments to recent changes (#27130)
njhill Oct 18, 2025
23830a0
[BugFix] Fix failing gemma-3-1b-it test: `test_lm_eval_accuracy_v1_en…
LucasWilkinson Oct 18, 2025
2ce71b9
[Chore] Separate out profiling utilities from vllm.utils (#27150)
dongbo910220 Oct 18, 2025
de01143
[BugFix] fix graph partition signature (#27139)
BoyuanFeng Oct 18, 2025
aa5a77c
[BugFix] Disable fp8 kv-cache by default for DeepSeek V3.2 (#27121)
LucasWilkinson Oct 18, 2025
3cca461
[V1][Metrics][Plugin] Add plugin support for custom `StatLoggerBase` …
ptovam Oct 18, 2025
7e611e9
[Minor] Remove unused env variable (#27161)
WoosukKwon Oct 19, 2025
c72cde6
[BugFix] Fix lazy imports involving outlines_core (#27158)
22quinn Oct 19, 2025
17e77db
[Chore] Separate out hashing utilities from vllm.utils (#27151)
dongbo910220 Oct 19, 2025
19c6372
[Benchmark] Convenience script for multiple parameter combinations (#…
DarkLight1337 Oct 19, 2025
478bd82
output type conversion fix (#27159)
jianyuh Oct 19, 2025
f65a2ee
[Chore] Separate out `vllm.utils.network_utils` (#27164)
iAmir97 Oct 19, 2025
c2ea2c8
[Misc] Move utils to avoid conflicts with stdlib, and move tests (#27…
DarkLight1337 Oct 19, 2025
65749c8
[Bugfix] Fix error with penalties when speculative decoding and struc…
southfreebird Oct 19, 2025
a2e665b
Fix typo in ValueError message: use `kv_role` instead of `kv_disagg_r…
hyongtao-code Oct 19, 2025
c757d0b
[Model][VLM] Support Bee-8B Model (#27012)
uyzhang Oct 20, 2025
c0700ae
[LoRA] LoRA cuda graph specialization (#25914)
andylolu2 Oct 20, 2025
07ee43f
[Kernel] Accelerate solve_tril with TMA (#26746)
ZJY0516 Oct 20, 2025
76e06a5
AArch64 CPU Docker pipeline (#26931)
ioghiban Oct 20, 2025
84a158d
Nemotron Nano V2 VL + EVS Video Support (#27107)
BloodAxe Oct 20, 2025
5227215
[Kernel][Model] Tune fused_moe Triton configs for Qwen3-30B A3/A3B on…
shivampr Oct 20, 2025
34f6539
[Bugfix][CI] Fix `Distributed Tests (4 GPUs)` async_sched+ray test (#…
NickLucche Oct 20, 2025
f6383fe
[Feature][Quantization] auto_round support for mixed bits quantizatio…
n1ck-guo Oct 20, 2025
323625b
[ROCm] enable some tests in entrypoints test groups on AMD (#26725)
Concurrensee Oct 21, 2025
f8d4205
[ez] add uv lock to gitignore (#27212)
qandrew Oct 21, 2025
ee4604d
[Quantization] Automatically infer AWQ `modules_to_not_convert` field…
Isotr0py Oct 21, 2025
b2a6870
[V0 Deprecation] Remove V0 metrics code (#27215)
njhill Oct 21, 2025
a425176
[cpu] Dispatch un-quantized linear to oneDNN/ACL by default for AArch…
fadara01 Oct 21, 2025
f6a47a6
create is_in_the_same_node on cpu (#26832)
helunwencser Oct 21, 2025
5c7ada1
[Frontend] Enforce tokenize=False when applying chat template (#27205)
russellb Oct 21, 2025
2ff764c
[Feature][Kernel]FusedMoE LoRA (#21229)
wcwuwc Oct 21, 2025
79bddea
[BugFix] GPT-OSS Attention DP + MoE TP weight loading issue (#24032)
nvpohanh Oct 21, 2025
d3f9bda
[ModelOpt] Load w13/w2_input_scale for all experts, nvfp4 (#26135)
wenscarl Oct 21, 2025
97e3403
[Bugfix] Fix gpt-oss w4a8 DP/EP on B200 (#26729)
varun-sundar-rabindranath Oct 21, 2025
9d41021
[Bugfix] Fix broken MTP weight loading for FP8 KV Scales (#27227)
benchislett Oct 21, 2025
25756dd
[Fix][Spec Decode] Fix llama4 draft loading with different quantizati…
linzebing Oct 21, 2025
5a187fa
[Nixl] Minor refactor to handshake related metadata (#26410)
NickLucche Oct 21, 2025
ddce9d7
[MM][Core] Decouple ViT backend from LM backend (#27061)
ywang96 Oct 21, 2025
48956d8
[Deepseek v3.2] Optimize top_k_per_row (#26763)
dcampora Oct 21, 2025
b3117b3
[Chore] Separate out NCCL utilities from vllm.utils (#27197)
dongbo910220 Oct 21, 2025
93489c0
[CI] Install pre-release version of `apache-tvm-ffi` for `flashinfer`…
hmellor Oct 21, 2025
ba7e59a
[ROCM] Enable CompressedTensorsWNA16 (#27187)
JartX Oct 21, 2025
3b92b85
Add @pavanimajety to .github/codeowners for Flashinfer, ModelOpt rela…
pavanimajety Oct 21, 2025
ec5ef5a
[ROCm] Update Triton, Torch, and AITER branches for ROCm base Dockerf…
micah-wil Oct 21, 2025
1932d99
[Feature] Batch Invariant for R1 TP 8 on Blackwell (#27229)
yewentao256 Oct 21, 2025
f354642
[Bugfix][P/D] Reduce num_threads used by nixl ucx backend (#27196)
dagrayvid Oct 21, 2025
e51c9b6
[V0 Deprecation] Remove V0 executors (#27142)
njhill Oct 21, 2025
a16cfbe
[Bugfix] fixes the decoding metadata of dense mla's fp8 kvcache. (#27…
sighingnow Oct 21, 2025
063cb5c
Update PyTorch to 2.9.0+cu129 (#24994)
huydhn Oct 21, 2025
870f776
[Performance] Dual stream execution of "shared_experts" and "selected…
alexm-redhat Oct 21, 2025
687a7c4
Updated xgrammar backend to not deny supported string formats (#27253)
ExtReMLapin Oct 21, 2025
90a10be
[Bugfix] skip cuda graph for drafter when running with eager (#26821)
benchislett Oct 21, 2025
d703e3e
[P/D] KVConnector for decode benchmarking (#25986)
tlrmchlsmth Oct 21, 2025
36a963f
[Deepseek v3.2] Remove extra logics in indexer (#26465)
IwakuraRein Oct 21, 2025
2fb57bf
[DOC] [ROCm] Add ROCm quickstart guide (#26505)
vllmellm Oct 22, 2025
f1d24eb
[CI] Nixl integration tests DP-EP (#27199)
NickLucche Oct 22, 2025
61e6f00
[Benchmark] Add plot utility for parameter sweep (#27168)
DarkLight1337 Oct 22, 2025
aaa47a6
[torch.compile] Enable silu_mul_fp8_quant fusion without custom ops e…
ZJY0516 Oct 22, 2025
ce78347
[1/N][Platform] Cleanup useless function (#26982)
wangxiyuan Oct 22, 2025
32092f1
Update release pipeline for PyTorch 2.9.0 (#27303)
huydhn Oct 22, 2025
2bb9bfe
Remove last `level` references not removed in #26355 (#27260)
hmellor Oct 22, 2025
8ceb0e8
fixed reasoning streaming with tool_choice="required" (#24108)
ExtReMLapin Oct 22, 2025
8a2b9ca
[Frontend][3/N] Improve all pooling task | Support binary embedding r…
noooop Oct 22, 2025
3ef109e
[Bugfix][CPU] Disable dual stream execution for experts on CPU (#27320)
bigPYJ1151 Oct 22, 2025
c059c9f
[Log] Add Warning for `LLM(data_parallel_size=k)` single-process DP U…
yewentao256 Oct 22, 2025
c61b853
Bugfix - pass 'max_num_tokens_padded' into 'moe_lora_align_block_size…
gnovack Oct 22, 2025
c86ac81
[Core] Handle MoE LoRA edge cases (#27335)
jeejeelee Oct 22, 2025
91350f3
[docs] Update v1 metrics design doc (#27332)
markmc Oct 22, 2025
3931943
Mirroring changes in test-pipeline.yaml into test-amd.yaml (#27242)
Alexei-V-Ivanov-AMD Oct 22, 2025
5b0e9b8
[Chore] Separate out optional dependency checks from vllm.utils (#27207)
dongbo910220 Oct 22, 2025
60293c4
[Model] Upstream Deepseek-OCR model (#27247)
Isotr0py Oct 22, 2025
857216a
[NIXL] Terminate handshake listener thread in shutdown (#26404)
markmc Oct 22, 2025
2aabbcf
[Bug] Fix DeepSeek-V2.5-1210-FP8 issue (#27267)
yewentao256 Oct 22, 2025
fecd0f0
[bugfix] remove unused parameters to reduce unnecessary vram usage (#…
ReinForce-II Oct 22, 2025
53766e7
[Bugfix] Add missing 'is_internal_router' attribute to FusedMoEWithLo…
jeejeelee Oct 22, 2025
04d5802
[NIXL] use Host buffer to support TP_ratio > 1 for XPU (#27140)
xuechendi Oct 22, 2025
f0f25be
[Bugfix] Make `get_mrope_input_positions` instance methods (#27342)
DarkLight1337 Oct 22, 2025
de2821b
[Bugfix] Fix HF format InternVL large variants video processing (#27330)
Isotr0py Oct 22, 2025
1b6dc44
[Frontend] Require flag for loading text and image embeds (#27204)
russellb Oct 22, 2025
3694915
[P/D] Dynamic `kv_output_aggregator` collect size (#26734)
NickLucche Oct 22, 2025
22474aa
Support Anthropic API /v1/messages Endpoint (#22627)
LiuLi1998 Oct 22, 2025
05b7c8e
[Bugfix] Disable FlexAttention direct block mask building for encoder…
Isotr0py Oct 22, 2025
dc93147
[Model] Revert PR #26715: Restore custom PaliGemma and Gemma3-MM impl…
lucianommartins Oct 22, 2025
7dccff2
[Doc] Fix numbering sequence in prefix caching (#27357)
gigit0000 Oct 22, 2025
9f86cc2
[Prefix Cache] Use LoRA name for consistent KV-cache block hashing (#…
sagiahrac Oct 22, 2025
72b2eba
[Feature] publisher default set zmq in kv_event config (#26915)
lengrongfu Oct 22, 2025
65b7070
[BugFix] bugfix for Flash Attention MLA with full cuda graph IMA foll…
Daisy-Ma-coder Oct 22, 2025
3affbf4
[Chore] Separate out system utilities from vllm.utils (#27201)
dongbo910220 Oct 22, 2025
b923e26
[MLA] Bump FlashMLA (#27354)
MatthewBonanni Oct 22, 2025
718433e
[Bugfix] Fix deepseek-ocr multi-image inference and add `merge_by_fie…
Isotr0py Oct 23, 2025
a861472
[Bugfix] Fix SLA tuner initialization (#27355)
DarkLight1337 Oct 23, 2025
070d17b
[Bugfix] Fix incorrect kv cache metrics in grafana.json (#27133)
fangpings Oct 23, 2025
bf81da3
[Bugfix][Core] running queue index leakage exception (#26754)
CLFutureX Oct 23, 2025
1a19076
[CORE] Support Prefix Caching with Prompt Embeds (#27219)
qthequartermasterman Oct 23, 2025
cbb3052
[V1][spec decode] return logprobs for spec decoding (#26060)
TheEpicDolphin Oct 23, 2025
a7014b9
[Model] Add num_cached_tokens for PoolingRequestOutput (#27378)
noooop Oct 23, 2025
bebe4a6
[Chore] Remove duplicate `has_` functions in vllm.utils (#27372)
jonathanc-n Oct 23, 2025
0613f63
[CI/Build] Fix Prithvi plugin test (#27393)
DarkLight1337 Oct 23, 2025
6c633e1
[Bugfix] Fix args settings for guided decoding args (#27375)
luccafong Oct 23, 2025
0a85a28
[CI/Build] Fix AMD CI: test_cpu_gpu.py (#27388)
zhewenl Oct 23, 2025
a1ad387
add SLA information into comparison graph for vLLM Benchmark Suite (#…
louie-tsai Oct 23, 2025
46a5edb
[CI] Reorganize entrypoints tests (#27403)
chaunceyjiang Oct 23, 2025
f40d05b
[Metrics] [KVConnector] Add connector prefix cache hit rate stats (#2…
ptovam Oct 23, 2025
fc0a99f
[Model] Add MoE support for NemotronH (#25863)
tomeras91 Oct 23, 2025
0877397
Run mypy on the lowest supported Python version instead of system Pyt…
hmellor Oct 23, 2025
0fa71bf
[Bugfix] Honor --mm_encoder_attn_backend when used (#27124)
bradleyhd Oct 23, 2025
1c5d804
[Feature] Pydantic validation for speculative.py (#27156)
Navya1707 Oct 23, 2025
4f7f936
[Misc] Remove use of CUDA_VISIBLE_DEVICES for device selection (fix D…
ilmarkov Oct 23, 2025
ba5f805
[CI/Build] Remove unnecessary flags from test registry (#27353)
DarkLight1337 Oct 23, 2025
a791471
[Frontend][4/N] Improve all pooling task | Add plugin pooling task (#…
noooop Oct 23, 2025
996d325
Mirroring the test definitions (2025-10-22) (#27362)
Alexei-V-Ivanov-AMD Oct 23, 2025
7785fd2
[Bugfix] Fix dp_chunking enablement logic in FusedMoE layer (#27220)
alexm-redhat Oct 23, 2025
f67a708
[Bugfix][ROCm][DeepSeek] Fix for forward_hip in rope for DeepSeek (#2…
gshtras Oct 23, 2025
d8f307a
[Bugfix] Fix AWQ marlin layer skipping (#27416)
Isotr0py Oct 23, 2025
5ff270a
[Misc] Add triton_kernels dependency (#27370)
varun-sundar-rabindranath Oct 23, 2025
2237c9c
[Chore] Separate out `vllm.utils.platform_utils.py` (#27374)
jonathanc-n Oct 23, 2025
f5e5da6
[Attention] Fix FlashMLA metadata builder arguments for q_len > 1 (#2…
MatthewBonanni Oct 23, 2025
be160a9
[Bugfix][DP] Fix creating too many DP Placement Groups (#26880)
kebe7jun Oct 23, 2025
c0ea035
[Model] Siglip Embedding Support (#27324)
piood Oct 23, 2025
5ff5ee2
[Hardware][POWERPC] Disable oneDNN path in vllm/model_executor/layers…
Akashcodes732 Oct 23, 2025
ec3ed22
Granite 4.0 quark quantization support (#26944)
xiao-llm Oct 24, 2025
d1395fb
Fix pooling adapters for Transformers backend (#27338)
hmellor Oct 24, 2025
ca4d31d
[Kernel] Add GPTQv2 format support for low-bit or asymmetric quantiza…
xxxxyu Oct 24, 2025
1d404a0
[Misc] Add TPU usage report when using tpu_inference. (#27423)
hfan Oct 24, 2025
c471066
[Bugfix][CI] Move resolving cudagraph_mode before initializing attn_m…
fhl2000 Oct 24, 2025
2d0baa3
Fix EventPublisherFactory logic for disabled KV cache events (#27419)
usberkeley Oct 24, 2025
e657a5a
[Chore] remove structural tags logging lines (#27451)
aarnphm Oct 24, 2025
badb509
[Bugfix] Fix Pydantic union resolution for ResponseFunctionToolCall i…
strinczer Oct 24, 2025
4e1b946
[Misc] Avoid "PyTorch non-writable tensors" warning in RayPPCommunica…
ruisearch42 Oct 24, 2025
65bff0c
[Docs] remove v1 column for embedding models (#27446)
piood Oct 24, 2025
ab333cc
[MM][Bugfix] Replace `PatchEmbed`'s conv3d to linear layer (#27418)
Isotr0py Oct 24, 2025
e402acb
[BugFix] Fix torchrun DP with LLM class (#27395)
22quinn Oct 24, 2025
2287228
[Refactor] move tool parsing logic from protocol.py to the tool parse…
chaunceyjiang Oct 24, 2025
0097e4d
[Benchmark] Enable benchmark to run with `encoding_format="bytes"` (#…
DarkLight1337 Oct 24, 2025
13e3d6b
Fix AArch64 CPU Docker pipeline (#27331)
ioghiban Oct 24, 2025
b2be57e
[MISC] `cudagraph_capture_sizes` related improvements (#26016)
fhl2000 Oct 24, 2025
2bc5aa8
Fix test named tool use (#27458)
chaunceyjiang Oct 24, 2025
ea80126
[Doc] Fix minor issues in docs/design/metrics.md (#27436)
draftbk Oct 24, 2025
89d9730
[cpu][fix] Fix onednn_mm crash on consecutive matmuls with same M,K,N…
fadara01 Oct 24, 2025
fce6445
[compile] Turn standalone_compile back on (#27460)
zou3519 Oct 24, 2025
146e279
[NIXL][BUGFIX] delay done_recving queue cleanup to bottom of get_fini…
xuechendi Oct 24, 2025
4308bdb
[Bugfix] Fix MultiConnector stats reconstruction across process bound…
kouroshHakha Oct 24, 2025
cfc6818
[Attention] Add MLA prefill backend: trtllm_ragged_attention_deepseek…
minosfuture Oct 24, 2025
ff55c3b
[Bugfix] Fix interns1-vit qk norm code path (#27480)
Isotr0py Oct 24, 2025
4c87e6e
[CI/Build] Fix test_torch_utils in AMD CI (#27317)
zhewenl Oct 24, 2025
cbb9f93
[Document] Add ms-swift library to rlhf.md (#27469)
hjh0119 Oct 24, 2025
962f87f
[Perf][Async Scheduling] Remove CPU->GPU sync in dummy_run (#27455)
lhtin Oct 24, 2025
76ff636
[Distributed] Basic set of configuration for large EP deployment on G…
wpc Oct 24, 2025
ad2c0e6
[Log] Optimize Startup Log (#26740)
yewentao256 Oct 24, 2025
bc82752
[Misc][DP] Guard mxfp4 implementation selection (#27484)
varun-sundar-rabindranath Oct 24, 2025
b997f6b
[KVConnector] Migrate the LMCache integration code to be vLLM native …
ApostaC Oct 25, 2025
846354e
[CI] Add tests for cudagraph (#27391)
ZJY0516 Oct 25, 2025
622ac20
Revert "[Misc] Remove use of CUDA_VISIBLE_DEVICES for device selectio…
zhuohan123 Oct 25, 2025
d61ec81
[Core][Hybrid allocator + kv connector 1/n] Enable hybrid allocator +…
KuntaiDu Oct 25, 2025
3893339
[Misc] Simplify max tokens in multimodal registry (#27500)
DarkLight1337 Oct 25, 2025
2509106
[Attention] Add missing kv cache scale setup (#27490)
MatthewBonanni Oct 25, 2025
651c9fb
[CI/Build] Refactor processing tests (#27470)
DarkLight1337 Oct 25, 2025
57d72e5
merge main
0xrushi Oct 26, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
4 changes: 2 additions & 2 deletions .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@
import sys
import zipfile

# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 450 MiB
# Read the VLLM_MAX_SIZE_MB environment variable, defaulting to 500 MiB
# Note that we have 800 MiB quota, please use it wisely.
# See https://github.com/pypi/support/issues/6326 .
# Please also sync the value with the one in Dockerfile.
VLLM_MAX_SIZE_MB = int(os.environ.get("VLLM_MAX_SIZE_MB", 450))
VLLM_MAX_SIZE_MB = int(os.environ.get("VLLM_MAX_SIZE_MB", 500))


def print_top_10_largest_files(zip_file):
Expand Down
12 changes: 12 additions & 0 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-QQQ.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m HandH1998/QQQ-Llama-3-8b-g128 -b 32 -l 1000 -f 5 -t 1
model_name: "HandH1998/QQQ-Llama-3-8b-g128"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.419
- name: "exact_match,flexible-extract"
value: 0.416
limit: 1000
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh -m meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 -l 100 -t 8
model_name: "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8"
backend: "vllm-vlm"
tasks:
- name: "chartqa"
metrics:
- name: "relaxed_accuracy,none"
# TODO(zhewenl): model card is 0.90, but the actual score is 0.80.
value: 0.80
limit: 100
num_fewshot: 0
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# For hf script, without -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-mmlupro-vllm-baseline.sh -m meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 -l 250 -t 8 -f 5
model_name: "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8"
tasks:
- name: "mmlu_pro"
metrics:
- name: "exact_match,custom-extract"
value: 0.80
limit: 250 # will run on 250 * 14 subjects = 3500 samples
num_fewshot: 5
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic -b auto -l 1319 -f 5 -t 1
# For vllm script, with -t option (tensor parallel size)
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic -l 1319 -t 1
model_name: "RedHatAI/Qwen2.5-VL-3B-Instruct-FP8-Dynamic"
tasks:
- name: "gsm8k"
Expand Down
12 changes: 12 additions & 0 deletions .buildkite/lm-eval-harness/configs/Qwen2.5-VL-7B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# For vllm script, with -t option (tensor parallel size).
# bash .buildkite/lm-eval-harness/run-lm-eval-chartqa-vllm-vlm-baseline.sh -m Qwen/Qwen2.5-VL-7B-Instruct -l 2500 -t 1

model_name: "Qwen/Qwen2.5-VL-7B-Instruct"
backend: "vllm-vlm"
tasks:
- name: "chartqa"
metrics:
- name: "relaxed_accuracy,none"
value: 0.855
limit: 2500
num_fewshot: 0
1 change: 1 addition & 0 deletions .buildkite/lm-eval-harness/configs/models-large-h100.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Meta-Llama-4-Maverick-17B-128E-Instruct-FP8.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Meta-Llama-4-Maverick-17B-128E-Instruct-FP8-MM.yaml
1 change: 1 addition & 0 deletions .buildkite/lm-eval-harness/configs/models-mm-small.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Qwen2.5-VL-7B-Instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#!/bin/bash
# We can use this script to compute baseline accuracy on chartqa for vllm.
#
# Make sure you have lm-eval-harness installed:
# pip install lm-eval==0.4.9

usage() {
echo``
echo "Runs lm eval harness on ChartQA using multimodal vllm."
echo "This pathway is intended to be used to create baselines for "
echo "our correctness tests in vllm's CI."
echo
echo "usage: ${0} <options>"
echo
echo " -m - huggingface stub or local directory of the model"
echo " -l - limit number of samples to run"
echo " -t - tensor parallel size to run at"
echo
}

while getopts "m:l:t:" OPT; do
case ${OPT} in
m )
MODEL="$OPTARG"
;;
l )
LIMIT="$OPTARG"
;;
t )
TP_SIZE="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

lm_eval --model vllm-vlm \
--model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE" \
--tasks chartqa \
--batch_size auto \
--apply_chat_template \
--limit $LIMIT
Empty file modified .buildkite/lm-eval-harness/run-lm-eval-gsm-hf-baseline.sh
100644 → 100755
Empty file.
50 changes: 50 additions & 0 deletions .buildkite/lm-eval-harness/run-lm-eval-mmlupro-vllm-baseline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
#!/bin/bash
# We can use this script to compute baseline accuracy on MMLUPRO for vllm.
# We use this for fp8, which HF does not support.
#
# Make sure you have lm-eval-harness installed:
# pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@206b7722158f58c35b7ffcd53b035fdbdda5126d#egg=lm-eval[api]

usage() {
echo``
echo "Runs lm eval harness on MMLU Pro using huggingface transformers."
echo "This pathway is intended to be used to create baselines for "
echo "our automated nm-test-accuracy workflow"
echo
echo "usage: ${0} <options>"
echo
echo " -m - huggingface stub or local directory of the model"
echo " -l - limit number of samples to run"
echo " -f - number of fewshot samples to use"
echo " -t - tensor parallel size to run at"
echo
}

while getopts "m:b:l:f:t:" OPT; do
case ${OPT} in
m )
MODEL="$OPTARG"
;;
b )
BATCH_SIZE="$OPTARG"
;;
l )
LIMIT="$OPTARG"
;;
f )
FEWSHOT="$OPTARG"
;;
t )
TP_SIZE="$OPTARG"
;;
\? )
usage
exit 1
;;
esac
done

lm_eval --model vllm \
--model_args "pretrained=$MODEL,tensor_parallel_size=$TP_SIZE,add_bos_token=true,trust_remote_code=true,max_model_len=4096" \
--tasks mmlu_pro --num_fewshot "$FEWSHOT" --limit "$LIMIT" \
--batch_size auto
12 changes: 9 additions & 3 deletions .buildkite/lm-eval-harness/test_lm_eval_correctness.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,21 +19,27 @@
def launch_lm_eval(eval_config, tp_size):
trust_remote_code = eval_config.get("trust_remote_code", False)
max_model_len = eval_config.get("max_model_len", 4096)
batch_size = eval_config.get("batch_size", "auto")
backend = eval_config.get("backend", "vllm")
model_args = (
f"pretrained={eval_config['model_name']},"
f"tensor_parallel_size={tp_size},"
f"enforce_eager=true,"
f"add_bos_token=true,"
f"trust_remote_code={trust_remote_code},"
f"max_model_len={max_model_len}"
f"max_model_len={max_model_len},"
)
results = lm_eval.simple_evaluate(
model="vllm",
model=backend,
model_args=model_args,
tasks=[task["name"] for task in eval_config["tasks"]],
num_fewshot=eval_config["num_fewshot"],
limit=eval_config["limit"],
batch_size="auto",
# TODO(yeq): using chat template w/ fewshot_as_multiturn is supposed help
# text models. however, this is regressing measured strict-match for
# existing text models in CI, so only apply it for mm.
apply_chat_template=backend == "vllm-vlm",
batch_size=batch_size,
)
return results

Expand Down
Loading