[Fix] Change default MXFP4 backend for SM90 to Marlin #27523

mmangkad · 2025-10-26T04:03:15Z

Purpose

This PR fixes the default MXFP4 backend selection for SM90 (Hopper) GPUs from Triton back to Marlin to address significant performance degradation. The change made in a9f55dc added the triton_kernels dependency, which caused SM90 to default to Triton instead of Marlin. Benchmarks show that the Triton default results in substantial performance degradation on SM90: latency increased from 1.17s to 1.58s (35% slower) and throughput decreased from 3.78 req/s to 3.12 req/s (21% slower). The change introduces a new environment variable VLLM_MXFP4_USE_TRITON to allow explicit Triton selection, while restoring SM90 to default to Marlin when no environment variable is set.

Test Plan

# Triton
vllm bench latency --model  openai/gpt-oss-120b --num-iters 100
vllm serve openai/gpt-oss-120b
vllm bench serve --model openai/gpt-oss-120b --dataset-name random --random-input-len 1000 --random-output-len 1000 --num-prompts 1000 --max-concurrency 16

# Marlin
export VLLM_MXFP4_USE_MARLIN=1
vllm bench latency --model  openai/gpt-oss-120b --num-iters 100
vllm serve openai/gpt-oss-120b
vllm bench serve --model openai/gpt-oss-120b --dataset-name random --random-input-len 1000 --random-output-len 1000 --num-prompts 1000 --max-concurrency 16

Test Result

# Triton
Avg latency: 1.5815536511300048 seconds
10% percentile latency: 1.5204667982000046 seconds
25% percentile latency: 1.553816903249924 seconds
50% percentile latency: 1.5796886724999695 seconds
75% percentile latency: 1.6134025832499788 seconds
90% percentile latency: 1.6404986918001214 seconds
99% percentile latency: 1.701717556580029 seconds

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  320.22    
Total input tokens:                      1000000   
Total generated tokens:                  286769    
Request throughput (req/s):              3.12      
Output token throughput (tok/s):         895.53    
Peak output token throughput (tok/s):    1552.00   
Peak concurrent requests:                30.00     
Total Token throughput (tok/s):          4018.36   
---------------Time to First Token----------------
Mean TTFT (ms):                          280.78    
Median TTFT (ms):                        125.90    
P99 TTFT (ms):                           5398.79   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.37     
Median TPOT (ms):                        16.21     
P99 TPOT (ms):                           80.90     
---------------Inter-token Latency----------------
Mean ITL (ms):                           16.64     
Median ITL (ms):                         12.28     
P99 ITL (ms):                            109.48    
==================================================

# Marlin
Avg latency: 1.1686289020399931 seconds
10% percentile latency: 1.1123060915998166 seconds
25% percentile latency: 1.1479770297498817 seconds
50% percentile latency: 1.172815559500009 seconds
75% percentile latency: 1.1934974020000482 seconds
90% percentile latency: 1.2131133147000355 seconds
99% percentile latency: 1.2565373172201773 seconds

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  264.78    
Total input tokens:                      1000000   
Total generated tokens:                  270412    
Request throughput (req/s):              3.78      
Output token throughput (tok/s):         1021.27   
Peak output token throughput (tok/s):    1568.00   
Peak concurrent requests:                28.00     
Total Token throughput (tok/s):          4797.98   
---------------Time to First Token----------------
Mean TTFT (ms):                          122.76    
Median TTFT (ms):                        99.74     
P99 TTFT (ms):                           489.35    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.50     
Median TPOT (ms):                        14.79     
P99 TPOT (ms):                           32.83     
---------------Inter-token Latency----------------
Mean ITL (ms):                           14.99     
Median ITL (ms):                         11.78     
P99 ITL (ms):                            78.76     
==================================================

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Mohammad Miadh Angkad <[email protected]>

gemini-code-assist

Code Review

This pull request correctly addresses a performance regression on SM90 GPUs for MXFP4 quantization by reverting the default backend to Marlin from Triton. The change is well-implemented, introducing a new environment variable VLLM_MXFP4_USE_TRITON to allow users to explicitly select the Triton backend, which is a good approach. The logic for backend selection is sound. I have one suggestion to improve robustness by explicitly handling cases where conflicting environment variables are set.

vllm/model_executor/layers/quantization/mxfp4.py

jeejeelee · 2025-10-26T05:29:57Z

@varun-sundar-rabindranath

[Fix] Change default MXFP4 backend for SM90 to Marlin

74932f8

Signed-off-by: Mohammad Miadh Angkad <[email protected]>

mmangkad requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners October 26, 2025 04:03

gemini-code-assist bot reviewed Oct 26, 2025

View reviewed changes

vllm/model_executor/layers/quantization/mxfp4.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[Fix] Change default MXFP4 backend for SM90 to Marlin #27523

[Fix] Change default MXFP4 backend for SM90 to Marlin #27523

mmangkad commented Oct 26, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

jeejeelee commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

[Fix] Change default MXFP4 backend for SM90 to Marlin #27523

Are you sure you want to change the base?

[Fix] Change default MXFP4 backend for SM90 to Marlin #27523

Conversation

mmangkad commented Oct 26, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

jeejeelee commented Oct 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mmangkad commented Oct 26, 2025 •

edited by github-actions bot

Loading