Skip to content

Conversation

@mmangkad
Copy link
Contributor

@mmangkad mmangkad commented Oct 26, 2025

Purpose

This PR fixes the default MXFP4 backend selection for SM90 (Hopper) GPUs from Triton back to Marlin to address significant performance degradation. The change made in a9f55dc added the triton_kernels dependency, which caused SM90 to default to Triton instead of Marlin. Benchmarks show that the Triton default results in substantial performance degradation on SM90: latency increased from 1.17s to 1.58s (35% slower) and throughput decreased from 3.78 req/s to 3.12 req/s (21% slower). The change introduces a new environment variable VLLM_MXFP4_USE_TRITON to allow explicit Triton selection, while restoring SM90 to default to Marlin when no environment variable is set.

Test Plan

# Triton
vllm bench latency --model  openai/gpt-oss-120b --num-iters 100
vllm serve openai/gpt-oss-120b
vllm bench serve --model openai/gpt-oss-120b --dataset-name random --random-input-len 1000 --random-output-len 1000 --num-prompts 1000 --max-concurrency 16
# Marlin
export VLLM_MXFP4_USE_MARLIN=1
vllm bench latency --model  openai/gpt-oss-120b --num-iters 100
vllm serve openai/gpt-oss-120b
vllm bench serve --model openai/gpt-oss-120b --dataset-name random --random-input-len 1000 --random-output-len 1000 --num-prompts 1000 --max-concurrency 16

Test Result

# Triton
Avg latency: 1.5815536511300048 seconds
10% percentile latency: 1.5204667982000046 seconds
25% percentile latency: 1.553816903249924 seconds
50% percentile latency: 1.5796886724999695 seconds
75% percentile latency: 1.6134025832499788 seconds
90% percentile latency: 1.6404986918001214 seconds
99% percentile latency: 1.701717556580029 seconds

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  320.22    
Total input tokens:                      1000000   
Total generated tokens:                  286769    
Request throughput (req/s):              3.12      
Output token throughput (tok/s):         895.53    
Peak output token throughput (tok/s):    1552.00   
Peak concurrent requests:                30.00     
Total Token throughput (tok/s):          4018.36   
---------------Time to First Token----------------
Mean TTFT (ms):                          280.78    
Median TTFT (ms):                        125.90    
P99 TTFT (ms):                           5398.79   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.37     
Median TPOT (ms):                        16.21     
P99 TPOT (ms):                           80.90     
---------------Inter-token Latency----------------
Mean ITL (ms):                           16.64     
Median ITL (ms):                         12.28     
P99 ITL (ms):                            109.48    
==================================================
# Marlin
Avg latency: 1.1686289020399931 seconds
10% percentile latency: 1.1123060915998166 seconds
25% percentile latency: 1.1479770297498817 seconds
50% percentile latency: 1.172815559500009 seconds
75% percentile latency: 1.1934974020000482 seconds
90% percentile latency: 1.2131133147000355 seconds
99% percentile latency: 1.2565373172201773 seconds

============ Serving Benchmark Result ============
Successful requests:                     1000      
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  264.78    
Total input tokens:                      1000000   
Total generated tokens:                  270412    
Request throughput (req/s):              3.78      
Output token throughput (tok/s):         1021.27   
Peak output token throughput (tok/s):    1568.00   
Peak concurrent requests:                28.00     
Total Token throughput (tok/s):          4797.98   
---------------Time to First Token----------------
Mean TTFT (ms):                          122.76    
Median TTFT (ms):                        99.74     
P99 TTFT (ms):                           489.35    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          15.50     
Median TPOT (ms):                        14.79     
P99 TPOT (ms):                           32.83     
---------------Inter-token Latency----------------
Mean ITL (ms):                           14.99     
Median ITL (ms):                         11.78     
P99 ITL (ms):                            78.76     
==================================================

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a performance regression on SM90 GPUs for MXFP4 quantization by reverting the default backend to Marlin from Triton. The change is well-implemented, introducing a new environment variable VLLM_MXFP4_USE_TRITON to allow users to explicitly select the Triton backend, which is a good approach. The logic for backend selection is sound. I have one suggestion to improve robustness by explicitly handling cases where conflicting environment variables are set.

@jeejeelee
Copy link
Collaborator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants