[PERF] Decouple projections from GDN custom op #27512

vadiklyutiy · 2025-10-25T20:46:47Z

Purpose

This PR is refactoring of GDN.

The main goal is to allow wider using of torch.compile.

Separated forward pass of GDN attention into three distinct pieces: Input Projection, Core Attention, Output Projection. Before projections was in the GDN custom op and were not covered by torch.compile.
Added RMSNormGated class that implements torch native gated rmsnorm and use it for GDN. torch.compile creates a good code for RMSNormGated even better than custom triton kernel used before.

Functional Test Result

lm_eval
Before

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8491|±  |0.0099|
|     |       |strict-match    |     5|exact_match|↑  |0.8059|±  |0.0109|

After

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8544|±  |0.0097|
|     |       |strict-match    |     5|exact_match|↑  |0.8127|±  |0.0107|

Perf Test Result

Server

VLLM_ATTENTION_BACKEND=FLASH_ATTN VLLM_USE_FLASHINFER_MOE_FP16=1 vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4 --enable-expert-parallel --no-enable-prefix-caching --async-scheduling --max_cudagraph_capture_size=2048

Prefill

vllm bench serve --backend vllm --model Qwen/Qwen3-Next-80B-A3B-Instruct --endpoint /v1/completions --dataset-name random --random-input 8192 --random-output 1 --max-concurrency 512 --num-prompt 512 --ignore-eos

Before: Total Token throughput (tok/s): 104098.78
After: Total Token throughput (tok/s): 105270.70
Speedup: 1.1%

Decode1

vllm bench serve --backend vllm --model Qwen/Qwen3-Next-80B-A3B-Instruct --endpoint /v1/completions --dataset-name random --random-input 32 --random-output 1024 --max-concurrency 512 --num-prompt 512 --ignore-eos

Before: Output token throughput (tok/s): 19212.17
After: Output token throughput (tok/s): 22384.37
Speedup: 16.5%

Decode2

vllm bench serve --backend vllm --model Qwen/Qwen3-Next-80B-A3B-Instruct --endpoint /v1/completions --dataset-name random --random-input 32 --random-output 1024 --max-concurrency 1024 --num-prompt 1024 --ignore-eos

Before: Output token throughput (tok/s): 28821.37
After: Output token throughput (tok/s): 30298.90
Speed up: 5.1%

Decode3
Server

VLLM_ATTENTION_BACKEND=FLASH_ATTN VLLM_USE_FLASHINFER_MOE_FP16=1 vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4 --enable-expert-parallel --no-enable-prefix-caching --async-scheduling

(without increasing --max_cudagraph_capture_size)

vllm bench serve --backend vllm --model Qwen/Qwen3-Next-80B-A3B-Instruct --endpoint /v1/completions --dataset-name random --random-input 32 --random-output 1024 --max-concurrency 1024 --num-prompt 1024 --ignore-eos

Before: Output token throughput (tok/s): 16586.93
After: Output token throughput (tok/s): 18953.92
Speed up: 14.3%

Signed-off-by: Vadim Gimpelson <[email protected]>

gemini-code-assist

Code Review

This pull request refactors the Gated Delta Net (GDN) attention mechanism to improve torch.compile compatibility and performance. By decoupling the input/output projections from the core custom operator and introducing a native PyTorch RMSNormGated layer, the changes yield significant decode throughput improvements. The refactoring is well-executed and the code is clear. I have one high-severity suggestion regarding a local import in a performance-critical path, which should be moved to the top level of the module to adhere to best practices and avoid potential overhead.

vllm/model_executor/layers/layernorm.py

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/model_executor/models/qwen3_next.py

ZJY0516 · 2025-10-27T09:07:50Z

CC @heheda12345

Signed-off-by: Vadim Gimpelson <[email protected]>

vadiklyutiy · 2025-10-28T23:09:58Z

@ALL
Could you please take a look at this PR

GDN refacoring

01a83d9

Signed-off-by: Vadim Gimpelson <[email protected]>

vadiklyutiy requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, robertgshaw2-redhat, sighingnow, simon-mo, tlrmchlsmth, yewentao256 and youkaichao as code owners October 25, 2025 20:46

mergify bot added the qwen Related to Qwen models label Oct 25, 2025

gemini-code-assist bot reviewed Oct 25, 2025

View reviewed changes

vllm/model_executor/layers/layernorm.py Show resolved Hide resolved

chatgpt-codex-connector bot reviewed Oct 25, 2025

View reviewed changes

vllm/model_executor/models/qwen3_next.py Show resolved Hide resolved

fix

ea0ade2

Signed-off-by: Vadim Gimpelson <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[PERF] Decouple projections from GDN custom op #27512

[PERF] Decouple projections from GDN custom op #27512

vadiklyutiy commented Oct 25, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

ZJY0516 commented Oct 27, 2025

Uh oh!

vadiklyutiy commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

[PERF] Decouple projections from GDN custom op #27512

Are you sure you want to change the base?

[PERF] Decouple projections from GDN custom op #27512

Conversation

vadiklyutiy commented Oct 25, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Functional Test Result

Perf Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

ZJY0516 commented Oct 27, 2025

Uh oh!

vadiklyutiy commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vadiklyutiy commented Oct 25, 2025 •

edited by github-actions bot

Loading