[webgpu] Optimize FlashAttention for prefill #25395

daijh · 2025-07-15T02:52:38Z

Description

This PR enhances unidirectional FlashAttention by applying causal masking inside the main loop. This optimization eliminates unnecessary memory loads by avoiding future entries in the KV cache.

Testing on Lunar Lake shows up to a 20% performance improvement for phi-4-mini-accuracy4 (with a prompt of 4096). Similar performance gains were also observed for other models, including Qwen3-0.6B-accuracy4.

This PR now uses the more readable unidirectional attribute instead of is_gqa, to control causal masking.

Motivation and Context

See above.

This PR enhances unidirectional `FlashAttention` by applying causal masking inside the main loop. This optimization eliminates unnecessary memory loads by avoiding future entries in the KV cache. Testing on Lunar Lake shows up to a 20% performance improvement for `phi-4-mini-accuracy4` (with a prompt of 4096). Similar performance gains were also observed for other models, including `Qwen3-0.6B-accuracy4`. This PR now uses the more readable `unidirectional` attribute instead of `is_gpa`, to control causal masking.

daijh · 2025-07-15T02:54:22Z

Lunar Lake, Phi-4-mini-accuracy4:

Prompt	Default Prefill Speed (tps)	Opt Prefill Speed (tps)	Improvement
Prompt-1024	561.40	593.76	5.76%
Prompt-2048	498.60	549.59	10.23%
Prompt-3072	465.98	537.64	15.38%
Prompt-4096	430.61	513.40	19.23%

daijh · 2025-07-15T03:02:03Z

@sushraja-msft @qjia7 pls take a look.

cc @jchen10 @xhcao

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc

onnxruntime/contrib_ops/webgpu/bert/attention_common.h

qjia7

LGTM with nits.

qjia7 · 2025-07-15T12:25:42Z

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc

@@ -337,7 +352,7 @@ Status FlashAttentionProgram::GenerateShaderCode(ShaderHelper& shader) const {
      qk_4 = qk_4 + loadAttentionBias(q_idx_global, k_start+12, head_idx);
    }

-    let seq_causal_length = select(uniforms.total_sequence_length, uniforms.past_sequence_length + q_idx_global + 1, uniforms.is_gqa > 0);
+    let seq_causal_length = select(uniforms.total_sequence_length, uniforms.past_sequence_length + q_idx_global + 1, uniforms.is_unidirectional > 0);


nit: Is it better if we move this assignment out of the for loop?

Sure. Let me address it in follow-up PRs to keep this one focused.

daijh · 2025-07-16T01:44:37Z

@guschmue @fs-eire
Please take a look.

fs-eire · 2025-07-22T17:13:28Z

/azp run Linux QNN CI Pipeline, Win_TRT_Minimal_CUDA_Test_CI, Windows ARM64 QNN CI Pipeline, Windows GPU Doc Gen CI Pipeline, Windows x64 QNN CI Pipeline

azure-pipelines · 2025-07-22T17:13:49Z

Azure Pipelines successfully started running 5 pipeline(s).

qjia7 reviewed Jul 15, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc Show resolved Hide resolved

qjia7 reviewed Jul 15, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/bert/flash_attention.cc Show resolved Hide resolved

onnxruntime/contrib_ops/webgpu/bert/attention_common.h Outdated Show resolved Hide resolved

qjia7 previously approved these changes Jul 15, 2025

View reviewed changes

Explicitly set is_unidirectional_ to true for GQA

7bd19af

daijh dismissed qjia7’s stale review via 7bd19af July 16, 2025 01:35

fs-eire approved these changes Jul 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[webgpu] Optimize FlashAttention for prefill #25395

[webgpu] Optimize FlashAttention for prefill #25395

daijh commented Jul 15, 2025 •

edited

Loading

Uh oh!

daijh commented Jul 15, 2025

Uh oh!

daijh commented Jul 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qjia7 left a comment

Uh oh!

qjia7 Jul 15, 2025

Uh oh!

daijh Jul 16, 2025

Uh oh!

daijh commented Jul 16, 2025

Uh oh!

fs-eire commented Jul 22, 2025

Uh oh!

azure-pipelines bot commented Jul 22, 2025

Uh oh!

Uh oh!

[webgpu] Optimize FlashAttention for prefill #25395

Are you sure you want to change the base?

[webgpu] Optimize FlashAttention for prefill #25395

Conversation

daijh commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Uh oh!

daijh commented Jul 15, 2025

Uh oh!

daijh commented Jul 15, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qjia7 left a comment

Choose a reason for hiding this comment

Uh oh!

qjia7 Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

daijh Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

daijh commented Jul 16, 2025

Uh oh!

fs-eire commented Jul 22, 2025

Uh oh!

azure-pipelines bot commented Jul 22, 2025

Uh oh!

Uh oh!

daijh commented Jul 15, 2025 •

edited

Loading