Skip to content

Conversation

@sunjiweiswift
Copy link

Description

Type

  • Bug - [ ] Feature - [ ] Performance - [ ] Refactor

Testing

  • Tests pass - [ ] Xe12 - [ ] Xe20

Performance

Metric Before After

References

Fixes #

Checklist

  • Copyright - [ ] Co-pilot Review - [ ] Deprecated APIs not used

@sunjiweiswift sunjiweiswift marked this pull request as draft November 11, 2025 05:36
@sunjiweiswift sunjiweiswift reopened this Nov 11, 2025
@sunjiweiswift sunjiweiswift force-pushed the fmha_GQA branch 9 times, most recently from 5cb846f to 90a414a Compare November 17, 2025 08:51
Copy link

@airMeng airMeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my XeTLA experience, Q-head folding improves decoding performance but hurts prefill. Should this optimization be made conditional, similar to reduce_A ?

@sunjiweiswift sunjiweiswift force-pushed the fmha_GQA branch 2 times, most recently from e8576ac to c114d9f Compare November 18, 2025 05:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants