-
Notifications
You must be signed in to change notification settings - Fork 13.9k
Override SSM_A op for Qwen3 Next to reduce splits #17587
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
On the test server this improves pp512 t/s from 900 to 1300. |
|
I don't understand this change, but it also improves perf for Vulkan. I see that the SSM_SCAN operation in this model isn't supported by the Vulkan backend, but even if I mark it supported it still isn't assigned to the GPU split. If we ran it on the GPU would that help? |
|
It's better to use a different tensor identifier other from |
|
@ggerganov I thought about it, but people will kill me for breaking existing GGUFs :) |
When graph building is performed, weights have to be assigned to a backend. That's where the tensor default operations come in - they will assign a weight based on the operation that it's supposed to be used in. If the default operation (in case of SSM_A it's SSM_SCAN) is unsupported on a given backend, it will be moved to CPU. Because of that, any tensor that is generated from that weight's projection will be also placed on the CPU, influencing further graph split decisions (that's what I called "poisoning"). |
You can fix it without breaking existing GGUFs as in #17548 |
Not that way because I'd break other hybrid models that use this tensor. |
This comment was marked as off-topic.
This comment was marked as off-topic.
Not really, look closely, you simply create a new identifier and map it to the old tensor name for |
This massively reduces the number of splits for the Qwen3 Next graph by placing the initial gate tensor on the backend, otherwise it's put on the CPU which recursively poisons all other layers, leading to splits.