-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sharktank] Coerce paged attention args' dtype to avoid mismatch #994
Conversation
With the introduction of the KV cache dtype config option we may encounter configurations that would mix dtypes of attention's op arguments. For example if the KV cache is stored in lower precision. With this change all attention args have their dtype converteted.
Fixes #896 (comment). |
|
This fixes the mismatch of dtypes. Try
|
Or better yet
If you don't have the change iree-org/iree#20005 |
This flag work. |
Would it have an impact on the flag we use in #907? I guess we should also set |
This export flag not work when iree compile. https://sharkpublic.blob.core.windows.net/sharkpublic/chi/llama/fp8_32_kv16.mlir
|
I feel so confused when you say |
With the introduction of the KV cache dtype config option we may encounter configurations that would mix dtypes of attention's op arguments. For example if the KV cache is stored in lower precision.
With this change all attention args have their dtype converteted.