-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GEMM perf] Poor GEMM performance on A770 #1765
Comments
A770 does not support DPAS 16, so the kernel is likely a fully unrolled loop. |
What's our timeline for supporting fast GEMM on A770? |
What does it mean a fast GEMM on 770? It doesn't have DPAS, so it will lag behind. Do you mean efficiency? |
Torch does not use Triton for GEMM - neither for XPU nor for CUDA. There is an existing, performant solution for GEMM in PyTorch on A770. Why do we need Triton to be competitive? |
@alexbaden as per @whitneywhtsang 's comments in the issue DPAS8 is supported via different OpenCL built-in. |
When I run GEMM benchmark on A770 I get about ~
0.3 TFLOPs
, while 1550 can get about250 TFLOPs
Performance table:
File with triton cache from the run (cache is in
cache
folder):benchmark-reports (6).zip
My run, just in case:
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10215632110/job/28265440574
The text was updated successfully, but these errors were encountered: