Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GEMM perf] Poor GEMM performance on A770 #1765

Open
Egor-Krivov opened this issue Aug 2, 2024 · 6 comments
Open

[GEMM perf] Poor GEMM performance on A770 #1765

Egor-Krivov opened this issue Aug 2, 2024 · 6 comments
Labels
enhancement New feature or request performance

Comments

@Egor-Krivov
Copy link
Contributor

When I run GEMM benchmark on A770 I get about ~0.3 TFLOPs, while 1550 can get about 250 TFLOPs

Performance table:
image

File with triton cache from the run (cache is in cache folder):
benchmark-reports (6).zip

My run, just in case:
https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/10215632110/job/28265440574

@alexbaden
Copy link
Contributor

A770 does not support DPAS 16, so the kernel is likely a fully unrolled loop.

@Egor-Krivov
Copy link
Contributor Author

What's our timeline for supporting fast GEMM on A770?

@vlad-penkin vlad-penkin added the enhancement New feature or request label Aug 2, 2024
@aregm
Copy link
Contributor

aregm commented Aug 2, 2024

What's our timeline for supporting fast GEMM on A770?

What does it mean a fast GEMM on 770? It doesn't have DPAS, so it will lag behind. Do you mean efficiency?

@Egor-Krivov
Copy link
Contributor Author

I think that current performance is lower than could be expected. Here is another GEMM benchmark (in milliseconds) using out matmul
triton implementation against IPEX torch (onednn). We get about ~100x slowdown when use triton vs IPEX torch.
image

@alexbaden
Copy link
Contributor

Torch does not use Triton for GEMM - neither for XPU nor for CUDA. There is an existing, performant solution for GEMM in PyTorch on A770. Why do we need Triton to be competitive?
We have line of sight to very good GEMM performance on hardware with DPAS instructions. On A770, we would be effectively starting over. What is the consumer demands that justifies such resource intensive work?

@vlad-penkin
Copy link
Contributor

@alexbaden as per @whitneywhtsang 's comments in the issue

DPAS8 is supported via different OpenCL built-in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance
Projects
None yet
Development

No branches or pull requests

4 participants