-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate cases in which subgroup size 16 is noticeable slower #1371
Comments
So far, we haven't been able to reproduce this behavior on the current |
Work on this ticket had been blocked by #1647 for most of last week. We were able to obtain data from a CI run end of last week and will present a summary offline in the meeting. |
As discussed in chat offline, varying performance between different CI runs (based on the same commit) are still a problem for this investigation. We'll try to still identify a consistent outlier in the performance comparison with the different subgroups sizes and then investigate what causes that performance difference. |
After comparing CI runs 4 and 5, three outlier models could be identified where sub-group size 16 consistently across both runs provided worse performance than sub-group size 32:
For
|
Add a script to easily compare two runs of the "E2E performance" CI workflow. The script compares the speedup over Pytorch eager yielded by the two different CI runs, prints an evaluation and is also able to visualize the data as a boxplot. For more details on the usage of the script, see the accompanying README. This script was written and used for #1370 and #1371. Closes #1848. --------- Signed-off-by: Lukas Sommer <[email protected]>
After running huggingface benchmarks with subgroup size 16 (https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/9499093865), we saw some cases in which subgroup size 16 reported worse performance:
huggingface amp_bf16 inference PLBartForCausalLM
: 3.58x vs 2.36x speedupshuggingface bf16 inferece BartForCausalLM
: 2.52x vs 1.19x speedupshuggingface f32 training T5Small
: 0.65x vs 0.50x speedupsInvestigate and create followup issues if needed or write report in this issue.
Env:
The text was updated successfully, but these errors were encountered: