You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The investigation in #1371 has revealed that in some cases kernels perform slower with subgroup-size 16 than with subgroup-size 32.
Further analysis of one of the outliers (BlenderbotSmallForCausalLM inference with amp_fp16) revealed that a kernel with reduction was particularly affected by the change in sub-group size.
We should investigate why the change in subgroup-size causes a difference in performance and fix if possible.
The example kernel and more information can be found in this comment on #1371.
The text was updated successfully, but these errors were encountered:
The investigation in #1371 has revealed that in some cases kernels perform slower with subgroup-size 16 than with subgroup-size 32.
Further analysis of one of the outliers (
BlenderbotSmallForCausalLM
inference withamp_fp16
) revealed that a kernel with reduction was particularly affected by the change in sub-group size.We should investigate why the change in subgroup-size causes a difference in performance and fix if possible.
The example kernel and more information can be found in this comment on #1371.
The text was updated successfully, but these errors were encountered: