Reduction performance with subroup size 16 #1868

sommerlukas · 2024-08-13T08:55:29Z

The investigation in #1371 has revealed that in some cases kernels perform slower with subgroup-size 16 than with subgroup-size 32.

Further analysis of one of the outliers (BlenderbotSmallForCausalLM inference with amp_fp16) revealed that a kernel with reduction was particularly affected by the change in sub-group size.

We should investigate why the change in subgroup-size causes a difference in performance and fix if possible.

The example kernel and more information can be found in this comment on #1371.

The text was updated successfully, but these errors were encountered:

sommerlukas added the performance label Aug 13, 2024

sommerlukas mentioned this issue Aug 13, 2024

Investigate cases in which subgroup size 16 is noticeable slower #1371

Closed

vlad-penkin added this to the 4.0 [Performance] Core milestone Aug 13, 2024

vlad-penkin added the research label Aug 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduction performance with subroup size 16 #1868

Reduction performance with subroup size 16 #1868

sommerlukas commented Aug 13, 2024

Reduction performance with subroup size 16 #1868

Reduction performance with subroup size 16 #1868

Comments

sommerlukas commented Aug 13, 2024