You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is too conservative and can be greatly improved knowing PVC's SLM configuration for parallel accesses (64 banks providing access to 8 B each). So the ideal mechanism would be:
Store (64 banks * 8 B/bank / X B/element) elements
Leave (1 bank * 8 B/bank / X B/element) empty spots
Store (64 banks * 8 B/bank / X B/element) elements
...
Assuming fp32 elements:
0 1 2 3 4 5 ... 127 X X
0 1 2 3 4 5 ... 127 X X
0 1 2 3 4 5 ... 127 X X
Again, for fp32, in terms of code:
; Store untransposedcall spir_funccc void@intel_sub_group_block_write8(ptr(3) %ptr0, <8 x float> %data)
%ptr1 = getelementptrinbounds%ptr0[130], float; ...; Load transposed%vec0 = load<8 x float> %ptrwi0%ptrwi1 = getelementptrinbounds%ptrwi0[1], <8 x float>
; ...; Take into account empty elements%ptrwi16 = getelementptrinbounds%ptrwi15[10], float
The text was updated successfully, but these errors were encountered:
The approach described in #2890 performs the reduction directly in registers, so SLM is not involved and no tuning for bank conflicts is required. Closing this issue.
As of now, sub-group transpose bank conflict prevention leaves a single item every 17 items ((sub-group size = 16) + 1) to avoid bank conflicts:
This is too conservative and can be greatly improved knowing PVC's SLM configuration for parallel accesses (64 banks providing access to 8 B each). So the ideal mechanism would be:
Assuming
fp32
elements:Again, for
fp32
, in terms of code:The text was updated successfully, but these errors were encountered: