Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fine tune sub-group transpose bank conflict prevention for PVC #2797

Closed
victor-eds opened this issue Nov 22, 2024 · 2 comments
Closed

Fine tune sub-group transpose bank conflict prevention for PVC #2797

victor-eds opened this issue Nov 22, 2024 · 2 comments

Comments

@victor-eds
Copy link
Contributor

victor-eds commented Nov 22, 2024

As of now, sub-group transpose bank conflict prevention leaves a single item every 17 items ((sub-group size = 16) + 1) to avoid bank conflicts:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 X
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 X
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 X
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 X
...

This is too conservative and can be greatly improved knowing PVC's SLM configuration for parallel accesses (64 banks providing access to 8 B each). So the ideal mechanism would be:

  • Store (64 banks * 8 B/bank / X B/element) elements
  • Leave (1 bank * 8 B/bank / X B/element) empty spots
  • Store (64 banks * 8 B/bank / X B/element) elements
  • ...

Assuming fp32 elements:

0 1 2 3 4 5 ... 127 X X
0 1 2 3 4 5 ... 127 X X
0 1 2 3 4 5 ... 127 X X

Again, for fp32, in terms of code:

; Store untransposed
call spir_funccc void @intel_sub_group_block_write8(ptr(3) %ptr0, <8 x float> %data)
%ptr1 = getelementptr inbounds %ptr0[130], float
; ...
; Load transposed
%vec0 = load<8 x float> %ptrwi0
%ptrwi1 = getelementptr inbounds %ptrwi0[1], <8 x float>
; ...
; Take into account empty elements
%ptrwi16 = getelementptr inbounds %ptrwi15[10], float
@victor-eds
Copy link
Contributor Author

Postponed as #2890 looks like a more promising approach.

@vlad-penkin vlad-penkin added this to the 4.0 [Performance] Core milestone Dec 2, 2024
@sommerlukas
Copy link
Contributor

The approach described in #2890 performs the reduction directly in registers, so SLM is not involved and no tuning for bank conflicts is required. Closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants